2012-03-09

BAM versus CRAM v0.7

CRAM 0.7 was released earlier this month, and includes support for storing arbitrary read tags - a key requirement for it to be evaluated in existing pipelines as a BAM alternative. However, it doesn't  preserve read names - which is a compression trick you can also do with plain BAM.

Vadim kindly released some sample data with BAM -> CRAM 0.7 -> BAM. Just from the file sizes this it is clear that something is being lost - and on querying this it should just be the read names. I pointed out that BAM also allows you to omit the read names so the example might be viewed as skewed. In this blog post I'm going to examine what size benefits that does give with BAM.

Previously I compared pre-0.5 CRAM against BAM using reference based compression (equals signs in the SEQ field), where this early version of CRAM did not attempt to preserve read tags. In that setup while reference based BAM compression provides a good saving, CRAM did a lot better.

Here I'm also discarding the original read names in BAM. Specifically I'm dropping the read name for single-fragment reads (using * for QNAME, originally suggested by Shaun Jackman), and replacing multi-fragment read names with short automatically allocated names. Like sam_seq_equals.py, I wrote sam_drop_qname.py as a simple Python script manipulating SAM format, which can then be piped in and out as BAM using samtools.

Note that similarly to how CRAM 0.7 generates read names on CRAM to BAM conversion, I too have simply used a base ten integer (1, 2, 3, ...). Switching to hex or even using all the letters in both cases might be more efficient still, but may cause more user confusion than it is worth.

In the following table I am trying to perform a comparison of loss-less compression bar the read names. Again, I've picked NA12878 as the example, and as in the previous post, the final column is bits per bases (bpb).

NA12878.mapped.illumina.mosaik.CEU.exome.20110411.chr20.bam
With tags including OQ, allowing original read names to be lost.
In total 298,406,433 bases in 3,931,328 mapped reads, none unmapped.
CompressionSize (bytes)Saving on normal BAMbpb
BAM, normal, with read names424,968,956-11.39
BAM, dropping read names403,835,6105.0%10.83
BAM, seq equals, with read names384,405,0629.5%10.31
BAM, seq equals, dropping read names366,516,99013.8%9.83
CRAM v0.7, dropping read names263,183,02638.1%7.06

Compared to my optimised BAM, CRAM v0.7 gives a respectable disk saving of 28.2% (251MB vs 387MB) - not as good as the headline figure of 38.1% but still impressive.

However, as James Bonfield observed, this is something of a pathological example because a large chunk of the data is in the form of original quality tags (OQ), which his group remove altogether. I wrote a short Python script for removing tags from SAM files to help explore this, sam_strip_tags.py, and retested:

NA12878.mapped.illumina.mosaik.CEU.exome.20110411.chr20.noOQ.bam
With tags except OQ, allowing original read names to be lost.
In total 298,406,433 bases in 3,931,328 mapped reads, none unmapped.
CompressionSize (bytes)Saving on normal BAMbpb
BAM, normal, with read names265,661,710-7.12
BAM, dropping read names247,271,7096.9%6.63
BAM, seq equals, with read names229,824,02413.5%6.16
BAM, seq equals, dropping read names211,179,62820.5%5.66
CRAM v0.7, dropping read names141,125,89246.9%3.78

This time compared to my optimised BAM, CRAM v0.7 gives a more impressive saving of 33.2% (135MB vs 201MB), although not as good as the headline figure of 46.9% for the default fat BAM.

This is somewhat idealised since there are no unmapped reads in this example - typical saving would probably be lower.

Notes

Unlike the results in the first table where I could download the file in CRAM format, for the second table I had to run CRAM 0.7 locally, which I did with the following command (the memory settings are to avoid a Java out of heap space error):

$ java -Xms256m -Xmx2048m -jar cramtools.jar cram --input-bam-file NA12878.mapped.illumina.mosaik.CEU.exome.20110411.chr20.noOQ.bam --reference-fasta-file human_g1k_v37.fasta --output-cram-file NA12878.mapped.illumina.mosaik.CEU.exome.20110411.chr20.noOQ.bam.cram --capture-all-quality-scores --capture-all-tags

Similarly, to recover a BAM file from this for checking I used:

$ java -Xms256m -Xmx2048m -jar cramtools.jar bam --input-cram-file NA12878.mapped.illumina.mosaik.CEU.exome.20110411.chr20.noOQ.bam.cram --reference-fasta-file human_g1k_v37.fasta --output-bam-file NA12878.mapped.illumina.mosaik.CEU.exome.20110411.chr20.noOQ.bam.cram.bam

Conclusions

As in the previous post, this aimed to demonstrate that we can save significant amounts of space by the adoption of small changes to BAM (which are within the specification) right away. In this case, dropping the read names makes a further marked improvement to the size on disk.

Unlike the earlier releases of CRAM, in v0.7 you can choose to record all the read tags. As a result, the CRAM size savings and bits-per-base costs shown today are much less impressive than before where the tag overhead was not included.

Sadly I wasn't able to confirm complete agreement between the CRAM 0.7 compression and my BAM size optimisations due to CRAM problems with the header and TLEN fields. Stay tuned for the next CRAM release, which should be another step closer to a drop in replacement for BAM.

Update (16 March 2012)

Heng Li has also started looking at BAM vs CRAM vs Goby vs BioHDF vs cSRA, both from a file size and CPU needs perspective.

1 comment:

  1. It may seem at first glance advantageous to keep a format similar to BAM, since many pipelines have been written with this format, but with each improvement in sequencing throughput the BAM design gets more challenged (i.e., reads and alignment together in one file). Switching to a new format may seem daunting at first, but it is not that bad (when good, documented APIs are available) and often a necessary step to fix design problems and move forward. Goby uses several tiers of files: reads files that keep all reads (and read names if you care) and alignment files that keep only differences with the reference. Because of this design, alignment files are much smaller and transfer nicely through the network (useful to share alignments or make them available for download). See http://goby.campagnelab.org

    ReplyDelete