Recently Torsten Seemann had a look at GZIP and BZIP2 applied to FASTQ data which is interesting. I decided to look at a different dataset, UniProt/TrEMBL data files (now one month out of date):
|(532,792 recs)||SwissProt Text||2.52GB||448.6MB||534.9MB||468.6MB||317.6MB|
|(17,651,715 recs)||SwissProt Text||44.66GB||7.23GB||8.29GB||7.21GB||5.42GB|
The GZIP sizes are for the files as downloaded from the EBI Mirror FTP site in early September 2011. I then decompressed them, and re-compressed as BGZF using bgzip from samtools, or as bz2 using bzip2. I presume that both the original GZIP file and the BGZIP file are using the same default ZLIB compression settings.
As I noted before, the 64kb block size currently used in BGZF is hurting it in the data compression of repeat rich files like XML, although this isn't an issue with compact records like FASTA, FASTQ, SAM, or BAM. This suggests an increased block size would benefit applying BGZF to XML file formats.
We can explore this easily with BZIP2 as you can choose a block size of 100KB (least space efficient, shown above), 200KB, ..., or 900KB (the default, shown above). It is very clear that the bigger block size (900KB) really shows dividends in reducing the size of the rich text SwissProt format and the UniProt XML files. On the other hand, using the smallest BZIP2 block size (100KB) puts the compression efficiency on a par with GZIP (no blocks), which is still generally better than BGZF (GZIP with 64KB blocks).
BZIP2 file format and random access
I had trouble finding details on the raw BZIP2 file format on bzip.org. Instead, I found the Wikipedia page on BZIP2 much more useful - thanks largely due to content added by Paul Sladen while working on pyflate (bzip2 decompression in pure Python).
Anyway, the important bits are that a BZIP2 file contains one or more "streams", which are byte aligned, each containing one (zero?) or more "blocks", which are not byte aligned, followed by an end of stream marker (the six bytes 0x177245385090 which is the square root of pi as a binary coded decimal (BCD), a four byte checksum, and empty bits for byte alignment).
This means that like GZIP, you can concatenate multiple BZIP2 files together and still have a valid file. Also, you can combine files with different block sizes. Since the block size is recorded in the stream header (byte four: 1, 2, ..., or 9), not the block header, that may be a problem for random access to any block. However, I suspect that the block size only affects the maximum allocation size for data structures, so assuming the worst case (default 9 for 9KB) will be fine.
Assuming each stream contains at least one block, it easy to spot a new stream by scanning the file for the magic bit pattern of BZh1 to BZh9 (four byte stream header encoding the block size 1 to 9) followed by the six byte block header 0x314159265359 which is the number pi (BCD). I think you have to then parse the next fields in order to work out the size of the block and thus find the next block. Because the blocks are not byte aligned, this pi marker could be bit shifted. So, in principle to describe the start of a block, you need to know its offset in bits, or equivalently the offset in bytes plus the bit shift of 0 to 7.
(Update June 2012: I later came across James Taylor's bzip-seek2 which uses offsets in bits to seek to arbitrary bzip2 blocks)
In practise, for random access the block offset in bytes would suffice. You'd just have to inspect the first six or seven bytes to identify the bit shift giving 0x314159265359, and decode from there. One problem is the bzip2 C API doesn't appear to let you specify the bit offset, so either it would need modifying, or you'd need some sneakiness - there are several options here.
Here are some relevant bits I'd like to quote from the bzip2 documentation:
4.1. Limitations of the compressed file format
The compressed file format was never designed to be handled by a library, and I have had to jump though some hoops to produce an efficient implementation of decompression. It's a bit hairy. Try passing decompress.c through the C preprocessor and you'll see what I mean. Much of this complexity could have been avoided if the compressed size of each block of data was recorded in the data stream.
It would be fair to say that the bzip2 format was frozen before I properly and fully understood the performance consequences of doing so.
Further ahead, it would be nice to be able to do random access into files. This will require some careful design of compressed file formats.
That point about recording the size of each compressed block is particularly interesting as the only difference between GZIP and the BGZF variant is the later records the block size (and limits you to 64K blocks). It would make life easier, but I don't think we need that - random access should still be possible with the current BZIP2 format.
Later in the bzip2 documentation we have this:
4.4. Did you get the right package?Julian Seward has been honestly self critical, which is great. Most of his points concerned compression rather than decompression, and I'm not concerned about his worries about the memory overhead (given I'd plan to keep at least a few decompressed blocks in RAM). Rather it is the decompression CPU time I'm concerned about, and it can be four to twelve times more than GZIP.
bzip2 is a resource hog. It soaks up large amounts of CPU cycles and memory. ...
For something faster and lighter still, you might try Markus F X J Oberhumer's LZO real-time compression/decompression library, at http://www.oberhumer.com/opensource.
Based on all that, I believe it is possible to implement efficient random access to BZIP2 files based on double offsets (raw block start offset, and offset into the decompressed block), and that this should work well for semi-sequential access where we can keep enough decompressed blocks in RAM. It will be tricky to implement due to the fact the blocks are not byte aligned. However, given the expected CPU overheads on decompression, and the potential size of each block, true random access with a reasonable block cache will be CPU bound.
On this basis, going with GZIP rather than BZIP2 for use in BAM was a sensible choice, even if it did require tweaks to record the block size (BGZF).
If commenting on the blog is still not working, try this SEQanswers thread instead.