Blasted Bioinformatics!?: BGZF - Blocked, Bigger & Better GZIP!

BAM files are compressed using a variant of GZIP (GNU ZIP), called BGZF (Blocked GNU Zip Format). Anyone who has read the SAM/BAM Specification will have seen the terms BGZF and virtual offsets, but what you may not realise is how general purpose this is for random access to any large compressed file. The take home message is:

BGZF files are bigger than GZIP files, but they are much faster for random access.

Tucked away in samtools' sister tool set tabix is a little command line tool called bgzip (blocked gzip), which lets you compress any file in BGZF format. It also lets you decompress a BGZF file (nothing special here, they are a GZIP variant so you can use gunzip for this if you like). However, it can also decompress a requested region of the file - without the excessive IO required to do this with a traditional gzipped file.

This next paragraph assumes you are familiar with the programming ideas of seek and tell for low level file IO. If you seek on a file handle, it means jump to a particular point in the file - usually a simple offset which counts the number of bytes from the start of the file. If you then read some data, the position in the file changes. Calling tell tells you where in the file you are now. This idea also works for writing data (you can open files for writing with random access).

BGZF files (including BAM files) consist of many GZIP blocks, each limited to 64kb on disk and 64kb of compressed data. The virtual offset is an unsigned 64 bit integer which combines a 48 bit block start offset and a 16 bit within block offset. What this means is to get a particular bit of data from the BGZF file, you seek to a virtual offset, which is translated into a seek to the relevant BGZF block, which can be read and decompressed in memory (or incrementally), and then the within block offset tells you where the data requested begins. Just one disk seek is needed (except for the special case where the data continues into the subsequence block).

BGZF files are bigger than GZIP files! (Boo!)

BGZF files are bigger than traditional GZIP files. Each block requires its own header data, and compressing many units of data separately isn't as space efficient as compressing one large data file.

As an example of this size penalty, let's look at the UniProt/TrEMBL data files - the final column shows the extra disk space taken using BGZF rather than GZIP, and this is noticeable higher for the "SwissProt" plain text format and the XML files (which have a lot of repeated text which must be stored in each BGZF block):

Database	Format	Raw	GZIP	BGZF	Overhead
Uniprot SProt	FASTA	252.7 Mb	78.9 Mb	82.0 Mb	4%
(532,792 records)	SwissProt Text	2.52 Gb	448.6 Mb	534.9 Mb	19%
	UniProt XML	5.26 Gb	773.2 Mb	896.0 Mb	16%
Uniprot TrEMBL	FASTA	7.94 Gb	3.50 Gb	3.58 Gb	2%
(17,651,715 records)	SwissProt Text	44.66 Gb	7.23 Gb	8.29 Gb	4%
	UniProt XML	90.83 Gb	14.21 Gb	16.21	14%

The GZIP sizes are for the files as downloaded from the EBI Mirror FTP site. I then decompressed them, and re-compressed as BGZF using bgzip from samtools. I presume that both the original GZIP file and the BGZIP file are using the same default ZLIB compression settings.

The 64kb block size currently used in BGZF is hurting it in the data compression of repeat rich files like XML, although this isn't an issue with compact records like FASTA, FASTQ, SAM, or BAM. This suggests an increased block size would benefit applying BGZF to XML file formats, with the significant downside that this would require a different virtual offset scheme.

BGZF files are better than GZIP files! (Yay!)

BGZF is intended to improve on GZIP for random access. For testing my use case is to randomly select some SProt record IDs, extract those records (in this random order) from the parent file, and write them to a new subset file. Note that for the XML file, you'd also want to include an XML header/footer. And I'm going to do this with a modified version of Biopython's index support in Bio.SeqIO, currently on my bgzf branch on github.

Uniprot SProt (532,792 records)
Format	Task	Raw	GZIP	BGZF
FASTA	Size	252.7 Mb	78.9 Mb	82.0 Mb
FASTA	Index	9s	40s	18s
FASTA	Extract 10	0s	10s	0s
	Extract 50	0s	50s	0s
	Extract 100	0s	106s	0s
	Extract 500	0s	505s	0s
	Extract 1,000	0s	1,025s	1s
	Extract 5,000	0s	5,043s	2s
	Extract 100,000	3s	100,464s	48s
	Extract 500,000	16s	-	242s
	Extract 532,792	17s	-	258s
FASTA	Shuffle 532,792	26s	-	277s
FASTA	Copy/decompress	2s	3s	3s

Uniprot SProt (532,792 records)
Format	Task	Raw	GZIP	BGZF
SwissProt Text	Size	2.52 Gb	448.6 Mb	534.9 Mb
SwissProt Text	Index	80s	479s	197s
SwissProt Text	Extract 10	0s	77s	0s
	Extract 50	0s	369s	0s
	Extract 100	0s	787s	0s
	Extract 500	0s	3,740s	0s
	Extract 1,000	0s	-	1s
	Extract 5,000	1s	-	3s
	Extract 100,000	17s	-	65s
	Extract 500,000	87s	-	325s
	Extract 532,792	93s	-	346s
SwissProt Text	Shuffle 532,792	172s	-	543s
SwissProt Text	Copy/decompress	21s	27s	27s

Uniprot SProt (532,792 records)
Format	Task	Raw	GZIP	BGZF
UniProt XML	Size	5.26 Gb	773.2 Mb	896.0 Mb
UniProt XML	Index	170s	1194s	341s
UniProt XML	Extract 10	0s	157s	0s
	Extract 50	0s	642s	0s
	Extract 100	0s	1,352s	1s
	Extract 500	0s	6,508s	1s
	Extract 1,000	0s	-	1s
	Extract 5,000	2s	-	6s
	Extract 100,000	48s	-	112s
	Extract 500,000	240s	-	560s
	Extract 532,792	256s	-	593s
UniProt XML	Shuffle 532,792	427s	-	934s
UniProt XML	Copy/decompress	41s	54s	55s

The times are all in seconds, rounded to the nearest whole number. The "shuffle" time is the combination of indexing the file plus extracting all the records in the random order. The "-" entries are where I was not patient enough to let the GZIP random access task complete. The copy/decompress time used the Unix commands time, cat and for the GZIP and BGZF files gunzip to give an idea of the minimal disk IO and decompression time if we didn't need random access.

The yellow rows of the table should convincingly make the point is while indexing and random access to a large BGZF file is slower than the uncompressed file, it is reasonable. But trying to do this with a GZIP file is not a good idea. BGZF clearly trounces GZIP in random access.

Now there is a small catch - you can't do this with Biopython at home yet (update - you can with Biopython 1.60 onwards). Currently the Bio.SeqIO code doesn't support indexing compressed files. Indexing gzipped files was easy to add (we knew from earlier work performance would be poor, which is why it was never included), but support for BGZF was a bit harder to implement for this experiment. The main problem was the current code did things like taking the difference between start and end offsets to get an entry's length. You can't take the difference of BGZF virtual offsets.

Note that the "official" Bio.SeqIO code (which currently only does uncompressed files) could be a little faster to build the indexes because it assumes a normal file handle where the offsets behave naturally. Also, I'm deliberatly just timing the smaller Uniprot SProt database, because it has few enough records to safely work with the index in memory. We could ask Biopython to index Uniprot TrEMBL with an SQLite3 databse for the offsets, but it isn't necessary to demonstrate my point here, which is the raw IO side of things, not the indexing scheme.

Another proviso is it may be possible to improve the GZIP speed. I'm just using the Python gzip library via gzip.open(...) and the seek/tell methods it provides, so any optimisation or lack of it falls to the Python standard library. Interally this calls the Python zlib library, which is what I'm using to implement BGZF in Biopython. On the other hand, I'm pretty sure that my BGZF code can be sped up after some profiling, which I haven't tried yet.

Test Code

First the following quick shell commands extracted all the record names from the FASTA file (using grep and cut), randomised them (using a little Perl), and saved them as a plain text one name per line.

$ time grep "^>" uniprot_sprot.fasta
 | cut -c 2- | cut -f 1 -d " "
 | perl -MList::Util=shuffle -e 'print shuffle(<stdin>);'
 > uniprot_sprot.fasta.rnd

real 0m3.570s
user 0m5.902s
sys 0m0.523s

$ time cut -d "|" -f 2 uniprot_sprot.fasta.rnd
 > uniprot_sprot.swiss.rnd

real 0m0.361s
user 0m0.329s
sys 0m0.028s

Lines wrapped at the pipe and redirection characters. These are the input files for my random access trails - the two ID files handle the different ID formats used by default for the different filetypes in Biopython (e.g. tr|F2D7G4|F2D7G4_HORVD from the FASTA header vs just F2D7G4 from the SwissProt plain text file).

The Python code boils down to just some out loops to set filenames and formats, then builds an in memory index (with the experimental support for GZIP and BGZF compression), and another loop to time extracting various records from the file. The timings shown above are the best run of three:

import time
from Bio import SeqIO

start = time.time()
index = SeqIO.index(filename, format, compression=comp)
time_index = time.time()-start

print "SeqIO.index(%r, %r, compression=%r) took %0.1fs" \
      % (filename, format, comp, time_index)

Then I have a generator function to load the identifiers from the text file created earlier:

def load_ids(filename, limit = 10):
    count = 0
    with open(filename) as handle:
        for line in handle:
            name = line.strip()
            if name:
                yield name
                count += 1
                if count == limit:
                    break

And finally, we time using the index to extract various number of the records


for lim in [10,50,100,500,1000,5000,100000,500000,len(index)]:
    start = time.time()
    handle = open(out_filename, "w")
    for name in load_ids(id_filename, lim):
        handle.write(index.get_raw(name))
    handle.close()
    time_write = time.time()-start
    print "Writing %i in random order took %0.1fs" \
          % (limit, time_write)
    if lim == len(index):
        print "Shuffling %i took %0.1fs" \
              % (limit, time_index + time_write)

This requires the BGZF and GZIP support currently on my bgzf branch on github (update - you can with Biopython 1.60 onwards). That's about it.

BGZF is bigger but better than GZIP for random access

I'm still tweaking the BGZF code but will probably propose merging it into Biopython later this month.

However, the main point of this blog post was to try to raise awareness of BGZF as a suitable alternative to GZIP for when you want to compress a data file while retaining reasonably fast random access:

BGZF files are bigger than GZIP files, but they are much faster for random access.

For High Throughput Sequencing (HTS or HTSeq, aka Next Gen Seq or NGS) data, I still believe you should just use SAM/BAM for your unaligned reads, but for FASTQ / FASTA / QUAL files using BGZF is a more flexible alternative to plain GZIP (with a fairly small size overhead).

Also on the plus side, separate self-contained GZIP blocks as used in BGZF also offer the possibility of multi-threading decompression and data analysis at the block level, in addition to the random access benefits I've been focusing on here today.

If you are interested in the Biopython developement side, please join the Biopython-dev mailing list and read this thread, while for any discussion/feedback/comments on the general use of BGZF, I've started a thread on SEQanswers.com for this.

Update (June 2012)

Biopython 1.60 includes module Bio.bgzf for working with BGZF files, and Bio.SeqIO's indexing functions will use this to support BGZF compressed sequence files.

Blasted Bioinformatics!?

2011-11-08

BGZF - Blocked, Bigger & Better GZIP!

2 comments: