Blasted Bioinformatics!?: Random access to blocked XZ format (BXZF)

I've written about random access to the blocked GZIP variant BGZF used in BAM, and looked at random access to BZIP2, but here I'm looking at XZ files which are based on LZMA compression. This was prompted by the release of Python 3.3 which includes the lzma module to support XZ files, which I then back-ported to offer lzma for Python 2.6 or later. Over the Christmas / New Year break I extended this to handle Blocked XZ Format (BXZF for short), so publishing this post is a bit overdue.

XZ files have some important properties also shared with other formats like GZIP,

Streamable: It is always possible to create and decompress .xz files in a pipe; no seeking is required.
Concatenation: Just like with .gz and .bz2 files, it is possible to concatenate .xz files giving a multi-stream file. The decompressor can decompress a concatenated file as if it was a regular single-stream .xz file.
Random-access reading: Each XZ stream can be split into independently compressed blocks, followed by an index, which makes fast random-access reading possible when multiple-blocks are used.

Getting XZ blocks

GZIP too can use blocks, but generally this isn't exploited - hence the blocked GZIP variant BGZF. And as I've explored before, BZ2 also uses blocks but doesn't make random access that easy.

So, does life get any easier with the blocks in XZ? Well yes, but for XZ random access to work, you need the file to actually use the multiple blocks feature (at the cost of being less space efficient).

Sadly it appears that the current stable release command line tools from the XZ Utils package (v5.0.4) only creates single-block XZ files, which are useless for random access. You can check this with the simple command line --list option:

$ xz < thousand_blastx_nr.xml > thousand_blastx_nr.xml.xz
$ xz --list thousand_blastx_nr.xml.xz
Strms Blocks Compressed Uncompressed Ratio Check Filename
1 1 18.4 MiB 285.7 MiB 0.064 CRC64 thousand_blastx_nr.xml.xz

However, in v5.1.1 (alpha) a new option --block-size=SIZE was added. The default remains one large block. Running with the latest release and asking for 1MB blocks gives:

$ xz -V
xz (XZ Utils) 5.1.2alpha
liblzma 5.1.2alpha
$ xz --block-size 1048576 < thousand_blastx_nr.xml > thousand_blastx_nr.xml.b1M.xz
$ xz --list thousand_blastx_nr.xml.b1M.xz
Strms Blocks Compressed Uncompressed Ratio Check Filename
1 286 20.5 MiB 285.7 MiB 0.072 CRC64 thousand_blastx_nr.xml.b1M.xz

As expected, compression suffers a little - but we now have multiple blocks and in principle can do fast random access to this XZ file. So how would random access to multi-block XZ files work? Most of the following is simply from reading the XZ file format specification (currently v1.0.4, 2009-08-27).

XZ Stream Structure

Valid XZ files contain one or more streams, each null padded to be a multiple of four bytes long. Each stream begins with a 12 byte header:

Header Magic (6 bytes, b'\xfd7zXZ\x00' in Python notation, or 0xfd 0x37 0x7a 0x58 0x5a in hex)
Stream Flags (2 bytes)
CRC32 (4 bytes)

There are then zero or more blocks (null padded to be a multiple of four bytes long), followed by an index block (null padded again), and finally a 12 byte stream footer (also null padded):

CRC32 (4 bytes)
Backward size (4 bytes, encodes the size of the index block)
Stream Flags (2 bytes, should match stream header)
Footer Magic (2 bytes, b'YZ' or 0x59 0x5a in hex)

This means you can spot a XZ file by checking the header magic bytes - in this case the first six bytes should be 0xFD then 7zXY then null. Also any XZ file should end with 'YZ' plus up to three nulls for alignment. Typically you'll only find one stream.

XZ Block structure

I'm not going to say anything here other than the block header may include the compressed and decompressed size of the data in the block. Including this allows more efficient seeking when processing a data stream - but doesn't seem important when trying to do random access to a file on disk.

XZ Index Structure

This is where I start to get interested, because one of the aims of the index is to allow random access to any block in the stream. The index holds:

Index indicator (1 byte, null)
Number of records (i.e. number of blocks in the stream)
List of records (one per block, recording the unpadded size on disk and the uncompressed size)
Index padding
CRC (4 bytes)

Using this index we can do from desired (decompressed) offset to the raw offset on disk of the start of the block of interest, and the (decompressed) relative offset into that block.

XZ Random Access

From reading the XZ file format specification, the relevant bits of which I have tried to summarise above, this is how could implement random access:

Open the XZ file, and seek to the end (i.e. the end of the final stream)
Parse the footer of this stream, which tells you the size and thus location of any preceding stream index
Parse the index of the current stream, which tells you the block information for that stream, and the total size on disk of that stream.
Using the size of the current stream from its index, seek to the end of the previous stream, and repeat.

After all that you can build up a table of entries for each block in each stream, giving the block start offset, and the size of the block contents, and thus the start offset in the decompressed data for that block. We'll also need to record the settings for each stream (e.g. the filters) or at least the offset for the start of each stream. Doable.

XZ Random Access in Python

Python 3.3 includes LZMA/XZ support in the standard library - here's how I installed Python 3.3 with lmza under Mac OS X 10.8 Mountain Lion, and here is my backport of lzma for older versions of Python.

The documentation for the Python lmza module says it only emulates seeking - and testing confirmed it was rather slow. Examining the source code of the lzma.py for Python 3.3.0 it was clear that this worked by rewinding to the start of the file if required, and simply reading forward until the desired offset is reached. This Python layer is built on top of a C module _lzma, described as a low-level Python interface to liblzma. This appears not to do anything with the internal structure of the XZ files.

(Low level support for random access within the XZ C library itself is still on the author's wish list.)

I wrote my own pure Python code to parse the stream and blocks, and then call the lzma module to decompress the individual blocks. i.e. The same approach I used to implement BGZF random access with zlib doing the decompression. Have a look at the file bxzf.py on this branch - the name BXZF is my shorthand for Blocked XZ Format.

At some point I'm intending to try this out on a Biopython branch, integrating it into SeqIO, to test it as a practical alternative to the BGZF support included in Biopython 1.60 onwards. I was intending to do that before publishing this blog post, but that hasn't happened.

3 comments:

Unknown5 April 2013 at 12:25
Thanks for an interesting post!
Evolved Microbe13 March 2014 at 20:28
Interesting, do you know how this format compares to zlib performance wise? I was just playing around a bit with the snappy format (http://evolvedmicrobe.com/blogs/?p=253) seems quite fast on compression, but not such a big win on decompression. I really like the idea of having a file format that is somewhat agnostic to what is using the underlying compression for things like tabix, etc. though, so think implementing this in biopython is awesome.
Peter Cock13 March 2014 at 23:23
For zlib/gzip you could use the BGZF performance as a guide (which is just blocked GZIP format).

2013-04-01

Random access to blocked XZ format (BXZF)

3 comments: