2011-10-21

FASTQ must die! Long live SAM/BAM!


I think it is time to retire the FASTQ file format in favour of storing unaligned reads in SAM/BAM format. I will try to explain, as this may not immediately strike everyone as logical, given SAM/BAM is primarily a sequence alignment/mapping format, while for "raw" reads FASTQ is near ubiquitous in Next Generation Sequencing (NGS), more sensibly known as High Throughput Sequencing (HTS).



Unaligned reads in FASTQ vs SAM/BAM

There are several good reasons for this. Perhaps surprisingly I don't think file size is one of them. BAM files use a variant of gzip compression called BGZF, with the advantage of random access via their BGZF blocks - but its not that important to have random access to unaligned reads. Typically you only need sequential access, so a gzipped FASTQ file works just fine.

There is the advantage of consistency of base calling quality scores - there is only one way to do this in SAM/BAM which the specification is nice and clear on. I'm delighted Illumina have switched to the Sanger FASTQ encoding with CASAVA v1.8, which settles the old issue of incompatible FASTQ quality scores, but that legacy will plague us for a bit longer.

However, as we move on from the pain of inconsistent FASTQ encodings, the main FASTQ headaches now are all the different metadata conventions, and in particular paired end reads. We have the messy situation where some tools want one FASTQ file containing both forward and reverse reads interleaved, while others want two matched files, some require one particular naming schemes, some another scheme, while some tools actually ignore the read names altogether and just go by their position in the file (e.g. Velvet).

It had looked to me like most tool authors were settling on the /1 and /2 suffixes for paired end reads, which is simple (and extensible to strobed reads with more than two parts). This naming was introduced by Illumina and by sheer volume of data swamped the historic Sanger naming scheme (which was actually rather complicated because it allowed for multiple attempts at sequencing either end of a PCR product - see here and here), and the many others. Frustratingly in CASAVA v1.8 while Illumina adopted Sanger FASTQ encoding for the quality scores (yay!), they also changed their paired end naming conventions and how they handled reads failing QC. See this important SEQanswers thread where with hindsight I should have complained louder, and more recently this SEQanswers thread. One step forward, two steps backward?

It is for this kind of metadata that I think SAM/BAM has its biggest appeal over FASTQ for unaligned reads. Paired reads are catered for explicitly via the FLAG field, and so too is a simple QC pass/fail. You can also store other metadata, for example the SAM/BAM header can include information about the software tool and version used to create the file. You can assign read groups which let you specify what platform was used (Sanger, Illumina, Roche 454, IonTorrent, etc), the sample name (e.g. Cancer vs Healthy), and for paired end libraries the expected insert size. You can even store per-read annotation in the extensible tag fields.

Admittedly explicit tags may not exist for all the key meta data yet, but that can be added. Dialogue with sequencing archives SRA/EDA would help here to ensure all the key metadata they want can be represented in SAM/BAM.

On the other hand, with FASTQ files you're stuck playing silly games with filenames, the single "free text" line per read, often linking pairs of FASTQ files, etc. It's a mess and far too much bioinformaticians' time in high throughput sequencing is spent messing about with the consequences of this.

Moving from FASTQ to SAM/BAM for unaligned reads

What needs to happen for SAM/BAM to supplant FASTQ for "raw" unaligned reads?
  • Adoption by more key analysis tools as an input format, already supported in:
  • Adoption by sequencing companies as a vendor output format, including:
    • Illumina who might build on illumina2bam (used by Sanger for their pipelines).
    • Roche 454 and IonTorrent. They offer SFF format which works very well, but flow space information can be recorded in SAM/BAM with the FZ read tag, and the flow order in the header @RG line's FO tag.
    • SOLiD's color space information can be recorded in SAM/BAM using the CQ and CS read tags.
  • Adoption by platforms like Galaxy (already being discussed on their mailing list)
  • Backing from SRA/EDA would be a bonus.
I think users asking for this and helping testing it will be the biggest driver, but getting key tools using unaligned SAM/BAM for input will be vital. This can then be leveraged by adopting this within pipelines and platforms like Galaxy etc.

And on a related note we should all be using SAM/BAM for alignments (de novo or reference guided) as well (Nick Loman says so too), and I've blogged about that before (here and here), and will write more about this later.

Discussion

I've started a thread on SEQanswer.com for discussion (partly since I don't think the comment settings on my blog are working as I want them to).

No comments:

Post a Comment