Blasted Bioinformatics!?: SAM/BAM without gapped reference

In my last post I talked about SAM/BAM with a gapped reference, and how this makes it much easier to work with inserted bases relative to the reference/consensus - especially for visualisation.

I should point out that some viewers do actually manage to show the inserts as columns even with the traditional ungapped/unpadded reference sequence - notably Gap5, Bambino, and the text based samtools tview, as shown in these tview screenshots. You press the "i" key to toggle this insert display, press "?" for help.

Here's the same E. coli sample from my previous post (shown there using Tablet and IGV), using a BAM file with a traditional ungapped reference:

samtools tview 0.1.18, showing simplistic insertion display (press i to collapse this)

And the same alignment but using a BAM file with a padded reference:

samtools tview v0.1.18 showing padded reference (i toggle has no effect)

You may notice some subtle differences in the insert region - which I'll come back to in a moment.

A major problem for a viewer wanting to do display insert sequences like this with the traditional ungapped reference sequence is before drawing anything, you must inspect all the reads for inserts in order to compute the gapped reference. That could be slow - especially if you want to this for the entire reference sequence for the purposes of showing an overview and both gapped and ungapped coordinates. I'm pretty sure samtools tview only does this insertion scan for the section being viewed, which is sensible as it doesn't attempt to provide an overview - nor track both the gapped and ungapped coordinates.

The next problem is how to place the inserted bases to recover the original multiple sequence alignment, rather than treating the SAM/BAM file as a collection of pairwise alignments? For example, at one insert you might have extra bases AT from one read, but just T from another. This could be treated as a mini-multiple sequence alignment but that gets complicated (and also potentially slow). The viewer Bamino does something along these lines. A much simpler answer is shunt them all to the left, which seems to be what samtools tview does.

This is where the CIGAR string pad operator P comes in. CIGAR strings are how the SAM/BAM format stores alignment information, and are one of the most complicated bits of the SAM/BAM specification.

Have a look at the first read in the two screenshots above. The first, where tview has done the insert layout itself, uses "AGGT**" while the second image has "*A*GGT" which is described explicitly in the padded reference BAM file (and is consistent with the original assembly output in ACE format). In both cases the reference is six gaps, "******".

If you treat the original alignment (as shown in the ACE file) as a pairwise alignment ("------" vs "-A-GGT") and ignore the CIGAR pad operator, you can describe this bit of the alignment with the CIGAR string 4I (four inserts, assuming as conventional an ungapped reference). However, to preserve the multiple sequence alignment's layout you should use 1P1I1P3I (one pad, one insert, one pad, three inserts). This is something that I didn't find immediately obvious on reading the spec.

So, in principle, if you get a traditional SAM/BAM file with an unpadded reference, but where the CIGAR pad operator has been used properly, it is possible to recover the original padded alignment. But if any of the reads have been processed or lack the P operator, you can't. One example of this is the SRA/ENA ArchiveBAM files which make a number of simplifications/restrictions to save disk space. The ArchiveBAM 1.0 Format Specification says very clearly:

If present the ‘P’ operations will be removed by SRA which effectively converts the multiple alignment into a pair-wise alignment.

So, in general, this means a displaying a traditional SAM/BAM file using an unpadded reference as a padded alignment is hard. This is why I was so excited about using SAM/BAM with a padded reference in my last blog post.

P.S. Personally I like to use minus signs for gaps in alignments, which is the convention from multiple sequence alignments (which can cover either nucleotide or protein sequences, where asterisk has the meaning of a stop codon). In an alternative convention going back over 30 years, some of the assembly community use an asterisk for their gaps instead (e.g. ACE files). Thanks to James Bonfield for that little history lesson.

Update (29 May 2015)

Those of you following the samtools-devel mailing list, or the SAM/BAM specification on GitHub might have noticed using a padded reference was made part of the specification. I coordinated a paper describing this work which is now available as a preprint: SAM/BAM format v1.5 extensions for de novo assemblies (DOI: 10.1101/014043).

Blasted Bioinformatics!?

2011-10-03

SAM/BAM without gapped reference

Update (29 May 2015)

No comments:

Post a Comment