Blasted Bioinformatics!?: 2012

2012-10-30

My IDs not good enough for NCBI BLAST+

The blastdbcmd tool in the BLAST+ suite (replacing fastacmd in the C 'legacy' BLAST suite) lets you do a lot of clever things with a BLAST database. As long as you follow the baroque NCBI FASTA naming scheme you can do this with local BLAST databases too. However, if you don't want to bow down to the NCBI naming (e.g. use FASTA files directly from your favourite assembler), then blastdbcmd seems needlessly crippled.

Update (2 April 2013): Some changes in BLAST 2.2.28+ (released yesterday) seem to be intended to address these issues, but there remain problems with this which I intend to expand on later.

Update (20 April 2013): I found a quiet moment this weekend to update this post with the BLAST 2.2.28+ problems I was alluding to. There has been some progress on this issue with the new release, but it is flawed. See below.

Broken blastdbcmd for -target_only

This is just a quick post to document a bug in the blastdbcmd tool from the BLAST+ suite when used on the NR database with a full identifier and the -target_only option.

Update: See end of post, BLAST 2.2.28+ fixed this :)

How not to deal with NGS data - MrFast & MrsFast

One of the first things a programmer dealing with 'Next Generation Sequencing' (NGS) aka 'High Throughput Sequencing' (HTSeq) data learns is to be very aware of memory limitations. You can't just go loading files into RAM when they are often gigabytes in size. Instead where possible you loop over a file (iterating over it record by record) or employ indexed random access. The authors of MrFast & MrsFast didn't do this.

Installing Python 3.3 with LZMA/XZ on Mac OS X

Today's release of Python 3.3 includes the lzma module in the standard library for LZMA/XZ files, but it didn't work 'out of the box' under Mac OS X 10.8 Mountain Lion - it requires XZ Utils. This is how I installed it.

Stop breaking NCBI BLAST searches!

Have you ever tried to use a BLAST database of protein sequences containing stop codons? If you work on nice model organisms with solid gene annotation maybe not. However, with draft annotations, mutation studies, or read through translation it is not unreasonable for the odd internal stop codon to appear in a protein sequence. And some translation pipelines do leave in a trailing * character. It turns out the BLAST+ suite has a rather nasty glitch with this sort of sequence.

Update: BLAST 2.2.27+ fixed this bug.

BLAST tabular output missing descriptions

This is an open letter to the NCBI BLAST+ team to request two simple enhancements which I think would be extremely useful - first and foremost the option to include BLAST result descriptions in the tabular output. Having the taxonomic identifiers (if available) would be great too - allowing downstream filtering of BLAST results by species etc.

Update: See end of post, BLAST 2.2.28+ added most of these features :)

BLAST+ ignoring search space size for e-values

Sometimes using BLAST is frustrating. Today I'm writing about it returning different expectation values, and therefore different answers, depending on if you use a FASTA subject file, or a database made from that file. I noticed something funny a while ago, but didn't immediately investigate and report it (which I regret). In the continued absence of an official public bug tracker for NCBI BLAST, I'm again going to blog about it here, so people can find it via Google, and email this to the NCBI.

BLAST+ memory hog with subject FASTA and XML output

We noticed a major memory problem running NCBI BLAST+ with XML output using a FASTA subject (consuming loads of swap space then getting killed). This doesn't happen with tabular output, nor if using a BLAST database:

Memory and CPU usage (until killed by OS)

Missing feature locations in GenBank "with parts"

My last blog post was on a problem with missing NCBI GenBank feature location information for trans-spliced genes when using GenBank 'with parts'. That was fixed, but while using EFetch a colleague just stumbled over something possibly related where the location is given as just a question mark. This took a little digging to understand...

Missing external exons in GenBank "with parts"

I recently stumbled on a problem in NCBI Entrez with the GenBank (with parts) return type. Some GenBank files don't actually contain a sequence at the end - instead they have a CONTIG section telling you how to construct the sequence from other referenced pieces. That's often inconvenient so the NCBI have the handy option of downloading it with all this parts pre-computed, which normally is great.

BAM versus CRAM v0.7

CRAM 0.7 was released earlier this month, and includes support for storing arbitrary read tags - a key requirement for it to be evaluated in existing pipelines as a BAM alternative. However, it doesn't preserve read names - which is a compression trick you can also do with plain BAM.

Reference based SAM/BAM compression

In some respects the SAM/BAM specification is quite loose, in that there is more than one way to represent a given piece of information. We can take advantage of this to reduce the size on disk of mapped reads which match the reference sequence, while still maintaining conformance within the spec. I've written a SAM/BAM reference based compression script in Python - put this in your pipeline and smoke it!

Ion Torrent Suite on GitHub

Good news - the Ion Torrent Suite is now freely available open source software on GitHub under the GPL v2 licence, as promised late last year. There is now something more substantial behind talk of Ion Torrent "democratising sequencing", and a clear advantage over the closed source tools of rival companies. I commend them!

Ion Torrent does the Samba

I'm a bit behind the curve here (see Lex's blog post from July 2011), but I was amused to find out Ion Torrent call their current nucleotide flow order TACGTACGTCTGAGCATCGATCGATGTACAGC the "Samba". Apparently the idea is to avoid reads going out of phase which could happen with the traditional repeated flow TACG (still used by Roche 454), by giving the molecules which missed a base a chance to catch up, and for IonTorrent this works better.

I wonder if the flow order next revision will also get a dance based name? I'd suggest conga, since it is about synchronising lots of people.

Blasted Bioinformatics!?

2012-10-30

My IDs not good enough for NCBI BLAST+

Broken blastdbcmd for -target_only

2012-10-02

How not to deal with NGS data - MrFast & MrsFast

2012-09-29

Installing Python 3.3 with LZMA/XZ on Mac OS X

2012-08-16

Stop breaking NCBI BLAST searches!

2012-05-25

BLAST tabular output missing descriptions

2012-05-18

BLAST+ ignoring search space size for e-values

2012-04-18

BLAST+ memory hog with subject FASTA and XML output

2012-04-03

Missing feature locations in GenBank "with parts"

2012-03-16

Missing external exons in GenBank "with parts"

2012-03-09

BAM versus CRAM v0.7

2012-02-14

Reference based SAM/BAM compression

2012-01-23

Ion Torrent Suite on GitHub

2012-01-16

Ion Torrent does the Samba