2014-11-26

Column headers in BLAST+ tabular and CSV output

In the last couple of years, my preferred BLAST output format has switched from BLAST XML to plain tabular output. The main reason for this it is easier to parse, and now gives easy access to more fields - BLAST+ 2.2.28 added descriptions and taxonomy output to the tabular and CSV output, but the cumulative effect is BLAST XML has been lagging behind.

However, there is a simple change the NCBI could make to greatly improve the usability of the tabular or CSV output - label the columns with a header line! This is vital meta-data: No-one should be forced to guess-the-columns when presented with a data file. 

2014-10-31

BLAST! No frequency ratios needed for composition-based statistics

While working on updating the NCBI BLAST+ wrapper for Galaxy for any changes in the new BLAST+ 2.2.30 release, I hit a cryptic error message from deltablast

$ deltablast -query rhodopsin_proteins.fasta -subject four_human_proteins.fasta -evalue 1e-08 -outfmt "6 qseqid sseqid score" -rpsdb /data/blastdb/cdd_delta
BLAST engine error: /data/blastdb/cdd_delta contains no frequency ratios needed for composition-based statistics.
Please disable composition-based statistics when searching against /data/blastdb/ncbi/cdd/cdd_delta.

To cut a long story short, to fix this you need to download and unpack a newer cdd_delta.tar.gz which now includes another file cdd_delta.freq containing frequency ratio information which the newer deltablast tool requires.

The same applies to the rpsblast tool, although here you just get a warning rather than an error:

$ rpsblast -query four_human_proteins.fasta -db /data/blastdb/cdd_delta -evalue 1e-08 -outfmt "6 qseqid sseqid score"
Warning: /data/blastdb/cdd_delta contain(s) no freq ratios needed for composition-based statistics.
RPSBLAST will be run without composition-based statistics.
sp|Q9BS26|ERP44_HUMAN    gnl|CDD|222416    401
...
sp|P06213|INSR_HUMAN    gnl|CDD|238021    137
sp|P08100|OPSD_HUMAN    gnl|CDD|215646    411

2014-04-09

BLAST+ 2.2.29 upset by [key=value] entries in queries

I recently got a weird error/warning message (repeated) in my BLAST+ stderr output,

Error: (1431.1) FASTA-Reader: Warning: FASTA-Reader: Ignoring FASTA modifier(s) found because the input was not expected to have any.


This turns out to be due to having [key=value] tags in my query FASTA file, and appears to be a new bug introduced in BLAST+ 2.2.29 (as BLAST+ 2.2.26 through 2.2.28 inclusive are not affected).

Update (31 October 2014): This was fixed in BLAST+ 2.2.30 (released yesterday).

2014-02-27

BLAST XML output needs more love from NCBI

For some time I had thought that the best option for computer parsing of BLAST+ output was BLAST XML. It had all the key bits of information, and XML is designed for automated parsing. However, with the extra fields added to the tabular or comma separated output in BLAST+ 2.2.28 like the long overdue hit descriptions, and taxonomy fields, I think they are now preferable. BLAST XML is now lagging behind!

2013-12-24

BLAST+ should keep its BL_ORD_ID identifiers to itself

This is in a sense a continuation of my previous BLAST blog post, My IDs not good enough for NCBI BLAST+. My core complaint is that makeblastdb currently ignores the user's own identifiers and automatically assigns its own identifiers (gnl|BL_ORD_ID|0, gnl|BL_ORD_ID|1, gnl|BL_ORD_ID|2, etc), and that the BLAST+ suite as a whole is inconsistent about hiding these in its output.

Note that one side-effect of BLAST+ ignoring the user identifiers and creating its own is that it can tolerate databases made from FASTA files with accidentally duplicated identifiers, but this only causes great confusion and ambiguity in the downstream analysis. One of the ways I've seen FASTA files be created with accidentally duplicated identifiers is pooling of assemblies where generic names like contig1 (or even the more complex Trinity naming scheme) naturally cause clashes. In situations like this, I think makeblastdb should give an error when attempting to build a BLAST database.