2015-12-18

What BLAST's max-target-sequences doesn't do


This is a short post to highlight a scary BLAST+ -max_target_seqs bug found and reported by Sujai Kumar, which he discovered in the course of working on some puzzling Blobtools output while analysing the tardigrade genome.



In essence we, and probably most BLAST+ users, had assumed the -max_target_seqs command line option was applied after the search was finished prior to output, and likewise for the plain text output's -num_descriptions and -num_alignments (see also #5 on my Christmas 2014 BLAST wish list).

$ blastp -help
...
-max_target_seqs <Integer, >=1>
Maximum number of aligned sequences to keep
Not applicable for outfmt <= 4
Default = `500'
* Incompatible with: num_descriptions, num_alignments
...


Then Sujai found a reproducible counter example contracting this mental model, where the top 20 hits switched from all being Eukaryotes to all Bacteria depending on -max_target_seqs.

The NCBI reply by email was:
Hello,
Thank you for the report. We don't consider this a bug, but I agree that we should document this possibility better. This can happen because limits, including max target sequences, are applied in an early ungapped phase of the algorithm, as well as later. In some cases a final HSP will improve enough in the later gapped phase to rise to the top hits. In your case, relaxing the limit to 200 appears to have allowed hits that would have been excluded in the ungapped phase at 100 max target sequences to rise.


So basically while most of the time -max_target_seqs does seem to be a final filter before the output is written, this setting actually happens much earlier in the search and can (unexpectedly) cull what turns out to be the best hit.

I agree with the NCBI BLAST team that this needs better documentation, but believe this is should be treated as bug not a feature.

Many thanks to Sujai for openly reporting this via Twitter and posting his BLAST+ bug report as a gist on GitHub where we could see and comment on the issue. Sadly I'm still waiting for an official NCBI public BLAST+ bug tracker

No comments:

Post a Comment