2018-11-13

BLAST max alignment limits reply - part four

This is the fourth in a series of blog posts seeking to throw light some of the claims about the BLAST+ tool recently published by Shah et al. (2018) "Misunderstood parameter of NCBI BLAST impacts the correctness of bioinformatics workflows". It was very frustrating that the letter did not provide a reproducible test case, but in reply to the first pair of posts (one and two, both on Friday 2 November 2018), lead author Nidhi Shah got in touch via the comments on Sunday 4 November, with the URL to a GitHub repository describing the Shah et al. (2018) test case. Thank you!

Their test case turns out to be using MEGABLAST (the default algorithm in the blastn binary), with a custom nucleotide BLAST database (the previous blog post examined this).

On the other hand, the original Dec 2015 -max_target_seqs bug report (and my earlier blog posts), used BLASTP with a protein BLAST database.

This is important because one key setting which the internal limit on the number of alignments (N_i) that BLAST+ considers depends on, is if  composition-based statistics (CBS) are being used. This is the default with BLASTP, but not for MEGABLAST (i.e. the blastn binary).

The key point is that requesting N=1 alignments, but otherwise the blastp tool's default settings, gives an internal limit N_i = 2*N + 50 = 52, but with the blastn tool you get an internal alignment limit N_i = 10.  Evidently the BLAST+ developers were comfortable with a lower limit, so I presume there is less chance of the hit ordering changing in the final stages of the algorithm, but this emphasises why it is especially important to avoid duplicates in a nucleotide BLAST database.

BLAST max alignment limits reply - part three

This is the third in a series of blog posts seeking to throw light some of the claims about the BLAST+ tool recently published by Shah et al. (2018) "Misunderstood parameter of NCBI BLAST impacts the correctness of bioinformatics workflows". It was very frustrating that the letter did not provide a reproducible test case, but in reply to the first pair of posts (one and two, both on Friday 2 November 2018), lead author Nidhi Shah got in touch via the comments on Sunday 4 November, with the URL to a GitHub repository describing the Shah et al. (2018) test case. Thank you!

Their test case turns out to be using a custom nucleotide BLAST database (rather than a protein example as per the original Dec 2015 -max_target_seqs bug report, see my post "What BLAST's max-target-sequences doesn't do"), and rather than a single query sequence, they have ten.

I could reproduce their initial example locally. I could indeed see the database order coming into play - but so far nothing that cannot be explained by using this as a tie breaker. De-duplicating their database to make it non-redundant greatly improves things, but some of the queries still showed the (December 2015) problem where using -max_target_seqs 1 does not give the expected top alignment.

2018-11-02

BLAST max alignment limits repartee - part two

This is the second in a series of blog posts seeking to throw light some of the claims about the BLAST+ tool recently published by Shah et al. (2018) "Misunderstood parameter of NCBI BLAST impacts the correctness of bioinformatics workflows". Since regrettably they did not provide a reproducible test case, my previous post began by introducing a minimal test case.

This topic dates back to 2015, when Sujai Kumar reported as a scaryBLAST+ -max_target_seqs bug which as I wrote about ("What BLAST's max-target-sequences doesn't do"), the NCBI BLAST developers explained it as a poorly documented feature.

Here I focus on what might be the most quoted part of Shah et al. (2018), which is causing what I consider to be unwarranted panic:
To our surprise, we have recently discovered that this intuition is incorrect. Instead, BLAST returns the first N hits that exceed the specified E-value threshold, which may or may not be the highest scoring N hits. The invocation using the parameter ‘-max_target_seqs 1’ simply returns the first good hit found in the database, not the best hit as one would assume. Worse yet, the output produced depends on the order in which the sequences occur in the database. For the same query, different results will be returned by BLAST when using different versions of the database even if all versions contain the same best hit for this database sequence. Even ordering the database in a different way would cause BLAST to return a different ‘top hit’ when setting the max_target_seqs parameter to 1.
This does not seem to be the case. If I am misreading their message, I am not alone. See for example Emma Bell's blog post, or John Walshaw's comments on my 2015 post. It is possible Shah et al. have found a separate issue, but since no test case was given, that cannot currently be verified.

BLAST max alignment limits repartee - part one

Back in 2015, my blog post "What BLAST's max-target-sequences doesn't do" highlighted what we called a scary BLAST+ -max_target_seqs bug, found and reported by Sujai Kumar. The NCBI BLAST teams took the stance this was a feature not a bug (and as a heuristic search tool, this is an understandable view), but conceded it could be better documented.

Sadly, I don't think there has been much if any clarification in the BLAST+ documentation about the settings limiting the number of alignments returned, and what else they control. The recent letter Shah et al. (2018) "Misunderstood parameter of NCBI BLAST impacts the correctness of bioinformatics workflows" did serve the purpose of raising the profile of this issue, but sadly seems confused and misleading in several places. Most regrettably they did not provide a reproducible test case, so it is possible they found another issue.

This is the first of a planned series of blog posts which seeks to clarify the situation, and some of the claims in Shah et al. (2018).

First of all, this issue is not specific to -max_target_seqs (used with the computer readable output format). With human readable output formats, the maximum of -num_descriptions and -num_alignments is used in exactly the same way during the BLAST search.