This is the fourth in a series of blog posts seeking to throw light some of the claims about the BLAST+ tool recently published by
Shah et al. (2018) "Misunderstood parameter of NCBI BLAST impacts the correctness of bioinformatics workflows". It was very frustrating that the letter did not provide a reproducible test case, but in reply to the first pair of posts (
one and
two, both on Friday 2 November 2018), lead author Nidhi Shah got in touch via the comments on Sunday 4 November, with the URL to a GitHub repository describing the
Shah et al. (2018) test case. Thank you!
Their test case turns out to be using MEGABLAST (the default algorithm in the
blastn binary), with a custom nucleotide BLAST database (the
previous blog post examined this).
On the other hand, the original
Dec 2015 -max_target_seqs bug report (and my earlier blog posts), used BLASTP with a protein BLAST database.
This is important because one key setting which the internal limit on the number of alignments (
N_i) that BLAST+ considers depends on, is if composition-based statistics (CBS) are being used. This is the default with BLASTP, but
not for MEGABLAST (i.e. the
blastn binary).
The key point is that requesting
N=1 alignments, but otherwise the
blastp tool's default settings, gives an internal limit
N_i = 2*N + 50 = 52, but with the
blastn tool you get an internal alignment limit
N_i = 10. Evidently the BLAST+ developers were comfortable with a lower limit, so I presume there is less chance of the hit ordering changing in the final stages of the algorithm, but this emphasises why
it is especially important to avoid duplicates in a nucleotide BLAST database.