Their test case turns out to be using MEGABLAST (the default algorithm in the blastn binary), with a custom nucleotide BLAST database (the previous blog post examined this).
On the other hand, the original Dec 2015 -max_target_seqs bug report (and my earlier blog posts), used BLASTP with a protein BLAST database.
This is important because one key setting which the internal limit on the number of alignments (N_i) that BLAST+ considers depends on, is if composition-based statistics (CBS) are being used. This is the default with BLASTP, but not for MEGABLAST (i.e. the blastn binary).
The key point is that requesting N=1 alignments, but otherwise the blastp tool's default settings, gives an internal limit N_i = 2*N + 50 = 52, but with the blastn tool you get an internal alignment limit N_i = 10. Evidently the BLAST+ developers were comfortable with a lower limit, so I presume there is less chance of the hit ordering changing in the final stages of the algorithm, but this emphasises why it is especially important to avoid duplicates in a nucleotide BLAST database.
How is the internal alignment limit set?In late October 2018, the NCBI team added an Appendix entry "Outline of the BLAST process" to the online BLAST+ documentation, which described in words how the internal maximum limit of N_i databases sequences is setup.
This was revised in early November 2018 to note "CBS can be applied for BLASTP, BLASTX, and TBLASTN". How N_i was set with CBS in a BLASTP example was described in a previous post, but Shah et al. (2018) turned out to be using a MEGABLAST example - and this not using CBS.
This is the code as used from the first public BLAST+ release 2.2.18 (released 14 October 2008) though to the current release BLAST+ 2.7.1 (released 23 October 2017) and the 2.8.0alpha (released 28 March 2018) as well, at around line 65 of file c++/src/algo/blast/core/blast_hits.c:
prelim_hitlist_size = hit_options->hitlist_size; if (ext_options->compositionBasedStats) prelim_hitlist_size = prelim_hitlist_size * 2 + 50; else if (scoring_options->gapped_calculation) prelim_hitlist_size = MIN(2 * prelim_hitlist_size, prelim_hitlist_size + 50); (*retval)->prelim_hitlist_size = MAX(prelim_hitlist_size, 10);
The release dates are copied from the NCBI BLAST+ release notes. I've omitted the full output, but by eye this code snippet looked identical over all the tar-ball source code releases from ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/ as examined with:
$ grep -A 7 "prelim_hitlist_size = hit_options->hitlist_size;" ncbi-blast-2.*+-src/c++/src/algo/blast/core/blast_hits.c
As an aside, I found the same file under NCBI source control online (and confused myself temporarily by misreading a decade - see the update at the end of the post). It matches as of the latest changes, 31 October 2018 (SVN revision 84285). Curiously this records an important change (before the first public release of BLAST+) in 15 April 2008 (SVN revision 37584), when the plus fifty change was made to the CBS mode as part of a bi-weekly merge with no explanatory commit comment. Also, in 9 April 2007 (SVN revision 33344), a stylistic change was made (introducing a local temporary variable), but importantly it added the last line quoted, as emphasised in the commit comment "save hits for at least 10 sequences, in case the traceback significantly changes the scores of hits found". I'm not going to go back any further in this code's history here, since I think nothing substantial changed (other than a special case for RPS-BLAST settings) since this fragment was introduced back in 16 May 2005 (SVN revision 22335).
Minimum N_i valuesPutting the C++ into pseudocode, and noting that in the CBS case due to the plus fifty, the value will always exceed the minimum of ten, we have:
if CBS: N_i = 2*N + 50 elif gapped: N_i = MAX(MIN(2*N, N+50), 10) else: N_i = MAX(N, 10)
So, if asking for one alignment (N=1), via -max_target_seqs 1 or equivalently for the human readable output formats (see part one), what do we get?
if CBS: N_i = 2*N + 50 = 52 elif gapped: N_i = MAX(MIN(2*N, N+50), 10) = MAX(MIN(2, 52), 10) = MAX(2, 10) = 10 else: N_i = MAX(N, 10) = MAX(1, 10) = 10
That means for N=1 with CBS we get an internal limit N_i = 52, and otherwise N_i = 10.
When are composition-based statistics (CBS) used?Consulting the command line help,
$ blastp -help ... -comp_based_stats
Use composition-based statistics: D or d: default (equivalent to 2 ) 0 or F or f: No composition-based statistics 1: Composition-based statistics as in NAR 29:2994-3005, 2001 2 or T or t : Composition-based score adjustment as in Bioinformatics 21:902-911, 2005, conditioned on sequence properties 3: Composition-based score adjustment as in Bioinformatics 21:902-911, 2005, unconditionally Default = `2' ...
I won't quote them all, but blastp -help, blastx -help and tblastn -help all report their default is -comp_based_stats 2, meaning enabled as per the Bioinformatics paper, Yu et al. (2005). Likewise for rpsblast and rpstblastn (although their the default is -comp_based_stats 1, meaning CBS is enabled as per the older NAR paper, Schäffer et al. 2001).
Meanwhile, there is no mention of composition in the blastn -help, nor tblastx -help, which fits as both of those papers are about protein databases.
ConclusionBy default the command line tools blastp, blastx, tblastn, rpsblast and rpstblastn for protein databases all use CBS, and so requesting N=1 we get an internal limit N_i = 52.
However, with blastn and tblastx for nucleotide databases, there is no CBS mode, and so with N=1 we get the much lower internal limit N_i = 10.
Evidently for the Shah et al. (2018) test case, even with the database de-duplicated as described in my earlier post, N_i = 10 is not big enough, and this heuristic limit is affecting the top hit returned when only one alignment is requested.
In summary, if you are using the alignment limits like -max_target_seqs 1, then it is especially important to avoid duplicates in a nucleotide BLAST database.
Update (14 November 2018)Corrected the decade of the most recent SVN commit of interest, 15 April 2008 (SVN revision 37584). It was not 15 April 2018, which was what was confusing me. I updated this and the start of the paragraph accordingly (with note added in italics). It would be interesting to go over the "legacy" BLAST release notes from that period.
The NCBI have also updated the Appendix entry "Outline of the BLAST process" again, which is now much clearer about this limit, and the minimum value of 10.