2018-11-13

BLAST max alignment limits reply - part four

This is the fourth in a series of blog posts seeking to throw light some of the claims about the BLAST+ tool recently published by Shah et al. (2018) "Misunderstood parameter of NCBI BLAST impacts the correctness of bioinformatics workflows". It was very frustrating that the letter did not provide a reproducible test case, but in reply to the first pair of posts (one and two, both on Friday 2 November 2018), lead author Nidhi Shah got in touch via the comments on Sunday 4 November, with the URL to a GitHub repository describing the Shah et al. (2018) test case. Thank you!

Their test case turns out to be using MEGABLAST (the default algorithm in the blastn binary), with a custom nucleotide BLAST database (the previous blog post examined this).

On the other hand, the original Dec 2015 -max_target_seqs bug report (and my earlier blog posts), used BLASTP with a protein BLAST database.

This is important because one key setting which the internal limit on the number of alignments (N_i) that BLAST+ considers depends on, is if  composition-based statistics (CBS) are being used. This is the default with BLASTP, but not for MEGABLAST (i.e. the blastn binary).

The key point is that requesting N=1 alignments, but otherwise the blastp tool's default settings, gives an internal limit N_i = 2*N + 50 = 52, but with the blastn tool you get an internal alignment limit N_i = 10.  Evidently the BLAST+ developers were comfortable with a lower limit, so I presume there is less chance of the hit ordering changing in the final stages of the algorithm, but this emphasises why it is especially important to avoid duplicates in a nucleotide BLAST database.

How is the internal alignment limit set?

In late October 2018, the NCBI team added an Appendix entry "Outline of the BLAST process" to the online BLAST+ documentation, which described in words how the internal maximum limit of N_i databases sequences is setup.

This was revised in early November 2018 to note "CBS can be applied for BLASTP, BLASTX, and TBLASTN". How N_i was set with CBS in a BLASTP example was described in a  previous post, but Shah et al. (2018) turned out to be using a MEGABLAST example - and this not using CBS.

This is the code as used from the first public BLAST+ release 2.2.18 (released 14 October 2008) though to the current release BLAST+ 2.7.1 (released 23 October 2017) and the 2.8.0alpha (released 28 March 2018) as well, at around line 65 of file c++/src/algo/blast/core/blast_hits.c:

       prelim_hitlist_size = hit_options->hitlist_size;
       if (ext_options->compositionBasedStats)
            prelim_hitlist_size = prelim_hitlist_size * 2 + 50;
       else if (scoring_options->gapped_calculation)
            prelim_hitlist_size = MIN(2 * prelim_hitlist_size,
                                      prelim_hitlist_size + 50);

       (*retval)->prelim_hitlist_size = MAX(prelim_hitlist_size, 10);

The release dates are copied from the NCBI BLAST+ release notes. I've omitted the full output, but by eye this code snippet looked identical over all the tar-ball source code releases from ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/ as examined with:

$ grep -A 7 "prelim_hitlist_size = hit_options->hitlist_size;" ncbi-blast-2.*+-src/c++/src/algo/blast/core/blast_hits.c
...

As an aside, I found the same file under NCBI source control online (and confused myself temporarily by misreading a decade - see the update at the end of the post). It matches as of the latest changes, 31 October 2018 (SVN revision 84285). Curiously this records an important change (before the first public release of BLAST+) in 15 April 2008 (SVN revision 37584), when the plus fifty change was made to the CBS mode as part of a bi-weekly merge with no explanatory commit comment. Also, in 9 April 2007 (SVN revision 33344), a stylistic change was made (introducing a local temporary variable), but importantly it added the last line quoted, as emphasised in the commit comment "save hits for at least 10 sequences, in case the traceback significantly changes the scores of hits found". I'm not going to go back any further in this code's history here, since I think nothing substantial changed (other than a special case for RPS-BLAST settings) since this fragment was introduced back in 16 May 2005 (SVN revision 22335).

Minimum N_i values

Putting the C++ into pseudocode, and noting that in the CBS case due to the plus fifty, the value will always exceed the minimum of ten, we have:

if CBS:
    N_i = 2*N + 50
elif gapped:
    N_i = MAX(MIN(2*N, N+50), 10)
else:
    N_i = MAX(N, 10)


So, if asking for one alignment (N=1), via -max_target_seqs 1 or equivalently for the human readable output formats (see part one), what do we get?

if CBS:
    N_i = 2*N + 50 = 52
elif gapped:
    N_i = MAX(MIN(2*N, N+50), 10) = MAX(MIN(2, 52), 10) = MAX(2, 10) = 10
else:
    N_i = MAX(N, 10) = MAX(1, 10) = 10


That means for N=1 with CBS we get an internal limit N_i = 52, and otherwise N_i = 10.

When are composition-based statistics (CBS) used?

Consulting the command line help,

$ blastp -help
...
 -comp_based_stats 
   Use composition-based statistics:
       D or d: default (equivalent to 2 )
       0 or F or f: No composition-based statistics
       1: Composition-based statistics as in NAR 29:2994-3005, 2001
       2 or T or t : Composition-based score adjustment as in Bioinformatics
   21:902-911,
       2005, conditioned on sequence properties
       3: Composition-based score adjustment as in Bioinformatics 21:902-911,
       2005, unconditionally
   Default = `2'
...

I won't quote them all, but blastp -help,  blastx -help and tblastn -help all report their default is -comp_based_stats 2, meaning enabled as per the Bioinformatics paper, Yu et al. (2005). Likewise for rpsblast and rpstblastn (although their the default is  -comp_based_stats 1, meaning CBS is enabled as per the older NAR paper, Schäffer et al. 2001).

Meanwhile, there is no mention of composition in the blastn -help, nor tblastx -help,  which fits as both of those papers are about protein databases.

Conclusion

By default the command line tools blastp, blastx, tblastn, rpsblast and rpstblastn for protein databases all use CBS, and so requesting N=1 we get an internal limit N_i = 52.

However, with blastn and tblastx for nucleotide databases, there is no CBS mode, and so with N=1 we get the much lower internal limit N_i = 10.

Evidently for the Shah et al. (2018) test case, even with the database de-duplicated as described in my earlier post, N_i = 10 is not big enough, and this heuristic limit is affecting the top hit returned when only one alignment is requested.

In summary, if you are using the alignment limits like -max_target_seqs 1, then it is especially important to avoid duplicates in a nucleotide BLAST database.

Update (14 November 2018)

Corrected the decade of the most recent SVN commit of interest, 15 April 2008 (SVN revision 37584). It was not 15 April 2018, which was what was confusing me. I updated this and the start of the paragraph accordingly (with note added in italics). It would be interesting to go over the "legacy" BLAST release notes from that period.

The NCBI have also updated the Appendix entry "Outline of the BLAST process" again, which is now much clearer about this limit, and the minimum value of 10.

Update (4 December 2018)

Corrected minor typo in third paragraph of the conclusion, de-duplicated rather than duplicated.

5 comments:

  1. Do you plan to write an article for Oxford Bioinformatics, proving that the problem with BLAST in not nearly as big as Shah and co-authors stated? I think this article will be very valuable. Many bioinformaticians almost panic because they don't realize that the article of Shah is wrong https://oxfordjournals.altmetric.com/details/48829434/twitter

    ReplyDelete
    Replies
    1. Thank you for your feedback. Sujai Kumar and I have discussed writing a formal reply letter, and have been in touch with the NCBI BLAST team about this. There are still a few more points I would like to explore as blog posts, so in the mean time, please subscribe to the RSS/Atom feed (or follow me on Twitter).

      Delete
  2. Thank you for this excellent elucidation of the determination of prelim_hitlist_size in the various BLAST programs. Using your analogy about a half marathon, what criteria do the BLAST programs use to determine which sequences are in the lead at the halfway point (I am particularly concerned with BLASTn)? That is, how do sequences make it onto the prelim_hitlist? Is it really as simple the first ones that meet the criteria?

    ReplyDelete
    Replies
    1. I can only really point you at the Appendix entry "Outline of the BLAST process", or if you really want to explore this yourself, the source code. However, saying"first ones" is misleading as the order in the database only comes into play as a tie break (see discussion in blog post part three).

      Delete
  3. All of the discussion about the SIZE of the prelim_hitlist is fine, but this is only a part of the problem. The composition of the prelim_hitlist is also a problem. Regardless of its size, if the prelim_hitlist actually contained the "best" targets, there would be no problem. Hence, the algorithm for constructing the prelim_hitlist is crucial for understanding the actual scope of the problems associated with max_target_seqs. Hopefully someone will investigate this.

    ReplyDelete