2019-01-08

An overly aggressive optimization in BLASTN and MegaBLAST

All my recent blog posts have been looking at issues raised by the recent letter Shah et al. (2018) "Misunderstood parameter of NCBI BLAST impacts the correctness of bioinformatics workflows", and the associated test case which they ought to have included with the letter itself. Thus far I have focussed on:


Over the Christmas break, there were two notable public developments. The BLAST team's reply was published, Madden et al. (2018) "Reply to the paper: Misunderstood parameters of NCBI BLAST impacts the correctness of bioinformatics workflows". Quoting from this:

"we examined the new example and it became clear that the demonstrated behavior was a bug, resulting from an overly aggressive optimization, introduced in 2012 for BLASTN and MegaBLAST (DNADNA alignments). This bug has been fixed in the BLAST+ 2.8.1 release, due out in December 2018. The aberrant behavior seems to occur only in alignments with an extremely large number of gaps, which is the case in the example provided by Shah and collaborators."

And BLAST+ v2.8.1 was released. Quoting the release notes,

"Disabled an overly aggressive optimization that caused problems mentioned by Shah et al."

So, what was this aggressive optimisation in BLASTN and MegaBLAST, and how did it combine with the internal candidate alignment limit and database order to produce counter intuitive results? It turns out to explain the details I'd not yet followed up from blog post part three.

When was the change?

The NCBI says the "overly aggressive optimization" was introduced in 2012, which suggests it first appeared in BLAST+ 2.2.26 (31 Jan 2012), BLAST+ 2.2.27 (10 September 2012), or as it turns out BLAST+ 2.2.28 (released 19 March 2013).

Time for some historical testing - a continuation of the examples in my blog post BLAST max alignment limits reply - part three, using a deduplicated version of the Shah et al. test case.

On my system the earliest version of BLAST+ where the NCBI provided 64-bit Linux binaries work is BLAST+ 2.2.21, but this and 2.2.22 do not use the subject names in the tabular output - so I'm starting from v2.2.23 instead. There are some apparently minor differences in scores or gaps as the versions change, but if we look just at the query and match names what do we see?

Here I am using a multi-line double nested bash for loop, where I look at each version of BLAST in turn, and use the head command to get the top result for running BLAST with the default alignment limit. Then I output the MD5 checksum of the first two columns (query and match name). Happily each version gives the same output:

$ for v in 2.2.23 2.2.24 2.2.25 2.2.26 2.2.27 2.2.28 2.2.29 2.2.30 2.2.31 2.3.0 2.4.0 2.5.0 2.6.0 2.7.1 2.8.1
  do echo $v; rm -rf top1_${v}.tsv; for i in {1..10};
  do ~/downloads/ncbi-blast-${v}+/bin/blastn -query query_$i.fasta -db dedup.fasta -outfmt 6 | head -n 1 >> top1_${v}.tsv;
  done; cut -f 1,2 top1_${v}.tsv | md5sum;
  done;
2.2.23
dd4d4c2ffb2a1d43bc3844b5406366d6  -
2.2.24
dd4d4c2ffb2a1d43bc3844b5406366d6  -
2.2.25
dd4d4c2ffb2a1d43bc3844b5406366d6  -
2.2.26
dd4d4c2ffb2a1d43bc3844b5406366d6  -
2.2.27
dd4d4c2ffb2a1d43bc3844b5406366d6  -
2.2.28
dd4d4c2ffb2a1d43bc3844b5406366d6  -
2.2.29
dd4d4c2ffb2a1d43bc3844b5406366d6  -
2.2.30
dd4d4c2ffb2a1d43bc3844b5406366d6  -
2.2.31
dd4d4c2ffb2a1d43bc3844b5406366d6  -
2.3.0
dd4d4c2ffb2a1d43bc3844b5406366d6  -
2.4.0
dd4d4c2ffb2a1d43bc3844b5406366d6  -
2.5.0
dd4d4c2ffb2a1d43bc3844b5406366d6  -
2.6.0
dd4d4c2ffb2a1d43bc3844b5406366d6  -
2.7.1
dd4d4c2ffb2a1d43bc3844b5406366d6  -
2.8.1
dd4d4c2ffb2a1d43bc3844b5406366d6  -

There have been (apparently minor) changes in the output, which I am not going to look at further:

$ for v in 2.2.23 2.2.24 2.2.25 2.2.26 2.2.27 2.2.28 2.2.29 2.2.30 2.2.31 2.3.0 2.4.0 2.5.0 2.6.0 2.7.1 2.8.1;
  do md5sum top1_${v}.tsv;
  done
8b64ffa5e3452e03e53b5c8f83efd00d  top1_2.2.23.tsv
251b13795f5df0505b6f476dc34ae997  top1_2.2.24.tsv
251b13795f5df0505b6f476dc34ae997  top1_2.2.25.tsv
251b13795f5df0505b6f476dc34ae997  top1_2.2.26.tsv
51794c55ec13bb6d96f180c2508a85a8  top1_2.2.27.tsv
51794c55ec13bb6d96f180c2508a85a8  top1_2.2.28.tsv
d6f8bdd885154ab3ad66f51b86cddb62  top1_2.2.29.tsv
d6f8bdd885154ab3ad66f51b86cddb62  top1_2.2.30.tsv
ae5406a65f9e6842ee01dcbbafc182c3  top1_2.2.31.tsv
4118c80f712b0dff24cffb328432c0bb  top1_2.3.0.tsv
ac84739316e78ba53811c71c1eea0b8a  top1_2.4.0.tsv
ac84739316e78ba53811c71c1eea0b8a  top1_2.5.0.tsv
ac84739316e78ba53811c71c1eea0b8a  top1_2.6.0.tsv
ac84739316e78ba53811c71c1eea0b8a  top1_2.7.1.tsv
ac84739316e78ba53811c71c1eea0b8a  top1_2.8.1.tsv

Still, looking at just the first two columns (query and match names), nothing changed. Now, for the simpler command to request just one alignment from each query - again with a bash for loop over the BLAST+ version, and reporting the MD5 checksum of the first two columns:

$ for v in 2.2.23 2.2.24 2.2.25 2.2.26 2.2.27 2.2.28 2.2.29 2.2.30 2.2.31 2.3.0 2.4.0 2.5.0 2.6.0 2.7.1 2.8.1;
  do echo $v;
  ~/downloads/ncbi-blast-${v}+/bin/blastn -query example.fasta -db dedup.fasta -outfmt 6 -max_target_seqs 1 -out max1_${v}.tsv; cut -f 1,2 max1_${v}.tsv | md5sum;
  done
2.2.23
dd4d4c2ffb2a1d43bc3844b5406366d6  -
2.2.24
dd4d4c2ffb2a1d43bc3844b5406366d6  -
2.2.25
dd4d4c2ffb2a1d43bc3844b5406366d6  -
2.2.26
dd4d4c2ffb2a1d43bc3844b5406366d6  -
2.2.27
dd4d4c2ffb2a1d43bc3844b5406366d6  -
2.2.28
be102cef9b8eca7a49ffd35c74005999  -
2.2.29
be102cef9b8eca7a49ffd35c74005999  -
2.2.30
be102cef9b8eca7a49ffd35c74005999  -
2.2.31
be102cef9b8eca7a49ffd35c74005999  -
2.3.0
be102cef9b8eca7a49ffd35c74005999  -
2.4.0
be102cef9b8eca7a49ffd35c74005999  -
2.5.0
be102cef9b8eca7a49ffd35c74005999  -
2.6.0
be102cef9b8eca7a49ffd35c74005999  -
2.7.1
be102cef9b8eca7a49ffd35c74005999  -
2.8.1
Warning: [blastn] Examining 5 or more matches is recommended
dd4d4c2ffb2a1d43bc3844b5406366d6  -

The new release BLAST+ 2.8.1 gives a warning about using the alignment limit, but it's output now matches that from BLAST+ 2.2.23 to 2.2.27 inclusive. Meanwhile, BLAST+ 2.2.28 to 2.7.1 gave a consistent but different best hit.

Another way to look at this is to compare the BLAST hit when requesting at most one alignment, versus the top hit without the limit:

$ for v in 2.2.23 2.2.24 2.2.25 2.2.26 2.2.27 2.2.28 2.2.29 2.2.30 2.2.31 2.3.0 2.4.0 2.5.0 2.6.0 2.7.1 2.8.1;
  do echo $v;
  diff -q <(cut -f 1,2 max1_${v}.tsv) <(cut -f 1,2 top1_${v}.tsv);
  done
2.2.23
2.2.24
2.2.25
2.2.26
2.2.27
2.2.28
Files /dev/fd/63 and /dev/fd/62 differ
2.2.29
Files /dev/fd/63 and /dev/fd/62 differ
2.2.30
Files /dev/fd/63 and /dev/fd/62 differ
2.2.31
Files /dev/fd/63 and /dev/fd/62 differ
2.3.0
Files /dev/fd/63 and /dev/fd/62 differ
2.4.0
Files /dev/fd/63 and /dev/fd/62 differ
2.5.0
Files /dev/fd/63 and /dev/fd/62 differ
2.6.0
Files /dev/fd/63 and /dev/fd/62 differ
2.7.1
Files /dev/fd/63 and /dev/fd/62 differ
2.8.1

So, things worked as expected with this (deduplicated) database for BLAST+ 2.2.23 through 2.2.27, but the top result changed with BLAST+ 2.2.28 to 2.7.1 with -max_target_seqs 1, and this was fixed again in BLAST+ 2.8.1.

What was the change?


Having pinned down the change to the release of BLAST+ 2.2.28, with the fix being in BLAST+ 2.8.1, I was able to make a guess at the relevant commits in SVN.

The version bump for BLAST+ 2.2.27 was done in 2 August 2012 in SVN revision 55265, while the version bump for 2.2.28 was done on 12 March 2013 in SVN revision  57472,  so we're looking for the bug being introduced in that window.

Similarly, the version bump for BLAST+ 2.8.0 (alpha) was done on 16 Jan 2018 in SVN revision 80848, while the final release date bump for 2.8.1 was 20 November 2018 in SVN revision 84607, so we're looking for the fix in that window.

I asked the BLAST team if my guess was right, it wasn't - Tom Madden kindly pointed me at the relevant commits. First, this commit added a new option:
  • 30 November 2011, SVN revision 52133
    "Ignore low scoring ungapped alignments if hitlist is full, JIRA:SB-914"
Then this commit activated the setting in blast_nucl_options.cpp:

  • 18 October 2012, SVN revision 56007
  • "Use standard gap trigger for blastn to find missing hits, increase reduced_nucl_cutoff_score and turn on SetLowScorePerc to make up lost speed, JIRA:SB-1047"

While finally this was the fix, turning that setting off again:
  • 9 November 2018, SVN revision 84455
    "Disable ungapped low score perc check, JIRA:SB-2407"
This was fixed after the NCBI studied the test case from Shah et al. (2018), which had gap-rich alignments.

Returning to the original Shah et al. (2018) test case

Does the same apply with the original Shah et al. (2018) test case with duplicated sequences?

$ for v in 2.2.23 2.2.24 2.2.25 2.2.26 2.2.27 2.2.28 2.2.29 2.2.30 2.2.31 2.3.0 2.4.0 2.5.0 2.6.0 2.7.1 2.8.1;
   do echo $v; rm -rf dups_top1_${v}.tsv; for i in {1..10};
   do ~/downloads/ncbi-blast-${v}+/bin/blastn -query query_$i.fasta -db db.fasta -outfmt 6 | head -n 1 >> dups_top1_${v}.tsv;
   done; cut -f 1,2 dups_top1_${v}.tsv | md5sum;
   done;
2.2.23
77c8ac0df4a04523c39f01cdd4629b1f  -
2.2.24
77c8ac0df4a04523c39f01cdd4629b1f  -
2.2.25
77c8ac0df4a04523c39f01cdd4629b1f  -
2.2.26
77c8ac0df4a04523c39f01cdd4629b1f  -
2.2.27
77c8ac0df4a04523c39f01cdd4629b1f  -
2.2.28
77c8ac0df4a04523c39f01cdd4629b1f  -
2.2.29
77c8ac0df4a04523c39f01cdd4629b1f  -
2.2.30
77c8ac0df4a04523c39f01cdd4629b1f  -
2.2.31
77c8ac0df4a04523c39f01cdd4629b1f  -
2.3.0
77c8ac0df4a04523c39f01cdd4629b1f  -
2.4.0
77c8ac0df4a04523c39f01cdd4629b1f  -
2.5.0
77c8ac0df4a04523c39f01cdd4629b1f  -
2.6.0
77c8ac0df4a04523c39f01cdd4629b1f  -
2.7.1
77c8ac0df4a04523c39f01cdd4629b1f  -
2.8.1
77c8ac0df4a04523c39f01cdd4629b1f  -

That shows the top hit returned from the duplicate database was consistent over these versions of BLAST+, but with the alignment limit set to one:

$ for v in 2.2.23 2.2.24 2.2.25 2.2.26 2.2.27 2.2.28 2.2.29 2.2.30 2.2.31 2.3.0 2.4.0 2.5.0 2.6.0 2.7.1 2.8.1;
   do echo $v;
   ~/downloads/ncbi-blast-${v}+/bin/blastn -query example.fasta -db db.fasta -outfmt 6 -max_target_seqs 1 -out dups_max1_${v}.tsv; cut -f 1,2 dups_max1_${v}.tsv | md5sum;
   done
2.2.23
9870749de9ca82e81bc96ba72b45c2f9  -
2.2.24
9870749de9ca82e81bc96ba72b45c2f9  -
2.2.25
9870749de9ca82e81bc96ba72b45c2f9  -
2.2.26
9870749de9ca82e81bc96ba72b45c2f9  -
2.2.27
9870749de9ca82e81bc96ba72b45c2f9  -
2.2.28
b980215a60700a608bf4016f6ddced2b  -
2.2.29
b980215a60700a608bf4016f6ddced2b  -
2.2.30
b980215a60700a608bf4016f6ddced2b  -
2.2.31
b980215a60700a608bf4016f6ddced2b  -
2.3.0
b980215a60700a608bf4016f6ddced2b  -
2.4.0
b980215a60700a608bf4016f6ddced2b  -
2.5.0
b980215a60700a608bf4016f6ddced2b  -
2.6.0
b980215a60700a608bf4016f6ddced2b  -
2.7.1
b980215a60700a608bf4016f6ddced2b  -
2.8.1
Warning: [blastn] Examining 5 or more matches is recommended
9870749de9ca82e81bc96ba72b45c2f9  -

We see the same pattern - and BLAST+ 2.8.1 fixed things to match BLAST+ 2.2.23 to 2.2.27, but these are not all the same hits! Here is the change in the BLAST+ 2.2.7 output:

$ diff <(cut -f 1,2 dups_top1_2.2.27.tsv) <(cut -f 1,2 dups_max1_2.2.27.tsv)
5c5
< NC_006448.1-15016:454_5cov_045M-050M s_16779:COG0087
---
> NC_006448.1-15016:454_5cov_045M-050M s_7827:COG0088

And the same with BLAST+ 2.8.1,

$ diff <(cut -f 1,2 dups_top1_2.8.1.tsv) <(cut -f 1,2 dups_max1_2.8.1.tsv)
5c5
< NC_006448.1-15016:454_5cov_045M-050M s_16779:COG0087
---
> NC_006448.1-15016:454_5cov_045M-050M s_7827:COG0088

It turns out that query 5 (NC_006448.1-15016:454_5cov_045M-050M) still suffers from the internal candidate limit problem (see blog post part four, with one alignment requested an internal limit of 10 would be used):

$ ~/downloads/ncbi-blast-2.2.27+/bin/blastn -query query_5.fasta -db db.fasta -outfmt 6 -max_target_seqs 1
NC_006448.1-15016:454_5cov_045M-050M s_7827:COG0088 100.00 130 0 0 161 290 1 130 9e-65  241

There are six duplicates of the true best hit which are lost (shown in red), giving the sixth ranked sequence instead which is one of the second best alignments (group shown in orange):

$ ~/downloads/ncbi-blast-2.2.27+/bin/blastn -query query_5.fasta -db db.fasta -outfmt 6 | head -n 25
NC_006448.1-15016:454_5cov_045M-050M s_16779:COG0087 98.54 137 1 1 1 136 492 628 2e-65  243
NC_006448.1-15016:454_5cov_045M-050M s_14096:COG0087 98.54 137 1 1 1 136 492 628 2e-65  243
NC_006448.1-15016:454_5cov_045M-050M s_10633:COG0087 98.54 137 1 1 1 136 492 628 2e-65  243
NC_006448.1-15016:454_5cov_045M-050M s_8701:COG0087 98.54 137 1 1 1 136 492 628 2e-65  243
NC_006448.1-15016:454_5cov_045M-050M s_16540:COG0087 98.54 137 1 1 1 136 492 628 2e-65  243
NC_006448.1-15016:454_5cov_045M-050M s_7827:COG0088 100.00 130 0 0 161 290 1 130 9e-65  241
NC_006448.1-15016:454_5cov_045M-050M s_12527:COG0088 100.00 130 0 0 161 290 1 130 9e-65  241
NC_006448.1-15016:454_5cov_045M-050M s_3668:COG0088 100.00 130 0 0 161 290 1 130 9e-65  241
NC_006448.1-15016:454_5cov_045M-050M s_4487:COG0088 100.00 130 0 0 161 290 1 130 9e-65  241
NC_006448.1-15016:454_5cov_045M-050M s_238:COG0088 100.00 130 0 0 161 290 1 130 9e-65  241
NC_006448.1-15016:454_5cov_045M-050M s_4073:COG0088 100.00 130 0 0 161 290 1 130 9e-65  241
NC_006448.1-15016:454_5cov_045M-050M s_15051:COG0088 100.00 130 0 0 161 290 1 130 9e-65  241
NC_006448.1-15016:454_5cov_045M-050M s_5970:COG0088 100.00 130 0 0 161 290 1 130 9e-65  241
NC_006448.1-15016:454_5cov_045M-050M s_6971:COG0088 100.00 130 0 0 161 290 1 130 9e-65  241
NC_006448.1-15016:454_5cov_045M-050M s_10867:COG0088 100.00 130 0 0 161 290 1 130 9e-65  241
NC_006448.1-15016:454_5cov_045M-050M s_4893:COG0088 100.00 130 0 0 161 290 1 130 9e-65  241
NC_006448.1-15016:454_5cov_045M-050M s_9798:COG0088 100.00 130 0 0 161 290 1 130 9e-65  241
NC_006448.1-15016:454_5cov_045M-050M s_16093:COG0088 100.00 130 0 0 161 290 1 130 9e-65  241
NC_006448.1-15016:454_5cov_045M-050M s_10339:COG0088 100.00 130 0 0 161 290 1 130 9e-65  241
NC_006448.1-15016:454_5cov_045M-050M s_3051:COG0088 100.00 130 0 0 161 290 1 130 9e-65  241
NC_006448.1-15016:454_5cov_045M-050M s_9302:COG0088 100.00 130 0 0 161 290 1 130 9e-65  241
NC_006448.1-15016:454_5cov_045M-050M s_15416:COG0087 97.81 137 2 1 1 136 492 628 1e-63  237
NC_006448.1-15016:454_5cov_045M-050M s_16215:COG0087 97.81 137 2 1 1 136 492 628 1e-63  237
NC_006448.1-15016:454_5cov_045M-050M s_16063:COG0087 97.81 137 2 1 1 136 492 628 1e-63  237
NC_006448.1-15016:454_5cov_045M-050M s_15862:COG0087 97.81 137 2 1 1 136 492 628 1e-63  237

Shown here using BLAST+ 2.2.27, but the same applies with BLAST+ 2.8.1 as well.

Conclusion

It appears the bug fix in BLAST+ 2.8.1 to remove the overly aggressive optimisation in BLASTN and MegaBLAST alone fixed 9 of the 10 queries with the original Shah et al. (2018) test case with a database full of duplicates.

Only query 5 (NC_006448.1-15016:454_5cov_045M-050M) is also directly affected by the internal alignment limit setting (see blog post part four), which can be solved by also deduplicating the database.

Note that based on my analysis on blog post part three (using BLAST+ 2.7.0), with BLAST+ 2.2.28 through 2.8.0, I think just deduplicating the database fixed 5 of the 10 queries. The remainder are now explained by the 2012 "overly aggressive optimization".

My advice from blog post part three still stands - you should deduplicate your database (especially for nucleotides), as does the previous post looking at exactly how the BLAST database order tie breaking is done (be careful if you have lots of similar sequences like marker genes, and there is meaning in how your database entries are sorted).

In closing, while some of the strange behaviour in the Shah et al. (2018), where applying  -max_target_seqs 1 gave a different result, could be explained by deduplicating the database to avoid the internal alignment candidate limit (i.e. the original 2015 issue), it turns out it was mostly due to a completely separate issue which the BLAST developers could identify using the test case.

I think what Shah et al. ought to have done was contact the BLAST developers with a bug report, sharing their reproducible test case. In hindsight, they would likely have been told something like "Ah, yes. Thank you for the test case, that was very helpful. Sorry. Actually that's a different bug present since BLAST+ 2.2.28, affecting gap-rich alignments. We've fixed this as part of the next release (BLAST+ 2.8.1), due December 2018."

Instead, we are left with a confused and misleading letter in the scientific literature. Still, on the bright side, more people are now aware of how the alignment limits and other heuristics in BLAST work, and are less likely to take the top hit at face value. And I got to blog more - yay for my scientific impact!

1 comment:

  1. Enjoyed the post :) I agree that Shah et al. should have sent a nice bug report to the developers.

    ReplyDelete