2018-11-02

BLAST max alignment limits repartee - part two

This is the second in a series of blog posts seeking to throw light some of the claims about the BLAST+ tool recently published by Shah et al. (2018) "Misunderstood parameter of NCBI BLAST impacts the correctness of bioinformatics workflows". Since regrettably they did not provide a reproducible test case, my previous post began by introducing a minimal test case.

This topic dates back to 2015, when Sujai Kumar reported as a scaryBLAST+ -max_target_seqs bug which as I wrote about ("What BLAST's max-target-sequences doesn't do"), the NCBI BLAST developers explained it as a poorly documented feature.

Here I focus on what might be the most quoted part of Shah et al. (2018), which is causing what I consider to be unwarranted panic:
To our surprise, we have recently discovered that this intuition is incorrect. Instead, BLAST returns the first N hits that exceed the specified E-value threshold, which may or may not be the highest scoring N hits. The invocation using the parameter ‘-max_target_seqs 1’ simply returns the first good hit found in the database, not the best hit as one would assume. Worse yet, the output produced depends on the order in which the sequences occur in the database. For the same query, different results will be returned by BLAST when using different versions of the database even if all versions contain the same best hit for this database sequence. Even ordering the database in a different way would cause BLAST to return a different ‘top hit’ when setting the max_target_seqs parameter to 1.
This does not seem to be the case. If I am misreading their message, I am not alone. See for example Emma Bell's blog post, or John Walshaw's comments on my 2015 post. It is possible Shah et al. have found a separate issue, but since no test case was given, that cannot currently be verified.

Reverse ordering the existing test case

As explained in the previous post, I have a database older_matches.fasta of 496 sequences where the first hit returned for Sujai's tardigrade query sequence input.fasta changes with the -max_target_seqs setting.

The quote from Shah et al. (2018) claims the BLAST output when using -max_target_seqs 1 (to return a single hit) "depends on the order in which the sequences occur in the database".

It is hard to prove a negative, but this test database ought to be appropriate. I therefore prepared a new database of the same sequences but in the reverse order. A quick bit of Python generated reverse_order.fasta which was turned into a database with makeblastdbcmd as usual:

$ python3 -c "from Bio import SeqIO; \
print(SeqIO.write(list(SeqIO.parse('older_matches.fasta', 'fasta'))[::-1], 'reverse_order.fasta', 'fasta'))"
496

$ makeblastdb -dbtype prot -in reverse_order.fasta -parse_seqids -taxid_map older_matches.taxmap.txt
Building a new DB, current time: 11/01/2018 13:42:24
New DB name:   /mnt/shared/users/pc40583/repositories/blast_max_target_seqs/tests/reverse_order.fasta
New DB title:  reverse_order.fasta
Sequence type: Protein
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 496 sequences in 0.036006 seconds.

We can now try the two databases side by side - they contain the same sequences but in different order:

$ blastp -query input.fasta -db older_matches.fasta -outfmt "6 std sskingdoms" -max_target_seqs 1
nHd.2.3.1.t00019-RA KRX89027.1 63.115 122 45 0 1 122 105 226 5.26e-42 140 Eukaryota
$ blastp -query input.fasta -db reverse_order.fasta -outfmt "6 std sskingdoms" -max_target_seqs 1
nHd.2.3.1.t00019-RA KRX89027.1 63.115 122 45 0 1 122 105 226 5.26e-42 140 Eukaryota

No difference. However, increasing the limit we can see minor differences in the sort order of tied hits:

$ blastp -query input.fasta -db older_matches.fasta -outfmt "6 std sskingdoms" -max_target_seqs 100 -evalue 1e-5 -out input_vs_older_matches_max_100.tsv
$ blastp -query input.fasta -db reverse_order.fasta -outfmt "6 std sskingdoms" -max_target_seqs 100 -evalue 1e-5 -out input_vs_reverse_order_max_100.tsv
$ diff input_vs_older_matches_max_100.tsv input_vs_reverse_order_max_100.tsv
70d69
< nHd.2.3.1.t00019-RA XP_015813695.1 60.000 125 49 1 1 125 100 223 1.13e-38 133 Eukaryota
71a71
> nHd.2.3.1.t00019-RA XP_015813695.1 60.000 125 49 1 1 125 100 223 1.13e-38 133 Eukaryota

That turns out to be a trivial difference in the order of tied hits:

$ grep -n -C 2 "XP_015813695.1" input_vs_older_matches_max_100.tsv
68-nHd.2.3.1.t00019-RA XP_001988914.1 59.677 124 49 1 1 124 100 222 1.03e-38 133 Eukaryota
69-nHd.2.3.1.t00019-RA KTG43357.1 60.800 125 48 1 1 125 100 223 1.08e-38 133 Eukaryota
70:nHd.2.3.1.t00019-RA XP_015813695.1 60.000 125 49 1 1 125 100 223 1.13e-38 133 Eukaryota
71-nHd.2.3.1.t00019-RA XP_007556854.1 59.200 125 50 1 1 125 100 223 1.13e-38 133 Eukaryota
72-nHd.2.3.1.t00019-RA PWA22675.1 59.200 125 50 1 1 125 100 223 1.20e-38 133 Eukaryota
$ grep -n -C 2 "XP_015813695.1" input_vs_reverse_order_max_100.tsv
69-nHd.2.3.1.t00019-RA KTG43357.1 60.800 125 48 1 1 125 100 223 1.08e-38 133 Eukaryota
70-nHd.2.3.1.t00019-RA XP_007556854.1 59.200 125 50 1 1 125 100 223 1.13e-38 133 Eukaryota
71:nHd.2.3.1.t00019-RA XP_015813695.1 60.000 125 49 1 1 125 100 223 1.13e-38 133 Eukaryota
72-nHd.2.3.1.t00019-RA PWA22675.1 59.200 125 50 1 1 125 100 223 1.20e-38 133 Eukaryota
73-nHd.2.3.1.t00019-RA XP_014848806.1 59.200 125 50 1 1 125 100 223 1.26e-38 133 Eukaryota

The same appears to be true without any alignments limit, but with even more re-orderings due to more ties.

$ blastp -query input.fasta -db older_matches.fasta -outfmt "6 std sskingdoms" -evalue 1e-5 \
-out input_vs_older_matches.tsv
$ blastp -query input.fasta -db reverse_order.fasta -outfmt "6 std sskingdoms" -evalue 1e-5 \
-out input_vs_reverse_order.tsv
$ diff <(sort input_vs_older_matches.tsv) <(sort input_vs_reverse_order.tsv)

i.e. No differences after applying Unix sort to the output files, so no changes to the scoring etc.

Pathological Test Case

I was able to make a test case where the hit returned with  -max_target_seqs 1 really does depend on the database order, but only by constructing an exact duplicate of the expected hit. This is hardly something that should cause a problem in real life - if there are multiple sequences giving equally good alignments and you ask BLAST for just one, this is unavoidable.

The technical details of the setup,

$ grep -A 7  ">KRX89027.1" older_matches.fasta | sed "s/KRX89027.1/duplicate/g" > dup_euk_hit.fasta
$ cat dup_euk_hit.fasta older_matches.fasta > dup_at_start.fasta
$ cat older_matches.fasta dup_euk_hit.fasta > dup_at_end.fasta
$ makeblastdb -dbtype prot -in dup_at_start.fasta -parse_seqids -taxid_map older_matches.taxmap.txt
...
$ makeblastdb -dbtype prot -in dup_at_end.fasta -parse_seqids -taxid_map older_matches.taxmap.txt
...

These two databases now have an extra copy of KRX89027.1 either at the start or the end, and this duplicate does not have a taxonomy mapping defined. Here are the search results using a limit of one:

$ blastp -query input.fasta -db older_matches.fasta -outfmt "6 std sskingdoms" -max_target_seqs 1
nHd.2.3.1.t00019-RA KRX89027.1 63.115 122 45 0 1 122 105 226 5.26e-42 140 Eukaryota
$ blastp -query input.fasta -db dup_at_start.fasta -outfmt "6 std sskingdoms" -max_target_seqs 1
nHd.2.3.1.t00019-RA KRX89027.1 63.115 122 45 0 1 122 105 226 5.27e-42 140 Eukaryota
$ blastp -query input.fasta -db dup_at_end.fasta -outfmt "6 std sskingdoms" -max_target_seqs 1
nHd.2.3.1.t00019-RA duplicate 63.115 122 45 0 1 122 105 226 5.27e-42 140 N/A

As you can see, depending on the database order when there is a duplicate, sometimes you get the original sequence (KRX89027.1), and sometimes the duplicate.

Increasing the limit makes this a little clearer what is going on - database order seems to be the tie breaker for ranking the identical hits:

$ blastp -query input.fasta -db older_matches.fasta -outfmt "6 std sskingdoms" -max_target_seqs 3
nHd.2.3.1.t00019-RA KRX89027.1 63.115 122 45 0 1 122 105 226 5.26e-42 140 Eukaryota
nHd.2.3.1.t00019-RA KRX89025.1 63.115 122 45 0 1 122 105 226 1.07e-41 140 Eukaryota
nHd.2.3.1.t00019-RA KFD69381.1 61.983 121 46 0 1 121 466 586 1.39e-41 141 Eukaryota
$ blastp -query input.fasta -db dup_at_start.fasta -outfmt "6 std sskingdoms" -max_target_seqs 3
nHd.2.3.1.t00019-RA KRX89027.1 63.115 122 45 0 1 122 105 226 5.27e-42 140 Eukaryota
nHd.2.3.1.t00019-RA duplicate 63.115 122 45 0 1 122 105 226 5.27e-42 140 N/A
nHd.2.3.1.t00019-RA KRX89025.1 63.115 122 45 0 1 122 105 226 1.07e-41 140 Eukaryota
$ blastp -query input.fasta -db dup_at_end.fasta -outfmt "6 std sskingdoms" -max_target_seqs 3
nHd.2.3.1.t00019-RA duplicate 63.115 122 45 0 1 122 105 226 5.27e-42 140 N/A
nHd.2.3.1.t00019-RA KRX89027.1 63.115 122 45 0 1 122 105 226 5.27e-42 140 Eukaryota
nHd.2.3.1.t00019-RA KRX89025.1 63.115 122 45 0 1 122 105 226 1.07e-41 140 Eukaryota

Again, these example searches are missing the two even better bacterial hits which come through with a much higher alignment limit.

Conclusion

I remain open to the possibility that Shah et al. (2018) have found a separate issue, but for now think that paragraph of their paper in particular needs some serious editing.

It is completely at odds with the Outline of the BLAST process (recently added to the BLAST Command Line Applications User Manual Appendix in October 2018). This describes how the number of alignments requested, N, is used as an internal limit, N_i, to cap the number of candidates to be taken forward to the final gapped alignment with traceback stage. For the default composition based statistics, that internal limit is N_i = 2*N+50.

This is like marathon race with N top spots at stake (e.g. N=1 for the gold medal, or N=3 for a podium finish), where at the half way check point only the N_i = 2*N+50 front runners are allowed to go forward. With N=1, this means only the 52 front runners at the checkpoint are allowed to finish the race - and this might exclude a slow-starter who could otherwise overtake the front runners and win.

I would like to see this better documented at the command line, but BLAST is a heuristic search tool, and this approach does clearly reduce the amount of computation performed, and thus returns results much quicker.

Note that document does confirm at the final stage "A tie (two matches with identical score and expect value) is broken by the order of the sequences in the database" as demonstrated practically above.

I am expecting the NR database to gradually build up lots more similar protein sequences as more and more genomes are sequenced and submitted. It would therefore not surprise me if the NCBI BLAST team choose in future to tinker with the culling thresholds (like increasing the current limit N_i = 2*N+50, perhaps increasing the 50 value for larger databases?) to be more relaxed at the earlier stage of the search, to reduce the chances of situations like the one Sujai found.

Simplified Test Case

While doing final proof reading, I stumbled upon some independent work on this issue. BioStars user fishgolden had an elegant idea for a simplified test case. They took one copy of the best match WP_042303394.1 (bacteria) and either 51 or 52 copies of the unwanted best match KRX89030.1 (eukaryote), to make two databases.
Using -max_target_seqs 1 sets the internal limit to 52 candidates, and we know that at the first stage in the algorithm, KRX89030.1 (eukaryote) ranks higher than WP_042303394.1 (bacteria). So, if there are up to 51 copies of the eukaryote match, then the bacterial match is still included - at the check point it is ranked last, but overtakes the eukaryotic matches in the final stage to finish top. However, if there are 52 copies of the eukaryote, they alone are taken forward to the final stage, and our desired ultimate winner bacteria sequence WP_042303394.1 is excluded.

Update (4 November 2018)

Corrected spelling of Nidhi Shah's surname, in this post I had consistently written Shar et al. (2018) rather than Shah et al. 2018. My deep apologies. See the comments below, and in particular the link to Nidhi Shah's BLAST test case on GitHub, which I intend to try out this week.

Update (14 November 2018)

Corrected one misspelling of Sujai's name (sorry!). Thank you to John Walshaw who spotted this.

I've published BLAST max alignment limits - part three looking at the test case from Nidhi Shah, and BLAST max alignment limits - part four looking at the internal alignment number limit in the context of nucleotide databases (where composition based statistics are not used).

4 comments:

  1. Already mentioned this on Twitter, but it's fairly trivial to test the claim that "...BLAST returns the first N hits that exceed the specified E-value threshold..."

    To test this, one can effectively remove the E-value threshold by setting it to some extreme value such that no hits are discarded. In that case, if this claim were true, every search should return the first N entries in the database.

    ReplyDelete
  2. Hi Peter,

    We have provided some test examples here - https://github.com/shahnidhi/BLAST_maxtargetseq_analysis. Please check it out. Also, my last name is 'Shah', which is misspelled here at multiple places. I'd appreciate if you change it. :)
    - Nidhi

    ReplyDelete
    Replies
    1. Deep apologies about misspelling your name - I've been actively fighting autocorrect, but have clearly internalised the wrong spelling. I'll fix that, and hope to look at your test case soon.

      Delete
  3. Dear peter, Just wanted to briefly give my thanks for these detailed comments, thoughts and analyses about the BLAST issues.
    Cheers!
    Yannick

    ReplyDelete