This topic dates back to 2015, when Sujai Kumar reported as a scaryBLAST+ -max_target_seqs bug which as I wrote about ("What BLAST's max-target-sequences doesn't do"), the NCBI BLAST developers explained it as a poorly documented feature.
Here I focus on what might be the most quoted part of Shah et al. (2018), which is causing what I consider to be unwarranted panic:
To our surprise, we have recently discovered that this intuition is incorrect. Instead, BLAST returns the first N hits that exceed the specified E-value threshold, which may or may not be the highest scoring N hits. The invocation using the parameter ‘-max_target_seqs 1’ simply returns the first good hit found in the database, not the best hit as one would assume. Worse yet, the output produced depends on the order in which the sequences occur in the database. For the same query, different results will be returned by BLAST when using different versions of the database even if all versions contain the same best hit for this database sequence. Even ordering the database in a different way would cause BLAST to return a different ‘top hit’ when setting the max_target_seqs parameter to 1.
This does not seem to be the case. If I am misreading their message, I am not alone. See for example Emma Bell's blog post, or John Walshaw's comments on my 2015 post. It is possible Shah et al. have found a separate issue, but since no test case was given, that cannot currently be verified.
We can now try the two databases side by side - they contain the same sequences but in different order:
No difference. However, increasing the limit we can see minor differences in the sort order of tied hits:
That turns out to be a trivial difference in the order of tied hits:
The same appears to be true without any alignments limit, but with even more re-orderings due to more ties.
i.e. No differences after applying Unix sort to the output files, so no changes to the scoring etc.
The technical details of the setup,
These two databases now have an extra copy of KRX89027.1 either at the start or the end, and this duplicate does not have a taxonomy mapping defined. Here are the search results using a limit of one:
As you can see, depending on the database order when there is a duplicate, sometimes you get the original sequence (KRX89027.1), and sometimes the duplicate.
Increasing the limit makes this a little clearer what is going on - database order seems to be the tie breaker for ranking the identical hits:
Reverse ordering the existing test case
As explained in the previous post, I have a database older_matches.fasta of 496 sequences where the first hit returned for Sujai's tardigrade query sequence input.fasta changes with the -max_target_seqs setting.
The quote from Shah et al. (2018) claims the BLAST output when using -max_target_seqs 1 (to return a single hit) "depends on the order in which the sequences occur in the database".
It is hard to prove a negative, but this test database ought to be appropriate. I therefore prepared a new database of the same sequences but in the reverse order. A quick bit of Python generated reverse_order.fasta which was turned into a database with makeblastdbcmd as usual:
$ python3 -c "from Bio import SeqIO; \ print(SeqIO.write(list(SeqIO.parse('older_matches.fasta', 'fasta'))[::-1], 'reverse_order.fasta', 'fasta'))" 496
$ makeblastdb -dbtype prot -in reverse_order.fasta -parse_seqids -taxid_map older_matches.taxmap.txt Building a new DB, current time: 11/01/2018 13:42:24 New DB name: /mnt/shared/users/pc40583/repositories/blast_max_target_seqs/tests/reverse_order.fasta New DB title: reverse_order.fasta Sequence type: Protein Keep MBits: T Maximum file size: 1000000000B Adding sequences from FASTA; added 496 sequences in 0.036006 seconds.
We can now try the two databases side by side - they contain the same sequences but in different order:
$ blastp -query input.fasta -db older_matches.fasta -outfmt "6 std sskingdoms" -max_target_seqs 1 nHd.2.3.1.t00019-RA KRX89027.1 63.115 122 45 0 1 122 105 226 5.26e-42 140 Eukaryota $ blastp -query input.fasta -db reverse_order.fasta -outfmt "6 std sskingdoms" -max_target_seqs 1 nHd.2.3.1.t00019-RA KRX89027.1 63.115 122 45 0 1 122 105 226 5.26e-42 140 Eukaryota
No difference. However, increasing the limit we can see minor differences in the sort order of tied hits:
$ blastp -query input.fasta -db older_matches.fasta -outfmt "6 std sskingdoms" -max_target_seqs 100 -evalue 1e-5 -out input_vs_older_matches_max_100.tsv $ blastp -query input.fasta -db reverse_order.fasta -outfmt "6 std sskingdoms" -max_target_seqs 100 -evalue 1e-5 -out input_vs_reverse_order_max_100.tsv $ diff input_vs_older_matches_max_100.tsv input_vs_reverse_order_max_100.tsv 70d69 < nHd.2.3.1.t00019-RA XP_015813695.1 60.000 125 49 1 1 125 100 223 1.13e-38 133 Eukaryota 71a71 > nHd.2.3.1.t00019-RA XP_015813695.1 60.000 125 49 1 1 125 100 223 1.13e-38 133 Eukaryota
That turns out to be a trivial difference in the order of tied hits:
$ grep -n -C 2 "XP_015813695.1" input_vs_older_matches_max_100.tsv 68-nHd.2.3.1.t00019-RA XP_001988914.1 59.677 124 49 1 1 124 100 222 1.03e-38 133 Eukaryota 69-nHd.2.3.1.t00019-RA KTG43357.1 60.800 125 48 1 1 125 100 223 1.08e-38 133 Eukaryota 70:nHd.2.3.1.t00019-RA XP_015813695.1 60.000 125 49 1 1 125 100 223 1.13e-38 133 Eukaryota 71-nHd.2.3.1.t00019-RA XP_007556854.1 59.200 125 50 1 1 125 100 223 1.13e-38 133 Eukaryota 72-nHd.2.3.1.t00019-RA PWA22675.1 59.200 125 50 1 1 125 100 223 1.20e-38 133 Eukaryota $ grep -n -C 2 "XP_015813695.1" input_vs_reverse_order_max_100.tsv 69-nHd.2.3.1.t00019-RA KTG43357.1 60.800 125 48 1 1 125 100 223 1.08e-38 133 Eukaryota 70-nHd.2.3.1.t00019-RA XP_007556854.1 59.200 125 50 1 1 125 100 223 1.13e-38 133 Eukaryota 71:nHd.2.3.1.t00019-RA XP_015813695.1 60.000 125 49 1 1 125 100 223 1.13e-38 133 Eukaryota 72-nHd.2.3.1.t00019-RA PWA22675.1 59.200 125 50 1 1 125 100 223 1.20e-38 133 Eukaryota 73-nHd.2.3.1.t00019-RA XP_014848806.1 59.200 125 50 1 1 125 100 223 1.26e-38 133 Eukaryota
The same appears to be true without any alignments limit, but with even more re-orderings due to more ties.
$ blastp -query input.fasta -db older_matches.fasta -outfmt "6 std sskingdoms" -evalue 1e-5 \ -out input_vs_older_matches.tsv $ blastp -query input.fasta -db reverse_order.fasta -outfmt "6 std sskingdoms" -evalue 1e-5 \ -out input_vs_reverse_order.tsv $ diff <(sort input_vs_older_matches.tsv) <(sort input_vs_reverse_order.tsv)
i.e. No differences after applying Unix sort to the output files, so no changes to the scoring etc.
Pathological Test Case
I was able to make a test case where the hit returned with -max_target_seqs 1 really does depend on the database order, but only by constructing an exact duplicate of the expected hit. This is hardly something that should cause a problem in real life - if there are multiple sequences giving equally good alignments and you ask BLAST for just one, this is unavoidable.The technical details of the setup,
$ grep -A 7 ">KRX89027.1" older_matches.fasta | sed "s/KRX89027.1/duplicate/g" > dup_euk_hit.fasta $ cat dup_euk_hit.fasta older_matches.fasta > dup_at_start.fasta $ cat older_matches.fasta dup_euk_hit.fasta > dup_at_end.fasta $ makeblastdb -dbtype prot -in dup_at_start.fasta -parse_seqids -taxid_map older_matches.taxmap.txt ... $ makeblastdb -dbtype prot -in dup_at_end.fasta -parse_seqids -taxid_map older_matches.taxmap.txt ...
These two databases now have an extra copy of KRX89027.1 either at the start or the end, and this duplicate does not have a taxonomy mapping defined. Here are the search results using a limit of one:
$ blastp -query input.fasta -db older_matches.fasta -outfmt "6 std sskingdoms" -max_target_seqs 1 nHd.2.3.1.t00019-RA KRX89027.1 63.115 122 45 0 1 122 105 226 5.26e-42 140 Eukaryota $ blastp -query input.fasta -db dup_at_start.fasta -outfmt "6 std sskingdoms" -max_target_seqs 1 nHd.2.3.1.t00019-RA KRX89027.1 63.115 122 45 0 1 122 105 226 5.27e-42 140 Eukaryota $ blastp -query input.fasta -db dup_at_end.fasta -outfmt "6 std sskingdoms" -max_target_seqs 1 nHd.2.3.1.t00019-RA duplicate 63.115 122 45 0 1 122 105 226 5.27e-42 140 N/A
As you can see, depending on the database order when there is a duplicate, sometimes you get the original sequence (KRX89027.1), and sometimes the duplicate.
$ blastp -query input.fasta -db older_matches.fasta -outfmt "6 std sskingdoms" -max_target_seqs 3 nHd.2.3.1.t00019-RA KRX89027.1 63.115 122 45 0 1 122 105 226 5.26e-42 140 Eukaryota nHd.2.3.1.t00019-RA KRX89025.1 63.115 122 45 0 1 122 105 226 1.07e-41 140 Eukaryota nHd.2.3.1.t00019-RA KFD69381.1 61.983 121 46 0 1 121 466 586 1.39e-41 141 Eukaryota $ blastp -query input.fasta -db dup_at_start.fasta -outfmt "6 std sskingdoms" -max_target_seqs 3 nHd.2.3.1.t00019-RA KRX89027.1 63.115 122 45 0 1 122 105 226 5.27e-42 140 Eukaryota nHd.2.3.1.t00019-RA duplicate 63.115 122 45 0 1 122 105 226 5.27e-42 140 N/A nHd.2.3.1.t00019-RA KRX89025.1 63.115 122 45 0 1 122 105 226 1.07e-41 140 Eukaryota $ blastp -query input.fasta -db dup_at_end.fasta -outfmt "6 std sskingdoms" -max_target_seqs 3 nHd.2.3.1.t00019-RA duplicate 63.115 122 45 0 1 122 105 226 5.27e-42 140 N/A nHd.2.3.1.t00019-RA KRX89027.1 63.115 122 45 0 1 122 105 226 5.27e-42 140 Eukaryota nHd.2.3.1.t00019-RA KRX89025.1 63.115 122 45 0 1 122 105 226 1.07e-41 140 Eukaryota
Again, these example searches are missing the two even better bacterial hits which come through with a much higher alignment limit.
Conclusion
I remain open to the possibility that Shah et al. (2018) have found a separate issue, but for now think that paragraph of their paper in particular needs some serious editing.It is completely at odds with the Outline of the BLAST process (recently added to the BLAST Command Line Applications User Manual Appendix in October 2018). This describes how the number of alignments requested, N, is used as an internal limit, N_i, to cap the number of candidates to be taken forward to the final gapped alignment with traceback stage. For the default composition based statistics, that internal limit is N_i = 2*N+50.
This is like marathon race with N top spots at stake (e.g. N=1 for the gold medal, or N=3 for a podium finish), where at the half way check point only the N_i = 2*N+50 front runners are allowed to go forward. With N=1, this means only the 52 front runners at the checkpoint are allowed to finish the race - and this might exclude a slow-starter who could otherwise overtake the front runners and win.
I would like to see this better documented at the command line, but BLAST is a heuristic search tool, and this approach does clearly reduce the amount of computation performed, and thus returns results much quicker.
Note that document does confirm at the final stage "A tie (two matches with identical score and expect value) is broken by the order of the sequences in the database" as demonstrated practically above.
I am expecting the NR database to gradually build up lots more similar protein sequences as more and more genomes are sequenced and submitted. It would therefore not surprise me if the NCBI BLAST team choose in future to tinker with the culling thresholds (like increasing the current limit N_i = 2*N+50, perhaps increasing the 50 value for larger databases?) to be more relaxed at the earlier stage of the search, to reduce the chances of situations like the one Sujai found.
Simplified Test Case
While doing final proof reading, I stumbled upon some independent work on this issue. BioStars user fishgolden had an elegant idea for a simplified test case. They took one copy of the best match WP_042303394.1 (bacteria) and either 51 or 52 copies of the unwanted best match KRX89030.1 (eukaryote), to make two databases.Using -max_target_seqs 1 sets the internal limit to 52 candidates, and we know that at the first stage in the algorithm, KRX89030.1 (eukaryote) ranks higher than WP_042303394.1 (bacteria). So, if there are up to 51 copies of the eukaryote match, then the bacterial match is still included - at the check point it is ranked last, but overtakes the eukaryotic matches in the final stage to finish top. However, if there are 52 copies of the eukaryote, they alone are taken forward to the final stage, and our desired ultimate winner bacteria sequence WP_042303394.1 is excluded.
Update (4 November 2018)
Corrected spelling of Nidhi Shah's surname, in this post I had consistently written Shar et al. (2018) rather than Shah et al. 2018. My deep apologies. See the comments below, and in particular the link to Nidhi Shah's BLAST test case on GitHub, which I intend to try out this week.Update (14 November 2018)
Corrected one misspelling of Sujai's name (sorry!). Thank you to John Walshaw who spotted this.I've published BLAST max alignment limits - part three looking at the test case from Nidhi Shah, and BLAST max alignment limits - part four looking at the internal alignment number limit in the context of nucleotide databases (where composition based statistics are not used).
Already mentioned this on Twitter, but it's fairly trivial to test the claim that "...BLAST returns the first N hits that exceed the specified E-value threshold..."
ReplyDeleteTo test this, one can effectively remove the E-value threshold by setting it to some extreme value such that no hits are discarded. In that case, if this claim were true, every search should return the first N entries in the database.
Hi Peter,
ReplyDeleteWe have provided some test examples here - https://github.com/shahnidhi/BLAST_maxtargetseq_analysis. Please check it out. Also, my last name is 'Shah', which is misspelled here at multiple places. I'd appreciate if you change it. :)
- Nidhi
Deep apologies about misspelling your name - I've been actively fighting autocorrect, but have clearly internalised the wrong spelling. I'll fix that, and hope to look at your test case soon.
DeleteDear peter, Just wanted to briefly give my thanks for these detailed comments, thoughts and analyses about the BLAST issues.
ReplyDeleteCheers!
Yannick