2018-11-13

BLAST max alignment limits reply - part three

This is the third in a series of blog posts seeking to throw light some of the claims about the BLAST+ tool recently published by Shah et al. (2018) "Misunderstood parameter of NCBI BLAST impacts the correctness of bioinformatics workflows". It was very frustrating that the letter did not provide a reproducible test case, but in reply to the first pair of posts (one and two, both on Friday 2 November 2018), lead author Nidhi Shah got in touch via the comments on Sunday 4 November, with the URL to a GitHub repository describing the Shah et al. (2018) test case. Thank you!

Their test case turns out to be using a custom nucleotide BLAST database (rather than a protein example as per the original Dec 2015 -max_target_seqs bug report, see my post "What BLAST's max-target-sequences doesn't do"), and rather than a single query sequence, they have ten.

I could reproduce their initial example locally. I could indeed see the database order coming into play - but so far nothing that cannot be explained by using this as a tie breaker. De-duplicating their database to make it non-redundant greatly improves things, but some of the queries still showed the (December 2015) problem where using -max_target_seqs 1 does not give the expected top alignment.

Their core test case

Using the links provided, I fetched the ten sequence query file example.fasta and 1991 sequence db.fasta, and made a nucleotide BLAST database as usual. It does not seem to matter that they used BLAST+ 2.6.0, while I am using BLAST+ 2.7.1 (on 64-bit Linux installed via BioConda).

$ blastn -version
blastn: 2.7.1+
 Package: blast 2.7.1, build Sep 20 2018 02:20:26
$ makeblastdb -dbtype nucl -in db.fasta
Building a new DB, current time: 11/05/2018 14:55:50
New DB name:   /.../db.fasta
New DB title:  db.fasta
Sequence type: Nucleotide
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 1991 sequences in 0.182925 seconds.

First, running with the limit on, the output is naturally short:

$ blastn -query example.fasta -db db.fasta -outfmt 6 -max_target_seqs 1
S50_7242:pb_10cov_000M-001M s_15833:COG0185 87.583 451 8 42 2352 2785 1 420 2.40e-135 479
S208_335:pb_10cov_001M-002M s_6114:COG0087 85.065 616 32 55 809 1406 3 576 5.90e-164 573
S216_4030:pb_10cov_001M-002M s_401:COG0088 82.232 681 40 73 1154 1818 1 616 2.33e-145 512
S2153_228:pb_10cov_022M-023M s_7936:COG0087 82.336 702 38 74 8837 9516 1 638 2.27e-150 531
NC_006448.1-15016:454_5cov_045M-050M s_7827:COG0088 100.000 130 0 0 161 290 1 130 8.87e-65 241
S52_804:pb_10cov_000M-001M s_7931:COG0088 86.025 644 23 62 1 618 195 797 1.42e-180 628
S190_1420:pb_10cov_001M-002M s_8639:COG0088 83.639 709 39 72 2762 3452 3 652 3.42e-170 595
S232_2558:pb_10cov_001M-002M s_3834:COG0090 84.407 885 51 84 28 883 3 829 0.0 789
S188_1416:pb_10cov_001M-002M s_15600:COG0094 86.311 599 15 63 2224 2801 2 554 2.46e-168 590
S1170_2543:pb_10cov_011M-012M s_8667:COG0094 86.038 573 24 53 606 1162 3 535 1.06e-160 564

Now, without giving BLAST a limit on the number of alignments, I wanted to show the equivalent ten lines of output. The was most concise way I could come up was to split the 10 sequence example.fasta into 10 single sequence FASTA files using a short Python script, and then run BLAST ten times, once for each query, with the Unix head command to show only the top hit:

$ for i in {1..10}; do blastn -query query_$i.fasta -db db.fasta -outfmt 6 | head -n 1; done
S50_7242:pb_10cov_000M-001M s_1013:COG0088 84.598 870 33 89 423 1258 1 803 0.0 771
S208_335:pb_10cov_001M-002M s_9883:COG0088 86.420 648 35 49 1406 2038 1 610 0.0 660
S216_4030:pb_10cov_001M-002M s_9886:COG0090 82.671 906 46 101 2118 2995 3 825 0.0 701
S2153_228:pb_10cov_022M-023M s_1068:COG0092 82.841 880 43 98 12227 13084 1 794 0.0 689
NC_006448.1-15016:454_5cov_045M-050M s_16779:COG0087 98.540 137 1 1 1 136 492 628 2.46e-65 243
S52_804:pb_10cov_000M-001M s_14626:COG0090 87.126 769 19 74 911 1656 7 718 0.0 798
S190_1420:pb_10cov_001M-002M s_16963:COG0090 84.294 885 49 83 3789 4646 2 823 0.0 782
S232_2558:pb_10cov_001M-002M s_10253:COG0201 85.682 1348 63 120 7042 8359 14 1261 0.0 1301
S188_1416:pb_10cov_001M-002M s_17130:COG0201 84.922 1479 47 149 5687 7120 1 1348 0.0 1334
S1170_2543:pb_10cov_011M-012M s_1966:COG0201 82.612 1409 94 133 4088 5459 3 1297 0.0 1105

Indeed, all ten queries have a different top hit with and without the -max_target_seq 1 setting.

I have included these single-query FASTA files in my BLAST test case repository on GitHub (specifically this commit added the 10 single-query files and the script).

However, without applying the head command, you see lots of tied hits with equal bitscores and e-values - which is exactly the circumstances where the database order is documented to come into play (see part two). This is clear if we apply the same command to the two randomised order versions of the same database the authors provide:

$ for i in {1..10}; do blastn -query query_$i.fasta -db db_rand_1.fasta -outfmt 6 | head -n 1; done
S50_7242:pb_10cov_000M-001M s_15069:COG0088 84.598 870 33 89 423 1258 1 803 0.0 771
S208_335:pb_10cov_001M-002M s_12233:COG0088 86.420 648 35 49 1406 2038 1 610 0.0 660
S216_4030:pb_10cov_001M-002M s_5441:COG0090 82.671 906 46 101 2118 2995 3 825 0.0 701
S2153_228:pb_10cov_022M-023M s_1068:COG0092 82.841 880 43 98 12227 13084 1 794 0.0 689
NC_006448.1-15016:454_5cov_045M-050M s_16540:COG0087 98.540 137 1 1 1 136 492 628 2.46e-65 243
S52_804:pb_10cov_000M-001M s_4179:COG0090 87.126 769 19 74 911 1656 7 718 0.0 798
S190_1420:pb_10cov_001M-002M s_606:COG0090 84.294 885 49 83 3789 4646 2 823 0.0 782
S232_2558:pb_10cov_001M-002M s_563:COG0201 85.682 1348 63 120 7042 8359 14 1261 0.0 1301
S188_1416:pb_10cov_001M-002M s_14987:COG0201 84.922 1479 47 149 5687 7120 1 1348 0.0 1334
S1170_2543:pb_10cov_011M-012M s_474:COG0201 82.612 1409 94 133 4088 5459 3 1297 0.0 1105

And:

$ for i in {1..10}; do blastn -query query_$i.fasta -db db_rand_2.fasta -outfmt 6 | head -n 1; done
S50_7242:pb_10cov_000M-001M s_6935:COG0088 84.598 870 33 89 423 1258 1 803 0.0 771
S208_335:pb_10cov_001M-002M s_15110:COG0088 86.420 648 35 49 1406 2038 1 610 0.0 660
S216_4030:pb_10cov_001M-002M s_16192:COG0090 82.671 906 46 101 2118 2995 3 825 0.0 701
S2153_228:pb_10cov_022M-023M s_1068:COG0092 82.841 880 43 98 12227 13084 1 794 0.0 689
NC_006448.1-15016:454_5cov_045M-050M s_8701:COG0087 98.540 137 1 1 1 136 492 628 2.46e-65 243
S52_804:pb_10cov_000M-001M s_2111:COG0090 87.126 769 19 74 911 1656 7 718 0.0 798
S190_1420:pb_10cov_001M-002M s_6587:COG0090 84.294 885 49 83 3789 4646 2 823 0.0 782
S232_2558:pb_10cov_001M-002M s_563:COG0201 85.682 1348 63 120 7042 8359 14 1261 0.0 1301
S188_1416:pb_10cov_001M-002M s_17130:COG0201 84.922 1479 47 149 5687 7120 1 1348 0.0 1334
S1170_2543:pb_10cov_011M-012M s_1966:COG0201 82.612 1409 94 133 4088 5459 3 1297 0.0 1105

Everything looks consistent by eye, except the second column - the name of the sequence matched. Once you know this database is full of duplicates (see later in this post), this is not at all surprising.

However, now for a rather different set of results - here are searches with -max_target_seqs 1 on the three duplicate filled databases which differ only in their entry order:

$ blastn -query example.fasta -db db.fasta -outfmt 6 -max_target_seqs 1 
S50_7242:pb_10cov_000M-001M s_15833:COG0185 87.583 451 8 42 2352 2785 1 420 2.40e-135 479
S208_335:pb_10cov_001M-002M s_6114:COG0087 85.065 616 32 55 809 1406 3 576 5.90e-164 573
S216_4030:pb_10cov_001M-002M s_401:COG0088 82.232 681 40 73 1154 1818 1 616 2.33e-145 512
S2153_228:pb_10cov_022M-023M s_7936:COG0087 82.336 702 38 74 8837 9516 1 638 2.27e-150 531
NC_006448.1-15016:454_5cov_045M-050M s_7827:COG0088 100.000 130 0 0 161 290 1 130 8.87e-65 241
S52_804:pb_10cov_000M-001M s_7931:COG0088 86.025 644 23 62 1 618 195 797 1.42e-180 628
S190_1420:pb_10cov_001M-002M s_8639:COG0088 83.639 709 39 72 2762 3452 3 652 3.42e-170 595
S232_2558:pb_10cov_001M-002M s_3834:COG0090 84.407 885 51 84 28 883 3 829 0.0 789
S188_1416:pb_10cov_001M-002M s_15600:COG0094 86.311 599 15 63 2224 2801 2 554 2.46e-168 590
S1170_2543:pb_10cov_011M-012M s_8667:COG0094 86.038 573 24 53 606 1162 3 535 1.06e-160 564

$ blastn -query example.fasta -db db_rand_1.fasta -outfmt 6 -max_target_seqs 1 
S50_7242:pb_10cov_000M-001M s_2717:COG0088 84.598 870 33 89 423 1258 1 803 0.0 771
S208_335:pb_10cov_001M-002M s_8841:COG0088 86.420 648 35 49 1406 2038 1 610 0.0 660
S216_4030:pb_10cov_001M-002M s_16711:COG0088 82.232 681 40 73 1154 1818 1 616 2.33e-145 512
S2153_228:pb_10cov_022M-023M s_10644:COG0088 84.476 715 33 71 9530 10227 5 658 0.0 634
NC_006448.1-15016:454_5cov_045M-050M s_16540:COG0087 98.540 137 1 1 1 136 492 628 2.46e-65 243
S52_804:pb_10cov_000M-001M s_14721:COG0090 87.126 769 19 74 911 1656 7 718 0.0 798
S190_1420:pb_10cov_001M-002M s_5615:COG0090 84.294 885 49 83 3789 4646 2 823 0.0 782
S232_2558:pb_10cov_001M-002M s_10253:COG0201 85.682 1348 63 120 7042 8359 14 1261 0.0 1301
S188_1416:pb_10cov_001M-002M s_8257:COG0201 85.126 1432 44 143 5687 7073 1 1308 0.0 1308
S1170_2543:pb_10cov_011M-012M s_16228:COG0201 82.553 1410 93 134 4088 5459 3 1297 0.0 1099

$ blastn -query example.fasta -db db_rand_2.fasta -outfmt 6 -max_target_seqs 1 
S50_7242:pb_10cov_000M-001M s_539:COG0088 84.598 870 33 89 423 1258 1 803 0.0 771
S208_335:pb_10cov_001M-002M s_8710:COG0088 86.420 648 35 49 1406 2038 1 610 0.0 660
S216_4030:pb_10cov_001M-002M s_11826:COG0090 81.678 906 55 101 2118 2995 3 825 0.0 651
S2153_228:pb_10cov_022M-023M s_16088:COG0092 82.500 880 46 98 12227 13084 1 794 0.0 673
NC_006448.1-15016:454_5cov_045M-050M s_8701:COG0087 98.540 137 1 1 1 136 492 628 2.46e-65 243
S52_804:pb_10cov_000M-001M s_301:COG0090 87.126 769 19 74 911 1656 7 718 0.0 798
S190_1420:pb_10cov_001M-002M s_606:COG0090 84.294 885 49 83 3789 4646 2 823 0.0 782
S232_2558:pb_10cov_001M-002M s_8271:COG0201 84.039 1347 87 118 7042 8359 14 1261 0.0 1179
S188_1416:pb_10cov_001M-002M s_16648:COG0201 85.126 1432 44 143 5687 7073 1 1308 0.0 1308
S1170_2543:pb_10cov_011M-012M s_2195:COG0201 82.624 1410 92 134 4088 5459 3 1297 0.0 1105

The eye is drawn to the evalue differences in column 11, but on closer inspection the results are seemingly arbitrary, with minimal agreement with each other, or the search above without the -max_target_seqs limit. Based on this rather pathological configuration (a database full of duplicates), some of the bold claims in Shah et al. (2018) make more sense.

De-duplicating the test database

I checked, and found this database has a lot of duplicated sequences - 404 unique and present once, and a 193 further unique sequences which were duplicated, meaning an equivalent non-redundant database has only 597 unique sequences. I had some existing code for this, which I reworked into a Python script to de-duplicate a FASTA file and make it non-redundant, make_nr.py (on GitHub, I used v0.0.1 as of this commit):

$ ./make_nr.py -a -o dedup.fasta db.fasta
404 unique entries; removed 1394 duplicates leaving 193 representative records
$ ./make_nr.py -a -o dedup_rand_1.fasta db_rand_1.fasta
404 unique entries; removed 1394 duplicates leaving 193 representative records
$ ./make_nr.py -a -o dedup_rand_2.fasta db_rand_2.fasta
404 unique entries; removed 1394 duplicates leaving 193 representative records
$ makeblastdb -dbtype nucl -in dedup.fasta
Building a new DB, current time: 11/08/2018 23:39:42
New DB name:   /.../dedup.fasta
New DB title:  dedup.fasta
Sequence type: Nucleotide
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 597 sequences in 0.029556 seconds.
$ makeblastdb -dbtype nucl -in dedup_rand_1.fasta
...
$ makeblastdb -dbtype nucl -in dedup_rand_1.fasta
...

Now repeating those searches on the non-redundant version of the test database, first with the -max_target_seq 1 limit:

$ blastn -query example.fasta -db dedup.fasta -outfmt 6 -max_target_seqs 1
S50_7242:pb_10cov_000M-001M s_1013:COG0088;... 84.598 870 33 89 423 1258 1 803 0.0 771
S208_335:pb_10cov_001M-002M s_12233:COG0088;... 86.420 648 35 49 1406 2038 1 610 0.0 660
S216_4030:pb_10cov_001M-002M s_1365:COG0088;... 82.379 681 39 73 1154 1818 1 616 1.65e-147 518
S2153_228:pb_10cov_022M-023M s_7936:COG0087 82.336 702 38 74 8837 9516 1 638 7.47e-151 531
NC_006448.1-15016:454_5cov_045M-050M s_14096:COG0087 98.540 137 1 1 1 136 492 628 8.17e-66 243
S52_804:pb_10cov_000M-001M s_6705:COG0090 87.126 769 19 74 911 1656 7 718 0.0 798
S190_1420:pb_10cov_001M-002M s_10249:COG0090;... 84.294 885 49 83 3789 4646 2 823 0.0 782
S232_2558:pb_10cov_001M-002M s_3834:COG0090 84.407 885 51 84 28 883 3 829 0.0 789
S188_1416:pb_10cov_001M-002M s_14987:COG0201;... 84.922 1479 47 149 5687 7120 1 1348 0.0 1334
S1170_2543:pb_10cov_011M-012M s_1966:COG0201;... 82.612 1409 94 133 4088 5459 3 1297 0.0 1105

And without the limit:

$ for i in {1..10}; do blastn -query query_$i.fasta -db dedup.fasta -outfmt 6 | head -n 1; done
S50_7242:pb_10cov_000M-001M s_1013:COG0088;... 84.598 870 33 89 423 1258 1 803 0.0 771
S208_335:pb_10cov_001M-002M s_12233:COG0088;... 86.420 648 35 49 1406 2038 1 610 0.0 660
S216_4030:pb_10cov_001M-002M s_11713:COG0090;... 82.671 906 46 101 2118 2995 3 825 0.0 701
S2153_228:pb_10cov_022M-023M s_1068:COG0092 82.841 880 43 98 12227 13084 1 794 0.0 689
NC_006448.1-15016:454_5cov_045M-050M s_14096:COG0087 98.540 137 1 1 1 136 492 628 8.17e-66 243
S52_804:pb_10cov_000M-001M s_11999:COG0090 87.126 769 19 74 911 1656 7 718 0.0 798
S190_1420:pb_10cov_001M-002M s_10249:COG0090;... 84.294 885 49 83 3789 4646 2 823 0.0 782
S232_2558:pb_10cov_001M-002M s_10253:COG0201;... 85.682 1348 63 120 7042 8359 14 1261 0.0 1301
S188_1416:pb_10cov_001M-002M s_14987:COG0201;... 84.922 1479 47 149 5687 7120 1 1348 0.0 1334
S1170_2543:pb_10cov_011M-012M s_2195:COG0201;... 82.624 1410 92 134 4088 5459 3 1297 0.0 1105

Note for display here I have abbreviated the match names of the long de-duplicated entries with dots, and the lines in red or orange are different with the limit applied.

Making the test database non-redundant fixes 5 of these 10 problematic searches (those left in green). Queries 6 and 10 (in orange) have an equally good result (same bitscore and evalue, consistent with a tie break by database order at the end of the search), but query numbers 3, 4, and 8 (in red) show a more substantial change (and I discuss why this could happen later).

This suggests the NCBI could help reduce the chances of the Suaji December 2018 bug in custom databases by warning if they are non-redundant, or better supporting de-duplication as an option in makeblastdb.

Continuing my previous marathon analogy with a half-way check point, what I think we've done here is remove most of the fast-starters which otherwise fill out the N_i candidate limit. This allows more of the slow starting ultimate winners into the final stage, where then can overtake.

To assist anyone wanting to explore this further, I have included these single-query FASTA files and the deduplicated FASTA files used in my BLAST test case repository on GitHub.

Database Order on a Non-Redundant Database

Now that we've de-duplicated the database, what is the effect of randomising the database order? Taking the two re-orderings provided, deduplicating them while preserving the order, and making those into databases, we can compare that too (code shown above; files in this repository).

It shouldn't be a surprise, but it turns out different orderings of the deduplicated database can still give different results.

Without applying a limit on the number of alignments (i.e. the default), sometimes the first reported hits differ, but have the same final evalue and bitscore. This is consistent with the database order tie break happening right at the end (i.e. step D2 in the Outline of the BLAST process document). I saw this with query_5.fasta (NC_006448.1-15016:454_5cov_045M-050M), query_6.fasta (S52_804:pb_10cov_000M-001M) and query_10.fasta (S1170_2543:pb_10cov_011M-012M) as shown in the following bash nested-for loop, where I have omitted the consistent outputs and shortened the deduplicated names with triple dots:

$ for i in {1..10}; \
do echo "Query $i" ; for d in dedup.fasta dedup_rand_1.fasta dedup_rand_2.fasta; \
do blastn -query "query_$i.fasta" -db "$d" -outfmt 6 | head -n 1; done; \
echo; done
Query 1
...

Query 2
...

Query 3
...

Query 4
...

Query 5
NC_006448.1-15016:454_5cov_045M-050M s_14096:COG0087 98.540 137 1 1 1 136 492 628 8.17e-66 243
NC_006448.1-15016:454_5cov_045M-050M s_10633:COG0087;... 98.540 137 1 1 1 136 492 628 8.17e-66 243
NC_006448.1-15016:454_5cov_045M-050M s_10633:COG0087;... 98.540 137 1 1 1 136 492 628 8.17e-66 243

Query 6
S52_804:pb_10cov_000M-001M s_11999:COG0090 87.126 769 19 74 911 1656 7 718 0.0 798
S52_804:pb_10cov_000M-001M s_6705:COG0090 87.126 769 19 74 911 1656 7 718 0.0 798
S52_804:pb_10cov_000M-001M s_6705:COG0090 87.126 769 19 74 911 1656 7 718 0.0 798

Query 7
...

Query 8
...

Query 9
...

Query 10
S1170_2543:pb_10cov_011M-012M s_2195:COG0201;... 82.624 1410 92 134 4088 5459 3 1297 0.0 1105
S1170_2543:pb_10cov_011M-012M s_2195:COG0201;... 82.624 1410 92 134 4088 5459 3 1297 0.0 1105
S1170_2543:pb_10cov_011M-012M s_1966:COG0201;... 82.612 1409 94 133 4088 5459 3 1297 0.0 1105

No real surprises there, but what about with -max_target_seqs 1 active? In the following I have again omitted those outputs which are consistent:

$ for i in {1..10}; \
do echo "Query $i"; for d in dedup.fasta dedup_rand_1.fasta dedup_rand_2.fasta;\
do blastn -query "query_$i.fasta" -db "$d" -outfmt 6 -max_target_seqs 1; done; \
echo ; done
Query 1
...

Query 2
...

Query 3
S216_4030:pb_10cov_001M-002M s_1365:COG0088;... 82.379 681 39 73 1154 1818 1 616 1.65e-147 518
S216_4030:pb_10cov_001M-002M s_1365:COG0088;... 82.379 681 39 73 1154 1818 1 616 1.65e-147 518
S216_4030:pb_10cov_001M-002M s_11713:COG0090;... 82.671 906 46 101 2118 2995 3 825 0.0 701

Query 4
S2153_228:pb_10cov_022M-023M s_7936:COG0087 82.336 702 38 74 8837 9516 1 638 7.47e-151 531
S2153_228:pb_10cov_022M-023M s_10644:COG0088;... 84.476 715 33 71 9530 10227 5 658 0.0 634
S2153_228:pb_10cov_022M-023M s_16088:COG0092;... 82.500 880 46 98 12227 13084 1 794 0.0 673

Query 5
NC_006448.1-15016:454_5cov_045M-050M s_14096:COG0087 98.540 137 1 1 1 136 492 628 8.17e-66 243
NC_006448.1-15016:454_5cov_045M-050M s_10633:COG0087;... 98.540 137 1 1 1 136 492 628 8.17e-66 243
NC_006448.1-15016:454_5cov_045M-050M s_10633:COG0087;... 98.540 137 1 1 1 136 492 628 8.17e-66 243

Query 6
...

Query 7
...

Query 8
S232_2558:pb_10cov_001M-002M s_3834:COG0090 84.407 885 51 84 28 883 3 829 0.0 789
S232_2558:pb_10cov_001M-002M s_10253:COG0201;... 85.682 1348 63 120 7042 8359 14 1261 0.0 1301
S232_2558:pb_10cov_001M-002M s_8271:COG0201 84.039 1347 87 118 7042 8359 14 1261 0.0 1179

Query 9
...

Query 10
S1170_2543:pb_10cov_011M-012M s_1966:COG0201;... 82.612 1409 94 133 4088 5459 3 1297 0.0 1105
S1170_2543:pb_10cov_011M-012M s_1966:COG0201;... 82.612 1409 94 133 4088 5459 3 1297 0.0 1105
S1170_2543:pb_10cov_011M-012M s_2195:COG0201;... 82.624 1410 92 134 4088 5459 3 1297 0.0 1105

Again, the alternative results in queries 5 and 10 have the same bitscore and evalue, suggesting the the database order is being used at the final stage as a tie breaker. Query 6 this time gives the same result each time, the equally good hit seem sometimes without the limit seems to have lost out this time. However, in the other cases the top hit changes dramatically with the database order - for example query_4.fasta (S2153_228:pb_10cov_022M-023M) gave three different answers with three different databases orderings:

$ blastn -query query_4.fasta -db dedup.fasta -outfmt 6 -max_target_seqs 1
S2153_228:pb_10cov_022M-023M s_7936:COG0087 82.336 702 38 74 8837 9516 1 638 7.47e-151 531
$ blastn -query query_4.fasta -db dedup_rand_1.fasta -outfmt 6 -max_target_seqs 1
S2153_228:pb_10cov_022M-023M s_10644:COG0088;s_1307:COG0088;s_3859:COG0088;s_6993:COG0088;s_9865:COG0088 84.476 715 33 71 9530 10227 5 658 0.0 634
$ blastn -query query_4.fasta -db dedup_rand_2.fasta -outfmt 6 -max_target_seqs 1
S2153_228:pb_10cov_022M-023M s_16088:COG0092;s_17102:COG0092 82.500 880 46 98 12227 13084 1 794 0.0 673

Referring to the Outline of the BLAST process document, because database order also comes into play as a tie break at C4, that could explain these results (the scores at stage C4 and D2 can be quite different). My hypothesis is that these alternative top matches must have had tied scores at step C4, but some were dropped due to the internal limit on the number of alignments. Confirming this needs more work.

Initial Conclusion

If you are using the alignment limits with a custom BLAST database, de-duplicate it to ensure it is non-redundant. This is a good idea anyway for speed, and will force you to think about interpreting ties, which is also good. However, the immediate practical benefit here is it reduces the chances of the desired best hit getting crowded out at the N_i internal cull by needlessly duplicated candidates.

My next post looks at what the N_i internal candidate alignment limit actually is in this context.

To return to the Shah et al. 2018 letter, I am still unhappy with the text as it paints a very negative picture of what appears to be a problem only for databases of highly similar sequences, and especially so for databases with duplicated sequences. However, further work is needed to determine just how often this occurs - I expect that it is a very rare occurrence with the main NCBI provided BLAST databases, which likely see most usage.

Foot Notes

My thanks to John Walshaw for proof reading an earlier draft of this post, and catching one mistake. Any remaining or new mistakes are my own.

Update (14 November 2018)

Fixed a typo in the third paragraph (December 2015, not Dec 2018), spotted by David Eccles.

Update (8 January 2019)


In early December I published the post "BLAST tie breaking by database order", which I think means tie breaking alone cannot explain the results shown for query 6 and 10 above (marked in orange above). Meanwhile the results for queries 3, 4 and 8 (in red above) had also yet to be fully explained.

I have today published a follow up post after the BLAST team's formal reply and BLAST+ 2.8.1 were published in late December, "An overly aggressive optimization in BLASTN and MegaBLAST". This update fixes these oddities, which turn out to have been a complex interaction of multiple issues.

No comments:

Post a Comment