2018-12-07

BLAST tie breaking by database order

My November blog posts discussing the BLAST+ tools behaviour with an alignment limit setting (see What BLAST's max-target-sequences doesn't do, and the links from it), touched on database order, which comes into play as a tie breaker.

Well, how is the BLAST database order defined? It turns out to be the reverse of the FASTA file used with makeblastdb, or in other words: Last-in, First-out (LIFO).

Making simple test cases

The idea here is to make a database full of almost identical sequences, by adding a tiny barcode to the end. For the nucleotide example, I've used a 3bp barcode using all the combinations of A, C, G and T giving 64 unique entries. The point here is when querying with the original sequence, these all give the same perfect alignment.

My protein example is very similar - although because blastp uses composition based statistics (CBS) by default, I want all the sequences to have the same composition. Here my barcodes are all the possible permutations of five distinct amino acids, giving 120 barcodes. This was picked to give a reasonable sized database, potentially useful if exploring the maximum alignment settings due to how the internal limits are set by default on protein versus nucleotide searches (my previous blog post).

Note I am using BLAST+ 2.7.1 on Linux here.

Nucleotide test case

Here is my Python script make_sweetpea.py which generates a FASTA nucleotide file:

#!/usr/bin/env python
"""Generate a FASTA file of 64 near identical nucleotide sequences.

Outputs FASTA records consisting of the sweet-pea sequence U78617.1
(from the Lathyrus odoratus phytochrome A (PHYA) gene), with a 3bp
unique barcode append, giving 64 near identical sequences.
"""

import itertools

template = """>pea%i based on U78617.1 Lathyrus odoratus phytochrome A
CAGGCTGCGCGGTTTCTATTTATGAAGAACAAGGTCCGTATGATAGTTGATTGTCATGCAAAACATGTGA
AGGTTCTTCAAGACGAAAAACTCCCATTTGATTTGACTCTGTGCGGTTCGACCTTAAGAGCTCCACATAG
TTGCCATTTGCAGTACATGGCTAACATGGATTCAATTGCTTCATTGGTTATGGCAGTGGTCGTCAATGAC
AGCGATGAAGATGGAGATAGCCGTGACGCAGTTCTACCACAAAAGAAAAAGAGACTTTGGGGTTTGGTAG
TTTGTCATAACACTACTCCGAGGTTTGTT%s"""

for i, barcode in enumerate(itertools.product('ACGT', repeat=3)):
    print(template % (i + 1, ''.join(barcode)))

Then make this into a nucleotide BLAST database with makeblastdb as usual,

$ python make_sweetpea.py > sweetpea_64.fasta
$ makeblastdb -dbtype nucl -in sweetpea_64.fasta -out sweetpea_64
Building a new DB, current time: 12/07/2018 10:31:20
New DB name:   /mnt/shared/users/pc40583/repositories/blast_max_target_seqs/nuc_test/sweetpea_64
New DB title:  sweetpea_64.fasta
Sequence type: Nucleotide
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 64 sequences in 0.00718689 seconds.

Note that the FASTA file looks like this:

>pea1 based on U78617.1 Lathyrus odoratus phytochrome A
CAGGCTGCGCGGTTTCTATTTATGAAGAACAAGGTCCGTATGATAGTTGATTGTCATGCAAAACATGTGA
AGGTTCTTCAAGACGAAAAACTCCCATTTGATTTGACTCTGTGCGGTTCGACCTTAAGAGCTCCACATAG
TTGCCATTTGCAGTACATGGCTAACATGGATTCAATTGCTTCATTGGTTATGGCAGTGGTCGTCAATGAC
AGCGATGAAGATGGAGATAGCCGTGACGCAGTTCTACCACAAAAGAAAAAGAGACTTTGGGGTTTGGTAG
TTTGTCATAACACTACTCCGAGGTTTGTTAAA
>pea2 based on U78617.1 Lathyrus odoratus phytochrome A
CAGGCTGCGCGGTTTCTATTTATGAAGAACAAGGTCCGTATGATAGTTGATTGTCATGCAAAACATGTGA
AGGTTCTTCAAGACGAAAAACTCCCATTTGATTTGACTCTGTGCGGTTCGACCTTAAGAGCTCCACATAG
TTGCCATTTGCAGTACATGGCTAACATGGATTCAATTGCTTCATTGGTTATGGCAGTGGTCGTCAATGAC
AGCGATGAAGATGGAGATAGCCGTGACGCAGTTCTACCACAAAAGAAAAAGAGACTTTGGGGTTTGGTAG
TTTGTCATAACACTACTCCGAGGTTTGTTAAC
...
>pea64 based on U78617.1 Lathyrus odoratus phytochrome A
CAGGCTGCGCGGTTTCTATTTATGAAGAACAAGGTCCGTATGATAGTTGATTGTCATGCAAAACATGTGA
AGGTTCTTCAAGACGAAAAACTCCCATTTGATTTGACTCTGTGCGGTTCGACCTTAAGAGCTCCACATAG
TTGCCATTTGCAGTACATGGCTAACATGGATTCAATTGCTTCATTGGTTATGGCAGTGGTCGTCAATGAC
AGCGATGAAGATGGAGATAGCCGTGACGCAGTTCTACCACAAAAGAAAAAGAGACTTTGGGGTTTGGTAG
TTTGTCATAACACTACTCCGAGGTTTGTTTTT

Then we need the query sequence, which I have in a file named sweetpea.fasta as follows:

>U78617.1 Lathyrus odoratus phytochrome A (PHYA) gene, partial cds
CAGGCTGCGCGGTTTCTATTTATGAAGAACAAGGTCCGTATGATAGTTGATTGTCATGCAAAACATGTGA
AGGTTCTTCAAGACGAAAAACTCCCATTTGATTTGACTCTGTGCGGTTCGACCTTAAGAGCTCCACATAG
TTGCCATTTGCAGTACATGGCTAACATGGATTCAATTGCTTCATTGGTTATGGCAGTGGTCGTCAATGAC
AGCGATGAAGATGGAGATAGCCGTGACGCAGTTCTACCACAAAAGAAAAAGAGACTTTGGGGTTTGGTAG
TTTGTCATAACACTACTCCGAGGTTTGTT

Protein Test Case

Here is my Python script make_aster.py which generates a FASTA protein file:

#!/usr/bin/env python                                                                                                                                                                                             
"""Generate a FASTA file of 120 near identical protein sequences.                                                                                                                                                 
                                                                                                                                                                                                                  
Outputs FASTA records consisting of the Aster sequence BAA31520.1                                                                                                                                                 
with a 5aa unique barcode appended, giving 120 near identical                                                                                                                                                     
sequences.                                                                                                                                                                                                        
"""

import itertools

template = """>aster%i based on BAA31520.1 from Aster                                                                                                                                                             
GGHVNPAVTFGAFVGGNITLLRGIVYIIAQLLGSTVACLLLKFVTNDMAVGVFSLSAGVGVTNALVFEIV                                                                                                                                            
MTFGLVYTVYATAIDPKKGSLGTIAPIAIGFIVGANI%s"""

for i, barcode in enumerate(itertools.permutations('AEILV')):
    print(template % (i + 1, ''.join(barcode)))

Then make this into a nucleotide BLAST database with makeblastdb as usual,

$ python make_aster.py > aster_120.fasta
$ makeblastdb -dbtype prot -in aster_120.fasta -out aster_120
Building a new DB, current time: 12/07/2018 11:22:36
New DB name:   /mnt/shared/users/pc40583/repositories/blast_max_target_seqs/nuc_test/aster_120
New DB title:  aster_120.fasta
Sequence type: Protein
Deleted existing Protein BLAST database named /mnt/shared/users/pc40583/repositories/blast_max_target_seqs/nuc_test/aster_120
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 120 sequences in 0.00914717 seconds.

The protein FASTA file looks like this:

>aster1 based on BAA31520.1 from Aster
GGHVNPAVTFGAFVGGNITLLRGIVYIIAQLLGSTVACLLLKFVTNDMAVGVFSLSAGVGVTNALVFEIV
MTFGLVYTVYATAIDPKKGSLGTIAPIAIGFIVGANIAEILV
>aster2 based on BAA31520.1 from Aster
GGHVNPAVTFGAFVGGNITLLRGIVYIIAQLLGSTVACLLLKFVTNDMAVGVFSLSAGVGVTNALVFEIV
MTFGLVYTVYATAIDPKKGSLGTIAPIAIGFIVGANIAEIVL
...
>aster120 based on BAA31520.1 from Aster
GGHVNPAVTFGAFVGGNITLLRGIVYIIAQLLGSTVACLLLKFVTNDMAVGVFSLSAGVGVTNALVFEIV
MTFGLVYTVYATAIDPKKGSLGTIAPIAIGFIVGANIVLIEA

Then we need the query sequence, which I have in a file named aster.fasta as follows:

>BAA31520.1 from Aster
GGHVNPAVTFGAFVGGNITLLRGIVYIIAQLLGSTVACLLLKFVTNDMAVGVFSLSAGVGVTNALVFEIV
MTFGLVYTVYATAIDPKKGSLGTIAPIAIGFIVGANI

What does this show?

First, let's just run blastn with the defaults other than asking for tabular output:

$ blastn -query sweetpea.fasta -db sweetpea_64 -outfmt 6
U78617.1 pea64 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea63 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea62 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea61 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea60 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea59 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea58 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea57 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea56 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea55 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea54 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea53 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea52 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea51 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea50 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea49 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea48 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea47 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea46 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea45 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea44 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea43 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea42 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea41 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea40 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea39 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea38 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea37 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea36 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea35 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea34 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea33 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea32 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea31 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea30 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea29 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea28 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea27 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea26 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea25 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea24 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea23 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea22 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea21 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea20 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea19 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea18 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea17 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea16 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea15 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea14 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea13 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea12 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea11 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea10 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea9 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea8 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea7 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea6 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea5 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea4 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea3 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea2 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea1 100.000 309 0 0 1 309 1 309 4.34e-166 571

That's long, but you immediately see that the hits are all the same, but listed 64 to 1, which is the reverse order of the FASTA file where they are 1 to 64.

Requesting just one alignment gives what was reported first, entry 64:

$ blastn -query sweetpea.fasta -db sweetpea_64 -outfmt 6 -max_target_seqs 1
U78617.1 pea64 100.000 309 0 0 1 309 1 309 4.34e-166 571

i.e. When there are 64 equally good matches, and we ask for just one, we do get the "first" entry in the database (it just happens that was the last entry in the FASTA file used to build the database).

With blastp the situation is exactly the same (because this database was constructed with all the near-identical sequences having the same amino acid composition, and thus CBS does not complicate the ranking):

$ blastp -query aster.fasta -db aster_120 -outfmt 6 
BAA31520.1 aster120 100.000 107 0 0 1 107 1 107 2.73e-73 205
BAA31520.1 aster119 100.000 107 0 0 1 107 1 107 2.73e-73 205
BAA31520.1 aster118 100.000 107 0 0 1 107 1 107 2.73e-73 205
BAA31520.1 aster117 100.000 107 0 0 1 107 1 107 2.73e-73 205
BAA31520.1 aster116 100.000 107 0 0 1 107 1 107 2.73e-73 205
BAA31520.1 aster115 100.000 107 0 0 1 107 1 107 2.73e-73 205
BAA31520.1 aster114 100.000 107 0 0 1 107 1 107 2.73e-73 205
BAA31520.1 aster113 100.000 107 0 0 1 107 1 107 2.73e-73 205
BAA31520.1 aster112 100.000 107 0 0 1 107 1 107 2.73e-73 205
BAA31520.1 aster111 100.000 107 0 0 1 107 1 107 2.73e-73 205
BAA31520.1 aster110 100.000 107 0 0 1 107 1 107 2.73e-73 205
BAA31520.1 aster109 100.000 107 0 0 1 107 1 107 2.73e-73 205
BAA31520.1 aster108 100.000 107 0 0 1 107 1 107 2.73e-73 205
BAA31520.1 aster107 100.000 107 0 0 1 107 1 107 2.73e-73 205
BAA31520.1 aster106 100.000 107 0 0 1 107 1 107 2.73e-73 205
BAA31520.1 aster105 100.000 107 0 0 1 107 1 107 2.73e-73 205
BAA31520.1 aster104 100.000 107 0 0 1 107 1 107 2.73e-73 205
BAA31520.1 aster103 100.000 107 0 0 1 107 1 107 2.73e-73 205
BAA31520.1 aster102 100.000 107 0 0 1 107 1 107 2.73e-73 205
BAA31520.1 aster101 100.000 107 0 0 1 107 1 107 2.73e-73 205
BAA31520.1 aster100 100.000 107 0 0 1 107 1 107 2.73e-73 205
BAA31520.1 aster99 100.000 107 0 0 1 107 1 107 2.73e-73 205
BAA31520.1 aster98 100.000 107 0 0 1 107 1 107 2.73e-73 205
BAA31520.1 aster97 100.000 107 0 0 1 107 1 107 2.73e-73 205
...
BAA31520.1 aster3 100.000 107 0 0 1 107 1 107 2.73e-73 205
BAA31520.1 aster2 100.000 107 0 0 1 107 1 107 2.73e-73 205
BAA31520.1 aster1 100.000 107 0 0 1 107 1 107 2.73e-73 205

I abridged the output at the ... line. Asking for just one alignment this we do indeed get entry 120:

$ blastp -query aster.fasta -db aster_120 -outfmt 6 -max_target_seqs 1
BAA31520.1 aster120 100.000 107 0 0 1 107 1 107 2.73e-73 205

i.e. Exactly as with the observation from blastn, the protein database order is used as the tie breaker with the proviso that the database order is the reverse of the FASTA file used to build it.

What about chunked databases?

If you've used the NCBI provided NT or NR databases, you'll know they come in multiple chunks, with a master file nt.nal or nr.pal listing the child-databases which together make up the database. How does the database order work here? This turns out to be easy to check via the makeblastdb -max_file_sz setting.

$ makeblastdb -dbtype prot -in aster_120.fasta -out aster_120_chunked -max_file_sz 1200B
Building a new DB, current time: 12/07/2018 11:54:54
New DB name:   /mnt/shared/users/pc40583/repositories/blast_max_target_seqs/nuc_test/aster_120_chunked
New DB title:  aster_120.fasta
Sequence type: Protein
Keep MBits: T
Maximum file size: 0B
Adding sequences from FASTA; added 120 sequences in 0.0767181 seconds.

I found that value of 1200 bytes by trial and error, but it results in twelve chunks:

$ cat aster_120_chunked.pal
#
# Alias file created: Dec 7, 2018  11:54 AM
#
TITLE aster_120.fasta
DBLIST aster_120_chunked.00 aster_120_chunked.01 aster_120_chunked.02 aster_120_chunked.03 aster_120_chunked.04 aster_120_chunked.05 aster_120_chunked.06 aster_120_chunked.07 aster_120_chunked.08 aster_120_chunked.09 aster_120_chunked.10 aster_120_chunked.11

We can use the blastdbcmd tool to see which records are in which - but it shouldn't surprise you that the first ten records are in aster_120_chunked.00.p* while the last ten records are in aster_120_chunked.11.p* - the chunks are created as needed while looping though the input FASTA file.

$ blastdbcmd -entry all -db aster_120.11
>aster111 based on BAA31520.1 from Aster
GGHVNPAVTFGAFVGGNITLLRGIVYIIAQLLGSTVACLLLKFVTNDMAVGVFSLSAGVGVTNALVFEIVMTFGLVYTVY
ATAIDPKKGSLGTIAPIAIGFIVGANIVIEAL
...
GGHVNPAVTFGAFVGGNITLLRGIVYIIAQLLGSTVACLLLKFVTNDMAVGVFSLSAGVGVTNALVFEIVMTFGLVYTVY
ATAIDPKKGSLGTIAPIAIGFIVGANIVLIEA


And how does this behave in terms of tie-breaking? Happily, exactly the same:

$ blastp -query aster.fasta -db aster_120_chunked -outfmt 6
BAA31520.1 aster120 100.000 107 0 0 1 107 1 107 2.73e-73 205
BAA31520.1 aster119 100.000 107 0 0 1 107 1 107 2.73e-73 205
BAA31520.1 aster118 100.000 107 0 0 1 107 1 107 2.73e-73 205
...
BAA31520.1 aster3 100.000 107 0 0 1 107 1 107 2.73e-73 205
BAA31520.1 aster2 100.000 107 0 0 1 107 1 107 2.73e-73 205
BAA31520.1 aster1 100.000 107 0 0 1 107 1 107 2.73e-73 205

and:

$ blastp -query aster.fasta -db aster_120_chunked -outfmt 6 -max_target_seqs 1
BAA31520.1 aster120 100.000 107 0 0 1 107 1 107 2.73e-73 205

We can do the same trick with the nucleotide test case. This command makes eight equal chunks, each with eight sequences:

$ makeblastdb -dbtype nucl -in sweetpea_64.fasta -out sweetpea_64_chunked -max_file_sz 1000B

Again, the search results and tie breaking behave the same for the single database versus the chunked nucleotide database.

Conclusion

The BLAST database order for both nucleotide and protein databases is the reverse of the FASTA file used to build the database.

Discussion

BLAST has to use something as a tie breaker, and database order is deterministic and fast - and indirectly this is under the user's full control. The NCBI could implement tie breaking using the sequence or its identifier as a tie breaker, but not only would the string comparison be a little slower, this seems more likely to introduce a subtle bias.

However, this makes it clear that if you ignore the ties and only look at the first result, and your input database FASTA file has a meaningful order, that will be introducing a bias to your BLAST results.

For example, you might update the database FASTA file by appending new sequences to it - meaning once it is turned into a BLAST database, in a tie break the most recently added sequences will be preferred over the older sequences. If you only look at the top result, in a tie-break your results will change as the database is updated. From a results stability point of view you might prefer the older sequence takes priority (in which case, reverse the FASTA file record order when making the database), but really this is a problem waiting to catch you out. You could randomise the FASTA file record order, but if tied best hits are likely, do not just look at the top BLAST hit!

No comments:

Post a Comment