Consider this output using a 20,359 protein FASTA subject file, and a query file with two EST sequences. The problem existed on BLAST 2.2.25+ (tested on Mac OS X) but here I'm using BLAST 2.2.26+ (the current release) on Linux. First, let's make a database:
$ makeblastdb -dbtype prot -in MincV1A1.fasta -out MincV1A1
Building a new DB, current time: 05/18/2012 10:24:28
New DB name: MincV1A1
New DB title: MincV1A1.fasta
Sequence type: Protein
Keep Linkouts: T
Keep MBits: T
Maximum file size: 1073741824B
Adding sequences from FASTA; added 20359 sequences in 1.55252 seconds.
If we search against the database with a moderate cut off, we find three matches for query one, and none for query two:
$ blastx -query test_est.fasta -db MincV1A1 -outfmt 6 -evalue 1e-4
Gpa_EST_02_04___C05_022 prot:Minc09151 63.16 76 28 0 3 230 198 273 1e-27 103
Gpa_EST_02_04___C05_022 prot:Minc01375 47.69 65 33 1 36 230 210 273 3e-13 63.5
Gpa_EST_02_04___C05_022 prot:Minc04363 47.69 65 33 1 36 230 759 822 2e-12 63.2
Now the same search using the subject FASTA file, instead of the database:
$ blastx -query test_est.fasta -subject MincV1A1.fasta -outfmt 6 -evalue 1e-5
Gpa_EST_02_04___C05_022 prot:Minc01375 47.69 65 33 1 36 230 210 273 1e-17 63.5
Gpa_EST_02_04___C05_022 prot:Minc04363 47.69 65 33 1 36 230 759 822 3e-16 63.2
Gpa_EST_02_04___C05_022 prot:Minc09151 63.16 76 28 0 3 230 198 273 4e-32 103
gi|54546214 prot:Minc01918 65.00 20 7 0 76 135 729 748 1e-07 33.1
gi|54546214 prot:Minc01918 84.62 13 2 0 170 208 760 772 1e-07 26.6
Can you spot the differences? My query file has two EST sequences in it, and both times BLASTX finds three hits for Gpa_EST_02_04___C05_022 but with different e-values and thus a different order, but it now returns hits for the second query gi|54546214. Using the database BLASTX does find them, but with much weaker (larger) e-values:
$ blastx -query test_est.fasta -db MincV1A1 -outfmt 6 -evalue 1e-3
Gpa_EST_02_04___C05_022 prot:Minc09151 63.16 76 28 0 3 230 198 273 1e-27 103
Gpa_EST_02_04___C05_022 prot:Minc01375 47.69 65 33 1 36 230 210 273 3e-13 63.5
Gpa_EST_02_04___C05_022 prot:Minc04363 47.69 65 33 1 36 230 759 822 2e-12 63.2
gi|54546214 prot:Minc01918 65.00 20 7 0 76 135 729 748 5e-04 33.1
gi|54546214 prot:Minc01918 84.62 13 2 0 170 208 760 772 5e-04 26.6
Notice the alignment positions and bit-scores are identical. This suggests a problem with the database size or composition in calculating the expectation value.
Using the plain text output throws a little light on this - effective search space used with a subject file is suspiciously small. If we switch to looking at the XML output, I have a hunch for how this happens. From the XML output, we get this for query one:
<Statistics>
<Statistics_db-num>20359</Statistics_db-num>
<Statistics_db-len>7198396</Statistics_db-len>
<Statistics_hsp-len>83</Statistics_hsp-len>
<Statistics_eff-space>198309564</Statistics_eff-space>
<Statistics_kappa>0.041</Statistics_kappa>
<Statistics_lambda>0.267</Statistics_lambda>
<Statistics_entropy>0.14</Statistics_entropy>
</Statistics>
</Iteration_stat>
And this for query two:
<Statistics>
<Statistics_db-num>20359</Statistics_db-num>
<Statistics_db-len>7198396</Statistics_db-len>
<Statistics_hsp-len>93</Statistics_hsp-len>
<Statistics_eff-space>795751350</Statistics_eff-space>
<Statistics_kappa>0.041</Statistics_kappa>
<Statistics_lambda>0.267</Statistics_lambda>
<Statistics_entropy>0.14</Statistics_entropy>
</Statistics>
Whereas with the subject file, we get 20359 entries like this from query one (one for each entry in the subject file):
<Iteration_stat>
<Statistics>
<Statistics_db-num>0</Statistics_db-num>
<Statistics_db-len>0</Statistics_db-len>
<Statistics_hsp-len>14</Statistics_hsp-len>
<Statistics_eff-space>13755</Statistics_eff-space>
<Statistics_kappa>0.041</Statistics_kappa>
<Statistics_lambda>0.267</Statistics_lambda>
<Statistics_entropy>0.14</Statistics_entropy>
</Statistics>
</Iteration_stat>
and similarly 20359 entries like this for query two:
<Statistics>
<Statistics_db-num>0</Statistics_db-num>
<Statistics_db-len>0</Statistics_db-len>
<Statistics_hsp-len>20</Statistics_hsp-len>
<Statistics_eff-space>27875</Statistics_eff-space>
<Statistics_kappa>0.041</Statistics_kappa>
<Statistics_lambda>0.267</Statistics_lambda>
<Statistics_entropy>0.14</Statistics_entropy>
</Statistics>
It would appear that using a FASTA subject file, BLAST+ is effectively doing a series of pairwise alignments, and completely ignores the effective database size when calculating the e-values. By using this inappropriately small search space, you get an overly significant (small) e-value.
What BLAST+ probably should do is scan the subject FASTA file once, and calculate the total length (i.e. the database length). But there is probably more to this, as trying -dbsize 7198396 on the command line doesn't completely solve this (although it helps - the evalues still don't quite match those from the database search).
This discrepancy between searching the FASTA file and searching a BLAST database made from the file is extremely frustrating. Coupled with the BLAST+ memory bug mentioned last time, this really makes searching a FASTA file with BLAST+ a potentially bad idea.
Update (29 May 2012)
I got an email back from the NCBI, "This is expected behavior, at least for us, when using "-subject." In such cases, the effective search space is calculated for each query/subject pair, so it will always be smaller than the effective search space for a database of the same subject sequences (and, hence, the evalues will be smaller)."
I guess I misunderstood the intended usage of the -subject switch, but I doubt I am the only one to do so. I don't think I'll be using this option again...
spelling error in title: BLAST+ "ingoring "
ReplyDelete