2012-05-18

BLAST+ ignoring search space size for e-values

Sometimes using BLAST is frustrating. Today I'm writing about it returning different expectation values, and therefore different answers, depending on if you use a FASTA subject file, or a database made from that file. I noticed something funny a while ago, but didn't immediately investigate and report it (which I regret). In the continued absence of an official public bug tracker for NCBI BLAST, I'm again going to blog about it here, so people can find it via Google, and email this to the NCBI.


Consider this output using a 20,359 protein FASTA subject file, and a query file with two EST sequences. The problem existed on BLAST 2.2.25+ (tested on Mac OS X) but here I'm using BLAST 2.2.26+ (the current release) on Linux. First, let's make a database:

$ makeblastdb -dbtype prot -in MincV1A1.fasta -out MincV1A1
Building a new DB, current time: 05/18/2012 10:24:28
New DB name:   MincV1A1
New DB title:  MincV1A1.fasta
Sequence type: Protein
Keep Linkouts: T
Keep MBits: T
Maximum file size: 1073741824B
Adding sequences from FASTA; added 20359 sequences in 1.55252 seconds.

If we search against the database with a moderate cut off, we find three matches for query one, and none for query two:

$ blastx -query test_est.fasta -db MincV1A1 -outfmt 6 -evalue 1e-4
Gpa_EST_02_04___C05_022 prot:Minc09151 63.16 76 28 0  3 230 198 273 1e-27  103
Gpa_EST_02_04___C05_022 prot:Minc01375 47.69 65 33 1 36 230 210 273 3e-13 63.5
Gpa_EST_02_04___C05_022 prot:Minc04363 47.69 65 33 1 36 230 759 822 2e-12 63.2

Now the same search using the subject FASTA file, instead of the database:

$ blastx -query test_est.fasta -subject MincV1A1.fasta -outfmt 6 -evalue 1e-5
Gpa_EST_02_04___C05_022 prot:Minc01375 47.69 65 33 1  36 230 210 273 1e-17 63.5
Gpa_EST_02_04___C05_022 prot:Minc04363 47.69 65 33 1  36 230 759 822 3e-16 63.2
Gpa_EST_02_04___C05_022 prot:Minc09151 63.16 76 28 0   3 230 198 273 4e-32  103
gi|54546214             prot:Minc01918 65.00 20  7 0  76 135 729 748 1e-07 33.1
gi|54546214             prot:Minc01918 84.62 13  2 0 170 208 760 772 1e-07 26.6

Can you spot the differences? My query file has two EST sequences in it, and both times BLASTX finds three hits for Gpa_EST_02_04___C05_022 but with different e-values and thus a different order, but it now returns hits for the second query gi|54546214. Using the database BLASTX does find them, but with much weaker (larger) e-values:

$ blastx -query test_est.fasta -db MincV1A1 -outfmt 6 -evalue 1e-3
Gpa_EST_02_04___C05_022 prot:Minc09151 63.16 76 28 0   3 230 198 273 1e-27  103
Gpa_EST_02_04___C05_022 prot:Minc01375 47.69 65 33 1  36 230 210 273 3e-13 63.5
Gpa_EST_02_04___C05_022 prot:Minc04363 47.69 65 33 1  36 230 759 822 2e-12 63.2
gi|54546214             prot:Minc01918 65.00 20  7 0  76 135 729 748 5e-04 33.1
gi|54546214             prot:Minc01918 84.62 13  2 0 170 208 760 772 5e-04 26.6

Notice the alignment positions and bit-scores are identical. This suggests a problem with the database size or composition in calculating the expectation value.

Using the plain text output throws a little light on this - effective search space used with a subject file is suspiciously small. If we switch to looking at the XML output, I have a hunch for how this happens. From the XML output, we get this for query one:

        <Statistics>
          <Statistics_db-num>20359</Statistics_db-num>
          <Statistics_db-len>7198396</Statistics_db-len>
          <Statistics_hsp-len>83</Statistics_hsp-len>
          <Statistics_eff-space>198309564</Statistics_eff-space>
          <Statistics_kappa>0.041</Statistics_kappa>
          <Statistics_lambda>0.267</Statistics_lambda>
          <Statistics_entropy>0.14</Statistics_entropy>
        </Statistics>
      </Iteration_stat>


And this for query two:

        <Statistics>
          <Statistics_db-num>20359</Statistics_db-num>
          <Statistics_db-len>7198396</Statistics_db-len>
          <Statistics_hsp-len>93</Statistics_hsp-len>
          <Statistics_eff-space>795751350</Statistics_eff-space>
          <Statistics_kappa>0.041</Statistics_kappa>
          <Statistics_lambda>0.267</Statistics_lambda>
          <Statistics_entropy>0.14</Statistics_entropy>
        </Statistics>


Whereas with the subject file, we get 20359 entries like this from query one (one for each entry in the subject file):
      <Iteration_stat>
        <Statistics>
          <Statistics_db-num>0</Statistics_db-num>
          <Statistics_db-len>0</Statistics_db-len>
          <Statistics_hsp-len>14</Statistics_hsp-len>
          <Statistics_eff-space>13755</Statistics_eff-space>
          <Statistics_kappa>0.041</Statistics_kappa>
          <Statistics_lambda>0.267</Statistics_lambda>
          <Statistics_entropy>0.14</Statistics_entropy>
        </Statistics>
      </Iteration_stat>


and similarly 20359 entries like this for query two:

        <Statistics>
          <Statistics_db-num>0</Statistics_db-num>
          <Statistics_db-len>0</Statistics_db-len>
          <Statistics_hsp-len>20</Statistics_hsp-len>
          <Statistics_eff-space>27875</Statistics_eff-space>
          <Statistics_kappa>0.041</Statistics_kappa>
          <Statistics_lambda>0.267</Statistics_lambda>
          <Statistics_entropy>0.14</Statistics_entropy>
        </Statistics>


It would appear that using a FASTA subject file, BLAST+ is effectively doing a series of pairwise alignments, and completely ignores the effective database size when calculating the e-values. By using this inappropriately small search space, you get an overly significant (small) e-value.

What BLAST+ probably should do is scan the subject FASTA file once, and calculate the total length (i.e. the database length). But there is probably more to this, as trying  -dbsize 7198396 on the command line doesn't completely solve this (although it helps - the evalues still don't quite match those from the database search).

This discrepancy between searching the FASTA file and searching a BLAST database made from the file is extremely frustrating. Coupled with the BLAST+ memory bug mentioned last time, this really makes searching a FASTA file with BLAST+ a potentially bad idea.

Update (29 May 2012)

I got an email back from the NCBI, "This is expected behavior, at least for us, when using "-subject." In such cases, the effective search space is calculated for each query/subject pair, so it will always be smaller than the effective search space for a database of the same subject sequences (and, hence, the evalues will be smaller)."

I guess I misunderstood the intended usage of the -subject switch, but I doubt I am the only one to do so. I don't think I'll be using this option again...

1 comment: