2018-11-02

BLAST max alignment limits repartee - part one

Back in 2015, my blog post "What BLAST's max-target-sequences doesn't do" highlighted what we called a scary BLAST+ -max_target_seqs bug, found and reported by Sujai Kumar. The NCBI BLAST teams took the stance this was a feature not a bug (and as a heuristic search tool, this is an understandable view), but conceded it could be better documented.

Sadly, I don't think there has been much if any clarification in the BLAST+ documentation about the settings limiting the number of alignments returned, and what else they control. The recent letter Shah et al. (2018) "Misunderstood parameter of NCBI BLAST impacts the correctness of bioinformatics workflows" did serve the purpose of raising the profile of this issue, but sadly seems confused and misleading in several places. Most regrettably they did not provide a reproducible test case, so it is possible they found another issue.

This is the first of a planned series of blog posts which seeks to clarify the situation, and some of the claims in Shah et al. (2018).

First of all, this issue is not specific to -max_target_seqs (used with the computer readable output format). With human readable output formats, the maximum of -num_descriptions and -num_alignments is used in exactly the same way during the BLAST search.

Re-creating the test case

The most useful bug reports come with steps to reproduce the issue, and that is exactly what Sujai Kumar did originally back in December 2015. His example worked with a tardigrade query sequence and the NCBI NR database of the time, which didn't yet have any tardigrade matches.  The project Sujai was working on was later published as Koutsovoulos et al. (2016) "No evidence for extensive horizontal gene transfer in the genome of the tardigrade Hypsibius dujardini". Given the nature of checking for horizontal gene transfer, simply looking at the top BLAST hit is unwise. BLAST results should be scrutinised very carefully - which is how Sujai must noticed the taxonomic kingdom of a top result changing between bacteria and eukaryota.

If you run the same search now against the 2018 NR database, there are two new top hits from tardigrade genomes, Hypsibius dujardini and Ramazzottius varieornatus. It turned out to be quite simple to make a new test case by building a mini BLAST database of all the current hits except those two sequences. I've put these files into a git repository which explains how I did this in fine detail:

https://github.com/peterjc/blast_max_target_seqs

Revisiting the test case

Now, to show the problem - let's run a BLASTP search using BLAST+ 2.7.1, with my new minimal database.

$ blastp -version

blastp: 2.7.1+

 Package: blast 2.7.1, build Sep 20 2018 02:20:26

First with the default alignment number limits, tabular output, and looking at the top hits (using the Unix command head to show the first ten lines):

$ blastp -query input.fasta -db older_matches.fasta -outfmt "6 std sskingdoms" -evalue 1e-5 | head
nHd.2.3.1.t00019-RA WP_042303394.1 58.678 121 49 1 4 124 93 212 7.54e-46 153 Bacteria
nHd.2.3.1.t00019-RA WP_017775351.1 58.678 121 49 1 4 124 93 212 8.49e-46 153 Bacteria
nHd.2.3.1.t00019-RA KRX89027.1 63.115 122 45 0 1 122 105 226 5.26e-42 140 Eukaryota
nHd.2.3.1.t00019-RA KRX89025.1 63.115 122 45 0 1 122 105 226 1.07e-41 140 Eukaryota
nHd.2.3.1.t00019-RA KFD69381.1 61.983 121 46 0 1 121 466 586 1.39e-41 141 Eukaryota
nHd.2.3.1.t00019-RA KRZ17714.1 63.115 122 45 0 1 122 121 242 1.39e-41 140 Eukaryota
nHd.2.3.1.t00019-RA KFD48812.1 61.983 121 46 0 1 121 419 539 1.44e-41 141 Eukaryota
nHd.2.3.1.t00019-RA KHJ41189.1 61.983 121 46 0 1 121 39 159 1.49e-41 141 Eukaryota
nHd.2.3.1.t00019-RA KRX89026.1 63.115 122 45 0 1 122 153 274 1.82e-41 140 Eukaryota
nHd.2.3.1.t00019-RA CDW52156.1 61.983 121 46 0 1 121 97 217 1.90e-41 141 Eukaryota

Notice the top two hits are bacteria, then lots of eukaryota. Now, if you limit this to 100 alignments using -max_target_seqs 100  the two bacterial top hits have gone:

$ blastp -query input.fasta -db older_matches.fasta -outfmt "6 std sskingdoms" -max_target_seqs 100 -evalue 1e-5 | head
nHd.2.3.1.t00019-RA KRX89027.1 63.115 122 45 0 1 122 105 226 5.26e-42 140 Eukaryota
nHd.2.3.1.t00019-RA KRX89025.1 63.115 122 45 0 1 122 105 226 1.07e-41 140 Eukaryota
nHd.2.3.1.t00019-RA KFD69381.1 61.983 121 46 0 1 121 466 586 1.39e-41 141 Eukaryota
nHd.2.3.1.t00019-RA KRZ17714.1 63.115 122 45 0 1 122 121 242 1.39e-41 140 Eukaryota
nHd.2.3.1.t00019-RA KFD48812.1 61.983 121 46 0 1 121 419 539 1.44e-41 141 Eukaryota
nHd.2.3.1.t00019-RA KHJ41189.1 61.983 121 46 0 1 121 39 159 1.49e-41 141 Eukaryota
nHd.2.3.1.t00019-RA KRX89026.1 63.115 122 45 0 1 122 153 274 1.82e-41 140 Eukaryota
nHd.2.3.1.t00019-RA CDW52156.1 61.983 121 46 0 1 121 97 217 1.90e-41 141 Eukaryota
nHd.2.3.1.t00019-RA KRZ35475.1 63.115 122 45 0 1 122 105 226 2.77e-41 140 Eukaryota
nHd.2.3.1.t00019-RA KRX89032.1 63.115 122 45 0 1 122 105 226 2.95e-41 140 Eukaryota

Now the first point I want to clear up, is this is not specific to the -max_target_seqs setting (which is used for the computer readable output formats), you get the same problem with the human readable formats as well!

$ blastp -query input.fasta -db older_matches.fasta  -evalue 1e-5 | grep "Query=" -A 15
Query= nHd.2.3.1.t00019-RA

Length=126
                                                                      Score     E
Sequences producing significant alignments:                          (Bits)  Value

WP_042303394.1  glycogen/starch/alpha-glucan phosphorylase [Parab...  153     8e-46
WP_017775351.1  glycogen/starch/alpha-glucan phosphorylase [Parab...  153     8e-46
KRX89027.1  Glycogen phosphorylase, partial [Trichinella pseudosp...  140     5e-42
KRX89025.1  Glycogen phosphorylase, liver form [Trichinella pseud...  140     1e-41
KFD69381.1  hypothetical protein M514_10296 [Trichuris suis]          141     1e-41
KRZ17714.1  Glycogen phosphorylase [Trichinella pseudospiralis]       140     1e-41
KFD48812.1  hypothetical protein M513_10296 [Trichuris suis]          141     1e-41
KHJ41189.1  phosphorylase, glycogen/starch/alpha-glucan family [T...  141     1e-41
KRX89026.1  Glycogen phosphorylase [Trichinella pseudospiralis]       140     2e-41
CDW52156.1  Phosphorylase domain containing protein [Trichuris tr...  141     2e-41

The above is with no limits. Now if we use -num_descriptions=100 -num_alignments=100 to again limit to 100 alignments, we see the two bacteria top hits (WP_042303394.1 and WP_017775351.1) have gone:

$ blastp -query input.fasta -db older_matches.fasta -num_descriptions=100 -num_alignments=100 -evalue 1e-5 | grep "Query=" -A 15
Query= nHd.2.3.1.t00019-RA

Length=126
                                                                      Score     E
Sequences producing significant alignments:                          (Bits)  Value

KRX89027.1  Glycogen phosphorylase, partial [Trichinella pseudosp...  140     5e-42
KRX89025.1  Glycogen phosphorylase, liver form [Trichinella pseud...  140     1e-41
KFD69381.1  hypothetical protein M514_10296 [Trichuris suis]          141     1e-41
KRZ17714.1  Glycogen phosphorylase [Trichinella pseudospiralis]       140     1e-41
KFD48812.1  hypothetical protein M513_10296 [Trichuris suis]          141     1e-41
KHJ41189.1  phosphorylase, glycogen/starch/alpha-glucan family [T...  141     1e-41
KRX89026.1  Glycogen phosphorylase [Trichinella pseudospiralis]       140     2e-41
CDW52156.1  Phosphorylase domain containing protein [Trichuris tr...  141     2e-41
KRZ35475.1  Glycogen phosphorylase [Trichinella pseudospiralis]       140     3e-41
KRX89032.1  Glycogen phosphorylase [Trichinella pseudospiralis]       140     3e-41

For conciseness, I have used the Unix command grep to highlight the top of the summary table, but the same applies to the full pairwise alignments which would be shown later. The test files are public, so you can verify this if you wish to.

This is not specific to -max_target_seqs!

What is important here is that either -max_target_seqs, or the maximum of -num_descriptions and -num_alignments, is used during the search to set a heuristic limit on the number of alignments to consider. You can verify this by tracing the arguments through the C++ source code for BLAST+.

In this corner case, as the NCBI BLAST team explained back in 2015, the two bacterial results which would have become the top results can be excluded during an early part of the search.

I think the BLAST team should not have introduced -max_target_seqs for the computer readable output formats, rather they could have simply continued to use -num_alignments. This was #5 on my BLAST+ Christmas Wish List (2014).

However, I find the command line documentation here is not helpful here by listing -max_target_seqs under "Restrict search or results" (which is accurate), while -num_descriptions and -num_alignments are under "Formatting options" (which is misleading):

$ blastp -help
...
DESCRIPTION
   Protein-Protein BLAST 2.7.1+
...
 *** Formatting options
...

 -num_descriptions =0>
   Number of database sequences to show one-line descriptions for
   Not applicable for outfmt > 4
   Default = `500'
    * Incompatible with:  max_target_seqs
 -num_alignments =0>
   Number of database sequences to show alignments for
   Default = `250'
    * Incompatible with:  max_target_seqs
...
 *** Restrict search or results
...

 -max_target_seqs =1>   Maximum number of aligned sequences to keep 
   Not applicable for outfmt <= 4
   Default = `500'
    * Incompatible with:  num_descriptions, num_alignments
...

Terry Jones made this same point in his August 2017 comment on the origin report, where he had another reproducible example using some local databases.

Watch this space - further posts are planned.

Update (2 November 2018)

In BLAST max alignment limits repartee - part two, I focus on the question of if database order is important (as claimed in Shah et al. 2018), and how exactly the internal alignment number limit works.

Update (4 November 2018)

Corrected spelling of Nidhi Shah's surname, I had sometimes written Shar et al. (2018) rather than Shah et al. 2018. My deep apologies. See also the comments on part two, and the link to his test case.

Update (13 November 2018)

I've published BLAST max alignment limits - part three looking at the test case from Nidhi Shah, and BLAST max alignment limits - part four looking at the internal alignment number limit in the context of nucleotide databases (where composition based statistics are not used).

Update (7 December 2018)

I've published a fifth post, BLAST tie breaking by database order, looking at how the BLAST database order is defined in comparison to the FASTA file used to build a database.

4 comments:

  1. The link "BLAST max alignment limits repartee - part two" leads to part one

    ReplyDelete
  2. So why are the two Bacterial hits eliminated? Do we know?

    ReplyDelete
    Replies
    1. Hi Mick. Part Two ought to answer that, but in essence when the initial cull is too strict, the two best bacteria get drowned out by initially more promising eukaryota hits. It would be possible to go into this example in more depth, e.g. inserting debugging into the BLAST+ codebase to dump out the intermediate candidate list, but there are other things I'd like to explore more.

      Delete