2013-12-24

BLAST+ should keep its BL_ORD_ID identifiers to itself

This is in a sense a continuation of my previous BLAST blog post, My IDs not good enough for NCBI BLAST+. My core complaint is that makeblastdb currently ignores the user's own identifiers and automatically assigns its own identifiers (gnl|BL_ORD_ID|0, gnl|BL_ORD_ID|1, gnl|BL_ORD_ID|2, etc), and that the BLAST+ suite as a whole is inconsistent about hiding these in its output.

Note that one side-effect of BLAST+ ignoring the user identifiers and creating its own is that it can tolerate databases made from FASTA files with accidentally duplicated identifiers, but this only causes great confusion and ambiguity in the downstream analysis. One of the ways I've seen FASTA files be created with accidentally duplicated identifiers is pooling of assemblies where generic names like contig1 (or even the more complex Trinity naming scheme) naturally cause clashes. In situations like this, I think makeblastdb should give an error when attempting to build a BLAST database.

Sample Data

For the tests here, I created a simple sample FASTA file with deliberately duplicated identifiers. In fact I deliberately included a duplicated entry too (in the sense that both entries gene1 have the same sequence). Here the first three entries are from the NC_005816 Yersinia pestis plasmid, while the remaining entries are from the HQ230977 Salmonella enteric plasmid.

$ cat with_dups.fasta
>gene1
GTGAACAAACAACAACAAACTGCGCTGAATATGGCGCGATTTATCAGAAGCCAGAGCCTGATACTGCTTG
AAAAACTGGATGCTCTGGATGCCGACGAGCAGGCGGCCATGTGTGAACGACTGCACGAACTCGCGGAAGA
ACTCCAGAACAGCATCCAGGCTCGCTTTGAAGCCGAAAGTGAAACAGGAACATAA
>gene2
ATGAAATTTCATTTTTGTGATCTGAATCACTCTTATAAAAATCAGGAAGGGAAGATTCGCAGCAGAAAAA
CAGCACCGGGTAACATCAGAAAAAAACAGAAAGGAGATAACGTGAGCAAAACAAAATCTGGTCGCCACCG
ACTGAGCAAAACAGACAAACGCCTGCTGGCTGCACTTGTCGTTGCCGGATACGAAGAACGGACAGCCCGT
GACCTCATCCAGAAACACGTTTACACACTGACACAGGCCGACCTGCGCCATCTGGTCAGTGAAATCAGTA
ACGGTGTGGGACAGTCACAGGCCTACGATGCGATTTACCAGGCGAGACGCATTCGTCTCGCCCGTAAATA
CCTGAGCGGAAAAAAACCGGAAGGGGTGGAACCCCGGGAAGGGCAGGAACGGGAAGATTTACCATAA
>gene3 uniquely named
TTGGCTGATTTGAAAAAGCTACAGGTTTACGGACCTGAGTTACCCAGGCCATATGCCGATACCGTGAAAG
GTTCTCGGTACAAAAATATGAAAGAGCTTCGCGTTCAGTTTTCTGGCCGTCCGATAAGAGCCTTTTATGC
GTTCGATCCGATTCGTCGGGCTATCGTTCTTTGTGCAGGAGATAAAAGTAATGATAAGCGGTTTTATGAA
AAACTGGTGCGTATAGCTGAGGATGAATTTACAGCACATCTGAACACACTGGAGAGCAAGTAA
>gene1
GTGAACAAACAACAACAAACTGCGCTGAATATGGCGCGATTTATCAGAAGCCAGAGCCTGATACTGCTTG
AAAAACTGGATGCTCTGGATGCCGACGAGCAGGCGGCCATGTGTGAACGACTGCACGAACTCGCGGAAGA
ACTCCAGAACAGCATCCAGGCTCGCTTTGAAGCCGAAAGTGAAACAGGAACATAA
>gene2
ATGAGTGATTTATATAATGTAATATCACGGGCTGTTGAAGCGTCTGGCGCCGATCATTCAATTAATGAAA
AATTGACAAATGTCTTGAAAAGAGAATTAGTTGATTATGTCAGCATTGCGCATCTAAAAACCAAATTGTC
TGTATTATATGAGTTTGAAAAGAATTATTTGCAGCTTATCGCAGAATATAAGGAAGAAATAAAATTTGCT
TCCTCTTTGCAAGAGGATTTACGTAAAGAACGTGCTAAATTCTTTTCTGAGACATTAAAAGAGGTTCATC
AAACTCTAAATGAATCTCAAGTTGATAATGAAGTGGCATCAAAATGGATTAAAGAACTTGTTGGTAGTTA
TACCAAAAGCTTGGATCTAAGCGGAGGCCTTGTTGAAGAGCATACATTAGATACAATTGCTTGTATTCGC
GCTGAGGCTAAATTAAATAAACCATCTATTGAGCCGGGCAATAACTAA

So, now let's try using this with recent versions of BLAST+ ...

Making a BLAST database

Here I'm using a problematic FASTA file containing duplicated identifiers to build a BLAST database:
$ ~/Downloads/ncbi-blast-2.2.28+/bin/makeblastdb -in with_dups.fasta -dbtype nucl -out with_dups


Building a new DB, current time: 12/24/2013 07:47:05
New DB name:   with_dups_v28
New DB title:  with_dups.fasta
Sequence type: Nucleotide
Keep Linkouts: T
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 5 sequences in 0.030242 seconds.

$ ls with_dups.n*
with_dups.nhr with_dups.nin with_dups.nsq

I also created the same database using older versions of BLAST+ for comparison,
$ ~/Downloads/ncbi-blast-2.2.27+/bin/makeblastdb -in with_dups.fasta -dbtype nucl -out with_dups_v27


Building a new DB, current time: 12/24/2013 07:47:55
New DB name:   with_dups_v27
New DB title:  with_dups.fasta
Sequence type: Nucleotide
Keep Linkouts: T
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 5 sequences in 0.0043211 seconds.

$ ls with_dups_v27*
with_dups_v27.nhr with_dups_v27.nin with_dups_v27.nsq

$ diff with_dups_v27.nhr with_dups.nhr 
$ diff with_dups_v27.nsq with_dups.nsq
$ diff with_dups_v27.nin with_dups.nin

Comparing the two, the *.nhr and *.nsq files are the same, while the *.nin files embed a time stamp - and therefore differ unless you run makeblastdb within the same minute.

So, from BLAST+ 2.2.27 to 2.2.28 nothing seems to have changed on the makeblastdb side. I went back and checked with the first release BLAST+ 2.2.18 and it too gives an identical database.

Examining the BLAST database

It seems that as of BLAST+ 2.2.28, blastdbcmd tries to hide the internal details of the automatically assigned names. For example, attempting to recover the input FASTA file previously showed quite clearly that the original identifiers were in some sense demoted to part of the description (or sequence title) only. Here's an example using BLAST+ 2.2.27:
$ ~/Downloads/ncbi-blast-2.2.27+/bin/blastdbcmd -db with_dups -entry all
>gnl|BL_ORD_ID|0 gene1
GTGAACAAACAACAACAAACTGCGCTGAATATGGCGCGATTTATCAGAAGCCAGAGCCTGATACTGCTTGAAAAACTGGA
TGCTCTGGATGCCGACGAGCAGGCGGCCATGTGTGAACGACTGCACGAACTCGCGGAAGAACTCCAGAACAGCATCCAGG
CTCGCTTTGAAGCCGAAAGTGAAACAGGAACATAA
...

With BLAST+ 2.2.28, the internal implementation details are more hidden and the original FASTA naming is recovered when you ask for all the database entries:
$ ~/Downloads/ncbi-blast-2.2.28+/bin/blastdbcmd -db with_dups -entry all
>gene1
GTGAACAAACAACAACAAACTGCGCTGAATATGGCGCGATTTATCAGAAGCCAGAGCCTGATACTGCTTGAAAAACTGGA
TGCTCTGGATGCCGACGAGCAGGCGGCCATGTGTGAACGACTGCACGAACTCGCGGAAGAACTCCAGAACAGCATCCAGG
CTCGCTTTGAAGCCGAAAGTGAAACAGGAACATAA
...

Furthermore, you can see how older versions of BLAST+ exposed the internal identifiers:
$ ~/Downloads/ncbi-blast-2.2.27+/bin/blastdbcmd -db with_dups -entry all
-outfmt "OID: %o     GI: %g     ACC: %a     IDENTIFIER: %i"
OID: 0     GI: N/A     ACC: BL_ORD_ID:0     IDENTIFIER: gnl|BL_ORD_ID|0
OID: 1     GI: N/A     ACC: BL_ORD_ID:1     IDENTIFIER: gnl|BL_ORD_ID|1
OID: 2     GI: N/A     ACC: BL_ORD_ID:2     IDENTIFIER: gnl|BL_ORD_ID|2
OID: 3     GI: N/A     ACC: BL_ORD_ID:3     IDENTIFIER: gnl|BL_ORD_ID|3
OID: 4     GI: N/A     ACC: BL_ORD_ID:4     IDENTIFIER: gnl|BL_ORD_ID|4

Compare that to the BLAST+ 2.2.28 where the internal BLAST ordinal identifiers are redacted:
$ ~/Downloads/ncbi-blast-2.2.28+/bin/blastdbcmd -db with_dups -entry all
-outfmt "OID: %o     GI: %g     ACC: %a     IDENTIFIER: %i"
OID: 0     GI: N/A     ACC: No ID available     IDENTIFIER: No ID available
OID: 1     GI: N/A     ACC: No ID available     IDENTIFIER: No ID available
OID: 2     GI: N/A     ACC: No ID available     IDENTIFIER: No ID available
OID: 3     GI: N/A     ACC: No ID available     IDENTIFIER: No ID available
OID: 4     GI: N/A     ACC: No ID available     IDENTIFIER: No ID available

This is problematic as in order to pull out an individual entry using blastdbcmd your own identifiers are not currently supported, so you need to know the internally assigned identifier - but most of BLAST+ is now trying to hide them from you:
$ ~/Downloads/ncbi-blast-2.2.28+/bin/blastdbcmd -db with_dups -entry gene3
Error: gene3: OID not found
BLAST query/options error: Entry not found in BLAST database
Please refer to the BLAST+ user manual.
and similarly:
$ ~/Downloads/ncbi-blast-2.2.27+/bin/blastdbcmd -db with_dups_v27 -entry gene3
Error: gene3: OID not found
BLAST query/options error: Entry not found in BLAST database

In this example, gene3 is an original unambiguous identifier. Pulling this entry from the database requires knowing the automatically assigned identifier, but as of BLAST+ 2.2.28 working out what this is was made harder:
$ ~/Downloads/ncbi-blast-2.2.27+/bin/blastdbcmd -db with_dups -entry "gnl|BL_ORD_ID|2"
>gnl|BL_ORD_ID|2 gene3 uniquely named
TTGGCTGATTTGAAAAAGCTACAGGTTTACGGACCTGAGTTACCCAGGCCATATGCCGATACCGTGAAAGGTTCTCGGTA
CAAAAATATGAAAGAGCTTCGCGTTCAGTTTTCTGGCCGTCCGATAAGAGCCTTTTATGCGTTCGATCCGATTCGTCGGG
CTATCGTTCTTTGTGCAGGAGATAAAAGTAATGATAAGCGGTTTTATGAAAAACTGGTGCGTATAGCTGAGGATGAATTT
ACAGCACATCTGAACACACTGGAGAGCAAGTAA
$ ~/Downloads/ncbi-blast-2.2.28+/bin/blastdbcmd -db with_dups -entry "gnl|BL_ORD_ID|2"
>gnl|BL_ORD_ID|2 gene3 uniquely named
TTGGCTGATTTGAAAAAGCTACAGGTTTACGGACCTGAGTTACCCAGGCCATATGCCGATACCGTGAAAGGTTCTCGGTA
CAAAAATATGAAAGAGCTTCGCGTTCAGTTTTCTGGCCGTCCGATAAGAGCCTTTTATGCGTTCGATCCGATTCGTCGGG
CTATCGTTCTTTGTGCAGGAGATAAAAGTAATGATAAGCGGTTTTATGAAAAACTGGTGCGTATAGCTGAGGATGAATTT
ACAGCACATCTGAACACACTGGAGAGCAAGTAA

Also note here BLAST+ 2.2.28 blastdbcmd is being inconsistent about showing the internal names (for individual entries) or hiding them (when all entries are request).

Searching the ambiguous database

Now let's search against the nucleotide database we made from this ambiguous file using blastn. As my query I've taken a chunk from one of the sequences in the database. Let's start with the tabular output - as an aside, I'm piping the query into blastn as stdin, and it gets automatically named as Query_1. This is a handy shortcut for quick tests without bothering to make a query FASTA file.

$ echo ACTGAGCAAAACAGACAAACGCCTGCTGGCTGCACTTGTCGTTGCCGGATACGAAGAACGGACAGCCCGT
| ~/Downloads/ncbi-blast-2.2.28+/bin/blastn -db with_dups -outfmt 6
Query_1 gene2 100.00 70 0 0 1 70 141 210 5e-35  130

Notice we've got a single 100% match against "gene2". But which gene2, since there were actually two sequences named this in the original FASTA file used for this database?

Interestingly you get the same from BLAST+ 2.2.23 to 2.2.28, but going back to the first release BLAST+ 2.2.18 through to 2.2.22, it gave the internal identifier "gnl|BL_ORD_ID|1" instead (and a shorter default query name of "1"):
$ echo ACTGAGCAAAACAGACAAACGCCTGCTGGCTGCACTTGTCGTTGCCGGATACGAAGAACGGACAGCCCGT
| ~/Downloads/ncbi-blast-2.2.22+/bin/blastn -db with_dups -outfmt 6
1 gnl|BL_ORD_ID|1 100.00 70 0 0 1 70 141 210 5e-35  130

The same change in BLAST+ 2.2.23 also applies to the comma-separated output:
$ echo ACTGAGCAAAACAGACAAACGCCTGCTGGCTGCACTTGTCGTTGCCGGATACGAAGAACGGACAGCCCGT
| ~/Downloads/ncbi-blast-2.2.22+/bin/blastn -db with_dups -outfmt 10
1,gnl|BL_ORD_ID|1,100.00,70,0,0,1,70,141,210,5e-35, 130
$ echo ACTGAGCAAAACAGACAAACGCCTGCTGGCTGCACTTGTCGTTGCCGGATACGAAGAACGGACAGCCCGT
| ~/Downloads/ncbi-blast-2.2.23+/bin/blastn -db with_dups -outfmt 10
Query_1,gene2,100.00,70,0,0,1,70,141,210,5e-35, 130

Here's an excerpt from the default human readable text output instead,

$ echo ACTGAGCAAAACAGACAAACGCCTGCTGGCTGCACTTGTCGTTGCCGGATACGAAGAACGGACAGCCCGT
| ~/Downloads/ncbi-blast-2.2.28+/bin/blastn -db with_dups

...
Query= 
Length=70
                                                                      Score     E
Sequences producing significant alignments:                          (Bits)  Value

  gene2                                                                130    5e-35


> gene2
Length=417

 Score =  130 bits (70),  Expect = 5e-35
 Identities = 70/70 (100%), Gaps = 0/70 (0%)
 Strand=Plus/Plus

Query  1    ACTGAGCAAAACAGACAAACGCCTGCTGGCTGCACTTGTCGTTGCCGGATACGAAGAACG  60
            ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct  141  ACTGAGCAAAACAGACAAACGCCTGCTGGCTGCACTTGTCGTTGCCGGATACGAAGAACG  200

Query  61   GACAGCCCGT  70
            ||||||||||
Sbjct  201  GACAGCCCGT  210
...

Again, we're told "gene2" which is sadly ambiguous. Also oddly unlike the tabular output (and XML output below), the query is unnamed here. The output from BLAST+ 2.2.18 through 2.2.27 looks the same as this output from BLAST+ 2.2.28 in terms of seeing the original identifier ("gene2").

However, if we ask for the XML output, we get to see the internal identifier of the "gene2" BLAST match, "gnl|BL_ORD_ID|1":

$ echo ACTGAGCAAAACAGACAAACGCCTGCTGGCTGCACTTGTCGTTGCCGGATACGAAGAACGGACAGCCCGT
| ~/Downloads/ncbi-blast-2.2.28+/bin/blastn -db with_dups -outfmt 5
...
<Iteration>
  <Iteration_iter-num>1</Iteration_iter-num>
  <Iteration_query-ID>Query_1</Iteration_query-ID>
  <Iteration_query-def>No definition line</Iteration_query-def>
  <Iteration_query-len>70</Iteration_query-len>
<Iteration_hits>
<Hit>
  <Hit_num>1</Hit_num>
  <Hit_id>gnl|BL_ORD_ID|1</Hit_id>
  <Hit_def>gene2</Hit_def>
  <Hit_accession>1</Hit_accession>
  <Hit_len>417</Hit_len>
  <Hit_hsps>
    <Hsp>
...

Ways forward

First of all, makeblastdb needs to reject duplicated identifiers when building a BLAST database. This should be an error condition, and would prevent the ambiguous situation in the example above (a BLAST match to "gene2", but which "gene2"?).

The wider issue is that despite some cosmetic changes, BLAST+ still essentially ignores the user's own identifiers (unless of course you follow the NCBI pipe-based naming scheme and used -parse_seqids). Without digging into the BLAST+ source code, I can't say if this is feasible or not - but from an end-user perspective, BLAST+ should really just use the given identifiers as they are. This is what I was hoping for in the previous blog post (My IDs not good enough for NCBI BLAST+).

Assuming however that the database file format etc makes using the user provided identifies impractical, then BLAST+ will have to continue to assign its own identifiers as now (e.g. "gnl|BL_ORD_ID|2"). However, I regard these as an internal implementation detail, and think BLAST+ should consistently hide this in its output (making it look like it uses the given identifiers).

This requires several minor changes:
  • Treat duplicated identifiers as an error in makeblastdb (for the reasons explained above)
  • Hide the internal identifiers in BLAST XML output (as always done in the plain text and HTML, and done for the tabular and comma separated output since BLAST+ 2.2.23)
  • Finish hiding the internal identify in blastdbcmd FASTA output (hidden for -entry all in BLAST+ 2.2.28, but not for specific entries)
  • Support user's original identifiers in blastdbcmd for retrieval (at worst this could be done without any changes to the database format with a brute force loop, acceptable for small custom databases)
  • Review other BLAST+ output for any other remaining cases where the internal identifiers are still used.
The take away message is if BLAST+ has to use its own automatically assigned identifiers, it should be consistent about if they should appear in all the BLAST output files, or not? The current situation where they are sometimes shown and sometimes hidden is simply confusing. My view is they are an internal implementation detail which should not be exposed to the end user.

Update (13 January 2014)

Nothing seems to have changed here in the new point release BLAST+ 2.2.29 released today.

4 comments:

  1. I've just hit this issue in development - my company is building a cloud-based platform which uses BLAST. I need to be able to perform very large, distributed BLAST runs, and afterwards take potentially lots of separate sets of output and extract the hits, matching them up to metadata about the targets which my application already knows about. I (naively) assumed that by using the internal sequence IDs from *my* system as the FASTA headers when building the databases, they would be preserved and I could just match them up after the fact. I thought I was going crazy, wondering where in the world all of the "BL_ORD_ID"'s were coming from, and how to turn them off. Why should BLAST insist on telling me the ordinal it assigns internally to each sequence?

    Thanks so much for these posts, and your efforts to communicate with the BLAST team - it's reassuring to know that I'm not the only one frustrated by this.

    ReplyDelete
  2. This comment has been removed by a blog administrator.

    ReplyDelete
  3. Just to add a piece to the puzzle…
    Using 2.2.28 I can create a db using sequences with the form:
    >WI-ST957:277:H7J8MADXX:1:1101:1670:2185_N:0:TCACCA/1
    using the -parse_seqids flag of makeblastdb. However, using 2.2.9 I get this error:
    BLAST Database creation error: Defline lacks a proper ID around line 1

    But if I query the db created by 2.2.28 with:
    blastdbcmd -db blast_db/SA_reads -entry 'WI-ST957:277:H7J8MADXX:1:1101:1670:2185_N:0:TCACCA/1'
    I get back a fasta named like this:
    >lcl|WI-ST957:277:H7J8MADXX:1:1101:1670:2185_N:0:TCACCA/1
    Notice the "lcl|" prepended to my original fasta id.

    So for now, I seem to be stuck with creating dbs in 2.2.28 and living with the unwelcome "lcl|" prepended to my fasta ids.

    Thanks for looking into this whole crazy issue, I'll be keeping my eye out.

    ReplyDelete
    Replies
    1. Oh yes, the magic "local" prefix "lcl|" used under the NCBI naming scheme.

      Delete