2012-10-30

My IDs not good enough for NCBI BLAST+

The blastdbcmd tool in the BLAST+ suite (replacing fastacmd in the C 'legacy' BLAST suite) lets you do a lot of clever things with a BLAST database. As long as you follow the baroque NCBI FASTA naming scheme you can do this with local BLAST databases too. However, if you don't want to bow down to the NCBI naming (e.g. use FASTA files directly from your favourite assembler), then blastdbcmd seems needlessly crippled.

Update (2 April 2013): Some changes in BLAST 2.2.28+ (released yesterday) seem to be intended to address these issues, but there remain problems with this which I intend to expand on later.

Update (20 April 2013): I found a quiet moment this weekend to update this post with the BLAST 2.2.28+ problems I was alluding to. There has been some progress on this issue with the new release, but it is flawed. See below.

To demonstrate the problem (tested using BLAST 2.2.27+), consider this small silly protein FASTA file example:

$ more silly.faa
>example1 Here is an example with a simple name
MVKLRLKRCGRKQRAVYRILAIDVRYRREGRDLSKVGFYDPITNQTFLNLSAILDFLKKGAQPTRTAHDISKKAGIFTE
>example2
MVKLRLKRCGRKQRAVYRILAIDVRYRREGRDLSKVGFYDPITNQTFLNLSAILDFLKKGAQPTRTAHDI
SKKAGIFTE
>example3|truncated Note pipe characters but this isn't using NCBI style!
MLNIFNLICIFFNSTLFSSTFLVAKLPEAYAFLNPIVDVMPVIPLFFLLLAFVWQAA


First, here's what happens if I tell BLAST+ that I'm using the NCBI style naming and it can parse the sequence identifiers:

$ ~/downloads/ncbi-blast-2.2.27+/bin/makeblastdb -dbtype prot -in silly.faa -out silly -parse_seqids
...
Error: NCBI C++ Exception:
    "/am/ncbiapdata/release/blast/src/2.2.27/Linux64-Centos-icc/c++/ICC1010-ReleaseMT64--Linux64-Centos-icc/../src/objects/seq/../seqloc/Seq_id.cpp", line 1689: Error: ncbi::objects::CSeq_id::x_Init() - Unsupported ID type example3


Here BLAST+ has quite rightly rejected my FASTA file when I told it to try and parse the sequence identifiers as if they were using the NCBI style. (Update: Error handling improved in BLAST 2.2.29+, see end of post). Clearly we shouldn't try to do that, so:

$ ~/downloads/ncbi-blast-2.2.27+/bin/makeblastdb -dbtype prot -in silly.faa -out silly


Building a new DB, current time: 10/30/2012 16:59:24
New DB name:   silly
New DB title:  silly.faa
Sequence type: Protein
Keep Linkouts: T
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 3 sequences in 0.00222683 seconds.


Now, what does blastdbcmd make of this? For this we can call blastdbcmd with the special argument 'all' as the entry name:

$ blastdbcmd -db silly -entry all
>gnl|BL_ORD_ID|0 example1 Here is an example with a simple name
MVKLRLKRCGRKQRAVYRILAIDVRYRREGRDLSKVGFYDPITNQTFLNLSAILDFLKKGAQPTRTAHDISKKAGIFTE
>gnl|BL_ORD_ID|1 example2
MVKLRLKRCGRKQRAVYRILAIDVRYRREGRDLSKVGFYDPITNQTFLNLSAILDFLKKGAQPTRTAHDISKKAGIFTE
>gnl|BL_ORD_ID|2 example3|truncated Note pipe characters but this isn't using NCBI style!
MLNIFNLICIFFNSTLFSSTFLVAKLPEAYAFLNPIVDVMPVIPLFFLLLAFVWQAA

As you can see via the default FASTA output, makeblastdb has made up its own NCBI style names for our sequences - numbering with an ordinal from zero upwards. To look at this another way,

$ blastdbcmd -db silly -entry all -outfmt "OID: %o     GI: %g     ACC: %a     IDENTIFIER: %i"
OID: 0     GI: N/A     ACC: BL_ORD_ID:0     IDENTIFIER: gnl|BL_ORD_ID|0
OID: 1     GI: N/A     ACC: BL_ORD_ID:1     IDENTIFIER: gnl|BL_ORD_ID|1
OID: 2     GI: N/A     ACC: BL_ORD_ID:2     IDENTIFIER: gnl|BL_ORD_ID|2

If these BLAST ordinal identifiers (my guess for what BL_ORD_ID stands for) were purely an internal implementation detail, everything would be fine. If you use the BLAST search tools with tabular output you'll never see them (which seems to me a good thing!). However, if you want to actually pull out an individual sequence from the database you need to know the NCBI's made up identifier:

$ blastdbcmd -db silly -entry "gnl|BL_ORD_ID|0" -outfmt "OID: %o     GI: %g     ACC: %a     IDENTIFIER: %i"
OID: 0     GI: N/A     ACC: BL_ORD_ID:0     IDENTIFIER gnl|BL_ORD_ID|0

$ blastdbcmd -db silly -entry "gnl|BL_ORD_ID|0"
>gnl|BL_ORD_ID|0 example1 Here is an example with a simple name
MVKLRLKRCGRKQRAVYRILAIDVRYRREGRDLSKVGFYDPITNQTFLNLSAILDFLKKGAQPTRTAHDISKKAGIFTE

What frustrates me is blastdbcmd cannot lookup the record using the original identifier:

$ blastdbcmd -db silly -entry example1
Error: example1: OID not found
BLAST query/options error: Entry not found in BLAST database

Attempting to use the invented accession fails too, which is kind of funny, or perhaps infuriating depending on your state of mind:

$ blastdbcmd -db silly -entry BL_ORD_ID:0 -outfmt "OID: %o     GI: %g     ACC: %a     IDENTIFIER %i"
Error: BL_ORD_ID:0: OID not found


Desired Behaviour

What I would like to happen is BLAST build the database using the identifers as given (and if there happen to be duplicate identifiers, abort with a clear error message). For example, something like this:

$ blastdbcmd -db silly -entry all
>example1 Here is an example with a simple name
MVKLRLKRCGRKQRAVYRILAIDVRYRREGRDLSKVGFYDPITNQTFLNLSAILDFLKKGAQPTRTAHDISKKAGIFTE
>example2
MVKLRLKRCGRKQRAVYRILAIDVRYRREGRDLSKVGFYDPITNQTFLNLSAILDFLKKGAQPTRTAHDISKKAGIFTE
>example3|truncated Note pipe characters but this isn't using NCBI style!
MLNIFNLICIFFNSTLFSSTFLVAKLPEAYAFLNPIVDVMPVIPLFFLLLAFVWQAA

$ blastdbcmd -db silly -entry all -outfmt "OID: %o     GI: %g     ACC: %a     IDENTIFIER: %i"
OID: 0     GI: N/A     ACC: N/A     IDENTIFIER: example1
OID: 1     GI: N/A     ACC: N/A     IDENTIFIER: example2
OID: 2     GI: N/A     ACC: N/A     IDENTIFIER: example3|truncated

And specifically I'd want to be able to pull out records using the provided identifiers:

$ blastdbcmd -db silly -entry example1
>example1 Here is an example with a simple name
MVKLRLKRCGRKQRAVYRILAIDVRYRREGRDLSKVGFYDPITNQTFLNLSAILDFLKKGAQPTRTAHDISKKAGIFTE
$ blastdbcmd -db silly -entry example1 -outfmt "OID: %o     GI: %g     ACC: %a     IDENTIFIER: %i"
OID: 0     GI: N/A     ACC: N/A     IDENTIFIER: example1

What I don't know is if this how much work it would be, and if indeed it would be possible without breaking anything else. I guess any sequence called "all" would be a problem, best solved with an explicit command line switch rather than using this as a special case search term?


Workaround

If you make a custom BLAST database (unless you are using NCBI style names), keep the FASTA file and index that to pull out specific records. I've you've lost the FASTA file, you can recover it like this using sed to remove the NCBI BLAST made up gnl|BL_ORD_ID|nnn identifiers:

$ blastdbcmd -db silly -entry all | sed 's/>\(lcl|\|gnl|BL_ORD_ID|[0-9]* \)/>/1'
>example1 Here is an example with a simple name
MVKLRLKRCGRKQRAVYRILAIDVRYRREGRDLSKVGFYDPITNQTFLNLSAILDFLKKGAQPTRTAHDISKKAGIFTE
>example2
MVKLRLKRCGRKQRAVYRILAIDVRYRREGRDLSKVGFYDPITNQTFLNLSAILDFLKKGAQPTRTAHDISKKAGIFTE
>example3|truncated Note pipe characters but this isn't using NCBI style!
MLNIFNLICIFFNSTLFSSTFLVAKLPEAYAFLNPIVDVMPVIPLFFLLLAFVWQAA

Of course, rather than printing this to screen, pipe it to the new file:

$ blastdbcmd -db silly -entry all | sed 's/>\(lcl|\|gnl|BL_ORD_ID|[0-9]* \)/>/1' > recovered.faa

For those on Mac OS X, the sed command needs a little tweak, try:

$ blastdbcmd -db silly -entry all | sed -E 's/^>(lcl\||gnl\|BL_ORD_ID\|[0-9]* )/>/1' > recoved.faa


Scope

I touched on the NCBI BLAST funnyness with identifiers on this older blog post, and this protein example is (deliberately) trivial.

However, this is is a general problem - what prompted this blog post was trying to extract sequences from a nucleotide BLAST database made from a Trinity transcriptome assembly file names like comp1_c0_seq1 for the transcripts. In fact, I'm not aware of any assembly tools whose FASTA output follows the NCBI naming scheme. Pretty much the only FASTA files I see which do, are those downloaded directly from the NCBI.

Update (20 April 2013)

Later than planned, a quick summary of the BLAST 2.2.28+ update which said in the change log:

  • Bug fix: blastdbcmd displaying internal sequence ID for databases built without -parse_seqid
Evidently this was intended to address some of my complaints. So let's try the examples given above... and we're off to a bad start:

$ makeblastdb -dbtype prot -in silly.faa -out silly -parse_seqids
...

This fails to give an error due to non-NCBI naming. Perhaps it thinks silently munging the data is a good idea? Well, pressing on

$ makeblastdb -dbtype prot -in silly.faa -out silly $ blastdbcmd -entry all -db silly >example1 Here is an example with a simple name MVKLRLKRCGRKQRAVYRILAIDVRYRREGRDLSKVGFYDPITNQTFLNLSAILDFLKKGAQPTRTAHDISKKAGIFTE >example2 MVKLRLKRCGRKQRAVYRILAIDVRYRREGRDLSKVGFYDPITNQTFLNLSAILDFLKKGAQPTRTAHDISKKAGIFTE >example3|truncated Note pipe characters but this isn't using NCBI style! MLNIFNLICIFFNSTLFSSTFLVAKLPEAYAFLNPIVDVMPVIPLFFLLLAFVWQAA

Things looked good, but not so fast - the old IDs are still there if we ask for an individual entry:

$ blastdbcmd -entry example2 -db silly >gnl|BL_ORD_ID|1 example2 MVKLRLKRCGRKQRAVYRILAIDVRYRREGRDLSKVGFYDPITNQTFLNLSAILDFLKKGAQPTRTAHDISKKAGIFTE

So on the plus side, extracting a sequence with the user provided identifier is now possible. However, it seems that BLAST+ is not really using our identifiers properly,

$ blastdbcmd -db silly -entry all -outfmt "OID: %o     GI: %g     ACC: %a     IDENTIFIER: %i"  
OID: 0     GI: N/A     ACC: No ID available     IDENTIFIER: No ID available 
OID: 1     GI: N/A     ACC: No ID available     IDENTIFIER: No ID available
OID: 2     GI: N/A     ACC: No ID available     IDENTIFIER: No ID available

I'm guessing they're just doing something sneaky with the title field (but haven't quite got it right):

$ blastdbcmd -db silly -entry all -outfmt "OID: %o     TITLE: %t" 
OID: 0     TITLE: example1 Here is an example with a simple name 
OID: 1     TITLE: example2 
OID: 2     TITLE: example3|truncated Note pipe characters but this isn't using NCBI style!

So, it seems a partial effort has been made but not properly tested. Given I found this immediately on testing (updating the blog took a bit longer), this bug wouldn't have made it into the official release if the BLAST team had actually asked people to test it. Step one would be to setup a public bug tracker for BLAST+ please. Still, this does seem to be a small step forward (and overall BLAST 2.2.28+ is a major improvement - see my other BLAST posts).

Update (22 April 2013)

The NCBI replied to my email about problem with BLAST 2.2.28+, and they have an open bug for -parse_seqids not failing. The later odd results were related to files left on disk from the first run of makeblastdb. So if we remove the old database files and try from a fresh start:

$ rm silly.p*
$ ~/downloads/ncbi-blast-2.2.28+/bin/makeblastdb -dbtype prot -in silly.faa -out silly
...
$ ~/downloads/ncbi-blast-2.2.28+/bin/blastdbcmd -entry all -db silly
>example1 Here is an example with a simple name
MVKLRLKRCGRKQRAVYRILAIDVRYRREGRDLSKVGFYDPITNQTFLNLSAILDFLKKGAQPTRTAHDISKKAGIFTE
>example2
MVKLRLKRCGRKQRAVYRILAIDVRYRREGRDLSKVGFYDPITNQTFLNLSAILDFLKKGAQPTRTAHDISKKAGIFTE
>example3|truncated Note pipe characters but this isn't using NCBI style!
MLNIFNLICIFFNSTLFSSTFLVAKLPEAYAFLNPIVDVMPVIPLFFLLLAFVWQAA


So far so good, but now extracting using my identifiers fails:

$ ~/downloads/ncbi-blast-2.2.28+/bin/blastdbcmd -entry example2 -db silly
Error: example2: OID not found
BLAST query/options error: Entry not found in BLAST database
Please refer to the BLAST+ user manual.


So sadly this isn't very useful yet.

Update (24 December 2013)

It took me a while, but I finally sat down and explored the issue with how BLAST+ is treating user-provided identifiers in more detail, see BLAST+ should keep its BL_ORD_ID identifiers to itself.

Update (13 January 2014)

The NCBI just released BLAST 2.2.29+, which includes as a bug fix "makeblastdb provides error message when -parse_seqids is used and invalid FASTA is provided".

$ rm silly.p*
$ ~/downloads/ncbi-blast-2.2.29+/bin/makeblastdb -dbtype prot -in silly.faa -out silly -parse_seqids


Building a new DB, current time: 01/13/2014 15:55:34
New DB name:   silly
New DB title:  silly.faa
Sequence type: Protein
Keep Linkouts: T
Keep MBits: T
Maximum file size: 1000000000B

volume: silly

file: silly.pin
file: silly.phr
file: silly.psq
file: silly.psi
file: silly.psd
file: silly.pog

BLAST Database creation error: Defline lacks a proper ID around line 6

 Good, a clearer error about the identifier with a pipe not following the NCBI style.

Note there is no change to the problems above when trying to use our own IDs:

$ rm silly.p*
$ ~/downloads/ncbi-blast-2.2.29+/bin/makeblastdb -dbtype prot -in silly.faa -out silly
Building a new DB, current time: 01/13/2014 15:53:53
New DB name:   silly
New DB title:  silly.faa
Sequence type: Protein
Keep Linkouts: T
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 3 sequences in 0.000931978 seconds.
$ ~/downloads/ncbi-blast-2.2.29+/bin/blastdbcmd -entry example2 -db silly
Error: example2: OID not found
BLAST query/options error: Entry not found in BLAST database
Please refer to the BLAST+ user manual.


I've written more about this on a follow up post, BLAST+ should keep its BL_ORD_ID identifiers to itself.

6 comments:

  1. Peter, can you confirm this bug was fixed in the latest 2.2.28+ release?

    "Bug fix: blastdbcmd displaying internal sequence ID for databases built without -parse_seqid"

    ftp://ftp.ncbi.nih.gov/blast/executables/blast+/2.2.28/ChangeLog

    ReplyDelete
    Replies
    1. They've certainly done some work here with the new BLAST+ release, but it doesn't seem to be working quite right (and certainly it does not do what I was hoping for). Time for an update to the post exploring this, and more bug reports :(

      Delete
    2. I've updated the post with the initial problems I'd found trying this with BLAST 2.2.28+ but I haven't found the time in the last two weeks to dig into this any deeper.

      Delete
  2. Thanks for the post! This was informative and helped me fix a problem in my project

    ReplyDelete
  3. Thank you very much for posting this. Would you be able to advise me on how to extract ~12,000 specific sequences from a Trinity-assembled fasta file containing ~160,000 sequences?

    ReplyDelete
    Replies
    1. That doesn't need to involve BLAST at all - write a script in your favourite language to pull out the records from the FASTA file directly. e.g. https://github.com/peterjc/pico_galaxy/tree/master/tools/seq_filter_by_id or https://github.com/peterjc/pico_galaxy/tree/master/tools/seq_select_by_id offer this with a Galaxy wrapper.

      Delete