2012-10-30

Broken blastdbcmd for -target_only

This is just a quick post to document a bug in the blastdbcmd tool from the BLAST+ suite when used on the NR database with a full identifier and the -target_only option.

Update: See end of post, BLAST 2.2.28+ fixed this :)

Here's a simple example of how you can retrieve a FASTA record from the NCBI provided NR database using the original NCBI FASTA identifier - note that the pesky pipe character requiring escaping (given with a preceding slash):

$ blastdbcmd -db nr -entry gi\|7525015\|ref\|NP_051041.1\|
>gi|7525015|ref|NP_051041.1| ribosomal protein S16 [Arabidopsis thaliana] >gi|6685910|sp|P56806.1|RR16_ARATH RecName: Full=30S ribosomal protein S16, chloroplastic >gi|5881676|dbj|BAA84367.1| ribosomal protein S16 [Arabidopsis thaliana]
MVKLRLKRCGRKQRAVYRILAIDVRYRREGRDLSKVGFYDPITNQTFLNLSAILDFLKKGAQPTRTAHDISKKAGIFTE


Or, wrapping the record identifier in double-quotes:

$ blastdbcmd -db nr -entry "gi|7525015|ref|NP_051041.1"
>gi|7525015|ref|NP_051041.1| ribosomal protein S16 [Arabidopsis thaliana] >gi|6685910|sp|P56806.1|RR16_ARATH RecName: Full=30S ribosomal protein S16, chloroplastic >gi|5881676|dbj|BAA84367.1| ribosomal protein S16 [Arabidopsis thaliana]
MVKLRLKRCGRKQRAVYRILAIDVRYRREGRDLSKVGFYDPITNQTFLNLSAILDFLKKGAQPTRTAHDISKKAGIFTE


What is very nice is you can also look up the sequence via the GI number or accession (with or without the version suffix). Notice this is a merged entry in the 'non-redundant' NR database, combining three proteins which have the same sequence into a single FASTA entry. You can also use their identifiers as well, e.g.

$ blastdbcmd -db nr -entry "gi|5881676|dbj|BAA84367.1|"
>gi|7525015|ref|NP_051041.1| ribosomal protein S16 [Arabidopsis thaliana] >gi|6685910|sp|P56806.1|RR16_ARATH RecName: Full=30S ribosomal protein S16, chloroplastic >gi|5881676|dbj|BAA84367.1| ribosomal protein S16 [Arabidopsis thaliana]
MVKLRLKRCGRKQRAVYRILAIDVRYRREGRDLSKVGFYDPITNQTFLNLSAILDFLKKGAQPTRTAHDISKKAGIFTE

$ blastdbcmd -db nr -entry 5881676
>gi|7525015|ref|NP_051041.1| ribosomal protein S16 [Arabidopsis thaliana] >gi|6685910|sp|P56806.1|RR16_ARATH RecName: Full=30S ribosomal protein S16, chloroplastic >gi|5881676|dbj|BAA84367.1| ribosomal protein S16 [Arabidopsis thaliana]
MVKLRLKRCGRKQRAVYRILAIDVRYRREGRDLSKVGFYDPITNQTFLNLSAILDFLKKGAQPTRTAHDISKKAGIFTE

$ blastdbcmd -db nr -entry BAA84367.1
>gi|7525015|ref|NP_051041.1| ribosomal protein S16 [Arabidopsis thaliana] >gi|6685910|sp|P56806.1|RR16_ARATH RecName: Full=30S ribosomal protein S16, chloroplastic >gi|5881676|dbj|BAA84367.1| ribosomal protein S16 [Arabidopsis thaliana]
MVKLRLKRCGRKQRAVYRILAIDVRYRREGRDLSKVGFYDPITNQTFLNLSAILDFLKKGAQPTRTAHDISKKAGIFTE

$ blastdbcmd -db nr -entry BAA84367
>gi|7525015|ref|NP_051041.1| ribosomal protein S16 [Arabidopsis thaliana] >gi|6685910|sp|P56806.1|RR16_ARATH RecName: Full=30S ribosomal protein S16, chloroplastic >gi|5881676|dbj|BAA84367.1| ribosomal protein S16 [Arabidopsis thaliana]
MVKLRLKRCGRKQRAVYRILAIDVRYRREGRDLSKVGFYDPITNQTFLNLSAILDFLKKGAQPTRTAHDISKKAGIFTE


The -target_only switch gives you just the one set of annotation:

$ blastdbcmd -db nr -entry 5881676 -target_only
>gi|5881676|dbj|BAA84367.1| ribosomal protein S16 [Arabidopsis thaliana]
MVKLRLKRCGRKQRAVYRILAIDVRYRREGRDLSKVGFYDPITNQTFLNLSAILDFLKKGAQPTRTAHDISKKAGIFTE


$ blastdbcmd -db nr -entry BAA84367.1 -target_only
>gi|5881676|dbj|BAA84367.1| ribosomal protein S16 [Arabidopsis thaliana]
MVKLRLKRCGRKQRAVYRILAIDVRYRREGRDLSKVGFYDPITNQTFLNLSAILDFLKKGAQPTRTAHDISKKAGIFTE

$ blastdbcmd -db nr -entry BAA84367 -target_only
>gi|5881676|dbj|BAA84367.1| ribosomal protein S16 [Arabidopsis thaliana]
MVKLRLKRCGRKQRAVYRILAIDVRYRREGRDLSKVGFYDPITNQTFLNLSAILDFLKKGAQPTRTAHDISKKAGIFTE


That is particularly useful with the other output formats which give one line per entry for these merged sequences, e.g.

$ blastdbcmd -db nr -entry 5881676 -outfmt "GI:%g ACC:%a - %i"
GI:7525015 ACC:NP_051041.1 - ref|NP_051041.1|
GI:6685910 ACC:P56806.1 - sp|P56806.1|RR16_ARATH
GI:5881676 ACC:BAA84367.1 - dbj|BAA84367.1|

$ blastdbcmd -db nr -entry 5881676 -outfmt "GI:%g ACC:%a - %i" -target_only
GI:5881676 ACC:BAA84367.1 - dbj|BAA84367.1|


Here's the bug: The -target_only should work with the long FASTA entry identifiers too, but doesn't:

$ blastdbcmd -db nr -entry gi\|7525015\|ref\|NP_051041.1\| -target_only
Error: FASTA-style ID GI|7525015|REF|NP_051041.1| has too many parts.

$ blastdbcmd -db nr -entry "gi|7525015|ref|NP_051041.1|" -target_only
Error: FASTA-style ID GI|7525015|REF|NP_051041.1| has too many parts.


It doesn't matter if you look at the primary entry or not:

$ blastdbcmd -db nr -entry "gi|5881676|dbj|BAA84367.1|" -target_only
Error: FASTA-style ID GI|5881676|DBJ|BAA84367.1| has too many parts.


It doesn't matter if you use a custom output format or not:

$ blastdbcmd -db nr -entry "gi|5881676|dbj|BAA84367.1|" -outfmt "GI:%g ACC:%a - %i" -target_only
Error: FASTA-style ID GI|5881676|DBJ|BAA84367.1| has too many parts.


I've tried this with the provided pre-compiled binaries for 64-bit Linux for the current and most recent releases (BLAST 2.2.26+ and BLAST 2.2.27+).  Should it matter, I'm using the NR database downloaded from the NCBI FTP site, once unzipped they are dated 23 October 2012. I've also tried with the provided pre-compiled binaries for Mac OS X for BLAST 2.2.27+, although with an older copy of the NR database.

I am reporting this to the NCBI - by email and this public blog post since the NCBI BLAST team don't have a public bug tracker. This way anyone else bitten by the problem should be able to find this information via an internet search.


Update (2 April 2013)

Although dated 19 March, the NCBI released BLAST 2.2.28+ yesterday. The change log included a bug fix entry:
  • blastdbcmd not fetching sequence data for complete sequence ID and -target_only
This does indeed seem to have fixed the bug demonstrated above:

$ blastdbcmd -version
blastdbcmd: 2.2.28+
Package: blast 2.2.28, build Mar 12 2013 16:52:31

$ blastdbcmd -db nr -entry gi\|7525015\|ref\|NP_051041.1\| -target_only
>gi|7525015|ref|NP_051041.1| ribosomal protein S16 [Arabidopsis thaliana]
MVKLRLKRCGRKQRAVYRILAIDVRYRREGRDLSKVGFYDPITNQTFLNLSAILDFLKKGAQPTRTAHDISKKAGIFTE

$ blastdbcmd -db nr -entry "gi|7525015|ref|NP_051041.1|" -target_only
>gi|7525015|ref|NP_051041.1| ribosomal protein S16 [Arabidopsis thaliana]
MVKLRLKRCGRKQRAVYRILAIDVRYRREGRDLSKVGFYDPITNQTFLNLSAILDFLKKGAQPTRTAHDISKKAGIFTE


Using an alternative ID or custom output format works too:

$ blastdbcmd -db nr -entry "gi|5881676|dbj|BAA84367.1|" -target_only
>gi|5881676|dbj|BAA84367.1| ribosomal protein S16 [Arabidopsis thaliana]
MVKLRLKRCGRKQRAVYRILAIDVRYRREGRDLSKVGFYDPITNQTFLNLSAILDFLKKGAQPTRTAHDISKKAGIFTE

$ blastdbcmd -db nr -entry "gi|5881676|dbj|BAA84367.1|" -outfmt "GI:%g ACC:%a - %i" -target_only
GI:5881676 ACC:BAA84367.1 - dbj|BAA84367.1|


Lovely.

4 comments:

  1. Peter, can you confirm this was fixed in the latest 2.2.28+ release?

    ftp://ftp.ncbi.nih.gov/blast/executables/blast+/2.2.28/ChangeLog

    "Bug fixes: blastdbcmd not fetching sequence data for complete sequence ID and -target_only"

    ReplyDelete
  2. Peter, I see 2 fasta headers come up when I am extracting using blastdbcmd as I see same case in yours too. "target_only" parameter seems to solve the problem. But, do you know if there is a reason why it is happening like that?

    ReplyDelete
    Replies
    1. If I have understood your question, this is by design. In the NR database when a protein sequence is identical between multiple sources, it can be represented in the database once with multiple names. The first name is the only one shown if "-target_only" is used.

      Delete