2012-05-25

BLAST tabular output missing descriptions

This is an open letter to the NCBI BLAST+ team to request two simple enhancements which I think would be extremely useful - first and foremost the option to include BLAST result descriptions in the tabular output. Having the taxonomic identifiers (if available) would be great too - allowing downstream filtering of BLAST results by species etc.

Update: See end of post, BLAST 2.2.28+ added most of these features :)

As an example, consider a BLASTP search against the NR database with the following single protein sequence, limited to one target sequence here for conciseness.

>spo_like
MNEKILIVDEQYGIRVLLNEVFHKEGYQTFQAANGIQALEIVTKERPDLILLDMKIPGMDGIEILKRMK
MVDESIRVIIMTAYGELDMIKESKELGALTHFAKPFEIDDIRDAVKKYLPLKSN


Here I'm interested in the descriptions of the matched sequences in particular - useful not just for human interpretation but handy to search for keywords such as an enzyme or organism name.

BLAST plain text output

Here is some of the default plain text output from BLAST 2.2.26+, where I have highlighted the match description. This occurs twice, once truncated in the description listing (yellow), and once in full in the alignment listing (multiple lines, in orange).

$ blastp -query spo_like.faa -db nr -max_target_seqs 1 -outfmt 1
...

Query= spo_like

Length=123
                                                                      Score     E
Sequences producing significant alignments:                          (Bits)  Value

ref|YP_005543406.1|  spo0F gene product [Bacillus amyloliquefacie...   239    3e-79


>ref|YP_005543406.1| spo0F gene product [Bacillus amyloliquefaciens TA208]
 gb|AEB25913.1| stage 0 sporulation protein F [Bacillus amyloliquefaciens TA208]
 gb|EHM07003.1| stage 0 sporulation protein F [Bacillus amyloliquefaciens IT-45]
Length=123

 Score =  239 bits (611),  Expect = 3e-79, Method: Compositional matrix adjust.
 Identities = 116/123 (94%), Positives = 123/123 (100%), Gaps = 0/123 (0%)

Query  1    MNEKILIVDEQYGIRVLLNEVFHKEGYQTFQAANGIQALEIVTKERPDLILLDMKIPGMD  60
            MNEKILIVD+QYGIR+LLNEVFHKEGYQTFQAANGIQAL+IVTKERPDL+LLDMKIPGMD
Sbjct  1    MNEKILIVDDQYGIRILLNEVFHKEGYQTFQAANGIQALDIVTKERPDLVLLDMKIPGMD  60

Query  61   GIEILKRMKMVDESIRVIIMTAYGELDMIKESKELGALTHFAKPFEIDDIRDAVKKYLPL  120
            GIEILKRMKM+DESIRVIIMTAYGELDMIKESKELGALTHFAKPF+ID+IRDAVKKYLPL
Sbjct  61   GIEILKRMKMIDESIRVIIMTAYGELDMIKESKELGALTHFAKPFDIDEIRDAVKKYLPL  120

Query  121  KSN  123
            KSN
Sbjct  121  KSN  123

...

Why are there multiple lines? This is a quirk of the NR database, the clue is in the full name - non redundant. In this example, three identical protein sequences (YP_001422989.1, AEB25913.1 and EHM07003.1) are combined into one entry in the database.

BLAST XML output

Here is the equivalent output in XML, where the matched sequence description occurs only once (highlighted in orange):

$ blastp -query spo_like.faa -db nr -max_target_seqs 1 -outfmt 5
<?xml version="1.0"?>
...
     <Iteration>
      <Iteration_iter-num>1</Iteration_iter-num>
      <Iteration_query-ID>Query_1</Iteration_query-ID>
      <Iteration_query-def>spo_like</Iteration_query-def>
      <Iteration_query-len>123</Iteration_query-len>
      <Iteration_hits>
        <Hit>
          <Hit_num>1</Hit_num>
          <Hit_id>gi|384161333|ref|YP_005543406.1|</Hit_id>
          <Hit_def>spo0F gene product [Bacillus amyloliquefaciens TA208] &gt;gi|328555421|gb|AEB25913.1| stage 0 sporulation protein F [Bacillus amyloliquefaciens TA208] &gt;gi|363726869|gb|EHM07003.1| stage 0 sporulation protein F [Bacillus amyloliquefaciens IT-45]</Hit_def>
          <Hit_accession>YP_005543406</Hit_accession>
          <Hit_len>123</Hit_len>
          <Hit_hsps>
            <Hsp>
              <Hsp_num>1</Hsp_num>
              <Hsp_bit-score>239.965</Hsp_bit-score>
              <Hsp_score>611</Hsp_score>
              <Hsp_evalue>2.57739e-79</Hsp_evalue>
              <Hsp_query-from>1</Hsp_query-from>
              <Hsp_query-to>123</Hsp_query-to>
              <Hsp_hit-from>1</Hsp_hit-from>
              <Hsp_hit-to>123</Hsp_hit-to>
              <Hsp_query-frame>0</Hsp_query-frame>
              <Hsp_hit-frame>0</Hsp_hit-frame>
              <Hsp_identity>116</Hsp_identity>
              <Hsp_positive>123</Hsp_positive>
              <Hsp_gaps>0</Hsp_gaps>
              <Hsp_align-len>123</Hsp_align-len>
              <Hsp_qseq>MNEKILIVDEQYGIRVLLNEVFHKEGYQTFQAANGIQAL...KSN</Hsp_qseq>
              <Hsp_hseq>MNEKILIVDDQYGIRILLNEVFHKEGYQTFQAANGIQAL...KSN</Hsp_hseq>
              <Hsp_midline>MNEKILIVD+QYGIR+LLNEVFHKEGYQTFQAANGI...KSN</Hsp_midline>
            </Hsp>
          </Hit_hsps>
        </Hit>
        ...


This time the three combined names are a little more painful to get - first combine the <Hit_id> and <Hit_def> tags, and then split this up by looking for &gt; (shown in red), which is an HTML encoded greater than sign, familiar to any Bioinformatician as the new sequence marker in a FASTA file.

There is reasonably a logical explanation for that palava, and it is based on what I infer the entry in the FASTA file used to make the NR database would have looked like. You can actually extract this from the database using blastbcmd, shown here I have shown ^A to represent the non-printing ASCII character CTRL+A, which is what the NCBI use as the separator between merged redundant entries.

$ blastdbcmd -db nr -entry YP_005543406.1 -ctrl_a
>ref|YP_005543406.1| spo0F gene product [Bacillus amyloliquefaciens TA208]
^Agb|AEB25913.1| stage 0 sporulation protein F [Bacillus amyloliquefaciens TA208]
^Agb|EHM07003.1| stage 0 sporulation protein F [Bacillus amyloliquefaciens IT-45]
MNEKILIVDDQYGIRILLNEVFHKEGYQTFQAANGIQALDIVTKERPDLVLLDMKIPGMD
GIEILKRMKMIDESIRVIIMTAYGELDMIKESKELGALTHFAKPFDIDEIRDAVKKYLPL 

KSN

For the blog formatting I've had to stick line breaks in at the ^A bits, but all that long triple name bit would have been one line in the FASTA file.

In the standard FASTA header line interpretation (which doesn't care about this weird control character), the first word is the record's ID, and thus becomes <Hit_id>gi|384161333|ref|YP_005543406.1|</Hit_id> in the blast output, while everything after the first space is the description, and thus becomes <Hit_def>spo0F gene product [Bacillus amyloliquefaciens TA208] &gt;gi|328555421|gb|AEB25913.1| stage 0 sporulation protein F [Bacillus amyloliquefaciens TA208] &gt;gi|363726869|gb|EHM07003.1| stage 0 sporulation protein F [Bacillus amyloliquefaciens IT-45]</Hit_def>. You can't put a nasty non-printing ASCII character in XML, so they were switched to something else. That is my working theory for this anyway.

Why the NCBI didn't come up with a clean and explicit XML solution to this beats me. Adding some additional tags to the XML seems like a good idea for my wish list (that should be backwards compatible).

BLAST tabular output

Now for tabular output, one of the nice things about BLAST+ (and a big improvement over the NCBI 'legacy' BLAST suite) is you can customise the tabular output to include additional columns.I've highlighted the relevant entries from the help text:

$ blastp -help
...

 -outfmt 
   alignment view options:
     0 = pairwise,
     1 = query-anchored showing identities,
     2 = query-anchored no identities,
     3 = flat query-anchored, show identities,
     4 = flat query-anchored, no identities,
     5 = XML Blast output,
     6 = tabular,
     7 = tabular with comment lines,
     8 = Text ASN.1,
     9 = Binary ASN.1,
    10 = Comma-separated values,
    11 = BLAST archive format (ASN.1) 
   
   Options 6, 7, and 10 can be additionally configured to produce
   a custom format specified by space delimited format specifiers.
   The supported format specifiers are:
        qseqid means Query Seq-id
           qgi means Query GI
          qacc means Query accesion
       qaccver means Query accesion.version
          qlen means Query sequence length
        sseqid means Subject Seq-id
     sallseqid means All subject Seq-id(s), separated by a ';'
           sgi means Subject GI
        sallgi means All subject GIs
          sacc means Subject accession
       saccver means Subject accession.version
       sallacc means All subject accessions
          slen means Subject sequence length
        qstart means Start of alignment in query
          qend means End of alignment in query
        sstart means Start of alignment in subject
          send means End of alignment in subject
          qseq means Aligned part of query sequence
          sseq means Aligned part of subject sequence
        evalue means Expect value
      bitscore means Bit score
         score means Raw score
        length means Alignment length
        pident means Percentage of identical matches
        nident means Number of identical matches
      mismatch means Number of mismatches
      positive means Number of positive-scoring matches
       gapopen means Number of gap openings
          gaps means Total number of gaps
          ppos means Percentage of positive-scoring matches
        frames means Query and subject frames separated by a '/'
        qframe means Query frame
        sframe means Subject frame
          btop means Blast traceback operations (BTOP)
   When not provided, the default value is:
   'qseqid sseqid pident length mismatch gapopen qstart qend sstart send
   evalue bitscore', which is equivalent to the keyword 'std'
   Default = `0'

What do those fields look like? This shows the first two columns from the default output:

$ blastp -query spo_like.faa -db nr -max_target_seqs 1 -outfmt "6 qseqid sseqid"
spo_like gi|384161333|ref|YP_005543406.1|

Or, rather than just the first ID for the redundant merged entries, showing all the IDs of merged entries:

$ blastp -query spo_like.faa -db nr -max_target_seqs 1 -outfmt "6 qseqid sallseqid"
spo_like gi|384161333|ref|YP_005543406.1|;gi|328555421|gb|AEB25913.1|;gi|363726869|gb|EHM07003.1|


A discussion of the ins-and-outs of the NCBI naming conventions, and why the BLAST tabular output includes a GI number prefix but the BLAST tabular and XML output does not, can wait for another day.

Tabular Output Request - Subject descriptions

The tabular output is far more compact than both the XML and plain text output, and by its nature it is trivial to parse or work with at the command line. This makes it very attractive, but surprisingly one of the key bits of information is currently unavailable. The option to include the description would for example allow you to easily filter BLAST output (e.g. with grep) for keywords such as an enzyme or organism name.

I would like the NCBI to add a further two fields to the tabular output options:

        qdescr means Query identifier and description
        sdescr means Subject identifier(s) and description(s), separated by CTRL+A

The proposed qdescr would just repeat the entire contents of the FASTA query's header line (without the leading greater than sign, and with any tabs turned into spaces). Likewise, I want to be able to do the same for the subject. i.e.

$ blastp -query spo_like.faa -db nr -max_target_seqs 1 -outfmt "6 qseqid sdescr"
spo_like gi|384161333|ref|YP_005543406.1| spo0F gene product [Bacillus amyloliquefaciens TA208]
^Agi|328555421|gb|AEB25913.1| stage 0 sporulation protein F [Bacillus amyloliquefaciens TA208]
^Agi|363726869|gb|EHM07003.1| stage 0 sporulation protein F [Bacillus amyloliquefaciens IT-45]

As before, I've inserted a line break at the ^A bits due to the page width, but that would all be one long line. I don't really mind if this is made into two new options, qdesrc for the first hit name and description, and salldescr for all the names and descriptions of a merged entry. I can see some appeal for symmetry with the other existing fields, although personally I only need the later functionality.

I realise that the descriptions can be very long, but so too can existing fields in the tabular output like the aligned sequences. This isn't really a problem for processing tabular files.


Tabular Output Request - Subject TaxID

Another new feature in BLAST+ is build in taxonomic filtering, which relies on additional information in the database. You can supply this to makeblastdb when making your own custom BLAST database. Surprisingly, there is no option to show the subject taxid information in the tabular output - how about adding new entries staxid and salltaxid for this? It could be added to the XML output too.

Update (25 May 2012)

The NCBI replied to my email saying "We are currently considering both of these options" and it was being added to their request list. I don't know how they prioritise requests but fingers crossed for the next release.

Update (1 April 2013)

Although dated 19 March, the NCBI released BLAST 2.2.28+ today. The change log lists most of the features I was asking for:
  • Support for taxonomy data in custom tabular output format
  • Support for query coverage and title in custom tabular output format
 In fact, they've added quite a few new fields:

$ blastp -help
...
          staxids means Subject Taxonomy ID(s), separated by a ';'
        sscinames means Subject Scientific Name(s), separated by a ';'
        scomnames means Subject Common Name(s), separated by a ';'
      sblastnames means Subject Blast Name(s), separated by a ';'
                  (in alphabetical order)
       sskingdoms means Subject Super Kingdom(s), separated by a ';'
                  (in alphabetical order) 
           stitle means Subject Title
       salltitles means All Subject Title(s), separated by a '<>'
          sstrand means Subject Strand
            qcovs means Query Coverage Per Subject
          qcovhsp means Query Coverage Per HSP
...

Very nice. Here's my example again - with manual line breaks for the blog post. Note the NR database has grown and there are now ten aliases for the first hit:

$ blastp -query spo_like.faa -db nr -max_target_seqs 1 -outfmt "6 qseqid sallseqid"
spo_like gi|384161333|ref|YP_005543406.1|;gi|394991260|ref|ZP_10384067.1|;gi|421729881|ref|ZP_16169010.1|
;gi|429507007|ref|YP_007188191.1|;gi|451345129|ref|YP_007443760.1|;gi|328555421|gb|AEB25913.1|
;gi|393808032|gb|EJD69344.1|;gi|407075847|gb|EKE48831.1|;gi|429488597|gb|AFZ92521.1| 
;gi|449848887|gb|AGF25879.1|

Using the new field for the first match's title/description:

$ blastp -query spo_like.faa -db nr -max_target_seqs 1 -outfmt "6 qseqid stitle"
spo_like stage 0 sporulation protein F [Bacillus amyloliquefaciens TA208]

And all the titles/descriptions for the first match - I've manually added line breaks for the blog, and highlighted the separators:

$ blastp -query spo_like.faa -db nr -max_target_seqs 1 -outfmt "6 qseqid salltitles"
spo_like stage 0 sporulation protein F [Bacillus amyloliquefaciens TA208]<>stage 0 sporulation protein F
[Bacillus sp. 916]<>two-component system, sporulation family, response regulator, stage 0 sporulation
protein F [Bacillus amyloliquefaciens subsp. plantarum M27]<>two-component system, sporulation family,
response regulator, stage 0 sporulation protein F [Bacillus amyloliquefaciens subsp. plantarum AS43.3]
<>two-component system, sporulation family, response regulator, stage 0 sporulation protein F [Bacillus
amyloliquefaciens IT-45]<>stage 0 sporulation protein F [Bacillus amyloliquefaciens TA208]<>stage 0
sporulation protein F [Bacillus sp. 916]<>two-component system, sporulation family, response regulator,
stage 0 sporulation protein F [Bacillus amyloliquefaciens subsp. plantarum M27]<>two-component system,
sporulation family, response regulator, stage 0 sporulation protein F [Bacillus amyloliquefaciens subsp.
plantarum AS43.3]<>two-component system, sporulation family, response regulator, stage 0 sporulation
protein F [Bacillus amyloliquefaciens IT-45]

And the new field for the taxonomy IDs:

$ blastp -query spo_like.faa -db nr -max_target_seqs 1 -outfmt "6 qseqid staxids"
spo_like 999891;1007654;1091041;1177186;1225788

Maybe writing this blog post and my habit of telling anyone else asking for this functionality to email the NCBI BLAST team paid off? This didn't include my request for qdescr (or maybe qtitle based on the terms the NCBI seems to prefer) but that was the easiest to work around anyway.

Big smiles - thank you to the NCBI BLAST+ team for listening :)

13 comments:

  1. Peter,

    I agree with everything you have said in this post! I've been dreaming of Description and Taxa information in BLAST -m8 format for 10 years, and when BLAST+ still didn't have it, I was very disappointed. I'm not holding my breath on this one.

    Torst

    ReplyDelete
  2. The descriptions are desperately needed on the tabular output, I can't understand the difficult in adding this feature!

    ReplyDelete
  3. Hi Peter,

    I end up on your page precisely because I was looking for a way to deal with this output issue. Then a 100% agree with your post.

    Well, I guess that it's still possible to parse the blast output with perl script or something like that. (for instance when you perform the blast with Geneious software, the "description" column contains the full description).

    Any advices about such a parsing script ?

    Thanks

    Alex

    ReplyDelete
    Replies
    1. Hi Alex,

      First of all, would you (and anyone else who thinks this is a good idea) ask the NCBI BLAST team to implement this. You can link to this blog if you like - if they are more aware of the demand, it might get given a higher priority.

      Right now if you need the description you could parse the XML or the plain text. If you want to use Perl, there is a parser in BioPerl you could use. Personally I'd use Biopython ;)

      You could also write a script combining the input FASTA file (for the descriptions) with the output tabular file from BLAST.

      Peter

      Delete
  4. Thanks for your post - i too have banged my head against this issue for a while, and ended up always dumping full blast output as a text file and parsing it to get this information. I'm glad the blast+ team listened!

    ReplyDelete
  5. Thanks for this writeup! So glad you updated this to point out the 2.2.28 version of BLAST+. I was having to do too much after-the-fact cleanup.

    ReplyDelete
    Replies
    1. All credit to the NCBI - they fixed a lot of my reported issues in BLAST 2.2.28+ (as well as adding this enhancement).

      Delete
  6. Yeah -- super cool! Thanks for helping making the world a better place! For reals.
    For English speakers interested in bioinformatics, anyway ;)

    So, another problem I initially ran into with this topic was that these new fields,
    e.g., sscinames, scomnames, etc.
    seem to only come back with N/A values when just plugged right in to the old command.

    A little more searching led me here: http://www.biostars.org/p/76551/
    at which point I learned that an additional "taxa" database (in the right location) is required (ftp://ftp.ncbi.nlm.nih.gov/blast/db/taxdb.tar.gz)

    This page was so helpful for me and the problem I was tacking for the first time!
    You might edit it to suggest the above additional step?
    In case people (like me) don't have a complete/properly set up blast installation...?
    Or not, if this is not an issue you want to conflate with things here.

    Thanks!

    ReplyDelete
  7. Thanks for this post! The BLAST+ table parameters don't make a whole lot of sense to me, but your color-coded examples do.

    ReplyDelete
  8. Peter thank you so much for making BLAST 2.2.28+ compliant with Taxonomy info! Thank you! :-)

    Federico

    ReplyDelete
    Replies
    1. To be clear, I didn't do the work - I just talked about it ;)

      Delete
  9. Hey Peter,
    We are using blast+ to check some vector contamination in our data. Do you know what are the columns in parsed tabular output file?
    Thanks
    Shaadi

    ReplyDelete
    Replies
    1. It depends which columns you asked for, but the default 12 column output is described in the help text (e.g. blastn -help).

      Delete