Update: See end of post, BLAST 2.2.28+ added most of these features :)
As an example, consider a BLASTP search against the NR database with the following single protein sequence, limited to one target sequence here for conciseness.
>spo_like
MNEKILIVDEQYGIRVLLNEVFHKEGYQTFQAANGIQALEIVTKERPDLILLDMKIPGMDGIEILKRMK
MVDESIRVIIMTAYGELDMIKESKELGALTHFAKPFEIDDIRDAVKKYLPLKSN
Here I'm interested in the descriptions of the matched sequences in particular - useful not just for human interpretation but handy to search for keywords such as an enzyme or organism name.
BLAST plain text output
Here is some of the default plain text output from BLAST 2.2.26+, where I have highlighted the match description. This occurs twice, once truncated in the description listing (yellow), and once in full in the alignment listing (multiple lines, in orange).
$ blastp -query spo_like.faa -db nr -max_target_seqs 1 -outfmt 1 ... Query= spo_like Length=123 Score E Sequences producing significant alignments: (Bits) Value ref|YP_005543406.1| spo0F gene product [Bacillus amyloliquefacie... 239 3e-79 >ref|YP_005543406.1| spo0F gene product [Bacillus amyloliquefaciens TA208] gb|AEB25913.1| stage 0 sporulation protein F [Bacillus amyloliquefaciens TA208] gb|EHM07003.1| stage 0 sporulation protein F [Bacillus amyloliquefaciens IT-45] Length=123 Score = 239 bits (611), Expect = 3e-79, Method: Compositional matrix adjust. Identities = 116/123 (94%), Positives = 123/123 (100%), Gaps = 0/123 (0%) Query 1 MNEKILIVDEQYGIRVLLNEVFHKEGYQTFQAANGIQALEIVTKERPDLILLDMKIPGMD 60 MNEKILIVD+QYGIR+LLNEVFHKEGYQTFQAANGIQAL+IVTKERPDL+LLDMKIPGMD Sbjct 1 MNEKILIVDDQYGIRILLNEVFHKEGYQTFQAANGIQALDIVTKERPDLVLLDMKIPGMD 60 Query 61 GIEILKRMKMVDESIRVIIMTAYGELDMIKESKELGALTHFAKPFEIDDIRDAVKKYLPL 120 GIEILKRMKM+DESIRVIIMTAYGELDMIKESKELGALTHFAKPF+ID+IRDAVKKYLPL Sbjct 61 GIEILKRMKMIDESIRVIIMTAYGELDMIKESKELGALTHFAKPFDIDEIRDAVKKYLPL 120 Query 121 KSN 123 KSN Sbjct 121 KSN 123 ...
Why are there multiple lines? This is a quirk of the NR database, the clue is in the full name - non redundant. In this example, three identical protein sequences (YP_001422989.1, AEB25913.1 and EHM07003.1) are combined into one entry in the database.
BLAST XML output
Here is the equivalent output in XML, where the matched sequence description occurs only once (highlighted in orange):
$ blastp -query spo_like.faa -db nr -max_target_seqs 1 -outfmt 5
<Iteration_iter-num>1</Iteration_iter-num>
<Iteration_query-ID>Query_1</Iteration_query-ID>
<Iteration_query-def>spo_like</Iteration_query-def>
<Iteration_query-len>123</Iteration_query-len>
<Iteration_hits>
<Hit>
<Hit_num>1</Hit_num>
<Hit_id>gi|384161333|ref|YP_005543406.1|</Hit_id>
<Hit_def>spo0F gene product [Bacillus amyloliquefaciens TA208] >gi|328555421|gb|AEB25913.1| stage 0 sporulation protein F [Bacillus amyloliquefaciens TA208] >gi|363726869|gb|EHM07003.1| stage 0 sporulation protein F [Bacillus amyloliquefaciens IT-45]</Hit_def>
<Hit_accession>YP_005543406</Hit_accession>
<Hit_len>123</Hit_len>
<Hit_hsps>
<Hsp>
<Hsp_num>1</Hsp_num>
<Hsp_bit-score>239.965</Hsp_bit-score>
<Hsp_score>611</Hsp_score>
<Hsp_evalue>2.57739e-79</Hsp_evalue>
<Hsp_query-from>1</Hsp_query-from>
<Hsp_query-to>123</Hsp_query-to>
<Hsp_hit-from>1</Hsp_hit-from>
<Hsp_hit-to>123</Hsp_hit-to>
<Hsp_query-frame>0</Hsp_query-frame>
<Hsp_hit-frame>0</Hsp_hit-frame>
<Hsp_identity>116</Hsp_identity>
<Hsp_positive>123</Hsp_positive>
<Hsp_gaps>0</Hsp_gaps>
<Hsp_align-len>123</Hsp_align-len>
<Hsp_qseq>MNEKILIVDEQYGIRVLLNEVFHKEGYQTFQAANGIQAL...KSN</Hsp_qseq>
<Hsp_hseq>MNEKILIVDDQYGIRILLNEVFHKEGYQTFQAANGIQAL...KSN</Hsp_hseq>
<Hsp_midline>MNEKILIVD+QYGIR+LLNEVFHKEGYQTFQAANGI...KSN</Hsp_midline>
</Hsp>
</Hit_hsps>
</Hit>
...
This time the three combined names are a little more painful to get - first combine the <Hit_id> and <Hit_def> tags, and then split this up by looking for > (shown in red), which is an HTML encoded greater than sign, familiar to any Bioinformatician as the new sequence marker in a FASTA file.
There is reasonably a logical explanation for that palava, and it is based on what I infer the entry in the FASTA file used to make the NR database would have looked like. You can actually extract this from the database using blastbcmd, shown here I have shown ^A to represent the non-printing ASCII character CTRL+A, which is what the NCBI use as the separator between merged redundant entries.
$ blastdbcmd -db nr -entry YP_005543406.1 -ctrl_a
>ref|YP_005543406.1| spo0F gene product [Bacillus amyloliquefaciens TA208]
^Agb|AEB25913.1| stage 0 sporulation protein F [Bacillus amyloliquefaciens TA208]
^Agb|EHM07003.1| stage 0 sporulation protein F [Bacillus amyloliquefaciens IT-45]
MNEKILIVDDQYGIRILLNEVFHKEGYQTFQAANGIQALDIVTKERPDLVLLDMKIPGMD
GIEILKRMKMIDESIRVIIMTAYGELDMIKESKELGALTHFAKPFDIDEIRDAVKKYLPL
KSN
For the blog formatting I've had to stick line breaks in at the ^A bits, but all that long triple name bit would have been one line in the FASTA file.
In the standard FASTA header line interpretation (which doesn't care about this weird control character), the first word is the record's ID, and thus becomes <Hit_id>gi|384161333|ref|YP_005543406.1|</Hit_id> in the blast output, while everything after the first space is the description, and thus becomes <Hit_def>spo0F gene product [Bacillus amyloliquefaciens TA208] >gi|328555421|gb|AEB25913.1| stage 0 sporulation protein F [Bacillus amyloliquefaciens TA208] >gi|363726869|gb|EHM07003.1| stage 0 sporulation protein F [Bacillus amyloliquefaciens IT-45]</Hit_def>. You can't put a nasty non-printing ASCII character in XML, so they were switched to something else. That is my working theory for this anyway.
Why the NCBI didn't come up with a clean and explicit XML solution to this beats me. Adding some additional tags to the XML seems like a good idea for my wish list (that should be backwards compatible).
BLAST tabular output
Now for tabular output, one of the nice things about BLAST+ (and a big improvement over the NCBI 'legacy' BLAST suite) is you can customise the tabular output to include additional columns.I've highlighted the relevant entries from the help text:
$ blastp -help ... -outfmtalignment view options: 0 = pairwise, 1 = query-anchored showing identities, 2 = query-anchored no identities, 3 = flat query-anchored, show identities, 4 = flat query-anchored, no identities, 5 = XML Blast output, 6 = tabular, 7 = tabular with comment lines, 8 = Text ASN.1, 9 = Binary ASN.1, 10 = Comma-separated values, 11 = BLAST archive format (ASN.1) Options 6, 7, and 10 can be additionally configured to produce a custom format specified by space delimited format specifiers. The supported format specifiers are: qseqid means Query Seq-id qgi means Query GI qacc means Query accesion qaccver means Query accesion.version qlen means Query sequence length sseqid means Subject Seq-id sallseqid means All subject Seq-id(s), separated by a ';' sgi means Subject GI sallgi means All subject GIs sacc means Subject accession saccver means Subject accession.version sallacc means All subject accessions slen means Subject sequence length qstart means Start of alignment in query qend means End of alignment in query sstart means Start of alignment in subject send means End of alignment in subject qseq means Aligned part of query sequence sseq means Aligned part of subject sequence evalue means Expect value bitscore means Bit score score means Raw score length means Alignment length pident means Percentage of identical matches nident means Number of identical matches mismatch means Number of mismatches positive means Number of positive-scoring matches gapopen means Number of gap openings gaps means Total number of gaps ppos means Percentage of positive-scoring matches frames means Query and subject frames separated by a '/' qframe means Query frame sframe means Subject frame btop means Blast traceback operations (BTOP) When not provided, the default value is: 'qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore', which is equivalent to the keyword 'std' Default = `0'
What do those fields look like? This shows the first two columns from the default output:
$ blastp -query spo_like.faa -db nr -max_target_seqs 1 -outfmt "6 qseqid sseqid"
spo_like gi|384161333|ref|YP_005543406.1|
Or, rather than just the first ID for the redundant merged entries, showing all the IDs of merged entries:
$ blastp -query spo_like.faa -db nr -max_target_seqs 1 -outfmt "6 qseqid sallseqid"
spo_like gi|384161333|ref|YP_005543406.1|;gi|328555421|gb|AEB25913.1|;gi|363726869|gb|EHM07003.1|
A discussion of the ins-and-outs of the NCBI naming conventions, and why the BLAST tabular output includes a GI number prefix but the BLAST tabular and XML output does not, can wait for another day.
Tabular Output Request - Subject descriptions
The tabular output is far more compact than both the XML and plain text output, and by its nature it is trivial to parse or work with at the command line. This makes it very attractive, but surprisingly one of the key bits of information is currently unavailable. The option to include the description would for example allow you to easily filter BLAST output (e.g. with grep) for keywords such as an enzyme or organism name.
I would like the NCBI to add a further two fields to the tabular output options:
qdescr means Query identifier and description
sdescr means Subject identifier(s) and description(s), separated by CTRL+A
The proposed qdescr would just repeat the entire contents of the FASTA query's header line (without the leading greater than sign, and with any tabs turned into spaces). Likewise, I want to be able to do the same for the subject. i.e.
$ blastp -query spo_like.faa -db nr -max_target_seqs 1 -outfmt "6 qseqid sdescr"
spo_like gi|384161333|ref|YP_005543406.1| spo0F gene product [Bacillus amyloliquefaciens TA208]
^Agi|328555421|gb|AEB25913.1| stage 0 sporulation protein F [Bacillus amyloliquefaciens TA208]
^Agi|363726869|gb|EHM07003.1| stage 0 sporulation protein F [Bacillus amyloliquefaciens IT-45]
As before, I've inserted a line break at the ^A bits due to the page width, but that would all be one long line. I don't really mind if this is made into two new options, qdesrc for the first hit name and description, and salldescr for all the names and descriptions of a merged entry. I can see some appeal for symmetry with the other existing fields, although personally I only need the later functionality.
I realise that the descriptions can be very long, but so too can existing fields in the tabular output like the aligned sequences. This isn't really a problem for processing tabular files.
Tabular Output Request - Subject TaxID
Another new feature in BLAST+ is build in taxonomic filtering, which relies on additional information in the database. You can supply this to makeblastdb when making your own custom BLAST database. Surprisingly, there is no option to show the subject taxid information in the tabular output - how about adding new entries staxid and salltaxid for this? It could be added to the XML output too.
Update (25 May 2012)
The NCBI replied to my email saying "We are currently considering both of these options" and it was being added to their request list. I don't know how they prioritise requests but fingers crossed for the next release.
Update (1 April 2013)
Although dated 19 March, the NCBI released BLAST 2.2.28+ today. The change log lists most of the features I was asking for:
- Support for taxonomy data in custom tabular output format
- Support for query coverage and title in custom tabular output format
$ blastp -help
...
staxids means Subject Taxonomy ID(s), separated by a ';'
sscinames means Subject Scientific Name(s), separated by a ';'
scomnames means Subject Common Name(s), separated by a ';'
sblastnames means Subject Blast Name(s), separated by a ';'
(in alphabetical order)
sskingdoms means Subject Super Kingdom(s), separated by a ';'
(in alphabetical order)
stitle means Subject Title
salltitles means All Subject Title(s), separated by a '<>'
sstrand means Subject Strand
qcovs means Query Coverage Per Subject
qcovhsp means Query Coverage Per HSP
...
Very nice. Here's my example again - with manual line breaks for the blog post. Note the NR database has grown and there are now ten aliases for the first hit:
$ blastp -query spo_like.faa -db nr -max_target_seqs 1 -outfmt "6 qseqid sallseqid"
spo_like gi|384161333|ref|YP_005543406.1|;gi|394991260|ref|ZP_10384067.1|;gi|421729881|ref|ZP_16169010.1|
;gi|429507007|ref|YP_007188191.1|;gi|451345129|ref|YP_007443760.1|;gi|328555421|gb|AEB25913.1|
;gi|393808032|gb|EJD69344.1|;gi|407075847|gb|EKE48831.1|;gi|429488597|gb|AFZ92521.1|
;gi|449848887|gb|AGF25879.1|
Using the new field for the first match's title/description:
$ blastp -query spo_like.faa -db nr -max_target_seqs 1 -outfmt "6 qseqid stitle"
spo_like stage 0 sporulation protein F [Bacillus amyloliquefaciens TA208]
And all the titles/descriptions for the first match - I've manually added line breaks for the blog, and highlighted the separators:
$ blastp -query spo_like.faa -db nr -max_target_seqs 1 -outfmt "6 qseqid salltitles"
spo_like stage 0 sporulation protein F [Bacillus amyloliquefaciens TA208]<>stage 0 sporulation protein F
[Bacillus sp. 916]<>two-component system, sporulation family, response regulator, stage 0 sporulation
protein F [Bacillus amyloliquefaciens subsp. plantarum M27]<>two-component system, sporulation family,
response regulator, stage 0 sporulation protein F [Bacillus amyloliquefaciens subsp. plantarum AS43.3]
<>two-component system, sporulation family, response regulator, stage 0 sporulation protein F [Bacillus
amyloliquefaciens IT-45]<>stage 0 sporulation protein F [Bacillus amyloliquefaciens TA208]<>stage 0
sporulation protein F [Bacillus sp. 916]<>two-component system, sporulation family, response regulator,
stage 0 sporulation protein F [Bacillus amyloliquefaciens subsp. plantarum M27]<>two-component system,
sporulation family, response regulator, stage 0 sporulation protein F [Bacillus amyloliquefaciens subsp.
plantarum AS43.3]<>two-component system, sporulation family, response regulator, stage 0 sporulation
protein F [Bacillus amyloliquefaciens IT-45]
And the new field for the taxonomy IDs:
$ blastp -query spo_like.faa -db nr -max_target_seqs 1 -outfmt "6 qseqid staxids"
spo_like 999891;1007654;1091041;1177186;1225788
Maybe writing this blog post and my habit of telling anyone else asking for this functionality to email the NCBI BLAST team paid off? This didn't include my request for qdescr (or maybe qtitle based on the terms the NCBI seems to prefer) but that was the easiest to work around anyway.
Big smiles - thank you to the NCBI BLAST+ team for listening :)
Update (Monday 13 June 2016)
With the release of BLAST+ 2.4.0, the NCBI have added a few more columns for the taxonomy information of the first hit where multiple entries have been merged (e.g. before just had staxids which would give multiple values with a database like NR, now have staxid which will give the first entry only):
Added new tabular format specifiers for taxonomic information (staxid, ssciname, scomname, sblastname, skingdom) that correspond to the first subject ID.
Peter,
ReplyDeleteI agree with everything you have said in this post! I've been dreaming of Description and Taxa information in BLAST -m8 format for 10 years, and when BLAST+ still didn't have it, I was very disappointed. I'm not holding my breath on this one.
Torst
The descriptions are desperately needed on the tabular output, I can't understand the difficult in adding this feature!
ReplyDeleteHi Peter,
ReplyDeleteI end up on your page precisely because I was looking for a way to deal with this output issue. Then a 100% agree with your post.
Well, I guess that it's still possible to parse the blast output with perl script or something like that. (for instance when you perform the blast with Geneious software, the "description" column contains the full description).
Any advices about such a parsing script ?
Thanks
Alex
Hi Alex,
DeleteFirst of all, would you (and anyone else who thinks this is a good idea) ask the NCBI BLAST team to implement this. You can link to this blog if you like - if they are more aware of the demand, it might get given a higher priority.
Right now if you need the description you could parse the XML or the plain text. If you want to use Perl, there is a parser in BioPerl you could use. Personally I'd use Biopython ;)
You could also write a script combining the input FASTA file (for the descriptions) with the output tabular file from BLAST.
Peter
Thanks for your post - i too have banged my head against this issue for a while, and ended up always dumping full blast output as a text file and parsing it to get this information. I'm glad the blast+ team listened!
ReplyDeleteThanks for this writeup! So glad you updated this to point out the 2.2.28 version of BLAST+. I was having to do too much after-the-fact cleanup.
ReplyDeleteAll credit to the NCBI - they fixed a lot of my reported issues in BLAST 2.2.28+ (as well as adding this enhancement).
DeleteYeah -- super cool! Thanks for helping making the world a better place! For reals.
ReplyDeleteFor English speakers interested in bioinformatics, anyway ;)
So, another problem I initially ran into with this topic was that these new fields,
e.g., sscinames, scomnames, etc.
seem to only come back with N/A values when just plugged right in to the old command.
A little more searching led me here: http://www.biostars.org/p/76551/
at which point I learned that an additional "taxa" database (in the right location) is required (ftp://ftp.ncbi.nlm.nih.gov/blast/db/taxdb.tar.gz)
This page was so helpful for me and the problem I was tacking for the first time!
You might edit it to suggest the above additional step?
In case people (like me) don't have a complete/properly set up blast installation...?
Or not, if this is not an issue you want to conflate with things here.
Thanks!
Thanks for this post! The BLAST+ table parameters don't make a whole lot of sense to me, but your color-coded examples do.
ReplyDeletePeter thank you so much for making BLAST 2.2.28+ compliant with Taxonomy info! Thank you! :-)
ReplyDeleteFederico
To be clear, I didn't do the work - I just talked about it ;)
DeleteHey Peter,
ReplyDeleteWe are using blast+ to check some vector contamination in our data. Do you know what are the columns in parsed tabular output file?
Thanks
Shaadi
It depends which columns you asked for, but the default 12 column output is described in the help text (e.g. blastn -help).
DeleteIt looks like 2.2.30 has taxid columns now?
ReplyDeletestaxids means unique Subject Taxonomy ID(s), separated by a ';'
(in numerical order)
sscinames means unique Subject Scientific Name(s), separated by a ';'
scomnames means unique Subject Common Name(s), separated by a ';'
sblastnames means unique Subject Blast Name(s), separated by a ';'
(in alphabetical order)
sskingdoms means unique Subject Super Kingdom(s), separated by a ';'
(in alphabetical order)
Yes, easily overlooked but they were added back in BLAST+ 2.2.28 (I added a footnote and example output to the blog back in April 2013).
Delete