However, there is a simple change the NCBI could make to greatly improve the usability of the tabular or CSV output - label the columns with a header line! This is vital meta-data: No-one should be forced to guess-the-columns when presented with a data file.
Look at all the columns
Here is an excerpt from the BLAST+ 2.2.30 command line help on the output format:
*** Formatting options
-outfmt <String>
alignment view options:
0 = pairwise,
1 = query-anchored showing identities,
2 = query-anchored no identities,
3 = flat query-anchored, show identities,
4 = flat query-anchored, no identities,
5 = XML Blast output,
6 = tabular,
7 = tabular with comment lines,
8 = Text ASN.1,
9 = Binary ASN.1,
10 = Comma-separated values,
11 = BLAST archive format (ASN.1)
12 = JSON Seqalign output
Options 6, 7, and 10 can be additionally configured to produce
a custom format specified by space delimited format specifiers.
The supported format specifiers are:
qseqid means Query Seq-id
qgi means Query GI
qacc means Query accesion
qaccver means Query accesion.version
qlen means Query sequence length
sseqid means Subject Seq-id
sallseqid means All subject Seq-id(s), separated by a ';'
sgi means Subject GI
sallgi means All subject GIs
sacc means Subject accession
saccver means Subject accession.version
sallacc means All subject accessions
slen means Subject sequence length
qstart means Start of alignment in query
qend means End of alignment in query
sstart means Start of alignment in subject
send means End of alignment in subject
qseq means Aligned part of query sequence
sseq means Aligned part of subject sequence
evalue means Expect value
bitscore means Bit score
score means Raw score
length means Alignment length
pident means Percentage of identical matches
nident means Number of identical matches
mismatch means Number of mismatches
positive means Number of positive-scoring matches
gapopen means Number of gap openings
gaps means Total number of gaps
ppos means Percentage of positive-scoring matches
frames means Query and subject frames separated by a '/'
qframe means Query frame
sframe means Subject frame
btop means Blast traceback operations (BTOP)
staxids means unique Subject Taxonomy ID(s), separated by a ';'
(in numerical order)
sscinames means unique Subject Scientific Name(s), separated by a ';'
scomnames means unique Subject Common Name(s), separated by a ';'
sblastnames means unique Subject Blast Name(s), separated by a ';'
(in alphabetical order)
sskingdoms means unique Subject Super Kingdom(s), separated by a ';'
(in alphabetical order)
stitle means Subject Title
salltitles means All Subject Title(s), separated by a '<>'
sstrand means Subject Strand
qcovs means Query Coverage Per Subject
qcovhsp means Query Coverage Per HSP
When not provided, the default value is:
'qseqid sseqid pident length mismatch gapopen qstart qend sstart send
evalue bitscore', which is equivalent to the keyword 'std'
Default = `0'
-outfmt <String>
alignment view options:
0 = pairwise,
1 = query-anchored showing identities,
2 = query-anchored no identities,
3 = flat query-anchored, show identities,
4 = flat query-anchored, no identities,
5 = XML Blast output,
6 = tabular,
7 = tabular with comment lines,
8 = Text ASN.1,
9 = Binary ASN.1,
10 = Comma-separated values,
11 = BLAST archive format (ASN.1)
12 = JSON Seqalign output
Options 6, 7, and 10 can be additionally configured to produce
a custom format specified by space delimited format specifiers.
The supported format specifiers are:
qseqid means Query Seq-id
qgi means Query GI
qacc means Query accesion
qaccver means Query accesion.version
qlen means Query sequence length
sseqid means Subject Seq-id
sallseqid means All subject Seq-id(s), separated by a ';'
sgi means Subject GI
sallgi means All subject GIs
sacc means Subject accession
saccver means Subject accession.version
sallacc means All subject accessions
slen means Subject sequence length
qstart means Start of alignment in query
qend means End of alignment in query
sstart means Start of alignment in subject
send means End of alignment in subject
qseq means Aligned part of query sequence
sseq means Aligned part of subject sequence
evalue means Expect value
bitscore means Bit score
score means Raw score
length means Alignment length
pident means Percentage of identical matches
nident means Number of identical matches
mismatch means Number of mismatches
positive means Number of positive-scoring matches
gapopen means Number of gap openings
gaps means Total number of gaps
ppos means Percentage of positive-scoring matches
frames means Query and subject frames separated by a '/'
qframe means Query frame
sframe means Subject frame
btop means Blast traceback operations (BTOP)
staxids means unique Subject Taxonomy ID(s), separated by a ';'
(in numerical order)
sscinames means unique Subject Scientific Name(s), separated by a ';'
scomnames means unique Subject Common Name(s), separated by a ';'
sblastnames means unique Subject Blast Name(s), separated by a ';'
(in alphabetical order)
sskingdoms means unique Subject Super Kingdom(s), separated by a ';'
(in alphabetical order)
stitle means Subject Title
salltitles means All Subject Title(s), separated by a '<>'
sstrand means Subject Strand
qcovs means Query Coverage Per Subject
qcovhsp means Query Coverage Per HSP
When not provided, the default value is:
'qseqid sseqid pident length mismatch gapopen qstart qend sstart send
evalue bitscore', which is equivalent to the keyword 'std'
Default = `0'
Just look at all those interesting potential output fields :)
Sample Output
Now for a quick example, the protein sequence for E. coli K-12's gene moaC, and a fake sequence:
$ more queries.fasta
>moaC molybdopterin biosynthesis, protein C
MSQLTHINAAGEAHMVDVSAKAETVREARAEAFVTMRSETLAMIIDGRHHKGDVFATARIAGIQAAKRTW
DLIPLCHPLMLSKVEVNLQAEPEHNRVRIETLCRLTGKTGVEMEALTAASVAALTIYDMCKAVQKDMVIG
PVRLLAKSGGKSGDFKVEADD
>fake sequence of letters which should not match any real proteins
DFAIBFNWACNMVBNMDEYGBCBFCNKSFDEZNVDXHALAHLFGDNASHBCVNMDFGNMNDFSILAPPQG
FCGHAKGRDAIBVKPDJKAHCIIBYANMNVB
>moaC molybdopterin biosynthesis, protein C
MSQLTHINAAGEAHMVDVSAKAETVREARAEAFVTMRSETLAMIIDGRHHKGDVFATARIAGIQAAKRTW
DLIPLCHPLMLSKVEVNLQAEPEHNRVRIETLCRLTGKTGVEMEALTAASVAALTIYDMCKAVQKDMVIG
PVRLLAKSGGKSGDFKVEADD
>fake sequence of letters which should not match any real proteins
DFAIBFNWACNMVBNMDEYGBCBFCNKSFDEZNVDXHALAHLFGDNASHBCVNMDFGNMNDFSILAPPQG
FCGHAKGRDAIBVKPDJKAHCIIBYANMNVB
Let's search this against the NCBI NR database using BLASTP with the default column tabular output, limiting this to the top two hits for brevity:
$ blastp -query queries.fasta -db nr -evalue 0.0001 -max_target_seqs 2 -outfmt 6
moaC<tab>gi|15800534|ref|NP_286546.1|<tab>100.00<tab>161<tab>0<tab>0<tab>1<tab>161
<tab>1<tab>161<tab>3e-114<tab>330<end>
moaC<tab>gi|170768970|ref|ZP_02903423.1|<tab>99.38<tab>161<tab>1<tab>0<tab>1<tab>161
<tab>1<tab>161<tab>9e-114<tab>329<end>
moaC<tab>gi|15800534|ref|NP_286546.1|<tab>100.00<tab>161<tab>0<tab>0<tab>1<tab>161
<tab>1<tab>161<tab>3e-114<tab>330<end>
moaC<tab>gi|170768970|ref|ZP_02903423.1|<tab>99.38<tab>161<tab>1<tab>0<tab>1<tab>161
<tab>1<tab>161<tab>9e-114<tab>329<end>
Due to line splitting for the blog, I have represented the tab characters as <tab> in yellow, and emphasised the new lines with <end>.
Or, for comparison, the comma separated output - highlighted in yellow (which appears to include some unexpected spaces before the final values - a minor bug?):
$ blastp -query queries.fasta -db nr -evalue 0.0001 -max_target_seqs 2 -outfmt 10
moaC,gi|15800534|ref|NP_286546.1|,100.00,161,0,0,1,161,1,161,3e-114, 330
moaC,gi|170768970|ref|ZP_02903423.1|,,99.38,161,1,,0,1,161,1,161,9e-114, 329
moaC,gi|15800534|ref|NP_286546.1|,100.00,161,0,0,1,161,1,161,3e-114, 330
moaC,gi|170768970|ref|ZP_02903423.1|,,99.38,161,1,,0,1,161,1,161,9e-114, 329
If you know the standard 12 columns by heart, this is workable. But one of the best things about BLAST+ is it will happily output other columns! For instance,
$ blastp -query queries.fasta -db nr -evalue 0.0001 -max_target_seqs 2 -outfmt "6 qseqid sseqid score qcovs"
moaC<tab>gi|15800534|ref|NP_286546.1|<tab>846<tab>100<end>
moaC<tab>gi|170768970|ref|ZP_02903423.1|<tab>843<tab>100<end>
moaC<tab>gi|15800534|ref|NP_286546.1|<tab>846<tab>100<end>
moaC<tab>gi|170768970|ref|ZP_02903423.1|<tab>843<tab>100<end>
If you don't know how the file was generated, how are you to guess what the columns mean? Supposedly this is where -outfmt 7 is useful:
$ blastp -query queries.fasta -db nr -evalue 0.0001 -max_target_seqs 2 -outfmt "7 qseqid sseqid score qcovs"
# BLASTP 2.2.30+
# Query: moaC molybdopterin biosynthesis, protein C
# Database: nr
# Fields: query id, subject id, score, % query coverage per subject
# 2 hits found
moaC<tab>gi|15800534|ref|NP_286546.1|<tab>846<tab>100<end>
moaC<tab>gi|170768970|ref|ZP_02903423.1|<tab>843<tab>100<end>
# BLASTP 2.2.30+
# Query: fake sequence of letters which should not match any real proteins
# Database: nr
# 0 hits found
# BLAST processed 2 queries
# BLASTP 2.2.30+
# Query: moaC molybdopterin biosynthesis, protein C
# Database: nr
# Fields: query id, subject id, score, % query coverage per subject
# 2 hits found
moaC<tab>gi|15800534|ref|NP_286546.1|<tab>846<tab>100<end>
moaC<tab>gi|170768970|ref|ZP_02903423.1|<tab>843<tab>100<end>
# BLASTP 2.2.30+
# Query: fake sequence of letters which should not match any real proteins
# Database: nr
# 0 hits found
# BLAST processed 2 queries
This is extremely comment heavy (OK, not as verbose as BLAST XML, but still...) and not immediately useful for parsing the data (although you can now see queries with no hits being reported). However, loading this with Excel or R will not recognise the columns for you.
Enhancement Proposal
I would like the NCBI to add a new tabular output format to BLAST+, say -outfmt 13 if not used for anything else first, which acts like -outfmt 6 (tabular) but with the addition of a single header line. This should start with the # character (indicating this is a comment rather than data), followed by tab separated column names:
$ blastp -query queries.fasta -db nr -evalue 0.0001 -max_target_seqs 2 -outfmt "13 qseqid sseqid score qcovs"
#qseqid<tab>sseqid<tab>score<tab>qcovs<end>
moaC<tab>gi|15800534|ref|NP_286546.1|<tab>846<tab>100<end>
moaC<tab>gi|170768970|ref|ZP_02903423.1|<tab>843<tab>100<end>
#qseqid<tab>sseqid<tab>score<tab>qcovs<end>
moaC<tab>gi|15800534|ref|NP_286546.1|<tab>846<tab>100<end>
moaC<tab>gi|170768970|ref|ZP_02903423.1|<tab>843<tab>100<end>
I debated just changing the existing -outfmt 6 output, but fear this would break too many existing scripts. Similarly, this new mode could replace the existing -outfmt 7, but that includes other functionality as well - like reporting query sequences with no hits. So a new output format number seems best to me.
For fans of comma separated variable (CSV) files, the same style header should also be useful? I prefer to avoid CSV with BLAST as commas do occur in record titles and so can complicate parsing. Maybe that can be -outfmt 14 giving:
$ blastp -query queries.fasta -db nr -evalue 0.0001 -max_target_seqs 2 -outfmt "14 qseqid sseqid score qcovs"
#qseqid,sseqid,score,qcovs
moaC,gi|15800534|ref|NP_286546.1|,846,100
moaC,gi|170768970|ref|ZP_02903423.1|,843,100
#qseqid,sseqid,score,qcovs
moaC,gi|15800534|ref|NP_286546.1|,846,100
moaC,gi|170768970|ref|ZP_02903423.1|,843,100
Note that I am explicitly suggest using the same one-word concise format names that BLAST+ itself uses in the command line to generate the file (and not the more verbose column names currently used in -outfmt 7). These short column names are clear, plus this should make it easier to trace the values back to their meaning. Also they are much easier to work with in R than column names with spaces.
The point of this header line convention is it is widely used for self-describing tab-separated tables (I specifically want this for the BLAST+ Galaxy wrappers). This is trivial to load into Excel, and much easier to load into R or Python etc and get column names matched up with the data.
In essence, this column header line is vital meta-data which avoids the guess-the-columns problem which otherwise can come up when sharing BLAST data in tabular or CSV format.
Maybe adopting the ASCII char 31 for column delimitation and ASCII char 30 for lines would make it easy to avoid problems when using comma. See here: http://ronaldduncan.wordpress.com/2009/10/31/text-file-formats-ascii-delimited-text-not-csv-or-tab-delimited-text/
ReplyDeleteTrue, but tab separated variables are much more commonly used than those ASCII codes. My guess is they were too hard to enter from a standard keyboard layout? So sure, we could have ASCII delimited text as another BLAST+ output format - but I'd still want the column names there as the first line ;)
Delete