The command line help in BLAST+ 2.2.31 only describes output formats 0 to 14:
$ ~/Downloads/ncbi-blast-2.2.31+/bin/blastp -help USAGE ... DESCRIPTION Protein-Protein BLAST 2.2.31+ ... *** Formatting options -outfmtalignment view options: 0 = pairwise, 1 = query-anchored showing identities, 2 = query-anchored no identities, 3 = flat query-anchored, show identities, 4 = flat query-anchored, no identities, 5 = XML Blast output, 6 = tabular, 7 = tabular with comment lines, 8 = Text ASN.1, 9 = Binary ASN.1, 10 = Comma-separated values, 11 = BLAST archive format (ASN.1), 12 = JSON Seqalign output, 13 = JSON Blast output, 14 = XML2 Blast output Options 6, 7, and 10 can be additionally configured to produce a custom format specified by space delimited format specifiers. ...
I discovered by accident that 15 offers SAM format output. First, using the sample files from this repository, here's an example using -outfmt 6, the concise tabular output:
$ blastp -query rhodopsin_proteins.fasta -db four_human_proteins.fasta -evalue 0.0001 -outfmt 6 gi|57163783|ref|NP_001009242.1| sp|P08100|OPSD_HUMAN 96.55 348 12 0 1 348 1 348 0.0 701 gi|3024260|sp|P56514.1|OPSD_BUFBU sp|P08100|OPSD_HUMAN 83.33 354 53 2 1 354 1 348 0.0 605 gi|283855846|gb|ADB45242.1| sp|P08100|OPSD_HUMAN 94.82 328 17 0 1 328 11 338 0.0 630 gi|283855823|gb|ADB45229.1| sp|P08100|OPSD_HUMAN 94.82 328 17 0 1 328 11 338 0.0 630 gi|223523|prf||0811197A sp|P08100|OPSD_HUMAN 93.10 348 23 1 1 347 1 348 0.0 651 gi|12583665|dbj|BAB21486.1| sp|P08100|OPSD_HUMAN 81.09 349 65 1 1 349 1 348 0.0 587
Here's what BLAST+ 2.2.31 gives with -outfmt 15 instead:
$ blastp -query rhodopsin_proteins.fasta -db four_human_proteins.fasta -evalue 0.0001 -outfmt 15 @HD VN:1.2 GO:query @SQ SN:gnl|BL_ORD_ID|3 LN:348 @SQ SN:gnl|BL_ORD_ID|3 LN:348 @SQ SN:gnl|BL_ORD_ID|3 LN:348 @SQ SN:gnl|BL_ORD_ID|3 LN:348 @SQ SN:gnl|BL_ORD_ID|3 LN:348 @SQ SN:gnl|BL_ORD_ID|3 LN:348 lcl|Query_1 0 gnl|BL_ORD_ID|3 1 255 348M * 0 0 * * AS:i:1808 EV:f:NM:i:0 PI:f:96.55 BS:f:701.049 lcl|Query_2 0 gnl|BL_ORD_ID|3 1 255 333M1I8M5I7M * 0 0 * * AS:i:1560 EV:f:0 NM:i:6 PI:f:84.77 BS:f:605.52 lcl|Query_3 0 gnl|BL_ORD_ID|3 11 255 328M * 0 0 * * AS:i:1625 EV:f:NM:i:0 PI:f:94.82 BS:f:630.558 lcl|Query_4 0 gnl|BL_ORD_ID|3 11 255 328M * 0 0 * * AS:i:1625 EV:f:NM:i:0 PI:f:94.82 BS:f:630.558 lcl|Query_5 0 gnl|BL_ORD_ID|3 1 255 190M1D157M * 0 0 * * AS:i:1680 EV:f:0 NM:i:1 PI:f:93.37 BS:f:651.744 lcl|Query_6 0 gnl|BL_ORD_ID|3 1 255 328M1I20M5H * 0 0 * * AS:i:1512 EV:f:0 NM:i:1 PI:f:81.32 BS:f:587.03
Note the use of standard SAM/BAM format tags - looking at the first match:
- AS:i:1808 - integer alignment score, here 1808
- NM:i:0 - integer edit distance, here 0
- EV:f:0 - float e-value, here 0
- PI:f:96.55 - float percent identify, here 96.55
- BS:f:701.049 - float bit-score, here 701.049
Sadly rather than the real names for the query and matches, BLAST's internal names are being used (e.g. lcl|Query_1 and gnl|BL_ORD_ID|3) rather than gi|57163783|ref|NP_001009242.1| and sp|P08100|OPSD_HUMAN). This reminds me of earlier problems, see my older post "BLAST+ should keep its BL_ORD_ID identifiers to itself".
Also there's an obvious bug in the duplication of the "reference" @SQ lines in the header.
For bonus points they should add a @PG line to the header as well, giving the BLAST+ version etc.
Still, as long as these niggles are sorted out for the first official release to offer SAM output, I think this will be surprisingly useful. For now, there are options like the BLAST XML to SAM tool Pierre Lindenbaum's student Aurélien Guy-Duché wrote (blog post, repository).
Update (6 July 2015)
Last night on Zach Charlop-powers got in touch on Twitter (@zach_cp) to report using -parse_deflines fixes the query names, and if the database was built using makeblastdb -parse_seqids ... then this fixes the match names. He put together an example too.Unfortunately that doesn't work if you are not using the NCBI specific pipe-character based naming scheme (see also My IDs not good enough for NCBI BLAST+).
P.S. To date SAM/BAM has only been used for nucleotides, but my example above was from a protein BLAST so it was apparently using SAM format working in amino acid residues rather than base pairs.
Thanks for keeping this updated with the latest news:)!
ReplyDeleteAlso...
>P.S. To date SAM/BAM has only been used for nucleotides, but my example above was from a protein BLAST so it was apparently using SAM format working in amino acid residues rather than base pairs.
I get this error when using it with BLASTP
"Error BLAST query/options error: SAM format is only applicable to blastn results"
Its probably deliberate - SAM output makes sense for blastn (nucleotides versus nucleotides), but is a bit of a stretch when you have protein sequences.
DeleteThanks for this very informative post.
ReplyDeleteI'm working with blastn and the SAM output, I've noticed that the version in 2.2.31 works perfectly for my needs, while the newer versions are broken!
Specifically, in the newer versions, in the sam output the query is taken to be the reference and the subject the reads, which makes no sense.
Query as the reference works for my purposes, because I can then view the results in IGV. I can see that both could be useful though. Each should have its own outfmt number, query as reads and query as reference.
ReplyDeleteThe reference IDs are set to Query_# rather than the query sequence's FASTA ID, which is not not at all helpful!
I just submitted a bug report to NCBI regarding the "Query_#" issue, which still exists...
ReplyDeleteThanks Tomer, post what you hear.
ReplyDeleteThe -parse_deflines still works to fix the query_# issue
ReplyDelete