2014-12-23

BLAST+ Christmas Wish List

Dear Santa,

Please could you ask the Elves at the NCBI to deliver the following BLAST+ feature requests for Christmas 2014?

Thank you,

Peter

P.S. Do they think I have been naughty or nice with my BLAST blog posts?


These are roughly in order of increasing priority, starting with some relatively minor issues. It was of course utterly unrealistic to expect this by Christmas, even though I started writing this a month before. But fingers crossed some of this might appear in BLAST+ during 2015?

10. Keep the old BLAST URL alive


The NCBI are dropping the long lived www.ncbi.nlm.nih.gov/blast URL as of 1st December 2014, which they had widely publicly announced (including via mailing lists and repeatedly on Twitter).

I worry this will break a number of legacy applications and scripts (whose development has ceased) which used this to run BLAST, and anticipate a flood of queries on mailing lists and forums.

The old URL already redirects to the new blast.ncbi.nih.gov address (and has done so for some time). Is it really enough of a maintenance burden to justify dropping the old URL? Michael Hoffman wondered the same thing on Twitter.

[Thinking this would be a good example, I just checked BioEdit, last updated in 2005. Its remote BLAST searches fail with an NCBI message about the withdrawal of Blastcl3 in 2013 - sadly I doubt this was enough to stop people using BioEdit.]

At the time of posting, 23rd December 2014, the old URL still redirects... so maybe it got a stay of execution?

9. Resurrect blastclust


A minor causality in the BLAST rewrite from C to C++ was the blastclust tool for clustering sequences based on their similarity. OK, yes, there are alternatives like UCLUST but there are times when it would be nice to have a BLAST+ version (and others agree, e.g. Mick Watson).

8. Command line option for taxonomy database path


The NCBI added taxonomy output to BLAST+ 2.2.28, which requires you download taxdb.tar.gz from the NCBI FTP site and decompress this somewhere on your $BLASTDB path.

I sometimes want to specify the taxonomy database at the command line. One use-case is if you really care about reproducibility and want to use a particular version of the NCBI taxonomy tree. I would like a new optional argument to do this, e.g. -taxdb /my/data/taxdb-2014-11-26 to tell BLAST to use the files /my/data/taxdb-2014-11-26.* rather than looking for taxdb.* on the $BLASTDB path.

This would be similar to how the deltablast command has an optional argument -rpsdb which defaults to looking for cdd_delta.* on the $BLASTDB path.

7. Include an official local BLAST web-server


The old "legacy" BLAST suite included a basic web-server, wwwblast. It was functional and ugly by today's standards, but got the job done. Sadly as part of the retirement of the "legacy" BLAST suite with development shifted from C to C++, there never was an official replacement - meanwhile the NCBI hosted BLAST web-server has gone from strength to strength.

This gap has lead to numerous alternatives like my own work enabling BLAST+ within Galaxy, or SequenceServer for running BLAST on a local server or cluster from a browser.

6. Update the BLAST XML format.


I hope the NCBI got lots of useful feedback from their March 2014 BLAST XML consultation exercise. Back in February 2014 I wrote about my own thoughts on what needs fixing in the BLAST XML format.

5. Fix the alignment limit arguments


BLAST+ has arguments -num_descriptions and -num_alignments for use with the human readable plain text output formats (-outfmt 0 to 4 inclusive). For other formats the data does not get split into a summary listing (descriptions) followed by alignments, so instead option -max_target_seqs is used.

I want to be able to use -max_target_seqs for all the output formats. If applied to the plain text outputs, it should be treated as the default limit for the descriptions and alignment.

The sad thing is this actually worked from the first release of BLAST+ 2.2.18 through to 2.2.25,

$ blastp -query queries.fasta -db nr -outfmt 0 -max_target_seqs 2
BLASTP 2.2.25+
...

There was a somewhat scary warning in BLAST+ 2.2.26,

$ blastp -query queries.fasta -db nr -outfmt 0 -max_target_seqs 2
Warning: Number of descriptions overridden to 2, number of alignments overridden to 2.
max_target_seqs should not be set with outfmt 0
BLASTP 2.2.26+
...

However the limit still worked. Unfortunately with BLAST+ 2.2.27 through to the current release 2.2.30 this was changed to ignore the argument:

$ blastp -query queries.fasta -db nr -outfmt 0 -max_target_seqs 2
Warning: The parameter -max_target_seqs is ignored for output formats, 0,1,2,3. Use -num_descriptions and -num_alignments to control output
BLASTP 2.2.30+
...

There is even an off-by-one bug in the warning message as it also applies to format four:

$ blastp -query queries.fasta -db nr -outfmt 4 -max_target_seqs 2
Warning: The parameter -max_target_seqs is ignored for output formats, 0,1,2,3. Use -num_descriptions and -num_alignments to control output
BLASTP 2.2.30+
...

I want this to be reverted to the original behaviour (with no warning, just obey the argument). This way you can just remember and use a single option -max_target_seqs for all the output formats (and for most usage, ignore -num_descriptions and -num_alignments).

4. Hide BL_ORD_ID as an implementation detail.


I wrote about why I think BLAST+ should hide its BL_ORD_ID identifiers as an internal implementation detail, and reject FASTA files with duplicate identifiers when making a BLAST database - which should resolve my lingering issues with BLAST+ not handling user defined naming conventions very well.

3. Optional headers in the BLAST+ tabular and CSV output.


I don't like playing guess-the-column with tables of BLAST data. Judging from the re-tweets and replies Twitter (e.g. Torsten Seemann, Laura Williams, Matt Loose, Lex Nederbragt, and Bastien Chevreux), I am not alone in this.

2. Taxonomy filters


One of the most common uses I have seen for the Entrez filter when using the BLAST+ command line tools to run a remote search at the NCBI is to filter by taxonomy. Building on the taxonomy support added in BLAST+ 2.2.28, I would like to see new options for restricting the results to given taxa (a white list) or excluding taxa (a black list).

For example to do a BLAST search against NR restricting to only plant (Embryophyta, higher plants, taxid 3193) matches I would like to be able to do something like this:

$ blastp -query my_seqs.fasta -db nr -taxidlist 3193 -evalue 0.0001 ...

Or to do a BLAST search against NR excluding any bacterial (taxid 2) or archaeal (taxid 2157) matches I would like to be able to use:

$ blastp -query my_seqs.fasta -db nr -negative_taxidlist 2,2157 -evalue 0.0001 ...

Hopefully the taxonomy database files BLAST+ uses (taxdb.tar.gz) already contains the tree structure needed to do this, but if need be that could be expanded.

1. Develop BLAST+ in the open


Last, but definitely not least on my list: Returning to the inaugural post on this blog, I'd still like to see the NCBI BLAST+ team adopt a more open approach to development - with a public issue tracker etc.

1 comment:

  1. I would add "please explain the -culling limit option" in the documentation too!

    ReplyDelete