2011-08-11

Opening up NCBI BLAST?

The BLAST chapter of the Biopython Tutorial (PDF) starts with these lines by Brad Chapman,
Hey, everybody loves BLAST right? I mean, geez, how can get it get any easier to do comparisons between one of your sequences and every other sequence in the known world?

I know what he meant - but it turns out things could be easier, especially once you start running "standalone BLAST" on your own machines, rather than using the NCBI's ever improving BLAST website. Part of the problem is setting up BLAST and its databases can be complicated (especially on a cluster), but also inevitably, BLAST has bugs.

This isn't a slight on the NCBI, any non-trivial software product will have bugs. I'm more concerned with how they are dealt with.



I've observed in general that with propriety software you typically report a bug and hope for the best (even if you do have a support contract), while with an active open source project when you report a bug there is a fair chance they'll fix it (or apply a fix if you can provide one yourself). Part of this difference in culture is how open the project is about tracking and reporting bugs - commercial companies tend to keep their bug database private, open source projects tend to have it out in the open (usually with the exception of security issues). The same generally applies to the source code repository as well - most open source projects let you access and browse the history of their source code.

Now as a USA government funded project, NCBI BLAST is released to the public domain - check out the LICENSE file included in each download:
PUBLIC DOMAIN NOTICE
National Center for Biotechnology Information

This software/database is a "United States Government Work" under the terms of the United States Copyright Act. It was written as part of the author's official duties as a United States Government employee and thus cannot be copyrighted. This software/database is freely available to the public for use. The National Library of Medicine and the U.S. Government have not placed any restriction on its use or reproduction.

Although all reasonable efforts have been taken to ensure the accuracy and reliability of the software and data, the NLM and the U.S. Government do not and cannot warrant the performance or results that may be obtained by using this software or data. The NLM and the U.S. Government disclaim all warranties, express or implied, including warranties of performance, merchantability or fitness for any particular purpose.

Please cite the author in any work or product based on this material

That's great, it's free and open source, and means BLAST can be modified and re-distributed. This is perfect for inclusion in Linux distributions like Debian which take licence freedom issues very seriously (see packages blast2 for NCBI "legacy" BLAST, and ncbi-blast+ for NCBI BLAST+, the re-write in C++).

However, in other terms the NCBI BLAST project is far from open. Looking at the BLAST Developer Information there is nothing about participating in BLAST development, and no sign of a developers mailing list.

NCBI BLAST doesn't have a public source code repository. All you get are snapshots of the source code for each release, and a web-browsable version of the current code. Not helpful if you wanted to track down a regression for instance, or experiment on a fork of the code. In contrast, the Wellcome Trust Sanger Institute's code is available on github!

I guess I could just create a github repository, and import all the NCBI releases to date... but without any means to push proposed changes and improvements back to the NCBI it would be of limited benefit.

NCBI BLAST doesn't have a public bug tracker. Instead individuals must contact the NCBI by email, at blast-help (at) ncbi.nlm.nih.gov, which gets you in touch with the front line support team that then pass proper bug reports on to the actual developer team. The only way to track an issue is by follow up email, referencing the original report by date and email subject -- if there is an internal bug tracking number I've never been told it, and I have asked about this specifically.

Worse than this though, is when you next fall over a bug in BLAST, you have no way to check if this is a known issue and what if any workarounds exist, and if it is likely to be resolved in the next release!

I would be delighted if the NCBI would provide a read only public bug tracker for their software tools, whereby the NCBI could log legitimate issues reported by the public and identified internally. Obviously I'd prefer to be able to file bugs directly, but that might be open to abuse and overload the support team.

As an interim solution, I am seriously considering a series of blog posts each detailing a reproducible bug in BLAST that has been reported to the NCBI. Alternatively, perhaps the Open Bioinformatics Foundation (OBF) might be willing to host bug tracking for external projects like NCBI BLAST on their issue tracker (used by OBF supported Bio* projects). The point about this is that we (the BLAST users) could then track issues reported to the NCBI, and note if and when they are fixed. Not perfect, but it should also make Google searching for bugs in BLAST a little more useful.

Am I just having a rant after frustrating day? Maybe. Do I still use and recommend BLAST? Absolutely yes. But with a few provisos.

Peter

P.S. BLAST is a registered trademark of the National Library of Medicine (NLM), USA.

P.P.S. This post and all others (unless clearly stated to the contrary) reflect my own personal opinions, not those of my employer or any other organisation or project I may contribute to or belong to.

Update (21 October 2011)

Not sure how long its been there, but there is (now) a read only public SVN for BLAST+ etc,

http://www.ncbi.nlm.nih.gov/viewvc/v1/trunk/c++/src/app/blast/

I must say finding it from the main BLAST website is an exercise in persistence! Thanks to ) via this tweet.

2 comments:

  1. Have also recently experienced difficulty reporting bugs + accompanying code patches that fix the bugs to the NCBI Toolkit team.

    Ultimately resorted to just sending to the mailing list, but not a preferred method to track developer review.

    http://www.ncbi.nlm.nih.gov/mailman/pipermail/cpp/2012q4/002486.html

    ReplyDelete