2018-11-02

BLAST max alignment limits repartee - part two

This is the second in a series of blog posts seeking to throw light some of the claims about the BLAST+ tool recently published by Shah et al. (2018) "Misunderstood parameter of NCBI BLAST impacts the correctness of bioinformatics workflows". Since regrettably they did not provide a reproducible test case, my previous post began by introducing a minimal test case.

This topic dates back to 2015, when Sujai Kumar reported as a scaryBLAST+ -max_target_seqs bug which as I wrote about ("What BLAST's max-target-sequences doesn't do") the NCBI BLAST developers explained as a poorly documented feature.

Here I focus on what might be the most quoted part of Shah et al. (2018), which is causing what I consider to be unwarranted panic:
To our surprise, we have recently discovered that this intuition is incorrect. Instead, BLAST returns the first N hits that exceed the specified E-value threshold, which may or may not be the highest scoring N hits. The invocation using the parameter ‘-max_target_seqs 1’ simply returns the first good hit found in the database, not the best hit as one would assume. Worse yet, the output produced depends on the order in which the sequences occur in the database. For the same query, different results will be returned by BLAST when using different versions of the database even if all versions contain the same best hit for this database sequence. Even ordering the database in a different way would cause BLAST to return a different ‘top hit’ when setting the max_target_seqs parameter to 1.
This does not seem to be the case. If I am misreading their message, I am not alone. See for example Emma Bell's blog post, or John Walshaw's comments on my 2015 post. It is possible Shah et al. have found a separate issue, but since no test case was given, that cannot currently be verified.

BLAST max alignment limits repartee - part one

Back in 2015, my blog post "What BLAST's max-target-sequences doesn't do" highlighted what we called a scary BLAST+ -max_target_seqs bug, found and reported by Sujai Kumar. The NCBI BLAST teams took the stance this was a feature not a bug (and as a heuristic search tool, this is an understandable view), but conceded it could be better documented.

Sadly, I don't think there has been much if any clarification in the BLAST+ documentation about the settings limiting the number of alignments returned, and what else they control. The recent letter Shah et al. (2018) "Misunderstood parameter of NCBI BLAST impacts the correctness of bioinformatics workflows" did serve the purpose of raising the profile of this issue, but sadly seems confused and misleading in several places. Most regrettably they did not provide a reproducible test case, so it is possible they found another issue.

This is the first of a planned series of blog posts which seeks to clarify the situation, and some of the claims in Shah et al. (2018).

First of all, this issue is not specific to -max_target_seqs (used with the computer readable output format). With human readable output formats, the maximum of -num_descriptions and -num_alignments is used in exactly the same way during the BLAST search.

2017-10-27

Entrez eSpell can't resolve PubmedSpellSrv

Another quick bug report blog post, this time NCBI Entrez's espell is currently broken returning:

Couldn't resolve #PubmedSpellSrv, the address table is empty.

2017-10-12

BLAST+ 2.7.0 segmentation fault with HTML output

I've not managed to blog much at all this year (parenthood), so here's a quick BLAST+ bug report from working on updating the Galaxy wrappers: I've found a reproducible segmentation fault in tblastn under both Mac and Linux when requesting HTML output.

2017-01-27

Mozilla Science Fellowship application 2016

Last summer I applied to the  Mozilla Fellows for Science 2016 call. Congratulations to the four 2016 fellows, selected from an impressive 483 submissions for only the second year of this innovative program.

I was delighted to be short listed and interviewed, but also slightly relieved not to have made the final cut. This is due to the timing of personal circumstances - I'm now a father, and despite the taking time off under the UK's Shared Parental Leave scheme, I'm currently trying to cut back work related activities.

While preparing my application, I was impressed by Jon Tennant's decision to post his application openly online, and had been meaning to share mine too. Better late than never?

[Update: To be clear, this was my application in 2016, which was shortlisted but ultimately unsuccessful]

[Update: Cross-posted on the James Hutton ICS blog]