Showing posts with label BLAST. Show all posts
Showing posts with label BLAST. Show all posts


BLAST+ 2.2.29 upset by [key=value] entries in queries

I recently got a weird error/warning message (repeated) in my BLAST+ stderr output,

Error: (1431.1) FASTA-Reader: Warning: FASTA-Reader: Ignoring FASTA modifier(s) found because the input was not expected to have any.

This turns out to be due to having [key=value] tags in my query FASTA file, and appears to be a new bug introduced in BLAST+ 2.2.29 (as BLAST+ 2.2.26 through 2.2.28 inclusive are not affected).


BLAST XML output needs more love from NCBI

For some time I had thought that the best option for computer parsing of BLAST+ output was BLAST XML. It had all the key bits of information, and XML is designed for automated parsing. However, with the extra fields added to the tabular or comma separated output in BLAST+ 2.2.28 like the long overdue hit descriptions, and taxonomy fields, I think they are now preferable. BLAST XML is now lagging behind!


BLAST+ should keep its BL_ORD_ID identifiers to itself

This is in a sense a continuation of my previous BLAST blog post, My IDs not good enough for NCBI BLAST+. My core complaint is that makeblastdb currently ignores the user's own identifiers and automatically assigns its own identifiers (gnl|BL_ORD_ID|0, gnl|BL_ORD_ID|1, gnl|BL_ORD_ID|2, etc), and that the BLAST+ suite as a whole is inconsistent about hiding these in its output.

Note that one side-effect of BLAST+ ignoring the user identifiers and creating its own is that it can tolerate databases made from FASTA files with accidentally duplicated identifiers, but this only causes great confusion and ambiguity in the downstream analysis. One of the ways I've seen FASTA files be created with accidentally duplicated identifiers is pooling of assemblies where generic names like contig1 (or even the more complex Trinity naming scheme) naturally cause clashes. In situations like this, I think makeblastdb should give an error when attempting to build a BLAST database.


Trouble with chimeras - getting all complete viral genomes from the NCBI

Back in 2009, I wrote some Python scripts to use the NCBI Entrez Utilities to search for and download all known complete virus genomes in GenBank format, which I then processed to make FASTA files and BLAST databases. Recently I updated them and ran into some problems... false positives like entire bacterial genomes! This turns out to be due to a few bacteria with integrated phage being annotated as chimeras - genomes combined from multiple organisms.


My IDs not good enough for NCBI BLAST+

The blastdbcmd tool in the BLAST+ suite (replacing fastacmd in the C 'legacy' BLAST suite) lets you do a lot of clever things with a BLAST database. As long as you follow the baroque NCBI FASTA naming scheme you can do this with local BLAST databases too. However, if you don't want to bow down to the NCBI naming (e.g. use FASTA files directly from your favourite assembler), then blastdbcmd seems needlessly crippled.

Update (2 April 2013): Some changes in BLAST 2.2.28+ (released yesterday) seem to be intended to address these issues, but there remain problems with this which I intend to expand on later.

Update (20 April 2013): I found a quiet moment this weekend to update this post with the BLAST 2.2.28+ problems I was alluding to. There has been some progress on this issue with the new release, but it is flawed. See below.

Broken blastdbcmd for -target_only

This is just a quick post to document a bug in the blastdbcmd tool from the BLAST+ suite when used on the NR database with a full identifier and the -target_only option.

Update: See end of post, BLAST 2.2.28+ fixed this :)


Stop breaking NCBI BLAST searches!

Have you ever tried to use a BLAST database of protein sequences containing stop codons? If you work on nice model organisms with solid gene annotation maybe not. However, with draft annotations, mutation studies, or read through translation it is not unreasonable for the odd internal stop codon to appear in a protein sequence. And some translation pipelines do leave in a trailing * character. It turns out the BLAST+ suite has a rather nasty glitch with this sort of sequence.

Update: BLAST 2.2.27+ fixed this bug.


BLAST tabular output missing descriptions

This is an open letter to the NCBI BLAST+ team to request two simple enhancements which I think would be extremely useful - first and foremost the option to include BLAST result descriptions in the tabular output. Having the taxonomic identifiers (if available) would be great too - allowing downstream filtering of BLAST results by species etc.

Update: See end of post, BLAST 2.2.28+ added most of these features :)


BLAST+ ignoring search space size for e-values

Sometimes using BLAST is frustrating. Today I'm writing about it returning different expectation values, and therefore different answers, depending on if you use a FASTA subject file, or a database made from that file. I noticed something funny a while ago, but didn't immediately investigate and report it (which I regret). In the continued absence of an official public bug tracker for NCBI BLAST, I'm again going to blog about it here, so people can find it via Google, and email this to the NCBI.


BLAST+ memory hog with subject FASTA and XML output

We noticed a major memory problem running NCBI BLAST+ with XML output using a FASTA subject (consuming loads of swap space then getting killed). This doesn't happen with tabular output, nor if using a BLAST database:
Memory and CPU usage (until killed by OS)


Opening up NCBI BLAST?

The BLAST chapter of the Biopython Tutorial (PDF) starts with these lines by Brad Chapman,
Hey, everybody loves BLAST right? I mean, geez, how can get it get any easier to do comparisons between one of your sequences and every other sequence in the known world?

I know what he meant - but it turns out things could be easier, especially once you start running "standalone BLAST" on your own machines, rather than using the NCBI's ever improving BLAST website. Part of the problem is setting up BLAST and its databases can be complicated (especially on a cluster), but also inevitably, BLAST has bugs.

This isn't a slight on the NCBI, any non-trivial software product will have bugs. I'm more concerned with how they are dealt with.