BLAST+ 2.2.29 upset by [key=value] entries in queries

I recently got a weird error/warning message (repeated) in my BLAST+ stderr output,

Error: (1431.1) FASTA-Reader: Warning: FASTA-Reader: Ignoring FASTA modifier(s) found because the input was not expected to have any.

This turns out to be due to having [key=value] tags in my query FASTA file, and appears to be a new bug introduced in BLAST+ 2.2.29 (as BLAST+ 2.2.26 through 2.2.28 inclusive are not affected).


BLAST XML output needs more love from NCBI

For some time I had thought that the best option for computer parsing of BLAST+ output was BLAST XML. It had all the key bits of information, and XML is designed for automated parsing. However, with the extra fields added to the tabular or comma separated output in BLAST+ 2.2.28 like the long overdue hit descriptions, and taxonomy fields, I think they are now preferable. BLAST XML is now lagging behind!


BLAST+ should keep its BL_ORD_ID identifiers to itself

This is in a sense a continuation of my previous BLAST blog post, My IDs not good enough for NCBI BLAST+. My core complaint is that makeblastdb currently ignores the user's own identifiers and automatically assigns its own identifiers (gnl|BL_ORD_ID|0, gnl|BL_ORD_ID|1, gnl|BL_ORD_ID|2, etc), and that the BLAST+ suite as a whole is inconsistent about hiding these in its output.

Note that one side-effect of BLAST+ ignoring the user identifiers and creating its own is that it can tolerate databases made from FASTA files with accidentally duplicated identifiers, but this only causes great confusion and ambiguity in the downstream analysis. One of the ways I've seen FASTA files be created with accidentally duplicated identifiers is pooling of assemblies where generic names like contig1 (or even the more complex Trinity naming scheme) naturally cause clashes. In situations like this, I think makeblastdb should give an error when attempting to build a BLAST database.


Trouble with chimeras - getting all complete viral genomes from the NCBI

Back in 2009, I wrote some Python scripts to use the NCBI Entrez Utilities to search for and download all known complete virus genomes in GenBank format, which I then processed to make FASTA files and BLAST databases. Recently I updated them and ran into some problems... false positives like entire bacterial genomes! This turns out to be due to a few bacteria with integrated phage being annotated as chimeras - genomes combined from multiple organisms.


UTF8 encoded Japanese in LaTeX

Slightly off topic, but anyway... notes on getting Japanese text working in LaTeX under Mac OS X using TeX Live. Once I finally got it to work it is quite easy, but first I explored a lot of dead ends and distractions (in the end I could ignore LaTeX Omega, XeLaTeX, etc). I'm just using pdflatex with the LaTeX Chinese, Japanese, Korean (CJK) package, here's an example from the PDF output: