BLAST+ should keep its BL_ORD_ID identifiers to itself

This is in a sense a continuation of my previous BLAST blog post, My IDs not good enough for NCBI BLAST+. My core complaint is that makeblastdb currently ignores the user's own identifiers and automatically assigns its own identifiers (gnl|BL_ORD_ID|0, gnl|BL_ORD_ID|1, gnl|BL_ORD_ID|2, etc), and that the BLAST+ suite as a whole is inconsistent about hiding these in its output.

Note that one side-effect of BLAST+ ignoring the user identifiers and creating its own is that it can tolerate databases made from FASTA files with accidentally duplicated identifiers, but this only causes great confusion and ambiguity in the downstream analysis. One of the ways I've seen FASTA files be created with accidentally duplicated identifiers is pooling of assemblies where generic names like contig1 (or even the more complex Trinity naming scheme) naturally cause clashes. In situations like this, I think makeblastdb should give an error when attempting to build a BLAST database.


Trouble with chimeras - getting all complete viral genomes from the NCBI

Back in 2009, I wrote some Python scripts to use the NCBI Entrez Utilities to search for and download all known complete virus genomes in GenBank format, which I then processed to make FASTA files and BLAST databases. Recently I updated them and ran into some problems... false positives like entire bacterial genomes! This turns out to be due to a few bacteria with integrated phage being annotated as chimeras - genomes combined from multiple organisms.


UTF8 encoded Japanese in LaTeX

Slightly off topic, but anyway... notes on getting Japanese text working in LaTeX under Mac OS X using TeX Live. Once I finally got it to work it is quite easy, but first I explored a lot of dead ends and distractions (in the end I could ignore LaTeX Omega, XeLaTeX, etc). I'm just using pdflatex with the LaTeX Chinese, Japanese, Korean (CJK) package, here's an example from the PDF output:


Interview with a PeerJ author (me)

This summer I submitted a paper to the innovative new open access journal PeerJ, where it was published this week (Cock et al. 2013). I decided to write up the experience in the style of the PeerJ's Interview with an Author blog posts. I've copied the questions they normally ask, and written up my own replies - other than some rough edges in their current submission system it was all good.

Update: This has been reposted on the official PeerJ blog, with responses which I have inserted below.

Using Travis-CI for testing Galaxy Tools

Travis CI is one of the best things to happen to GitHub in some time - it adds automated testing capabilities to your source code repository as changes are committed, and even on pull requests to help ensure new work doesn't break existing functionality.

We've been using this for Biopython for over a year, but this month I've started using TravisCI for testing my add-ons for the Galaxy Project as well. My Galaxy tools (see also Cock et al. 2013) were already being tested every night once uploaded to the Galaxy Tool Shed, and I always stage releases via the Galaxy Test Tool Shed before posting them on the main Galaxy Tool Shed. However this fixed nightly schedule isn't very flexible for debugging failures.

Galaxy BLAST tools:
Galaxy sequence analysis tools:

I've currently got TravisCI working for my two Galaxy tool repositories on GitHub. Both configurations follow the same basic approach, which I have tried to explain in this post, and run the tests as soon as I update GitHub.


Pixelated Posters at Potatoes in Practice

Yesterday I attended the annual "Potatoes in Practice" meeting for the first time, mainly to see the finished display which I helped produce. Here it is, showing the twelve chromosomes of potato, drawn as stylized uniform green 'X' shapes, with different colour LEDs marking traits of interest for potato breeding.

Potato chromosomes 1 to 6, and 7 to 12, with LEDs marking traits of interest

My contribution was the background images, which are actually drawn using the bases of the potato genome instead of pixels. For this I wrote a little Python script to render photos using A, C, G and T from a FASTA sequence file, using Biopython to load the sequences, the Python Imaging Library (PIL) to load the photos, NumPy to manipulate the array, and ReportLab to render a PDF.

Potato chromosomes 9 and 10Close-up showing the A, C, G, T pixels


Markup support for a Python project

Mark-up languages allow you to write a plain text input file which is then processed to produce a nicely formatted output - as HTML, PDF or similar. The plain text nature of the input is perfect for tracking under version control software, which other richer formats are not suited to. So what's the current best markup choice for a Python project? It looks like reStructuredText (*.rst).


Random access to blocked XZ format (BXZF)

I've written about random access to the blocked GZIP variant BGZF used in BAM, and looked at random access to BZIP2, but here I'm looking at XZ files which are based on LZMA compression. This was prompted by the release of Python 3.3 which includes the lzma module to support XZ files, which I then back-ported to offer lzma for Python 2.6 or later. Over the Christmas / New Year break I extended this to handle Blocked XZ Format (BXZF for short), so publishing this post is a bit overdue.


Free for non-commercial academic use only sucks

Scientific software released for free under an "academic only"/"non-commercial" licence really irks me, even if the source code is included. This is mainly because it isn't open source, and I believe science should be open, but this also causes a whole lot of headaches. Right now I'm particularly disappointed by the GATK licensing debacle from The Broad Institute - and I'm not alone in this, for example Mick Watson has just blogged about why the GATK re-licensing matters and there has been plenty of comment on Twitter.