2012-10-30

My IDs not good enough for NCBI BLAST+

The blastdbcmd tool in the BLAST+ suite (replacing fastacmd in the C 'legacy' BLAST suite) lets you do a lot of clever things with a BLAST database. As long as you follow the baroque NCBI FASTA naming scheme you can do this with local BLAST databases too. However, if you don't want to bow down to the NCBI naming (e.g. use FASTA files directly from your favourite assembler), then blastdbcmd seems needlessly crippled.

Update (2 April 2013): Some changes in BLAST 2.2.28+ (released yesterday) seem to be intended to address these issues, but there remain problems with this which I intend to expand on later.

Update (20 April 2013): I found a quiet moment this weekend to update this post with the BLAST 2.2.28+ problems I was alluding to. There has been some progress on this issue with the new release, but it is flawed. See below.

Broken blastdbcmd for -target_only

This is just a quick post to document a bug in the blastdbcmd tool from the BLAST+ suite when used on the NR database with a full identifier and the -target_only option.

Update: See end of post, BLAST 2.2.28+ fixed this :)

2012-10-02

How not to deal with NGS data - MrFast & MrsFast

One of the first things a programmer dealing with 'Next Generation Sequencing' (NGS) aka 'High Throughput Sequencing' (HTSeq) data learns is to be very aware of memory limitations. You can't just go loading files into RAM when they are often gigabytes in size. Instead where possible you loop over a file (iterating over it record by record) or employ indexed random access. The authors of MrFast & MrsFast didn't do this.