I've written about random access to the blocked GZIP variant BGZF used in BAM, and looked at random access to BZIP2, but here I'm looking at XZ files which are based on LZMA compression. This was prompted by the release of Python 3.3 which includes the lzma module to support XZ files, which I then back-ported to offer lzma for Python 2.6 or later. Over the Christmas / New Year break I extended this to handle Blocked XZ Format (BXZF for short), so publishing this post is a bit overdue.
Bioinformatics lessons learned the hard way, bugs, gripes, and maybe topical paper reviews too...
2013-04-01
2013-01-28
Free for non-commercial academic use only sucks
Scientific software released for free under an "academic only"/"non-commercial" licence really irks me, even if the source code is included. This is mainly because it isn't open source, and I believe science should be open, but this also
causes a whole lot of headaches. Right now I'm particularly disappointed by the GATK licensing debacle from The Broad Institute - and I'm not alone in this, for example Mick Watson has just blogged about why the GATK re-licensing matters and there has been plenty of comment on Twitter.
2012-10-30
My IDs not good enough for NCBI BLAST+
The blastdbcmd tool in the BLAST+ suite (replacing fastacmd in the C 'legacy' BLAST suite) lets you do a lot of clever things with a BLAST database. As long as you follow the baroque NCBI FASTA naming scheme you can do this with local BLAST databases too. However, if you don't want to bow down to the NCBI naming (e.g. use FASTA files directly from your favourite assembler), then blastdbcmd seems needlessly crippled.
Update (2 April 2013): Some changes in BLAST 2.2.28+ (released yesterday) seem to be intended to address these issues, but there remain problems with this which I intend to expand on later.
Update (20 April 2013): I found a quiet moment this weekend to update this post with the BLAST 2.2.28+ problems I was alluding to. There has been some progress on this issue with the new release, but it is flawed. See below.
Update (2 April 2013): Some changes in BLAST 2.2.28+ (released yesterday) seem to be intended to address these issues, but there remain problems with this which I intend to expand on later.
Update (20 April 2013): I found a quiet moment this weekend to update this post with the BLAST 2.2.28+ problems I was alluding to. There has been some progress on this issue with the new release, but it is flawed. See below.
Broken blastdbcmd for -target_only
This is just a quick post to document a bug in the blastdbcmd tool from the BLAST+ suite when used on the NR database with a full identifier and the -target_only option.
Update: See end of post, BLAST 2.2.28+ fixed this :)
Update: See end of post, BLAST 2.2.28+ fixed this :)
2012-10-02
How not to deal with NGS data - MrFast & MrsFast
One of the first things a programmer dealing with 'Next Generation Sequencing' (NGS) aka 'High Throughput Sequencing' (HTSeq) data learns is to be very aware of memory limitations. You can't just go loading files into RAM when they are often gigabytes in size. Instead where possible you loop over a file (iterating over it record by record) or employ indexed random access. The authors of MrFast & MrsFast didn't do this.
Subscribe to:
Posts (Atom)