|Memory and CPU usage (until killed by OS)|
The screenshots are from the cluster monitoring tool Ganglia, the horizontal red lines show the machine's physical limits (here 4 CPUs and 8 GB of RAM).
This example is using BLAST+ with a subject FASTA file rather than a pre-compiled BLAST database, because it is easier and the original search was expected to be a one-off, and I expected only a relatively modest performance hit. It turns out that was naive.
The cluster node is running 64bit CentOS Linux, and has 8GB of RAM. The BLAST+ tools were version 2.2.25+ (then updated to the current release 2.2.26+ and retested), using the pre-compiled binaries from the NCBI FTP site.
The query FASTA file was 1420 nucleotide sequences (ESTs) and the subject FASTA file was 20359 protein sequences (predicted genes), so this was a BLASTX search. This is actually a subset of the real query set, I wanted something more modest for exploring this issue.
My colleague observed similar issues with BLASTP comparing bacterial gene sets (so about 5000 proteins versus 5000 proteins).
Failing Approach - FASTA subject with XML out
The command run was:
$ blastx -query q.fasta -subject s.fasta -evalue 0.0001 -out blastx.xml -outfmt 5 -num_threads 4
Note that at the time of writing the multi-threading option is disabled when using a subject file. Back in 2010 I found BLAST 2.2.24+ could crash with a bus error or segmentation fault when used in this way. To avoid this, the NCBI disabled this feature in BLAST 2.2.25+ and it is still disabled in 2.2.26+ as well. I'm still hoping for a proper fix.
That ran for about 20 minutes, consuming about 20G of RAM and swap before getting killed by the OS (error 9). Interesting no output was written to the XML file at all - either something stupid is going on with caching the output in memory, or it didn't even manage to search one query sequence!
Workaround One - Make a BLAST database
$ makeblastdb -dbtype prot -in s.fasta
Building a new DB, current time: 04/18/2012 14:35:45
New DB name: s.fasta
New DB title: s.fasta
Sequence type: Protein
Keep Linkouts: T
Keep MBits: T
Maximum file size: 1073741824B
Adding sequences from FASTA; added 20359 sequences in 0.88878 seconds.
$ blastx -query q.fasta -db s.fasta -evalue 0.0001 -out blastx_db.xml -outfmt 5 -num_threads 4
That finished in just 3 minutes, giving an 11MB XML file. It didn't seem to take advantage of all four CPUs on the machine though - only two seem to have been used.
|Using BLAST database instead (right)|
Workaround Two - Use tabular output not XML
$ blastx -query q.fasta -subject s.fasta -evalue 0.0001 -out blastx.tabstd.txt -outfmt 6 -num_threads 4
Not as fast as using a database, that finished in just 24 minutes. This is not quick (not helped by subject mode disabling multi-threading), and the smaller tabular file (917K) lacks much of the information in the XML.
As you can see from the cluster monitoring program during this run the memory usage was minimal, under 1GB. Clearly this shows the problem with XML output is isolated to XML, and not the core BLAST searching.
|Using subject FASTA & tabular output (right)|
Combined Workaround - Using BLAST database and tabular output
$ blastx -query q.fasta -db s.fasta -evalue 0.0001 -out blastx_db.tabstd.txt -outfmt 6 -num_threads 4
Again, about 3 minutes, and not maxing out all four CPUs. The profile looks very like the XML output version shown above.
I ran a few more trials for time comparisons (to the nearest minute) with and without multiple threads, all using BLAST 2.2.25+ or the current release BLAST 2.2.26+ (same times):
|Threads||1 thread||4 threads||1 thread||4 threads|
|Subject FASTA file||Killed||Disabled||24 mins||Disabled|
|BLAST database||3 mins||3 mins||3 mins||3 mins|
It is surprising to see that on this dataset multiple threading makes almost no difference to the BLASTX run time, it was faster but only marginal. The fact that this is still disabled with a subject FASTA file is therefore not such an issue here.
However, the miserably slow performance using a large subject FASTA file, and this apparent XML memory leak, renders the subject FASTA option most unattractive. I shall probably go back to wrapper scripts which create / update BLAST databases before calling BLAST itself - which was what I had been assuming the new option might be doing internally. Clearly not.
I have reported this to the NCBI BLAST email address.
Update (Thursday 20 April)
The BLAST developers are looking into the memory issue.