2012-04-18

BLAST+ memory hog with subject FASTA and XML output

We noticed a major memory problem running NCBI BLAST+ with XML output using a FASTA subject (consuming loads of swap space then getting killed). This doesn't happen with tabular output, nor if using a BLAST database:
Memory and CPU usage (until killed by OS)

The screenshots are from the cluster monitoring tool Ganglia, the horizontal red lines show the machine's physical limits (here 4 CPUs and 8 GB of RAM).

This example is using BLAST+ with a subject FASTA file rather than a pre-compiled BLAST database, because it is easier and the original search was expected to be a one-off, and I expected only a relatively modest performance hit. It turns out that was naive.

The cluster node is running 64bit CentOS Linux, and has 8GB of RAM. The BLAST+ tools were version 2.2.25+ (then updated to the current release 2.2.26+ and retested), using the pre-compiled binaries from the NCBI FTP site.

The query FASTA file was 1420 nucleotide sequences (ESTs) and the subject FASTA file was 20359 protein sequences (predicted genes), so this was a BLASTX search. This is actually a subset of the real query set, I wanted something more modest for exploring this issue.

My colleague observed similar issues with BLASTP comparing bacterial gene sets (so about 5000 proteins versus 5000 proteins).


Failing Approach - FASTA subject with XML out

The command run was:

$ blastx -query q.fasta -subject s.fasta -evalue 0.0001 -out blastx.xml -outfmt 5 -num_threads 4

Note that at the time of writing the multi-threading option is disabled when using a subject file. Back in 2010 I found BLAST 2.2.24+ could crash with a bus error or segmentation fault when used in this way. To avoid this, the NCBI disabled this feature in BLAST 2.2.25+ and it is still disabled in 2.2.26+ as well. I'm still hoping for a proper fix.

That ran for about 20 minutes, consuming about 20G of RAM and swap before getting killed by the OS (error 9). Interesting no output was written to the XML file at all - either something stupid is going on with caching the output in memory, or it didn't even manage to search one query sequence!


Workaround One - Make a BLAST database

$ makeblastdb -dbtype prot -in s.fasta

Building a new DB, current time: 04/18/2012 14:35:45
New DB name:   s.fasta
New DB title:  s.fasta
Sequence type: Protein
Keep Linkouts: T
Keep MBits: T
Maximum file size: 1073741824B
Adding sequences from FASTA; added 20359 sequences in 0.88878 seconds.

$ blastx -query q.fasta -db s.fasta -evalue 0.0001 -out blastx_db.xml -outfmt 5 -num_threads 4

That finished in just 3 minutes, giving an 11MB XML file. It didn't seem to take advantage of all four CPUs on the machine though - only two seem to have been used.

Using BLAST database instead (right)

Workaround Two - Use tabular output not XML

$ blastx -query q.fasta -subject s.fasta -evalue 0.0001 -out blastx.tabstd.txt -outfmt 6 -num_threads 4

Not as fast as using a database, that finished in just 24 minutes. This is not quick (not helped by subject mode disabling multi-threading), and the smaller tabular file (917K) lacks much of the information in the XML.

As you can see from the cluster monitoring program during this run the memory usage was minimal, under 1GB. Clearly this shows the problem with XML output is isolated to XML, and not the core BLAST searching.
Using subject FASTA & tabular output (right)

Combined Workaround - Using BLAST database and tabular output

$ blastx -query q.fasta -db s.fasta -evalue 0.0001 -out blastx_db.tabstd.txt -outfmt 6 -num_threads 4

Again, about 3 minutes, and not maxing out all four CPUs. The profile looks very like the XML output version shown above.



Conclusion

I ran a few more trials for time comparisons (to the nearest minute) with and without multiple threads, all using BLAST 2.2.25+ or the current release BLAST 2.2.26+ (same times):

FormatXMLTabular
Threads1 thread4 threads1 thread4 threads
Subject FASTA fileKilledDisabled24 minsDisabled
BLAST database3 mins3 mins3 mins3 mins

It is surprising to see that on this dataset multiple threading makes almost no difference to the BLASTX run time, it was faster but only marginal. The fact that this is still disabled with a subject FASTA file is therefore not such an issue here.

However, the miserably slow performance using a large subject FASTA file, and this apparent XML memory leak, renders the subject FASTA option most unattractive. I shall probably go back to wrapper scripts which create / update BLAST databases before calling BLAST itself - which was what I had been assuming the new option might be doing internally. Clearly not.

I have reported this to the NCBI BLAST email address.

Update (Thursday 20 April)

The BLAST developers are looking into the memory issue.

3 comments:

  1. As far as I'm aware, only one of the three stages in blast are hyperthreaded. I too have not found threading to give much overall speed increase. So its still worth splitting your query file into smaller chunks and running them in parallel.

    ReplyDelete
  2. Any idea how this approach compares to the old bl2seq in BLAST legacy i.e. (http://nebc.nerc.ac.uk/bioinformatics/docs/bl2seq.html)

    I presume this would have been based off that, which to my knowledge does not internally produce a BLAST database.

    ReplyDelete
  3. I have run many blasts with xml output. The biggest problem that I have seen with xml is that the program seems to hold all of the output in memory until the run is completely finished. This is why you didn't have any output in your xml file when your run failed. I am not sure why the programs work this way, but I would love to know if there is a way to change this behavior.

    ReplyDelete