Blasted Bioinformatics!?: How not to deal with NGS data

One of the first things a programmer dealing with 'Next Generation Sequencing' (NGS) aka 'High Throughput Sequencing' (HTSeq) data learns is to be very aware of memory limitations. You can't just go loading files into RAM when they are often gigabytes in size. Instead where possible you loop over a file (iterating over it record by record) or employ indexed random access. The authors of MrFast & MrsFast didn't do this.

I forget where I heard about the tools first, but the catchily named short read alignment tools Mr & Mrs Fast (and for colour-space, their brother Dr Fast) seemed worth trying out. The MrFast algorithm is described in this paper:

Dinov ID, Torri F, Macciardi F, Petrosyan P, Liu Z, Zamanyan A, Eggert P, Pierce J, Genco A, Knowles JA, Clark AP, Van Horn JD, Ames J, Kesselman C, Toga AW. (2011) Applications of the pipeline environment for visual informatics and genomics computations. BMC Bioinformatics 26;12:304. DOI: 10.1186/1471-2105-12-304

I'd initially tried out MrFast 2.5.5.0 on some small datasets, and it worked fine. Then I tried a full sized FASTQ file with 134,270,353 Illumina reads of 100bp - again with a small reference sequence:

@pjacock:
MrFast? More like MrGreedy! Killed it at 68GB of RAM aligning 12GB gzipped FASTQ file to tiny reference of 160,000bp #bioinformatics #fail

A few other folk replied on Twitter:

@stuka:
@pjacock great. I'm supposed to start using that and its wife for an upcoming project #bioinformatics #fail

@mwanger:
@pjacock My last run on MrFast was extremely slow...

Well, what was going on? I decided to look at the source code for MrFast 2.5.5.0 which is what I had been using - the latest release. I was gobsmacked:

@pjacock:
Reading MrFast source code: does it really start by loading all reads to be mapped in RAM! WTF? Plz tell me I'm wrong! #bioinformatics #fail

I would link to the files in their repository on the MrFast SourceForge page, but it is empty. I do hope they are using source control, even if they haven't been kind enough to share the history online.

Here's a snippet from file Reads.c which made me sit up:

int readAllReads(char *fileName1,
             char *fileName2,
             int compressed,
             unsigned char *fastq,
             unsigned char pairedEnd,
             Read **seqList,
             unsigned int *seqListSize)
{
...

  Read *list = NULL;

...

  // Counting the number of lines in the file
  while (readFirstSeq(dummy)) maxCnt++;

  if (!compressed)
    {
      rewind(_r_fp1);
    }
  else
    {
      gzrewind(_r_gzfp1);
    }

  // Calculating the Maximum # of sequences
  if (*fastq)
    {
      maxCnt /= 4;
    }
  else
    {
      maxCnt /= 2;
    }

...

  *seqList = list;
  *seqListSize = seqCnt;

...

  list = getMem(sizeof(Read)*maxCnt);

...

  return 1;
}

Their function getMem is defined in Common.c and just calls the standard memory allocator, malloc. Translated from C into English, this is a function to read in a FASTA or FASTQ file. It scans through the file once to count the lines, and then divides by 2 (FASTA) or 4 (FASTQ) to get an upper bound on the number of sequences. It the allocates enough memory to keep an array of that many reads in RAM, parses the file into this array, and returns the array.

Looking over the main entry point, this function is called almost immediately after parsing the command line arguments and starting their timer in file baseFAST.c:

int main(int argc, char *argv[])
{
  if (!parseCommandLine(argc, argv))
    return 1;

  configHashTable();

...

      char outputFileName[FILE_NAME_LENGTH];
      // Loading Sequences & Sampling Locations
      startTime = getTime();
      if (!readAllReads(seqFile1, seqFile2, seqCompressed, &seqFastq,
                    pairedEndMode, &seqList, &seqListSize))
 {
   return 1;
 }
...

i.e. Before starting any alignments, MrFast calls the function readAllReads any tries to load all the given FASTA/FASTQ reads RAM. I checked and the same occurs in MrsFasta v2.4.0.4 (the current release).

To me this is a major design flaw in Mr & Mrs Fast. Until it is fixed, I will be obliged to refer to the Mr & Mrs Fast tools as Mr & Mrs Greedy due to their incredible appetite for RAM ;)


Mr Rush (*)	Mr Greedy
(*) = There is no Mr Men character called Mr Fast

Bug reported here: 3573811 - MrFast & MrsFast both attempt to load all reads into RAM

On the bright side, having pre-filtered my reads I think I can still use MrFast for the analysis I have in mind, but I will definitely be trying alternatives for comparison.

P.S. The Mr Men cartoon characters were created by Roger Hargreaves, and their images here are used without permission. Please see http://www.mrmen.com/ for details.

Update

One of the authors replied on the bug report. Since the tools are currently single-threaded, they break up their read files and run multiple instances in order to take advantage of multiple cores. As a side effect, this brings down the total memory per job as well. So there's a straightforward practical workaround.


Mr Uppity	Mr Happy

5 comments:

sebhtml28 January 2013 at 18:00
Hi,

I really like your blog.

> Instead where possible you loop over a file (iterating over it record by record) or employ indexed random access.

There is a design pattern called "Lazy loading" for that.

http://en.wikipedia.org/wiki/Lazy_loading
sebhtml28 January 2013 at 18:04
> Since the tools are currently single-threaded, they break up their read files and run multiple instances in order to take advantage of multiple cores.

This part should be done by the software tool, not by its end user.

Blasted Bioinformatics!?

2012-10-02

How not to deal with NGS data - MrFast & MrsFast

5 comments: