2012-10-02

How not to deal with NGS data - MrFast & MrsFast

One of the first things a programmer dealing with 'Next Generation Sequencing' (NGS) aka 'High Throughput Sequencing' (HTSeq) data learns is to be very aware of memory limitations. You can't just go loading files into RAM when they are often gigabytes in size. Instead where possible you loop over a file (iterating over it record by record) or employ indexed random access. The authors of MrFast & MrsFast didn't do this.

I forget where I heard about the tools first, but the catchily named short read alignment tools Mr & Mrs Fast (and for colour-space, their brother Dr Fast) seemed worth trying out. The MrFast algorithm is described in this paper:

Dinov ID, Torri F, Macciardi F, Petrosyan P, Liu Z, Zamanyan A, Eggert P, Pierce J, Genco A, Knowles JA, Clark AP, Van Horn JD, Ames J, Kesselman C, Toga AW. (2011) Applications of the pipeline environment for visual informatics and genomics computations. BMC Bioinformatics 26;12:304. DOI: 10.1186/1471-2105-12-304

I'd initially tried out MrFast 2.5.5.0 on some small datasets, and it worked fine. Then I tried a full sized FASTQ file with 134,270,353 Illumina reads of 100bp - again with a small reference sequence:

@pjacock:
MrFast? More like MrGreedy! Killed it at 68GB of RAM aligning 12GB gzipped FASTQ file to tiny reference of 160,000bp #bioinformatics #fail

A few other folk replied on Twitter:

@stuka:
@pjacock great. I'm supposed to start using that and its wife for an upcoming project #bioinformatics #fail

@mwanger:
@pjacock My last run on MrFast was extremely slow...

Well, what was going on? I decided to look at the source code for MrFast 2.5.5.0 which is what I had been using - the latest release. I was gobsmacked:

@pjacock:
Reading MrFast source code: does it really start by loading all reads to be mapped in RAM! WTF? Plz tell me I'm wrong! #bioinformatics #fail

I would link to the files in their repository on the MrFast SourceForge page, but it is empty. I do hope they are using source control, even if they haven't been kind enough to share the history online.

Here's a snippet from file Reads.c which made me sit up:
int readAllReads(char *fileName1,
             char *fileName2,
             int compressed,
             unsigned char *fastq,
             unsigned char pairedEnd,
             Read **seqList,
             unsigned int *seqListSize)
{
...

  Read *list = NULL;

...

  // Counting the number of lines in the file
  while (readFirstSeq(dummy)) maxCnt++;

  if (!compressed)
    {
      rewind(_r_fp1);
    }
  else
    {
      gzrewind(_r_gzfp1);
    }

  // Calculating the Maximum # of sequences
  if (*fastq)
    {
      maxCnt /= 4;
    }
  else
    {
      maxCnt /= 2;
    }

...

  *seqList = list;
  *seqListSize = seqCnt;

...

  list = getMem(sizeof(Read)*maxCnt);

...

  return 1;
}

Their function getMem is defined in Common.c and just calls the standard memory allocator, malloc. Translated from C into English, this is a function to read in a FASTA or FASTQ file. It scans through the file once to count the lines, and then divides by 2 (FASTA) or 4 (FASTQ) to get an upper bound on the number of sequences. It the allocates enough memory to keep an array of that many reads in RAM, parses the file into this array, and returns the array.

Looking over the main entry point, this function is called almost immediately after parsing the command line arguments and starting their timer in file baseFAST.c:
int main(int argc, char *argv[])
{
  if (!parseCommandLine(argc, argv))
    return 1;

  configHashTable();

...

      char outputFileName[FILE_NAME_LENGTH];
      // Loading Sequences & Sampling Locations
      startTime = getTime();
      if (!readAllReads(seqFile1, seqFile2, seqCompressed, &seqFastq,
                    pairedEndMode, &seqList, &seqListSize))
 {
   return 1;
 }
...

i.e. Before starting any alignments, MrFast calls the function readAllReads any tries to load all the given FASTA/FASTQ reads RAM. I checked and the same occurs in MrsFasta v2.4.0.4 (the current release).

To me this is a major design flaw in Mr & Mrs Fast. Until it is fixed, I will be obliged to refer to the Mr & Mrs Fast tools as Mr & Mrs Greedy due to their incredible appetite for RAM ;)

Mr Rush (*)Mr Greedy
(*) = There is no Mr Men character called Mr Fast

Bug reported here: 3573811 - MrFast & MrsFast both attempt to load all reads into RAM

On the bright side, having pre-filtered my reads I think I can still use MrFast for the analysis I have in mind, but I will definitely be trying alternatives for comparison.

P.S. The Mr Men cartoon characters were created by Roger Hargreaves, and their images here are used without permission. Please see http://www.mrmen.com/ for details.

Update

One of the authors replied on the bug report. Since the tools are currently single-threaded, they break up their read files and run multiple instances in order to take advantage of multiple cores. As a side effect, this brings down the total memory per job as well. So there's a straightforward practical workaround.

Mr UppityMr Happy

5 comments:

  1. Hi,

    I really like your blog.

    > Instead where possible you loop over a file (iterating over it record by record) or employ indexed random access.

    There is a design pattern called "Lazy loading" for that.

    http://en.wikipedia.org/wiki/Lazy_loading

    ReplyDelete
    Replies
    1. Glad you like the blog :)

      Personally I would use the term "lazy loading" for tasks where you may never need to look at a piece of data. Here with read mapping, the program will need to look at every read eventually.

      Delete
    2. > Personally I would use the term "lazy loading" for tasks where you may never need to look at a piece of data.

      That a case where on-demand resource provisioning is very good: the client asks for a piece of data but never actually look at it (false promise).

      > Here with read mapping, the program will need to look at every read eventually.

      If I am not mistaken, bwa is quite efficient and loads individual chunks of data from input files. From the outside, this can be seen as lazy loading too. It works well because objects are touched sequentially.

      We use lazy loading in Ray. It's basically just a class that sits between a data file and a operator class and that provides on-demand data.


      According to Wikipedia, these are the 4 type of lazy loading:

      > lazy initialization;

      object construction is deferred until first utilization

      > a virtual proxy;

      pretty much like lazy initialization, but with a object wrapping the true object using the same interface

      > a ghost

      uses a partial state -- this can be used to load partially a fastq file

      > a value holder

      a object to which lazy loading is deleguated

      It's also related somehow to copy-on-write (except that here it's more like copy-on-read)

      http://en.wikipedia.org/wiki/Copy-on-write


      You don't provide the resource unless the customer/client requests it. Other objects can remain in the warehouse (fastq files).

      Delete
  2. > Since the tools are currently single-threaded, they break up their read files and run multiple instances in order to take advantage of multiple cores.

    This part should be done by the software tool, not by its end user.

    ReplyDelete
    Replies
    1. Ideally yes, although all to often in bioinformatics this kind of software optimisation isn't done in the original code base :(

      Delete