2024-06-07

FASTQ uploads to ENA FTP site with rclone

I've recently been working what I considered to be a large scale FASTQ upload to the European Nucleotide Archive (ENA), from where it will be mirrored to the NCBS Short or Sequence Read Archive (SRA). Although the total size was "only" 37GB, this was about 3500 pairs of Illumina MiSeq FASTQ files - more than enough to make me worry about the job being interrupted and needing to resume without repeating uploads.

2024-02-02

BLAST max-target-seq meets metabarcoding

This is my first blog post in years - primarily down to a second child who is now a toddler. And what better topic to return to than a mainstay of past content, NCBI BLAST? This time with a motivating example from my recent work, metabarcoding. This is term used for sequencing a diagnostic region of DNA using specific primers for a group of organisms of interest, and then matching that amplicon to a database of known species. Human interpretation of a BLAST search can generally put a good guess as the organism - weighing hits and annotated taxonomy (e.g. ignoring the odd suspicious uncultured "fungal" match).

This post is about how sometimes BLAST on the NCBI website can miss 100% identical (albeit not full length) matches, returning instead lots of very good but longer matches. Basically the online defaults don't suit this use-case.

2019-01-08

An overly aggressive optimization in BLASTN and MegaBLAST

All my recent blog posts have been looking at issues raised by the recent letter Shah et al. (2018) "Misunderstood parameter of NCBI BLAST impacts the correctness of bioinformatics workflows", and the associated test case which they ought to have included with the letter itself. Thus far I have focussed on:


Over the Christmas break, there were two notable public developments. The BLAST team's reply was published, Madden et al. (2018) "Reply to the paper: Misunderstood parameters of NCBI BLAST impacts the correctness of bioinformatics workflows". Quoting from this:

"we examined the new example and it became clear that the demonstrated behavior was a bug, resulting from an overly aggressive optimization, introduced in 2012 for BLASTN and MegaBLAST (DNADNA alignments). This bug has been fixed in the BLAST+ 2.8.1 release, due out in December 2018. The aberrant behavior seems to occur only in alignments with an extremely large number of gaps, which is the case in the example provided by Shah and collaborators."

And BLAST+ v2.8.1 was released. Quoting the release notes,

"Disabled an overly aggressive optimization that caused problems mentioned by Shah et al."

So, what was this aggressive optimisation in BLASTN and MegaBLAST, and how did it combine with the internal candidate alignment limit and database order to produce counter intuitive results? It turns out to explain the details I'd not yet followed up from blog post part three.

2018-12-07

BLAST tie breaking by database order

My November blog posts discussing the BLAST+ tools behaviour with an alignment limit setting (see What BLAST's max-target-sequences doesn't do, and the links from it), touched on database order, which comes into play as a tie breaker.

Well, how is the BLAST database order defined? It turns out to be the reverse of the FASTA file used with makeblastdb, or in other words: Last-in, First-out (LIFO).

2018-11-13

BLAST max alignment limits reply - part four

This is the fourth in a series of blog posts seeking to throw light some of the claims about the BLAST+ tool recently published by Shah et al. (2018) "Misunderstood parameter of NCBI BLAST impacts the correctness of bioinformatics workflows". It was very frustrating that the letter did not provide a reproducible test case, but in reply to the first pair of posts (one and two, both on Friday 2 November 2018), lead author Nidhi Shah got in touch via the comments on Sunday 4 November, with the URL to a GitHub repository describing the Shah et al. (2018) test case. Thank you!

Their test case turns out to be using MEGABLAST (the default algorithm in the blastn binary), with a custom nucleotide BLAST database (the previous blog post examined this).

On the other hand, the original Dec 2015 -max_target_seqs bug report (and my earlier blog posts), used BLASTP with a protein BLAST database.

This is important because one key setting which the internal limit on the number of alignments (N_i) that BLAST+ considers depends on, is if  composition-based statistics (CBS) are being used. This is the default with BLASTP, but not for MEGABLAST (i.e. the blastn binary).

The key point is that requesting N=1 alignments, but otherwise the blastp tool's default settings, gives an internal limit N_i = 2*N + 50 = 52, but with the blastn tool you get an internal alignment limit N_i = 10.  Evidently the BLAST+ developers were comfortable with a lower limit, so I presume there is less chance of the hit ordering changing in the final stages of the algorithm, but this emphasises why it is especially important to avoid duplicates in a nucleotide BLAST database.