2024-02-02

BLAST max-target-seq meets metabarcoding

This is my first blog post in years - primarily down to a second child who is now a toddler. And what better topic to return to than a mainstay of past content than NCBI BLAST? This time with a motivating example from my recent work, metabarcoding. This is term used for sequencing a diagnostic region of DNA using specific primers for a group of organisms of interest, and then matching that amplicon to a database of known species. Human interpretation of a BLAST search can generally put a good guess as the organism - weighing hits and annotated taxonomy (e.g. ignoring the odd suspicious uncultured "fungal" match).

This post is about how sometimes BLAST on the NCBI website can miss 100% identical (albeit not full length) matches, returning instead lots of very good but longer matches. Basically the online defaults don't suit this use-case.

 

Background

We're using a region of the ITS1 gene using primers which target Phytophthora and related organisms, many of which are important plant pathogens. Some of these are quarantine class organisms, so we want to avoid false positives so the default classifier we use in my metabarcoding pipeline THAPBI PICT only declares a species level match for a perfect match or something 1bp different (a single base substitution, deletion, or insertion). In BLAST terms, this means over 99% identical over the full query length.

Now barcoding primers are of course designed to match conserved regions of a genome, yet span a variable region in order to get meaningfully different amplicons. It so happens that the first 32bp of our typical Phytophthora amplicons (immediately after the forward primer site) are also conserved - and importantly often missing in published ITS1 sequences. That means when using NCBI BLAST to check an amplicon, although we hope to see full length perfect matches, the most interesting sequences are often only about 85% of the query (due to the subject match in the database missing the first 32bp) but in the region of 99% to 100% identical. Importantly those hits may not be ranked first by the BLAST e-value or bitscore (but you can change the sort order online).

The other handicap for using BLAST in these context is there are lots of very very similar ITS1 sequences in the database, and while the NCBI does do some de-duplicating, the BLAST NT database is still full of duplicates, or near duplicates, of common barcoding sequences.

Readers of my past blog posts (e.g. the most recent) will anticipate how the -max-target-seqs setting now comes into play here. During the initial search, BLAST will find lots of good candidates, so many that the heuristic max-target-sequences will drop some. And this means it can sometimes drop the perfect but incomplete matches I am interested in, in favour of the imperfect full length matches (which in fairness do get a better overall score).

i.e. The default -max-target-seqs value can be too low for this use case.

Example

I have two example query sequences, both observed from multiple UK samples - uncultured but from the sequence almost certainly from Phytophthora:

>dfae766ff29a02c0521fea4ee7969dc2 Phytophthora
TTTCCGTAGGTGAACCTGCGGAAGGATCATTACCACACCTAAAAAACTTTCCACGTGAAC
CGTATCAACCCCTTAAAATTGGGGGCTTGCTCGGCGGCGTGCGTGCTGGCCTATAATGGG
TTGGTGTGCTGCTGCTGGGCGGGCTCTATCATGGGCGAGCGTTTGGGCTTCGGCTCGAGC
TAGTAGCTTTTTCTTTTAAACCCATTCTTTAATTACTGAAATACT

>3d3321eed13dba60899edfbb40cb7629 Phytophthora
TTTCCGTAGGTGAACCTGCGGAAGGATCATTACCACACCTAAAAAACTTTCCACGTGAAC
CGTATCAACCCTTTTAAATTGGGGGCTTCCGTCTGGCCGGCCGGTTCTCGGCTGGCTGGG
TGGCGGCTCTATCATGGCGACCGCCTGGGGCCTCGGCCTGGGCTAGTAGCGTATTTTTTA
AACCATTCCTAATTACTGAAAAAACT

They both start with TTTCCGTAGGTGAACCTGCGGAAGGATCATTA which is the typical conserved 32bp expected for Phytophthora amplicons, which we can remove giving two truncated queries:

>dfae76-truncated Phytophthora
CCACACCTAAAAAACTTTCCACGTGAACCGTATCAACCCCTTAAAATTGGGGGCTTGCTC
GGCGGCGTGCGTGCTGGCCTATAATGGGTTGGTGTGCTGCTGCTGGGCGGGCTCTATCAT
GGGCGAGCGTTTGGGCTTCGGCTCGAGCTAGTAGCTTTTTCTTTTAAACCCATTCTTTAA
TTACTGAAATACT

>3d3321e-truncated Phytophthora
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
CCACACCTAAAAAACTTTCCACGTGAACCGTATCAACCCTTTTAAATTGGGGGCTTCCGT
CTGGCCGGCCGGTTCTCGGCTGGCTGGGTGGCGGCTCTATCATGGCGACCGCCTGGGGCC
TCGGCCTGGGCTAGTAGCGTATTTTTTAAACCATTCCTAATTACTGAAAAAACT

Truncated queries

Putting those truncated queries into the NCBI BLAST website using BLASTN against the NT database, or the currently experimental Eukaryota NT (nt_euk) database, with default settings gives full length perfect matches. Specifically query dfae76 gives two perfect matches (shown as a single deuplicated alignment):

  • KT337856.1 Phytophthora cf. inundata D0S1P25 isolate D0S1P25-3
  • KT337858.1 Phytophthora cf. inundata D0S1P25 isolate D0S1P25-5

While for query 3d3321e there are five perfect matches (show as five alignments):

  • JF300264.1 Phytophthora drechsleri clone S7.Oak4-1F
  • JF300263.1 Phytophthora drechsleri clone S7.Oak4-1B
  • JF300257.1 Phytophthora drechsleri clone S7.Oak4-1G
  • JF300256.1 Phytophthora drechsleri clone S7.Oak4-1D
  • JF300255.1 Phytophthora drechsleri clone S7.Oak4-4A

i.e. Leaving the 32bp leader aside, perfect matches of my environmental Phytophthora sequences have been published from cultured isolates/clones, and we can confidentally assign species.

Full-length queries

Now, repeating that but using the full queries this time (and again default settings, including specifically leaving "Max target sequences" at the current web-blast default of 100), those matches vanish.


This uses the TSV output (format mode 6) with custom columns picking percentage identify first, which allows a simple search with grep to pull out any perfect (partial) matches. The output:

Query dfae766.fasta

Any perfect (partial) hits with -max_target_seqs 100?
NO - expected perfect partial hits are missed!

Any perfect (partial) hits with -max_target_seqs 250?
100.000    193    KT337858.1 Phytophthora cf. inundata D0S1P25 isolate
100.000    193    KT337856.1 Phytophthora cf. inundata D0S1P25 isolate
Yes

Query 3d3321e.fasta

Any perfect (partial) hits with -max_target_seqs 100?
NO - expected perfect partial hits are missed!

Any perfect (partial) hits with -max_target_seqs 250?
100.000    174    JF300264.1 Phytophthora drechsleri clone S7.Oak4-1F
100.000    174    JF300263.1 Phytophthora drechsleri clone S7.Oak4-1B
100.000    174    JF300257.1 Phytophthora drechsleri clone S7.Oak4-1G
100.000    174    JF300256.1 Phytophthora drechsleri clone S7.Oak4-1D
100.000    174    JF300255.1 Phytophthora drechsleri clone S7.Oak4-4A
Yes

Note that the command line default of 500 is comfortably above what is needed for this example to "work", and it is unfortunate that the online default of 100 is too low for this niche use-case.

Conclusion

So, for this particular use case, given the first 32bp of my amplicons is highly conserved but often omitting in the ITS1 fragments in the public databases, I should consider running a second BLAST search with that removed - or remember to bump up the "Max target sequences" option - before concluding that I have a truely novel Phytophthora amplicon.

No comments:

Post a Comment