tag:blogger.com,1999:blog-85846294684718030752024-03-13T03:07:31.293+00:00Blasted Bioinformatics!?Bioinformatics lessons learned the hard way, bugs, gripes, and maybe topical paper reviews too...Peter Cockhttp://www.blogger.com/profile/00233221181317137855noreply@blogger.comBlogger54125tag:blogger.com,1999:blog-8584629468471803075.post-51398577511057941122024-02-02T12:40:00.000+00:002024-02-02T12:40:31.365+00:00BLAST max-target-seq meets metabarcoding<p>This is my first blog post in years - primarily down to a second child who is now a toddler. And what better topic to return to than a mainstay of past content than NCBI BLAST? This time with a motivating example from my recent work, metabarcoding. This is term used for sequencing a diagnostic region of DNA using specific primers for a group of organisms of interest, and then matching that amplicon to a database of known species. Human interpretation of a BLAST search can generally put a good guess as the organism - weighing hits and annotated taxonomy (e.g. ignoring the odd suspicious uncultured "fungal" match).</p><p>This post is about how sometimes BLAST on the NCBI website can miss 100% identical (albeit not full length) matches, returning instead lots of very good but longer matches. Basically the online defaults don't suit this use-case.</p><span><a name='more'></a></span><p> </p><h3 style="text-align: left;">Background<br /></h3><p>We're using a region of the ITS1 gene using primers which target <i>Phytophthora</i> and related organisms, many of which are important plant pathogens. Some of these are quarantine class organisms, so we want to avoid false positives so the default classifier we use in my <a href="https://github.com/peterjc/thapbi-pict">metabarcoding pipeline THAPBI PICT</a> only declares a species level match for a perfect match or something 1bp different (a single base substitution, deletion, or insertion). In BLAST terms, this means over 99% identical over the full query length.</p><p>Now barcoding primers are of course designed to match conserved regions of a genome, yet span a variable region in order to get meaningfully different amplicons. It so happens that the first 32bp of our typical <i>Phytophthora</i> amplicons (immediately after the forward primer site) are also conserved - and importantly often missing in published ITS1 sequences. That means when using NCBI BLAST to check an amplicon, although we hope to see full length perfect matches, the most interesting sequences are often only about 85% of the query (due to the subject match in the database missing the first 32bp) but in the region of 99% to 100% identical. Importantly those hits may not be ranked first by the BLAST e-value or bitscore (but you can change the sort order online).</p><p>The other handicap for using BLAST in these context is there are lots of very very similar ITS1 sequences in the database, and while the NCBI does do some de-duplicating, the BLAST NT database is still full of duplicates, or near duplicates, of common barcoding sequences.</p><p>Readers of my past blog posts (e.g. <a href="https://blastedbio.blogspot.com/2019/01/blast-overly-aggressive-optimization.html">the most recent</a>) will anticipate how the <span style="font-family: courier;">-max-target-seqs</span> setting now comes into play here. During the initial search, BLAST will find lots of good candidates, so many that the heuristic max-target-sequences will drop some. And this means it can sometimes drop the perfect but incomplete matches I am interested in, in favour of the imperfect full length matches (which in fairness do get a better overall score).</p><p>i.e. The default <span style="font-family: courier;">-max-target-seqs</span> value can be too low for this use case.</p><h3 style="text-align: left;">Example</h3><p>I have two example query sequences, both observed from multiple UK samples - uncultured but from the sequence almost certainly from Phytophthora:</p><p><span style="font-family: courier;">>dfae766ff29a02c0521fea4ee7969dc2 Phytophthora<br />TTTCCGTAGGTGAACCTGCGGAAGGATCATTACCACACCTAAAAAACTTTCCACGTGAAC<br />CGTATCAACCCCTTAAAATTGGGGGCTTGCTCGGCGGCGTGCGTGCTGGCCTATAATGGG<br />TTGGTGTGCTGCTGCTGGGCGGGCTCTATCATGGGCGAGCGTTTGGGCTTCGGCTCGAGC<br />TAGTAGCTTTTTCTTTTAAACCCATTCTTTAATTACTGAAATACT<br /><br />>3d3321eed13dba60899edfbb40cb7629 Phytophthora<br />TTTCCGTAGGTGAACCTGCGGAAGGATCATTACCACACCTAAAAAACTTTCCACGTGAAC<br />CGTATCAACCCTTTTAAATTGGGGGCTTCCGTCTGGCCGGCCGGTTCTCGGCTGGCTGGG<br />TGGCGGCTCTATCATGGCGACCGCCTGGGGCCTCGGCCTGGGCTAGTAGCGTATTTTTTA<br />AACCATTCCTAATTACTGAAAAAACT<br /></span></p><p></p><p>They both start with <code><span style="font-family: courier;">TTTCCGTAGGTGAACCTGCGGAAGGATCATTA</span> </code>which is the typical conserved 32bp expected for Phytophthora amplicons, which we can remove giving two truncated queries:</p><p><span style="font-family: courier;">>dfae76-truncated Phytophthora<br />CCACACCTAAAAAACTTTCCACGTGAACCGTATCAACCCCTTAAAATTGGGGGCTTGCTC<br />GGCGGCGTGCGTGCTGGCCTATAATGGGTTGGTGTGCTGCTGCTGGGCGGGCTCTATCAT<br />GGGCGAGCGTTTGGGCTTCGGCTCGAGCTAGTAGCTTTTTCTTTTAAACCCATTCTTTAA<br />TTACTGAAATACT<br /><br />>3d3321e-truncated Phytophthora<br />xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx<br />CCACACCTAAAAAACTTTCCACGTGAACCGTATCAACCCTTTTAAATTGGGGGCTTCCGT<br />CTGGCCGGCCGGTTCTCGGCTGGCTGGGTGGCGGCTCTATCATGGCGACCGCCTGGGGCC<br />TCGGCCTGGGCTAGTAGCGTATTTTTTAAACCATTCCTAATTACTGAAAAAACT<br /><br /></span></p><h3 style="text-align: left;">Truncated queries<br /></h3><p style="text-align: left;">Putting those truncated queries into the NCBI BLAST website using BLASTN against the NT database, or the currently experimental Eukaryota NT (nt_euk) database, with default settings gives full length perfect matches. Specifically query dfae76 gives two perfect matches (shown as a single deuplicated alignment):<br /></p><ul style="text-align: left;"><li style="text-align: left;"><a href="https://www.ncbi.nlm.nih.gov/nucleotide/KT337856.1?report=genbank&log$=nuclalign&blast_rank=1&RID=VTMWM4DH016" rel="nofollow" target="lnkVTMWM4DH016" title="Show report for KT337856.1">KT337856.1</a> Phytophthora cf. inundata D0S1P25 isolate D0S1P25-3</li><li><a href="https://www.ncbi.nlm.nih.gov/nucleotide/KT337858.1?report=genbank&log$=nuclalign&blast_rank=1&RID=VTMWM4DH016" rel="nofollow" target="lnkVTMWM4DH016" title="Show report for KT337858.1">KT337858.1</a> Phytophthora cf. inundata D0S1P25 isolate D0S1P25-5</li></ul><p style="text-align: left;"> While for query 3d3321e there are five perfect matches (show as five alignments):<br /></p><ul><li style="text-align: left;"><a href="https://www.ncbi.nlm.nih.gov/nucleotide/JF300264.1?report=genbank&log$=nuclalign&blast_rank=1&RID=VTMWM4DH016" rel="nofollow">JF300264.1</a> Phytophthora drechsleri clone S7.Oak4-1F</li><li style="text-align: left;"><a href="https://www.ncbi.nlm.nih.gov/nucleotide/JF300263.1?report=genbank&log$=nuclalign&blast_rank=2&RID=VTMWM4DH016" rel="nofollow">JF300263.1</a> Phytophthora drechsleri clone S7.Oak4-1B</li><li style="text-align: left;"><a href="https://www.ncbi.nlm.nih.gov/nucleotide/JF300257.1?report=genbank&log$=nuclalign&blast_rank=3&RID=VTMWM4DH016" rel="nofollow">JF300257.1</a> Phytophthora drechsleri clone S7.Oak4-1G</li><li style="text-align: left;"><a href="https://www.ncbi.nlm.nih.gov/nucleotide/JF300256.1?report=genbank&log$=nuclalign&blast_rank=4&RID=VTMWM4DH016" rel="nofollow">JF300256.1</a> Phytophthora drechsleri clone S7.Oak4-1D</li><li style="text-align: left;"><a href="https://www.ncbi.nlm.nih.gov/nucleotide/JF300255.1?report=genbank&log$=nuclalign&blast_rank=5&RID=VTMWM4DH016" rel="nofollow">JF300255.1</a> Phytophthora drechsleri clone S7.Oak4-4A</li></ul><p style="text-align: left;">i.e. Leaving the 32bp leader aside, perfect matches of my environmental <i>Phytophthora</i> sequences have been published from cultured isolates/clones, and we can confidentally assign species.<br /></p><h3 style="text-align: left;">Full-length queries<br /></h3><p>Now, repeating that but using the full queries this time (and again default settings, including specifically leaving "Max target sequences" at the current web-blast default of 100), those matches vanish.<label><label><label><label><label><br /></label></label></label></label></label></p><p><label><label><label><label><label>This time the top hit for query dfae76 by e-value, bitscore, or percentage identify is a a full length match but only 223/225 identical (99%):</label></label></label></label></label></p><ul style="text-align: left;"><li><label><label><label><label><label><a href="https://www.ncbi.nlm.nih.gov/nucleotide/AY995392.1?report=genbank&log$=nuclalign&blast_rank=1&RID=VTNNKTNS016" target="lnkVTNNKTNS016" title="Show report for AY995392.1">AY995392.1</a><span class=" r"><label> </label></span><label>Phytophthora inundata isolate P756 (BO-2)<br /></label></label></label></label></label></label></li></ul><p><label><label><label><label><label><label> And for query 3d3321e, this time a tied best match, full length but only 204/206 identical (99%):</label></label></label></label></label></label></p><ul style="text-align: left;"><li><label><label><label><label><label><label><a href="https://www.ncbi.nlm.nih.gov/nucleotide/KJ507657.1?report=genbank&log$=nuclalign&blast_rank=1&RID=VTNNKTNS016" target="lnkVTNNKTNS016" title="Show report for KJ507657.1">KJ507657.1</a><span class=" r"><label> </label></span><label>Phytophthora drechsleri isolate 1092RN</label></label></label></label></label></label></label></li><li><label><label><label><label><label><label><a href="https://www.ncbi.nlm.nih.gov/nucleotide/GU111633.1?report=genbank&log$=nuclalign&blast_rank=2&RID=VTNNKTNS016" target="lnkVTNNKTNS016" title="Show report for GU111633.1">GU111633.1</a><span class=" r"><label> </label></span><label>Phytophthora drechsleri strain TARI 98067 </label></label></label></label></label></label></label></li><li><label><label><label><label><label><label><label><a href="https://www.ncbi.nlm.nih.gov/nucleotide/GU111632.1?report=genbank&log$=nuclalign&blast_rank=3&RID=VTNNKTNS016" target="lnkVTNNKTNS016" title="Show report for GU111632.1">GU111632.1</a><span class=" r"><label> </label></span><label>Phytophthora drechsleri strain TARI 97098 </label></label></label></label></label></label></label></label></li><li><label><label><label><label><label><label><label><label><a href="https://www.ncbi.nlm.nih.gov/nucleotide/GU111630.1?report=genbank&log$=nuclalign&blast_rank=4&RID=VTNNKTNS016" target="lnkVTNNKTNS016" title="Show report for GU111630.1">GU111630.1</a><span class=" r"><label></label></span> Phytophthora drechsleri strain TARI 28322 </label></label></label></label></label></label></label></label></li><li><label><label><label><label><label><label><label><label><a href="https://www.ncbi.nlm.nih.gov/nucleotide/GU111629.1?report=genbank&log$=nuclalign&blast_rank=5&RID=VTNNKTNS016" target="lnkVTNNKTNS016" title="Show report for GU111629.1">GU111629.1</a><span class=" r"><label> </label></span><label>Phytophthora drechsleri strain TARI 26082 </label></label></label></label></label></label></label></label></label></li><li><label><label><label><label><label><label><label><label><label><a href="https://www.ncbi.nlm.nih.gov/nucleotide/GU111628.1?report=genbank&log$=nuclalign&blast_rank=6&RID=VTNNKTNS016" target="lnkVTNNKTNS016" title="Show report for GU111628.1">GU111628.1</a><span class=" r"><label> </label></span><label>Phytophthora drechsleri strain TARI 27221 </label></label></label></label></label></label></label></label></label></label></li><li><label><label><label><label><label><label><label><label><label><label><a href="https://www.ncbi.nlm.nih.gov/nucleotide/GU111627.1?report=genbank&log$=nuclalign&blast_rank=7&RID=VTNNKTNS016" target="lnkVTNNKTNS016" title="Show report for GU111627.1">GU111627.1</a><span class=" r"><label> </label></span><label>Phytophthora drechsleri strain TARI 25210 </label></label></label></label></label></label></label></label></label></label></label></li><li><label><label><label><label><label><label><label><label><label><label><label><a href="https://www.ncbi.nlm.nih.gov/nucleotide/EU194428.1?report=genbank&log$=nuclalign&blast_rank=8&RID=VTNNKTNS016" target="lnkVTNNKTNS016" title="Show report for EU194428.1">EU194428.1</a><span class=" r"><label> </label></span><label>Phytophthora drechsleri isolate PS-43</label> </label></label></label></label></label></label></label></label></label></label></label></li></ul><p><label><label><label><label><label>That the best hit by bitscore or e-value changed is no surprise, these are longer matches.<br /><br />While the top matches are still from the same species (<span class=" r"><label></label></span></label></label></label></label></label><i><label>Phytophthora </label></i><label><label><label><label><label><i>inundata</i> and <span class=" r"><label></label></span></label></label></label></label></label><i><label>Phytophthora </label></i><label><label><label><label><label><i>dreshsleri</i>), it appears that our environmental sequences are at least 2bp different from anything published. The gotcha is that the shorter 100% identical matches are missing.</label></label></label></label></label></p><h3 style="text-align: left;"><label><label><label><label><label>Non-default settings</label></label></label></label></label><br /><label><label><label><label><label></label></label></label></label></label></h3><p><label><label><label><label><label>To see the shorter but 100% identical matches in the results, we need to raise the "Max target sequences" setting. The web-blast interface currently offers values of 10, 50, 100 (default), 250, 500, 1000 and 5000. For these two queries, increasing that to 250 is enough - the missing hits are now present, although a long long way down the default sort order.<br /></label></label></label></label></label></p><h3 style="text-align: left;"><label><label><label><label><label>Off-line testing</label></label></label></label></label></h3><p><label><label><label><label><label>If running BLASTN at the command line, the equivalent setting is <span style="font-family: courier;">-max-target-seqs</span> and it defaults to 500. The online default is likely a more aggressive optimization to reduce computational load on the NCBI servers.</label></label></label></label></label></p><p><label><label><label><label><label>I was able to reproduce this locally using a custom database of 17717 <i>Phytophthora</i> ITS1 sequences downloaded via Entrez Direct on 1 Feb 2024 as follows:</label></label></label></label></label></p><p><span style="font-family: courier;"><label><label><label><label><label>$ esearch -db nuccore -sort accession \<br /> -query "Phytophthora[organism] \<br /> AND ((internal AND transcribed AND spacer) OR its1)\<br /> AND 150:10000[sequence length]" > search.xml<br />$ efetch -format fasta < search.xml > search.fasta<br />$ makeblastdb -in search.fasta -dbtype nucl -out search<br />$ for QUERY in dfae766.fasta 3d3321e.fasta; do<br /> echo<br /> echo "Query $QUERY"<br /> for MAX in 100 250; do<br /> echo<br /> echo "Any perfect (partial) hits with -max_target_seqs $MAX?"<br /> if blastn -db search -query $QUERY \<br /> -outfmt "6 pident length stitle" \<br /> -max_target_seqs $MAX \<br /> | cut -c 1-64 | grep -E "^100\.0" ; then<br /> echo "Yes"<br /> else<br /> echo "NO - expected perfect partial hits are missed!"<br /> fi<br /> done<br />done</label></label></label></label></label></span></p><p>This uses the TSV output (format mode 6) with custom columns picking percentage identify first, which allows a simple search with grep to pull out any perfect (partial) matches. The output:</p><span style="font-family: courier;">Query dfae766.fasta<br /><br />Any perfect (partial) hits with -max_target_seqs 100?<br />NO - expected perfect partial hits are missed!<br /><br />Any perfect (partial) hits with -max_target_seqs 250?<br />100.000 193 KT337858.1 Phytophthora cf. inundata D0S1P25 isolate<br />100.000 193 KT337856.1 Phytophthora cf. inundata D0S1P25 isolate<br />Yes<br /><br />Query 3d3321e.fasta<br /><br />Any perfect (partial) hits with -max_target_seqs 100?<br />NO - expected perfect partial hits are missed!<br /><br />Any perfect (partial) hits with -max_target_seqs 250?<br />100.000 174 JF300264.1 Phytophthora drechsleri clone S7.Oak4-1F <br />100.000 174 JF300263.1 Phytophthora drechsleri clone S7.Oak4-1B <br />100.000 174 JF300257.1 Phytophthora drechsleri clone S7.Oak4-1G <br />100.000 174 JF300256.1 Phytophthora drechsleri clone S7.Oak4-1D <br />100.000 174 JF300255.1 Phytophthora drechsleri clone S7.Oak4-4A <br />Yes</span><p>Note that the command line default of 500 is comfortably above what is needed for this example to "work", and it is unfortunate that the online default of 100 is too low for this niche use-case.</p><h3 style="text-align: left;">Conclusion<br /></h3><p>So, for this particular use case, given the first 32bp of my amplicons is highly conserved but often omitting in the ITS1 fragments in the public databases, I should consider running a second BLAST search with that removed - or remember to bump up the "Max target sequences" option - before concluding that I have a truely novel <i>Phytophthora</i> amplicon.</p>Peter Cockhttp://www.blogger.com/profile/00233221181317137855noreply@blogger.com0tag:blogger.com,1999:blog-8584629468471803075.post-37205528434870556742019-01-08T10:34:00.000+00:002019-01-08T10:34:48.624+00:00An overly aggressive optimization in BLASTN and MegaBLASTAll my recent blog posts have been looking at issues raised by the recent letter <a href="https://doi.org/10.1093/bioinformatics/bty833">Shah et al. (2018) "Misunderstood parameter of NCBI BLAST impacts the correctness of bioinformatics workflows"</a>, and the associated test case which they ought to have included with the letter itself. Thus far I have focussed on:<br />
<br />
<ul>
<li>"<a href="https://blastedbio.blogspot.com/2015/12/blast-max-target-sequences-bug.html">What BLAST's max-target-sequences doesn't do</a>" (2015 blog post, with links to follow up posts in 2018), an issue Suaji Kumar and I called a scary BLAST+ bug to do with alignment limits, which the NCBI BLAST team viewed as expected behaviour of a heuristic setting.</li>
<li>"<a href="https://blastedbio.blogspot.com/2018/12/blast-tie-break-db-order.html">BLAST tie breaking by database order</a>" (2018 blog post, with links to relevant older posts)</li>
</ul>
<br />
Over the Christmas break, there were two notable public developments. The BLAST team's reply was published, <a href="https://doi.org/10.1093/bioinformatics/bty1026">Madden et al. (2018) "Reply to the paper: Misunderstood parameters of NCBI BLAST impacts the correctness of bioinformatics workflows"</a>. Quoting from this:<br />
<br />
<blockquote class="tr_bq">
<i>"we examined the new example and it became clear that the demonstrated behavior was a bug, resulting from an overly aggressive optimization, introduced in 2012 for BLASTN and MegaBLAST (DNADNA alignments). This bug has been fixed in the BLAST+ 2.8.1 release, due out in December 2018. The aberrant behavior seems to occur only in alignments with an extremely large number of gaps, which is the case in the example provided by Shah and collaborators."</i></blockquote>
<br />
And BLAST+ v2.8.1 was released. Quoting the <a href="https://www.ncbi.nlm.nih.gov/books/NBK131777/">release notes</a>,<br />
<br />
<blockquote class="tr_bq">
"<i>Disabled an overly aggressive optimization that caused problems mentioned by Shah et al.</i>"</blockquote>
<br />
So, what was this aggressive optimisation in BLASTN and MegaBLAST, and how did it combine with the internal candidate alignment limit and database order to produce counter intuitive results? It turns out to explain the details I'd not yet followed up from <a href="https://blastedbio.blogspot.com/2018/11/blast-max-alignment-limits-part-three.html">blog post part three</a>.<br />
<br />
<a name='more'></a><h3>
When was the change?</h3>
The NCBI says the "<i>overly aggressive optimization</i>" was introduced in 2012, which suggests it first appeared in BLAST+ 2.2.26 (31 Jan 2012), BLAST+ 2.2.27 (10 September 2012), or as it turns out BLAST+ 2.2.28 (released 19 March 2013).<br />
<br />
Time for some historical testing - a continuation of the examples in my blog post <a href="https://blastedbio.blogspot.com/2018/11/blast-max-alignment-limits-part-three.html">BLAST max alignment limits reply - part three</a>, using a <a href="https://github.com/peterjc/blast_max_target_seqs/tree/master/Shah_et_al_2018">deduplicated version</a> of the Shah <i>et al.</i> <a href="https://github.com/shahnidhi/BLAST_maxtargetseq_analysis">test case</a>.<br />
<br />
On my system the earliest version of BLAST+ where the NCBI provided 64-bit Linux binaries work is BLAST+ 2.2.21, but this and 2.2.22 do not use the subject names in the tabular output - so I'm starting from v2.2.23 instead. There are some apparently minor differences in scores or gaps as the versions change, but if we look just at the query and match names what do we see?<br />
<br />
Here I am using a multi-line double nested bash for loop, where I look at each version of BLAST in turn, and use the head command to get the top result for running BLAST with the default alignment limit. Then I output the MD5 checksum of the first two columns (query and match name). Happily each version gives the same output:
<br />
<br />
<div style="background-color: black; color: #29f914; font-family: 'Andale Mono'; font-size: xx-small;">
<pre><tt>$ for v in 2.2.23 2.2.24 2.2.25 2.2.26 2.2.27 2.2.28 2.2.29 2.2.30 2.2.31 2.3.0 2.4.0 2.5.0 2.6.0 2.7.1 2.8.1
do echo $v; rm -rf top1_${v}.tsv; for i in {1..10};
do ~/downloads/ncbi-blast-${v}+/bin/blastn -query query_$i.fasta -db dedup.fasta -outfmt 6 | head -n 1 >> top1_${v}.tsv;
done; cut -f 1,2 top1_${v}.tsv | md5sum;
done;
2.2.23
dd4d4c2ffb2a1d43bc3844b5406366d6 -
2.2.24
dd4d4c2ffb2a1d43bc3844b5406366d6 -
2.2.25
dd4d4c2ffb2a1d43bc3844b5406366d6 -
2.2.26
dd4d4c2ffb2a1d43bc3844b5406366d6 -
2.2.27
dd4d4c2ffb2a1d43bc3844b5406366d6 -
2.2.28
dd4d4c2ffb2a1d43bc3844b5406366d6 -
2.2.29
dd4d4c2ffb2a1d43bc3844b5406366d6 -
2.2.30
dd4d4c2ffb2a1d43bc3844b5406366d6 -
2.2.31
dd4d4c2ffb2a1d43bc3844b5406366d6 -
2.3.0
dd4d4c2ffb2a1d43bc3844b5406366d6 -
2.4.0
dd4d4c2ffb2a1d43bc3844b5406366d6 -
2.5.0
dd4d4c2ffb2a1d43bc3844b5406366d6 -
2.6.0
dd4d4c2ffb2a1d43bc3844b5406366d6 -
2.7.1
dd4d4c2ffb2a1d43bc3844b5406366d6 -
2.8.1
dd4d4c2ffb2a1d43bc3844b5406366d6 -</tt></pre>
</div>
<br />
There have been (apparently minor) changes in the output, which I am not going to look at further:<br />
<br />
<div style="background-color: black; color: #29f914; font-family: 'Andale Mono'; font-size: xx-small;">
<pre><tt>$ for v in 2.2.23 2.2.24 2.2.25 2.2.26 2.2.27 2.2.28 2.2.29 2.2.30 2.2.31 2.3.0 2.4.0 2.5.0 2.6.0 2.7.1 2.8.1;
do md5sum top1_${v}.tsv;
done
8b64ffa5e3452e03e53b5c8f83efd00d top1_2.2.23.tsv
251b13795f5df0505b6f476dc34ae997 top1_2.2.24.tsv
251b13795f5df0505b6f476dc34ae997 top1_2.2.25.tsv
251b13795f5df0505b6f476dc34ae997 top1_2.2.26.tsv
51794c55ec13bb6d96f180c2508a85a8 top1_2.2.27.tsv
51794c55ec13bb6d96f180c2508a85a8 top1_2.2.28.tsv
d6f8bdd885154ab3ad66f51b86cddb62 top1_2.2.29.tsv
d6f8bdd885154ab3ad66f51b86cddb62 top1_2.2.30.tsv
ae5406a65f9e6842ee01dcbbafc182c3 top1_2.2.31.tsv
4118c80f712b0dff24cffb328432c0bb top1_2.3.0.tsv
ac84739316e78ba53811c71c1eea0b8a top1_2.4.0.tsv
ac84739316e78ba53811c71c1eea0b8a top1_2.5.0.tsv
ac84739316e78ba53811c71c1eea0b8a top1_2.6.0.tsv
ac84739316e78ba53811c71c1eea0b8a top1_2.7.1.tsv
ac84739316e78ba53811c71c1eea0b8a top1_2.8.1.tsv</tt></pre>
</div>
<br />
Still, looking at just the first two columns (query and match names), nothing changed. Now, for the simpler command to request just one alignment from each query - again with a bash for loop over the BLAST+ version, and reporting the MD5 checksum of the first two columns:<br />
<br />
<div style="background-color: black; color: #29f914; font-family: 'Andale Mono'; font-size: xx-small;">
<pre><tt>$ for v in 2.2.23 2.2.24 2.2.25 2.2.26 2.2.27 2.2.28 2.2.29 2.2.30 2.2.31 2.3.0 2.4.0 2.5.0 2.6.0 2.7.1 2.8.1;
do echo $v;
~/downloads/ncbi-blast-${v}+/bin/blastn -query example.fasta -db dedup.fasta -outfmt 6 -max_target_seqs 1 -out max1_${v}.tsv; cut -f 1,2 max1_${v}.tsv | md5sum;
done
2.2.23
dd4d4c2ffb2a1d43bc3844b5406366d6 -
2.2.24
dd4d4c2ffb2a1d43bc3844b5406366d6 -
2.2.25
dd4d4c2ffb2a1d43bc3844b5406366d6 -
2.2.26
dd4d4c2ffb2a1d43bc3844b5406366d6 -
2.2.27
dd4d4c2ffb2a1d43bc3844b5406366d6 -
2.2.28
be102cef9b8eca7a49ffd35c74005999 -
2.2.29
be102cef9b8eca7a49ffd35c74005999 -
2.2.30
be102cef9b8eca7a49ffd35c74005999 -
2.2.31
be102cef9b8eca7a49ffd35c74005999 -
2.3.0
be102cef9b8eca7a49ffd35c74005999 -
2.4.0
be102cef9b8eca7a49ffd35c74005999 -
2.5.0
be102cef9b8eca7a49ffd35c74005999 -
2.6.0
be102cef9b8eca7a49ffd35c74005999 -
2.7.1
be102cef9b8eca7a49ffd35c74005999 -
2.8.1
Warning: [blastn] Examining 5 or more matches is recommended
dd4d4c2ffb2a1d43bc3844b5406366d6 -</tt></pre>
</div>
<br />
The new release BLAST+ 2.8.1 gives a warning about using the alignment limit, but it's output now matches that from BLAST+ 2.2.23 to 2.2.27 inclusive. Meanwhile, BLAST+ 2.2.28 to 2.7.1 gave a consistent but different best hit.<br />
<br />
Another way to look at this is to compare the BLAST hit when requesting at most one alignment, versus the top hit without the limit:
<br />
<br />
<div style="background-color: black; color: #29f914; font-family: 'Andale Mono'; font-size: xx-small;">
<pre><tt>$ for v in 2.2.23 2.2.24 2.2.25 2.2.26 2.2.27 2.2.28 2.2.29 2.2.30 2.2.31 2.3.0 2.4.0 2.5.0 2.6.0 2.7.1 2.8.1;
do echo $v;
diff -q <(cut -f 1,2 max1_${v}.tsv) <(cut -f 1,2 top1_${v}.tsv);
done
2.2.23
2.2.24
2.2.25
2.2.26
2.2.27
2.2.28
Files /dev/fd/63 and /dev/fd/62 differ
2.2.29
Files /dev/fd/63 and /dev/fd/62 differ
2.2.30
Files /dev/fd/63 and /dev/fd/62 differ
2.2.31
Files /dev/fd/63 and /dev/fd/62 differ
2.3.0
Files /dev/fd/63 and /dev/fd/62 differ
2.4.0
Files /dev/fd/63 and /dev/fd/62 differ
2.5.0
Files /dev/fd/63 and /dev/fd/62 differ
2.6.0
Files /dev/fd/63 and /dev/fd/62 differ
2.7.1
Files /dev/fd/63 and /dev/fd/62 differ
2.8.1</tt></pre>
</div>
<br />
So, things worked as expected with this (deduplicated) database for BLAST+ 2.2.23 through 2.2.27, but the top result changed with BLAST+ 2.2.28 to 2.7.1 with <span style="font-family: "courier new" , "courier" , monospace;">-max_target_seqs 1</span>, and this was fixed again in BLAST+ 2.8.1.<br />
<br />
<h3>
What was the change?</h3>
<br />
Having pinned down the change to the release of BLAST+ 2.2.28, with the fix being in BLAST+ 2.8.1, I was able to make a guess at the relevant commits in SVN.<br />
<br />
The version bump for BLAST+ 2.2.27 was done in 2 August 2012 in <a href="https://www.ncbi.nlm.nih.gov/viewvc/v1/trunk/c%2B%2B/src/algo/blast/core/blast_engine.c?r1=55135&r2=55265">SVN revision 55265</a>, while the version bump for 2.2.28 was done on 12 March 2013 in <a href="https://www.ncbi.nlm.nih.gov/viewvc/v1/trunk/c%2B%2B/src/algo/blast/core/blast_engine.c?r1=56905&r2=57472">SVN revision 57472</a>, so we're looking for the bug being introduced in that window.<br />
<br />
Similarly, the version bump for BLAST+ 2.8.0 (alpha) was done on 16 Jan 2018 in <a href="https://www.ncbi.nlm.nih.gov/viewvc/v1/trunk/c%2B%2B/src/algo/blast/core/blast_engine.c?r1=79855&r2=80848">SVN revision 80848</a>, while the final release date bump for 2.8.1 was 20 November 2018 in <a href="https://www.ncbi.nlm.nih.gov/viewvc/v1/trunk/c%2B%2B/src/algo/blast/core/blast_engine.c?r1=83681&r2=84607">SVN revision 84607</a>, so we're looking for the fix in that window.<br />
<br />
I asked the BLAST team if my guess was right, it wasn't - Tom Madden kindly pointed me at the relevant commits. First, this commit added a new option:<br />
<ul>
<li>30 November 2011, <a href="https://www.ncbi.nlm.nih.gov/viewvc/v1?view=revision&revision=52133">SVN revision 52133</a><br /><i>"</i><i>Ignore low scoring ungapped alignments if hitlist is full, JIRA:SB-914<i>"</i></i></li>
</ul>
Then this commit activated the setting in <span style="font-family: "courier new" , "courier" , monospace;">blast_nucl_options.cpp</span>:<br />
<br />
<ul>
<li>18 October 2012, <a href="https://www.ncbi.nlm.nih.gov/viewvc/v1/trunk/c%2B%2B/src/algo/blast/api/blast_nucl_options.cpp?r1=48605&r2=56007&pathrev=56007">SVN revision 56007</a></li>
<li>"<i>Use standard gap trigger for blastn to find missing hits, increase reduced_nucl_cutoff_score and turn on SetLowScorePerc to make up lost speed, JIRA:SB-1047</i>"</li>
</ul>
<br />
While finally this was the fix, turning that setting off again:<br />
<ul>
<li>9 November 2018, <a href="https://www.ncbi.nlm.nih.gov/viewvc/v1/trunk/c++/src/algo/blast/api/blast_nucl_options.cpp?r1=73100&r2=84455&pathrev=84455">SVN revision 84455</a><br /><i>"Disable ungapped low score perc check, JIRA:SB-2407"</i></li>
</ul>
This was fixed after the NCBI studied the test case from Shah et al. (2018), which had gap-rich alignments.<br />
<br />
<h3>
Returning to the original Shah <i>et al.</i> (2018) test case</h3>
Does the same apply with the original Shah <i>et al.</i> (2018) test case with duplicated sequences?<br />
<br />
<div style="background-color: black; color: #29f914; font-family: 'Andale Mono'; font-size: xx-small;">
<pre><tt>$ for v in 2.2.23 2.2.24 2.2.25 2.2.26 2.2.27 2.2.28 2.2.29 2.2.30 2.2.31 2.3.0 2.4.0 2.5.0 2.6.0 2.7.1 2.8.1;
do echo $v; rm -rf dups_top1_${v}.tsv; for i in {1..10};
do ~/downloads/ncbi-blast-${v}+/bin/blastn -query query_$i.fasta -db db.fasta -outfmt 6 | head -n 1 >> dups_top1_${v}.tsv;
done; cut -f 1,2 dups_top1_${v}.tsv | md5sum;
done;
2.2.23
77c8ac0df4a04523c39f01cdd4629b1f -
2.2.24
77c8ac0df4a04523c39f01cdd4629b1f -
2.2.25
77c8ac0df4a04523c39f01cdd4629b1f -
2.2.26
77c8ac0df4a04523c39f01cdd4629b1f -
2.2.27
77c8ac0df4a04523c39f01cdd4629b1f -
2.2.28
77c8ac0df4a04523c39f01cdd4629b1f -
2.2.29
77c8ac0df4a04523c39f01cdd4629b1f -
2.2.30
77c8ac0df4a04523c39f01cdd4629b1f -
2.2.31
77c8ac0df4a04523c39f01cdd4629b1f -
2.3.0
77c8ac0df4a04523c39f01cdd4629b1f -
2.4.0
77c8ac0df4a04523c39f01cdd4629b1f -
2.5.0
77c8ac0df4a04523c39f01cdd4629b1f -
2.6.0
77c8ac0df4a04523c39f01cdd4629b1f -
2.7.1
77c8ac0df4a04523c39f01cdd4629b1f -
2.8.1
77c8ac0df4a04523c39f01cdd4629b1f -</tt></pre>
</div>
<br />
That shows the top hit returned from the duplicate database was consistent over these versions of BLAST+, but with the alignment limit set to one:<br />
<br />
<div style="background-color: black; color: #29f914; font-family: 'Andale Mono'; font-size: xx-small;">
<pre><tt>$ for v in 2.2.23 2.2.24 2.2.25 2.2.26 2.2.27 2.2.28 2.2.29 2.2.30 2.2.31 2.3.0 2.4.0 2.5.0 2.6.0 2.7.1 2.8.1;
do echo $v;
~/downloads/ncbi-blast-${v}+/bin/blastn -query example.fasta -db db.fasta -outfmt 6 -max_target_seqs 1 -out dups_max1_${v}.tsv; cut -f 1,2 dups_max1_${v}.tsv | md5sum;
done
2.2.23
9870749de9ca82e81bc96ba72b45c2f9 -
2.2.24
9870749de9ca82e81bc96ba72b45c2f9 -
2.2.25
9870749de9ca82e81bc96ba72b45c2f9 -
2.2.26
9870749de9ca82e81bc96ba72b45c2f9 -
2.2.27
9870749de9ca82e81bc96ba72b45c2f9 -
2.2.28
b980215a60700a608bf4016f6ddced2b -
2.2.29
b980215a60700a608bf4016f6ddced2b -
2.2.30
b980215a60700a608bf4016f6ddced2b -
2.2.31
b980215a60700a608bf4016f6ddced2b -
2.3.0
b980215a60700a608bf4016f6ddced2b -
2.4.0
b980215a60700a608bf4016f6ddced2b -
2.5.0
b980215a60700a608bf4016f6ddced2b -
2.6.0
b980215a60700a608bf4016f6ddced2b -
2.7.1
b980215a60700a608bf4016f6ddced2b -
2.8.1
Warning: [blastn] Examining 5 or more matches is recommended
9870749de9ca82e81bc96ba72b45c2f9 -</tt></pre>
</div>
<br />
We see the same pattern - and BLAST+ 2.8.1 fixed things to match BLAST+ 2.2.23 to 2.2.27, but these are not all the same hits! Here is the change in the BLAST+ 2.2.7 output:<br />
<br />
<div style="background-color: black; color: #29f914; font-family: 'Andale Mono'; font-size: xx-small;">
<pre><tt>$ diff <(cut -f 1,2 dups_top1_2.2.27.tsv) <(cut -f 1,2 dups_max1_2.2.27.tsv)
5c5
< NC_006448.1-15016:454_5cov_045M-050M s_16779:COG0087
---
> NC_006448.1-15016:454_5cov_045M-050M s_7827:COG0088</tt></pre>
</div>
<br />
And the same with BLAST+ 2.8.1,<br />
<br />
<div style="background-color: black; color: #29f914; font-family: 'Andale Mono'; font-size: xx-small;">
<pre><tt>$ diff <(cut -f 1,2 dups_top1_2.8.1.tsv) <(cut -f 1,2 dups_max1_2.8.1.tsv)
5c5
< NC_006448.1-15016:454_5cov_045M-050M s_16779:COG0087
---
> NC_006448.1-15016:454_5cov_045M-050M s_7827:COG0088</tt></pre>
</div>
<br />
It turns out that query 5 (<span style="font-family: "courier new" , "courier" , monospace;">NC_006448.1-15016:454_5cov_045M-050M</span>) still suffers from the internal candidate limit problem (see <a href="https://blastedbio.blogspot.com/2018/11/blast-max-alignment-limits-part-four.html">blog post part four</a>, with one alignment requested an internal limit of 10 would be used):<br />
<br />
<div style="background-color: black; font-family: "andale mono"; font-size: xx-small;">
<pre><tt><span style="color: #29f914;">$ ~/downloads/ncbi-blast-2.2.27+/bin/blastn -query query_5.fasta -db db.fasta -outfmt 6 -max_target_seqs 1
</span><span style="color: orange;">NC_006448.1-15016:454_5cov_045M-050M s_7827:COG0088 100.00 130 0 0 161 290 1 130 9e-65 241</span></tt></pre>
</div>
<br />
There are six duplicates of the true best hit which are lost (shown in red), giving the sixth ranked sequence instead which is one of the second best alignments (group shown in orange):<br />
<br />
<div style="background-color: black; font-family: "andale mono"; font-size: xx-small;">
<pre><tt><span style="color: #29f914;">$ ~/downloads/ncbi-blast-2.2.27+/bin/blastn -query query_5.fasta -db db.fasta -outfmt 6 | head -n 25
</span><span style="color: red;">NC_006448.1-15016:454_5cov_045M-050M s_16779:COG0087 98.54 137 1 1 1 136 492 628 2e-65 243
NC_006448.1-15016:454_5cov_045M-050M s_14096:COG0087 98.54 137 1 1 1 136 492 628 2e-65 243
NC_006448.1-15016:454_5cov_045M-050M s_10633:COG0087 98.54 137 1 1 1 136 492 628 2e-65 243
NC_006448.1-15016:454_5cov_045M-050M s_8701:COG0087 98.54 137 1 1 1 136 492 628 2e-65 243
NC_006448.1-15016:454_5cov_045M-050M s_16540:COG0087 98.54 137 1 1 1 136 492 628 2e-65 243
</span><span style="color: orange;">NC_006448.1-15016:454_5cov_045M-050M s_7827:COG0088 100.00 130 0 0 161 290 1 130 9e-65 241
NC_006448.1-15016:454_5cov_045M-050M s_12527:COG0088 100.00 130 0 0 161 290 1 130 9e-65 241
NC_006448.1-15016:454_5cov_045M-050M s_3668:COG0088 100.00 130 0 0 161 290 1 130 9e-65 241
NC_006448.1-15016:454_5cov_045M-050M s_4487:COG0088 100.00 130 0 0 161 290 1 130 9e-65 241
NC_006448.1-15016:454_5cov_045M-050M s_238:COG0088 100.00 130 0 0 161 290 1 130 9e-65 241
NC_006448.1-15016:454_5cov_045M-050M s_4073:COG0088 100.00 130 0 0 161 290 1 130 9e-65 241
NC_006448.1-15016:454_5cov_045M-050M s_15051:COG0088 100.00 130 0 0 161 290 1 130 9e-65 241
NC_006448.1-15016:454_5cov_045M-050M s_5970:COG0088 100.00 130 0 0 161 290 1 130 9e-65 241
NC_006448.1-15016:454_5cov_045M-050M s_6971:COG0088 100.00 130 0 0 161 290 1 130 9e-65 241
NC_006448.1-15016:454_5cov_045M-050M s_10867:COG0088 100.00 130 0 0 161 290 1 130 9e-65 241
NC_006448.1-15016:454_5cov_045M-050M s_4893:COG0088 100.00 130 0 0 161 290 1 130 9e-65 241
NC_006448.1-15016:454_5cov_045M-050M s_9798:COG0088 100.00 130 0 0 161 290 1 130 9e-65 241
NC_006448.1-15016:454_5cov_045M-050M s_16093:COG0088 100.00 130 0 0 161 290 1 130 9e-65 241
NC_006448.1-15016:454_5cov_045M-050M s_10339:COG0088 100.00 130 0 0 161 290 1 130 9e-65 241
NC_006448.1-15016:454_5cov_045M-050M s_3051:COG0088 100.00 130 0 0 161 290 1 130 9e-65 241
NC_006448.1-15016:454_5cov_045M-050M s_9302:COG0088 100.00 130 0 0 161 290 1 130 9e-65 241
</span><span style="color: #29f914;">NC_006448.1-15016:454_5cov_045M-050M s_15416:COG0087 97.81 137 2 1 1 136 492 628 1e-63 237
NC_006448.1-15016:454_5cov_045M-050M s_16215:COG0087 97.81 137 2 1 1 136 492 628 1e-63 237
NC_006448.1-15016:454_5cov_045M-050M s_16063:COG0087 97.81 137 2 1 1 136 492 628 1e-63 237
NC_006448.1-15016:454_5cov_045M-050M s_15862:COG0087 97.81 137 2 1 1 136 492 628 1e-63 237</span></tt></pre>
</div>
<br />
Shown here using BLAST+ 2.2.27, but the same applies with BLAST+ 2.8.1 as well.<br />
<h3>
Conclusion</h3>
It appears the bug fix in BLAST+ 2.8.1 to remove the overly aggressive optimisation in BLASTN and MegaBLAST <i>alone</i> fixed 9 of the 10 queries with the original Shah <i>et al.</i> (2018) <a href="https://github.com/shahnidhi/BLAST_maxtargetseq_analysis">test case</a> with a database full of duplicates.<br />
<br />
Only query 5 (<span style="font-family: "courier new" , "courier" , monospace;">NC_006448.1-15016:454_5cov_045M-050M</span>) is also directly affected by the internal alignment limit setting (see <a href="https://blastedbio.blogspot.com/2018/11/blast-max-alignment-limits-part-four.html">blog post part four</a>), which can be solved by also deduplicating the database.<br />
<br />
Note that based on my analysis on <a href="https://blastedbio.blogspot.com/2018/11/blast-max-alignment-limits-part-three.html">blog post part three</a> (using BLAST+ 2.7.0), with BLAST+ 2.2.28 through 2.8.0, I think just deduplicating the database fixed 5 of the 10 queries. The remainder are now explained by the 2012 "overly aggressive optimization".<br />
<br />
My advice from <a href="https://blastedbio.blogspot.com/2018/11/blast-max-alignment-limits-part-three.html">blog post part three</a> still stands - you should deduplicate your database (especially for nucleotides), as does the previous post looking at exactly how the <a href="https://blastedbio.blogspot.com/2018/12/blast-tie-break-db-order.html">BLAST database order tie breaking</a> is done (be careful if you have lots of similar sequences like marker genes, and there is meaning in how your database entries are sorted).<br />
<br />
In closing, while some of the strange behaviour in the Shah <i>et al.</i> (2018), where applying <span style="font-family: "courier new" , "courier" , monospace;">-max_target_seqs 1</span> gave a different result, could be explained by deduplicating the database to avoid the internal alignment candidate limit (i.e. the original 2015 issue), it turns out it was mostly due to a completely separate issue which the BLAST developers could identify using the test case.<br />
<br />
I think what Shah <i>et al.</i> ought to have done was contact the BLAST developers with a bug report, sharing their reproducible test case. In hindsight, they would likely have been told something like <i>"</i><i style="background-color: white;">Ah, yes. Thank you for the test case, that was very helpful. Sorry. Actually that's a different bug present since BLAST+ 2.2.28, affecting gap-rich alignments. We've fixed this as part of the next release (BLAST+ 2.8.1), due December 2018.</i><i>"</i><br />
<br />
Instead, we are left with a confused and misleading letter in the scientific literature. Still, on the bright side, more people are now aware of how the alignment limits and other heuristics in BLAST work, and are less likely to take the top hit at face value. And I got to blog more - yay for my scientific impact!Peter Cockhttp://www.blogger.com/profile/00233221181317137855noreply@blogger.com1tag:blogger.com,1999:blog-8584629468471803075.post-72629164425667605462018-12-07T11:36:00.000+00:002018-12-07T13:28:55.886+00:00BLAST tie breaking by database orderMy November blog posts discussing the BLAST+ tools behaviour with an alignment limit setting (see <a href="https://blastedbio.blogspot.com/2015/12/blast-max-target-sequences-bug.html">What BLAST's max-target-sequences doesn't do</a>, and the links from it), touched on database order, which comes into play as a tie breaker.<br />
<br />
Well, how is the BLAST database order defined? It turns out to be the reverse of the FASTA file used with <tt>makeblastdb</tt>, or in other words: Last-in, First-out (LIFO).<br />
<a name='more'></a><br />
<h3>
Making simple test cases</h3>
The idea here is to make a database full of almost identical sequences, by adding a tiny barcode to the end. For the nucleotide example, I've used a 3bp barcode using all the combinations of A, C, G and T giving 64 unique entries. The point here is when querying with the original sequence, these all give the same perfect alignment.<br />
<br />
My protein example is very similar - although because <tt>blastp</tt> uses composition based statistics (CBS) by default, I want all the sequences to have the same composition. Here my barcodes are all the possible permutations of five distinct amino acids, giving 120 barcodes. This was picked to give a reasonable sized database, potentially useful if exploring the maximum alignment settings due to <a href="https://blastedbio.blogspot.com/2018/11/blast-max-alignment-limits-part-four.html">how the internal limits are set by default on protein versus nucleotide searches</a> (my previous blog post).<br />
<br />
Note I am using BLAST+ 2.7.1 on Linux here.<br />
<br />
<h4>
Nucleotide test case</h4>
Here is my Python script <tt>make_sweetpea.py</tt> which generates a FASTA nucleotide file:<br />
<br />
<div style="background-color: black; color: #29f914; font-family: 'Andale Mono'; font-size: small;">
<pre>#!/usr/bin/env python
"""Generate a FASTA file of 64 near identical nucleotide sequences.
Outputs FASTA records consisting of the sweet-pea sequence U78617.1
(from the Lathyrus odoratus phytochrome A (PHYA) gene), with a 3bp
unique barcode append, giving 64 near identical sequences.
"""
import itertools
template = """>pea%i based on U78617.1 Lathyrus odoratus phytochrome A
CAGGCTGCGCGGTTTCTATTTATGAAGAACAAGGTCCGTATGATAGTTGATTGTCATGCAAAACATGTGA
AGGTTCTTCAAGACGAAAAACTCCCATTTGATTTGACTCTGTGCGGTTCGACCTTAAGAGCTCCACATAG
TTGCCATTTGCAGTACATGGCTAACATGGATTCAATTGCTTCATTGGTTATGGCAGTGGTCGTCAATGAC
AGCGATGAAGATGGAGATAGCCGTGACGCAGTTCTACCACAAAAGAAAAAGAGACTTTGGGGTTTGGTAG
TTTGTCATAACACTACTCCGAGGTTTGTT%s"""
for i, barcode in enumerate(itertools.product('ACGT', repeat=3)):
print(template % (i + 1, ''.join(barcode)))</pre>
</div>
<br />
Then make this into a nucleotide BLAST database with <tt>makeblastdb</tt> as usual,<br />
<br />
<div style="background-color: black; color: #29f914; font-family: 'Andale Mono'; font-size: x-small;">
<pre>$ python make_sweetpea.py > sweetpea_64.fasta
$ makeblastdb -dbtype nucl -in sweetpea_64.fasta -out sweetpea_64
Building a new DB, current time: 12/07/2018 10:31:20
New DB name: /mnt/shared/users/pc40583/repositories/blast_max_target_seqs/nuc_test/sweetpea_64
New DB title: sweetpea_64.fasta
Sequence type: Nucleotide
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 64 sequences in 0.00718689 seconds.</pre>
</div>
<br />
Note that the FASTA file looks like this:<br />
<br />
<div style="background-color: black; color: #29f914; font-family: 'Andale Mono'; font-size: small;">
<pre>>pea1 based on U78617.1 Lathyrus odoratus phytochrome A
CAGGCTGCGCGGTTTCTATTTATGAAGAACAAGGTCCGTATGATAGTTGATTGTCATGCAAAACATGTGA
AGGTTCTTCAAGACGAAAAACTCCCATTTGATTTGACTCTGTGCGGTTCGACCTTAAGAGCTCCACATAG
TTGCCATTTGCAGTACATGGCTAACATGGATTCAATTGCTTCATTGGTTATGGCAGTGGTCGTCAATGAC
AGCGATGAAGATGGAGATAGCCGTGACGCAGTTCTACCACAAAAGAAAAAGAGACTTTGGGGTTTGGTAG
TTTGTCATAACACTACTCCGAGGTTTGTTAAA
>pea2 based on U78617.1 Lathyrus odoratus phytochrome A
CAGGCTGCGCGGTTTCTATTTATGAAGAACAAGGTCCGTATGATAGTTGATTGTCATGCAAAACATGTGA
AGGTTCTTCAAGACGAAAAACTCCCATTTGATTTGACTCTGTGCGGTTCGACCTTAAGAGCTCCACATAG
TTGCCATTTGCAGTACATGGCTAACATGGATTCAATTGCTTCATTGGTTATGGCAGTGGTCGTCAATGAC
AGCGATGAAGATGGAGATAGCCGTGACGCAGTTCTACCACAAAAGAAAAAGAGACTTTGGGGTTTGGTAG
TTTGTCATAACACTACTCCGAGGTTTGTTAAC
...
>pea64 based on U78617.1 Lathyrus odoratus phytochrome A
CAGGCTGCGCGGTTTCTATTTATGAAGAACAAGGTCCGTATGATAGTTGATTGTCATGCAAAACATGTGA
AGGTTCTTCAAGACGAAAAACTCCCATTTGATTTGACTCTGTGCGGTTCGACCTTAAGAGCTCCACATAG
TTGCCATTTGCAGTACATGGCTAACATGGATTCAATTGCTTCATTGGTTATGGCAGTGGTCGTCAATGAC
AGCGATGAAGATGGAGATAGCCGTGACGCAGTTCTACCACAAAAGAAAAAGAGACTTTGGGGTTTGGTAG
TTTGTCATAACACTACTCCGAGGTTTGTTTTT</pre>
</div>
<br />
Then we need the query sequence, which I have in a file named <tt>sweetpea.fasta</tt> as follows:<br />
<br />
<div style="background-color: black; color: #29f914; font-family: 'Andale Mono'; font-size: small;">
<pre>>U78617.1 Lathyrus odoratus phytochrome A (PHYA) gene, partial cds
CAGGCTGCGCGGTTTCTATTTATGAAGAACAAGGTCCGTATGATAGTTGATTGTCATGCAAAACATGTGA
AGGTTCTTCAAGACGAAAAACTCCCATTTGATTTGACTCTGTGCGGTTCGACCTTAAGAGCTCCACATAG
TTGCCATTTGCAGTACATGGCTAACATGGATTCAATTGCTTCATTGGTTATGGCAGTGGTCGTCAATGAC
AGCGATGAAGATGGAGATAGCCGTGACGCAGTTCTACCACAAAAGAAAAAGAGACTTTGGGGTTTGGTAG
TTTGTCATAACACTACTCCGAGGTTTGTT</pre>
</div>
<br />
<h4>
Protein Test Case</h4>
Here is my Python script <tt>make_aster.py</tt> which generates a FASTA protein file:<br />
<br />
<div style="background-color: black; color: #29f914; font-family: 'Andale Mono'; font-size: small;">
<pre>#!/usr/bin/env python
"""Generate a FASTA file of 120 near identical protein sequences.
Outputs FASTA records consisting of the Aster sequence BAA31520.1
with a 5aa unique barcode appended, giving 120 near identical
sequences.
"""
import itertools
template = """>aster%i based on BAA31520.1 from Aster
GGHVNPAVTFGAFVGGNITLLRGIVYIIAQLLGSTVACLLLKFVTNDMAVGVFSLSAGVGVTNALVFEIV
MTFGLVYTVYATAIDPKKGSLGTIAPIAIGFIVGANI%s"""
for i, barcode in enumerate(itertools.permutations('AEILV')):
print(template % (i + 1, ''.join(barcode)))</pre>
</div>
<br />
Then make this into a nucleotide BLAST database with <tt>makeblastdb</tt> as usual,<br />
<br />
<div style="background-color: black; color: #29f914; font-family: 'Andale Mono'; font-size: small;">
<pre>$ python make_aster.py > aster_120.fasta
$ makeblastdb -dbtype prot -in aster_120.fasta -out aster_120
Building a new DB, current time: 12/07/2018 11:22:36
New DB name: /mnt/shared/users/pc40583/repositories/blast_max_target_seqs/nuc_test/aster_120
New DB title: aster_120.fasta
Sequence type: Protein
Deleted existing Protein BLAST database named /mnt/shared/users/pc40583/repositories/blast_max_target_seqs/nuc_test/aster_120
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 120 sequences in 0.00914717 seconds.</pre>
</div>
<br />
The protein FASTA file looks like this:<br />
<br />
<div style="background-color: black; color: #29f914; font-family: 'Andale Mono'; font-size: small;">
<pre>>aster1 based on BAA31520.1 from Aster
GGHVNPAVTFGAFVGGNITLLRGIVYIIAQLLGSTVACLLLKFVTNDMAVGVFSLSAGVGVTNALVFEIV
MTFGLVYTVYATAIDPKKGSLGTIAPIAIGFIVGANIAEILV
>aster2 based on BAA31520.1 from Aster
GGHVNPAVTFGAFVGGNITLLRGIVYIIAQLLGSTVACLLLKFVTNDMAVGVFSLSAGVGVTNALVFEIV
MTFGLVYTVYATAIDPKKGSLGTIAPIAIGFIVGANIAEIVL
...
>aster120 based on BAA31520.1 from Aster
GGHVNPAVTFGAFVGGNITLLRGIVYIIAQLLGSTVACLLLKFVTNDMAVGVFSLSAGVGVTNALVFEIV
MTFGLVYTVYATAIDPKKGSLGTIAPIAIGFIVGANIVLIEA</pre>
</div>
<br />
Then we need the query sequence, which I have in a file named <tt>aster.fasta</tt> as follows:<br />
<br />
<div style="background-color: black; color: #29f914; font-family: 'Andale Mono'; font-size: small;">
<pre>>BAA31520.1 from Aster
GGHVNPAVTFGAFVGGNITLLRGIVYIIAQLLGSTVACLLLKFVTNDMAVGVFSLSAGVGVTNALVFEIV
MTFGLVYTVYATAIDPKKGSLGTIAPIAIGFIVGANI</pre>
</div>
<br />
<h3>
What does this show?</h3>
First, let's just run <tt>blastn</tt> with the defaults other than asking for tabular output:<br />
<br />
<div style="background-color: black; color: #29f914; font-family: 'Andale Mono'; font-size: small;">
<pre>$ blastn -query sweetpea.fasta -db sweetpea_64 -outfmt 6
U78617.1 pea64 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea63 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea62 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea61 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea60 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea59 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea58 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea57 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea56 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea55 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea54 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea53 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea52 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea51 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea50 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea49 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea48 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea47 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea46 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea45 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea44 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea43 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea42 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea41 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea40 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea39 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea38 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea37 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea36 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea35 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea34 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea33 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea32 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea31 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea30 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea29 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea28 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea27 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea26 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea25 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea24 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea23 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea22 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea21 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea20 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea19 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea18 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea17 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea16 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea15 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea14 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea13 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea12 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea11 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea10 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea9 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea8 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea7 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea6 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea5 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea4 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea3 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea2 100.000 309 0 0 1 309 1 309 4.34e-166 571
U78617.1 pea1 100.000 309 0 0 1 309 1 309 4.34e-166 571</pre>
</div>
<br />
That's long, but you immediately see that the hits are all the same, but listed 64 to 1, which is the reverse order of the FASTA file where they are 1 to 64.<br />
<br />
Requesting just one alignment gives what was reported first, entry 64:<br />
<br />
<div style="background-color: black; color: #29f914; font-family: 'Andale Mono'; font-size: small;">
<pre>$ blastn -query sweetpea.fasta -db sweetpea_64 -outfmt 6 -max_target_seqs 1
U78617.1 pea64 100.000 309 0 0 1 309 1 309 4.34e-166 571</pre>
</div>
<br />
i.e. When there are 64 equally good matches, and we ask for just one, we do get the "first" entry in the database (it just happens that was the last entry in the FASTA file used to build the database).
<br />
<br />
With <tt>blastp</tt> the situation is exactly the same (because this database was constructed with all the near-identical sequences having the same amino acid composition, and thus CBS does not complicate the ranking):<br />
<br />
<div style="background-color: black; color: #29f914; font-family: 'Andale Mono'; font-size: small;">
<pre>$ blastp -query aster.fasta -db aster_120 -outfmt 6
BAA31520.1 aster120 100.000 107 0 0 1 107 1 107 2.73e-73 205
BAA31520.1 aster119 100.000 107 0 0 1 107 1 107 2.73e-73 205
BAA31520.1 aster118 100.000 107 0 0 1 107 1 107 2.73e-73 205
BAA31520.1 aster117 100.000 107 0 0 1 107 1 107 2.73e-73 205
BAA31520.1 aster116 100.000 107 0 0 1 107 1 107 2.73e-73 205
BAA31520.1 aster115 100.000 107 0 0 1 107 1 107 2.73e-73 205
BAA31520.1 aster114 100.000 107 0 0 1 107 1 107 2.73e-73 205
BAA31520.1 aster113 100.000 107 0 0 1 107 1 107 2.73e-73 205
BAA31520.1 aster112 100.000 107 0 0 1 107 1 107 2.73e-73 205
BAA31520.1 aster111 100.000 107 0 0 1 107 1 107 2.73e-73 205
BAA31520.1 aster110 100.000 107 0 0 1 107 1 107 2.73e-73 205
BAA31520.1 aster109 100.000 107 0 0 1 107 1 107 2.73e-73 205
BAA31520.1 aster108 100.000 107 0 0 1 107 1 107 2.73e-73 205
BAA31520.1 aster107 100.000 107 0 0 1 107 1 107 2.73e-73 205
BAA31520.1 aster106 100.000 107 0 0 1 107 1 107 2.73e-73 205
BAA31520.1 aster105 100.000 107 0 0 1 107 1 107 2.73e-73 205
BAA31520.1 aster104 100.000 107 0 0 1 107 1 107 2.73e-73 205
BAA31520.1 aster103 100.000 107 0 0 1 107 1 107 2.73e-73 205
BAA31520.1 aster102 100.000 107 0 0 1 107 1 107 2.73e-73 205
BAA31520.1 aster101 100.000 107 0 0 1 107 1 107 2.73e-73 205
BAA31520.1 aster100 100.000 107 0 0 1 107 1 107 2.73e-73 205
BAA31520.1 aster99 100.000 107 0 0 1 107 1 107 2.73e-73 205
BAA31520.1 aster98 100.000 107 0 0 1 107 1 107 2.73e-73 205
BAA31520.1 aster97 100.000 107 0 0 1 107 1 107 2.73e-73 205
...
BAA31520.1 aster3 100.000 107 0 0 1 107 1 107 2.73e-73 205
BAA31520.1 aster2 100.000 107 0 0 1 107 1 107 2.73e-73 205
BAA31520.1 aster1 100.000 107 0 0 1 107 1 107 2.73e-73 205</pre>
</div>
<br />
I abridged the output at the ... line. Asking for just one alignment this we do indeed get entry 120:<br />
<br />
<div style="background-color: black; color: #29f914; font-family: 'Andale Mono'; font-size: small;">
<pre>$ blastp -query aster.fasta -db aster_120 -outfmt 6 -max_target_seqs 1
BAA31520.1 aster120 100.000 107 0 0 1 107 1 107 2.73e-73 205</pre>
</div>
<br />
i.e. Exactly as with the observation from <tt>blastn</tt>, the protein database order is used as the tie breaker with the proviso that the database order is the reverse of the FASTA file used to build it.<br />
<br />
<h3>
What about chunked databases?</h3>
If you've used the NCBI provided NT or NR databases, you'll know they come in multiple chunks, with a master file <tt>nt.nal</tt> or <tt>nr.pal</tt> listing the child-databases which together make up the database. How does the database order work here? This turns out to be easy to check via the <tt>makeblastdb -max_file_sz</tt> setting.<br />
<br />
<div style="background-color: black; color: #29f914; font-family: 'Andale Mono'; font-size: small;">
<pre>$ makeblastdb -dbtype prot -in aster_120.fasta -out aster_120_chunked -max_file_sz 1200B
Building a new DB, current time: 12/07/2018 11:54:54
New DB name: /mnt/shared/users/pc40583/repositories/blast_max_target_seqs/nuc_test/aster_120_chunked
New DB title: aster_120.fasta
Sequence type: Protein
Keep MBits: T
Maximum file size: 0B
Adding sequences from FASTA; added 120 sequences in 0.0767181 seconds.</pre>
</div>
<br />
I found that value of 1200 bytes by trial and error, but it results in twelve chunks:<br />
<br />
<div style="background-color: black; color: #29f914; font-family: 'Andale Mono'; font-size: small;">
<pre>$ cat aster_120_chunked.pal
#
# Alias file created: Dec 7, 2018 11:54 AM
#
TITLE aster_120.fasta
DBLIST aster_120_chunked.00 aster_120_chunked.01 aster_120_chunked.02 aster_120_chunked.03 aster_120_chunked.04 aster_120_chunked.05 aster_120_chunked.06 aster_120_chunked.07 aster_120_chunked.08 aster_120_chunked.09 aster_120_chunked.10 aster_120_chunked.11</pre>
</div>
<br />
We can use the <tt>blastdbcmd</tt> tool to see which records are in which - but it shouldn't surprise you that the first ten records are in <tt>aster_120_chunked.00.p*</tt> while the last ten records are in <tt>aster_120_chunked.11.p*</tt> - the chunks are created as needed while looping though the input FASTA file.<br />
<br />
<div style="background-color: black; color: #29f914; font-family: 'Andale Mono'; font-size: small;">
<pre>$ blastdbcmd -entry all -db aster_120.11
>aster111 based on BAA31520.1 from Aster
GGHVNPAVTFGAFVGGNITLLRGIVYIIAQLLGSTVACLLLKFVTNDMAVGVFSLSAGVGVTNALVFEIVMTFGLVYTVY
ATAIDPKKGSLGTIAPIAIGFIVGANIVIEAL
...
GGHVNPAVTFGAFVGGNITLLRGIVYIIAQLLGSTVACLLLKFVTNDMAVGVFSLSAGVGVTNALVFEIVMTFGLVYTVY
ATAIDPKKGSLGTIAPIAIGFIVGANIVLIEA<pre></pre>
</pre>
</div>
<br />
And how does this behave in terms of tie-breaking? Happily, exactly the same:
<br />
<br />
<div style="background-color: black; color: #29f914; font-family: 'Andale Mono'; font-size: small;">
<pre>$ blastp -query aster.fasta -db aster_120_chunked -outfmt 6
BAA31520.1 aster120 100.000 107 0 0 1 107 1 107 2.73e-73 205
BAA31520.1 aster119 100.000 107 0 0 1 107 1 107 2.73e-73 205
BAA31520.1 aster118 100.000 107 0 0 1 107 1 107 2.73e-73 205
...
BAA31520.1 aster3 100.000 107 0 0 1 107 1 107 2.73e-73 205
BAA31520.1 aster2 100.000 107 0 0 1 107 1 107 2.73e-73 205
BAA31520.1 aster1 100.000 107 0 0 1 107 1 107 2.73e-73 205</pre>
</div>
<br />
and:<br />
<br />
<div style="background-color: black; color: #29f914; font-family: 'Andale Mono'; font-size: small;">
<pre>$ blastp -query aster.fasta -db aster_120_chunked -outfmt 6 -max_target_seqs 1
BAA31520.1 aster120 100.000 107 0 0 1 107 1 107 2.73e-73 205</pre>
</div>
<br />
We can do the same trick with the nucleotide test case. This command makes eight equal chunks, each with eight sequences:
<br />
<br />
<div style="background-color: black; color: #29f914; font-family: 'Andale Mono'; font-size: small;">
<pre>$ makeblastdb -dbtype nucl -in sweetpea_64.fasta -out sweetpea_64_chunked -max_file_sz 1000B</pre>
</div>
<br />
Again, the search results and tie breaking behave the same for the single database versus the chunked nucleotide database.
<br />
<br />
<h3>
Conclusion</h3>
The BLAST database order for both nucleotide and protein databases is the reverse of the FASTA file used to build the database.
<br />
<br />
<h3>
Discussion</h3>
BLAST has to use something as a tie breaker, and database order is deterministic and fast - and indirectly this is under the user's full control. The NCBI could implement tie breaking using the sequence or its identifier as a tie breaker, but not only would the string comparison be a little slower, this seems more likely to introduce a subtle bias.<br />
<br />
However, this makes it clear that <i>if</i> you ignore the ties and only look at the first result, and your input database FASTA file has a meaningful order, that will be introducing a bias to your BLAST results.<br />
<br />
For example, you might update the database FASTA file by appending new sequences to it - meaning once it is turned into a BLAST database, in a tie break the most recently added sequences will be preferred over the older sequences. If you only look at the top result, in a tie-break your results will change as the database is updated. From a results stability point of view you might prefer the older sequence takes priority (in which case, reverse the FASTA file record order when making the database), but really this is a problem waiting to catch you out. You could randomise the FASTA file record order, but if tied best hits are likely, <i>do not</i> just look at the top BLAST hit!Peter Cockhttp://www.blogger.com/profile/00233221181317137855noreply@blogger.com0tag:blogger.com,1999:blog-8584629468471803075.post-40878020375569906302018-11-13T15:43:00.000+00:002019-01-08T10:59:04.239+00:00BLAST max alignment limits reply - part fourThis is the fourth in a series of blog posts seeking to throw light some of the claims about the BLAST+ tool recently published by <a href="https://doi.org/10.1093/bioinformatics/bty833">Shah <i>et al.</i> (2018) "Misunderstood parameter of NCBI BLAST impacts the correctness of bioinformatics workflows</a>". It was very frustrating that the letter did not provide a reproducible test case, but in reply to the first pair of posts (<a href="https://blastedbio.blogspot.com/2018/11/blast-max-alignment-limits-repartee-one.html">one</a> and <a href="https://blastedbio.blogspot.com/2018/11/blast-max-alignment-limits-repartee-two.html">two</a>, both on Friday 2 November 2018), lead author Nidhi Shah got in touch via the comments on Sunday 4 November, with the URL to a GitHub repository describing the <a href="https://github.com/shahnidhi/BLAST_maxtargetseq_analysis">Shah et al. (2018) test case</a>. Thank you!<br />
<br />
Their test case turns out to be using MEGABLAST (the default algorithm in the <span style="font-family: "courier new" , "courier" , monospace;">blastn</span> binary), with a custom nucleotide BLAST database (the <a href="https://blastedbio.blogspot.com/2018/11/blast-max-alignment-limits-part-three.html">previous blog post</a> examined this).<br />
<br />
On the other hand, the original <a href="https://gist.github.com/sujaikumar/504b3b7024eaf3a04ef5">Dec 2015 -max_target_seqs bug report</a> (and my earlier blog posts), used BLASTP with a protein BLAST database.<br />
<br />
This is important because one key setting which the internal limit on the number of alignments (<i>N_i</i>) that BLAST+ considers depends on, is if composition-based statistics (CBS) are being used. This is the default with BLASTP, but <i>not</i> for MEGABLAST (i.e. the <span style="font-family: "courier new" , "courier" , monospace;">blastn</span> binary).<br />
<br />
The key point is that requesting <i>N=1</i> alignments, but otherwise the <span style="font-family: "courier new" , "courier" , monospace;">blastp</span> tool's default settings, gives an internal limit <i>N_i = 2*N + 50 = 52</i>, but with the <span style="font-family: "courier new" , "courier" , monospace;">blastn</span> tool you get an internal alignment limit <i>N_i = 10</i>. Evidently the BLAST+ developers were comfortable with a lower limit, so I presume there is less chance of the hit ordering changing in the final stages of the algorithm, but this emphasises why <b>it is <i>especially</i> important to avoid duplicates in a <i>nucleotide</i> BLAST database</b>.<br />
<br />
<a name='more'></a><h3>
How is the internal alignment limit set?</h3>
In late October 2018, the NCBI team added an <a href="https://www.ncbi.nlm.nih.gov/books/NBK279684/#_appendices_Outline_of_the_BLAST_process_">Appendix entry "Outline of the BLAST process"</a> to the online BLAST+ documentation, which described in words how the internal maximum limit of <i>N_i</i> databases sequences is setup.<br />
<br />
This was revised in early November 2018 to note "<i>CBS can be applied for BLASTP, BLASTX, and TBLASTN</i>". How <i>N_i</i> was set with CBS in a <a href="https://blastedbio.blogspot.com/2018/11/blast-max-alignment-limits-repartee-two.html">BLASTP example was described in a previous post</a>, but Shah <i>et al.</i> (2018) turned out to be using a MEGABLAST example - and this not using CBS.<br />
<br />
This is the code as used from the first public BLAST+ release 2.2.18 (released 14 October 2008) though to the current release BLAST+ 2.7.1 (released 23 October 2017) and the 2.8.0alpha (released 28 March 2018) as well, at around line 65 of file <span style="font-family: "courier new" , "courier" , monospace;">c++/src/algo/blast/core/blast_hits.c</span>:<br />
<br />
<div style="background-color: black; color: #29f914; font-family: "andale mono"; font-size: xx-small;">
<pre> prelim_hitlist_size = hit_options->hitlist_size;
if (ext_options->compositionBasedStats)
prelim_hitlist_size = prelim_hitlist_size * 2 + 50;
else if (scoring_options->gapped_calculation)
prelim_hitlist_size = MIN(2 * prelim_hitlist_size,
prelim_hitlist_size + 50);
(*retval)->prelim_hitlist_size = MAX(prelim_hitlist_size, 10);</pre>
</div>
<br />
The release dates are copied from the <a href="https://www.ncbi.nlm.nih.gov/books/NBK131777/">NCBI BLAST+ release notes</a>. I've omitted the full output, but by eye this code snippet looked identical over all the tar-ball source code releases from <a href="ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/">ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/</a> as examined with:<br />
<br />
<div style="background-color: black; color: #29f914; font-family: "andale mono"; font-size: xx-small;">
<pre>$ grep -A 7 "prelim_hitlist_size = hit_options->hitlist_size;" ncbi-blast-2.*+-src/c++/src/algo/blast/core/blast_hits.c</pre>
<pre>...</pre>
</div>
<br />
As an aside, I found the same file under NCBI source control online (<i>and confused myself temporarily by misreading a decade - see the update at the end of the post</i>). It matches as of the latest changes, <a href="https://www.ncbi.nlm.nih.gov/viewvc/v1/trunk/c%2B%2B/src/algo/blast/core/blast_hits.c?view=markup&pathrev=84285">31 October 2018 (SVN revision 84285)</a>. Curiously this records an important change (before the first public release of BLAST+) in <a href="https://www.ncbi.nlm.nih.gov/viewvc/v1/trunk/c%2B%2B/src/algo/blast/core/blast_hits.c?r1=37548&r2=37547&pathrev=37548">15 April 2008 (SVN revision 37584)</a>, when the plus fifty change was made to the CBS mode as part of a bi-weekly merge with no explanatory commit comment. Also, in <a href="https://www.ncbi.nlm.nih.gov/viewvc/v1/trunk/c%2B%2B/src/algo/blast/core/blast_hits.c?r1=33344&r2=33343&pathrev=33344">9 April 2007 (SVN revision 33344)</a>, a stylistic change was made (introducing a local temporary variable), but importantly it added the last line quoted, as emphasised in the commit comment "<i>save hits for at least 10 sequences, in case the traceback significantly changes the scores of hits found</i>". I'm not going to go back any further in this code's history here, since I think nothing substantial changed (other than a special case for RPS-BLAST settings) since this fragment was introduced back in <a href="https://www.ncbi.nlm.nih.gov/viewvc/v1/trunk/c%2B%2B/src/algo/blast/core/blast_hits.c?r1=22202&r2=22335">16 May 2005 (SVN revision 22335)</a>.<br />
<br />
<h3>
Minimum <i>N_i</i> values</h3>
Putting the C++ into pseudocode, and noting that in the CBS case due to the plus fifty, the value will always exceed the minimum of ten, we have:<br />
<br />
<div>
<pre>if CBS:
N_i = 2*N + 50
elif gapped:
N_i = MAX(MIN(2*N, N+50), 10)
else:
N_i = MAX(N, 10)</pre>
</div>
<br />
<br />
So, if asking for one alignment (<i>N=1</i>), via <span style="font-family: "courier new" , "courier" , monospace;">-max_target_seqs 1</span> or equivalently for the human readable output formats (see <a href="https://blastedbio.blogspot.com/2018/11/blast-max-alignment-limits-repartee-one.html">part one</a>), what do we get?
<br />
<br />
<div>
<pre>if CBS:
N_i = 2*N + 50 = 52
elif gapped:
N_i = MAX(MIN(2*N, N+50), 10) = MAX(MIN(2, 52), 10) = MAX(2, 10) = 10
else:
N_i = MAX(N, 10) = MAX(1, 10) = 10</pre>
</div>
<br />
<br />
That means for <i>N=1</i> with CBS we get an internal limit <i>N_i = 52</i>, and otherwise <i>N_i = 10</i>.
<br />
<br />
<h3>
When are composition-based statistics (CBS) used?
</h3>
Consulting the command line help,<br />
<br />
<div style="background-color: black; color: #29f914; font-family: "andale mono"; font-size: xx-small;">
<pre>$ blastp -help
...
-comp_based_stats <string>
Use composition-based statistics:
D or d: default (equivalent to 2 )
0 or F or f: No composition-based statistics
1: Composition-based statistics as in NAR 29:2994-3005, 2001
2 or T or t : Composition-based score adjustment as in Bioinformatics
21:902-911,
2005, conditioned on sequence properties
3: Composition-based score adjustment as in Bioinformatics 21:902-911,
2005, unconditionally
Default = `2'
...</string></pre>
</div>
<br />
I won't quote them all, but <span style="font-family: "courier new" , "courier" , monospace;">blastp -help</span>, <span style="font-family: "courier new" , "courier" , monospace;">blastx -help</span> and <span style="font-family: "courier new" , "courier" , monospace;">tblastn -help</span> all report their default is <span style="font-family: "courier new" , "courier" , monospace;">-comp_based_stats 2</span>, meaning enabled as per the Bioinformatics paper, <a href="https://doi.org/10.1093/bioinformatics/bti070">Yu et al. (2005)</a>. Likewise for <span style="font-family: "courier new" , "courier" , monospace;">rpsblast</span> and <span style="font-family: "courier new" , "courier" , monospace;">rpstblastn</span> (although their the default is <span style="font-family: "courier new" , "courier" , monospace;">-comp_based_stats 1</span>, meaning CBS is enabled as per the older NAR paper, <a href="https://doi.org/10.1093/nar/29.14.2994">Schäffer <i>et al.</i> 2001</a>).<br />
<br />
Meanwhile, there is no mention of composition in the <span style="font-family: "courier new" , "courier" , monospace;">blastn -help</span>, nor <span style="font-family: "courier new" , "courier" , monospace;">tblastx -help</span>, which fits as both of those papers are about protein databases.<br />
<br />
<h3>
Conclusion</h3>
By default the command line tools <span style="font-family: "courier new" , "courier" , monospace;">blastp</span>, <span style="font-family: "courier new" , "courier" , monospace;">blastx</span>, <span style="font-family: "courier new" , "courier" , monospace;">tblastn</span>, <span style="font-family: "courier new" , "courier" , monospace;">rpsblast</span> and <span style="font-family: "courier new" , "courier" , monospace;">rpstblastn</span> for protein databases all use CBS, and so requesting <i>N=1</i> we get an internal limit <i>N_i = 52</i>.<br />
<br />
However, with <span style="font-family: "courier new" , "courier" , monospace;">blastn</span> and <span style="font-family: "courier new" , "courier" , monospace;">tblastx</span> for nucleotide databases, there is no CBS mode, and so with <i>N=1</i> we get the much lower internal limit <i>N_i = 10</i>.<br />
<br />
Evidently for the Shah <i>et al.</i> (2018) test case, even with the database de-duplicated as described in my <a href="https://blastedbio.blogspot.com/2018/11/blast-max-alignment-limits-part-three.html">earlier post</a>, <i>N_i = 10</i> is not big enough, and this heuristic limit is affecting the top hit returned when only one alignment is requested.<br />
<br />
In summary, if you are using the alignment limits like <span style="font-family: "courier new" , "courier" , monospace;">-max_target_seqs 1</span>, then <b>it is <i>especially</i> important to avoid duplicates in a <i>nucleotide</i> BLAST database</b>.<br />
<div>
<br />
<h3>
Update (14 November 2018)</h3>
Corrected the decade of the most recent SVN commit of interest, <a href="https://www.ncbi.nlm.nih.gov/viewvc/v1/trunk/c%2B%2B/src/algo/blast/core/blast_hits.c?r1=37548&r2=37547&pathrev=37548">15 April 2008 (SVN revision 37584)</a>. It was not 15 April 2018, which was what was confusing me. I updated this and the start of the paragraph accordingly (with note added in italics). It would be interesting to go over the "legacy" BLAST release notes from that period.<br />
<br />
The NCBI have also updated the <a href="https://www.ncbi.nlm.nih.gov/books/NBK279684/#_appendices_Outline_of_the_BLAST_process_">Appendix entry "Outline of the BLAST process"</a> again, which is now much clearer about this limit, and the minimum value of 10.<br />
<br />
<h3>
Update (4 December 2018)</h3>
Corrected minor typo in third paragraph of the conclusion, de-duplicated rather than duplicated.<br />
<br />
<h3>
Update (8 January 2019)</h3>
I have today published a follow up post after the BLAST team's formal reply and BLAST+ 2.8.1 were published in late December, "<a href="https://blastedbio.blogspot.com/2019/01/blast-overly-aggressive-optimization.html">An overly aggressive optimization in BLASTN and MegaBLAST</a>". It turns out problems in Shah <i>et al.</i> (2018) were the complex interaction of multiple issues, with the internal alignment limit <i>N_i</i> and "<a href="https://blastedbio.blogspot.com/2018/12/blast-tie-break-db-order.html">BLAST tie breaking by database order</a>" being only part of the story.</div>
Peter Cockhttp://www.blogger.com/profile/00233221181317137855noreply@blogger.com7tag:blogger.com,1999:blog-8584629468471803075.post-26833856420005798862018-11-13T15:40:00.000+00:002019-01-08T10:58:54.295+00:00BLAST max alignment limits reply - part threeThis is the third in a series of blog posts seeking to throw light some of the claims about the BLAST+ tool recently published by <a href="https://doi.org/10.1093/bioinformatics/bty833">Shah <i>et al.</i> (2018) "Misunderstood parameter of NCBI BLAST impacts the correctness of bioinformatics workflows</a>". It was very frustrating that the letter did not provide a reproducible test case, but in reply to the first pair of posts (<a href="https://blastedbio.blogspot.com/2018/11/blast-max-alignment-limits-repartee-one.html">one</a> and <a href="https://blastedbio.blogspot.com/2018/11/blast-max-alignment-limits-repartee-two.html">two</a>, both on Friday 2 November 2018), lead author Nidhi Shah got in touch via the comments on Sunday 4 November, with the URL to a GitHub repository describing the <a href="https://github.com/shahnidhi/BLAST_maxtargetseq_analysis">Shah et al. (2018) test case</a>. Thank you!<br />
<br />
Their test case turns out to be using a custom nucleotide BLAST database (rather than a protein example as per the original <a href="https://gist.github.com/sujaikumar/504b3b7024eaf3a04ef5">Dec 2015 -max_target_seqs bug report</a>, see my post "<a href="https://blastedbio.blogspot.com/2015/12/blast-max-target-sequences-bug.html">What <span class="il">BLAST</span>'s max-target-sequences doesn't do</a>"), and rather than a single query sequence, they have ten.<br />
<br />
I could reproduce their initial example locally. I could indeed see the database order coming into play - but so far nothing that cannot be explained by using this as a tie breaker. De-duplicating their database to make it non-redundant greatly improves things, but some of the queries still showed the (December 2015) problem where using <span style="font-family: "courier new" , "courier" , monospace;">-max_target_seqs 1</span> does not give the expected top alignment.<br />
<a name='more'></a><br />
<h3>
Their core test case</h3>
<div>
Using the links provided, I fetched the ten sequence query file <span style="font-family: "courier new" , "courier" , monospace;">example.fasta</span> and 1991 sequence <span style="font-family: "courier new" , "courier" , monospace;">db.fasta</span>, and made a nucleotide BLAST database as usual. It does not seem to matter that they used BLAST+ 2.6.0, while I am using BLAST+ 2.7.1 (on 64-bit Linux installed via BioConda).</div>
<br />
<div style="background-color: black; color: #29f914; font-family: 'Andale Mono'; font-size: xx-small;">
<pre>$ blastn -version
blastn: 2.7.1+
Package: blast 2.7.1, build Sep 20 2018 02:20:26
$ makeblastdb -dbtype nucl -in db.fasta
Building a new DB, current time: 11/05/2018 14:55:50
New DB name: /.../db.fasta
New DB title: db.fasta
Sequence type: Nucleotide
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 1991 sequences in 0.182925 seconds.
</pre>
</div>
<br />
First, running with the limit on, the output is naturally short:
<br />
<br />
<div style="background-color: black; color: #29f914; font-family: 'Andale Mono'; font-size: xx-small;">
<pre>$ blastn -query example.fasta -db db.fasta -outfmt 6 -max_target_seqs 1
S50_7242:pb_10cov_000M-001M s_15833:COG0185 87.583 451 8 42 2352 2785 1 420 2.40e-135 479
S208_335:pb_10cov_001M-002M s_6114:COG0087 85.065 616 32 55 809 1406 3 576 5.90e-164 573
S216_4030:pb_10cov_001M-002M s_401:COG0088 82.232 681 40 73 1154 1818 1 616 2.33e-145 512
S2153_228:pb_10cov_022M-023M s_7936:COG0087 82.336 702 38 74 8837 9516 1 638 2.27e-150 531
NC_006448.1-15016:454_5cov_045M-050M s_7827:COG0088 100.000 130 0 0 161 290 1 130 8.87e-65 241
S52_804:pb_10cov_000M-001M s_7931:COG0088 86.025 644 23 62 1 618 195 797 1.42e-180 628
S190_1420:pb_10cov_001M-002M s_8639:COG0088 83.639 709 39 72 2762 3452 3 652 3.42e-170 595
S232_2558:pb_10cov_001M-002M s_3834:COG0090 84.407 885 51 84 28 883 3 829 0.0 789
S188_1416:pb_10cov_001M-002M s_15600:COG0094 86.311 599 15 63 2224 2801 2 554 2.46e-168 590
S1170_2543:pb_10cov_011M-012M s_8667:COG0094 86.038 573 24 53 606 1162 3 535 1.06e-160 564</pre>
</div>
<div>
<br />
Now, without giving BLAST a limit on the number of alignments, I wanted to show the equivalent ten lines of output. The was most concise way I could come up was to split the 10 sequence <span style="font-family: "courier new" , "courier" , monospace;">example.fasta</span> into 10 single sequence FASTA files using a short Python script, and then run BLAST ten times, once for each query, with the Unix <span style="font-family: "courier new" , "courier" , monospace;">head</span> command to show only the top hit:<br />
<br />
<div style="background-color: black; color: #29f914; font-family: 'Andale Mono'; font-size: xx-small;">
<pre>$ for i in {1..10}; do blastn -query query_$i.fasta -db db.fasta -outfmt 6 | head -n 1; done
S50_7242:pb_10cov_000M-001M s_1013:COG0088 84.598 870 33 89 423 1258 1 803 0.0 771
S208_335:pb_10cov_001M-002M s_9883:COG0088 86.420 648 35 49 1406 2038 1 610 0.0 660
S216_4030:pb_10cov_001M-002M s_9886:COG0090 82.671 906 46 101 2118 2995 3 825 0.0 701
S2153_228:pb_10cov_022M-023M s_1068:COG0092 82.841 880 43 98 12227 13084 1 794 0.0 689
NC_006448.1-15016:454_5cov_045M-050M s_16779:COG0087 98.540 137 1 1 1 136 492 628 2.46e-65 243
S52_804:pb_10cov_000M-001M s_14626:COG0090 87.126 769 19 74 911 1656 7 718 0.0 798
S190_1420:pb_10cov_001M-002M s_16963:COG0090 84.294 885 49 83 3789 4646 2 823 0.0 782
S232_2558:pb_10cov_001M-002M s_10253:COG0201 85.682 1348 63 120 7042 8359 14 1261 0.0 1301
S188_1416:pb_10cov_001M-002M s_17130:COG0201 84.922 1479 47 149 5687 7120 1 1348 0.0 1334
S1170_2543:pb_10cov_011M-012M s_1966:COG0201 82.612 1409 94 133 4088 5459 3 1297 0.0 1105</pre>
</div>
<br />
Indeed, all ten queries have a different top hit with and without the <span style="font-family: "courier new" , "courier" , monospace;">-max_target_seq 1</span> setting.<br />
<br />
I have included these single-query FASTA files in my <a href="https://github.com/peterjc/blast_max_target_seqs">BLAST test case repository</a> on GitHub (specifically this commit added the <a href="https://github.com/peterjc/blast_max_target_seqs/commit/f70f749d5ae999c61e5fa31d3b3f5a8a132ee31d">10 single-query files and the script</a>).<br />
<br />
<i>However</i>, without applying the head command, you see <i>lots</i> of tied hits with equal bitscores and e-values - which is exactly the circumstances where the database order is documented to come into play (see <a href="https://blastedbio.blogspot.com/2018/11/blast-max-alignment-limits-repartee-two.html">part two</a>). This is clear if we apply the same command to the two randomised order versions of the same database the authors provide:<br />
<br />
<div style="background-color: black; color: #29f914; font-family: 'Andale Mono'; font-size: xx-small;">
<pre>$ for i in {1..10}; do blastn -query query_$i.fasta -db db_rand_1.fasta -outfmt 6 | head -n 1; done
S50_7242:pb_10cov_000M-001M s_15069:COG0088 84.598 870 33 89 423 1258 1 803 0.0 771
S208_335:pb_10cov_001M-002M s_12233:COG0088 86.420 648 35 49 1406 2038 1 610 0.0 660
S216_4030:pb_10cov_001M-002M s_5441:COG0090 82.671 906 46 101 2118 2995 3 825 0.0 701
S2153_228:pb_10cov_022M-023M s_1068:COG0092 82.841 880 43 98 12227 13084 1 794 0.0 689
NC_006448.1-15016:454_5cov_045M-050M s_16540:COG0087 98.540 137 1 1 1 136 492 628 2.46e-65 243
S52_804:pb_10cov_000M-001M s_4179:COG0090 87.126 769 19 74 911 1656 7 718 0.0 798
S190_1420:pb_10cov_001M-002M s_606:COG0090 84.294 885 49 83 3789 4646 2 823 0.0 782
S232_2558:pb_10cov_001M-002M s_563:COG0201 85.682 1348 63 120 7042 8359 14 1261 0.0 1301
S188_1416:pb_10cov_001M-002M s_14987:COG0201 84.922 1479 47 149 5687 7120 1 1348 0.0 1334
S1170_2543:pb_10cov_011M-012M s_474:COG0201 82.612 1409 94 133 4088 5459 3 1297 0.0 1105</pre>
</div>
<br />
And:<br />
<br />
<div style="background-color: black; color: #29f914; font-family: 'Andale Mono'; font-size: xx-small;">
<pre>$ for i in {1..10}; do blastn -query query_$i.fasta -db db_rand_2.fasta -outfmt 6 | head -n 1; done
S50_7242:pb_10cov_000M-001M s_6935:COG0088 84.598 870 33 89 423 1258 1 803 0.0 771
S208_335:pb_10cov_001M-002M s_15110:COG0088 86.420 648 35 49 1406 2038 1 610 0.0 660
S216_4030:pb_10cov_001M-002M s_16192:COG0090 82.671 906 46 101 2118 2995 3 825 0.0 701
S2153_228:pb_10cov_022M-023M s_1068:COG0092 82.841 880 43 98 12227 13084 1 794 0.0 689
NC_006448.1-15016:454_5cov_045M-050M s_8701:COG0087 98.540 137 1 1 1 136 492 628 2.46e-65 243
S52_804:pb_10cov_000M-001M s_2111:COG0090 87.126 769 19 74 911 1656 7 718 0.0 798
S190_1420:pb_10cov_001M-002M s_6587:COG0090 84.294 885 49 83 3789 4646 2 823 0.0 782
S232_2558:pb_10cov_001M-002M s_563:COG0201 85.682 1348 63 120 7042 8359 14 1261 0.0 1301
S188_1416:pb_10cov_001M-002M s_17130:COG0201 84.922 1479 47 149 5687 7120 1 1348 0.0 1334
S1170_2543:pb_10cov_011M-012M s_1966:COG0201 82.612 1409 94 133 4088 5459 3 1297 0.0 1105</pre>
</div>
<br />
Everything looks consistent by eye, except the second column - the name of the sequence matched. Once you know this database is full of duplicates (see later in this post), this is not at all surprising.<br />
<br />
However, now for a rather different set of results - here are searches with <span style="font-family: "courier new" , "courier" , monospace;">-max_target_seqs 1</span> on the three duplicate filled databases which differ only in their entry order:<br />
<br />
<div style="background-color: black; color: #29f914; font-family: 'Andale Mono'; font-size: xx-small;">
<pre>$ blastn -query example.fasta -db db.fasta -outfmt 6 -max_target_seqs 1
S50_7242:pb_10cov_000M-001M s_15833:COG0185 87.583 451 8 42 2352 2785 1 420 2.40e-135 479
S208_335:pb_10cov_001M-002M s_6114:COG0087 85.065 616 32 55 809 1406 3 576 5.90e-164 573
S216_4030:pb_10cov_001M-002M s_401:COG0088 82.232 681 40 73 1154 1818 1 616 2.33e-145 512
S2153_228:pb_10cov_022M-023M s_7936:COG0087 82.336 702 38 74 8837 9516 1 638 2.27e-150 531
NC_006448.1-15016:454_5cov_045M-050M s_7827:COG0088 100.000 130 0 0 161 290 1 130 8.87e-65 241
S52_804:pb_10cov_000M-001M s_7931:COG0088 86.025 644 23 62 1 618 195 797 1.42e-180 628
S190_1420:pb_10cov_001M-002M s_8639:COG0088 83.639 709 39 72 2762 3452 3 652 3.42e-170 595
S232_2558:pb_10cov_001M-002M s_3834:COG0090 84.407 885 51 84 28 883 3 829 0.0 789
S188_1416:pb_10cov_001M-002M s_15600:COG0094 86.311 599 15 63 2224 2801 2 554 2.46e-168 590
S1170_2543:pb_10cov_011M-012M s_8667:COG0094 86.038 573 24 53 606 1162 3 535 1.06e-160 564
$ blastn -query example.fasta -db db_rand_1.fasta -outfmt 6 -max_target_seqs 1
S50_7242:pb_10cov_000M-001M s_2717:COG0088 84.598 870 33 89 423 1258 1 803 0.0 771
S208_335:pb_10cov_001M-002M s_8841:COG0088 86.420 648 35 49 1406 2038 1 610 0.0 660
S216_4030:pb_10cov_001M-002M s_16711:COG0088 82.232 681 40 73 1154 1818 1 616 2.33e-145 512
S2153_228:pb_10cov_022M-023M s_10644:COG0088 84.476 715 33 71 9530 10227 5 658 0.0 634
NC_006448.1-15016:454_5cov_045M-050M s_16540:COG0087 98.540 137 1 1 1 136 492 628 2.46e-65 243
S52_804:pb_10cov_000M-001M s_14721:COG0090 87.126 769 19 74 911 1656 7 718 0.0 798
S190_1420:pb_10cov_001M-002M s_5615:COG0090 84.294 885 49 83 3789 4646 2 823 0.0 782
S232_2558:pb_10cov_001M-002M s_10253:COG0201 85.682 1348 63 120 7042 8359 14 1261 0.0 1301
S188_1416:pb_10cov_001M-002M s_8257:COG0201 85.126 1432 44 143 5687 7073 1 1308 0.0 1308
S1170_2543:pb_10cov_011M-012M s_16228:COG0201 82.553 1410 93 134 4088 5459 3 1297 0.0 1099
$ blastn -query example.fasta -db db_rand_2.fasta -outfmt 6 -max_target_seqs 1
S50_7242:pb_10cov_000M-001M s_539:COG0088 84.598 870 33 89 423 1258 1 803 0.0 771
S208_335:pb_10cov_001M-002M s_8710:COG0088 86.420 648 35 49 1406 2038 1 610 0.0 660
S216_4030:pb_10cov_001M-002M s_11826:COG0090 81.678 906 55 101 2118 2995 3 825 0.0 651
S2153_228:pb_10cov_022M-023M s_16088:COG0092 82.500 880 46 98 12227 13084 1 794 0.0 673
NC_006448.1-15016:454_5cov_045M-050M s_8701:COG0087 98.540 137 1 1 1 136 492 628 2.46e-65 243
S52_804:pb_10cov_000M-001M s_301:COG0090 87.126 769 19 74 911 1656 7 718 0.0 798
S190_1420:pb_10cov_001M-002M s_606:COG0090 84.294 885 49 83 3789 4646 2 823 0.0 782
S232_2558:pb_10cov_001M-002M s_8271:COG0201 84.039 1347 87 118 7042 8359 14 1261 0.0 1179
S188_1416:pb_10cov_001M-002M s_16648:COG0201 85.126 1432 44 143 5687 7073 1 1308 0.0 1308
S1170_2543:pb_10cov_011M-012M s_2195:COG0201 82.624 1410 92 134 4088 5459 3 1297 0.0 1105</pre>
</div>
<br />
The eye is drawn to the evalue differences in column 11, but on closer inspection the results are seemingly arbitrary, with minimal agreement with each other, or the search above without the <span style="font-family: "courier new" , "courier" , monospace;">-max_target_seqs</span> limit. Based on this rather pathological configuration (a database full of duplicates), some of the bold claims in Shah <i>et al.</i> (2018) make more sense.<br />
<br />
<h3>
De-duplicating the test database</h3>
I checked, and found this database has a lot of duplicated sequences - 404 unique and present once, and a 193 further unique sequences which were duplicated, meaning an equivalent non-redundant database has only 597 unique sequences. I had some existing code for this, which I reworked into a <a href="https://github.com/peterjc/galaxy_blast/blob/master/tools/make_nr/make_nr.py">Python script to de-duplicate a FASTA file and make it non-redundant, <span style="font-family: "courier new" , "courier" , monospace;">make_nr.py</span></a> (on GitHub, I used <a href="https://github.com/peterjc/galaxy_blast/commit/1afabffff30e7ee9d9aad48635cb54f7a66fd3e9">v0.0.1 as of this commit</a>):<br />
<br />
<div style="background-color: black; color: #29f914; font-family: 'Andale Mono'; font-size: xx-small;">
<pre>$ ./make_nr.py -a -o dedup.fasta db.fasta
404 unique entries; removed 1394 duplicates leaving 193 representative records
$ ./make_nr.py -a -o dedup_rand_1.fasta db_rand_1.fasta
404 unique entries; removed 1394 duplicates leaving 193 representative records
$ ./make_nr.py -a -o dedup_rand_2.fasta db_rand_2.fasta
404 unique entries; removed 1394 duplicates leaving 193 representative records
$ makeblastdb -dbtype nucl -in dedup.fasta
Building a new DB, current time: 11/08/2018 23:39:42
New DB name: /.../dedup.fasta
New DB title: dedup.fasta
Sequence type: Nucleotide
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 597 sequences in 0.029556 seconds.
$ makeblastdb -dbtype nucl -in dedup_rand_1.fasta
...
$ makeblastdb -dbtype nucl -in dedup_rand_1.fasta
...
</pre>
</div>
<br />
Now repeating those searches on the non-redundant version of the test database, first with the <span style="font-family: "courier new" , "courier" , monospace;">-max_target_seq 1</span> limit:<br />
<br />
<div style="background-color: black; font-family: "andale mono"; font-size: xx-small;">
<pre><span style="color: #29f914;">$ blastn -query example.fasta -db dedup.fasta -outfmt 6 -max_target_seqs 1
S50_7242:pb_10cov_000M-001M s_1013:COG0088;... 84.598 870 33 89 423 1258 1 803 0.0 771
S208_335:pb_10cov_001M-002M s_12233:COG0088;... 86.420 648 35 49 1406 2038 1 610 0.0 660
</span><span style="color: red;">S216_4030:pb_10cov_001M-002M s_1365:COG0088;... 82.379 681 39 73 1154 1818 1 616 1.65e-147 518
S2153_228:pb_10cov_022M-023M s_7936:COG0087 82.336 702 38 74 8837 9516 1 638 7.47e-151 531
</span><span style="color: #29f914;">NC_006448.1-15016:454_5cov_045M-050M s_14096:COG0087 98.540 137 1 1 1 136 492 628 8.17e-66 243
</span><span style="color: orange;">S52_804:pb_10cov_000M-001M s_6705:COG0090 87.126 769 19 74 911 1656 7 718 0.0 798
</span><span style="color: #29f914;">S190_1420:pb_10cov_001M-002M s_10249:COG0090;... 84.294 885 49 83 3789 4646 2 823 0.0 782
</span><span style="color: red;">S232_2558:pb_10cov_001M-002M s_3834:COG0090 84.407 885 51 84 28 883 3 829 0.0 789
</span><span style="color: #29f914;">S188_1416:pb_10cov_001M-002M s_14987:COG0201;... 84.922 1479 47 149 5687 7120 1 1348 0.0 1334
</span><span style="color: orange;">S1170_2543:pb_10cov_011M-012M s_1966:COG0201;... 82.612 1409 94 133 4088 5459 3 1297 0.0 1105
</span></pre>
</div>
<br />
And without the limit:<br />
<br />
<div style="background-color: black; font-family: "andale mono"; font-size: xx-small;">
<pre><span style="color: #29f914;">$ for i in {1..10}; do blastn -query query_$i.fasta -db dedup.fasta -outfmt 6 | head -n 1; done
S50_7242:pb_10cov_000M-001M s_1013:COG0088;... 84.598 870 33 89 423 1258 1 803 0.0 771
S208_335:pb_10cov_001M-002M s_12233:COG0088;... 86.420 648 35 49 1406 2038 1 610 0.0 660
</span><span style="color: red;">S216_4030:pb_10cov_001M-002M s_11713:COG0090;... 82.671 906 46 101 2118 2995 3 825 0.0 701
S2153_228:pb_10cov_022M-023M s_1068:COG0092 82.841 880 43 98 12227 13084 1 794 0.0 689
</span><span style="color: #29f914;">NC_006448.1-15016:454_5cov_045M-050M s_14096:COG0087 98.540 137 1 1 1 136 492 628 8.17e-66 243
</span><span style="color: orange;">S52_804:pb_10cov_000M-001M s_11999:COG0090 87.126 769 19 74 911 1656 7 718 0.0 798
</span><span style="color: #29f914;">S190_1420:pb_10cov_001M-002M s_10249:COG0090;... 84.294 885 49 83 3789 4646 2 823 0.0 782
</span><span style="color: red;">S232_2558:pb_10cov_001M-002M s_10253:COG0201;... 85.682 1348 63 120 7042 8359 14 1261 0.0 1301
</span><span style="color: #29f914;">S188_1416:pb_10cov_001M-002M s_14987:COG0201;... 84.922 1479 47 149 5687 7120 1 1348 0.0 1334
</span><span style="color: orange;">S1170_2543:pb_10cov_011M-012M s_2195:COG0201;... 82.624 1410 92 134 4088 5459 3 1297 0.0 1105</span></pre>
</div>
<br />
Note for display here I have abbreviated the match names of the long de-duplicated entries with dots, and the lines in red or orange are different with the limit applied.
<br />
<br />
Making the test database non-redundant fixes 5 of these 10 problematic searches (those left in green). Queries 6 and 10 (in orange) have an equally good result (same bitscore and evalue, consistent with a tie break by database order at the end of the search), but query numbers 3, 4, and 8 (in red) show a more substantial change (and I discuss why this could happen later).<br />
<br />
This suggests the NCBI could help reduce the chances of the Suaji December 2018 bug in custom databases by warning if they are non-redundant, or better supporting de-duplication as an option in <span style="font-family: "courier new" , "courier" , monospace;">makeblastdb</span>.<br />
<br />
Continuing my previous marathon analogy with a half-way check point, what I think we've done here is remove most of the fast-starters which otherwise fill out the <i>N_i</i> candidate limit. This allows more of the slow starting ultimate winners into the final stage, where then can overtake.<br />
<br />
To assist anyone wanting to explore this further, I have included these single-query FASTA files and the deduplicated FASTA files used in my <a href="https://github.com/peterjc/blast_max_target_seqs">BLAST test case repository</a> on GitHub.<br />
<br />
<h3>
Database Order on a Non-Redundant Database</h3>
Now that we've de-duplicated the database, what is the effect of randomising the database order? Taking the two re-orderings provided, deduplicating them while preserving the order, and making those into databases, we can compare that too (code shown above; <a href="https://github.com/peterjc/blast_max_target_seqs">files in this repository</a>).<br />
<br />
It shouldn't be a surprise, but it turns out different orderings of the deduplicated database can still give different results.<br />
<br />
Without applying a limit on the number of alignments (i.e. the default), sometimes the first reported hits differ, but have the same final evalue and bitscore. This is consistent with the database order tie break happening right at the end (i.e. <a href="https://www.ncbi.nlm.nih.gov/books/NBK279684/#_appendices_Outline_of_the_BLAST_process_">step D2 in the Outline of the BLAST process document</a>). I saw this with <span style="font-family: "courier new" , "courier" , monospace;">query_5.fasta</span> (<span style="font-family: "courier new" , "courier" , monospace;">NC_006448.1-15016:454_5cov_045M-050M</span>), <span style="font-family: "courier new" , "courier" , monospace;">query_6.fasta</span> (<span style="font-family: "courier new" , "courier" , monospace;">S52_804:pb_10cov_000M-001M</span>) and <span style="font-family: "courier new" , "courier" , monospace;">query_10.fasta</span> (<span style="font-family: "courier new" , "courier" , monospace;">S1170_2543:pb_10cov_011M-012M</span>) as shown in the following bash nested-for loop, where I have omitted the consistent outputs and shortened the deduplicated names with triple dots:<br />
<br />
<div style="background-color: black; color: #29f914; font-family: 'Andale Mono'; font-size: xx-small;">
<pre>$ for i in {1..10}; \
do echo "Query $i" ; for d in dedup.fasta dedup_rand_1.fasta dedup_rand_2.fasta; \
do blastn -query "query_$i.fasta" -db "$d" -outfmt 6 | head -n 1; done; \
echo; done
Query 1
...
Query 2
...
Query 3
...
Query 4
...
Query 5
NC_006448.1-15016:454_5cov_045M-050M s_14096:COG0087 98.540 137 1 1 1 136 492 628 8.17e-66 243
NC_006448.1-15016:454_5cov_045M-050M s_10633:COG0087;... 98.540 137 1 1 1 136 492 628 8.17e-66 243
NC_006448.1-15016:454_5cov_045M-050M s_10633:COG0087;... 98.540 137 1 1 1 136 492 628 8.17e-66 243
Query 6
S52_804:pb_10cov_000M-001M s_11999:COG0090 87.126 769 19 74 911 1656 7 718 0.0 798
S52_804:pb_10cov_000M-001M s_6705:COG0090 87.126 769 19 74 911 1656 7 718 0.0 798
S52_804:pb_10cov_000M-001M s_6705:COG0090 87.126 769 19 74 911 1656 7 718 0.0 798
Query 7
...
Query 8
...
Query 9
...
Query 10
S1170_2543:pb_10cov_011M-012M s_2195:COG0201;... 82.624 1410 92 134 4088 5459 3 1297 0.0 1105
S1170_2543:pb_10cov_011M-012M s_2195:COG0201;... 82.624 1410 92 134 4088 5459 3 1297 0.0 1105
S1170_2543:pb_10cov_011M-012M s_1966:COG0201;... 82.612 1409 94 133 4088 5459 3 1297 0.0 1105
</pre>
</div>
<br />
No real surprises there, but what about with -<span style="font-family: "courier new" , "courier" , monospace;">max_target_seqs 1</span> active? In the following I have again omitted those outputs which are consistent:<br />
<br />
<div style="background-color: black; color: #29f914; font-family: 'Andale Mono'; font-size: xx-small;">
<pre>$ for i in {1..10}; \
do echo "Query $i"; for d in dedup.fasta dedup_rand_1.fasta dedup_rand_2.fasta;\
do blastn -query "query_$i.fasta" -db "$d" -outfmt 6 -max_target_seqs 1; done; \
echo ; done
Query 1
...
Query 2
...
Query 3
S216_4030:pb_10cov_001M-002M s_1365:COG0088;... 82.379 681 39 73 1154 1818 1 616 1.65e-147 518
S216_4030:pb_10cov_001M-002M s_1365:COG0088;... 82.379 681 39 73 1154 1818 1 616 1.65e-147 518
S216_4030:pb_10cov_001M-002M s_11713:COG0090;... 82.671 906 46 101 2118 2995 3 825 0.0 701
Query 4
S2153_228:pb_10cov_022M-023M s_7936:COG0087 82.336 702 38 74 8837 9516 1 638 7.47e-151 531
S2153_228:pb_10cov_022M-023M s_10644:COG0088;... 84.476 715 33 71 9530 10227 5 658 0.0 634
S2153_228:pb_10cov_022M-023M s_16088:COG0092;... 82.500 880 46 98 12227 13084 1 794 0.0 673
Query 5
NC_006448.1-15016:454_5cov_045M-050M s_14096:COG0087 98.540 137 1 1 1 136 492 628 8.17e-66 243
NC_006448.1-15016:454_5cov_045M-050M s_10633:COG0087;... 98.540 137 1 1 1 136 492 628 8.17e-66 243
NC_006448.1-15016:454_5cov_045M-050M s_10633:COG0087;... 98.540 137 1 1 1 136 492 628 8.17e-66 243
Query 6
...
Query 7
...
Query 8
S232_2558:pb_10cov_001M-002M s_3834:COG0090 84.407 885 51 84 28 883 3 829 0.0 789
S232_2558:pb_10cov_001M-002M s_10253:COG0201;... 85.682 1348 63 120 7042 8359 14 1261 0.0 1301
S232_2558:pb_10cov_001M-002M s_8271:COG0201 84.039 1347 87 118 7042 8359 14 1261 0.0 1179
Query 9
...
Query 10
S1170_2543:pb_10cov_011M-012M s_1966:COG0201;... 82.612 1409 94 133 4088 5459 3 1297 0.0 1105
S1170_2543:pb_10cov_011M-012M s_1966:COG0201;... 82.612 1409 94 133 4088 5459 3 1297 0.0 1105
S1170_2543:pb_10cov_011M-012M s_2195:COG0201;... 82.624 1410 92 134 4088 5459 3 1297 0.0 1105
</pre>
</div>
<br />
Again, the alternative results in queries 5 and 10 have the same bitscore and evalue, suggesting the the database order is being used at the <i>final</i> stage as a tie breaker. Query 6 this time gives the same result each time, the equally good hit seem sometimes without the limit seems to have lost out this time. However, in the other cases the top hit changes dramatically with the database order - for example <span style="font-family: "courier new" , "courier" , monospace;">query_4.fasta</span> (<span style="font-family: "courier new" , "courier" , monospace;">S2153_228:pb_10cov_022M-023M</span>) gave three different answers with three different databases orderings:<br />
<br />
<div style="background-color: black; color: #29f914; font-family: 'Andale Mono'; font-size: xx-small;">
<pre>$ blastn -query query_4.fasta -db dedup.fasta -outfmt 6 -max_target_seqs 1
S2153_228:pb_10cov_022M-023M s_7936:COG0087 82.336 702 38 74 8837 9516 1 638 7.47e-151 531
$ blastn -query query_4.fasta -db dedup_rand_1.fasta -outfmt 6 -max_target_seqs 1
S2153_228:pb_10cov_022M-023M s_10644:COG0088;s_1307:COG0088;s_3859:COG0088;s_6993:COG0088;s_9865:COG0088 84.476 715 33 71 9530 10227 5 658 0.0 634
$ blastn -query query_4.fasta -db dedup_rand_2.fasta -outfmt 6 -max_target_seqs 1
S2153_228:pb_10cov_022M-023M s_16088:COG0092;s_17102:COG0092 82.500 880 46 98 12227 13084 1 794 0.0 673</pre>
</div>
<br />
Referring to the <a href="https://www.ncbi.nlm.nih.gov/books/NBK279684/#_appendices_Outline_of_the_BLAST_process_">Outline of the BLAST process document</a>, because database order also comes into play as a tie break at C4, that could explain these results (the scores at stage C4 and D2 can be quite different). My hypothesis is that these alternative top matches must have had tied scores at step C4, but some were dropped due to the internal limit on the number of alignments. Confirming this needs more work.<br />
<br />
<h3>
Initial Conclusion</h3>
If you are using the alignment limits with a custom BLAST database, de-duplicate it to ensure it is non-redundant. This is a good idea anyway for speed, and will force you to think about interpreting ties, which is also good. However, the immediate practical benefit here is it reduces the chances of the desired best hit getting crowded out at the <i>N_i</i> internal cull by needlessly duplicated candidates.<br />
<br />
My <a href="https://blastedbio.blogspot.com/2018/11/blast-max-alignment-limits-part-four.html">next post</a> looks at what the <i>N_i</i> internal candidate alignment limit actually is in this context.<br />
<br />
To return to the Shah <i>et al.</i> 2018 letter, I am still unhappy with the text as it paints a very negative picture of what appears to be a problem only for databases of highly similar sequences, and especially so for databases with duplicated sequences. However, further work is needed to determine just how often this occurs - I expect that it is a very rare occurrence with the main NCBI provided BLAST databases, which likely see most usage.<br />
<br />
<h3>
Foot Notes</h3>
My thanks to <a href="https://twitter.com/seqwave">John Walshaw</a> for proof reading an earlier draft of this post, and catching one mistake. Any remaining or new mistakes are my own.<br />
<br />
<h3>
Update (14 November 2018)</h3>
Fixed a typo in the third paragraph (December 2015, not Dec 2018), spotted by <a href="https://twitter.com/gringene_bio/status/1062653996253343744">David Eccles</a>.<br />
<br />
<h3>
Update (8 January 2019)</h3>
<br />
In early December I published the post "<a href="https://blastedbio.blogspot.com/2018/12/blast-tie-break-db-order.html">BLAST tie breaking by database order</a>", which I think means tie breaking alone cannot explain the results shown for query 6 and 10 above (marked in orange above). Meanwhile the results for queries 3, 4 and 8 (in red above) had also yet to be fully explained.<br />
<br />
I have today published a follow up post after the BLAST team's formal reply and BLAST+ 2.8.1 were published in late December, "<a href="https://blastedbio.blogspot.com/2019/01/blast-overly-aggressive-optimization.html">An overly aggressive optimization in BLASTN and MegaBLAST</a>". This update fixes these oddities, which turn out to have been a complex interaction of multiple issues.</div>
Peter Cockhttp://www.blogger.com/profile/00233221181317137855noreply@blogger.com0tag:blogger.com,1999:blog-8584629468471803075.post-83349662135623097652018-11-02T16:41:00.000+00:002018-11-15T16:26:14.116+00:00BLAST max alignment limits repartee - part twoThis is the second in a series of blog posts seeking to throw light some of the claims about the BLAST+ tool recently published by <a href="https://doi.org/10.1093/bioinformatics/bty833">Shah <i>et al.</i> (2018) "Misunderstood parameter of NCBI BLAST impacts the correctness of bioinformatics workflows</a>". Since regrettably they did not provide a reproducible test case, my <a href="https://blastedbio.blogspot.com/2018/11/blast-max-alignment-limits-repartee-one.html">previous post</a> began by introducing <a href="https://github.com/peterjc/blast_max_target_seqs" style="background-color: transparent;">a minimal test case</a>.<br />
<br />
This topic dates back to 2015, when <a href="http://ylog.org/sujai/">Sujai Kumar</a> reported as <a href="https://gist.github.com/sujaikumar/504b3b7024eaf3a04ef5">a scary<span class="il">BLAST</span>+ -max_target_seqs bug</a> which as I wrote about ("<a href="https://blastedbio.blogspot.com/2015/12/blast-max-target-sequences-bug.html">What <span class="il">BLAST</span>'s max-target-sequences doesn't do</a>"), the NCBI BLAST developers explained it as a poorly documented feature.
<br />
<br />
Here I focus on what might be the most quoted part of Shah <i>et al.</i> (2018), which is causing what I consider to be unwarranted panic:<br />
<blockquote class="tr_bq">
<i>To our surprise, we have recently discovered that this intuition is incorrect. Instead, BLAST returns the first N hits that exceed the specified E-value threshold, which may or may not be the highest scoring N hits. The invocation using the parameter ‘-max_target_seqs 1’ simply returns the first good hit found in the database, not the best hit as one would assume. Worse yet, the output produced depends on the order in which the sequences occur in the database. For the same query, different results will be returned by BLAST when using different versions of the database even if all versions contain the same best hit for this database sequence. Even ordering the database in a different way would cause BLAST to return a different ‘top hit’ when setting the max_target_seqs parameter to 1.</i></blockquote>
<div>
This does not seem to be the case. If I am misreading their message, I am not alone. See for example <a href="https://emmabell42.wordpress.com/2018/11/01/the-max_target_seqs-parameter-of-ncbi-blast-may-not-do-what-you-think-it-does/">Emma Bell's blog post</a>, or John Walshaw's comments on <a href="https://blastedbio.blogspot.com/2015/12/blast-max-target-sequences-bug.html">my 2015 post</a>. It is possible Shah <i>et al.</i> have found a separate issue, but since no test case was given, that cannot currently be verified.
<br />
<br />
<a name='more'></a><h3>
Reverse ordering the existing test case</h3>
<div>
As explained in the previous post, I have a database <span style="background-color: black; color: #29f914; font-family: "andale mono";">older_matches.fasta</span> of 496 sequences where the first hit returned for Sujai's tardigrade query sequence <span style="background-color: black; color: #29f914; font-family: "andale mono";">input.fasta</span> changes with the <span style="background-color: black; color: #29f914; font-family: "andale mono";">-max_target_seqs</span> setting.</div>
<div>
<br /></div>
<div>
The quote from Shah et al. (2018) claims the BLAST output when using <span style="background-color: black; color: #29f914; font-family: "andale mono";">-max_target_seqs 1</span> (to return a single hit) "<i style="background-color: transparent;">depends on the order in which the sequences occur in the database</i>".</div>
<div>
<br /></div>
<div>
It is hard to prove a negative, but this test database ought to be appropriate. I therefore prepared a new database of the same sequences but in the reverse order. A quick bit of Python generated <span style="background-color: black; color: #29f914; font-family: "andale mono";">reverse_order.fasta</span> which was turned into a database with <span style="background-color: black; color: #29f914; font-family: "andale mono";">makeblastdbcmd</span> as usual:</div>
<br />
<div style="background-color: black; color: #29f914; font-family: 'Andale Mono'; font-size: small;">
<pre>$ python3 -c "from Bio import SeqIO; \
print(SeqIO.write(list(SeqIO.parse('older_matches.fasta', 'fasta'))[::-1], 'reverse_order.fasta', 'fasta'))"
496
</pre>
</div>
<br />
<div style="background-color: black; color: #29f914; font-family: 'Andale Mono'; font-size: small;">
<pre>$ makeblastdb -dbtype prot -in reverse_order.fasta -parse_seqids -taxid_map older_matches.taxmap.txt
Building a new DB, current time: 11/01/2018 13:42:24
New DB name: /mnt/shared/users/pc40583/repositories/blast_max_target_seqs/tests/reverse_order.fasta
New DB title: reverse_order.fasta
Sequence type: Protein
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 496 sequences in 0.036006 seconds.</pre>
</div>
<br />
We can now try the two databases side by side - they contain the same sequences but in different order:<br />
<br />
<div style="background-color: black; color: #29f914; font-family: 'Andale Mono'; font-size: xx-small;">
<pre>$ blastp -query input.fasta -db older_matches.fasta -outfmt "6 std sskingdoms" -max_target_seqs 1
nHd.2.3.1.t00019-RA KRX89027.1 63.115 122 45 0 1 122 105 226 5.26e-42 140 Eukaryota
$ blastp -query input.fasta -db reverse_order.fasta -outfmt "6 std sskingdoms" -max_target_seqs 1
nHd.2.3.1.t00019-RA KRX89027.1 63.115 122 45 0 1 122 105 226 5.26e-42 140 Eukaryota</pre>
</div>
<br />
No difference. However, increasing the limit we can see minor differences in the sort order of tied hits:<br />
<br />
<div style="background-color: black; color: #29f914; font-family: 'Andale Mono'; font-size: xx-small;">
<pre>$ blastp -query input.fasta -db older_matches.fasta -outfmt "6 std sskingdoms" -max_target_seqs 100 -evalue 1e-5 -out input_vs_older_matches_max_100.tsv
$ blastp -query input.fasta -db reverse_order.fasta -outfmt "6 std sskingdoms" -max_target_seqs 100 -evalue 1e-5 -out input_vs_reverse_order_max_100.tsv
$ diff input_vs_older_matches_max_100.tsv input_vs_reverse_order_max_100.tsv
70d69
< nHd.2.3.1.t00019-RA XP_015813695.1 60.000 125 49 1 1 125 100 223 1.13e-38 133 Eukaryota
71a71
> nHd.2.3.1.t00019-RA XP_015813695.1 60.000 125 49 1 1 125 100 223 1.13e-38 133 Eukaryota</pre>
</div>
<br />
That turns out to be a trivial difference in the order of tied hits:
<br />
<br />
<div style="background-color: black; color: #29f914; font-family: 'Andale Mono'; font-size: xx-small;">
<pre>$ grep -n -C 2 "XP_015813695.1" input_vs_older_matches_max_100.tsv
68-nHd.2.3.1.t00019-RA XP_001988914.1 59.677 124 49 1 1 124 100 222 1.03e-38 133 Eukaryota
69-nHd.2.3.1.t00019-RA KTG43357.1 60.800 125 48 1 1 125 100 223 1.08e-38 133 Eukaryota
70:nHd.2.3.1.t00019-RA XP_015813695.1 60.000 125 49 1 1 125 100 223 1.13e-38 133 Eukaryota
71-nHd.2.3.1.t00019-RA XP_007556854.1 59.200 125 50 1 1 125 100 223 1.13e-38 133 Eukaryota
72-nHd.2.3.1.t00019-RA PWA22675.1 59.200 125 50 1 1 125 100 223 1.20e-38 133 Eukaryota
$ grep -n -C 2 "XP_015813695.1" input_vs_reverse_order_max_100.tsv
69-nHd.2.3.1.t00019-RA KTG43357.1 60.800 125 48 1 1 125 100 223 1.08e-38 133 Eukaryota
70-nHd.2.3.1.t00019-RA XP_007556854.1 59.200 125 50 1 1 125 100 223 1.13e-38 133 Eukaryota
71:nHd.2.3.1.t00019-RA XP_015813695.1 60.000 125 49 1 1 125 100 223 1.13e-38 133 Eukaryota
72-nHd.2.3.1.t00019-RA PWA22675.1 59.200 125 50 1 1 125 100 223 1.20e-38 133 Eukaryota
73-nHd.2.3.1.t00019-RA XP_014848806.1 59.200 125 50 1 1 125 100 223 1.26e-38 133 Eukaryota</pre>
</div>
<br />
The same appears to be true without any alignments limit, but with even more re-orderings due to more ties.
<br />
<br />
<div style="background-color: black; color: #29f914; font-family: 'Andale Mono';">
<pre>$ blastp -query input.fasta -db older_matches.fasta -outfmt "6 std sskingdoms" -evalue 1e-5 \
-out input_vs_older_matches.tsv
$ blastp -query input.fasta -db reverse_order.fasta -outfmt "6 std sskingdoms" -evalue 1e-5 \
-out input_vs_reverse_order.tsv
$ diff <(sort input_vs_older_matches.tsv) <(sort input_vs_reverse_order.tsv)</pre>
</div>
<br />
i.e. No differences after applying Unix sort to the output files, so no changes to the scoring etc.
<br />
<br />
<h3>
Pathological Test Case</h3>
I was able to make a test case where the hit returned with <span style="background-color: black; color: #29f914; font-family: "andale mono";">-max_target_seqs 1</span> really does depend on the database order, but only by constructing an exact duplicate of the expected hit. This is hardly something that should cause a problem in real life - if there are multiple sequences giving equally good alignments and you ask BLAST for just one, this is unavoidable.<br />
<br />
The technical details of the setup,<br />
<br />
<div style="background-color: black; color: #29f914; font-family: 'Andale Mono'; font-size: xx-small;">
<pre>$ grep -A 7 ">KRX89027.1" older_matches.fasta | sed "s/KRX89027.1/duplicate/g" > dup_euk_hit.fasta
$ cat dup_euk_hit.fasta older_matches.fasta > dup_at_start.fasta
$ cat older_matches.fasta dup_euk_hit.fasta > dup_at_end.fasta
$ makeblastdb -dbtype prot -in dup_at_start.fasta -parse_seqids -taxid_map older_matches.taxmap.txt
...
$ makeblastdb -dbtype prot -in dup_at_end.fasta -parse_seqids -taxid_map older_matches.taxmap.txt
...
</pre>
</div>
<br />
These two databases now have an extra copy of KRX89027.1 either at the start or the end, and this duplicate does not have a taxonomy mapping defined. Here are the search results using a limit of one:
<br />
<br />
<div style="background-color: black; color: #29f914; font-family: 'Andale Mono'; font-size: xx-small;">
<pre>$ blastp -query input.fasta -db older_matches.fasta -outfmt "6 std sskingdoms" -max_target_seqs 1
nHd.2.3.1.t00019-RA KRX89027.1 63.115 122 45 0 1 122 105 226 5.26e-42 140 Eukaryota
$ blastp -query input.fasta -db dup_at_start.fasta -outfmt "6 std sskingdoms" -max_target_seqs 1
nHd.2.3.1.t00019-RA KRX89027.1 63.115 122 45 0 1 122 105 226 5.27e-42 140 Eukaryota
$ blastp -query input.fasta -db dup_at_end.fasta -outfmt "6 std sskingdoms" -max_target_seqs 1
nHd.2.3.1.t00019-RA duplicate 63.115 122 45 0 1 122 105 226 5.27e-42 140 N/A</pre>
</div>
<br />
As you can see, depending on the database order when there is a duplicate, sometimes you get the original sequence (KRX89027.1), and sometimes the duplicate.
<br />
<br /></div>
Increasing the limit makes this a little clearer what is going on - database order seems to be the tie breaker for ranking the identical hits:
<br />
<br />
<div style="background-color: black; color: #29f914; font-family: 'Andale Mono'; font-size: xx-small;">
<pre>$ blastp -query input.fasta -db older_matches.fasta -outfmt "6 std sskingdoms" -max_target_seqs 3
nHd.2.3.1.t00019-RA KRX89027.1 63.115 122 45 0 1 122 105 226 5.26e-42 140 Eukaryota
nHd.2.3.1.t00019-RA KRX89025.1 63.115 122 45 0 1 122 105 226 1.07e-41 140 Eukaryota
nHd.2.3.1.t00019-RA KFD69381.1 61.983 121 46 0 1 121 466 586 1.39e-41 141 Eukaryota
$ blastp -query input.fasta -db dup_at_start.fasta -outfmt "6 std sskingdoms" -max_target_seqs 3
nHd.2.3.1.t00019-RA KRX89027.1 63.115 122 45 0 1 122 105 226 5.27e-42 140 Eukaryota
nHd.2.3.1.t00019-RA duplicate 63.115 122 45 0 1 122 105 226 5.27e-42 140 N/A
nHd.2.3.1.t00019-RA KRX89025.1 63.115 122 45 0 1 122 105 226 1.07e-41 140 Eukaryota
$ blastp -query input.fasta -db dup_at_end.fasta -outfmt "6 std sskingdoms" -max_target_seqs 3
nHd.2.3.1.t00019-RA duplicate 63.115 122 45 0 1 122 105 226 5.27e-42 140 N/A
nHd.2.3.1.t00019-RA KRX89027.1 63.115 122 45 0 1 122 105 226 5.27e-42 140 Eukaryota
nHd.2.3.1.t00019-RA KRX89025.1 63.115 122 45 0 1 122 105 226 1.07e-41 140 Eukaryota</pre>
</div>
<br />
Again, these example searches are missing the two even better bacterial hits which come through with a much higher alignment limit.<br />
<br />
<h3>
Conclusion</h3>
I remain open to the possibility that Shah <i>et al.</i> (2018) have found a separate issue, but for now think that paragraph of their paper in particular needs some serious editing.<br />
<br />
It is completely at odds with the <a href="https://www.ncbi.nlm.nih.gov/books/NBK279684/#_appendices_Outline_of_the_BLAST_process_">Outline of the BLAST process</a> (recently added to the <i>BLAST Command Line Applications User Manual Appendix</i> in October 2018). This describes how the number of alignments requested, <i>N,</i> is used as an internal limit, <i>N_i</i>, to cap the number of candidates to be taken forward to the final gapped alignment with traceback stage. For the default composition based statistics, that internal limit is <i>N_i = 2*N+50</i>.<br />
<br />
This is like marathon race with <i>N</i> top spots at stake (e.g. <i>N=1</i> for the gold medal, or <i>N=3</i> for a podium finish), where at the half way check point only the <i>N_i = 2*N+50</i> front runners are allowed to go forward. With N=1, this means only the 52 front runners at the checkpoint are allowed to finish the race - and this might exclude a slow-starter who could otherwise overtake the front runners and win.<br />
<br />
I would like to see this better documented at the command line, but BLAST is a heuristic search tool, and this approach does clearly reduce the amount of computation performed, and thus returns results much quicker.<br />
<br />
Note that document does confirm at the final stage "<i>A tie (two matches with identical score and expect value) is broken by the order of the sequences in the database</i>" as demonstrated practically above.<br />
<br />
I am expecting the NR database to gradually build up lots more similar protein sequences as more and more genomes are sequenced and submitted. It would therefore not surprise me if the NCBI BLAST team choose in future to tinker with the culling thresholds (like increasing the current limit <i>N_i = 2*N+50</i>, perhaps increasing the 50 value for larger databases?<i>)</i> to be more relaxed at the earlier stage of the search, to reduce the chances of situations like the one Sujai found.<br />
<br />
<h3>
Simplified Test Case</h3>
While doing final proof reading, I stumbled upon some independent work on this issue. BioStars user <a href="https://www.biostars.org/p/341227/"><b>fishgolden</b> had an elegant idea for a simplified test case</a>. They took one copy of the best match WP_042303394.1 (bacteria) and either 51 or 52 copies of the unwanted best match KRX89030.1 (eukaryote), to make two databases.
<br />
Using <span style="background-color: black; color: #29f914; font-family: "andale mono";">-max_target_seqs 1</span> sets the internal limit to 52 candidates, and we know that at the first stage in the algorithm, <span style="background-color: black; color: #29f914; font-family: "andale mono";">KRX89030.1</span> (eukaryote) ranks higher than <span style="background-color: black; color: #29f914; font-family: "andale mono";">WP_042303394.1</span> (bacteria). So, if there are up to 51 copies of the eukaryote match, then the bacterial match is still included - at the check point it is ranked last, but overtakes the eukaryotic matches in the final stage to finish top. However, if there are 52 copies of the eukaryote, they alone are taken forward to the final stage, and our desired ultimate winner bacteria sequence <span style="background-color: black; color: #29f914; font-family: "andale mono";">WP_042303394.1</span> is excluded.<br />
<br />
<h3>
Update (4 November 2018)</h3>
Corrected spelling of Nidhi Shah's surname, in this post I had consistently written Shar et al. (2018) rather than Shah <i>et al.</i> 2018. My deep apologies. See the comments below, and in particular the link to <a href="https://github.com/shahnidhi/BLAST_maxtargetseq_analysis">Nidhi Shah's BLAST test case on GitHub</a>, which I intend to try out this week.<br />
<br />
<h3>
Update (14 November 2018)</h3>
Corrected one misspelling of Sujai's name (sorry!). Thank you to John Walshaw who spotted this.<br />
<br />
I've published <a href="https://blastedbio.blogspot.com/2018/11/blast-max-alignment-limits-part-three.html">BLAST max alignment limits - part three</a> looking at the test case from Nidhi Shah, and <a href="https://blastedbio.blogspot.com/2018/11/blast-max-alignment-limits-part-four.html">BLAST max alignment limits - part four</a> looking at the internal alignment number limit in the context of nucleotide databases (where composition based statistics are not used).
Peter Cockhttp://www.blogger.com/profile/00233221181317137855noreply@blogger.com4tag:blogger.com,1999:blog-8584629468471803075.post-68227554644819910302018-11-02T09:56:00.000+00:002018-12-07T11:40:57.354+00:00BLAST max alignment limits repartee - part oneBack in 2015, my blog post "<a href="https://blastedbio.blogspot.com/2015/12/blast-max-target-sequences-bug.html">What BLAST's max-target-sequences doesn't do</a>" highlighted what we called <a href="https://gist.github.com/sujaikumar/504b3b7024eaf3a04ef5">a scary BLAST+ -max_target_seqs bug</a>, found and reported by <a href="http://ylog.org/sujai/">Sujai Kumar</a>. The NCBI BLAST teams took the stance this was a feature not a bug (and as a heuristic search tool, this is an understandable view), but conceded it could be better documented.<br />
<br />
Sadly, I don't think there has been much if any clarification in the BLAST+ documentation about the settings limiting the number of alignments returned, and what else they control. The recent letter <a href="https://doi.org/10.1093/bioinformatics/bty833">Shah <i>et al.</i> (2018) "Misunderstood parameter of NCBI BLAST impacts the correctness of bioinformatics workflows</a>" did serve the purpose of raising the profile of this issue, but sadly seems confused and misleading in several places. Most regrettably they did not provide a reproducible test case, so it is possible they found another issue.<br />
<br />
This is the first of a planned series of blog posts which seeks to clarify the situation, and some of the claims in Shah <i>et al.</i> (2018).<br />
<br />
First of all, this issue is not specific to <span style="background-color: black; color: #29f914; font-family: "andale mono";">-max_target_seqs</span> (used with the computer readable output format). With human readable output formats, the maximum of <span style="background-color: black; color: #29f914; font-family: "andale mono";">-num_descriptions</span> and <span style="background-color: black; color: #29f914; font-family: "andale mono";">-num_alignments</span> is used <i>in exactly the same</i> way during the BLAST search.<br />
<br />
<h3>
<a name='more'></a>Re-creating the test case</h3>
The most useful bug reports come with steps to reproduce the issue, and that is exactly what Sujai Kumar did originally back in December 2015. <a href="https://gist.github.com/sujaikumar/504b3b7024eaf3a04ef5">His example</a> worked with a tardigrade query sequence and the NCBI NR database of the time, which didn't yet have any tardigrade matches. The project Sujai was working on was later published as <a href="https://doi.org/10.1073/pnas.1600338113">Koutsovoulos <i>et al.</i> (2016) "No evidence for extensive horizontal gene transfer in the genome of the tardigrade <i>Hypsibius dujardini</i>"</a>. Given the nature of checking for horizontal gene transfer, simply looking at the top BLAST hit is unwise. BLAST results should be scrutinised very carefully - which is how Sujai must noticed the taxonomic kingdom of a top result changing between bacteria and eukaryota.<br />
<br />
If you run the same search now against the 2018 NR database, there are two new top hits from tardigrade genomes, <i>Hypsibius dujardini </i>and <i>Ramazzottius varieornatus</i>. It turned out to be quite simple to make a new test case by building a mini BLAST database of all the current hits <i>except</i> those two sequences. I've put these files into a git repository which explains how I did this in fine detail:<br />
<br />
<a href="https://github.com/peterjc/blast_max_target_seqs">https://github.com/peterjc/blast_max_target_seqs</a><br />
<br />
<h3>
Revisiting the test case</h3>
Now, to show the problem - let's run a BLASTP search using BLAST+ 2.7.1, with my new minimal database.<br />
<br />
<div style="background-color: black; color: #29f914; font-family: 'Andale Mono'; font-size: small;">
<pre>$ blastp -version
blastp: 2.7.1+
Package: blast 2.7.1, build Sep 20 2018 02:20:26</pre>
</div>
<br />
First with the default alignment number limits, tabular output, and looking at the top hits (using the Unix command head to show the first ten lines):<br />
<br />
<div style="background-color: black; color: #29f914; font-family: 'Andale Mono'; font-size: xx-small;">
<pre>$ blastp -query input.fasta -db older_matches.fasta -outfmt "6 std sskingdoms" -evalue 1e-5 | head
nHd.2.3.1.t00019-RA WP_042303394.1 58.678 121 49 1 4 124 93 212 7.54e-46 153 Bacteria
nHd.2.3.1.t00019-RA WP_017775351.1 58.678 121 49 1 4 124 93 212 8.49e-46 153 Bacteria
nHd.2.3.1.t00019-RA KRX89027.1 63.115 122 45 0 1 122 105 226 5.26e-42 140 Eukaryota
nHd.2.3.1.t00019-RA KRX89025.1 63.115 122 45 0 1 122 105 226 1.07e-41 140 Eukaryota
nHd.2.3.1.t00019-RA KFD69381.1 61.983 121 46 0 1 121 466 586 1.39e-41 141 Eukaryota
nHd.2.3.1.t00019-RA KRZ17714.1 63.115 122 45 0 1 122 121 242 1.39e-41 140 Eukaryota
nHd.2.3.1.t00019-RA KFD48812.1 61.983 121 46 0 1 121 419 539 1.44e-41 141 Eukaryota
nHd.2.3.1.t00019-RA KHJ41189.1 61.983 121 46 0 1 121 39 159 1.49e-41 141 Eukaryota
nHd.2.3.1.t00019-RA KRX89026.1 63.115 122 45 0 1 122 153 274 1.82e-41 140 Eukaryota
nHd.2.3.1.t00019-RA CDW52156.1 61.983 121 46 0 1 121 97 217 1.90e-41 141 Eukaryota</pre>
</div>
<br />
Notice the top two hits are bacteria, then lots of eukaryota. Now, if you limit this to 100 alignments using <span style="background-color: black; color: #29f914; font-family: "andale mono";">-max_target_seqs 100</span><span style="background-color: black; color: #29f914; font-family: "andale mono";"> </span> the two bacterial top hits have gone:<br />
<br />
<div style="background-color: black; color: #29f914; font-family: 'Andale Mono'; font-size: xx-small;">
<pre>$ blastp -query input.fasta -db older_matches.fasta -outfmt "6 std sskingdoms" -max_target_seqs 100 -evalue 1e-5 | head
nHd.2.3.1.t00019-RA KRX89027.1 63.115 122 45 0 1 122 105 226 5.26e-42 140 Eukaryota
nHd.2.3.1.t00019-RA KRX89025.1 63.115 122 45 0 1 122 105 226 1.07e-41 140 Eukaryota
nHd.2.3.1.t00019-RA KFD69381.1 61.983 121 46 0 1 121 466 586 1.39e-41 141 Eukaryota
nHd.2.3.1.t00019-RA KRZ17714.1 63.115 122 45 0 1 122 121 242 1.39e-41 140 Eukaryota
nHd.2.3.1.t00019-RA KFD48812.1 61.983 121 46 0 1 121 419 539 1.44e-41 141 Eukaryota
nHd.2.3.1.t00019-RA KHJ41189.1 61.983 121 46 0 1 121 39 159 1.49e-41 141 Eukaryota
nHd.2.3.1.t00019-RA KRX89026.1 63.115 122 45 0 1 122 153 274 1.82e-41 140 Eukaryota
nHd.2.3.1.t00019-RA CDW52156.1 61.983 121 46 0 1 121 97 217 1.90e-41 141 Eukaryota
nHd.2.3.1.t00019-RA KRZ35475.1 63.115 122 45 0 1 122 105 226 2.77e-41 140 Eukaryota
nHd.2.3.1.t00019-RA KRX89032.1 63.115 122 45 0 1 122 105 226 2.95e-41 140 Eukaryota</pre>
</div>
<br />
Now the first point I want to clear up, is this is not specific to the <span style="background-color: black; color: #29f914; font-family: "andale mono";">-max_target_seqs</span> setting (which is used for the computer readable output formats), you get the same problem with the human readable formats as well!<br />
<br />
<div style="background-color: black; color: #29f914; font-family: 'Andale Mono'; font-size: small;">
<pre>$ blastp -query input.fasta -db older_matches.fasta -evalue 1e-5 | grep "Query=" -A 15
Query= nHd.2.3.1.t00019-RA
Length=126
Score E
Sequences producing significant alignments: (Bits) Value
WP_042303394.1 glycogen/starch/alpha-glucan phosphorylase [Parab... 153 8e-46
WP_017775351.1 glycogen/starch/alpha-glucan phosphorylase [Parab... 153 8e-46
KRX89027.1 Glycogen phosphorylase, partial [Trichinella pseudosp... 140 5e-42
KRX89025.1 Glycogen phosphorylase, liver form [Trichinella pseud... 140 1e-41
KFD69381.1 hypothetical protein M514_10296 [Trichuris suis] 141 1e-41
KRZ17714.1 Glycogen phosphorylase [Trichinella pseudospiralis] 140 1e-41
KFD48812.1 hypothetical protein M513_10296 [Trichuris suis] 141 1e-41
KHJ41189.1 phosphorylase, glycogen/starch/alpha-glucan family [T... 141 1e-41
KRX89026.1 Glycogen phosphorylase [Trichinella pseudospiralis] 140 2e-41
CDW52156.1 Phosphorylase domain containing protein [Trichuris tr... 141 2e-41</pre>
</div>
<br />
The above is with no limits. Now if we use <span style="background-color: black; color: #29f914; font-family: "andale mono";">-num_descriptions=100 -num_alignments=100</span> to again limit to 100 alignments, we see the two bacteria top hits (<span style="background-color: black; color: #29f914; font-family: "andale mono";">WP_042303394.1</span> and <span style="background-color: black; color: #29f914; font-family: "andale mono";">WP_017775351.1</span>) have gone:<br />
<br />
<div style="background-color: black; color: #29f914; font-family: 'Andale Mono'; font-size: small;">
<pre>$ blastp -query input.fasta -db older_matches.fasta -num_descriptions=100 -num_alignments=100 -evalue 1e-5 | grep "Query=" -A 15
</pre>
<pre>Query= nHd.2.3.1.t00019-RA
Length=126
Score E
Sequences producing significant alignments: (Bits) Value
KRX89027.1 Glycogen phosphorylase, partial [Trichinella pseudosp... 140 5e-42
KRX89025.1 Glycogen phosphorylase, liver form [Trichinella pseud... 140 1e-41
KFD69381.1 hypothetical protein M514_10296 [Trichuris suis] 141 1e-41
KRZ17714.1 Glycogen phosphorylase [Trichinella pseudospiralis] 140 1e-41
KFD48812.1 hypothetical protein M513_10296 [Trichuris suis] 141 1e-41
KHJ41189.1 phosphorylase, glycogen/starch/alpha-glucan family [T... 141 1e-41
KRX89026.1 Glycogen phosphorylase [Trichinella pseudospiralis] 140 2e-41
CDW52156.1 Phosphorylase domain containing protein [Trichuris tr... 141 2e-41
KRZ35475.1 Glycogen phosphorylase [Trichinella pseudospiralis] 140 3e-41
KRX89032.1 Glycogen phosphorylase [Trichinella pseudospiralis] 140 3e-41</pre>
</div>
<br />
For conciseness, I have used the Unix command grep to highlight the top of the summary table, but the same applies to the full pairwise alignments which would be shown later. The test files are public, so you can verify this if you wish to.<br />
<br />
<h3>
This is not specific to -max_target_seqs!</h3>
What is important here is that either <span style="background-color: black; color: #29f914; font-family: "andale mono";">-max_target_seqs</span>, or the maximum of <span style="background-color: black; color: #29f914; font-family: "andale mono";">-num_descriptions</span> and <span style="background-color: black; color: #29f914; font-family: "andale mono";">-num_alignments</span>, is used during the search to set a heuristic limit on the number of alignments to consider. You can verify this by tracing the arguments through the C++ source code for BLAST+.<br />
<br />
In this corner case, <a href="https://blastedbio.blogspot.com/2015/12/blast-max-target-sequences-bug.html">as the NCBI BLAST team explained back in 2015</a>, the two bacterial results which would have become the top results can be excluded during an early part of the search.<br />
<br />
I think the BLAST team should not have introduced <span style="background-color: black; color: #29f914; font-family: "andale mono";">-max_target_seqs</span> for the computer readable output formats, rather they could have simply continued to use <span style="background-color: black; color: #29f914; font-family: "andale mono";">-num_alignments</span>. This was <a href="https://blastedbio.blogspot.com/2014/12/blast-christmas-wish-list.html">#5 on my BLAST+ Christmas Wish List (2014)</a>.<br />
<br />
However, I find the command line documentation here is not helpful here by listing <span style="background-color: black; color: #29f914; font-family: "andale mono";">-max_target_seqs</span> under "<i>Restrict search or results</i>" (which is accurate), while <span style="background-color: black; color: #29f914; font-family: "andale mono";">-num_descriptions</span> and <span style="background-color: black; color: #29f914; font-family: "andale mono";">-num_alignments</span> are under "<i>Formatting options</i>" (which is misleading):<br />
<br />
<div style="background-color: black; color: #29f914; font-family: 'Andale Mono'; font-size: small;">
<pre>$ blastp -help
...
DESCRIPTION
Protein-Protein BLAST 2.7.1+
...
*** Formatting options
...
-num_descriptions <integer>=0>
Number of database sequences to show one-line descriptions for
Not applicable for outfmt > 4
Default = `500'
* Incompatible with: max_target_seqs
-num_alignments <integer>=0>
Number of database sequences to show alignments for
Default = `250'
* Incompatible with: max_target_seqs
...
*** Restrict search or results
...
-max_target_seqs <integer>=1> Maximum number of aligned sequences to keep
Not applicable for outfmt <= 4
Default = `500'
* Incompatible with: num_descriptions, num_alignments
...
</integer></integer></integer></pre>
</div>
<br />
Terry Jones made this same point in his <a href="https://gist.github.com/sujaikumar/504b3b7024eaf3a04ef5#gistcomment-2185621">August 2017 comment on the origin report</a>, where he had another reproducible example using some local databases.<br />
<br />
Watch this space - further posts are planned.
<br />
<br />
<h3>
Update (2 November 2018)</h3>
In <a href="https://blastedbio.blogspot.com/2018/11/blast-max-alignment-limits-repartee-two.html">BLAST max alignment limits repartee - part two</a>, I focus on the question of if database order is important (as claimed in Shah <i>et al.</i> 2018), and how exactly the internal alignment number limit works.<br />
<br />
<h3>
Update (4 November 2018)</h3>
Corrected spelling of Nidhi Shah's surname, I had sometimes written Shar et al. (2018) rather than Shah <i>et al.</i> 2018. My deep apologies. See also the comments on part two, and the link to his test case.<br />
<br />
<h3>
Update (13 November 2018)</h3>
I've published <a href="https://blastedbio.blogspot.com/2018/11/blast-max-alignment-limits-part-three.html">BLAST max alignment limits - part three</a> looking at the test case from Nidhi Shah, and <a href="https://blastedbio.blogspot.com/2018/11/blast-max-alignment-limits-part-four.html">BLAST max alignment limits - part four</a> looking at the internal alignment number limit in the context of nucleotide databases (where composition based statistics are not used).<br />
<br />
<h3>
Update (7 December 2018)</h3>
I've published a fifth post, <a href="https://blastedbio.blogspot.com/2018/12/blast-tie-break-db-order.html">BLAST tie breaking by database order</a>, looking at how the BLAST database order is defined in comparison to the FASTA file used to build a database.Peter Cockhttp://www.blogger.com/profile/00233221181317137855noreply@blogger.com4tag:blogger.com,1999:blog-8584629468471803075.post-54488964028108239892017-10-27T13:37:00.001+01:002017-10-29T21:13:26.005+00:00Entrez eSpell can't resolve PubmedSpellSrvAnother quick bug report blog post, this time NCBI Entrez's espell is currently broken returning:<br />
<br />
<tt>Couldn't resolve #PubmedSpellSrv, the address table is empty.</tt><br />
<br />
<a name='more'></a>Here's an example in full:<br />
<br />
<div style="background-color: black;">
<span style="color: #29f914; font-family: "andale mono";"><span style="font-size: 14px;">$ curl -v https://eutils.ncbi.nlm.nih.gov/entrez/eutils/espell.fcgi?term=biopythooon</span></span><br />
<span style="color: #29f914; font-family: "andale mono";"><span style="font-size: 14px;">* Trying 165.112.7.20...</span></span><br />
<span style="color: #29f914; font-family: "andale mono";"><span style="font-size: 14px;">* TCP_NODELAY set</span></span><br />
<span style="color: #29f914; font-family: "andale mono";"><span style="font-size: 14px;">* Connected to eutils.ncbi.nlm.nih.gov (165.112.7.20) port 443 (#0)</span></span><br />
<span style="color: #29f914; font-family: "andale mono";"><span style="font-size: 14px;">* TLS 1.2 connection using TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384</span></span><br />
<span style="color: #29f914; font-family: "andale mono";"><span style="font-size: 14px;">* Server certificate: *.ncbi.nlm.nih.gov</span></span><br />
<span style="color: #29f914; font-family: "andale mono";"><span style="font-size: 14px;">* Server certificate: DigiCert SHA2 High Assurance Server CA</span></span><br />
<span style="color: #29f914; font-family: "andale mono";"><span style="font-size: 14px;">* Server certificate: DigiCert High Assurance EV Root CA</span></span><br />
<span style="color: #29f914; font-family: "andale mono";"><span style="font-size: 14px;">> GET /entrez/eutils/espell.fcgi?term=biopythooon HTTP/1.1</span></span><br />
<span style="color: #29f914; font-family: "andale mono";"><span style="font-size: 14px;">> Host: eutils.ncbi.nlm.nih.gov</span></span><br />
<span style="color: #29f914; font-family: "andale mono";"><span style="font-size: 14px;">> User-Agent: curl/7.54.0</span></span><br />
<span style="color: #29f914; font-family: "andale mono";"><span style="font-size: 14px;">> Accept: */*</span></span><br />
<span style="color: #29f914; font-family: "andale mono";"><span style="font-size: 14px;">> </span></span><br />
<span style="color: #29f914; font-family: "andale mono";"><span style="font-size: 14px;">< HTTP/1.1 200 OK</span></span><br />
<span style="color: #29f914; font-family: "andale mono";"><span style="font-size: 14px;">< Date: Fri, 27 Oct 2017 12:30:59 GMT</span></span><br />
<span style="color: #29f914; font-family: "andale mono";"><span style="font-size: 14px;">< Server: Apache</span></span><br />
<span style="color: #29f914; font-family: "andale mono";"><span style="font-size: 14px;">< Strict-Transport-Security: max-age=31536000; includeSubDomains; preload</span></span><br />
<span style="color: #29f914; font-family: "andale mono";"><span style="font-size: 14px;">< Referrer-Policy: origin-when-cross-origin</span></span><br />
<span style="color: #29f914; font-family: "andale mono";"><span style="font-size: 14px;">< Content-Security-Policy: upgrade-insecure-requests</span></span><br />
<span style="color: #29f914; font-family: "andale mono";"><span style="font-size: 14px;">< Access-Control-Allow-Origin: *</span></span><br />
<span style="color: #29f914; font-family: "andale mono";"><span style="font-size: 14px;">< Cache-Control: private</span></span><br />
<span style="color: #29f914; font-family: "andale mono";"><span style="font-size: 14px;">< NCBI-PHID: 6ECE2B019F326F7100000000001A001A</span></span><br />
<span style="color: #29f914; font-family: "andale mono";"><span style="font-size: 14px;">< NCBI-SID: 6ECE2B019F327031_0026SID</span></span><br />
<span style="color: #29f914; font-family: "andale mono";"><span style="font-size: 14px;">< Content-Type: text/xml; charset=UTF-8</span></span><br />
<span style="color: #29f914; font-family: "andale mono";"><span style="font-size: 14px;">< Set-Cookie: ncbi_sid=6ECE2B019F327031_0026SID; domain=.nih.gov; path=/; expires=Sat, 27 Oct 2018 12:30:59 GMT</span></span><br />
<span style="color: #29f914; font-family: "andale mono";"><span style="font-size: 14px;">< Vary: Accept-Encoding</span></span><br />
<span style="color: #29f914; font-family: "andale mono";"><span style="font-size: 14px;">< X-UA-Compatible: IE=Edge</span></span><br />
<span style="color: #29f914; font-family: "andale mono";"><span style="font-size: 14px;">< X-XSS-Protection: 1; mode=block</span></span><br />
<span style="color: #29f914; font-family: "andale mono";"><span style="font-size: 14px;">< Transfer-Encoding: chunked</span></span><br />
<span style="color: #29f914; font-family: "andale mono";"><span style="font-size: 14px;">< </span></span><br />
<span style="color: #29f914; font-family: "andale mono";"><span style="font-size: 14px;"><?xml version="1.0"?></span></span><br />
<span style="color: #29f914; font-family: "andale mono";"><span style="font-size: 14px;"><!DOCTYPE eSpellResult PUBLIC "-//NLM//DTD eSpellResult, 23 November 2004//EN" "http://www.ncbi.nlm.nih.gov/entrez/query/DTD/eSpell.dtd"></span></span><br />
<span style="color: #29f914; font-family: "andale mono";"><span style="font-size: 14px;"><eSpellResult></span></span><br />
<span style="color: #29f914; font-family: "andale mono";"><span style="font-size: 14px;"><Database/></span></span><br />
<span style="color: #29f914; font-family: "andale mono";"><span style="font-size: 14px;"><Query/></span></span><br />
<span style="color: #29f914; font-family: "andale mono";"><span style="font-size: 14px;"><CorrectedQuery/></span></span><br />
<span style="color: #29f914; font-family: "andale mono";"><span style="font-size: 14px;"><SpelledQuery/></span></span><br />
<span style="color: #29f914; font-family: "andale mono";"><span style="font-size: 14px;"><span style="white-space: pre;"> </span><ERROR>Couldn't resolve #PubmedSpellSrv, the address table is empty.</ERROR></span></span><br />
<span style="color: #29f914; font-family: "andale mono";"><span style="font-size: 14px;"></eSpellResult></span></span><br />
<span style="color: #29f914; font-family: "andale mono";"><span style="font-size: 14px;">* Connection #0 to host eutils.ncbi.nlm.nih.gov left intact</span></span></div>
<br />
As usual, in the absence of a public NCBI issue tracker, I will report this by email as well.<br />
<br />
Note this also highlights an general problem with NCBI Entrez having horrible error handling - it fails to set an HTTP error code even on clear failures like this (you get HTTP 200 OK instead).29<br />
<br />
<h3>
Update (Sunday 29 October)</h3>
A surprise weekend email from the NCBI to my query (Ticket #28045-275380), this is working again and was presumably a transient error.Peter Cockhttp://www.blogger.com/profile/00233221181317137855noreply@blogger.com0tag:blogger.com,1999:blog-8584629468471803075.post-67438797540651352282017-10-12T11:20:00.002+01:002017-11-08T14:18:29.576+00:00BLAST+ 2.7.0 segmentation fault with HTML outputI've not managed to blog much at all this year (<a href="http://blastedbio.blogspot.com/2017/01/bbsrc-shared-parental-leave.html">parenthood</a>), so here's a quick BLAST+ bug report from working on updating the Galaxy wrappers: I've found a reproducible segmentation fault in <span style="background-color: black; color: #29f914; font-family: "andale mono"; font-size: 14px;">tblastn</span> under both Mac and Linux when requesting HTML output.<br />
<br />
<a name='more'></a><br />
Sample FASTA data files <a href="https://github.com/peterjc/galaxy_blast/blob/master/test-data/four_human_proteins.fasta">four_human_proteins.fasta</a> and <a href="https://github.com/peterjc/galaxy_blast/blob/master/test-data/rhodopsin_nucs.fasta">rhodopsin_nucs.fasta</a> are examples I use often as test cases. This example works perfectly on BLAST+ 2.6.0, here on Mac OS X<br />
<br />
<div style="background-color: black; color: #29f914; font-family: 'Andale Mono'; font-size: 14px;">
$ ~/Downloads/Software/ncbi-blast-2.6.0+/bin/tblastn -query four_human_proteins.fasta -subject rhodopsin_nucs.fasta -evalue 1e-10 -out tblastn_four_human_vs_rhodopsin.html -outfmt 0 -html -db_gencode 1 -seg no -matrix BLOSUM80
</div>
<br />
And now using the recently released BLAST+ 2.7.0, we get a segmentation fault:<br />
<br />
<div style="background-color: black; color: #29f914; font-family: 'Andale Mono'; font-size: 14px;">
$ ~/Downloads/Software/ncbi-blast-2.7.0+/bin/tblastn -query four_human_proteins.fasta -subject rhodopsin_nucs.fasta -evalue 1e-10 -out tblastn_four_human_vs_rhodopsin.html -outfmt 0 -html -db_gencode 1 -seg no -matrix BLOSUM80<br />
Segmentation fault: 11
</div>
<br />
The same happens on Linux. In both cases I am using the pre-compiled binaries provided by the NCBI.<br />
<br />
The critical option for triggering the crash seems to be <span style="background-color: black; color: #29f914; font-family: "andale mono"; font-size: 14px;">-html</span> to request an HTML output file, so the impact is not as bad as it could be.<br />
<br />
I will report this bug to the NCBI by email, and update this post with any resolution.<br />
<br />
<h3>
Update (18 October 2017)</h3>
The BLAST team confirmed the problem on 12 October 2017, and hope to have a fix out this week - so any day now...<br />
<br />
<h3>
Update (25 October 2017)</h3>
The BLAST team have yet to release a fix, and the original BLAST+ 2.7.0 release was removed from the NCBI FTP site last week.<br />
<br />
<h3>
Update (8 November 2017)</h3>
I forgot to update this blog post at the time, but the BLAST 2.7.1 release was announced 27 October, and fixed this. I believe this issue was covered under "<i>Fixed bl2seq problem with HTML output</i>" in the <a href="https://www.ncbi.nlm.nih.gov/books/NBK131777/">change log</a>.Peter Cockhttp://www.blogger.com/profile/00233221181317137855noreply@blogger.com0tag:blogger.com,1999:blog-8584629468471803075.post-11269175228862978502017-01-27T14:09:00.000+00:002017-01-30T09:56:11.258+00:00Mozilla Science Fellowship application 2016Last summer I applied to the <a href="https://science.mozilla.org/blog/2016-fellows-cfp">Mozilla Fellows for Science 2016 call</a>. Congratulations to the <a href="https://science.mozilla.org/blog/2016-science-fellows">four 2016 fellows</a>, selected from an impressive 483 submissions for only the second year of this innovative program.<br />
<br />
I was delighted to be short listed and interviewed, but also slightly relieved <i>not</i> to have made the final cut. This is due to the timing of personal circumstances - I'm now a father, and despite the <a href="https://blastedbio.blogspot.co.uk/2017/01/bbsrc-shared-parental-leave.html">taking time off under the UK's Shared Parental Leave scheme</a>, I'm currently trying to cut back work related activities.<br />
<br />
While preparing my application, I was impressed by <a href="http://fossilsandshit.com/application-for-a-mozilla-science-fellow/">Jon Tennant's decision to post his application openly online</a>, and had been meaning to share mine too. Better late than never?<br />
<br />
<i>[Update: To be clear, this was my application in 2016, which was shortlisted but ultimately unsuccessful]</i><br />
<i><br /></i>
<i>[Update: <a href="https://ics.hutton.ac.uk/mozilla-science-fellowship-application-2016/">Cross-posted on the James Hutton ICS blog</a>]</i><br />
<a name='more'></a>What follows were my answers to the non-administrative questions on the application web form (including the original typos, plus some minor notes <i>in italics</i>), and my cover letter text.<br />
<br />
In addition, I had to provide a CV, and letter of support from my line manager, <a href="http://www.hutton.ac.uk/staff/leighton-pritchard">Leighton Pritchard</a>. I am incredibly grateful - his support letter was glowing, and I felt he did a better job promoting me for the fellowship than my own dry style. Our head of department <a href="http://www.hutton.ac.uk/staff/rupert-hough">Rupert Hough</a> also encouraged me, and HR helped to work out a way the fellowship funding could be handled. I would also like to thank <a href="http://rik.smith-unna.com/">Richard Smith-Unna</a>, one of the inaugural Mozilla Fellows for Science 2015, for taking the time to discuss the scheme, and his encouragement.<br />
<br />
<b>What institution are you based at currently? (10 words)</b><br />
<br />
<a href="http://www.hutton.ac.uk/">The James Hutton Institute</a>, Dundee, Scotland, UK<br />
<br />
<b>What is your role at the institution?</b><br />
<br />
Research staff <i>(picked from a drop down list)</i><br />
<br />
<b>What research fields are you in? (25 words)</b><br />
<br />
Bioinformatics, particularly analysis of DNA and RNA sequence data for genome level analysis of plant pathogens in the context of improving and sustaining agricultural yields.<br />
<br />
<b>What is your research focus? (50 words)</b><br />
<br />
The applied focus has been on the genomic analysis of plant pathogens, particularly nematode worms and plant viruses, using DNA and RNA sequencing. To support this much of my work is on infrastructure maintaining an internal Galaxy server for my colleagues to run bioinformatics tools and workflows within a web-browser.<br />
<br />
<b>Describe to us your current research team. (50 words)</b><br />
<br />
The institute operates a project-centric management system where staff members can be part of multiple projects. This is particularly true for bioinformaticians like myself, as I am and have been involved in multiple projects with different teams. Currently my main focus is plant pathogenic nematodes with mainly wet-lab biologists.<br />
<br />
<b>Describe to us how open science advances your research. (100 words)</b><br />
<br />
Historically following the precedent of early major efforts like the Human Genome project, genomic data including raw DNA and RNA sequencing data is generally openly shared in public repositories upon publication. However, we still struggle with groups hoarding genomic resources prior to publication, which has hampered genomic comparison work.<br />
<br />
On the software side, we are fortunate that open source is the dominant model for bioinformatics tool and algorithm development, allowing free sharing of ideas and methods. From a practical perspective this allows packaging for automated installation and wrapping tools for use in other contexts like browser-based front ends.<br />
<br />
<b>Are you leading any projects related to open science? (100 words)</b><br />
<br />
I co-chair the <a href="https://www.open-bio.org/wiki/BOSC">Bioinformatics Open Source Conference (BOSC)</a> - an annual meeting with around 100 attendees. Much of the work presented is also explicitly about open science, and related themes like data sharing and integration.<br />
<br />
<i>[I have now stepped down as co-chair for <a href="https://www.open-bio.org/wiki/BOSC_2017">BOSC 2017</a>, but remain part of the organising committee.]</i><br />
<br />
Since 2009 I have been de-facto project lead for Biopython, an open source Python library for computational molecular biology developed by an international volunteer based team since 1999. See <a href="http://biopython.org/">http://biopython.org</a> <a href="http://dx.doi.org/10.1093/bioinformatics/btp163">http://dx.doi.org/10.1093/bioinformatics/btp163</a><br />
<br />
I started and maintain several bioinformatics tool wrappers for the Galaxy web-based platform, including the widely used NCBI BLAST+ tools (see <a href="https://github.com/peterjc/galaxy_blast">https://github.com/peterjc/galaxy_blast</a> and <a href="http://dx.doi.org/10.1186/s13742-015-0080-7">http://dx.doi.org/10.1186/s13742-015-0080-7</a> ), and sequence analysis tools (see <a href="https://github.com/peterjc/pico_galaxy">https://github.com/peterjc/pico_galaxy</a> and <a href="http://dx.doi.org/10.7717/peerj.167">http://dx.doi.org/10.7717/peerj.167</a> ).<br />
<br />
<b>How do you see Mozilla advancing your work? (50 words)</b><br />
<br />
This fellowship would allow me to focus on open source development and community work which does not fit neatly into my current job, such as organising the Bioinformatics Open Source Conference (BOSC), secretary to the Open Bioinformatics Foundation (OBF), completing and using my Software Carpentry trainer qualification.<br />
<br />
<i>[When my term as OBF secretary expired, I took on the role of treasurer instead.]</i><br />
<br />
<b>What do you see as the opportunities for impact around open research at your university? Could you leverage this opportunity in a potential project? (50 words)</b><br />
<br />
I write a technical blog at <a href="http://blastedbio.blogspot.com/">http://blastedbio.blogspot.com</a> and would like the group to do do something similar to highlight our work.<br />
<br />
The institute already releases a lot of open source software including <a href="https://ics.hutton.ac.uk/software/">https://ics.hutton.ac.uk/software/</a> and I would like to promote further use of GitHub at <a href="https://github.com/HuttonICS">https://github.com/HuttonICS </a>to encourage external contributions.<br />
<br />
<b>What do you think needs to change most immediately in scientific research? (100 words)</b><br />
<br />
Complete adoption of open-access for scientific publications, making them both free-of-cost to read (e.g. to the public who as tax payers may have funded the work) and also re-use (e.g. large scale data-mining).<br />
<br />
I believe this best route to achieve this is funding mandates, backed by financial penalties for non-compliance.<br />
<br />
I accept that in the short term a 6 or 12 month paid-access embargo period might be a useful compromise.<br />
<br />
The recent EU decision to require Open Access for work funded in Horizon 2020 is a very positive step in this directly <i>[sic, should have been direction]</i>, but all funders need to encourage this.<br />
<br />
<b>What project in the field do you find most inspiring to further science and the web? (50 words)</b><br />
<br />
Although slightly outside my area, <a href="http://www.alltrials.net/">http://www.alltrials.net</a> (campaigning to register all medical trials) and the closely related <a href="http://compare-trials.org/">http://compare-trials.org/</a> (tracking switched outcomes in clinical trials) projects are a combination of web-based public engagement, evidence driven medicine, scientific rigour, and open data sharing - born out of Ben Goldacre blogging as a young doctor.<br />
<br />
<b>Why is the the open web important to you? (25 words)</b><br />
<br />
Openness allows data sharing and innovative data integration. Also open web technologies are one of few roots <i>[sic, should have been routes]</i> for today's children to understand how computers work.<br />
<br />
<b>GitHub or other code repository profile</b><br />
<br />
<a href="https://github.com/peterjc">https://github.com/peterjc</a><br />
<br />
<b>Links to 2 of your projects that have high relevance to open science</b><br />
<br />
<a href="https://github.com/biopython/biopython">https://github.com/biopython/biopython</a><br />
<br />
<a href="https://www.open-bio.org/wiki/BOSC">https://www.open-bio.org/wiki/BOSC</a><br />
<br />
<b>Are you comfortable with semi-regular travel, and what are your travel constraints?</b><br />
<br />
Comfortable with regular travel (and have lived in multiple countries), but shorter trips preferred as we're expecting our first child <i>[rest of answer redacted]</i>.<br />
<br />
<b>Cover Letter</b><br />
<b><br /></b>
15 July 2016<br />
<br />
To the Mozilla Science Fellowship Selection Committee,<br />
<br />
I am writing to apply for the <i>Mozilla Fellowships for Science</i>.<br />
<br />
I have been aware of the Mozilla Project for a long time - I used Netscape as my main internet browser as an undergraduate, and was an early adopter of Pheonix (and later Firefox). Mozilla has undoubtedly changed the internet by promoting competition, innovations and standards, and I am excited to see what changes Mozilla Science Foundation can spur in how science works. I hope to contribute to this work.<br />
<br />
I am a bioinformatician, a computational biologist working mainly with genome data of plant pathogens. We are fortunate that there is a strong tradition of sharing genomic data, and releasing bioinformatics software as open source – but this is by no means universal, and continued advocacy is needed to improve openness and reproducibility in science.<br />
<br />
The Mozilla Fellowship would allow me to focus on and expand my efforts on the open source and related outreach activities. I would continue to serve on the Open Bioinformatics Foundation (OBF) board of directors, again help organize our annual Bioinformatics Open Source Conference (BOSC), and devote more time to the Biopython codebase and community. For example, I would like to mentor another Google Summer of Code student. Taking advantage of the potential travel support, I would want to be more active within the Software Carpentry community, and also attend (or organize) developer gathering like the Debian Packaging Sprints.<br />
<br />
As a relatively young field, like many bioinformaticians my career path has not followed the traditional university academic route. I first studied Mathematics and Physics, and then worked for several years as a systems analysist/programmer at a small company doing software integration and modification. Here I learnt important skills like software testing, using shared version control systems, and learnt first-hand the value of properly documented data standards, and the handicap of working with closed-source tools. With hindsight after returning to academia, this experience was important in helping me see the advantages of, and actively embracing, open source for scientific programming.<br />
<br />
After this spell in industry, I returned to university to complete an interdisciplinary MSc and PhD. My project was a mixture of wet-lab experiments and computational sequence analysis, which led me to start using, and then contributing to, the open source Biopython project.<br />
<br />
At the end of my PhD I joined the Scottish Crop Research Institute, which is now post-merger The James Hutton Institute. I am grateful that the Institute, and in particular my line manager Dr. Leighton Pritchard, have recognized and supported my continued open source, open standards, and open science work.<br />
<br />
However, as institute funding has grown tighter with a growing emphasis on securing external project funding, securing travel funding and dedicating work time to these activities has been getting harder. I recognize that with imminent fatherhood, despite my passion, as things stand I will soon have to scale back my unpaid community work after hours.<br />
<br />
Reducing my hours at the Institute to 20% during the fellowship, as proposed by our human resources department, will be enough to sustain my key institute commitments - including co- supervision of PhD students. I have already been pushing for more openness within the institute, in particular within our Information and Computational Sciences Group, and the fellowship would allow me to spend more time on this. Specific goals include establishing a developer-slated shared group blog, and having recently secured agreement to create a shared GitHub account, I want to expand uptake of this - especially amongst my colleagues already releasing open source scientific software.<br />
<br />
This arrangement would let me spend 80% of my working week as a Mozilla Fellow within our institute, actively engaged with the international open scientific community beyond what I have achieved to-date.<br />
<br />
Thank you for taking the time to evaluate this application, and for establishing this unique fellowship scheme.<br />
<br />
Sincerely,<br />
<br />
Dr Peter Cock<br />
Research Scientist<br />
<div>
<br /></div>
Peter Cockhttp://www.blogger.com/profile/00233221181317137855noreply@blogger.com2tag:blogger.com,1999:blog-8584629468471803075.post-66285636839389042662017-01-25T10:10:00.001+00:002021-03-19T10:13:17.790+00:00Achievement Unlocked: Shared Parental LeaveThis year was a first for my employer, the <a href="http://www.hutton.ac.uk/">James Hutton Institute</a>, and potentially all the UK research institutes on <a href="http://www.bbsrc.ac.uk/about/working-for-us/code/introduction/introduction-main/">BBSRC employment Terms and Conditions</a>: A father has taken time off work using Shared Parental Leave (SPL). That's something to be pleased about, although to me overshadowed of course by becoming a father.<br />
<br />
The UK's new SPL system was introduced in April 2015, and is intended to allow working couples to share a year off after the birth of a child. The basics of the <a href="https://www.gov.uk/shared-parental-leave-and-pay/overview">UK SPL scheme and Statutory Shared Parental Pay (ShPP) rules</a> are laid down in law, but good employers will pay the father more than the statutory minimum.<br />
<br />
<a name='more'></a>The James Hutton Institute is currently one of a number of <a href="https://webarchive.nationalarchives.gov.uk/20160701204307/http://www.bbsrc.ac.uk/about/working-for-us/code/introduction/appendix/">UK research institutes who have staff employed on BBSRC Terms and Conditions</a>, something it inherited from its creation by the <a href="http://www.hutton.ac.uk/about/history">merger of SCRI and MLURI</a>. One of the nice things about this is that it isn't just the local Human Resources team who interpret the rules - we can turn to the BBSRC for advice. Surprisingly when our HR checked, we were told that this would be a test case. Understanding how the SPL scheme worked, and what the <a href="http://webarchive.nationalarchives.gov.uk/20160615185255/http://www.rcuk.ac.uk/documents/terms/maternityadoptivematernitysupportparentalleavepolicy-pdf/">BBSRC Terms and Conditions</a> actually meant, took a lot of back and forth. Still, it was nice to be breaking new ground for equality.<br />
<br />
As you would hope from the <a href="http://www.hutton.ac.uk/news/james-hutton-institute-one-first-achieve-athena-swan-status">James Hutton Institute's Athena Swan Bronze</a> status, the principle guiding the BBSRC SPL payments is one of equality: If the mother just took the absolute minimum two weeks maternity leave, turning the remaining up to 52-2=50 weeks into SPL, and the father took his two weeks paternity leave (aka Maternity Support Leave) and all 50 weeks of SPL, then the father would be paid the same as a mother taking all 52 weeks of maternity leave. That example is a nice simple arrangement, but quite unlikely in practice.<br />
<br />
Referring to the "<a href="http://webarchive.nationalarchives.gov.uk/20160615185255/http://www.rcuk.ac.uk/documents/terms/maternityadoptivematernitysupportparentalleavepolicy-pdf/">RESEARCH COUNCIL MATERNITY, ADOPTIVE, MATERNITY SUPPORT (PATERNITY) AND PARENTAL LEAVE POLICY</a>", the BBSRC T&Cs give the mother a generous 26 weeks maternity leave at full pay, then 37-26=11 weeks at Statutory Maternity Pay (SMP) only, with the final 52-37=15 weeks unpaid. How much the father would be paid on SPL turns out to depend on both when it is taken, and if entitled to statutory pay at the time, or not. If taken within the first 26 weeks, any ShPP (i.e. SPL with statutory pay) is <i>upgraded</i> to full pay under the BBSRC terms.<br />
<br />
My wife originally planned to take 38 weeks of maternity leave, which included all 37 weeks with Statutory Maternity Pay (SMP) or better from her employer, and one week of unpaid leave. That would have left 52-38=14 weeks of unpaid SPL available for me. Tempting, but financially that was not attractive.<br />
<br />
Instead she is taking just 35 weeks of maternity leave, leaving a potential 52-35=17 weeks of SPL. Of this, 37-35=2 weeks are ShPP. I have taken the 2 weeks ShPP (paid SPL which is upgraded to full pay with pension contributions thanks to the BBSRC terms), and she is taking 3 weeks of unpaid SPL (meaning as originally planned, she is taking 38 weeks of leave combined). That initially sounds very good, but the catch is my wife is giving up two weeks of statutory pay with full pension contributions.<br />
<br />
This <a href="http://www.telegraph.co.uk/men/the-filter/why-are-only-1-in-100-men-taking-up-shared-parental-leave/">Telegraph article touches on the poor uptake of SPL</a>, and I agree that while the bureaucracy is a major hurdle, the main problem is the very nature of the scheme: Beyond the miserly two weeks, for the father to take (paid) time off work, the mother has to give up some of her (paid) maternity leave. Maybe one day the UK will catch up with Iceland, Norway and Sweden who (on top of a shared leave system) give the father three months paternity leave? Or Japan, where both parents are entitled to up to 52 weeks of partly paid leave (even if it is rare for the father to take much).<br />
<br />
<h3>
Update (24 May 2018)</h3>
<br />
Following the closure of the RCUK website, I have updated affected links to use the final version as recorded on UK Government national archive snapshots.<br />
<br />
Note that James Hutton staff are no longer employed under the BBSRC Terms and Conditions.Peter Cockhttp://www.blogger.com/profile/00233221181317137855noreply@blogger.com2Dundee DD2 5DA, UK56.4564051 -3.069387000000006156.4520181 -3.0794720000000062 56.4607921 -3.059302000000006tag:blogger.com,1999:blog-8584629468471803075.post-19547339278703945502016-05-13T09:27:00.001+01:002018-08-17T12:02:10.017+01:00Automatically keeping a GitHub fork up to dateWe recently setup a <a href="http://github.com/HuttonICS">departmental GitHub account for Hutton ICS</a>, and one of the things we'll use this for is to showcase projects which <a href="https://ics.hutton.ac.uk/staff/">ICS staff</a> are contributing to - such as <a href="http://biopython.org/">Biopython</a> in my case.<br />
<br />
To start with we have forked <a href="https://github.com/biopython/biopython">https://github.com/biopython/biopython</a> as <a href="https://github.com/huttonics/biopython">https://github.com/huttonics/biopython</a> which we'll use as a read-only mirror - but now we want to keep it up to date with commits pushed to the upstream repository.<br />
<br />
How can we automatically mirror the upstream repository? Enter <a href="https://developer.github.com/guides/managing-deploy-keys/#deploy-keys">GitHub Deploy Keys</a>, which we can use to grant read/write access on a repository basis - which a cron job can use to push changes to our mirrored git repository.<br />
<a name='more'></a><br />
My plan requires an online server where we can setup a cron job which will essentially run:<br />
<br />
<div style="background-color: black; color: #00f900; font-family: 'Andale Mono'; font-size: 16px; line-height: normal;">
<span style="font-variant-ligatures: no-common-ligatures;">$ git checkout master</span><br />
$ git pull --ff-only origin master</div>
<div style="background-color: black; color: #00f900; font-family: 'Andale Mono'; font-size: 16px; line-height: normal;">
$ git push mirror master --tags</div>
<br />
<i>(Update: Original version did not push new tags to the mirror)</i><br />
<br />
This assumes you've left the default remote branch (<span style="background-color: black; color: #00f900; font-family: "andale mono"; font-size: 16px;">origin</span>) as the upstream repository you want to pull from, and added <span style="background-color: black; color: #00f900; font-family: "andale mono"; font-size: 16px;">mirror</span> as the downstream mirror you want to push to. In this example, I did:<br />
<br />
<div style="background-color: black; color: #00f900; font-family: 'Andale Mono'; font-size: 16px; line-height: normal;">
<span style="font-variant-ligatures: no-common-ligatures;">$ git clone https://github.</span>com/biopython/biopython.git</div>
<div style="background-color: black; color: #00f900; font-family: 'Andale Mono'; font-size: 16px; line-height: normal;">
<span style="font-variant-ligatures: no-common-ligatures;">$ cd </span>biopython</div>
<div style="background-color: black; color: #00f900; font-family: 'Andale Mono'; font-size: 16px; line-height: normal;">
$ git remote add mirror git@github.com:HuttonICS/biopython.git</div>
<div style="background-color: black; color: #00f900; font-family: 'Andale Mono'; font-size: 16px; line-height: normal;">
$ git fetch mirror</div>
<br />
Given we're mirroring a public open source project, the fetch doesn't need any special permissions. However, we do need write access to write to our mirror machine. You could do this with a pass-phrase-less personal SSH key associated with your user's account - or a GitHub account setup just for the script, but a <a href="https://developer.github.com/guides/managing-deploy-keys/#deploy-keys">GitHub Deployment Key</a> seems the best option.<br />
<br />
So, we'll make a new pass-phrase-less RSA SSH key just for this mirroring task:<br />
<br />
<div style="background-color: black; color: #00f900; font-family: 'Andale Mono'; font-size: 16px; line-height: normal;">
<span style="font-variant-ligatures: no-common-ligatures;">$ ssh-keygen -t rsa -b 4096 -C "biopython key" -f biopython_key -N ""</span></div>
<div style="background-color: black; color: #00f900; font-family: 'Andale Mono'; font-size: 16px; line-height: normal;">
<span style="font-variant-ligatures: no-common-ligatures;">Generating public/private rsa key pair.</span></div>
<div style="background-color: black; color: #00f900; font-family: 'Andale Mono'; font-size: 16px; line-height: normal;">
<span style="font-variant-ligatures: no-common-ligatures;">Your identification has been saved in biopython_key.</span></div>
<div style="background-color: black; color: #00f900; font-family: 'Andale Mono'; font-size: 16px; line-height: normal;">
<span style="font-variant-ligatures: no-common-ligatures;">Your public key has been saved in biopython_key.pub.</span></div>
<div style="background-color: black; color: #00f900; font-family: 'Andale Mono'; font-size: 16px; line-height: normal;">
<span style="font-variant-ligatures: no-common-ligatures;">The key fingerprint is:</span></div>
<div style="background-color: black; color: #00f900; font-family: 'Andale Mono'; font-size: 16px; line-height: normal;">
<span style="font-variant-ligatures: no-common-ligatures;">ce:9d:c4:de:aa:63:02:10:fe:a8:25:b6:ec:37:b5:dc biopython key</span></div>
<div style="background-color: black; color: #00f900; font-family: 'Andale Mono'; font-size: 16px; line-height: normal;">
<span style="font-variant-ligatures: no-common-ligatures;">The key's randomart image is:</span></div>
<div style="background-color: black; color: #00f900; font-family: 'Andale Mono'; font-size: 16px; line-height: normal;">
<span style="font-variant-ligatures: no-common-ligatures;">+--[ RSA 4096]----+</span></div>
<div style="background-color: black; color: #00f900; font-family: 'Andale Mono'; font-size: 16px; line-height: normal;">
<span style="font-variant-ligatures: no-common-ligatures;">| |</span></div>
<div style="background-color: black; color: #00f900; font-family: 'Andale Mono'; font-size: 16px; line-height: normal;">
<span style="font-variant-ligatures: no-common-ligatures;">| . |</span></div>
<div style="background-color: black; color: #00f900; font-family: 'Andale Mono'; font-size: 16px; line-height: normal;">
<span style="font-variant-ligatures: no-common-ligatures;">| . . |</span></div>
<div style="background-color: black; color: #00f900; font-family: 'Andale Mono'; font-size: 16px; line-height: normal;">
<span style="font-variant-ligatures: no-common-ligatures;">| o . |</span></div>
<div style="background-color: black; color: #00f900; font-family: 'Andale Mono'; font-size: 16px; line-height: normal;">
<span style="font-variant-ligatures: no-common-ligatures;">| + S o |</span></div>
<div style="background-color: black; color: #00f900; font-family: 'Andale Mono'; font-size: 16px; line-height: normal;">
<span style="font-variant-ligatures: no-common-ligatures;">|..o o. o + o |</span></div>
<div style="background-color: black; color: #00f900; font-family: 'Andale Mono'; font-size: 16px; line-height: normal;">
<span style="font-variant-ligatures: no-common-ligatures;">|o+. o.o o + . |</span></div>
<div style="background-color: black; color: #00f900; font-family: 'Andale Mono'; font-size: 16px; line-height: normal;">
<span style="font-variant-ligatures: no-common-ligatures;">|.o o o.Eo . |</span></div>
<div style="background-color: black; color: #00f900; font-family: 'Andale Mono'; font-size: 16px; line-height: normal;">
<span style="font-variant-ligatures: no-common-ligatures;">|... . o.o. |</span></div>
<div style="background-color: black; color: #00f900; font-family: 'Andale Mono'; font-size: 16px; line-height: normal;">
</div>
<div style="background-color: black; color: #00f900; font-family: 'Andale Mono'; font-size: 16px; line-height: normal;">
<span style="font-variant-ligatures: no-common-ligatures;">+-----------------+</span></div>
<br />
Then go into the GitHub settings for the mirror repository, and add this deployment key (copy and paste the <span style="background-color: black; color: #00f900; font-family: "andale mono"; font-size: 16px;">biopython_key.pub</span> file contents):<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhMS6RMFFHzXL0NZucb0mqeQbswKM7NyRBu3IdaSq6RKR-WlAgOAAomF6fh4jk28opncCvETOiFHobE8qeVZ9PjTEkHykH1TIqRrNFmQ09QJBy3x-1WkgHZiWT1NCAuuqz18JjXBRUdFIs/s1600/deploy.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="90" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhMS6RMFFHzXL0NZucb0mqeQbswKM7NyRBu3IdaSq6RKR-WlAgOAAomF6fh4jk28opncCvETOiFHobE8qeVZ9PjTEkHykH1TIqRrNFmQ09QJBy3x-1WkgHZiWT1NCAuuqz18JjXBRUdFIs/s320/deploy.png" width="320" /></a></div>
<br />
<br />
If this was a personal account, you could configure which SSH key to use with GitHub via your <span style="background-color: black; color: #00f900; font-family: "andale mono"; font-size: 16px;">~/.ssh/config</span> file, but to do this at the command line seems easiest via the <span style="background-color: black; color: #00f900; font-family: "andale mono"; font-size: 16px;">$GIT_SSH</span> environment variable which points at the binary or shell script to use in place of the default ssh command. So we have a simple shell script named <span style="background-color: black; color: #00f900; font-family: "andale mono"; font-size: 16px;">mirror_ssh</span>,<br />
<br />
<div style="background-color: black; color: #00f900; font-family: 'Andale Mono'; font-size: 16px; line-height: normal;">
<span style="font-variant-ligatures: no-common-ligatures;">#!/bin/bash</span></div>
<div style="background-color: black; color: #00f900; font-family: 'Andale Mono'; font-size: 16px; line-height: normal;">
<span style="font-variant-ligatures: no-common-ligatures;"># Call ssh using our GitHub repository deploy key (set via -i)</span></div>
<div style="background-color: black; color: #00f900; font-family: 'Andale Mono'; font-size: 16px; line-height: normal;">
<span style="font-variant-ligatures: no-common-ligatures;"># using -F to make sure this ignores ~/.ssh/config</span><br />
ssh -i /path/to/deploy_key -F /dev/null -p 22 $*</div>
<br />
The basic task script becomes:<br />
<br />
<div style="background-color: black; color: #00f900; font-family: 'andale mono'; font-size: 16px; line-height: normal;">
<span style="font-variant-ligatures: no-common-ligatures;">$ export GIT_SSH=./mirror_ssh</span><br />
<span style="font-variant-ligatures: no-common-ligatures;">$ git checkout master</span><br />
$ git pull --ff-only origin master</div>
<div style="background-color: black; color: #00f900; font-family: 'andale mono'; font-size: 16px; line-height: normal;">
$ git push mirror master --tags</div>
<br />
I wanted to be able to extend this to mirroring multiple repositories, each of which could (and perhaps should) have their own unique GitHub Deploy Key. I'll setup up local git repositories and keys for each, and do the sync via a master script <span style="background-color: black; color: #00f900; font-family: "andale mono"; font-size: 16px;">mirror_git</span> (see <a href="https://gist.github.com/peterjc/eccac1942a9709993040425d33680352">mirror_git gist</a>) taking the git folder and deploy key file path as arguments:<br />
<br />
<script src="https://gist.github.com/peterjc/eccac1942a9709993040425d33680352.js"></script>
Then I add multiple calls to <span style="background-color: black; color: #00f900; font-family: "andale mono"; font-size: 16px;">mirror_git</span> to cron, one for each repository, e.g.<br />
<br />
<div style="background-color: black; color: #00f900; font-family: 'Andale Mono'; font-size: 16px; line-height: normal;">
<span style="font-variant-ligatures: no-common-ligatures;">$ crontab -l</span></div>
<div style="background-color: black; color: #00f900; font-family: 'Andale Mono'; font-size: 16px; line-height: normal;">
<span style="font-variant-ligatures: no-common-ligatures;">0 * * * * ~/cron/mirror_git ~/cron/biopython ~/cron/biopython_key</span></div>
<br />
As written <span style="background-color: black; color: #00f900; font-family: "andale mono"; font-size: 16px;">mirror_git</span> takes two arguments, the directory name where the temporary git repository is, and the location of the (private) SSH key used to push to the mirror repository, and does some minimal sanity checking before pulling and pushing to GitHub.<br />
<br />
These cron-jobs are running on an existing server under a non-user account (without any admin privileges).<br />
<br />
<h3>
Update (20 May 2016)</h3>
<br />
In testing when there were no new commits, everything looked fine - the deploy key seemed to be working. But now there are some upstream changes, as shown by a dry-run:<br />
<br />
<div style="background-color: black; color: #00f900; font-family: 'Andale Mono'; font-size: 16px; line-height: normal;">
<span style="font-variant-ligatures: no-common-ligatures;">$ export GIT_SSH=./mirror_ssh</span></div>
<div style="background-color: black; color: #00f900; font-family: 'Andale Mono'; font-size: 16px; line-height: normal;">
<div style="font-family: 'Andale Mono'; line-height: normal;">
<span style="font-variant-ligatures: no-common-ligatures;">$ git push mirror master --dry-run</span></div>
<div style="font-family: 'Andale Mono'; line-height: normal;">
<span style="font-variant-ligatures: no-common-ligatures;">To git@github.com:HuttonICS/biopython.git</span></div>
<div style="font-family: 'Andale Mono'; line-height: normal;">
<span style="font-variant-ligatures: no-common-ligatures;"> 1517344..c5b3309 master -> master</span></div>
</div>
<br />
However actually pushing the changes gives a novel git failure I've not seen before:<br />
<br />
<div style="background-color: black; color: #00f900; font-family: 'Andale Mono'; font-size: 16px; line-height: normal;">
<span style="font-variant-ligatures: no-common-ligatures;">$ export GIT_SSH=./mirror_ssh</span></div>
<div style="background-color: black; color: #00f900; font-family: 'Andale Mono'; font-size: 16px; line-height: normal;">
$ git push mirror master</div>
<div style="background-color: black; color: #00f900; font-family: 'Andale Mono'; font-size: 16px; line-height: normal;">
<span style="font-variant-ligatures: no-common-ligatures;">Counting objects: 45, done.</span></div>
<div style="background-color: black; color: #00f900; font-family: 'Andale Mono'; font-size: 16px; line-height: normal;">
<span style="font-variant-ligatures: no-common-ligatures;">Delta compression using up to 8 threads.</span></div>
<div style="background-color: black; color: #00f900; font-family: 'Andale Mono'; font-size: 16px; line-height: normal;">
<span style="font-variant-ligatures: no-common-ligatures;">Compressing objects: 100% (45/45), done.</span></div>
<div style="background-color: black; color: #00f900; font-family: 'Andale Mono'; font-size: 16px; line-height: normal;">
<span style="font-variant-ligatures: no-common-ligatures;">Writing objects: 100% (45/45), 5.73 KiB | 0 bytes/s, done.</span></div>
<div style="background-color: black; color: #00f900; font-family: 'Andale Mono'; font-size: 16px; line-height: normal;">
<span style="font-variant-ligatures: no-common-ligatures;">Total 45 (delta 34), reused 0 (delta 0)</span></div>
<div style="background-color: black; color: #00f900; font-family: 'Andale Mono'; font-size: 16px; line-height: normal;">
<span style="font-variant-ligatures: no-common-ligatures;">remote: fatal error in commit_refs</span></div>
<div style="background-color: black; color: #00f900; font-family: 'Andale Mono'; font-size: 16px; line-height: normal;">
<span style="font-variant-ligatures: no-common-ligatures;">To git@github.com:HuttonICS/biopython.git</span></div>
<div style="background-color: black; color: #00f900; font-family: 'Andale Mono'; font-size: 16px; line-height: normal;">
<span style="font-variant-ligatures: no-common-ligatures;"> ! [remote rejected] master -> master (failure)</span><br />
error: failed to push some refs to 'git@github.com:HuttonICS/biopython.git'</div>
<br />
Adding <span style="background-color: black; color: #00f900; font-family: "andale mono"; font-size: 16px;">--verbose</span> didn't reveal any clues. Adding using <span style="background-color: black; color: #00f900; font-family: "andale mono"; font-size: 16px;">ssh -v</span> in the wrapper script confirmed the deploy key was accepted, and that some data was sent.<br />
<br />
If we deliberately don't use the deploy key, even the dry-run will fail:<br />
<br />
<div style="background-color: black; color: #00f900; font-family: 'Andale Mono'; font-size: 16px; line-height: normal;">
<span style="font-variant-ligatures: no-common-ligatures;">$ unset GIT_SSH</span><br />
<div style="font-family: 'Andale Mono'; line-height: normal;">
<span style="font-variant-ligatures: no-common-ligatures;">$ git push mirror master </span></div>
<div style="font-family: 'Andale Mono'; line-height: normal;">
<span style="font-variant-ligatures: no-common-ligatures;">Permission denied (publickey).</span></div>
<div style="font-family: 'Andale Mono'; line-height: normal;">
<span style="font-variant-ligatures: no-common-ligatures;">fatal: Could not read from remote repository.</span></div>
<div style="font-family: 'Andale Mono'; line-height: normal; min-height: 18px;">
<span style="font-variant-ligatures: no-common-ligatures;"></span><br /></div>
<div style="font-family: 'Andale Mono'; line-height: normal;">
<span style="font-variant-ligatures: no-common-ligatures;">Please make sure you have the correct access rights</span></div>
<span style="font-variant-ligatures: no-common-ligatures;">
</span><br />
<div style="font-family: 'Andale Mono'; line-height: normal;">
<span style="font-variant-ligatures: no-common-ligatures;">and the repository exists.</span></div>
<div>
<span style="font-variant-ligatures: no-common-ligatures;"><br /></span></div>
</div>
<div style="background-color: black; color: #00f900; font-family: 'Andale Mono'; font-size: 16px; line-height: normal;">
<span style="font-variant-ligatures: no-common-ligatures;">$ git push mirror master --dry-run</span></div>
<div style="background-color: black; color: #00f900; font-family: 'Andale Mono'; font-size: 16px; line-height: normal;">
<span style="font-variant-ligatures: no-common-ligatures;">Permission denied (publickey).</span></div>
<div style="background-color: black; color: #00f900; font-family: 'Andale Mono'; font-size: 16px; line-height: normal;">
<span style="font-variant-ligatures: no-common-ligatures;">fatal: Could not read from remote repository.</span></div>
<div style="background-color: black; color: #00f900; font-family: 'Andale Mono'; font-size: 16px; line-height: normal; min-height: 18px;">
<span style="font-variant-ligatures: no-common-ligatures;"></span><br /></div>
<div style="background-color: black; color: #00f900; font-family: 'Andale Mono'; font-size: 16px; line-height: normal;">
<span style="font-variant-ligatures: no-common-ligatures;">Please make sure you have the correct access rights</span></div>
<div style="background-color: black; color: #00f900; font-family: 'Andale Mono'; font-size: 16px; line-height: normal;">
<span style="font-variant-ligatures: no-common-ligatures;">and the repository exists.</span></div>
<br />
GitHub seemed to be working (pushing to other repositories as myself worked, and there were no problems reported on the <a href="https://status.github.com/">GitHub Status</a> page)<br />
<br />
<h3>
Update (23 May 2016)</h3>
<div>
A few other people also hit this <span style="background-color: black; color: #00f900; font-family: "andale mono"; font-size: 16px;">remote: fatal error in commit_refs</span> problem around the same time, e.g. <a href="http://stackoverflow.com/questions/37341960/how-do-i-fix-remote-fatal-error-in-commit-refs-errors-trying-to-push-with-git">Tomas Skogberg</a> and <a href="http://pastebin.com/50BL9YTF">Mona Jalal</a>, and neither of them were trying to use a Deploy Key. This looks like a rare but more general problem at GitHub, so I have reported it to them.<br />
<br />
<h3>
Update (23 May 2016)</h3>
<br />
GitHub replied that this does look like a problem at their end, and they are looking into it.<br />
<br />
<h3>
Update (1 June 2016)</h3>
<br />
GitHub confirmed yesterday that this was due to their <a href="https://help.github.com/articles/about-protected-branches/">protected branch settings</a>, and have updated their system to give a much more useful error message:<br />
<br />
<div style="background-color: black; color: #00f900; font-family: 'Andale Mono'; font-size: 16px; line-height: normal;">
<span style="font-variant-ligatures: no-common-ligatures;">$ git push mirror master</span></div>
<div style="background-color: black; color: #00f900; font-family: 'Andale Mono'; font-size: 16px; line-height: normal;">
<span style="font-variant-ligatures: no-common-ligatures;">Counting objects: 391, done.</span></div>
<div style="background-color: black; color: #00f900; font-family: 'Andale Mono'; font-size: 16px; line-height: normal;">
<span style="font-variant-ligatures: no-common-ligatures;">Delta compression using up to 8 threads.</span></div>
<div style="background-color: black; color: #00f900; font-family: 'Andale Mono'; font-size: 16px; line-height: normal;">
<span style="font-variant-ligatures: no-common-ligatures;">Compressing objects: 100% (391/391), done.</span></div>
<div style="background-color: black; color: #00f900; font-family: 'Andale Mono'; font-size: 16px; line-height: normal;">
<span style="font-variant-ligatures: no-common-ligatures;">Writing objects: 100% (391/391), 99.28 KiB | 0 bytes/s, done.</span></div>
<div style="background-color: black; color: #00f900; font-family: 'Andale Mono'; font-size: 16px; line-height: normal;">
<span style="font-variant-ligatures: no-common-ligatures;">Total 391 (delta 298), reused 0 (delta 0)</span></div>
<div style="background-color: black; color: #00f900; font-family: 'Andale Mono'; font-size: 16px; line-height: normal;">
<span style="font-variant-ligatures: no-common-ligatures;">remote: error: GH006: Protected branch update failed for refs/heads/master.</span></div>
<div style="background-color: black; color: #00f900; font-family: 'Andale Mono'; font-size: 16px; line-height: normal;">
<span style="font-variant-ligatures: no-common-ligatures;">remote: error: You're not authorized to push to this branch. Visit https://help.github.com/articles/about-protected-branches/ for more information.</span></div>
<div style="background-color: black; color: #00f900; font-family: 'Andale Mono'; font-size: 16px; line-height: normal;">
<span style="font-variant-ligatures: no-common-ligatures;">To git@github.com:HuttonICS/biopython.git</span></div>
<div style="background-color: black; color: #00f900; font-family: 'Andale Mono'; font-size: 16px; line-height: normal;">
<span style="font-variant-ligatures: no-common-ligatures;"> ! [remote rejected] master -> master (protected branch hook declined)</span></div>
<div style="background-color: black; color: #00f900; font-family: 'Andale Mono'; font-size: 16px; line-height: normal;">
<span style="font-variant-ligatures: no-common-ligatures;">error: failed to push some refs to 'git@github.com:HuttonICS/biopython.git'</span></div>
<br />
The master branch on the mirror repository was protected (to prevent force-pushes etc), but also <i>"Restrict who can push to this branch"</i> was ticked (I wanted to avoid any accidental updates) which had the perhaps unexpected side effect of preventing use of a Deploy Key. Unpicking this has fixed my automated deployment.<br />
<h3>
<br class="Apple-interchange-newline" />Update (12 July 2016)</h3>
</div>
<div>
I'm told that after some internal discussions, GitHub will now allow a Deploy Key to be used on a protected branch ignoring the "<i>Restrict who can push to this branch</i>" setting. Since you need Admin permissions to create the key, this seems better to me.<br />
<br />
<h3>
Update (9 October 2017)</h3>
Added <span style="background-color: black; color: #00f900; font-family: "andale mono"; font-size: 16px;">--tags</span> argument to the push command.<br />
<br />
<h3>
Update (2 February 2018)</h3>
Added <span style="background-color: black; color: #00f900; font-family: "andale mono"; font-size: 16px;">--tags</span> argument to the fetch command (in case any tags were changed).<br />
<br />
<h3>
Update (17 August 2018)</h3>
<br />
I've posted the <a href="https://gist.github.com/peterjc/2f1bf0633afbbbb93e074625758fa9a7">mirror_setup</a> script I used to simplify adding more repositories to the collection we mirror under HuttonICS:<br />
<br /></div>
<script src="https://gist.github.com/peterjc/2f1bf0633afbbbb93e074625758fa9a7.js"></script>
This is all hard coded with HuttonICS as the mirror account username, but may be useful anyway.
Peter Cockhttp://www.blogger.com/profile/00233221181317137855noreply@blogger.com2tag:blogger.com,1999:blog-8584629468471803075.post-10804150791900113422015-12-18T11:19:00.000+00:002019-01-08T10:59:41.118+00:00What BLAST's max-target-sequences doesn't do<br />
This is a short post to highlight a <a href="https://gist.github.com/sujaikumar/504b3b7024eaf3a04ef5" target="_blank">scary BLAST+ -max_target_seqs bug</a> found and reported by <a href="http://ylog.org/sujai/" target="_blank">Sujai Kumar</a>, which he discovered in the course of working on some puzzling <a href="https://github.com/DRL/blobtools" target="_blank">Blobtools</a> output while analysing the <a href="http://dx.doi.org/10.1101/033464">tardigrade genome</a>.<br />
<br />
<a name='more'></a>In essence we, and probably most BLAST+ users, had assumed the <span style="font-family: "courier new" , "courier" , monospace;">-max_target_seqs</span> command line option was applied after the search was finished prior to output, and likewise for the plain text output's <span style="font-family: "courier new" , "courier" , monospace;">-num_descriptions</span> and <span style="font-family: "courier new" , "courier" , monospace;">-num_alignments</span> (see also <a href="http://blastedbio.blogspot.co.uk/2014/12/blast-christmas-wish-list.html">#5 on my Christmas 2014 BLAST wish list</a>).<br />
<br />
<div style="background-color: black; color: #29f914; font-family: 'Andale Mono'; font-size: 14px;">
$ blastp -help<br />
...<br />
-max_target_seqs <Integer, >=1><br />
Maximum number of aligned sequences to keep<br />
Not applicable for outfmt <= 4<br />
Default = `500'<br />
* Incompatible with: num_descriptions, num_alignments<br />
...</div>
<br />
Then Sujai found a reproducible counter example contradicting this mental model, where the top 20 hits switched from all being Eukaryotes to all Bacteria depending on <span style="font-family: "courier new" , "courier" , monospace;">-max_target_seqs</span>.<br />
<br />
The NCBI reply by email was:<br />
<blockquote class="tr_bq">
<i>Hello,<br />
Thank you for the report. We don't consider this a bug, but I agree that
we should document this possibility better. This can happen because
limits, including max target sequences, are applied in an early ungapped
phase of the algorithm, as well as later. In some cases a final HSP
will improve enough in the later gapped phase to rise to the top hits.
In your case, relaxing the limit to 200 appears to have allowed hits
that would have been excluded in the ungapped phase at 100 max target
sequences to rise.</i></blockquote>
<br />
So basically while most of the time <span style="font-family: "courier new" , "courier" , monospace;">-max_target_seqs</span> does <i>seem</i> to be a final filter before the output is written, this setting actually happens much earlier in the search and can (unexpectedly) cull what turns out to be the best hit.<br />
<br />
I agree with the NCBI BLAST team that this needs better documentation, but believe this is should be treated as bug not a feature.<br />
<br />
Many thanks to Sujai for <a href="https://twitter.com/sujaik/status/671333856461660160">openly reporting this via Twitter</a> and posting his <a href="https://gist.github.com/sujaikumar/504b3b7024eaf3a04ef5">BLAST+ bug report as a gist on GitHub</a> where we could see and comment on the issue. Sadly <a href="http://blastedbio.blogspot.co.uk/2011/08/opening-up-ncbi-blast.html">I'm still waiting for an official NCBI public BLAST+ bug tracker</a>.<br />
<br />
<h3>
Update 26 September 2018</h3>
This post was cited in <a href="https://doi.org/10.1093/bioinformatics/bty833">Shah <i>et al.</i> (2018) Misunderstood parameter of NCBI BLAST impacts the correctness of bioinformatics workflows.</a> I also fixed a typo.<br />
<br />
<h3>
Update 2 November 2018</h3>
I've posted an initial followup, <a href="https://blastedbio.blogspot.com/2018/11/blast-max-alignment-limits-repartee-one.html">BLAST max alignment limits repartee - part one</a>, which introduces a small self-contained test case, and emphasises that this issue is not specific to <span style="font-family: "courier new" , "courier" , monospace;">-max_target_seqs</span> but <i>also</i> affects the limits <span style="font-family: "courier new" , "courier" , monospace;">-num_descriptions</span> and <span style="font-family: "courier new" , "courier" , monospace;">-num_alignments</span> used with the human readable plain text or HTML output.<br />
<br />
In <a href="https://blastedbio.blogspot.com/2018/11/blast-max-alignment-limits-repartee-one.html">BLAST max alignment limits repartee - part two</a>, I focus on the question of if database order is important (as claimed in Shar <i>et al.</i> 2018), and how exactly the internal alignment number limit works.<br />
<br />
<h3>
Update 13 November 2018</h3>
I've published <a href="https://blastedbio.blogspot.com/2018/11/blast-max-alignment-limits-part-three.html">BLAST max alignment limits - part three</a> looking at the test case from Nidhi Shah, and <a href="https://blastedbio.blogspot.com/2018/11/blast-max-alignment-limits-part-four.html">BLAST max alignment limits - part four</a> looking at the internal alignment number limit in the context of nucleotide databases (where composition based statistics are not used).<br />
<br />
<h3>
Update (7 December 2018)</h3>
I've published a fifth follow-up post, <a href="https://blastedbio.blogspot.com/2018/12/blast-tie-break-db-order.html">BLAST tie breaking by database order</a>, looking at how the BLAST database order is defined in comparison to the FASTA file used to build a database.<br />
<br />
<h3>
Update (8 January 2019)</h3>
I've published a sixth follow-up post, "<a href="https://blastedbio.blogspot.com/2019/01/blast-overly-aggressive-optimization.html">An overly aggressive optimization in BLASTN and MegaBLAST</a>", which comes after the BLAST team's formal reply to the Shah <i>et al.</i> (2018) letter and BLAST+ 2.8.1 were published in late December. This update fixed oddities reported in Shah <i>et al.</i> (2018), which turn out to have been a complex interaction of multiple issues.Peter Cockhttp://www.blogger.com/profile/00233221181317137855noreply@blogger.com10tag:blogger.com,1999:blog-8584629468471803075.post-70798496710361396962015-07-08T17:37:00.000+01:002015-12-31T04:03:39.019+00:00BLAST XML 2 - does the sequel live up to my hopes?Last year I wrote a blog post "<a href="http://blastedbio.blogspot.co.uk/2014/02/blast-xml-output-needs-more-love-from.html">BLAST XML output needs more love from NCBI</a>", and in the numerous updates to this, tracked the NCBI outreach and then release of BLAST XML 2.<br />
<br />
The new output format was included in BLAST+ 2.2.31 as output format 15, without any kind of beta release for user feedback. Later than planned, I was able to give this a try during the Galaxy Community Conference 2015 Hackathon. Sadly the worries voiced on the OBF Bio* mailing lists were well founded.<br />
<br />
In part because XML is so verbose, it is nice to be able to parse it as a stream - meaning capturing the output via stdout and Unix pipes. That appears to be "broken". In fact, producing a bundle of XML files using <span style="font-family: "courier new" , "courier" , monospace;">XInclude</span> seems a recipe for trouble.<br />
<a name='more'></a><h3>
Setting up the example</h3>
Here I am using the sample files from this repository, first an example using -outfmt 6, the concise tabular output:<br />
<br />
<div style="background-color: black; color: #29f914; font-family: 'Andale Mono'; font-size: x-small;">
<pre>$ blastp -query rhodopsin_proteins.fasta -db four_human_proteins.fasta -evalue 0.0001 -outfmt 6
gi|57163783|ref|NP_001009242.1| sp|P08100|OPSD_HUMAN 96.55 348 12 0 1 348 1 348 0.0 701
gi|3024260|sp|P56514.1|OPSD_BUFBU sp|P08100|OPSD_HUMAN 83.33 354 53 2 1 354 1 348 0.0 605
...</pre>
</div>
<br />
Now, using classic BLAST XML,<br />
<br />
<div style="background-color: black;">
<pre style="color: #29f914; font-family: 'Andale Mono'; font-size: x-small;">$ blastp -query rhodopsin_proteins.fasta -db four_human_proteins.fasta -evalue 0.0001 -outfmt 5
<?xml version="1.0"?>
<!DOCTYPE BlastOutput PUBLIC "-//NCBI//NCBI BlastOutput/EN" "http://www.ncbi.nlm.nih.gov/dtd/NCBI_BlastOutput.dtd">
<BlastOutput>
<BlastOutput_program>blastp</BlastOutput_program>
<BlastOutput_version>BLASTP 2.2.31+</BlastOutput_version>
<BlastOutput_reference>Stephen F. Altschul, Thomas L. Madden, Alejandro A. Sch&amp;auml;ffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), &quot;Gapped BLAST and PSI-BLAST: a new generation of protein database search programs&quot;, Nucleic Acids Res. 25:3389-3402.</BlastOutput_reference>
<BlastOutput_db>four_human_proteins.fasta</BlastOutput_db>
<BlastOutput_query-ID>Query_1</BlastOutput_query-ID>
<BlastOutput_query-def>gi|57163783|ref|NP_001009242.1| rhodopsin [Felis catus]</BlastOutput_query-def>
<BlastOutput_query-len>348</BlastOutput_query-len>
<BlastOutput_param>
<Parameters>
<Parameters_matrix>BLOSUM62</Parameters_matrix>
<Parameters_expect>0.0001</Parameters_expect>
<Parameters_gap-open>11</Parameters_gap-open>
<Parameters_gap-extend>1</Parameters_gap-extend>
<Parameters_filter>F</Parameters_filter>
</Parameters>
</BlastOutput_param>
<BlastOutput_iterations>
<Iteration>
<Iteration_iter-num>1</Iteration_iter-num>
<Iteration_query-ID>Query_1</Iteration_query-ID>
<Iteration_query-def>gi|57163783|ref|NP_001009242.1| rhodopsin [Felis catus]</Iteration_query-def>
<Iteration_query-len>348</Iteration_query-len>
...
</Iteration>
<Iteration>
<Iteration_iter-num>2</Iteration_iter-num>
<Iteration_query-ID>Query_2</Iteration_query-ID>
<Iteration_query-def>gi|3024260|sp|P56514.1|OPSD_BUFBU RecName: Full=Rhodopsin</Iteration_query-def>
<Iteration_query-len>354</Iteration_query-len>
...
</Iteration></pre>
<pre style="color: #29f914; font-family: 'Andale Mono'; font-size: x-small;">...
</BlastOutput_iterations>
</BlastOutput></pre>
</div>
<br />
There is one <span style="background-color: black; color: #29f914; font-family: "andale mono"; font-size: 14px;"><Iteration></span> block per query.<br />
<br />
<div style="background-color: black; color: #29f914; font-family: 'Andale Mono'; font-size: x-small;">
<pre>$ blastp -query rhodopsin_proteins.fasta -db four_human_proteins.fasta -evalue 0.0001 -outfmt 5 | grep "<Iteration>"
<Iteration>
<Iteration>
<Iteration>
<Iteration>
<Iteration>
<Iteration>
</pre>
</div>
<br />
<h3>
Invalid XML to stdout</h3>
Finally, using the new BLAST XML v2 as implemented in BLAST+ 2.2.31,<br />
<br />
<div style="background-color: black;">
<pre style="color: #29f914; font-family: 'Andale Mono'; font-size: x-small;">$ blastp -query rhodopsin_proteins.fasta -db four_human_proteins.fasta -evalue 0.0001 -outfmt 14
<?xml version="1.0"?>
<BlastOutput2
xmlns="http://www.ncbi.nlm.nih.gov"
xmlns:xs="http://www.w3.org/2001/XMLSchema-instance"
xs:schemaLocation="http://www.ncbi.nlm.nih.gov http://www.ncbi.nlm.nih.gov/data_specs/schema_alt/NCBI_BlastOutput2.xsd"
>
<report>
<Report>
...
</Report>
</report>
</BlastOutput2>
<?xml version="1.0"?>
<BlastOutput2
xmlns="http://www.ncbi.nlm.nih.gov"
xmlns:xs="http://www.w3.org/2001/XMLSchema-instance"
xs:schemaLocation="http://www.ncbi.nlm.nih.gov http://www.ncbi.nlm.nih.gov/data_specs/schema_alt/NCBI_BlastOutput2.xsd"
>
<report>
<Report>
...
</Report>
</report>
</BlastOutput2>
...
</pre>
</div>
<br />
Anyone see the problem right away? There are multiple <span style="background-color: black; color: #29f914; font-family: "andale mono"; font-size: 14px;"><?xml version="1.0"?></span> lines though-out the output - this is not valid XML but the concatenation of XML files.<br />
<br />
<div style="background-color: black; color: #29f914; font-family: 'Andale Mono'; font-size: x-small;">
<pre>$ blastp -query rhodopsin_proteins.fasta -db four_human_proteins.fasta -evalue 0.0001 -outfmt 14 | grep "xml version"
<?xml version="1.0"?>
<?xml version="1.0"?>
<?xml version="1.0"?>
<?xml version="1.0"?>
<?xml version="1.0"?>
<?xml version="1.0"?>
</pre>
</div>
<br />
Does anyone find this familiar? Early versions of the XML BLAST output from "legacy" BLAST had the same problem (before the <span style="background-color: black; color: #29f914; font-family: "andale mono"; font-size: 14px;"><Iteration></span> element was repurposed as a per-query block).<br />
<br />
The reason behind this is perhaps somewhat explained if we ask BLAST+ to write to a file instead of defaulting to stdout:<br />
<br />
<div style="background-color: black; color: #29f914; font-family: 'Andale Mono'; font-size: x-small;">
<pre>$ blastp -query rhodopsin_proteins.fasta -db four_human_proteins.fasta -evalue 0.0001 -outfmt 14 -out example.xml
$ cat example.xml
<?xml version="1.0"?>
<BlastXML
xmlns="http://www.ncbi.nlm.nih.gov"
xmlns:xi="http://www.w3.org/2003/XInclude">
<xi:include href="example_1.xml"/>
<xi:include href="example_2.xml"/>
<xi:include href="example_3.xml"/>
<xi:include href="example_4.xml"/>
<xi:include href="example_5.xml"/>
<xi:include href="example_6.xml"/>
</BlastXML></pre>
</div>
<br />
It appears that what is being written to stdout by default is the concatenation of these six child files (at least in this example).<br />
<br />
<h3>
Problems with paths</h3>
I also found a clear bug in the new BLAST XML v2 output, consider this example:<br />
<br />
<div style="background-color: black; color: #29f914; font-family: 'Andale Mono'; font-size: x-small;">
<pre>$ mkdir output
$ ls output/
$ blastp -query rhodopsin_proteins.fasta -db four_human_proteins.fasta -evalue 0.0001 -outfmt 14 -out output/example.xml
$ ls output/
example.xml example_1.xml example_2.xml example_3.xml example_4.xml example_5.xml example_6.xml
$ cat output/example.xml
<?xml version="1.0"?>
<BlastXML
xmlns="http://www.ncbi.nlm.nih.gov"
xmlns:xi="http://www.w3.org/2003/XInclude">
<xi:include href="output/example_1.xml"/>
<xi:include href="output/example_2.xml"/>
<xi:include href="output/example_3.xml"/>
<xi:include href="output/example_4.xml"/>
<xi:include href="output/example_5.xml"/>
<xi:include href="output/example_6.xml"/>
</BlastXML></pre>
</div>
<br />
The problem here is that the master XML file contains include links with paths relative to where BLAST was run - not relative to the master XML file itself. This would break parsing the output<br />
<br />
Similarly, if I use an absolute path to the output parameter, then the Include lines also use absolute paths. Again, since they are output in the same folder as the master XML file, the include paths should simply be relative paths (otherwise the XML file set cannot be moved without breaking the links):
<br />
<br />
<div style="background-color: black; color: #29f914; font-family: 'Andale Mono'; font-size: x-small;">
<pre>$ blastp -query rhodopsin_proteins.fasta -db four_human_proteins.fasta -evalue 0.0001 -outfmt 14 -out /tmp/example.xml
$ cat /tmp/example.xml
<?xml version="1.0"?>
<BlastXML
xmlns="http://www.ncbi.nlm.nih.gov"
xmlns:xi="http://www.w3.org/2003/XInclude">
<xi:include href="/tmp/example_1.xml"/>
<xi:include href="/tmp/example_2.xml"/>
<xi:include href="/tmp/example_3.xml"/>
<xi:include href="/tmp/example_4.xml"/>
<xi:include href="/tmp/example_5.xml"/>
<xi:include href="/tmp/example_6.xml"/>
</BlastXML></pre>
</div>
<br />
<h3>
Problems with permissions</h3>
BLAST+ 2.2.31 will auto-name the child XML files for each query based on the requested output file - but there is no guarantee that those files won't already exist, or that they can even be created. Here's a pathological example using a special "filename":<br />
<br />
<div style="background-color: black; font-family: 'Andale Mono'; font-size: x-small;">
<pre><span style="color: #29f914;">$ ~/Downloads/ncbi-blast-2.2.31+/bin/blastp -query rhodopsin_proteins.fasta -db four_human_proteins.fasta -evalue 0.0001 -outfmt 14 -out /dev/stdout
</span><span style="color: red;">Error: [blastp] Cannot open output fileNCBI C++ Exception:
T0 "/Users/coremake/release_build/build/PrepareRelease_IntelMAC_JSID_01_80346_130.14.18.6_9008__PrepareRelease_IntelMAC_1433256305/c++/compilers/unix/../../src/algo/blast/format/blastxml2_format.cpp", line 751: Error: BLASTFORMAT::ncbi::BlastXML2_FormatReport() - Cannot open output file
Error: [blastp] Cannot open output fileNCBI C++ Exception:
T0 "/Users/coremake/release_build/build/PrepareRelease_IntelMAC_JSID_01_80346_130.14.18.6_9008__PrepareRelease_IntelMAC_1433256305/c++/compilers/unix/../../src/algo/blast/format/blastxml2_format.cpp", line 751: Error: BLASTFORMAT::ncbi::BlastXML2_FormatReport() - Cannot open output file
Error: [blastp] Cannot open output fileNCBI C++ Exception:
T0 "/Users/coremake/release_build/build/PrepareRelease_IntelMAC_JSID_01_80346_130.14.18.6_9008__PrepareRelease_IntelMAC_1433256305/c++/compilers/unix/../../src/algo/blast/format/blastxml2_format.cpp", line 751: Error: BLASTFORMAT::ncbi::BlastXML2_FormatReport() - Cannot open output file
Error: [blastp] Cannot open output fileNCBI C++ Exception:
T0 "/Users/coremake/release_build/build/PrepareRelease_IntelMAC_JSID_01_80346_130.14.18.6_9008__PrepareRelease_IntelMAC_1433256305/c++/compilers/unix/../../src/algo/blast/format/blastxml2_format.cpp", line 751: Error: BLASTFORMAT::ncbi::BlastXML2_FormatReport() - Cannot open output file
Error: [blastp] Cannot open output fileNCBI C++ Exception:
T0 "/Users/coremake/release_build/build/PrepareRelease_IntelMAC_JSID_01_80346_130.14.18.6_9008__PrepareRelease_IntelMAC_1433256305/c++/compilers/unix/../../src/algo/blast/format/blastxml2_format.cpp", line 751: Error: BLASTFORMAT::ncbi::BlastXML2_FormatReport() - Cannot open output file
Error: [blastp] Cannot open output fileNCBI C++ Exception:
T0 "/Users/coremake/release_build/build/PrepareRelease_IntelMAC_JSID_01_80346_130.14.18.6_9008__PrepareRelease_IntelMAC_1433256305/c++/compilers/unix/../../src/algo/blast/format/blastxml2_format.cpp", line 751: Error: BLASTFORMAT::ncbi::BlastXML2_FormatReport() - Cannot open output file</span><span style="color: #29f914;">
<?xml version="1.0"?>
<BlastXML
xmlns="http://www.ncbi.nlm.nih.gov"
xmlns:xi="http://www.w3.org/2003/XInclude">
<xi:include href="/dev/stdout_1.xml"/>
<xi:include href="/dev/stdout_2.xml"/>
<xi:include href="/dev/stdout_3.xml"/>
<xi:include href="/dev/stdout_4.xml"/>
<xi:include href="/dev/stdout_5.xml"/>
<xi:include href="/dev/stdout_6.xml"/>
</BlastXML></span></pre>
</div>
<br />
In this example, <span style="background-color: black; color: #29f914; font-family: "andale mono"; font-size: 14px;">/dev/stdout</span> is a Linux convention for writing to stdout (useful with command line tools which otherwise would only output to a file). I've highlighted in red the errors (on stderr) from BLAST+ 2.2.31 naively generating child XML filenames from the <span style="background-color: black; color: #29f914; font-family: "andale mono"; font-size: 14px;">-out</span> parameter.<br />
<br />
There's another bug here in that BLAST did not abort, indeed it gave a zero return code meaning success. You can trigger this in other ways:<br />
<br />
<div style="background-color: black; font-family: 'Andale Mono'; font-size: x-small;">
<pre><span style="color: #29f914;">$ ls example*.xml
ls: example*.xml: No such file or directory
$ touch example_2.xml
$ chmod a-w example_2.xml
$ ~/Downloads/ncbi-blast-2.2.31+/bin/blastp -query rhodopsin_proteins.fasta -db four_human_proteins.fasta -evalue 0.0001 -outfmt 14 -out example.xml && echo "Returned $?"
</span><span style="color: red;">Error: [blastp] Cannot open output fileNCBI C++ Exception:
T0 "/Users/coremake/release_build/build/PrepareRelease_IntelMAC_JSID_01_80346_130.14.18.6_9008__PrepareRelease_IntelMAC_1433256305/c++/compilers/unix/../../src/algo/blast/format/blastxml2_format.cpp", line 751: Error: BLASTFORMAT::ncbi::BlastXML2_FormatReport() - Cannot open output file</span><span style="color: #29f914;">
Returned 0
$ ls -l example*.xml
-rw-r--r-- 1 peterjc staff 340 5 Jul 17:03 example.xml
-rw-r--r-- 1 peterjc staff 4079 5 Jul 17:03 example_1.xml
-r--r--r-- 1 peterjc staff 0 5 Jul 17:02 example_2.xml
-rw-r--r-- 1 peterjc staff 4026 5 Jul 17:03 example_3.xml
-rw-r--r-- 1 peterjc staff 4019 5 Jul 17:03 example_4.xml
-rw-r--r-- 1 peterjc staff 4070 5 Jul 17:03 example_5.xml
-rw-r--r-- 1 peterjc staff 4099 5 Jul 17:03 example_6.xml
</span></pre>
</div>
<br />
In this case BLAST+ 2.2.31 was unable to create one of the child files, but again despite printing an error message to stderr, lies and returns a success return code.<br />
<br />
<h3>
Why use XML includes?</h3>
Since I first read the BLAST XML v2 proposal, I have yet to come up with a compelling reason for the NCBI to be using XML includes and all the overheads of multi-file output. Would it make sense to get rid of this in the next BLAST+ release and simply produce a single (large) XML file even for multiple-query BLAST searches?<br />
<br />
<h3>
Update (12 November 2015)</h3>
<br />
The NCBI are going to offer the <a href="http://www.ncbi.nlm.nih.gov/mailman/pipermail/blast-announce/2015q4/000118.html">new BLAST+ XML and JSON output in a single-file mode</a> with the next release.<br />
<br />
<h3>
Update (31 December 2015)</h3>
BLAST+ 2.3.0 (released 21 December 2015) appears to fix some/all of bugs logged here, and offers a single-file mode. I have not yet tested this.Peter Cockhttp://www.blogger.com/profile/00233221181317137855noreply@blogger.com0tag:blogger.com,1999:blog-8584629468471803075.post-11151449238831296232015-07-04T19:55:00.000+01:002017-06-02T10:37:20.151+01:00NCBI working on SAM output from BLAST+Recently <a href="http://www.ncbi.nlm.nih.gov/news/06-16-2015-blast-plus-update/">NCBI BLAST+ 2.2.31</a> was released, and it contains an undocumented "Easter Egg" - this is still very rough around the edges but they're working on SAM format output!<br />
<a name='more'></a><br />
The command line help in BLAST+ 2.2.31 only describes output formats 0 to 14:<br />
<br />
<div style="background-color: black; color: #29f914; font-family: 'Andale Mono'; font-size: 14px;">
<pre>$ ~/Downloads/ncbi-blast-2.2.31+/bin/blastp -help
USAGE
...
DESCRIPTION
Protein-Protein BLAST 2.2.31+
...
*** Formatting options
-outfmt <string>
alignment view options:
0 = pairwise,
1 = query-anchored showing identities,
2 = query-anchored no identities,
3 = flat query-anchored, show identities,
4 = flat query-anchored, no identities,
5 = XML Blast output,
6 = tabular,
7 = tabular with comment lines,
8 = Text ASN.1,
9 = Binary ASN.1,
10 = Comma-separated values,
11 = BLAST archive format (ASN.1),
12 = JSON Seqalign output,
13 = JSON Blast output,
14 = XML2 Blast output
Options 6, 7, and 10 can be additionally configured to produce
a custom format specified by space delimited format specifiers.
...
</string></pre>
</div>
<br />
I discovered by accident that 15 offers SAM format output. First, using the <a href="https://github.com/peterjc/galaxy_blast/tree/master/test-data">sample files from this repository</a>, here's an example using <span style="background-color: black; color: #29f914; font-family: "andale mono"; font-size: 14px;">-outfmt 6</span>, the concise tabular output:<br />
<br />
<div style="background-color: black; color: #29f914; font-family: 'Andale Mono';">
<pre>$ blastp -query rhodopsin_proteins.fasta -db four_human_proteins.fasta -evalue 0.0001 -outfmt 6
gi|57163783|ref|NP_001009242.1| sp|P08100|OPSD_HUMAN 96.55 348 12 0 1 348 1 348 0.0 701
gi|3024260|sp|P56514.1|OPSD_BUFBU sp|P08100|OPSD_HUMAN 83.33 354 53 2 1 354 1 348 0.0 605
gi|283855846|gb|ADB45242.1| sp|P08100|OPSD_HUMAN 94.82 328 17 0 1 328 11 338 0.0 630
gi|283855823|gb|ADB45229.1| sp|P08100|OPSD_HUMAN 94.82 328 17 0 1 328 11 338 0.0 630
gi|223523|prf||0811197A sp|P08100|OPSD_HUMAN 93.10 348 23 1 1 347 1 348 0.0 651
gi|12583665|dbj|BAB21486.1| sp|P08100|OPSD_HUMAN 81.09 349 65 1 1 349 1 348 0.0 587
</pre>
</div>
<br />
Here's what BLAST+ 2.2.31 gives with <span style="background-color: black; color: #29f914; font-family: "andale mono"; font-size: 14px;">-outfmt 15</span> instead:<br />
<br />
<div style="background-color: black; color: #29f914; font-family: 'Andale Mono';">
<pre>$ blastp -query rhodopsin_proteins.fasta -db four_human_proteins.fasta -evalue 0.0001 -outfmt 15
@HD VN:1.2 GO:query
@SQ SN:gnl|BL_ORD_ID|3 LN:348
@SQ SN:gnl|BL_ORD_ID|3 LN:348
@SQ SN:gnl|BL_ORD_ID|3 LN:348
@SQ SN:gnl|BL_ORD_ID|3 LN:348
@SQ SN:gnl|BL_ORD_ID|3 LN:348
@SQ SN:gnl|BL_ORD_ID|3 LN:348
lcl|Query_1 0 gnl|BL_ORD_ID|3 1 255 348M * 0 0 * * AS:i:1808 EV:f:NM:i:0 PI:f:96.55 BS:f:701.049
lcl|Query_2 0 gnl|BL_ORD_ID|3 1 255 333M1I8M5I7M * 0 0 * * AS:i:1560 EV:f:0 NM:i:6 PI:f:84.77 BS:f:605.52
lcl|Query_3 0 gnl|BL_ORD_ID|3 11 255 328M * 0 0 * * AS:i:1625 EV:f:NM:i:0 PI:f:94.82 BS:f:630.558
lcl|Query_4 0 gnl|BL_ORD_ID|3 11 255 328M * 0 0 * * AS:i:1625 EV:f:NM:i:0 PI:f:94.82 BS:f:630.558
lcl|Query_5 0 gnl|BL_ORD_ID|3 1 255 190M1D157M * 0 0 * * AS:i:1680 EV:f:0 NM:i:1 PI:f:93.37 BS:f:651.744
lcl|Query_6 0 gnl|BL_ORD_ID|3 1 255 328M1I20M5H * 0 0 * * AS:i:1512 EV:f:0 NM:i:1 PI:f:81.32 BS:f:587.03
</pre>
</div>
<br />
Note the use of standard <a href="http://samtools.github.io/hts-specs/SAMv1.pdf">SAM/BAM format</a> tags - looking at the first match:<br />
<ul>
<li><span style="background-color: black; color: #29f914; font-family: "andale mono"; font-size: 14px;">AS:i:1808</span> - integer alignment score, here 1808</li>
<li><span style="background-color: black; color: #29f914; font-family: "andale mono"; font-size: 14px;">NM:i:0</span> - integer edit distance, here 0</li>
</ul>
Plus non-standard tags:<br />
<ul>
<li><span style="background-color: black; color: #29f914; font-family: "andale mono"; font-size: 14px;">EV:f:0</span> - float e-value, here 0</li>
<li><span style="background-color: black; color: #29f914; font-family: "andale mono"; font-size: 14px;">PI:f:96.55</span> - float percent identify, here 96.55</li>
<li><span style="background-color: black; color: #29f914; font-family: "andale mono"; font-size: 14px;">BS:f:701.049</span> - float bit-score, here 701.049</li>
</ul>
<br />
Sadly rather than the real names for the query and matches, BLAST's internal names are being used (e.g. <span style="background-color: black; color: #29f914; font-family: "andale mono";">lcl|Query_1</span> and <span style="background-color: black; color: #29f914; font-family: "andale mono";">gnl|BL_ORD_ID|3</span>) rather than <span style="background-color: black; color: #29f914; font-family: "andale mono";">gi|57163783|ref|NP_001009242.1|</span> and <span style="background-color: black; color: #29f914; font-family: "andale mono";">sp|P08100|OPSD_HUMAN</span>). This reminds me of earlier problems, see my older post "<a href="http://blastedbio.blogspot.co.uk/2013/12/blast-should-keep-its-blordid.html">BLAST+ should keep its BL_ORD_ID identifiers to itself</a>".<br />
<br />
Also there's an obvious bug in the duplication of the "reference" <span style="background-color: black; color: #29f914; font-family: "andale mono";">@SQ</span> lines in the header.<br />
<br />
For bonus points they should add a <span style="background-color: black; color: #29f914; font-family: "andale mono";">@PG</span> line to the header as well, giving the BLAST+ version etc.<br />
<br />
Still, as long as these niggles are sorted out for the first official release to offer SAM output, I think this will be surprisingly useful. For now, there are options like the BLAST XML to SAM tool Pierre Lindenbaum's student Aurélien Guy-Duché wrote (<a href="http://plindenbaum.blogspot.co.uk/2015/06/a-blast-to-sam-converter.html">blog post</a>, <a href="https://github.com/guyduche/Blast2Bam">repository</a>).<br />
<br />
<h3>
Update (6 July 2015)</h3>
Last night on Zach Charlop-powers got in touch on Twitter (<a href="https://twitter.com/zach_cp">@zach_cp</a>) to report using <span style="background-color: black; color: #29f914; font-family: "andale mono"; font-size: 14px;">-parse_deflines</span> fixes the query names, and if the database was built using <span style="background-color: black; color: #29f914; font-family: "andale mono"; font-size: 14px;">makeblastdb -parse_seqids ...</span> then this fixes the match names. He put together <a href="https://gist.github.com/zachcp/3781489afcdf26d3eeda">an example</a> too.<br />
<br />
Unfortunately that doesn't work if you are not using the NCBI specific pipe-character based naming scheme (see also <a href="http://blastedbio.blogspot.co.uk/2012/10/my-ids-not-good-enough-for-ncbi-blast.html">My IDs not good enough for NCBI BLAST+</a>).<br />
<br />
<i>P.S.</i> To date SAM/BAM has only been used for nucleotides, but my example above was from a protein BLAST so it was apparently using SAM format working in amino acid residues rather than base pairs.<br />
<br />
<h3>
Update (31 December 2015)</h3>
SAM output is now officially documented as a beta feature in BLAST+ 2.3.0 (released 21 December 2015).<br />
<h3>
<br />Update (2 June 2017)</h3>
Note in the official support, SAM output is available via <span style="background-color: black; color: #29f914; font-family: "andale mono"; font-size: 14px;">-outfmt 17</span> (not 15 as in the examples above), and is only in <span style="font-family: Courier New, Courier, monospace;">blastn</span>. It is not in <span style="font-family: Courier New, Courier, monospace;">blastp</span> etc (which makes sense as SAM is about nucleotide alignments).Peter Cockhttp://www.blogger.com/profile/00233221181317137855noreply@blogger.com7tag:blogger.com,1999:blog-8584629468471803075.post-65448269729949900182015-06-01T21:33:00.000+01:002015-06-01T21:33:59.297+01:00PrePrint: SAM/BAM format v1.5 extensions for de novo assembliesHere's a little back-story on my latest preprint (based on my <a href="http://sourceforge.net/p/samtools/mailman/message/34161881/">email</a> to samtools-devel), which went live on the biology preprint server bioRvix at the end of last week:<br />
<blockquote class="tr_bq">
<a href="http://dx.doi.org/10.1101/020024"><span class="il">SAM</span>/<span class="il">BAM</span> format v1.5 extensions for de novo assemblies.</a><br />
<a href="http://orcid.org/0000-0001-9513-9993">Peter J. A. Cock</a>, <a href="http://orcid.org/0000-0002-6447-4112">James K. Bonfield</a>, <a href="http://orcid.org/0000-0003-4419-8840">Bastien Chevreux</a>, <a href="http://orcid.org/0000-0003-4874-2874">Heng Li</a>.<br />
bioRxiv DOI: 10.1101/020024</blockquote>
The current version is a terse three pages (trying to meet an "application note" page limit), but nevertheless should clarify the intended usage of these parts of the <span class="il">SAM</span>/<span class="il">BAM</span> specification.<br />
<a name='more'></a>This manuscript has been in progress since 2012, in parallel with
the associated file format change discussions all held openly on the <a href="http://sourceforge.net/p/samtools/mailman/samtools-devel/">samtools-devel mailing list</a>, and "<a href="https://github.com/samtools/samtools/blob/develop/padding.c">samtools depad</a>" work on GitHub.<br />
<br />
My apologies to my co-authors, the long delays are my fault. After getting useful comments from an internal pre-submission review (thank you to <a href="https://ics.hutton.ac.uk/staff/">colleagues at the James Hutton Institute</a>), I should have posted the current preprint back in February. Better late than never.<br />
<br />
Also, <a href="http://nickloman.github.io/high-throughput%20sequencing/2011/09/19/sambam-its-time-for-a-single-standard-for-assembly-output/">thank you to Nick Loman for a discussion in 2011</a> which was one of the motivations in making this effort in the first place.<br />
<br />
Also relevant are some of my blog posts from late 2011, with screenshots illustrating <a href="http://blastedbio.blogspot.co.uk/2011/09/sambam-with-gapped-reference.html"><span class="il">SAM</span>/<span class="il">BAM</span> files with a padded reference</a>, vs <a href="http://blastedbio.blogspot.co.uk/2011/10/sambam-without-gapped-reference.html">SAM/BAM files with an un-padded reference</a>.<br />
<br />
If there are queries about the file format itself, please raise them on the <a href="http://sourceforge.net/p/samtools/mailman/message/34161881/">samtools-devel mailing list</a>. I'm happy to receive comments about the manuscript itself by email directly.Peter Cockhttp://www.blogger.com/profile/00233221181317137855noreply@blogger.com0tag:blogger.com,1999:blog-8584629468471803075.post-33773376515861229062015-02-02T15:42:00.001+00:002018-10-19T15:46:47.015+01:00BLAST+ rejecting query files with zero sequencesThis is another brief NCBI BLAST+ bug report blog post, about a regression in BLAST+ 2.2.29 which will be breaking existing pipelines around the world. The problem is a new "<i>feature</i>" which treats an empty query file as an error.<br />
<a name='more'></a><br />
For this example, first make an empty query file:<br />
<br />
<div style="background-color: black; color: #29f914; font-family: 'Andale Mono'; font-size: 14px;">
$ touch empty_file.fasta</div>
<br />
Here's a simple example command showing older versions of BLAST+ would handle this corner case nicely, finishing with a zero return code (meaning success - shown here using echo and the special question mark environment variable). I tried this with BLAST+ 2.2.18 though to 2.2.28 inclusive:<br />
<br />
<div style="background-color: black; color: #29f914; font-family: 'Andale Mono'; font-size: 14px;">
$ blastp -query empty_file.fasta -db nr -outfmt 6; echo "[Return code $?]"<br />
[Return code 0]</div>
<br />
But not any more, both BLAST+ 2.2.29 and the current release 2.2.30 have broken this:<br />
<br />
<div style="background-color: black; color: #29f914; font-family: 'Andale Mono'; font-size: 14px;">
$ blastp -query empty_file.fasta -db nr -outfmt 6; echo "[Return code $?]"<br />
Command line argument error: Query is Empty!<br />
[Return code 1]
</div>
<br />
Following Unix conventions for an error, here the message is printed to <span style="font-family: "courier new" , "courier" , monospace;">stderr</span>, and a non zero return code is used (one). I just don't agree that this is an error.<br />
<br />
I accept that an empty input query file is <i>unusual</i>, but it does happen legitimately - particularly in automated pipelines. For instance, I have written Galaxy workflows which do things like start from a protein set, filter based on the presence of a signal peptide, then run BLAST against some known false-positives, which are then removed. This pipeline might very reasonably return zero sequences - and I want BLAST to accept this and carry on.<br />
<br />
This bug was actually reported to me by Jim Johnson (see his <a href="https://github.com/peterjc/galaxy_blast/issues/58">issue report here</a>), suggesting we add a work around in the Galaxy BLAST+ wrappers. The group at the University of Minnesota Supercomputing Institute has a pipeline which chunked large sequence sets by length before running BLAST. Occasionally one of the size bins could be empty, at which point BLAST+ broke their workflow.<br />
<br />
My suggestion is for the NCBI to either remove this check, or simply downgrade it to a warning on <span style="font-family: "courier new" , "courier" , monospace;">stderr</span> - with the critical requirement that it should revert to a zero return code. e.g.<br />
<br />
<div style="background-color: black; color: #29f914; font-family: 'Andale Mono'; font-size: 14px;">
$ blastp -query empty_file.fasta -db nr -outfmt 6; echo "[Return code $?]"<br />
Warning: Command line argument error?: Query is Empty!<br />
[Return code 0]
</div>
<br />
This gives some useful feedback for the user (especially if running BLAST+ by hand at the command line), without breaking legitimate use cases.<br />
<br />
Since <a href="http://blastedbio.blogspot.co.uk/2011/08/opening-up-ncbi-blast.html">NCBI BLAST+ don't have a public bug tracker</a>, I am blogging this here, and have reported the problem by email as well.<br />
<br />
<h3>
Update 19 October 2018</h3>
Belatedly noting this was fix in the BLAST+ 2.2.31 release, e.g.<br />
<br />
<div style="background-color: black; color: #29f914; font-family: 'Andale Mono'; font-size: 14px;">
$ blastp -query empty_file.fasta -db nr -outfmt 6; echo "[Return code $?]"
Warning: [blastp] Query is Empty!
[Return code 0]</div>
<br />
Thank you!Peter Cockhttp://www.blogger.com/profile/00233221181317137855noreply@blogger.com0tag:blogger.com,1999:blog-8584629468471803075.post-52613386694853824222014-12-23T12:31:00.001+00:002018-12-20T18:34:52.065+00:00BLAST+ Christmas Wish List<i>Dear Santa,</i><br />
<i><br /></i>
<i>Please could you ask the Elves at the NCBI to deliver the following BLAST+ feature requests for Christmas 2014?</i><br />
<i><br /></i>
<i>Thank you,</i><br />
<i><br /></i>
<i>Peter</i><br />
<i><br /></i>
<i>P.S. Do they think I have been naughty or nice with <a href="http://blastedbio.blogspot.co.uk/search/label/BLAST">my BLAST blog posts</a>?</i><br />
<br />
<a name='more'></a><br />
These are <i>roughly</i> in order of increasing priority, starting with some relatively minor issues. It was of course utterly unrealistic to expect this by Christmas, even though I started writing this a month before. But fingers crossed <i>some</i> of this might appear in BLAST+ during 2015?<br />
<br />
<h3>
10. Keep the old BLAST URL alive</h3>
<br />
The <a href="http://www.ncbi.nlm.nih.gov/news/05-22-2014-BLAST-URL-domain-change/">NCBI are dropping the long lived www.ncbi.nlm.nih.gov/blast URL as of 1st December 2014</a>, which they had widely publicly announced (including via <a href="http://www.ncbi.nlm.nih.gov/mailman/pipermail/blast-announce/2014q2/000106.html">mailing lists</a> and repeatedly on Twitter).<br />
<br />
I worry this will break a number of legacy applications and scripts (whose development has ceased) which used this to run BLAST, and anticipate a flood of queries on mailing lists and forums.<br />
<br />
The old URL already redirects to the new <a href="http://blast.ncbi.nih.gov/">blast.ncbi.nih.gov</a> address (and has done so for some time). Is it really enough of a maintenance burden to justify dropping the old URL? <a href="https://twitter.com/michaelhoffman/status/539450276344651776">Michael Hoffman</a> wondered the same thing on Twitter.<br />
<br />
[Thinking this would be a good example, I just checked <a href="http://www.mbio.ncsu.edu/bioedit/bioedit.html">BioEdit, last updated in 2005</a>. Its remote BLAST searches fail with an NCBI message about the <a href="http://www.ncbi.nlm.nih.gov/mailman/pipermail/blast-announce/2013q2/000103.html">withdrawal of Blastcl3 in 2013</a> - sadly I doubt this was enough to stop people using BioEdit.]<br />
<br />
At the time of posting, 23rd December 2014, the old URL still redirects... so maybe it got a stay of execution?<br />
<br />
<h3>
9. Resurrect blastclust</h3>
<br />
A minor causality in the BLAST rewrite from C to C++ was the <span style="font-family: "courier new" , "courier" , monospace;">blastclust</span> tool for clustering sequences based on their similarity. OK, yes, there are alternatives like <a href="http://drive5.com/usearch/manual/uclust_algo.html">UCLUST</a> but there are times when it would be nice to have a BLAST+ version (and others agree, e.g. <a href="https://twitter.com/BioMickWatson/status/527752989038374913">Mick Watson</a>).<br />
<br />
<h3>
8. Command line option for taxonomy database path</h3>
<br />
The <a href="http://blastedbio.blogspot.co.uk/2012/05/blast-tabular-missing-descriptions.html">NCBI added taxonomy output to BLAST+ 2.2.28</a>, which requires you download <a href="ftp://ftp.ncbi.nlm.nih.gov/blast/db/taxdb.tar.gz">taxdb.tar.gz from the NCBI FTP site </a>and decompress this somewhere on your <span style="font-family: "courier new" , "courier" , monospace;">$BLASTDB</span> path.<br />
<br />
I sometimes want to specify the taxonomy database at the command line. One use-case is if you really care about reproducibility and want to use a particular version of the NCBI taxonomy tree. I would like a new optional argument to do this, e.g. <span style="font-family: "courier new" , "courier" , monospace;">-taxdb /my/data/taxdb-2014-11-26</span> to tell BLAST to use the files <span style="font-family: "courier new" , "courier" , monospace;">/my/data/taxdb-2014-11-26.*</span> rather than looking for <span style="font-family: "courier new" , "courier" , monospace;">taxdb.*</span> on the <span style="font-family: "courier new" , "courier" , monospace;">$BLASTDB</span> path.<br />
<br />
This would be similar to how the <span style="font-family: "courier new" , "courier" , monospace;">deltablast</span> command has an optional argument <span style="font-family: "courier new" , "courier" , monospace;">-rpsdb</span> which defaults to looking for <span style="font-family: "courier new" , "courier" , monospace;">cdd_delta.*</span> on the <span style="font-family: "courier new" , "courier" , monospace;">$BLASTDB</span> path.<br />
<br />
<h3>
7. Include an official local BLAST web-server</h3>
<br />
The old "legacy" BLAST suite included a basic web-server, <a href="http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/wwwblast/">wwwblast</a>.
It was functional and ugly by today's standards, but got the job done.
Sadly as part of the retirement of the "legacy" BLAST suite with
development shifted from C to C++, there never was an official
replacement - meanwhile the <a href="http://blast.ncbi.nlm.nih.gov/">NCBI hosted BLAST web-server</a> has gone from strength to strength.<br />
<br />
This gap has lead to numerous alternatives like my own work enabling <a href="https://github.com/peterjc/galaxy_blast">BLAST+ within Galaxy</a>, or <a href="http://www.sequenceserver.com/">SequenceServer</a> for running BLAST on a local server or cluster from a browser.<br />
<br />
<h3>
6. Update the BLAST XML format.</h3>
<br />
I hope the NCBI got lots of useful feedback from their <a href="http://www.ncbi.nlm.nih.gov/mailman/pipermail/blast-announce/2014q1/000105.html">March 2014 BLAST XML consultation</a> exercise. Back in February 2014 I wrote about my own thoughts on <a href="http://blastedbio.blogspot.co.uk/2014/02/blast-xml-output-needs-more-love-from.html">what needs fixing in the BLAST XML format</a>.<br />
<br />
<h3>
5. Fix the alignment limit arguments</h3>
<br />
BLAST+ has arguments <span itemprop="text"><span style="font-family: "courier new" , "courier" , monospace;">-num_descriptions</span> and <span style="font-family: "courier new" , "courier" , monospace;">-num_alignments</span></span> for use with the human readable plain text output formats (<span style="font-family: "courier new" , "courier" , monospace;">-outfmt 0</span> to <span style="font-family: "courier new" , "courier" , monospace;">4</span> inclusive). For other formats the data does not get split into a summary listing (descriptions) followed by alignments, so instead option <span style="font-family: "courier new" , "courier" , monospace;">-max_target_seqs</span> is used.<br />
<br />
I want to be able to use <span style="font-family: "courier new" , "courier" , monospace;">-max_target_seqs</span> for <i>all</i> the output formats. If applied to the plain text outputs, it should be treated as the default limit for the descriptions and alignment.<br />
<br />
The sad thing is this actually worked from the first release of BLAST+ 2.2.18 through to 2.2.25,<br />
<br />
<div style="background-color: black; color: #29f914; font-family: 'Andale Mono'; font-size: 14px;">
$ blastp -query queries.fasta -db nr -outfmt 0 -max_target_seqs 2<br />
BLASTP 2.2.25+<br />
...</div>
<br />
There was a somewhat scary warning in BLAST+ 2.2.26,<br />
<br />
<div style="background-color: black; color: #29f914; font-family: 'Andale Mono'; font-size: 14px;">
$ blastp -query queries.fasta -db nr -outfmt 0 -max_target_seqs
2<br />
Warning: Number of descriptions overridden to 2, number of alignments overridden to 2.<br />
max_target_seqs should not be set with outfmt 0<br />
BLASTP 2.2.26+ <br />
...</div>
<br />
However the limit still worked. Unfortunately with BLAST+ 2.2.27 through to the current release 2.2.30 this was changed to ignore the argument:<br />
<br />
<div style="background-color: black; color: #29f914; font-family: 'Andale Mono'; font-size: 14px;">
$ blastp -query queries.fasta -db nr -outfmt 0 -max_target_seqs 2<br />
Warning: The parameter -max_target_seqs is ignored for output formats, 0,1,2,3. Use -num_descriptions and -num_alignments to control output<br />
BLASTP 2.2.30+ <br />
...</div>
<br />
There is even an off-by-one bug in the warning message as it also applies to format four:<br />
<br />
<div style="background-color: black; color: #29f914; font-family: 'Andale Mono'; font-size: 14px;">
$ blastp -query queries.fasta -db nr -outfmt 4 -max_target_seqs 2<br />
Warning: The parameter -max_target_seqs is ignored for output formats, 0,1,2,3. Use -num_descriptions and -num_alignments to control output<br />
BLASTP 2.2.30+<br />
...</div>
<br />
I want this to be reverted to the original behaviour (with no warning, just obey the argument). This way you can just remember and use a single option <span style="font-family: "courier new" , "courier" , monospace;">-max_target_seqs</span> for <i>all</i> the output formats (and for most usage, ignore <span itemprop="text"><span style="font-family: "courier new" , "courier" , monospace;">-num_descriptions</span> and <span style="font-family: "courier new" , "courier" , monospace;">-num_alignments</span></span>).<br />
<br />
<h3>
4. Hide BL_ORD_ID as an implementation detail.</h3>
<br />
I wrote about why I think <a href="http://blastedbio.blogspot.co.uk/2013/12/blast-should-keep-its-blordid.html">BLAST+ should hide its BL_ORD_ID identifiers as an internal implementation detail</a>,
and <i>reject</i> FASTA files with duplicate identifiers when making a BLAST
database - which should resolve my lingering issues with <a href="http://blastedbio.blogspot.co.uk/2012/10/my-ids-not-good-enough-for-ncbi-blast.html">BLAST+ not handling user defined naming conventions very well</a>.<br />
<br />
<h3>
3. Optional headers in the BLAST+ tabular and CSV output.</h3>
<br />
I don't like <a href="http://blastedbio.blogspot.com/2014/11/column-headers-in-blast-tabular-and-csv.html">playing guess-the-column with tables of BLAST data</a>. Judging from the re-tweets and replies Twitter (e.g. <a href="https://twitter.com/torstenseemann/status/537576961451622400">Torsten Seemann</a>, <a href="https://twitter.com/MicroWavesSci/status/537589926326710273">Laura Williams</a>, <a href="https://twitter.com/mattloose/status/537577399018590208">Matt Loose</a>, <a href="https://twitter.com/lexnederbragt/status/538002190267580416">Lex Nederbragt</a>, and <a href="https://twitter.com/BaCh_mira/status/538026270895120385">Bastien Chevreux</a>), I am not alone in this.<br />
<br />
<h3>
2. Taxonomy filters</h3>
<br />
One of the most common uses I have seen for the Entrez filter when using the BLAST+ command line tools to run a remote search at the NCBI is to filter by taxonomy. Building on the taxonomy support added in BLAST+ 2.2.28, I would like to see new options for restricting the results to given taxa (a white list) or excluding taxa (a black list).<br />
<br />
For example to do a BLAST search against NR restricting to only plant (<a href="http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=3193">Embryophyta, higher plants, taxid 3193</a>) matches I would like to be able to do something like this:<br />
<br />
<div style="background-color: black; color: #29f914; font-family: 'Andale Mono'; font-size: 14px;">
$ blastp -query my_seqs.fasta -db nr -taxidlist 3193 -evalue 0.0001 ...</div>
<br />
Or to do a BLAST search against NR excluding any bacterial (<a href="http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=2">taxid 2</a>) or archaeal (<a href="http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=2157">taxid 2157</a>) matches I would like to be able to use:<br />
<br />
<div style="background-color: black; color: #29f914; font-family: 'Andale Mono'; font-size: 14px;">
$ blastp -query my_seqs.fasta -db nr -negative_taxidlist 2,2157 -evalue 0.0001 ...</div>
<br />
Hopefully the taxonomy database files BLAST+ uses (<span style="font-family: "courier new" , "courier" , monospace;"><a href="ftp://ftp.ncbi.nlm.nih.gov/blast/db/taxdb.tar.gz">taxdb.tar.gz</a></span>) already contains the tree structure needed to do this, but if need be that could be expanded.<br />
<br />
<h3>
1. Develop BLAST+ in the open</h3>
<br />
Last, but definitely not least on my list: Returning to the inaugural post on this blog, <a href="http://blastedbio.blogspot.co.uk/2011/08/opening-up-ncbi-blast.html">I'd still like to see the NCBI BLAST+ team adopt a more open approach to development</a> - with a public issue tracker etc<br />
<br />
<h3>
Update (20 December 2018)</h3>
Today the BLAST+ 2.8.1 release was announced, just in time for Christmas, and it looks like #5 on my list has been fixed - quoting the <a href="https://www.ncbi.nlm.nih.gov/books/NBK131777/">release notes</a>:<br />
<br />
<i>Allow use of the -max_target_seqs option for formats 0-4. The number of alignments and descriptions will be set to the max_target_seqs</i><br />
<br />
And what's more, this is the first production release (after the BLAST+ 2.8.0 alpha release) to support the new BLAST DB v5, which indirectly should solves my wish list entry #2 - quoting the <a href="https://www.ncbi.nlm.nih.gov/mailman/pipermail/blast-announce/2018q4/000152.html">announcement email</a>:<br />
<br />
<i>Allows you to limit your search by taxonomy using information built into the BLAST databases.</i><br />
<br />
Also, belatedly, in connection to wish list entry #6, BLAST+ 2.2.31 released in 2015 introduced BLAST XML v2 as another output format.Peter Cockhttp://www.blogger.com/profile/00233221181317137855noreply@blogger.com3tag:blogger.com,1999:blog-8584629468471803075.post-57548647399401009322014-11-26T11:00:00.000+00:002014-11-26T13:30:41.732+00:00Column headers in BLAST+ tabular and CSV outputIn the last couple of years, my preferred BLAST output format has switched from BLAST XML to plain tabular output. The main reason for this it is easier to parse, and now gives easy access to more fields - <a href="http://blastedbio.blogspot.co.uk/2012/05/blast-tabular-missing-descriptions.html">BLAST+ 2.2.28 added descriptions and taxonomy output</a> to the tabular and CSV output, but the cumulative effect is <a href="http://blastedbio.blogspot.co.uk/2014/02/blast-xml-output-needs-more-love-from.html">BLAST XML has been lagging behind</a>.<br />
<br />
However, there is a simple change the NCBI could make to greatly improve the usability of the tabular or CSV output - label the columns with a header line! This is vital meta-data: <i>No-one should be forced to guess-the-columns when presented with a data file.</i> <br />
<br />
<a name='more'></a><h3>
Look at all the columns</h3>
<br />
Here is an excerpt from the BLAST+ 2.2.30 command line help on the output format:<br />
<br />
<div style="background-color: black; color: #29f914; font-family: 'Andale Mono'; font-size: 14px;">
*** Formatting options<br />
-outfmt <String><br />
alignment view options:<br />
0 = pairwise,<br />
1 = query-anchored showing identities,<br />
2 = query-anchored no identities,<br />
3 = flat query-anchored, show identities,<br />
4 = flat query-anchored, no identities,<br />
5 = XML Blast output,<br />
6 = tabular,<br />
7 = tabular with comment lines,<br />
8 = Text ASN.1,<br />
9 = Binary ASN.1,<br />
10 = Comma-separated values,<br />
11 = BLAST archive format (ASN.1) <br />
12 = JSON Seqalign output <br />
<br />
Options 6, 7, and 10 can be additionally configured to produce<br />
a custom format specified by space delimited format specifiers.<br />
The supported format specifiers are:<br />
qseqid means Query Seq-id<br />
qgi means Query GI<br />
qacc means Query accesion<br />
qaccver means Query accesion.version<br />
qlen means Query sequence length<br />
sseqid means Subject Seq-id<br />
sallseqid means All subject Seq-id(s), separated by a ';'<br />
sgi means Subject GI<br />
sallgi means All subject GIs<br />
sacc means Subject accession<br />
saccver means Subject accession.version<br />
sallacc means All subject accessions<br />
slen means Subject sequence length<br />
qstart means Start of alignment in query<br />
qend means End of alignment in query<br />
sstart means Start of alignment in subject<br />
send means End of alignment in subject<br />
qseq means Aligned part of query sequence<br />
sseq means Aligned part of subject sequence<br />
evalue means Expect value<br />
bitscore means Bit score<br />
score means Raw score<br />
length means Alignment length<br />
pident means Percentage of identical matches<br />
nident means Number of identical matches<br />
mismatch means Number of mismatches<br />
positive means Number of positive-scoring matches<br />
gapopen means Number of gap openings<br />
gaps means Total number of gaps<br />
ppos means Percentage of positive-scoring matches<br />
frames means Query and subject frames separated by a '/'<br />
qframe means Query frame<br />
sframe means Subject frame<br />
btop means Blast traceback operations (BTOP)<br />
staxids means unique Subject Taxonomy ID(s), separated by a ';'<br />
(in numerical order)<br />
sscinames means unique Subject Scientific Name(s), separated by a ';'<br />
scomnames means unique Subject Common Name(s), separated by a ';'<br />
sblastnames means unique Subject Blast Name(s), separated by a ';'<br />
(in alphabetical order)<br />
sskingdoms means unique Subject Super Kingdom(s), separated by a ';'<br />
(in alphabetical order) <br />
stitle means Subject Title<br />
salltitles means All Subject Title(s), separated by a '<>'<br />
sstrand means Subject Strand<br />
qcovs means Query Coverage Per Subject<br />
qcovhsp means Query Coverage Per HSP<br />
When not provided, the default value is:<br />
'qseqid sseqid pident length mismatch gapopen qstart qend sstart send<br />
evalue bitscore', which is equivalent to the keyword 'std'<br />
Default = `0'</div>
<br />
Just look at all those interesting potential output fields :)<br />
<br />
<h3>
Sample Output</h3>
<br />
Now for a quick example, the protein sequence for <i>E. coli</i> K-12's gene <i>moaC</i>, and a fake sequence:<br />
<br />
<div style="background-color: black; color: #29f914; font-family: 'Andale Mono'; font-size: 14px;">
$ more queries.fasta <br />
>moaC molybdopterin biosynthesis, protein C<br />
MSQLTHINAAGEAHMVDVSAKAETVREARAEAFVTMRSETLAMIIDGRHHKGDVFATARIAGIQAAKRTW<br />
DLIPLCHPLMLSKVEVNLQAEPEHNRVRIETLCRLTGKTGVEMEALTAASVAALTIYDMCKAVQKDMVIG<br />
PVRLLAKSGGKSGDFKVEADD<br />
>fake sequence of letters which should not match any real proteins<br />
DFAIBFNWACNMVBNMDEYGBCBFCNKSFDEZNVDXHALAHLFGDNASHBCVNMDFGNMNDFSILAPPQG<br />
FCGHAKGRDAIBVKPDJKAHCIIBYANMNVB</div>
<br />
Let's search this against the NCBI NR database using BLASTP with the default column tabular output, limiting this to the top two hits for brevity:<br />
<br />
<div style="background-color: black; color: #29f914; font-family: 'Andale Mono'; font-size: 14px;">
$ blastp -query queries.fasta -db nr -evalue 0.0001 -max_target_seqs 2 -outfmt 6<br />
moaC<span style="color: yellow;"><i><tab></i></span>gi|15800534|ref|NP_286546.1|<span style="color: yellow;"><i><tab></i></span>100.00<span style="color: yellow;"><i><tab></i></span>161<span style="color: yellow;"><i><tab></i></span>0<span style="color: yellow;"><i><tab></i></span>0<span style="color: yellow;"><i><tab></i></span>1<span style="color: yellow;"><i><tab></i></span>161<span style="color: yellow;"><i> </i></span><br />
<span style="color: yellow;"><i><tab></i></span>1<span style="color: yellow;"><i><tab></i></span>161<span style="color: yellow;"><i><tab></i></span>3e-114<span style="color: yellow;"><i><tab></i></span>330<span style="color: orange;"><i><end></i></span><br />
moaC<span style="color: yellow;"><i><tab></i></span>gi|170768970|ref|ZP_02903423.1|<span style="color: yellow;"><i><tab></i></span>99.38<span style="color: yellow;"><i><tab></i></span>161<span style="color: yellow;"><i><tab></i></span>1<span style="color: yellow;"><i><tab></i></span>0<span style="color: yellow;"><i><tab></i></span>1<span style="color: yellow;"><i><tab></i></span>161<br />
<span style="color: yellow;"><i><tab></i></span>1<span style="color: yellow;"><i><tab></i></span>161<span style="color: yellow;"><i><tab></i></span>9e-114<span style="color: yellow;"><i><tab></i></span>329<span style="color: orange;"><i><end></i></span></div>
<br />
Due to line splitting for the blog, I have represented the tab characters as <i><span style="background-color: black;"><span style="color: yellow;"><tab></span></span> </i>in yellow, and emphasised the new lines with <span style="color: orange;"><span style="background-color: black;"><i><end></i></span></span>.<br />
<br />
Or, for comparison, the comma separated output - highlighted in yellow (which appears to include some unexpected spaces before the final values - a minor bug?):<br />
<br />
<div style="background-color: black; color: #29f914; font-family: 'Andale Mono'; font-size: 14px;">
$ blastp -query queries.fasta -db nr -evalue 0.0001 -max_target_seqs 2 -outfmt 10<br />
moaC<span style="color: yellow;">,</span>gi|15800534|ref|NP_286546.1|<span style="color: yellow;">,</span>100.00<span style="color: yellow;">,</span>161<span style="color: yellow;">,</span>0<span style="color: yellow;">,</span>0<span style="color: yellow;">,</span>1<span style="color: yellow;">,</span>161<span style="color: yellow;">,</span>1<span style="color: yellow;">,</span>161<span style="color: yellow;">,</span>3e-114<span style="color: yellow;">,</span> 330<br />
moaC<span style="color: yellow;">,</span>gi|170768970|ref|ZP_02903423.1|<span style="color: yellow;">,</span>,99.38<span style="color: yellow;">,</span>161<span style="color: yellow;">,</span>1<span style="color: yellow;">,</span>,0<span style="color: yellow;">,</span>1<span style="color: yellow;">,</span>161<span style="color: yellow;">,</span>1<span style="color: yellow;">,</span>161<span style="color: yellow;">,</span>9e-114<span style="color: yellow;">,</span> 329</div>
<br />
If you know the standard 12 columns by heart, this is workable. But one of the best things about BLAST+ is it will happily output other columns! For instance,<br />
<br />
<div style="background-color: black; color: #29f914; font-family: 'Andale Mono'; font-size: 14px;">
$ blastp -query queries.fasta -db nr -evalue 0.0001 -max_target_seqs 2 -outfmt "6 qseqid sseqid score qcovs"<br />
moaC<span style="color: yellow;"><i><tab></i></span>gi|15800534|ref|NP_286546.1|<span style="color: yellow;"><i><tab></i></span>846<span style="color: yellow;"><i><tab></i></span>100<span style="color: orange;"><i><end></i></span><br />
moaC<span style="color: yellow;"><i><tab></i></span>gi|170768970|ref|ZP_02903423.1|<span style="color: yellow;"><i><tab></i></span>843<span style="color: yellow;"><i><tab></i></span>100<span style="color: orange;"><i><end></i></span></div>
<br />
If you don't know how the file was generated, how are you to guess what the columns mean? Supposedly this is where <span style="background-color: black;"><span style="color: lime;">-outfmt 7</span></span> is useful:<br />
<br />
<div style="background-color: black; color: #29f914; font-family: 'Andale Mono'; font-size: 14px;">
$ blastp -query queries.fasta -db nr -evalue 0.0001 -max_target_seqs 2 -outfmt "7 qseqid sseqid score qcovs"<br />
# BLASTP 2.2.30+<br />
# Query: moaC molybdopterin biosynthesis, protein C<br />
# Database: nr<br />
# Fields: query id, subject id, score, % query coverage per subject<br />
# 2 hits found<br />
moaC<span style="color: yellow;"><i><tab></i></span>gi|15800534|ref|NP_286546.1|<span style="color: yellow;"><i><tab></i></span>846<span style="color: yellow;"><i><tab></i></span>100<span style="color: orange;"><i><end></i></span><br />
moaC<span style="color: yellow;"><i><tab></i></span>gi|170768970|ref|ZP_02903423.1|<span style="color: yellow;"><i><tab></i></span>843<span style="color: yellow;"><i><tab></i></span>100<span style="color: orange;"><i><end></i></span><br />
# BLASTP 2.2.30+<br />
# Query: fake sequence of letters which should not match any real proteins<br />
# Database: nr<br />
# 0 hits found<br />
# BLAST processed 2 queries</div>
<br />
This is extremely comment heavy (OK, not as verbose as BLAST XML, but still...) and not immediately useful for parsing the data (although you can now see queries with no hits being reported). However, loading this with Excel or R will not recognise the columns for you.<br />
<br />
<h3>
Enhancement Proposal</h3>
<br />
I would like the NCBI to add a new tabular output format to BLAST+, say <span style="color: lime;"><span style="background-color: black;">-outfmt 13</span></span> if not used for anything else first, which acts like <span style="background-color: black;"><span style="color: lime;">-outfmt 6</span></span> (tabular) but with the addition of a single header line. This should start with the # character (indicating this is a comment rather than data), followed by tab separated column names:<br />
<br />
<div style="background-color: black; color: #29f914; font-family: 'Andale Mono'; font-size: 14px;">
$ blastp -query queries.fasta -db nr -evalue 0.0001 -max_target_seqs 2 -outfmt "<span style="color: red;">13</span> qseqid sseqid score qcovs"<br />
#qseqid<span style="color: yellow;"><i><tab></i></span>sseqid<span style="color: yellow;"><i><tab></i></span>score<span style="color: yellow;"><i><tab></i></span>qcovs<span style="color: orange;"><i><end></i></span><br />
moaC<span style="color: yellow;"><i><tab></i></span>gi|15800534|ref|NP_286546.1|<span style="color: yellow;"><i><tab></i></span>846<span style="color: yellow;"><i><tab></i></span>100<span style="color: orange;"><i><end></i></span><br />
moaC<span style="color: yellow;"><i><tab></i></span>gi|170768970|ref|ZP_02903423.1|<span style="color: yellow;"><i><tab></i></span>843<span style="color: yellow;"><i><tab></i></span>100<span style="color: orange;"><i><end></i></span></div>
<br />
I debated just changing the existing <span style="background-color: black;"><span style="color: lime;">-outfmt 6</span></span> output, but fear this would break too many existing scripts. Similarly, this new mode could replace the existing <span style="color: lime;"><span style="background-color: black;">-outfmt 7</span></span>,
but that includes other functionality as well - like reporting query
sequences with no hits. So a new output format number seems best to me.<br />
<br />
For fans of comma separated variable (CSV) files, the same style
header should also be useful? I prefer to avoid CSV with BLAST as
commas do occur in record titles and so can complicate parsing. Maybe that can be <span style="background-color: black;"><span style="color: lime;">-outfmt 14</span></span> giving:<br />
<br />
<div style="background-color: black; color: #29f914; font-family: 'Andale Mono'; font-size: 14px;">
$ blastp -query queries.fasta -db nr -evalue 0.0001 -max_target_seqs 2 -outfmt "<span style="color: red;">14</span> qseqid sseqid score qcovs"<br />
#qseqid<span style="color: yellow;"><i>,</i></span>sseqid<span style="color: yellow;"><i>,</i></span>score<span style="color: yellow;"><i>,</i></span>qcovs<br />
moaC<span style="color: yellow;"><i>,</i></span>gi|15800534|ref|NP_286546.1|<span style="color: yellow;"><i>,</i></span>846<span style="color: yellow;"><i>,</i></span>100<br />
moaC<span style="color: yellow;"><i>,</i></span>gi|170768970|ref|ZP_02903423.1|<span style="color: yellow;"><i>,</i></span>843<span style="color: yellow;"><i>,</i></span>100</div>
<br />
<br />
Note that I am explicitly suggest using the same one-word concise
format names that BLAST+ itself uses in the command line to generate the
file (and not the more verbose column names currently used in <span style="background-color: black;"><span style="color: lime;">-outfmt 7</span></span>). These short column names are clear,
plus this should make it easier to trace the values back to their
meaning. Also they are <i>much</i> easier to work with in R than column names with spaces.<br />
<br />
The point of this header line convention is it
is widely used for self-describing tab-separated tables (I specifically want this for the <a href="https://github.com/peterjc/galaxy_blast">BLAST+ Galaxy wrappers</a>). This is trivial to load into Excel, and much easier to load into R or Python etc and get column names matched up with the data.<br />
<br />
In essence, this column header line is vital meta-data which avoids the <i>guess-the-columns</i> problem which otherwise can come up when sharing BLAST data in tabular or CSV format.Peter Cockhttp://www.blogger.com/profile/00233221181317137855noreply@blogger.com2tag:blogger.com,1999:blog-8584629468471803075.post-17489912183355064322014-10-31T14:43:00.000+00:002014-11-26T22:28:49.108+00:00BLAST! No frequency ratios needed for composition-based statisticsWhile working on updating the <a href="https://github.com/peterjc/galaxy_blast">NCBI BLAST+ wrapper for Galaxy</a> for any changes in the new <a href="http://www.ncbi.nlm.nih.gov/news/10-30-2014-new-BLAST-plus-2_2_30/">BLAST+ 2.2.30 release</a>, I hit a cryptic error message from <span style="font-family: "Courier New",Courier,monospace;">deltablast</span>: <br />
<br />
<div style="background-color: black; color: #29f914; font-family: 'Andale Mono'; font-size: 14px;">
$ deltablast -query rhodopsin_proteins.fasta
-subject four_human_proteins.fasta -evalue 1e-08 -outfmt "6 qseqid
sseqid score" -rpsdb /data/blastdb/cdd_delta<br />
BLAST engine error: /data/blastdb/cdd_delta contains no frequency ratios needed for composition-based statistics.<br />
Please disable composition-based statistics when searching against /data/blastdb/ncbi/cdd/cdd_delta. </div>
<br />
To cut a long story short, to fix this you need to download and unpack a newer <a href="ftp://ftp.ncbi.nlm.nih.gov/blast/db/cdd_delta.tar.gz">cdd_delta.tar.gz</a> which now includes another file <span style="font-family: "Courier New",Courier,monospace;">cdd_delta.freq</span> containing frequency ratio information which the newer <span style="font-family: "Courier New",Courier,monospace;">deltablast</span> tool requires.<br />
<br />
The same applies to the <span style="font-family: "Courier New",Courier,monospace;">rpsblast</span> tool, although here you just get a warning rather than an error:<br />
<br />
<div style="background-color: black; color: #29f914; font-family: 'Andale Mono'; font-size: 14px;">
$ rpsblast -query four_human_proteins.fasta -db /data/blastdb/cdd_delta -evalue 1e-08 -outfmt "6 qseqid sseqid score"<br />
Warning: /data/blastdb/cdd_delta contain(s) no freq ratios needed for composition-based statistics.<br />
RPSBLAST will be run without composition-based statistics.<br />
sp|Q9BS26|ERP44_HUMAN gnl|CDD|222416 401<br />
...<br />
sp|P06213|INSR_HUMAN gnl|CDD|238021 137<br />
sp|P08100|OPSD_HUMAN gnl|CDD|215646 411</div>
<br />
<a name='more'></a>For the full story, I am using two small sample files <a href="https://github.com/peterjc/galaxy_blast/blob/master/test-data/rhodopsin_proteins.fasta">rhodopsin_proteins.fasta</a> and <a href="https://github.com/peterjc/galaxy_blast/blob/master/test-data/four_human_proteins.fasta">four_human_proteins.fasta</a> as test cases. Using BLAST+ 2.2.26 through 2.2.29, this example worked:<br />
<br />
<div style="background-color: black; color: #29f914; font-family: 'Andale Mono'; font-size: 14px;">
$ ~/ncbi_blast_2.2.29+/deltablast -query rhodopsin_proteins.fasta -subject four_human_proteins.fasta -evalue 1e-08 -outfmt "6 qseqid sseqid score" -rpsdb /data/blastdb/cdd_delta<br />
gi|57163783|ref|NP_001009242.1| sp|P08100|OPSD_HUMAN 826<br />
gi|3024260|sp|P56514.1|OPSD_BUFBU sp|P08100|OPSD_HUMAN 767<br />
gi|283855846|gb|ADB45242.1| sp|P08100|OPSD_HUMAN 718<br />
gi|283855823|gb|ADB45229.1| sp|P08100|OPSD_HUMAN 721<br />
gi|223523|prf||0811197A sp|P08100|OPSD_HUMAN 842<br />
gi|12583665|dbj|BAB21486.1| sp|P08100|OPSD_HUMAN 795</div>
<br />
The error message from BLAST+ 2.2.30 was a bit cryptic, but suggested the domain database format had changed. I was using quite an old copy of the <span style="font-family: "Courier New",Courier,monospace;">cdd_delta</span> database from November 2013, so I downloaded the current version of <a href="ftp://ftp.ncbi.nlm.nih.gov/blast/db/cdd_delta.tar.gz">cdd_delta.tar.gz</a> (dated 24 Oct 2014, verified MD5 checksum 0a5513e147aa320264a1414f8194cfbc as per <a href="ftp://ftp.ncbi.nlm.nih.gov/blast/db/cdd_delta.tar.gz.md5">cdd_delta.tar.gz.md5</a>).<br />
<br />
Now <span style="font-family: "Courier New",Courier,monospace;">deltablast</span> from BLAST+ 2.2.30 works, although the bit scores (and other details of the alignments) are slightly different.<br />
<br />
<div style="background-color: black; color: #29f914; font-family: 'Andale Mono'; font-size: 14px;">
$ ~/ncbi_blast_2.2.30+/2.2.30+/deltablast -query rhodopsin_proteins.fasta -subject four_human_proteins.fasta -evalue 1e-08 -outfmt "6 qseqid sseqid score" -rpsdb cdd_delta<br />
gi|57163783|ref|NP_001009242.1| sp|P08100|OPSD_HUMAN 822<br />
gi|3024260|sp|P56514.1|OPSD_BUFBU sp|P08100|OPSD_HUMAN 759<br />
gi|283855846|gb|ADB45242.1| sp|P08100|OPSD_HUMAN 714<br />
gi|283855823|gb|ADB45229.1| sp|P08100|OPSD_HUMAN 718<br />
gi|223523|prf||0811197A sp|P08100|OPSD_HUMAN 839<br />
gi|12583665|dbj|BAB21486.1| sp|P08100|OPSD_HUMAN 790</div>
<br />
So what changed? The new database contained an extra file, <span style="font-family: "Courier New",Courier,monospace;">cdd_delta.freq</span> - so for anyone else stumped by the error message "<i>BLASTDB</i> contains no frequency ratios needed for composition-based statistics" you need to check if there is a file named <span style="font-family: "Courier New",Courier,monospace;"><i>BLASTDB</i>.freq</span> present. <br />
<br />Peter Cockhttp://www.blogger.com/profile/00233221181317137855noreply@blogger.com0tag:blogger.com,1999:blog-8584629468471803075.post-57059215855631221872014-04-09T17:23:00.002+01:002014-10-31T12:46:44.898+00:00BLAST+ 2.2.29 upset by [key=value] entries in queriesI recently got a weird error/warning message (repeated) in my BLAST+ stderr output,<br />
<br />
<div style="background-color: black; color: #29f914; font-family: 'Andale Mono'; font-size: 14px;">
Error: (1431.1) FASTA-Reader: Warning: FASTA-Reader: Ignoring FASTA modifier(s) found because the input was not expected to have any.</div>
<br />
<br />
This turns out to be due to having <tt>[key=value]</tt> tags in my query FASTA file, and appears to be a new bug introduced in BLAST+ 2.2.29 (as BLAST+ 2.2.26 through 2.2.28 inclusive are not affected).<br />
<br />
<u><b>Update (31 October 2014):</b></u> <i>This was fixed in BLAST+ 2.2.30 (released yesterday).</i><br />
<br />
<a name='more'></a><br />
Here's a snippet from the source code giving the cryptic message (file <tt>ncbi-blast-2.2.29+-src/c++/src/objtools/readers/fasta.cpp</tt> from <tt>ncbi-blast-2.2.29+-src.tar.gz</tt> on the <a href="about:invalid#zClosurez">BLAST+ FTP site</a>):<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: left;"><tbody>
<tr><td class="td-caption"><pre> ...
// user did not request fAddMods, so we warn that we found
// mods anyway
smp.ParseTitle(
title,
CConstRef<CSeq_id>(bioseq.GetFirstId()),
1 // "1" since we only care whether or not there are mods, not how many
);
CSourceModParser::TMods unused_mods = smp.GetMods(CSourceModParser::fUnusedMods);
if( ! unused_mods.empty() ) {
to have(iLineNum,
"FASTA-Reader: Ignoring FASTA modifier(s) found because "
"the input was not expected to have any.",
ILineError::eProblem_ModifierFoundButNoneExpected,
"defline");
}
...
</pre>
</td></tr>
</tbody></table>
This is apparently an error about some unexpected <i>modifiers</i> (whatever the NCBI meant by that) in a FASTA record's title, but confusingly the number 1431.1 is not a line number but a unique constant - <tt>eProblem_ModifierFoundButNoneExpected</tt>.<br />
<br />
Since BLAST+ is open source, I could recompile it locally with a small change to this error message to include the title of the problem FASTA entry:
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: left;"><tbody>
<tr><td class="td-caption"><pre> ...
CSourceModParser::TMods unused_mods = smp.GetMods(CSourceModParser::fUnusedMods);
if( ! unused_mods.empty() ) {
to have(iLineNum,
"FASTA-Reader: Ignoring FASTA modifier(s) found because "
"the input was not expected to have any: " << title,
ILineError::eProblem_ModifierFoundButNoneExpected,
"defline");
}
...
</pre>
</td></tr>
</tbody></table>
<br />
Compiling BLAST+ takes a while, but this confirmed the error/warning message was triggered by <i>every</i> record in my FASTA file - which was from Prokka with default settings giving lines like this:<br />
<br />
<div style="background-color: black; color: #29f914; font-family: 'Andale Mono'; font-size: 14px;">
<span style="color: #5330e1;">$</span> grep "^>" prokka.fsa</div>
<div style="background-color: black; color: #29f914; font-family: 'Andale Mono'; font-size: 14px;">
>husec41_c1 [gcode=11] [organism=Genus species] [strain=strain]</div>
<div style="background-color: black; color: #29f914; font-family: 'Andale Mono'; font-size: 14px;">
>husec41_c10 [gcode=11] [organism=Genus species] [strain=strain]</div>
<div style="background-color: black; color: #29f914; font-family: 'Andale Mono'; font-size: 14px;">
>husec41_c100 [gcode=11] [organism=Genus species] [strain=strain]<br />
... </div>
<br />
This raises a number of issues:<br />
<ul>
<li>Can BLAST+ give a more helpful message (like mine)? Or only show the message once?</li>
<li>Could the "Error:" prefix be removed from this warning message? Are other warning messages being upgraded to errors in the same way?</li>
<li>Why is BLAST checking these <tt>[key=value]</tt> tags anyway? Was this a deliberate change in BLAST+ 2.2.29?</li>
</ul>
Unfortunately while this seems to be a warning, because it starts with the scary word "Error" my <a href="https://github.com/peterjc/galaxy_blast">Galaxy BLAST+ wrappers</a>
cautiously treat it a real error and declare such jobs a failure...
perhaps I can <a href="https://github.com/peterjc/galaxy_blast/issues/40">tweak my regex</a> to ignore this false positive?<br />
<br />
<u><b>Update (31 October 2014):</b></u> <i>This was fixed in BLAST+ 2.2.30 (released yesterday). In reply to my email reporting this back in April, I was told some changes to FASTA in the C++ Toolkit had accidentally snuck into the BLAST code.</i><br />
<br />Peter Cockhttp://www.blogger.com/profile/00233221181317137855noreply@blogger.com2tag:blogger.com,1999:blog-8584629468471803075.post-52397471049377919942014-02-27T17:20:00.003+00:002015-06-24T10:35:48.498+01:00BLAST XML output needs more love from NCBIFor some time I had thought that the best option for computer parsing of BLAST+ output was BLAST XML. It had all the key bits of information, and XML is <i>designed</i> for automated parsing. However, with the extra fields added to the tabular or comma separated output in BLAST+ 2.2.28 like the <a href="http://blastedbio.blogspot.co.uk/2012/05/blast-tabular-missing-descriptions.html">long overdue hit descriptions, and taxonomy fields</a>, I think they are now preferable. BLAST XML is now lagging behind!<br />
<a name='more'></a><br />
<h3>
BLAST tabular output</h3>
<br />
The greatly expanded set of columns available to the tabular (and comma separated) output is the motivation behind <a href="http://dev.list.galaxyproject.org/Pick-you-own-columns-in-BLAST-tabular-output-tc4663541.html">adding a pick-you-own columns option</a> to the <a href="https://github.com/peterjc/galaxy_blast">Galaxy BLAST+ wrappers</a> (which already use the new match description column by default): <br />
<br />
<table cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-right: 1em; text-align: left;"><tbody>
<tr><td style="text-align: left;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjqOTuWujAg_7AJxBcG0090JSNW-e9v_cVuSKowv6I6mfeH9m79s1PcNOakFYpRfEvki1zkNV-13SOSQlkpRf_mtvpnYC6w65AUZm-MBm7b883hHS47VmEL0IZtuJewOYhiZ-w087HRnJc/s1600/std_ext_columns.png" imageanchor="1" style="clear: left; margin-bottom: 1em; margin-left: auto; margin-right: auto;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjqOTuWujAg_7AJxBcG0090JSNW-e9v_cVuSKowv6I6mfeH9m79s1PcNOakFYpRfEvki1zkNV-13SOSQlkpRf_mtvpnYC6w65AUZm-MBm7b883hHS47VmEL0IZtuJewOYhiZ-w087HRnJc/s1600/std_ext_columns.png" /></a></td></tr>
<tr><td style="text-align: left;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEigXzhL3GcLvjX6fU8N0O3gpmYCijWfjySfqPErt590iuSiyi2OtjXZhCaYe9FWTxBYaQfHVnTfF6mGRyvwi7Uu7psBXfIkP6H7uQJMCBLsjodPe7ZjnuN3pFkLqt2ua9fFko9f1Tz0DgI/s1600/ids_misc_tax_columns.png" imageanchor="1" style="clear: left; margin-bottom: 1em; margin-left: auto; margin-right: auto;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEigXzhL3GcLvjX6fU8N0O3gpmYCijWfjySfqPErt590iuSiyi2OtjXZhCaYe9FWTxBYaQfHVnTfF6mGRyvwi7Uu7psBXfIkP6H7uQJMCBLsjodPe7ZjnuN3pFkLqt2ua9fFko9f1Tz0DgI/s1600/ids_misc_tax_columns.png" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Planned Galaxy interface for picking BLAST+ columns.<br />
With 44 output fields to choose from, this is a bit overwhelming!</td></tr>
</tbody></table>
<br />
That screenshot shows the proposed column selection for the Galaxy BLAST+ wrappers (<i><u>Update</u> - now available on the Galaxy Tool Shed</i>), which internally works via the <span style="font-family: "Courier New",Courier,monospace;">-outfmt 6</span> command line switch. I think the new taxonomy fields in the BLAST output will be especially popular - for example I know the Blaxter group was able to use this new feature to simplify the code for <a href="http://github.com/blaxterlab/blobology">Blobology (which maps assembly contigs to taxonomic groups)</a>.<br />
<br />
Notice these BLAST+ output fields explicitly handle multiple IDs/titles/species for a single match as used in the NCBI Non-Redundant (NR) database, where identical sequences from different organisms are collapsed into one sequence record (removing redundancy).<br />
<br />
<h3>
BLAST XML output</h3>
<br />
So well done to the NCBI for expanding the capabilities of BLAST+'s tabular output :)<br />
<br />
However BLAST's XML output needs some love to maintain parity and its utility:<br />
<ul>
<li>Include the taxonomy fields, defining them as optional in the XML DTD for backward compatibility.</li>
<li><a href="http://blastedbio.blogspot.co.uk/2013/12/blast-should-keep-its-blordid.html">Hide the internal identifiers like gnl|BL_ORD_ID|1</a>, a bug fixed in the tabular output back in BLAST+ 2.2.23 (Feb 2010).</li>
<li>Properly handle secondary identifiers (aliases) as used in the NR database, rather than putting the primary identifier in <Hit_id> and hiding the rest only within <Hit_def> (<a href="http://blastedbio.blogspot.co.uk/2012/05/blast-tabular-missing-descriptions.html">see this post for details</a>).</li>
</ul>
Now <a href="http://blastedbio.blogspot.co.uk/2011/08/opening-up-ncbi-blast.html">if only the NCBI ran BLAST+ as an open project</a>, I would log some enhancement requests on their issue tracker. But they don't, so its blog post time! ;)<br />
<br />
<h3>
Update (17 March 2014)</h3>
<br />
Apparently the NCBI team are planning some updates to the BLAST XML output, which I heard about via <a href="https://twitter.com/seandavis12/status/445585197783085056">Sean Davis on Twitter</a>:<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: left;"><a href="https://twitter.com/seandavis12/status/445585197783085056" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" src="https://pbs.twimg.com/profile_images/1913633871/sean_bigger.png" /></a></td><td class="tr-caption" style="text-align: left;"><b>Sean Davis</b> (<b>@seandavis12</b>):<br />
Proposed BLAST XML changes with embedded link for comment:
<a href="ftp://ftp.ncbi.nlm.nih.gov/blast/documents/NEWXML/ProposedBLASTXMLChanges.pdf">ftp://ftp.ncbi.nlm.nih.gov/blast/documents/NEWXML/ProposedBLASTXMLChanges.pdf</a>
</td></tr>
</tbody></table>
<br />
The PDF talks about including the taxonomy information and sorting out multiple identifiers (listed above), plus other issues like the current abuse of the <Iteration> tag originally just for PSI-BLAST. It doesn't address the BL_ORD_ID issue yet, but they are asking for feedback...<br />
<br />
<h3>
Update (18 March 2014)</h3>
<br />
The NCBI have now posted this <a href="https://twitter.com/NCBI/status/445640686894088192">on their official Twitter account</a>, and to the <a href="http://www.ncbi.nlm.nih.gov/mailman/pipermail/blast-announce/2014q1/000105.html">blast-announce list</a>:<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: left;"><a href="https://twitter.com/NCBI/status/445640686894088192" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" src="https://pbs.twimg.com/profile_images/759051827/26243_367877492480_367794192480_4194340_6124010_n_bigger.jpg" /></a></td><td class="tr-caption" style="text-align: left;"><b>NCBI Staff</b> (<b>@NCBI</b>):<br />
The BLAST dev team needs your help! Suggestions, comments, etc. needed on proposed XML changes <a href="http://www.ncbi.nlm.nih.gov/news/03-17-2014-blast-xml-feedback/">http://www.ncbi.nlm.nih.gov/news/03-17-2014-blast-xml-feedback/</a> #bioinformatics
</td></tr>
</tbody></table>
<br />
<h3>
Update (5 May 2015)</h3>
<br />
The NCBI have now released details of the <a href="ftp://ftp.ncbi.nlm.nih.gov/blast/documents/NEWXML/xml2.pdf">new BLAST XML format (PDF)</a> to the <a href="http://www.ncbi.nlm.nih.gov/mailman/pipermail/blast-announce/2015q2/000113.html">blast-announce list</a>.<br />
<br />
<h3>
Update (June 2015)</h3>
<br />
The <a href="http://www.ncbi.nlm.nih.gov/news/06-16-2015-blast-plus-update/">NCBI have released BLAST+ 2.2.31</a> which offers this new BLAST XML output.<br />
Peter Cockhttp://www.blogger.com/profile/00233221181317137855noreply@blogger.com3tag:blogger.com,1999:blog-8584629468471803075.post-47837500473417055022013-12-24T14:39:00.000+00:002014-01-13T16:06:31.838+00:00BLAST+ should keep its BL_ORD_ID identifiers to itselfThis is in a sense a continuation of my previous BLAST blog post, <a href="http://blastedbio.blogspot.co.uk/2012/10/my-ids-not-good-enough-for-ncbi-blast.html">My IDs not good enough for NCBI BLAST+</a>. My core complaint is that <tt>makeblastdb</tt> currently ignores the user's own identifiers and automatically assigns its own identifiers (<tt>gnl|BL_ORD_ID|0</tt>, <tt>gnl|BL_ORD_ID|1</tt>, <tt>gnl|BL_ORD_ID|2</tt>, <i>etc</i>), and that the BLAST+ suite as a whole is inconsistent about hiding these in its output.<br />
<br />
Note that one side-effect of BLAST+ ignoring the user identifiers and creating its own is that it can tolerate databases made from FASTA files with accidentally duplicated identifiers, but this only causes great confusion and ambiguity in the downstream analysis. One of the ways I've seen FASTA files be created with accidentally duplicated identifiers is pooling of assemblies where generic names like contig1 (or even the more complex Trinity naming scheme) naturally cause clashes. In situations like this, I think <tt>makeblastdb</tt> should give an error when attempting to build a BLAST database.<br />
<a name='more'></a><br />
<h3>
Sample Data</h3>
For the tests here, I created a simple sample FASTA file with deliberately duplicated identifiers. In fact I deliberately included a duplicated entry too (in the sense that both entries <tt>gene1</tt> have the same sequence). Here the first three entries are from the NC_005816 <i>Yersinia pestis</i> plasmid, while the remaining entries are from the HQ230977 <i>Salmonella enteric</i> plasmid.<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: left;"><tbody>
<tr><td class="td-caption"><pre>$ cat with_dups.fasta
>gene1
GTGAACAAACAACAACAAACTGCGCTGAATATGGCGCGATTTATCAGAAGCCAGAGCCTGATACTGCTTG
AAAAACTGGATGCTCTGGATGCCGACGAGCAGGCGGCCATGTGTGAACGACTGCACGAACTCGCGGAAGA
ACTCCAGAACAGCATCCAGGCTCGCTTTGAAGCCGAAAGTGAAACAGGAACATAA
>gene2
ATGAAATTTCATTTTTGTGATCTGAATCACTCTTATAAAAATCAGGAAGGGAAGATTCGCAGCAGAAAAA
CAGCACCGGGTAACATCAGAAAAAAACAGAAAGGAGATAACGTGAGCAAAACAAAATCTGGTCGCCACCG
ACTGAGCAAAACAGACAAACGCCTGCTGGCTGCACTTGTCGTTGCCGGATACGAAGAACGGACAGCCCGT
GACCTCATCCAGAAACACGTTTACACACTGACACAGGCCGACCTGCGCCATCTGGTCAGTGAAATCAGTA
ACGGTGTGGGACAGTCACAGGCCTACGATGCGATTTACCAGGCGAGACGCATTCGTCTCGCCCGTAAATA
CCTGAGCGGAAAAAAACCGGAAGGGGTGGAACCCCGGGAAGGGCAGGAACGGGAAGATTTACCATAA
>gene3 uniquely named
TTGGCTGATTTGAAAAAGCTACAGGTTTACGGACCTGAGTTACCCAGGCCATATGCCGATACCGTGAAAG
GTTCTCGGTACAAAAATATGAAAGAGCTTCGCGTTCAGTTTTCTGGCCGTCCGATAAGAGCCTTTTATGC
GTTCGATCCGATTCGTCGGGCTATCGTTCTTTGTGCAGGAGATAAAAGTAATGATAAGCGGTTTTATGAA
AAACTGGTGCGTATAGCTGAGGATGAATTTACAGCACATCTGAACACACTGGAGAGCAAGTAA
>gene1
GTGAACAAACAACAACAAACTGCGCTGAATATGGCGCGATTTATCAGAAGCCAGAGCCTGATACTGCTTG
AAAAACTGGATGCTCTGGATGCCGACGAGCAGGCGGCCATGTGTGAACGACTGCACGAACTCGCGGAAGA
ACTCCAGAACAGCATCCAGGCTCGCTTTGAAGCCGAAAGTGAAACAGGAACATAA
>gene2
ATGAGTGATTTATATAATGTAATATCACGGGCTGTTGAAGCGTCTGGCGCCGATCATTCAATTAATGAAA
AATTGACAAATGTCTTGAAAAGAGAATTAGTTGATTATGTCAGCATTGCGCATCTAAAAACCAAATTGTC
TGTATTATATGAGTTTGAAAAGAATTATTTGCAGCTTATCGCAGAATATAAGGAAGAAATAAAATTTGCT
TCCTCTTTGCAAGAGGATTTACGTAAAGAACGTGCTAAATTCTTTTCTGAGACATTAAAAGAGGTTCATC
AAACTCTAAATGAATCTCAAGTTGATAATGAAGTGGCATCAAAATGGATTAAAGAACTTGTTGGTAGTTA
TACCAAAAGCTTGGATCTAAGCGGAGGCCTTGTTGAAGAGCATACATTAGATACAATTGCTTGTATTCGC
GCTGAGGCTAAATTAAATAAACCATCTATTGAGCCGGGCAATAACTAA
</pre>
</td></tr>
</tbody></table>
<br />
So, now let's try using this with recent versions of BLAST+ ...<br />
<br />
<h3>
Making a BLAST database</h3>
Here I'm using a problematic FASTA file containing duplicated identifiers to build a BLAST database:<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: left;"><tbody>
<tr><td class="td-caption"><pre>$ ~/Downloads/ncbi-blast-2.2.28+/bin/makeblastdb -in with_dups.fasta -dbtype nucl -out with_dups
Building a new DB, current time: 12/24/2013 07:47:05
New DB name: with_dups_v28
New DB title: with_dups.fasta
Sequence type: Nucleotide
Keep Linkouts: T
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 5 sequences in 0.030242 seconds.
$ ls with_dups.n*
with_dups.nhr with_dups.nin with_dups.nsq
</pre>
</td></tr>
</tbody></table>
<br />
I also created the same database using older versions of BLAST+ for comparison,<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: left;"><tbody>
<tr><td class="td-caption"><pre>$ ~/Downloads/ncbi-blast-2.2.27+/bin/makeblastdb -in with_dups.fasta -dbtype nucl -out with_dups_v27
Building a new DB, current time: 12/24/2013 07:47:55
New DB name: with_dups_v27
New DB title: with_dups.fasta
Sequence type: Nucleotide
Keep Linkouts: T
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 5 sequences in 0.0043211 seconds.
$ ls with_dups_v27*
with_dups_v27.nhr with_dups_v27.nin with_dups_v27.nsq
$ diff with_dups_v27.nhr with_dups.nhr
$ diff with_dups_v27.nsq with_dups.nsq
$ diff with_dups_v27.nin with_dups.nin
</pre>
</td></tr>
</tbody></table>
<br />
Comparing the two, the <tt>*.nhr</tt> and <tt>*.nsq</tt> files are the same, while the <tt>*.nin</tt> files embed a time stamp - and therefore differ <i>unless</i> you run <tt>makeblastdb</tt> within the same minute.<br />
<br />
So, from BLAST+ 2.2.27 to 2.2.28 nothing seems to have changed on the <tt>makeblastdb</tt> side. I went back and checked with the first release BLAST+ 2.2.18 and it too gives an identical database.<br />
<br />
<h3>
Examining the BLAST database</h3>
It seems that as of BLAST+ 2.2.28, <tt>blastdbcmd</tt> tries to hide the internal details of the automatically assigned names. For example, attempting to recover the input FASTA file previously showed quite clearly that the original identifiers were in some sense demoted to part of the description (or sequence title) only. Here's an example using BLAST+ 2.2.27:<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: left;"><tbody>
<tr><td class="td-caption"><pre>$ ~/Downloads/ncbi-blast-2.2.27+/bin/blastdbcmd -db with_dups -entry all
>gnl|BL_ORD_ID|0 gene1
GTGAACAAACAACAACAAACTGCGCTGAATATGGCGCGATTTATCAGAAGCCAGAGCCTGATACTGCTTGAAAAACTGGA
TGCTCTGGATGCCGACGAGCAGGCGGCCATGTGTGAACGACTGCACGAACTCGCGGAAGAACTCCAGAACAGCATCCAGG
CTCGCTTTGAAGCCGAAAGTGAAACAGGAACATAA
...
</pre>
</td></tr>
</tbody></table>
<br />
With BLAST+ 2.2.28, the internal implementation details are more hidden and the original FASTA naming is recovered when you ask for all the database entries:<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: left;"><tbody>
<tr><td class="td-caption"><pre>$ ~/Downloads/ncbi-blast-2.2.28+/bin/blastdbcmd -db with_dups -entry all
>gene1
GTGAACAAACAACAACAAACTGCGCTGAATATGGCGCGATTTATCAGAAGCCAGAGCCTGATACTGCTTGAAAAACTGGA
TGCTCTGGATGCCGACGAGCAGGCGGCCATGTGTGAACGACTGCACGAACTCGCGGAAGAACTCCAGAACAGCATCCAGG
CTCGCTTTGAAGCCGAAAGTGAAACAGGAACATAA
...
</pre>
</td></tr>
</tbody></table>
<br />
Furthermore, you can see how older versions of BLAST+ exposed the internal identifiers:<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: left;"><tbody>
<tr><td class="td-caption"><pre>$ ~/Downloads/ncbi-blast-2.2.27+/bin/blastdbcmd -db with_dups -entry all
-outfmt "OID: %o GI: %g ACC: %a IDENTIFIER: %i"
OID: 0 GI: N/A ACC: BL_ORD_ID:0 IDENTIFIER: gnl|BL_ORD_ID|0
OID: 1 GI: N/A ACC: BL_ORD_ID:1 IDENTIFIER: gnl|BL_ORD_ID|1
OID: 2 GI: N/A ACC: BL_ORD_ID:2 IDENTIFIER: gnl|BL_ORD_ID|2
OID: 3 GI: N/A ACC: BL_ORD_ID:3 IDENTIFIER: gnl|BL_ORD_ID|3
OID: 4 GI: N/A ACC: BL_ORD_ID:4 IDENTIFIER: gnl|BL_ORD_ID|4</pre>
</td></tr>
</tbody></table>
<br />
Compare that to the BLAST+ 2.2.28 where the internal BLAST ordinal identifiers are redacted:<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: left;"><tbody>
<tr><td class="td-caption"><pre>$ ~/Downloads/ncbi-blast-2.2.28+/bin/blastdbcmd -db with_dups -entry all
-outfmt "OID: %o GI: %g ACC: %a IDENTIFIER: %i"
OID: 0 GI: N/A ACC: No ID available IDENTIFIER: No ID available
OID: 1 GI: N/A ACC: No ID available IDENTIFIER: No ID available
OID: 2 GI: N/A ACC: No ID available IDENTIFIER: No ID available
OID: 3 GI: N/A ACC: No ID available IDENTIFIER: No ID available
OID: 4 GI: N/A ACC: No ID available IDENTIFIER: No ID available</pre>
</td></tr>
</tbody></table>
<br />
This is problematic as in order to pull out an individual entry using <tt>blastdbcmd</tt> your own identifiers are not currently supported, so you need to know the internally assigned identifier - but most of BLAST+ is now trying to hide them from you:<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: left;"><tbody>
<tr><td class="td-caption"><pre>$ ~/Downloads/ncbi-blast-2.2.28+/bin/blastdbcmd -db with_dups -entry gene3
Error: gene3: OID not found
BLAST query/options error: Entry not found in BLAST database
Please refer to the BLAST+ user manual.</pre>
</td></tr>
</tbody></table>
and similarly:<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: left;"><tbody>
<tr><td class="td-caption"><pre>$ ~/Downloads/ncbi-blast-2.2.27+/bin/blastdbcmd -db with_dups_v27 -entry gene3
Error: gene3: OID not found
BLAST query/options error: Entry not found in BLAST database</pre>
</td></tr>
</tbody></table>
<br />
In this example, <tt>gene3</tt> is an original unambiguous identifier. Pulling this entry from the database requires knowing the automatically assigned identifier, but as of BLAST+ 2.2.28 working out what this is was made harder:<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: left;"><tbody>
<tr><td class="td-caption"><pre>$ ~/Downloads/ncbi-blast-2.2.27+/bin/blastdbcmd -db with_dups -entry "gnl|BL_ORD_ID|2"
>gnl|BL_ORD_ID|2 gene3 uniquely named
TTGGCTGATTTGAAAAAGCTACAGGTTTACGGACCTGAGTTACCCAGGCCATATGCCGATACCGTGAAAGGTTCTCGGTA
CAAAAATATGAAAGAGCTTCGCGTTCAGTTTTCTGGCCGTCCGATAAGAGCCTTTTATGCGTTCGATCCGATTCGTCGGG
CTATCGTTCTTTGTGCAGGAGATAAAAGTAATGATAAGCGGTTTTATGAAAAACTGGTGCGTATAGCTGAGGATGAATTT
ACAGCACATCTGAACACACTGGAGAGCAAGTAA
</pre>
</td></tr>
</tbody></table>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: left;"><tbody>
<tr><td class="td-caption"><pre>$ ~/Downloads/ncbi-blast-2.2.28+/bin/blastdbcmd -db with_dups -entry "gnl|BL_ORD_ID|2"
>gnl|BL_ORD_ID|2 gene3 uniquely named
TTGGCTGATTTGAAAAAGCTACAGGTTTACGGACCTGAGTTACCCAGGCCATATGCCGATACCGTGAAAGGTTCTCGGTA
CAAAAATATGAAAGAGCTTCGCGTTCAGTTTTCTGGCCGTCCGATAAGAGCCTTTTATGCGTTCGATCCGATTCGTCGGG
CTATCGTTCTTTGTGCAGGAGATAAAAGTAATGATAAGCGGTTTTATGAAAAACTGGTGCGTATAGCTGAGGATGAATTT
ACAGCACATCTGAACACACTGGAGAGCAAGTAA</pre>
</td></tr>
</tbody></table>
<br />
Also note here BLAST+ 2.2.28 <tt>blastdbcmd</tt> is being inconsistent about showing the internal names (for individual entries) or hiding them (when all entries are request).<br />
<br />
<h3>
Searching the ambiguous database</h3>
Now let's search against the nucleotide database we made from this ambiguous file using <tt>blastn</tt>. As my query I've taken a chunk from one of the sequences in the database. Let's start with the tabular output - as an aside, I'm piping the query into <tt>blastn</tt> as stdin, and it gets automatically named as <tt>Query_1</tt>. This is a handy shortcut for quick tests without bothering to make a query FASTA file.<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: left;"><tbody>
<tr><td class="td-caption"><pre>$ echo ACTGAGCAAAACAGACAAACGCCTGCTGGCTGCACTTGTCGTTGCCGGATACGAAGAACGGACAGCCCGT
| ~/Downloads/ncbi-blast-2.2.28+/bin/blastn -db with_dups -outfmt 6
Query_1 gene2 100.00 70 0 0 1 70 141 210 5e-35 130</pre>
</td></tr>
</tbody></table>
<br />
Notice we've got a single 100% match against "gene2". But which gene2, since there were actually two sequences named this in the original FASTA file used for this database?<br />
<br />
Interestingly you get the same from BLAST+ 2.2.23 to 2.2.28, but going back to the first release BLAST+ 2.2.18 through to 2.2.22, it gave the internal identifier "gnl|BL_ORD_ID|1" instead (and a shorter default query name of "1"):<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: left;"><tbody>
<tr><td class="td-caption"><pre>$ echo ACTGAGCAAAACAGACAAACGCCTGCTGGCTGCACTTGTCGTTGCCGGATACGAAGAACGGACAGCCCGT
| ~/Downloads/ncbi-blast-2.2.22+/bin/blastn -db with_dups -outfmt 6
1 gnl|BL_ORD_ID|1 100.00 70 0 0 1 70 141 210 5e-35 130</pre>
</td></tr>
</tbody></table>
<br />
The same change in BLAST+ 2.2.23 also applies to the comma-separated output:
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: left;"><tbody>
<tr><td class="td-caption"><pre>$ echo ACTGAGCAAAACAGACAAACGCCTGCTGGCTGCACTTGTCGTTGCCGGATACGAAGAACGGACAGCCCGT
| ~/Downloads/ncbi-blast-2.2.22+/bin/blastn -db with_dups -outfmt 10
1,gnl|BL_ORD_ID|1,100.00,70,0,0,1,70,141,210,5e-35, 130
$ echo ACTGAGCAAAACAGACAAACGCCTGCTGGCTGCACTTGTCGTTGCCGGATACGAAGAACGGACAGCCCGT
| ~/Downloads/ncbi-blast-2.2.23+/bin/blastn -db with_dups -outfmt 10
Query_1,gene2,100.00,70,0,0,1,70,141,210,5e-35, 130</pre>
</td></tr>
</tbody></table>
<br />
Here's an excerpt from the default human readable text output instead,<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: left;"><tbody>
<tr><td class="td-caption"><pre>$ echo ACTGAGCAAAACAGACAAACGCCTGCTGGCTGCACTTGTCGTTGCCGGATACGAAGAACGGACAGCCCGT
| ~/Downloads/ncbi-blast-2.2.28+/bin/blastn -db with_dups
...
Query=
Length=70
Score E
Sequences producing significant alignments: (Bits) Value
gene2 130 5e-35
> gene2
Length=417
Score = 130 bits (70), Expect = 5e-35
Identities = 70/70 (100%), Gaps = 0/70 (0%)
Strand=Plus/Plus
Query 1 ACTGAGCAAAACAGACAAACGCCTGCTGGCTGCACTTGTCGTTGCCGGATACGAAGAACG 60
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct 141 ACTGAGCAAAACAGACAAACGCCTGCTGGCTGCACTTGTCGTTGCCGGATACGAAGAACG 200
Query 61 GACAGCCCGT 70
||||||||||
Sbjct 201 GACAGCCCGT 210
...
</pre>
</td></tr>
</tbody></table>
<br />
Again, we're told "gene2" which is sadly ambiguous. Also oddly unlike the tabular output (and XML output below), the query is unnamed here. The output from BLAST+ 2.2.18 through 2.2.27 looks the same as this output from BLAST+ 2.2.28 in terms of seeing the original identifier ("gene2").<br />
<br />
However, if we ask for the XML output, we get to see the internal identifier of the "gene2" BLAST match, "gnl|BL_ORD_ID|1":
<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: left;"><tbody>
<tr><td class="td-caption"><pre>$ echo ACTGAGCAAAACAGACAAACGCCTGCTGGCTGCACTTGTCGTTGCCGGATACGAAGAACGGACAGCCCGT
| ~/Downloads/ncbi-blast-2.2.28+/bin/blastn -db with_dups -outfmt 5
...
<Iteration>
<Iteration_iter-num>1</Iteration_iter-num>
<Iteration_query-ID>Query_1</Iteration_query-ID>
<Iteration_query-def>No definition line</Iteration_query-def>
<Iteration_query-len>70</Iteration_query-len>
<Iteration_hits>
<Hit>
<Hit_num>1</Hit_num>
<Hit_id>gnl|BL_ORD_ID|1</Hit_id>
<Hit_def>gene2</Hit_def>
<Hit_accession>1</Hit_accession>
<Hit_len>417</Hit_len>
<Hit_hsps>
<Hsp>
...
</pre>
</td></tr>
</tbody></table>
<br />
<h3>
Ways forward</h3>
First of all, <tt>makeblastdb</tt> needs to reject duplicated identifiers when building a BLAST database. This should be an error condition, and would prevent the ambiguous situation in the example above (a BLAST match to "gene2", but which "gene2"?).<br />
<br />
The wider issue is that despite some cosmetic changes, BLAST+ still essentially ignores the user's own identifiers (unless of course you follow the NCBI pipe-based naming scheme and used <tt>-parse_seqids</tt>). Without digging into the BLAST+ source code, I can't say if this is feasible or not - but from an end-user perspective, BLAST+ should really just use the given identifiers as they are. This is what I was hoping for in the previous blog post (<a href="http://blastedbio.blogspot.co.uk/2012/10/my-ids-not-good-enough-for-ncbi-blast.html">My IDs not good enough for NCBI BLAST+</a>).<br />
<br />
Assuming however that the database file format etc makes using the user provided identifies impractical, then BLAST+ will have to continue to assign its own identifiers as now (e.g. "gnl|BL_ORD_ID|2"). However, I regard these as an internal implementation detail, and think BLAST+ should <i>consistently</i> hide this in its output (making it look like it uses the given identifiers).<br />
<br />
This requires several minor changes:<br />
<ul>
<li>Treat duplicated identifiers as an error in makeblastdb (for the reasons explained above)</li>
<li>Hide the internal identifiers in BLAST XML output (as always done in the plain text and HTML, and done for the tabular and comma separated output since BLAST+ 2.2.23)</li>
<li>Finish hiding the internal identify in <tt>blastdbcmd</tt> FASTA output (hidden for <tt>-entry all</tt> in BLAST+ 2.2.28, but not for specific entries)</li>
<li>Support user's original identifiers in <tt>blastdbcmd</tt> for retrieval (at worst this could be done without any changes to the database format with a brute force loop, acceptable for small custom databases)</li>
<li>Review other BLAST+ output for any other remaining cases where the internal identifiers are still used.</li>
</ul>
The take away message is if BLAST+ has to use its own automatically assigned identifiers, it should be consistent about if they should appear in all the BLAST output files, or not? The current situation where they are sometimes shown and sometimes hidden is simply confusing. My view is they are an internal implementation detail which should not be exposed to the end user.<br />
<br />
<u><b>Update (13 January 2014)</b></u><br />
<br />
Nothing seems to have changed here in the new point release BLAST+ 2.2.29 released today. Peter Cockhttp://www.blogger.com/profile/00233221181317137855noreply@blogger.com7tag:blogger.com,1999:blog-8584629468471803075.post-12486606475819143922013-11-14T15:15:00.001+00:002015-02-02T11:48:08.128+00:00Trouble with chimeras - getting all complete viral genomes from the NCBIBack in 2009, I wrote some Python scripts to use the NCBI Entrez Utilities to search for and download all known complete virus genomes in GenBank format, which I then processed to make FASTA files and BLAST databases. Recently I updated them and ran into some problems... false positives like entire bacterial genomes! This turns out to be due to a few bacteria with integrated phage being annotated as chimeras - genomes combined from multiple organisms.<br />
<a name='more'></a><br />
<br />
The core of this task uses <a href="http://eutils.ncbi.nlm.nih.gov/corehtml/query/static/esearch_help.html">esearch (NCBI Entrez Search)</a> on the nucleotide database, originally with the following query: <tt>"<a href="http://www.ncbi.nlm.nih.gov/nuccore/?term=txid35237%5Borgn%5D+AND+complete%5Bprop%5D">txid10239[orgn] AND complete[prop]</a>"</tt>. This restricted the search to organisms (shortened to <tt>[orgn]</tt> or <tt>[ORGN]</tt>) under the NCBI taxonomy tree for <a href="http://www.ncbi.nlm.nih.gov/taxonomy/?term=10239">Taxon ID 10239, the superkingdom for viruses</a>, with the "complete" property (shortened to <tt>[prop]</tt> or <tt>[PROP]</tt>).<br />
<br />
Fast-forward to 2013, as the databases grew there are now noticeable numbers of "complete" CDS sequences which I don't want. After some poking about exploring the current property set via the <i>advanced search</i> in the Entrez web interface, I settled on this query "<a href="http://www.ncbi.nlm.nih.gov/nuccore/?term=txid35237%5Borgn%5D+AND+complete%5Bprop%5D+AND+genome"><tt>txid35237[orgn] AND complete[prop] AND genome</tt></a>" instead (although there ought to be a more precise way to restrict this to genome sequences).<br />
<br />
All seemed well, I had my <a href="https://github.com/peterjc/picobio/blob/master/fetch_viruses/fetch_viruses.py">fetch_viruses.py</a> script talking to the NCBI and caching all the GenBank format files locally (using the <a href="https://github.com/biopython/biopython/blob/master/Bio/Entrez/__init__.py">Biopython Bio.Entrez module</a>), and <a href="https://github.com/peterjc/picobio/blob/master/fetch_viruses/merge_viruses.py">merge_virurses.py</a> turning them into non-redundant BLAST databases of complete virus genomes, their genes, and proteins. However, a week later I spotted a false positive on a set of local BLAST results, the complete genome of <a href="http://www.ncbi.nlm.nih.gov/nuccore/AE017333.1">Bacillus licheniformis DSM 13 = ATCC 14580</a> was in my (dsDNA) virus genome database!<br />
<br />
Adding a length filter to the NCBI esearch query "<a href="http://www.ncbi.nlm.nih.gov/nuccore/?term=txid35237%5Borgn%5D+AND+complete%5Bprop%5D+AND+genome+AND+1000000%3A100000000000%5BSequence+Length%5D">txid35237[orgn] AND complete[prop] AND genome AND 1000000:100000000000[Sequence Length]</a>" there are multiple complete bacteria showing up under viruses (and some cool enormous megaviruses, minivirus, and pandoraviruses). Here are the first three hits - all apparent false positives:<br />
<ul>
<li><a href="http://www.ncbi.nlm.nih.gov/nuccore/AE006468.1">Salmonella enterica subsp. enterica serovar Typhimurium str. LT2, complete genome</a><br />
4,857,432 bp circular DNA<br />
Accession: AE006468.1
GI: 16445344
</li>
<li><a href="http://www.ncbi.nlm.nih.gov/nuccore/AE017333.1">Bacillus licheniformis DSM 13 = ATCC 14580, complete genome</a><br />
4,222,645 bp circular DNA<br />
Accession:
AE017333.1
GI:
52346357</li>
<li><a href="http://www.ncbi.nlm.nih.gov/nuccore/CP002277.1">Haemophilus influenzae R2866, complete genome</a><br />
1,932,306 bp circular DNA<br />
Accession:
CP002277.1
GI:
309750011 </li>
</ul>
Looking at these in the GenBank format all becomes clear - they are annotated as chimeras, bacteria with integrated phage, e.g.<br />
<br />
<pre>LOCUS AE017333 4222645 bp DNA circular BCT 23-MAY-2013
DEFINITION Bacillus licheniformis DSM 13 = ATCC 14580, complete genome.
ACCESSION AE017333
VERSION AE017333.1 GI:52346357
...
FEATURES Location/Qualifiers
source order(1..1317753,1345263..1422555,1464175..4222645)
/organism="Bacillus licheniformis DSM 13 = ATCC 14580"
/mol_type="genomic DNA"
/strain="DSM 13"
/culture_collection="ATCC:14580"
/culture_collection="DSMZ:DSM 13"
/db_xref="taxon:279010"
/focus
source 1317754..1345262
/organism="Bacillus phage BLi_Pp2"
/mol_type="genomic DNA"
/db_xref="taxon:1230651"
/note="PBSX homolog phage"
source 1422556..1464174
/organism="Bacillus phage BLi_Pp3"
/mol_type="genomic DNA"
/db_xref="taxon:1230652"
/note="putative PBLD homologe phage"
...
</pre>
<br />
This seems unusual, but understandable. Historically GenBank files for bacteria had a single source feature covering the whole of the genome. The fact these (mostly) bacterial genomes now show up in virus specific searches is an unfortunate side effect.<br />
<br />
This is still a rare problem as shown by this search which looks for complete sequences which are under both <a href="http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=35237">virus (taxon id 35237)</a> and <a href="http://www.ncbi.nlm.nih.gov/taxonomy/?term=131567">cellular organisms (taxon id 131567)</a>, "<a href="http://www.ncbi.nlm.nih.gov/nuccore/?term=txid35237%5Borgn%5D+AND+complete%5Bprop%5D+AND+genome+AND+txid131567%5Borgn%5D">txid35237[orgn] AND complete[prop] AND genome AND txid131567[orgn]</a>" (currently 10 matches).<br />
<br />
The solution to avoiding these chimeras is the search "<a href="http://www.ncbi.nlm.nih.gov/nuccore/?term=txid35237%5Borgn%5D+AND+complete%5Bprop%5D+AND+genome+NOT+txid131567%5Borgn%5D">txid35237[orgn] AND complete[prop] AND genome NOT txid131567[orgn]</a>" (viruses but not cellular organisms). I've <a href="https://github.com/peterjc/picobio/commit/58430f4e84d601a62063bb9247ea793fea3009e0">fixed my script</a> to do this, and am regenerating my complete virus BLAST databases.<br />
<br />
<u><b>Update (18 April 2014)</b></u><br />
<br />
With hindsight, I neglected to consider a more straight forward approach of using the NCBI FTP site, specifically the <a href="ftp://ftp.ncbi.nih.gov/genomes/Viruses/">ftp://ftp.ncbi.nih.gov/genomes/Viruses/</a> folder where you can download many species individually - or all the GenBank or FASTA files as a tar-ball.<br />
<br />
Rodney Brister from the Virus Genomes Group at the NCBI also emailed to draw my attention to <a href="ftp://ftp.ncbi.nih.gov/genomes/Viruses/Viruses_RefSeq_and_neighbors_genome_data.tab">ftp://ftp.ncbi.nih.gov/genomes/Viruses/Viruses_RefSeq_and_neighbors_genome_data.tab</a> which lists all virus RefSeqs and their and manually curated "neighbours" and looks useful too. He also confirmed the "chimera" problems are just a side effect of more detailed "source" annotations in some recent submissions.<br />
<br />
<b><u>Update (2 Feb 2015)</u></b><br />
<br />
See also <a href="http://dx.doi.org/10.1093/nar/gku1207">Brister <i>et al.</i> (2015) NCBI Viral Genomes Resource</a>.Peter Cockhttp://www.blogger.com/profile/00233221181317137855noreply@blogger.com4tag:blogger.com,1999:blog-8584629468471803075.post-63867078914611647422013-10-20T13:40:00.003+01:002013-10-20T13:40:44.413+01:00UTF8 encoded Japanese in LaTeXSlightly off topic, but anyway... notes on getting Japanese text working in LaTeX under Mac OS X using TeX Live. Once I finally got it to work it is quite easy, but first I explored a lot of dead ends and distractions (in the end I could ignore <a href="http://en.wikipedia.org/wiki/Omega_(TeX)">LaTeX Omega</a>, <a href="http://en.wikipedia.org/wiki/XeTeX">XeLaTeX</a>, etc). I'm just using <span style="font-family: Courier New, Courier, monospace;">pdflatex</span> with the <a href="http://ctan.org/pkg/cjk">LaTeX Chinese, Japanese, Korean (CJK) package</a>, here's an example from the PDF output:<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg9SfhTDAcjy3WjSvLIFcY1GMQPM4s_IW_7smYgVykyywyrWduBfY-NofESmifuKFKvFmGhw6WrSGdV7rUxOHOM_YSS-o6Tv7z5l5wQeLr12544J47NZMUkpg_NyDZTz0Vu0W2iOkhmd0E/s1600/Japanese_in_LaTeX_output.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="198" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg9SfhTDAcjy3WjSvLIFcY1GMQPM4s_IW_7smYgVykyywyrWduBfY-NofESmifuKFKvFmGhw6WrSGdV7rUxOHOM_YSS-o6Tv7z5l5wQeLr12544J47NZMUkpg_NyDZTz0Vu0W2iOkhmd0E/s400/Japanese_in_LaTeX_output.png" width="400" /></a><br />
<div>
<br /></div>
</td></tr>
</tbody></table>
<br />
<a name='more'></a><br />
<h3>
Using Unicode UTF8 in LaTeX</h3>
<br />
Here's the LaTeX source to the example using kanji, hiragana and katakana, and a choice of three fonts:<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: left;"><pre>% !TEX encoding = UTF-8 Unicode
\documentclass{letter}
\usepackage{CJKutf8}
\begin{document}
The following three Japanese fonts should work:
\begin{CJK}{UTF8}{min}
明朝 Mincho (\texttt{min}), \\
e.g. 私の名前はピーターです。
\end{CJK}
\begin{CJK}{UTF8}{goth}
ゴシック Gothic (\texttt{goth}), \\
e.g. 私の名前はピーターです。
\end{CJK}
\begin{CJK}{UTF8}{maru}
丸ゴシック Maru Gothic (\texttt{maru}), \\
e.g. 私の名前はピーターです。
\end{CJK}
\end{document}</pre>
</td></tr>
</tbody></table>
<br />
This just says the font names and repeats the phrase "watashi no namae wa pi-ta- desu" ("My name is Peter"). As long as you have configured Japanese input within Mac OS X, then you can just type the Japanese text into your LaTeX file as normal.<br />
<br />
Assuming the relevant packages are installed (see below) compile it as usual, via the LaTeX button within TeXShop, or in the terminal:<br />
<br />
<span style="font-family: Courier New, Courier, monospace;">$ pdflatex example.tex</span><br />
<span style="font-family: Courier New, Courier, monospace;">...</span>
<br />
<span style="font-family: Courier New, Courier, monospace;">Output written on example.pdf (1 page, 63529 bytes).</span><br />
<span style="font-family: Courier New, Courier, monospace;">Transcript written on example.log.</span><br />
<br />
The special encoding line at the start can be inserted within TeXShop via the menu "Macros", "Encoding", "UTF-8 Unicode". However, it does not seem to be essential.<br />
<br />
<br />
Notice that you can use normal English characters inside the CJK environment, which is perfect for bits of English. If you are mostly writing in Japanese with occasional English, it seems quite natural to use a single CJK environment for the entire document.<br />
<br />
<br />
<h3>
LaTeX Package Installation</h3>
<br />
I'm using <a href="http://pages.uoregon.edu/koch/texshop/">TexShop</a> as my LaTeX editor on Mac OS X 10.8 <i>Mountain Lion</i>, specifically I am currently using v3.25 which also supports <i>Lion</i> and the imminent update <i>Mavericks</i>. On my MacBook Air due to the small 64GB SSD, I'm not using the standard all inclusive Tex Live setup, but instead the <a href="http://www.tug.org/mactex/morepackages.html">smaller BasicTeX package</a> (i.e. <a href="http://mirror.ctan.org/systems/mac/mactex/mactex-basic.pkg">mactex-basic.pkg</a>).<br />
<br />
Here's how to update your TeX/LaTeX system using the TeX Live Manager (tlmgr) command line tool:<br />
<br />
<span style="font-family: 'Courier New', Courier, monospace;">$ sudo tlmgr update -self -all</span><br />
<br />
Since I did not install the full TeX Live, every so often I find I am missing a useful package. For instance, if a LaTeX file fails to compile because it is missing <span style="font-family: Courier New, Courier, monospace;">fullpage.sty</span>, first find out which package provides that file, then install it:<br />
<br />
<span style="font-family: 'Courier New', Courier, monospace;">$ tlmgr search --global --file fullpage.sty</span><br />
<span style="font-family: Courier New, Courier, monospace;">tlmgr: package repository http://mirror.ox.ac.uk/sites/ctan.org/systems/texlive/tlnet</span><br />
<span style="font-family: Courier New, Courier, monospace;">preprint:</span><br />
<span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span" style="white-space: pre;"> </span>texmf-dist/tex/latex/preprint/fullpage.sty</span><br />
<br />
You need admin rights to actually install the package:<br />
<br />
<span style="font-family: Courier New, Courier, monospace;">$ sudo tlmgr install preprint</span><br />
<br />
Notice this automatically picked a local mirror,<br />
<br />
<span style="font-family: Courier New, Courier, monospace;">$ tlmgr repository list</span><br />
<span style="font-family: Courier New, Courier, monospace;">List of repositories (with tags if set):</span><br />
<span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span" style="white-space: pre;"> </span>http://mirror.ctan.org/systems/texlive/tlnet (main)</span><br />
<br />
For this Japanese text example I had to install the <a href="http://ctan.org/pkg/cjk">LaTeX Chinese, Japanese, Korean (CJK) package</a>:<br />
<br />
<span style="font-family: Courier New, Courier, monospace;">$ sudo tlmgr install cjk</span>Peter Cockhttp://www.blogger.com/profile/00233221181317137855noreply@blogger.com0