The core of this task uses esearch (NCBI Entrez Search) on the nucleotide database, originally with the following query: "txid10239[orgn] AND complete[prop]". This restricted the search to organisms (shortened to [orgn] or [ORGN]) under the NCBI taxonomy tree for Taxon ID 10239, the superkingdom for viruses, with the "complete" property (shortened to [prop] or [PROP]).
Fast-forward to 2013, as the databases grew there are now noticeable numbers of "complete" CDS sequences which I don't want. After some poking about exploring the current property set via the advanced search in the Entrez web interface, I settled on this query "txid35237[orgn] AND complete[prop] AND genome" instead (although there ought to be a more precise way to restrict this to genome sequences).
All seemed well, I had my fetch_viruses.py script talking to the NCBI and caching all the GenBank format files locally (using the Biopython Bio.Entrez module), and merge_virurses.py turning them into non-redundant BLAST databases of complete virus genomes, their genes, and proteins. However, a week later I spotted a false positive on a set of local BLAST results, the complete genome of Bacillus licheniformis DSM 13 = ATCC 14580 was in my (dsDNA) virus genome database!
Adding a length filter to the NCBI esearch query "txid35237[orgn] AND complete[prop] AND genome AND 1000000:100000000000[Sequence Length]" there are multiple complete bacteria showing up under viruses (and some cool enormous megaviruses, minivirus, and pandoraviruses). Here are the first three hits - all apparent false positives:
- Salmonella enterica subsp. enterica serovar Typhimurium str. LT2, complete genome
4,857,432 bp circular DNA
Accession: AE006468.1 GI: 16445344 - Bacillus licheniformis DSM 13 = ATCC 14580, complete genome
4,222,645 bp circular DNA
Accession: AE017333.1 GI: 52346357 - Haemophilus influenzae R2866, complete genome
1,932,306 bp circular DNA
Accession: CP002277.1 GI: 309750011
LOCUS AE017333 4222645 bp DNA circular BCT 23-MAY-2013 DEFINITION Bacillus licheniformis DSM 13 = ATCC 14580, complete genome. ACCESSION AE017333 VERSION AE017333.1 GI:52346357 ... FEATURES Location/Qualifiers source order(1..1317753,1345263..1422555,1464175..4222645) /organism="Bacillus licheniformis DSM 13 = ATCC 14580" /mol_type="genomic DNA" /strain="DSM 13" /culture_collection="ATCC:14580" /culture_collection="DSMZ:DSM 13" /db_xref="taxon:279010" /focus source 1317754..1345262 /organism="Bacillus phage BLi_Pp2" /mol_type="genomic DNA" /db_xref="taxon:1230651" /note="PBSX homolog phage" source 1422556..1464174 /organism="Bacillus phage BLi_Pp3" /mol_type="genomic DNA" /db_xref="taxon:1230652" /note="putative PBLD homologe phage" ...
This seems unusual, but understandable. Historically GenBank files for bacteria had a single source feature covering the whole of the genome. The fact these (mostly) bacterial genomes now show up in virus specific searches is an unfortunate side effect.
This is still a rare problem as shown by this search which looks for complete sequences which are under both virus (taxon id 35237) and cellular organisms (taxon id 131567), "txid35237[orgn] AND complete[prop] AND genome AND txid131567[orgn]" (currently 10 matches).
The solution to avoiding these chimeras is the search "txid35237[orgn] AND complete[prop] AND genome NOT txid131567[orgn]" (viruses but not cellular organisms). I've fixed my script to do this, and am regenerating my complete virus BLAST databases.
Update (18 April 2014)
With hindsight, I neglected to consider a more straight forward approach of using the NCBI FTP site, specifically the ftp://ftp.ncbi.nih.gov/genomes/Viruses/ folder where you can download many species individually - or all the GenBank or FASTA files as a tar-ball.
Rodney Brister from the Virus Genomes Group at the NCBI also emailed to draw my attention to ftp://ftp.ncbi.nih.gov/genomes/Viruses/Viruses_RefSeq_and_neighbors_genome_data.tab which lists all virus RefSeqs and their and manually curated "neighbours" and looks useful too. He also confirmed the "chimera" problems are just a side effect of more detailed "source" annotations in some recent submissions.
Update (2 Feb 2015)
See also Brister et al. (2015) NCBI Viral Genomes Resource.
Peter, thanks a lot for this extremely useful hint to a better use of research!
ReplyDeleteI have a few questions: in your script why do you loop through dsDnaViruses, dsRnaViruses, ssDnaViruses, ssRnaViruses, and then allViruses? Wouldn't it be enough to just cycle through allViruses only? I checked and, unless I did some mistake, allViruses is a superset of all others.
That said, I still find some glitches in the annotation. For example, the search http://www.ncbi.nlm.nih.gov/nuccore/?term=txid10239%5Borgn%5D+AND+complete%5Bprop%5D+AND+genome+NOT+txid131567%5Borgn%5D does not return at least one strain of HEV C 104: http://www.ncbi.nlm.nih.gov/nuccore/AB686524.1, although I think it should. Any idea on why?
I'm quite sure that you know this already, but just in case: you can download much faster from NCBI by fetching multiple ids at a time (I read somewhere on the NCBI website around 500 at a time). My script to download looks like this.
ids_count = len(names) # names is the list of genbank ids
step = 250 # download 250 sequences at a time
handle = open('your_nt_db.fasta', 'w+')
for i in range(0, ids_count, step):
ids = ','.join(names[i:i + step])
fetch_handle = Entrez.efetch("nuccore", rettype="fasta", id=ids)
data = fetch_handle.read()
fetch_handle.close()
handle.write(data)
if i % 1000 == 0:
print time.ctime(), ' Fetched %d sequences' % i
# download the sequences outside of the range
ids = ','.join(names[i:ids_count])
fetch_handle = Entrez.efetch("nuccore", rettype="fasta", id=ids)
data = fetch_handle.read()
fetch_handle.close()
handle.write(data)
handle.close()
Thanks again.
I have that loop over dsDNA, dsRNA, ssDNA, ssRNa, then all viruses as I wanted to make BLAST databases for each type of virus as well as all viruses. And yes, it is faster to download batches of records - but instead I wanted a local cache of one file per virus, which makes updates much more efficient.
DeleteRegarding AB686524, for some reason it does not have the complete property set - this is likely a bug in the annotation metadata, please tell the NCBI. e.g. Try this search on the Nucleotide database: AB686524 AND complete[prop]
Thanks Peter.
ReplyDeleteIn the meanwhile, I also realised that a better search might be "complete genome[Title] AND txid10239[orgn] NOT txid131567[orgn]". It returns ~35 thousand hits, a bit more than before. The point is that if you look for "complete[prop]" you exclude those hits that are described as complete genome in the title, but where the authors forgot to mention the property "completeness: complete".
This neatly shows the trouble with inconsistent annotation. Consider this search appears to show over 5000 complete virus genomes missing the completeness property: "complete genome"[Title] AND txid10239[orgn] NOT complete[prop]
DeleteIdeally all of those entries' metadata could be fixed...