The core of this task uses esearch (NCBI Entrez Search) on the nucleotide database, originally with the following query: "txid10239[orgn] AND complete[prop]". This restricted the search to organisms (shortened to [orgn] or [ORGN]) under the NCBI taxonomy tree for Taxon ID 10239, the superkingdom for viruses, with the "complete" property (shortened to [prop] or [PROP]).
Fast-forward to 2013, as the databases grew there are now noticeable numbers of "complete" CDS sequences which I don't want. After some poking about exploring the current property set via the advanced search in the Entrez web interface, I settled on this query "txid35237[orgn] AND complete[prop] AND genome" instead (although there ought to be a more precise way to restrict this to genome sequences).
All seemed well, I had my fetch_viruses.py script talking to the NCBI and caching all the GenBank format files locally (using the Biopython Bio.Entrez module), and merge_virurses.py turning them into non-redundant BLAST databases of complete virus genomes, their genes, and proteins. However, a week later I spotted a false positive on a set of local BLAST results, the complete genome of Bacillus licheniformis DSM 13 = ATCC 14580 was in my (dsDNA) virus genome database!
Adding a length filter to the NCBI esearch query "txid35237[orgn] AND complete[prop] AND genome AND 1000000:100000000000[Sequence Length]" there are multiple complete bacteria showing up under viruses (and some cool enormous megaviruses, minivirus, and pandoraviruses). Here are the first three hits - all apparent false positives:
- Salmonella enterica subsp. enterica serovar Typhimurium str. LT2, complete genome
4,857,432 bp circular DNA
Accession: AE006468.1 GI: 16445344
- Bacillus licheniformis DSM 13 = ATCC 14580, complete genome
4,222,645 bp circular DNA
Accession: AE017333.1 GI: 52346357
- Haemophilus influenzae R2866, complete genome
1,932,306 bp circular DNA
Accession: CP002277.1 GI: 309750011
LOCUS AE017333 4222645 bp DNA circular BCT 23-MAY-2013 DEFINITION Bacillus licheniformis DSM 13 = ATCC 14580, complete genome. ACCESSION AE017333 VERSION AE017333.1 GI:52346357 ... FEATURES Location/Qualifiers source order(1..1317753,1345263..1422555,1464175..4222645) /organism="Bacillus licheniformis DSM 13 = ATCC 14580" /mol_type="genomic DNA" /strain="DSM 13" /culture_collection="ATCC:14580" /culture_collection="DSMZ:DSM 13" /db_xref="taxon:279010" /focus source 1317754..1345262 /organism="Bacillus phage BLi_Pp2" /mol_type="genomic DNA" /db_xref="taxon:1230651" /note="PBSX homolog phage" source 1422556..1464174 /organism="Bacillus phage BLi_Pp3" /mol_type="genomic DNA" /db_xref="taxon:1230652" /note="putative PBLD homologe phage" ...
This seems unusual, but understandable. Historically GenBank files for bacteria had a single source feature covering the whole of the genome. The fact these (mostly) bacterial genomes now show up in virus specific searches is an unfortunate side effect.
This is still a rare problem as shown by this search which looks for complete sequences which are under both virus (taxon id 35237) and cellular organisms (taxon id 131567), "txid35237[orgn] AND complete[prop] AND genome AND txid131567[orgn]" (currently 10 matches).
The solution to avoiding these chimeras is the search "txid35237[orgn] AND complete[prop] AND genome NOT txid131567[orgn]" (viruses but not cellular organisms). I've fixed my script to do this, and am regenerating my complete virus BLAST databases.
Update (18 April 2014)
With hindsight, I neglected to consider a more straight forward approach of using the NCBI FTP site, specifically the ftp://ftp.ncbi.nih.gov/genomes/Viruses/ folder where you can download many species individually - or all the GenBank or FASTA files as a tar-ball.
Rodney Brister from the Virus Genomes Group at the NCBI also emailed to draw my attention to ftp://ftp.ncbi.nih.gov/genomes/Viruses/Viruses_RefSeq_and_neighbors_genome_data.tab which lists all virus RefSeqs and their and manually curated "neighbours" and looks useful too. He also confirmed the "chimera" problems are just a side effect of more detailed "source" annotations in some recent submissions.
Update (2 Feb 2015)
See also Brister et al. (2015) NCBI Viral Genomes Resource.