Blasted Bioinformatics!?: Trouble with chimeras - getting all complete viral genomes from the NCBI

2013-11-14

Trouble with chimeras - getting all complete viral genomes from the NCBI

Back in 2009, I wrote some Python scripts to use the NCBI Entrez Utilities to search for and download all known complete virus genomes in GenBank format, which I then processed to make FASTA files and BLAST databases. Recently I updated them and ran into some problems... false positives like entire bacterial genomes! This turns out to be due to a few bacteria with integrated phage being annotated as chimeras - genomes combined from multiple organisms.

The core of this task uses esearch (NCBI Entrez Search) on the nucleotide database, originally with the following query: "txid10239[orgn] AND complete[prop]". This restricted the search to organisms (shortened to [orgn] or [ORGN]) under the NCBI taxonomy tree for Taxon ID 10239, the superkingdom for viruses, with the "complete" property (shortened to [prop] or [PROP]).

Fast-forward to 2013, as the databases grew there are now noticeable numbers of "complete" CDS sequences which I don't want. After some poking about exploring the current property set via the advanced search in the Entrez web interface, I settled on this query "txid35237[orgn] AND complete[prop] AND genome" instead (although there ought to be a more precise way to restrict this to genome sequences).

All seemed well, I had my fetch_viruses.py script talking to the NCBI and caching all the GenBank format files locally (using the Biopython Bio.Entrez module), and merge_virurses.py turning them into non-redundant BLAST databases of complete virus genomes, their genes, and proteins. However, a week later I spotted a false positive on a set of local BLAST results, the complete genome of Bacillus licheniformis DSM 13 = ATCC 14580 was in my (dsDNA) virus genome database!

Adding a length filter to the NCBI esearch query "txid35237[orgn] AND complete[prop] AND genome AND 1000000:100000000000[Sequence Length]" there are multiple complete bacteria showing up under viruses (and some cool enormous megaviruses, minivirus, and pandoraviruses). Here are the first three hits - all apparent false positives:

Salmonella enterica subsp. enterica serovar Typhimurium str. LT2, complete genome
4,857,432 bp circular DNA
Accession: AE006468.1 GI: 16445344
Bacillus licheniformis DSM 13 = ATCC 14580, complete genome
4,222,645 bp circular DNA
Accession: AE017333.1 GI: 52346357
Haemophilus influenzae R2866, complete genome
1,932,306 bp circular DNA
Accession: CP002277.1 GI: 309750011

Looking at these in the GenBank format all becomes clear - they are annotated as chimeras, bacteria with integrated phage, e.g.

LOCUS       AE017333             4222645 bp    DNA     circular BCT 23-MAY-2013
DEFINITION  Bacillus licheniformis DSM 13 = ATCC 14580, complete genome.
ACCESSION   AE017333
VERSION     AE017333.1  GI:52346357
...
FEATURES             Location/Qualifiers
     source          order(1..1317753,1345263..1422555,1464175..4222645)
                     /organism="Bacillus licheniformis DSM 13 = ATCC 14580"
                     /mol_type="genomic DNA"
                     /strain="DSM 13"
                     /culture_collection="ATCC:14580"
                     /culture_collection="DSMZ:DSM 13"
                     /db_xref="taxon:279010"
                     /focus
     source          1317754..1345262
                     /organism="Bacillus phage BLi_Pp2"
                     /mol_type="genomic DNA"
                     /db_xref="taxon:1230651"
                     /note="PBSX homolog phage"
     source          1422556..1464174
                     /organism="Bacillus phage BLi_Pp3"
                     /mol_type="genomic DNA"
                     /db_xref="taxon:1230652"
                     /note="putative PBLD homologe phage"
...

This seems unusual, but understandable. Historically GenBank files for bacteria had a single source feature covering the whole of the genome. The fact these (mostly) bacterial genomes now show up in virus specific searches is an unfortunate side effect.

This is still a rare problem as shown by this search which looks for complete sequences which are under both virus (taxon id 35237) and cellular organisms (taxon id 131567), "txid35237[orgn] AND complete[prop] AND genome AND txid131567[orgn]" (currently 10 matches).

The solution to avoiding these chimeras is the search "txid35237[orgn] AND complete[prop] AND genome NOT txid131567[orgn]" (viruses but not cellular organisms). I've fixed my script to do this, and am regenerating my complete virus BLAST databases.

Update (18 April 2014)

With hindsight, I neglected to consider a more straight forward approach of using the NCBI FTP site, specifically the ftp://ftp.ncbi.nih.gov/genomes/Viruses/ folder where you can download many species individually - or all the GenBank or FASTA files as a tar-ball.

Rodney Brister from the Virus Genomes Group at the NCBI also emailed to draw my attention to ftp://ftp.ncbi.nih.gov/genomes/Viruses/Viruses_RefSeq_and_neighbors_genome_data.tab which lists all virus RefSeqs and their and manually curated "neighbours" and looks useful too. He also confirmed the "chimera" problems are just a side effect of more detailed "source" annotations in some recent submissions.

Update (2 Feb 2015)

See also Brister et al. (2015) NCBI Viral Genomes Resource.

4 comments:

Anonymous9 April 2014 at 10:22
Peter, thanks a lot for this extremely useful hint to a better use of research!
I have a few questions: in your script why do you loop through dsDnaViruses, dsRnaViruses, ssDnaViruses, ssRnaViruses, and then allViruses? Wouldn't it be enough to just cycle through allViruses only? I checked and, unless I did some mistake, allViruses is a superset of all others.

That said, I still find some glitches in the annotation. For example, the search http://www.ncbi.nlm.nih.gov/nuccore/?term=txid10239%5Borgn%5D+AND+complete%5Bprop%5D+AND+genome+NOT+txid131567%5Borgn%5D does not return at least one strain of HEV C 104: http://www.ncbi.nlm.nih.gov/nuccore/AB686524.1, although I think it should. Any idea on why?

I'm quite sure that you know this already, but just in case: you can download much faster from NCBI by fetching multiple ids at a time (I read somewhere on the NCBI website around 500 at a time). My script to download looks like this.

ids_count = len(names) # names is the list of genbank ids
step = 250 # download 250 sequences at a time
handle = open('your_nt_db.fasta', 'w+')
for i in range(0, ids_count, step):
ids = ','.join(names[i:i + step])
fetch_handle = Entrez.efetch("nuccore", rettype="fasta", id=ids)
data = fetch_handle.read()
fetch_handle.close()
handle.write(data)

if i % 1000 == 0:
print time.ctime(), ' Fetched %d sequences' % i

# download the sequences outside of the range
ids = ','.join(names[i:ids_count])
fetch_handle = Entrez.efetch("nuccore", rettype="fasta", id=ids)
data = fetch_handle.read()
fetch_handle.close()
handle.write(data)

handle.close()

Thanks again.
ReplyDelete
Replies
Anonymous9 April 2014 at 11:11
Thanks Peter.
In the meanwhile, I also realised that a better search might be "complete genome[Title] AND txid10239[orgn] NOT txid131567[orgn]". It returns ~35 thousand hits, a bit more than before. The point is that if you look for "complete[prop]" you exclude those hits that are described as complete genome in the title, but where the authors forgot to mention the property "completeness: complete".
ReplyDelete
Replies

Add comment