Comments on Blasted Bioinformatics!?: Trouble with chimeras - getting all complete viral genomes from the NCBI

This neatly shows the trouble with inconsistent an...

2014-04-09T11:24:51.431+01:00

This neatly shows the trouble with inconsistent annotation. Consider this search appears to show over 5000 complete virus genomes missing the completeness property: "complete genome"[Title] AND txid10239[orgn] NOT complete[prop]

Ideally all of those entries' metadata could be fixed...

Thanks Peter. In the meanwhile, I also realised th...

2014-04-09T11:11:59.535+01:00

Thanks Peter.
In the meanwhile, I also realised that a better search might be "complete genome[Title] AND txid10239[orgn] NOT txid131567[orgn]". It returns ~35 thousand hits, a bit more than before. The point is that if you look for "complete[prop]" you exclude those hits that are described as complete genome in the title, but where the authors forgot to mention the property "completeness: complete".

I have that loop over dsDNA, dsRNA, ssDNA, ssRNa, ...

2014-04-09T10:59:05.394+01:00

I have that loop over dsDNA, dsRNA, ssDNA, ssRNa, then all viruses as I wanted to make BLAST databases for each type of virus as well as all viruses. And yes, it is faster to download batches of records - but instead I wanted a local cache of one file per virus, which makes updates much more efficient.

Regarding AB686524, for some reason it does not have the complete property set - this is likely a bug in the annotation metadata, please tell the NCBI. e.g. Try this search on the Nucleotide database: AB686524 AND complete[prop]

Peter, thanks a lot for this extremely useful hint...

2014-04-09T10:22:51.397+01:00

Peter, thanks a lot for this extremely useful hint to a better use of research!
I have a few questions: in your script why do you loop through dsDnaViruses, dsRnaViruses, ssDnaViruses, ssRnaViruses, and then allViruses? Wouldn't it be enough to just cycle through allViruses only? I checked and, unless I did some mistake, allViruses is a superset of all others.

That said, I still find some glitches in the annotation. For example, the search http://www.ncbi.nlm.nih.gov/nuccore/?term=txid10239%5Borgn%5D+AND+complete%5Bprop%5D+AND+genome+NOT+txid131567%5Borgn%5D does not return at least one strain of HEV C 104: http://www.ncbi.nlm.nih.gov/nuccore/AB686524.1, although I think it should. Any idea on why?

I'm quite sure that you know this already, but just in case: you can download much faster from NCBI by fetching multiple ids at a time (I read somewhere on the NCBI website around 500 at a time). My script to download looks like this.

ids_count = len(names) # names is the list of genbank ids
step = 250 # download 250 sequences at a time
handle = open('your_nt_db.fasta', 'w+')
for i in range(0, ids_count, step):
ids = ','.join(names[i:i + step])
fetch_handle = Entrez.efetch("nuccore", rettype="fasta", id=ids)
data = fetch_handle.read()
fetch_handle.close()
handle.write(data)

if i % 1000 == 0:
print time.ctime(), ' Fetched %d sequences' % i

# download the sequences outside of the range
ids = ','.join(names[i:ids_count])
fetch_handle = Entrez.efetch("nuccore", rettype="fasta", id=ids)
data = fetch_handle.read()
fetch_handle.close()
handle.write(data)

handle.close()

Thanks again.