tag:blogger.com,1999:blog-8584629468471803075.post1248660647581914392..comments2024-01-31T12:30:28.282+00:00Comments on Blasted Bioinformatics!?: Trouble with chimeras - getting all complete viral genomes from the NCBIPeter Cockhttp://www.blogger.com/profile/00233221181317137855noreply@blogger.comBlogger4125tag:blogger.com,1999:blog-8584629468471803075.post-20124385101935557742014-04-09T11:24:51.431+01:002014-04-09T11:24:51.431+01:00This neatly shows the trouble with inconsistent an...This neatly shows the trouble with inconsistent annotation. Consider this search appears to show over 5000 complete virus genomes missing the completeness property: "complete genome"[Title] AND txid10239[orgn] NOT complete[prop]<br /><br />Ideally all of those entries' metadata could be fixed...Peter Cockhttps://www.blogger.com/profile/00233221181317137855noreply@blogger.comtag:blogger.com,1999:blog-8584629468471803075.post-17959044386891198162014-04-09T11:11:59.535+01:002014-04-09T11:11:59.535+01:00Thanks Peter.
In the meanwhile, I also realised th...Thanks Peter.<br />In the meanwhile, I also realised that a better search might be "complete genome[Title] AND txid10239[orgn] NOT txid131567[orgn]". It returns ~35 thousand hits, a bit more than before. The point is that if you look for "complete[prop]" you exclude those hits that are described as complete genome in the title, but where the authors forgot to mention the property "completeness: complete".Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-8584629468471803075.post-58333580877831982132014-04-09T10:59:05.394+01:002014-04-09T10:59:05.394+01:00I have that loop over dsDNA, dsRNA, ssDNA, ssRNa, ...I have that loop over dsDNA, dsRNA, ssDNA, ssRNa, then all viruses as I wanted to make BLAST databases for each type of virus as well as all viruses. And yes, it is faster to download batches of records - but instead I wanted a local cache of one file per virus, which makes updates much more efficient.<br /><br />Regarding AB686524, for some reason it does not have the complete property set - this is likely a bug in the annotation metadata, please tell the NCBI. e.g. Try this search on the Nucleotide database: AB686524 AND complete[prop] Peter Cockhttps://www.blogger.com/profile/00233221181317137855noreply@blogger.comtag:blogger.com,1999:blog-8584629468471803075.post-18727652942262195072014-04-09T10:22:51.397+01:002014-04-09T10:22:51.397+01:00Peter, thanks a lot for this extremely useful hint...Peter, thanks a lot for this extremely useful hint to a better use of research!<br />I have a few questions: in your script why do you loop through dsDnaViruses, dsRnaViruses, ssDnaViruses, ssRnaViruses, and then allViruses? Wouldn't it be enough to just cycle through allViruses only? I checked and, unless I did some mistake, allViruses is a superset of all others.<br /><br />That said, I still find some glitches in the annotation. For example, the search http://www.ncbi.nlm.nih.gov/nuccore/?term=txid10239%5Borgn%5D+AND+complete%5Bprop%5D+AND+genome+NOT+txid131567%5Borgn%5D does not return at least one strain of HEV C 104: http://www.ncbi.nlm.nih.gov/nuccore/AB686524.1, although I think it should. Any idea on why?<br /><br />I'm quite sure that you know this already, but just in case: you can download much faster from NCBI by fetching multiple ids at a time (I read somewhere on the NCBI website around 500 at a time). My script to download looks like this.<br /><br /> ids_count = len(names) # names is the list of genbank ids<br /> step = 250 # download 250 sequences at a time<br /> handle = open('your_nt_db.fasta', 'w+')<br /> for i in range(0, ids_count, step):<br /> ids = ','.join(names[i:i + step])<br /> fetch_handle = Entrez.efetch("nuccore", rettype="fasta", id=ids)<br /> data = fetch_handle.read()<br /> fetch_handle.close()<br /> handle.write(data)<br /> <br /> if i % 1000 == 0:<br /> print time.ctime(), ' Fetched %d sequences' % i<br /> <br /> # download the sequences outside of the range<br /> ids = ','.join(names[i:ids_count])<br /> fetch_handle = Entrez.efetch("nuccore", rettype="fasta", id=ids)<br /> data = fetch_handle.read()<br /> fetch_handle.close()<br /> handle.write(data)<br /> <br /> handle.close()<br /><br /><br />Thanks again.Anonymousnoreply@blogger.com