The bug I found appears to be an unexpected side effect of this processing, and it appears to discard exons in trans-spliced genes where part of the sequence is from another file. Mitochondria are a gold mine for these kind of weird genes - in this case mixed strand splicing wasn't strange enough!
In this specific example from Silene vulgaris (Bladder Campion), trans-spliced gene nad1 on mitochondria chromosome 3 (NC_016406) includes a fragment from mitochondria chromosome 4 (NC_016402). Here is an excerpt from NC_016406 from Entrez using GenBank as the return type (the default, not with parts) copied from the NCBI website:
gene join(complement(149815..150200),
complement(293787..295573),NC_016402.1:6618..6676,
181647..181905)
/gene="nad1"
/trans_splicing
/note="exons 1, 2, 3, and 5 on chromosome 1 are
trans-spliced with exon 4 on chromosome 3 to form the
complete coding region"
/db_xref="GeneID:11447159"
CDS join(complement(149815..150200),
complement(295492..295573),complement(293787..293978),
NC_016402.1:6618..6676,181647..181905)
/gene="nad1"
/trans_splicing
/note="exons 1, 2, 3, and 5 on chromosome 1 are
trans-spliced with exon 4 on chromosome 3 to form the
complete coding region"
/codon_start=1
/transl_except=(pos:complement(150198..150200),aa:Met)
/product="NADH dehydrogenase subunit 1"
/protein_id="YP_004935334.1"
/db_xref="GI:357967323"
/db_xref="GeneID:11447159"
/translation="MYIAVPAEILGIILPLLLGVAFLVLAERKVMAFVQRRKGPDVVG
SFGLLQPLADGSKLILKEPISPSSANFSLFRMAPVTTFMLSLVARAVVPFDYGMVLSD
PNIGLLYLFAISSLGVYGIIIAGWSSNSKYAFLGALRSAAQMVPYEVSIGLILITVLI
CVGPRNSSEIVMAQKQIWSGIPLFPVLVMFFISCLAETNRAPFDLPEAEAELVAGYNV
EYSSMGSALFFLGEYANMILMSGLCTLLSPGGWPPILDLPISKKIPGSIWFSIKVILF
LFLYIWVRAAFPRYRYDQLMGLGRKVFLPLSLARVVAVSGVLVTFQWLP"
Here is the same excerpt from NC_016406 from Entrez using GenBank (with parts), again copied from the NCBI website:
gene join(complement(149815..150200),
complement(293787..295573),181647..181905)
/gene="nad1"
/trans_splicing
/note="exons 1, 2, 3, and 5 on chromosome 1 are
trans-spliced with exon 4 on chromosome 3 to form the
complete coding region"
/db_xref="GeneID:11447159"
CDS join(complement(149815..150200),
complement(295492..295573),complement(293787..293978),
181647..181905)
/gene="nad1"
/trans_splicing
/note="exons 1, 2, 3, and 5 on chromosome 1 are
trans-spliced with exon 4 on chromosome 3 to form the
complete coding region"
/codon_start=1
/transl_except=(pos:complement(150198..150200),aa:Met)
/product="NADH dehydrogenase subunit 1"
/protein_id="YP_004935334.1"
/db_xref="GI:357967323"
/db_xref="GeneID:11447159"
/translation="MYIAVPAEILGIILPLLLGVAFLVLAERKVMAFVQRRKGPDVVG
SFGLLQPLADGSKLILKEPISPSSANFSLFRMAPVTTFMLSLVARAVVPFDYGMVLSD
PNIGLLYLFAISSLGVYGIIIAGWSSNSKYAFLGALRSAAQMVPYEVSIGLILITVLI
CVGPRNSSEIVMAQKQIWSGIPLFPVLVMFFISCLAETNRAPFDLPEAEAELVAGYNV
EYSSMGSALFFLGEYANMILMSGLCTLLSPGGWPPILDLPISKKIPGSIWFSIKVILF
LFLYIWVRAAFPRYRYDQLMGLGRKVFLPLSLARVVAVSGVLVTFQWLP"
Look carefully and the penultimate part of each join, NC_016402.1:6618..6676, is missing. I've highlighted this in orange to mark it.
I reported this issue to the eutilities@ncbi.nlm.... address a week ago on Friday 9 March. I will update this post if they reply, or I notice the bug has been fixed. It was discovering while exploring a separate Biopython bug with location parsing.
Update: I resent the report to info@ncbi.nlm... just after posting the blog this afternoon (USA morning). I just got an email back tonight (USA afternoon) from the NCBI acknowledging the report and saying they'll correct it as soon as possible.
Update: This is fixed now (as of 3rd April 2012 when I noticed, probably happened a bit earlier).
Update (4th August 2016)
The record in the examples above still looks fine, but there are other records broken in this way, e.g. NC_024258.1 (GenBank) looks OK:
misc_feature order(11,27,YP_009040213.1:43..44,46,61,
YP_009040213.1:63..65,YP_009040213.1:79..80,82,
YP_009040213.1:85..86,88,95,YP_009040213.1:113..114,
YP_009040213.1:116..117,120,122)
/note="epsilon subunit interface [polypeptide binding];
other site"
/db_xref="CDD:213395"
Compare this to NC_024258.1 GenBank (with parts) which is missing bits:
misc_feature order(11,27,46,61,82,88,95,120,122)
/note="epsilon subunit interface [polypeptide binding];
other site"
/db_xref="CDD:213395"
Reported to the NCBI by email, they confirmed they can reproduce this problem and on other entries.
No comments:
Post a Comment