2012-03-16

Missing external exons in GenBank "with parts"

I recently stumbled on a problem in NCBI Entrez with the GenBank (with parts) return type. Some GenBank files don't actually contain a sequence at the end - instead they have a CONTIG section telling you how to construct the sequence from other referenced pieces. That's often inconvenient so the NCBI have the handy option of downloading it with all this parts pre-computed, which normally is great.

The bug I found appears to be an unexpected side effect of this processing, and it appears to discard exons in trans-spliced genes where part of the sequence is from another file. Mitochondria are a gold mine for these kind of weird genes - in this case mixed strand splicing wasn't strange enough!

In this specific example from Silene vulgaris (Bladder Campion), trans-spliced gene nad1 on mitochondria chromosome 3 (NC_016406) includes a fragment from mitochondria chromosome 4 (NC_016402). Here is an excerpt from NC_016406 from Entrez using GenBank as the return type (the default, not with parts) copied from the NCBI website:

    gene            join(complement(149815..150200),
                    complement(293787..295573),NC_016402.1:6618..6676,
                    181647..181905)
                    /gene="nad1"
                    /trans_splicing
                    /note="exons 1, 2, 3, and 5 on chromosome 1 are
                    trans-spliced with exon 4 on chromosome 3 to form the
                    complete coding region"
                    /db_xref="GeneID:11447159"
    CDS             join(complement(149815..150200),
                    complement(295492..295573),complement(293787..293978),
                    NC_016402.1:6618..6676,181647..181905)
                    /gene="nad1"
                    /trans_splicing
                    /note="exons 1, 2, 3, and 5 on chromosome 1 are
                    trans-spliced with exon 4 on chromosome 3 to form the
                    complete coding region"
                    /codon_start=1
                    /transl_except=(pos:complement(150198..150200),aa:Met)
                    /product="NADH dehydrogenase subunit 1"
                    /protein_id="YP_004935334.1"
                    /db_xref="GI:357967323"
                    /db_xref="GeneID:11447159"
                    /translation="MYIAVPAEILGIILPLLLGVAFLVLAERKVMAFVQRRKGPDVVG
                    SFGLLQPLADGSKLILKEPISPSSANFSLFRMAPVTTFMLSLVARAVVPFDYGMVLSD
                    PNIGLLYLFAISSLGVYGIIIAGWSSNSKYAFLGALRSAAQMVPYEVSIGLILITVLI
                    CVGPRNSSEIVMAQKQIWSGIPLFPVLVMFFISCLAETNRAPFDLPEAEAELVAGYNV
                    EYSSMGSALFFLGEYANMILMSGLCTLLSPGGWPPILDLPISKKIPGSIWFSIKVILF
                    LFLYIWVRAAFPRYRYDQLMGLGRKVFLPLSLARVVAVSGVLVTFQWLP"

Here is the same excerpt from NC_016406 from Entrez using GenBank (with parts), again copied from the NCBI website:

    gene            join(complement(149815..150200),
                    complement(293787..295573),181647..181905)
                    /gene="nad1"
                    /trans_splicing
                    /note="exons 1, 2, 3, and 5 on chromosome 1 are
                    trans-spliced with exon 4 on chromosome 3 to form the
                    complete coding region"
                    /db_xref="GeneID:11447159"
    CDS             join(complement(149815..150200),
                    complement(295492..295573),complement(293787..293978),
                    181647..181905)
                    /gene="nad1"
                    /trans_splicing
                    /note="exons 1, 2, 3, and 5 on chromosome 1 are
                    trans-spliced with exon 4 on chromosome 3 to form the
                    complete coding region"
                    /codon_start=1
                    /transl_except=(pos:complement(150198..150200),aa:Met)
                    /product="NADH dehydrogenase subunit 1"
                    /protein_id="YP_004935334.1"
                    /db_xref="GI:357967323"
                    /db_xref="GeneID:11447159"
                    /translation="MYIAVPAEILGIILPLLLGVAFLVLAERKVMAFVQRRKGPDVVG
                    SFGLLQPLADGSKLILKEPISPSSANFSLFRMAPVTTFMLSLVARAVVPFDYGMVLSD
                    PNIGLLYLFAISSLGVYGIIIAGWSSNSKYAFLGALRSAAQMVPYEVSIGLILITVLI
                    CVGPRNSSEIVMAQKQIWSGIPLFPVLVMFFISCLAETNRAPFDLPEAEAELVAGYNV
                    EYSSMGSALFFLGEYANMILMSGLCTLLSPGGWPPILDLPISKKIPGSIWFSIKVILF
                    LFLYIWVRAAFPRYRYDQLMGLGRKVFLPLSLARVVAVSGVLVTFQWLP"

Look carefully and the penultimate part of each join, NC_016402.1:6618..6676, is missing. I've highlighted this in orange to mark it.

I reported this issue to the eutilities@ncbi.nlm.... address a week ago on Friday 9 March. I will update this post if they reply, or I notice the bug has been fixed. It was discovering while exploring a separate Biopython bug with location parsing.

Update: I resent the report to info@ncbi.nlm... just after posting the blog this afternoon (USA morning). I just got an email back tonight (USA afternoon) from the NCBI acknowledging the report and saying they'll correct it as soon as possible.

Update: This is fixed now (as of 3rd April 2012 when I noticed, probably happened a bit earlier).

Update (4th August 2016)

The record in the examples above still looks fine, but there are other records broken in this way, e.g. NC_024258.1 (GenBank) looks OK:

     misc_feature    order(11,27,YP_009040213.1:43..44,46,61,
                     YP_009040213.1:63..65,YP_009040213.1:79..80,82,
                     YP_009040213.1:85..86,88,95,YP_009040213.1:113..114,
                     YP_009040213.1:116..117,120,122)
                     /note="epsilon subunit interface [polypeptide binding];
                     other site"
                     /db_xref="CDD:213395"

Compare this to NC_024258.1 GenBank (with parts) which is missing bits:


     misc_feature    order(11,27,46,61,82,88,95,120,122)
                     /note="epsilon subunit interface [polypeptide binding];
                     other site"
                     /db_xref="CDD:213395"

Reported to the NCBI by email, they confirmed they can reproduce this problem and on other entries.

No comments:

Post a Comment