2011-08-15

Why are NCBI GFF3 files still broken?

For the early part of my career in Bioinformatics I was able to avoid GFF3 files - initially I focused on finished annotated genomes from the NCBI in plain text GenBank format (which has complications of its own), but with genome sequencing becoming widespread, so too is genome assembly and annotation. And for this, you will have to learn about GFF3 files.



I'm assuming my target audience will know what GFF3 files are, but to recap: In the beginning there was the Generic Feature Format (GFF), a plain text tab-delimited file format for describing annotation on biological sequences (typically genes and such like on chromosomes or contigs). This was followed by GFF2 (Sanger), but a number of dialects and variants were invented including Gene Transfer Format (GTF), sometimes nick-named GFF 2.5, which apparently comes in several flavours of its own including GTF 2 and GTF 2.2. The GMOD guys have a nice introduction to GFF.


When this XKCD comic came out I wasn't surprised to see it frequently linked to by bioinformaticians - after all, inventing a new file format seems to be one of the most commonly committed Deadly Sins of Bioinformatics. For those not familiar with this, see Carol Goble's The Seven Deadly Sins of Bioinformatics slides from BOSC 2007, Vienna.

In an effort to bring unity to the Generic Feature Format world, we now have GFF3. The GFF3 specification is here, www.sequenceontology.org/gff3.shtml, currently at v1.20, and Lincoln Stein's introduction alludes to the apparent mess things were in.

Notice that the GFF3 specification lives on www.sequenceontology.org - one reason for this is a formal ontology (restricted vocabulary) avoids different files using different feature names when they mean the same thing, GFF3 insists you use the Sequence Ontology terms.

The NCBI run GenBank and are one of the three main international sequence databases making up the INSDC (the others being EMBL at the EBI in Europe, and DDBJ in Japan), and as such are one of the main repositories of annotated genomes. They provide these in several formats, including the plain text GenBank format (also used by the DDBJ), while these is a sister format from EMBL. At some point the NCBI started publishing GFF3 versions as well.

An example Genome

I'm going to now point to a specific example, Nanoarchaeum equitans Kin4 M, specifically RefSeq accession NC_005213.1. If you go to that website, you can see this genome in various formats - but not (at the time of writing) GFF3. For that, see the N. equitans FTP site which includes:

  • NC_005213.fna - FASTA file of whole circular chromosome
  • NC_005213.ffn - FASTA file of annotated gene sequences (including stop codons)
  • NC_005213.faa - FASTA file of annotated protein sequences (excluding stop codons)
  • NC_005213.gbk - Plain text GenBank file for circular chromosome with annotation
  • NC_005213.gff - Currently broken GFF3 file (updated Feb 2012)

Problem One - Invalid Feature Types

The Genbank and EMBL plain text files share a common well documented feature table standard. The feature types used in the DDBJ/EMBL/GenBank Feature Table pre-date the Sequence Ontology (SOFA), and unfortunately the two standards differ. So, when the NCBI first started to generate GFF3 files, can you guess what happened?

Here's the start of NC_005213.gff, which I have edited to fit the page (truncation shown with ...):

##gff-version 3
#!gff-spec-version 1.14
#!source-version NCBI C++ formatter 0.2
##Type DNA NC_005213.1
NC_005213.1 RefSeq source 1 490885 . + . organism=Nanoarchaeum equitans...
NC_005213.1 RefSeq gene 35233 35301 . + . locus_tag=NEQ_t01;...
NC_005213.1 RefSeq gene 3254 3289 . + . locus_tag=NEQ_t01;...
NC_005213.1 RefSeq exon 35249 35287 . + . gbkey=tRNA;locus_tag=NEQ_t01;...
NC_005213.1 RefSeq exon 3254 3289 . + . gbkey=tRNA;locus_tag=NEQ_t01;...
...

Yep - rather than following the GFF3 standard and using the Sequence Ontology feature types, they continued to use their own separate feature types. The first line of real data uses feature type source to describe the genome, and this term is not in the Sequence Ontology - the GFF3 examples all use region here. You'll also find lots of misc_feature in column 3 further down (shown later).

Problem Two - Circular features not marked

This example genome is circular, which means the landmark feature should include the special Is_circular=true tag. The full column 9 entry of the landmark feature is:

organism=Nanoarchaeum equitans Kin4-M;mol_type=genomic DNA;strain=Kin4-M;db_xref=taxon:228908

This omits the Is_circular=true tag. It is also using source rather than a Sequence Ontology approved term like region (mentioned above).

To be fair, the second line of the NCBI GFF3 file reads #!gff-spec-version 1.14 so this predates the circular conventions.

Problem Three - Missing ID tags on multi-location features

The first gene in the GFF3 file is a tRNA, which is clear from the GenBank version:

     gene            join(35233..35301,3254..3289)
                     /locus_tag="NEQ_t01"
                     /experiment="experimental evidence, no additional details
                     recorded"
                     /trans_splicing
                     /db_xref="GeneID:3362429"
     tRNA            join(35249..35287,3254..3289)
                     /locus_tag="NEQ_t01"
                     /product="tRNA-Met"
                     /experiment="experimental evidence, no additional details
                     recorded"
                     /trans_splicing
                     /db_xref="GeneID:3362429"

Like many tRNA genes, it is trans-spliced meaning it comes from non-continuous bits of the genome, here 35249 to 35287, and 3254 to 3289. In GFF3 you should record this using two separate entries which are linked by virtue of having the same ID tag in column 9.

The column 9 entries for the two exons in the NCBI GFF3 file are:

gbkey=tRNA;locus_tag=NEQ_t01;product=tRNA-Met;experiment=experimental evidence%2C no additional details recorded;trans_splicing=;db_xref=GeneID:3362429;exon_number=1

gbkey=tRNA;locus_tag=NEQ_t01;product=tRNA-Met;experiment=experimental evidence%2C no additional details recorded;trans_splicing=;db_xref=GeneID:3362429;exon_number=2

These lack the special ID tag.

Problem Four - Wrong tag for database cross references

Notice in the above where I quoted the full column 9 entries that the NCBI's GFF file used db_xref=GeneID:3362429 which naturally enough is consistent with the qualifier key used in GenBank/EMBL feature tables.

Unfortunately the GFF3 specification clearly lists Dbxref as the appropriate predefined convention for database cross references.

Problem Five - Missing stop codon in CDS features

I'm going to illustrate this with an easy case, the coding sequence (CDS) for gene NEQ003 which is on the forward strand from positions 883 to 2691 inclusive (one-based counting). Here are the gene and CDS entries in the plain text GenBank file's feature table (there are also a dozen misc_feature entries within this gene, not shown):

     gene            883..2691
                     /locus_tag="NEQ003"
                     /db_xref="GeneID:2654355"
     CDS             883..2691
                     /locus_tag="NEQ003"
                     /note="large helicase-related protein [Pyrococcus abyssi].
                     SPLIT See NEQ409; COG1201: Lhr-like helicases; IPR001410:
                     DEAD/DEAH box helicase; IPR001472: Bipartite nuclear
                     localization signal; IPR001650: Helicase C-terminal
                     domain; IPR003593: AAA ATPase superfamily"
                     /codon_start=1
                     /transl_table=11
                     /product="hypothetical protein"
                     /protein_id="NP_963296.1"
                     /db_xref="GI:41614798"
                     /db_xref="GeneID:2654355"
                     /translation="MKKPQPYKDEEIYSILEEPVKQWFKEKYKTFTPPQRYAIMEIHK
                     RNNVLISSPTGSGKTLAAFLAIINELIKLSHKGKLENRVYAIYVSPLRSLNNDVKKNL
                     ETPLKEIKEKAKELNYYIGDIRIAVRTSDTKESEKAKMLKQPPHILITTPESLAIILS
                     TKKFREHIKKVEFVVVDEIHALAESKRGTHLALSLERLNYLTNFVRIGLSATIHPLEE
                     VAKFLFGYENGKPREGYIIDVSFEKPIEIQVYSPVDDIIYSSQEELMRNLYKFIGEQL
                     KKYRTILIFTNTRHGAESVAYHLKKAFPDLEKYIAVHHSSLSREERLEIEEKLKRGEL
                     RIVCTSTSLELGIDIGTIDLVIQIGSPKSVNRALQRIGRAGHRLDEKSKGILIGLDRD
                     ELLENTIIKYSADNRRLDRIHIPTNALDVLAQHIVGMGLEKEWDLKEAYNLIRNAYPY
                     RNLSWDDYINTIKYLAGEYTKLEHKKVYAKIILRDNKFKTKGSVRAIYFMNIGTIPEE
                     TKIAVYTLDNKRIGYVEEEFVEKLVKGDIFVLAGKTYEFLYSKGNKIYVRPVKGVNPT
                     IPHWFSEMLPLSYDLAIEIQKFREEIAKKLEREPLD"

You can check the matching amino acid sequence entry from the protein FASTA file NC_005213.faa. Similarly the CDS for this gene is in the FASTA file NC_005213.ffn, where is it called ref|NC_005213.1|:883-2691 - notice the FASTA entry's name encodes the location 883-2691 (matching that in the GenBank file), and the sequence runs ATG...TAA (including both the start and stop codons).

How does this look in the NCBI's GFF3 file?

##gff-version 3
#!gff-spec-version 1.14
#!source-version NCBI C++ formatter 0.2
##Type DNA NC_005213.1
NC_005213.1 RefSeq source 1 490885 . + . organism=Nanoarchaeum equitans...
...
NC_005213.1 RefSeq gene 883 2691 . + . locus_tag=NEQ003;...
NC_005213.1 RefSeq CDS 883 2688 . + 0 locus_tag=NEQ003;note=...
NC_005213.1 RefSeq start_codon 883 885 . + 0 locus_tag=NEQ003;note=...
NC_005213.1 RefSeq stop_codon 2689 2691 . + 0 locus_tag=NEQ003;note=...
NC_005213.1 RefSeq misc_feature 949 1632 . + . locus_tag=NEQ003;note=...
...

There are actually a dozen of these misc_feature entries for NEQ003, which are not using a valid Sequence Ontology term as noted above. But enough about that flaw.

On to the gene itself. For the gene feature, the GFF3 file says the co-ordinates are 883 to 2691 on the forward strand, which matches the GenBank file. However, the CDS feature is three bases short at 883 to 2688 on the forward strand. Why a three base difference? That is explained by the stop codon, correctly described as being at 2689 to 2691 on the forward strand. The start codon co-ordinates look fine too.

The problem with this is the GFF3 standard is very explicit that the co-ordinates for a CDS feature should include both the start and stop codons.

Problem Six - Features wrapping the origin of a circular genome

I'm now going to go back to the first protein in this genome, NEQ001, which is interesting because this is a circular genome (of length 490885 bp) and this gene spans the origin. In the GenBank file it is described as:

     gene            complement(join(490883..490885,1..879))
                     /locus_tag="NEQ001"
                     /db_xref="GeneID:2732620"
     CDS             complement(join(490883..490885,1..879))
                     /locus_tag="NEQ001"
                     /note="conserved hypothetical [Methanococcus jannaschii];
                     COG1583:Uncharacterized ACR; IPR001472:Bipartite nuclear
                     localization signal; IPR002743: Protein of unknown
                     function DUF57"
                     /codon_start=1
                     /transl_table=11
                     /product="hypothetical protein"
                     /protein_id="NP_963295.1"
                     /db_xref="GI:41614797"
                     /db_xref="GeneID:2732620"
                     /translation="MRLLLELKALNSIDKKQLSNYLIQGFIYNILKNTEYSWLHNWKK
                     EKYFNFTLIPKKDIIENKRYYLIISSPDKRFIEVLHNKIKDLDIITIGLAQFQLRKTK
                     KFDPKLRFPWVTITPIVLREGKIVILKGDKYYKVFVKRLEELKKYNLIKKKEPILEEP
                     IEISLNQIKDGWKIIDVKDRYYDFRNKSFSAFSNWLRDLKEQSLRKYNNFCGKNFYFE
                     EAIFEGFTFYKTVSIRIRINRGEAVYIGTLWKELNVYRKLDKEEREFYKFLYDCGLGS
                     LNSMGFGFVNTKKNSAR"

This is on the reverse complement, so the gene starts at +879, runs back to +1, continues round the origin to the +490885, and continues back to +490883. Remember this genome is a 490885 bp circle.

How do you describe that in GFF3? Over a year ago this was clarified (GFF3 v1.18, 24 June 2010) that the start is kept less than the end by making end = the position of the end + the length of the landmark feature. Because the start/end columns in GFF3 are given with respect to the forward strand, I think that means in GFF3 this gene starts at 490883 and ends at 879+490885=491764. And also the landmark feature describing the genome should include Is_circular=true in column 9.

How does this look in the NCBI's GFF3 file?

##gff-version 3
...
NC_005213.1 RefSeq gene 1 879 . - . locus_tag=NEQ001;...
NC_005213.1 RefSeq gene 490883 490885 . - . locus_tag=NEQ001;...
NC_005213.1 RefSeq CDS 1 879 . - 0 locus_tag=NEQ001;note=...
NC_005213.1 RefSeq start_codon 877 879 . - 0 locus_tag=NEQ001;note=...
NC_005213.1 RefSeq stop_codon 490883 490885 . - 0 locus_tag=NEQ001;note=...
NC_005213.1 RefSeq misc_feature 7 879 . - . locus_tag=NEQ001;note=...
...

Now for the gene feature - the NCBI have used two entries, one from 1 to 879 and one from 490883 to 490885. That's not right for the current specification, but would have been a reasonable representation if there was an common ID tag in column 9 to indicate these are two parts of the same feature.

On to the CDS feature - they're missing half of it! Only the region from bases 1 to 879 is mentioned. Weird, and clearly wrong.

The start_codon and stop_codon look OK.

Problem Seven - No parent/child relationships

I've already commented on the lack of an ID tag in column nine when it should have been present to unite split-location features. This omission is perhaps symptomatic of the following larger issue.

The GFF3 specification describes how to use ID and Parent tags in column 9 to express explicit "part of" relationships, and gives a detailed canonical gene example where the exon features and CDS features are part of an mRNA which is part of a gene.

The GenBank/EMBL feature table has an entirely implicit parent child relationship - for instance most prokaryotic gene features are followed by a CDS feature, while eukaryotic gene features are followed by mRNA and CDS features. However, for GFF3 we would expect these relationships to be made explicit.

I've also checked a eukaryotic GFF3 file from the NCBI to see if that makes a difference, Arabidopsis thaliana chromosome 5. Both NC_003076.gff and NC_003076.gbk do include mRNA features (unlike the NCBI files for prokaryotes), but the GFF3 file still lacks the parent

Problem Eight - Invalid tags

At this point I wanted a second opinion, so turned to the online GFF validator mentioned in the GFF3 specification, which also declared the NCBI's GFF file invalid.

On page one there was an unknown directive warning for the line ##Type DNA NC_005213.1, errors for invalid type (source and misc_feature both mentioned above), and something I'd missed: empty tag value/malformed tag-value pair (trans_splicing=). Reading the spec it isn't clear to me how such a tag should be expressed. I imagine for any genome where the GenBank/EMBL feature has a /pseduo the NCBI GFF3 files would have a similar empty GFF3 tag.

Further down the file, it complained of another error, unrecognized first-uppercase (reserved) tag (tag: EC_number) - in GFF3 user defined tags should start with lower case, upper case is reserved for tags defined in the standard itself.

Conclusions:

So, to recap, the NCBI GFF3 files are broken in at least the following ways - and this is without digging into the complexities of gene models (splicing, introns and exons, etc):
  • Use of invalid feature types in column 3, e.g. misc_feature and source
  • Doesn't declaring circular features, e.g. genomes (new in GFF3 v1.18, 24 June 2010)
  • Using db_xref instead of Dbxref in column 9.
  • Missing ID tags in column 9 to link multi-location features explicitly
  • Omit the stop codon in their CDS features (clarified in GFF3 v1.13, 23 May 2007)
  • Doesn't handle features wrapping the origin (new in GFF3 v1.18, 24 June 2010)
  • Makes no attempt to use parent/child relationships to link gene, mRNA, CDS etc.
  • Invalid tags in column 9 (e.g. EC_number rather than perhaps ec_number).

NCBI GFF3 files still claim to target v1.14 (28 August 2008) while the current standard is v1.20 (15 December 2010). The circular genome stuff and features wrapping the origin were only clarified in the specification a little over a year ago in v1.18 (24 June 2010), so I am not too surprised by problems in this area. Big organisations often move slowly after all.

What I'm more surprised about is why nothing seems to have been done following contact from Scott Cain from GMOD way back in January 2006 (and probably others then and since) to flag some more long standing errors, see the gmod-gbrowse mailing list archive. That was over five years ago!

As things stand, my plan to cross validate Brad Chapman's proposed GFF parser for Biopython against the existing Biopython GenBank parser using examples from the NCBI isn't viable.

To anyone at the NCBI in a position to address this, please do so. Standards only work if people follow them.

Also if I have made any errors in the above, please let me know so I can correct this. Thanks!

Update October 2011:

Good news - as of 14 October 2011, Terence Murphy (NIH/NLM/NCBI) has been in touch with me, Chris Fields, etc and the NCBI are working on a complete rewrite of their GFF3 writer. We'll be examining some of their output and fingers crossed this will be live in a month or two.

Bad news - comments on the blog still don't seem to be working properly....

Update March 2012:

I just checked the NCBI FTP site again, and last month the GFF3 file was updated and it now starts:

##gff-version 3
##gff-version 3
#!gff-spec-version 1.20
#!processor NCBI annotwriter
##sequence-region NC_005213.1 1 490885
##species http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=228908
##sequence-region NC_005213.1 1 490885
##species http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=228908
NC_005213.1 RefSeq region 1 490885 . + . ID=id0;...
NC_005213.1 RefSeq gene 35233 35301 . + . ID=gene0;...
NC_005213.1 RefSeq gene 3254 3289 . + . ID=gene0;...
NC_005213.1 RefSeq tRNA 35249 35287 . + . ID=rna0;...
NC_005213.1 RefSeq tRNA 494139 494174 . + . ID=rna0;...
NC_005213.1 RefSeq exon 35249 35287 . + . ID=id1;Parent=rna0;...
NC_005213.1 RefSeq exon 494139 494174 . + . ID=id2;Parent=rna0;...
NC_005213.1 RefSeq gene 490883 491764 . - . ID=gene1;...
NC_005213.1 RefSeq CDS 490883 491764 . - 0 ID=cds0;...;Parent=gene1;...
...

I've not done a full re-inspection yet, but it looks a lot better - origin spanning features look sane, multi-line features with common ID, even parent links. Nice.

4 comments:

  1. Just posted some links from my blog to see if this can get out a bit more. Let's see...

    ReplyDelete
  2. Something still isn't quite right with the comment setup here.. but anyway here's a direct link to Chris' blog post on Broken NCBI GFF3.

    ReplyDelete
  3. Peter do you happen to have any specific edits which I could hammer out on that NCBI >>> GFF3 converter?

    ReplyDelete