Schema for Gencode Genes - ENCODE Gencode Gene Annotations

JavaScript is disabled in your web browser

You must have JavaScript enabled in your web browser to use the Genome Browser

Database: hg18 Primary Table: wgEncodeGencodeManualV3 Row Count: 87,627 Data last updated: 2009-11-02
Format description: A gene prediction with some additional info.
On download server: MariaDB table dump directory

field	example	SQL type	info	description
`bin`	585	`smallint(5) unsigned`	range	Indexing field to speed chromosome range queries.
`name`	ENST00000450305	`varchar(255)`	values	Name of gene (usually transcript_id from GTF)
`chrom`	chr1	`varchar(255)`	values	Reference sequence chromosome or scaffold
`strand`	+	`char(1)`	values	+ or - for strand
`txStart`	1872	`int(10) unsigned`	range	Transcription start position (or end position for minus strand item)
`txEnd`	3533	`int(10) unsigned`	range	Transcription end position (or start position for minus strand item)
`cdsStart`	3533	`int(10) unsigned`	range	Coding region start (or end position for minus strand item)
`cdsEnd`	3533	`int(10) unsigned`	range	Coding region end (or start position for minus strand item)
`exonCount`	6	`int(10) unsigned`	range	Number of exons
`exonStarts`	1872,2041,2475,2837,3083,3315,	`longblob`		Exon start positions (or end positions for minus strand item)
`exonEnds`	1920,2090,2560,2915,3237,3533,	`longblob`		Exon end positions (or start positions for minus strand item)
`score`	0	`int(11)`	range	score
`name2`	RP11-34P13.1	`varchar(255)`	values	Alternate name (e.g. gene_id from GTF)
`cdsStartStat`	none	`enum('none', 'unk', 'incmpl', 'cmpl')`	values	Status of CDS start annotation (none, unknown, incomplete, or complete)
`cdsEndStat`	none	`enum('none', 'unk', 'incmpl', 'cmpl')`	values	Status of CDS end annotation (none, unknown, incomplete, or complete)
`exonFrames`	-1,-1,-1,-1,-1,-1,	`longblob`		Reading frame of the start of the CDS region of the exon, in the direction of transcription (0,1,2), or -1 if there is no CDS region.

Connected Tables and Joining Fields


	hg18.wgEncodeGencodeAutoV3.name (via wgEncodeGencodeManualV3.name) hg18.wgEncodeGencodeClassesV3.name (via wgEncodeGencodeManualV3.name) hg18.wgEncodeGencodePolyaV3.name (via wgEncodeGencodeManualV3.name)

Sample Rows

bin	name	chrom	strand	txStart	txEnd	cdsStart	cdsEnd	exonCount	exonStarts	exonEnds	name2	cdsStartStat	cdsEndStat	exonFrames
585	ENST00000450305	chr1	+	1872	3533	3533	3533	6	1872,2041,2475,2837,3083,3315,	1920,2090,2560,2915,3237,3533,	RP11-34P13.1	none	none	-1,-1,-1,-1,-1,-1,
585	ENST00000488147	chr1	-	4266	19433	19433	19433	11	4266,4867,5658,6469,6720,7095,7468,7777,8130,14600,19396,	4364,4901,5810,6628,6918,7231,7605,7924,8229,14754,19433,	WASH5P	none	none	-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,
585	ENST00000473358	chr1	+	19416	20960	20960	20960	3	19416,20426,20838,	19902,20530,20960,	AL627309.6	none	none	-1,-1,-1,
585	ENST00000469289	chr1	+	20129	20972	20972	20972	2	20129,20838,	20530,20972,	AL627309.6	none	none	-1,-1,
585	ENST00000417324	chr1	-	24416	25944	25000	25599	3	24416,25139,25583,	25037,25344,25944,	FAM138A	cmpl	cmpl	2,1,0,
585	ENST00000461467	chr1	-	25107	25936	25936	25936	2	25107,25583,	25344,25936,	FAM138A	none	none	-1,-1,
585	ENST00000492842	chr1	+	52810	53750	53750	53750	1	52810,	53750,	OR4G11P	none	none	-1,
585	ENST00000326183	chr1	+	58917	59971	58953	59871	1	58917,	59971,	OR4F5	cmpl	cmpl	0,
585	ENST00000466430	chr1	-	79157	110795	110795	110795	4	79157,81953,102562,110637,	81492,82103,102667,110795,	AL627309.9	none	none	-1,-1,-1,-1,
585	ENST00000495576	chr1	-	79413	80968	80968	80968	2	79413,80149,	79913,80968,	AL627309.4	none	none	-1,-1,

Note: all start coordinates in our database are 0-based, not 1-based. See explanation here.

Gencode Genes (wgEncodeSangerGencode) Track Description

Release Notes

This release of the Gencode Genes track (Version 3c, October 2009) shows high-quality manual annotations in the ENCODE regions generated by the GENCODE project.

Version 3 of the Gencode gene set presents a full merge between HAVANA and ENSEMBL, giving priority to the manually curated Havana objects and using ENSEMBL objects where they are different or fall into un-annotated regions. The annotation was carried out on genome assembly GRCh37 (hg19), features are projected back to NCBI36 (hg18) where possible. Gencode 3c is a small update of version 3b (July 09 freeze) mainly for chromosomes 3 & 4 for which the latest annotation was held back and QC'ed again to be used in the RNASeq Genome Annotation Assessment Project. Statistics about this release can be found here.

Display Conventions and Configuration

The annotations are divided into separate tracks based on source/confidence. The Gencode project recommends that the annotations from level 1 & 2 be used as the reference gene annotation, level 3 was added to fill gaps for methods that analyze the entire genome and require a full set.

Level 1: validated

At this time only pseudogene loci, that were predicted by the analysis-pipelines from YALE, UCSC as well as by HAVANA manual annotation from WTSI.

Level 2: manual annotation

HAVANA manual annotation from WTSI.
The following regions are considered "fully annotated" and contain level 2 annotation from HAVANA only, although they will still be updated: chromosomes 1, 2, 6, 9, 10, 13, 20, 21, 22, X, Y, ENCODE pilot regions, chr11:2353995-3878750.

Level 3: automated annotation

ENSEMBL loci in regions where no HAVANA annotation can be found.

NOTE: The release cycles for Gencode, Havana and Ensembl differ. Users are cautioned to compare release dates to determine which annotation is most current.

The gene annotations are colored based on the HAVANA annotation type and the confidence level. See the table below for the color key, as well as more detail about the transcript and feature types.

Class	Color	Description	Transcript Types (see Vega Transcript Types)
Validated_coding	Dark Orange	Level 1 Validated: coding regions	protein_coding
Validated_processed	Light Orange	Level 1 Validated: processed	processed_transcript
Validated_processed_pseudogene	Dark Pink	Level 1 Validated: processed pseudogenes	processed_pseudogene, processed_transcript, transcribed_processed_pseudogene
Validated_unprocessed_pseudogene	Medium Pink	Level 1 Validated: unprocessed pseudogenes	transcribed_unprocessed_pseudogene, unprocessed_pseudogene
Validated_pseudogene	Light Pink	Level 1 Validated: pseudogenes	IG_pseudogene, polymorphic_pseudogene, pseudogene, retrotransposed, unitary_pseudogene
Havana_coding	Dark Orange	Level 2 Manual annotation: coding	IG_C_gene,IG_D_gene,IG_J_gene,IG_V_gene,protein_coding
Havana_nonsense	Medium Orange	Level 2 Manual annotation: nonsense	nonsense_mediated_decay
Havana_non_coding	Light Orange	Level 2 Manual annotation: non-coding	ambiguous_orf, antisense, non_coding, processed_transcript, retained_intron
Havana_polyA	Black	Level 2 Manual annotation: polyA	polyA_signal, polyA_site, pseudo_polyA
Havana_processed_pseudogene	Dark Pink	Level 2 Manual annotation: processed pseudogene	processed_pseudogene, transcribed_processed_pseudogene
Havana_unprocessed_pseudogene	Medium Pink	Level 2 Manual annotation: unprocessed pseudogene	transcribed_unprocessed_pseudogene, unprocessed_pseudogene
Havana_pseudogene	Light Pink	Level 2 Manual annotation: pseudogene	IG_pseudogene, TR_pseudogene, polymorphic_pseudogene, pseudogene, retrotransposed, unitary_pseudogene
Havana_TEC	Grey	Level 2 Manual annotation: TEC	TEC, artifact
Ensembl_coding	Dark Red	Level 3 Automated annotation: coding	IG_C_gene, IG_D_gene, IG_J_gene, IG_V_gene, protein_coding
Ensembl_non_coding	Light Orange	Level 3 Automated annotation: non-coding	antisense, non_coding, processed_transcript, retained_intron
Ensembl_pseudogene	Dark Pink	Level 3 Automated annotation: pseudogene	IG_pseudogene, miRNA_pseudogene, misc_RNA_pseudogene, pseudogene, retrotransposed, unitary_pseudogene
Ensembl_processed_pseudogene	Medium Pink	Level 3 Automated annotation: processed pseudogene	processed_pseudogene
Ensembl_unprocessed_pseudogene	Light Pink	Level 3 Automated annotation: unprocessed pseudogene	unprocessed_pseudogene
Ensembl_RNA	Light Red	Level 3 Automated annotation: RNA transcripts	Mt_rRNA, Mt_tRNA, Mt_tRNA_pseudogene, miRNA, misc_RNA, rRNA, rRNA_pseudogene, scRNA_pseudogene, snRNA, snRNA_pseudogene, snoRNA, snoRNA_pseudogene, tRNA_pseudogene, tRNAscan
2way_consensus_pseudogene	Dark Purple	Level 3 Automated annotation: pseudogenes	pseudogenes

This track uses filtering by category to select subsets of transcripts and has additional advanced features. Help with these features can be found here.

Methods

We aim to annotate all evidence-based gene features at high accuracy on the human reference sequence. This includes identifying all protein-coding loci with associated alternative variants, non-coding loci which have transcript evidence, and pseudogenes. We integrate computational approaches (including comparative methods), manual annotation and targeted experimental verification.

For a detailed description of the methods and references used, see Harrow et al (2006).

Verification

See Harrow et al. (2006) for information on verification techniques.

Credits

This GENCODE release is the result of a collaborative effort among the following laboratories: (contact: Felix Kokocinski)

Lab/Institution	Contributors
HAVANA annotation group, Wellcome Trust Sanger Insitute (WTSI), Hinxton, UK	Adam Frankish, James Gilbert, Jennifer Harrow, Felix Kokocinski, Stephen Trevanion, Tim Hubbard (GENCODE Principal Investigator)
Genome Bioinformatics Lab (CRG), Barcelona, Spain	Thomas Derrien, Tyler Alioto, Roderic Guigó
Genome Bioinformatics, University of California Santa Cruz (UCSC), USA	Rachel Harte, Mark Diekhans, Robert Baertsch, David Haussler
Comp. Genomics Lab, Washington University St. Louis (WUSTL), USA	Jeltje van Baren, Charlie Comstock, David Lu, Michael Brent
Computer Science and Artificial Intelligence Lab, Broad Institute of MIT and Harvard, USA	Mike Lin, Manolis Kellis
Bioinformatics, Yale University (Yale), USA	Philip Cayting, Mark Gerstein
Center for Integrative Genomics, University of Lausanne, Switzerland	Cedric Howald, Alexandre Reymond
ENSEMBL genebuild group, Wellcome Trust Sanger Insitute (WTSI), Hinxton, UK	Bronwen Aken, Julio Fernandez Banet, Stephen Searle
Structural Computational Biology Group, Centro Natcional de Investigaciones Oncologicas (CNIO), Madrid, Spain	Manuel Rodríguez José, Jan-Jaap Wesselink, Michael Tress, Alfonso Valencia

References

Coffey AJ, Kokocinski F, Calafato MS, Scott CE, Palta P, Drury E, Joyce CJ, Leproust EM, Harrow J, Hunt S, et al. The GENCODE exome: sequencing the complete human exome. European Journal of Human Genetics. March 2011;19 827-831. [Epub ahead of print]

Harrow J, Denoeud F, Frankish A, Reymond A, Chen CK, Chrast J, Lagarde J, Gilbert JG, Storey R, Swarbreck D et al. GENCODE: producing a reference annotation for ENCODE. Genome Biol. 2006;7 Suppl 1:S4.1-9.

Data Release Policy

GENCODE data are available for use without restrictions. The full data release policy for ENCODE is available here.