Schema for UCSC Genes - UCSC Genes (RefSeq, GenBank, tRNAs & Comparative Genomics)
  Database: mm9    Primary Table: knownGene    Row Count: 55,419   Data last updated: 2011-03-02
Format description: Transcript from default gene set in UCSC browser
On download server: MariaDB table dump directory
fieldexampleSQL type info description
name uc007aet.1varchar(255) values Name of gene
chrom chr1varchar(255) values Reference sequence chromosome or scaffold
strand -char(1) values + or - for strand
txStart 3195984int(10) unsigned range Transcription start position (or end position for minus strand item)
txEnd 3205713int(10) unsigned range Transcription end position (or start position for minus strand item)
cdsStart 3195984int(10) unsigned range Coding region start (or end position if for minus strand item)
cdsEnd 3195984int(10) unsigned range Coding region end (or start position if for minus strand item)
exonCount 2int(10) unsigned range Number of exons
exonStarts 3195984,3203519,longblob   Exon start positions (or end positions for minus strand item)
exonEnds 3197398,3205713,longblob   Exon end positions (or start positions for minus strand item)
proteinID  varchar(40) values UniProt display ID, UniProt accession, or RefSeq protein ID
alignID uc007aet.1varchar(255) values Unique identifier (GENCODE transcript ID for GENCODE Basic)

Connected Tables and Joining Fields
        mm9.affyGnfU74ADistance.query (via knownGene.name)
      mm9.affyGnfU74ADistance.target (via knownGene.name)
      mm9.affyGnfU74BDistance.query (via knownGene.name)
      mm9.affyGnfU74BDistance.target (via knownGene.name)
      mm9.affyGnfU74CDistance.query (via knownGene.name)
      mm9.affyGnfU74CDistance.target (via knownGene.name)
      mm9.bioCycPathway.kgID (via knownGene.name)
      mm9.ccdsKgMap.geneId (via knownGene.name)
      mm9.ceBlastTab.query (via knownGene.name)
      mm9.dmBlastTab.query (via knownGene.name)
      mm9.drBlastTab.query (via knownGene.name)
      mm9.foldUtr3.name (via knownGene.name)
      mm9.foldUtr5.name (via knownGene.name)
      mm9.gnfAtlas2Distance.query (via knownGene.name)
      mm9.gnfAtlas2Distance.target (via knownGene.name)
      mm9.hgBlastTab.query (via knownGene.name)
      mm9.keggPathway.kgID (via knownGene.name)
      mm9.kgAlias.kgID (via knownGene.name)
      mm9.kgColor.kgID (via knownGene.name)
      mm9.kgProtAlias.kgID (via knownGene.name)
      mm9.kgProtMap2.qName (via knownGene.name)
      mm9.kgSpAlias.kgID (via knownGene.name)
      mm9.kgTargetAli.qName (via knownGene.name)
      mm9.kgTxInfo.name (via knownGene.name)
      mm9.kgXref.kgID (via knownGene.name)
      mm9.knownBlastTab.query (via knownGene.name)
      mm9.knownBlastTab.target (via knownGene.name)
      mm9.knownCanonical.protein (via knownGene.name)
      mm9.knownCanonical.transcript (via knownGene.name)
      mm9.knownGeneMrna.name (via knownGene.name)
      mm9.knownGenePep.name (via knownGene.name)
      mm9.knownIsoforms.transcript (via knownGene.name)
      mm9.knownToAllenBrain.name (via knownGene.name)
      mm9.knownToEnsembl.name (via knownGene.name)
      mm9.knownToGnfAtlas2.name (via knownGene.name)
      mm9.knownToKeggEntrez.name (via knownGene.name)
      mm9.knownToLocusLink.name (via knownGene.name)
      mm9.knownToPfam.name (via knownGene.name)
      mm9.knownToRefSeq.name (via knownGene.name)
      mm9.knownToSuper.gene (via knownGene.name)
      mm9.knownToVisiGene.name (via knownGene.name)
      mm9.rnBlastTab.query (via knownGene.name)
      mm9.scBlastTab.query (via knownGene.name)
      mm9.ucscScop.ucscId (via knownGene.name)

Sample Rows
 
namechromstrandtxStarttxEndcdsStartcdsEndexonCountexonStartsexonEndsproteinIDalignID
uc007aet.1chr1-319598432057133195984319598423195984,3203519,3197398,3205713,uc007aet.1
uc007aeu.1chr1-320456236615793206102366142933204562,3411782,3660632,3207049,3411982,3661579,Q5GH67uc007aeu.1
uc007aev.1chr1-363839136489853638391363839123638391,3648927,3640590,3648985,uc007aev.1
uc007aew.1chr1-428092643993224283061439926844280926,4341990,4342282,4399250,4283093,4342162,4342918,4399322,NP_001182591uc007aew.1
uc007aex.2chr1-433358743503954334680434290644333587,4341990,4342282,4350280,4340172,4342162,4342918,4350395,Q548Q8uc007aex.2
uc007aey.1chr1-448100844838164481796448348724481008,4483180,4482749,4483816,Q61473uc007aey.1
uc007aez.1chr1-448100844864944481796448348754481008,4483180,4483852,4485216,4486371,4482749,4483547,4483944,4486023,4486494,Q61473uc007aez.1
uc007afa.1chr1-448100844864944481796448523644481008,4483852,4485216,4486371,4482749,4483944,4486023,4486494,Q61473uc007afa.1
uc007afb.1chr1-448100844864944481796448267234481008,4483852,4486371,4482749,4483944,4486494,Q61473-2uc007afb.1
uc007afc.1chr1-448100844864944481796448348744481008,4483180,4483852,4486371,4482749,4483571,4483944,4486494,Q61473uc007afc.1

Note: all start coordinates in our database are 0-based, not 1-based. See explanation here.

UCSC Genes (knownGene) Track Description
 

Description

The UCSC Genes track shows gene predictions based on data from RefSeq, GenBank, and the tRNA Genes track. This is a moderately conservative set of predictions, requiring the support of one GenBank RNA sequence plus at least one additional line of evidence. The RefSeq RNAs are an exception to this, requiring no additional evidence. The track includes both protein-coding and putative non-coding transcripts. Some of these non-coding transcripts may actually code for protein, but the evidence for the associated protein is weak at best. Compared to RefSeq, this gene set has generally about 10% more protein-coding genes, approximately five times as many putative non-coding genes, and about twice as many splice variants.

For more information on the different gene tracks, see our Genes FAQ.

Display Conventions and Configuration

This track in general follows the display conventions for gene prediction tracks. The exons for putative noncoding genes and untranslated regions are represented by relatively thin blocks, while those for coding open reading frames are thicker. The following color key is used:

  • Black -- feature has a corresponding entry in the Protein Databank (PDB)
  • Dark blue -- transcript has been reviewed or validated by either the RefSeq or SwissProt staff
  • Medium blue -- other RefSeq transcripts
  • Light blue -- non-RefSeq transcripts

This track contains an optional codon coloring feature that allows users to quickly validate and compare gene predictions. To display codon colors, select the genomic codons option from the Color track by codons pull-down menu. Click here for more information about this feature.

Methods

The UCSC Genes are built using a multi-step pipeline:

  1. RefSeq and GenBank RNAs are aligned to the genome with BLAT, keeping only the best alignments for each RNA and discarding alignments of less than 98% identity.
  2. Alignments are broken up at non-intronic gaps, with small isolated fragments thrown out.
  3. A splicing graph is created for each set of overlapping alignments. This graph has an edge for each exon or intron, and a vertex for each splice site, start, and end. Each RNA that contributes to an edge is kept as evidence for that edge.
  4. A similar splicing graph is created in the mouse, based on mouse RNA and ESTs. If the mouse graph has an edge that is orthologous to an edge in the human graph, that is added to the evidence for the human edge.
  5. If an edge in the splicing graph is supported by two or more human ESTs, it is added as evidence for the edge.
  6. If there is an Exoniphy prediction for an exon, that is added as evidence.
  7. The graph is traversed to generate all unique transcripts. The traversal is guided by the initial RNAs to avoid a combinatorical explosion in alternative splicing. All refSeq transcripts are output. For other multi-exon transcripts to be output, an edge supported by at least one additional line of evidence beyond the RNA is required. Single-exon genes require either two RNAs or two additional lines of evidence beyond the single RNA.
  8. Protein predictions are generated. For non-RefSeq transcripts we use the txCdsPredict program to determine if the transcript is protein-coding and if so, the locations of the start and stop codons. The program weighs as positive evidence the length of the protein, the presence of a Kozak consensus sequence at the start codon, and the length of the orthologous predicted protein in other species. As negative evidence it considers nonsense-mediated decay and start codons in any frame upstream of the predicted start codon. For RefSeq transcripts the RefSeq protein prediction is used.
  9. The corresponding UniProt protein is found, if any.
  10. The transcript is assigned a permanent "uc" accession.

Credits

The UCSC Genes track was produced at UCSC using a computational pipeline developed by Jim Kent, Chuck Sugnet and Mark Diekhans. It is based on data from NCBI RefSeq, UniProt (including TrEMBL and TrEMBL-NEW) and GenBank. Our thanks to the people running these databases and to the scientists worldwide who have made contributions to them.

Data Use Restrictions

The UniProt data have the following terms of use, UniProt copyright(c) 2002 - 2004 UniProt consortium:

For non-commercial use, all databases and documents in the UniProt FTP directory may be copied and redistributed freely, without advance permission, provided that this copyright statement is reproduced with each copy.

For commercial use, all databases and documents in the UniProt FTP directory except the files

  • ftp://ftp.uniprot.org/pub/databases/uniprot/knowledgebase/uniprot_sprot.dat.gz
  • ftp://ftp.uniprot.org/pub/databases/uniprot/knowledgebase/uniprot_sprot.xml.gz
may be copied and redistributed freely, without advance permission, provided that this copyright statement is reproduced with each copy. More information for commercial users can be found here.

From January 1, 2005, all databases and documents in the UniProt FTP directory may be copied and redistributed freely by all entities, without advance permission, provided that this copyright statement is reproduced with each copy.

References

Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Wheeler DL. GenBank: update. Nucleic Acids Res. 2004 Jan 1;32(Database issue):D23-6. PMID: 14681350; PMC: PMC308779

Hsu F, Kent WJ, Clawson H, Kuhn RM, Diekhans M, Haussler D. The UCSC Known Genes. Bioinformatics. 2006 May 1;22(9):1036-46. PMID: 16500937

Kent WJ. BLAT--the BLAST-like alignment tool. Genome Res. 2002 Apr;12(4):656-64. PMID: 11932250; PMC: PMC187518