Schema for Known Genes - UCSC Known Genes Based on UniProt, RefSeq, and GenBank mRNA
  Database: mm8    Primary Table: knownGene    Row Count: 31,863   Data last updated: 2006-02-27
Format description: Transcript from default gene set in UCSC browser
fieldexampleSQL type info description
name NM_001011874varchar(255) values Name of gene
chrom chr1varchar(255) values Reference sequence chromosome or scaffold
strand -char(1) values + or - for strand
txStart 3204562int(10) unsigned range Transcription start position (or end position for minus strand item)
txEnd 3661579int(10) unsigned range Transcription end position (or start position for minus strand item)
cdsStart 3206102int(10) unsigned range Coding region start (or end position if for minus strand item)
cdsEnd 3661429int(10) unsigned range Coding region end (or start position if for minus strand item)
exonCount 3int(10) unsigned range Number of exons
exonStarts 3204562,3411782,3660632,longblob   Exon start positions (or end positions for minus strand item)
exonEnds 3207049,3411982,3661579,longblob   Exon end positions (or start positions for minus strand item)
proteinID XKR4_MOUSEvarchar(40) values UniProt display ID, UniProt accession, or RefSeq protein ID
alignID R1varchar(8) values Unique identifier (GENCODE transcript ID for GENCODE Basic)

Connected Tables and Joining Fields
        mm8.affyGnfU74ADistance.query (via knownGene.name)
      mm8.affyGnfU74ADistance.target (via knownGene.name)
      mm8.affyGnfU74BDistance.query (via knownGene.name)
      mm8.affyGnfU74BDistance.target (via knownGene.name)
      mm8.affyGnfU74CDistance.query (via knownGene.name)
      mm8.affyGnfU74CDistance.target (via knownGene.name)
      mm8.ccdsKgMap.geneId (via knownGene.name)
      mm8.ceBlastTab.query (via knownGene.name)
      mm8.cgapAlias.alias (via knownGene.name)
      mm8.dmBlastTab.query (via knownGene.name)
      mm8.drBlastTab.query (via knownGene.name)
      mm8.dupSpMrna.mrnaID (via knownGene.name)
      mm8.foldUtr3.name (via knownGene.name)
      mm8.foldUtr5.name (via knownGene.name)
      mm8.gnfAtlas2Distance.query (via knownGene.name)
      mm8.gnfAtlas2Distance.target (via knownGene.name)
      mm8.hgBlastTab.query (via knownGene.name)
      mm8.keggPathway.kgID (via knownGene.name)
      mm8.kgAlias.kgID (via knownGene.name)
      mm8.kgProtAlias.kgID (via knownGene.name)
      mm8.kgSpAlias.kgID (via knownGene.name)
      mm8.kgTargetAli.qName (via knownGene.name)
      mm8.kgXref.kgID (via knownGene.name)
      mm8.knownBlastTab.query (via knownGene.name)
      mm8.knownBlastTab.target (via knownGene.name)
      mm8.knownCanonical.transcript (via knownGene.name)
      mm8.knownGeneMrna.name (via knownGene.name)
      mm8.knownGenePep.name (via knownGene.name)
      mm8.knownIsoforms.transcript (via knownGene.name)
      mm8.knownToAllenBrain.name (via knownGene.name)
      mm8.knownToEnsembl.name (via knownGene.name)
      mm8.knownToGnfAtlas2.name (via knownGene.name)
      mm8.knownToKeggEntrez.name (via knownGene.name)
      mm8.knownToLocusLink.name (via knownGene.name)
      mm8.knownToPfam.name (via knownGene.name)
      mm8.knownToRefSeq.name (via knownGene.name)
      mm8.knownToSuper.gene (via knownGene.name)
      mm8.knownToVisiGene.name (via knownGene.name)
      mm8.rnBlastTab.query (via knownGene.name)
      mm8.scBlastTab.query (via knownGene.name)

Sample Rows
 
namechromstrandtxStarttxEndcdsStartcdsEndexonCountexonStartsexonEndsproteinIDalignID
NM_001011874chr1-320456236615793206102366142933204562,3411782,3660632,3207049,3411982,3661579,XKR4_MOUSER1
AK140060chr1+367023536718693670651367099313670235,3671869,Q3USV4_MOUSEG168464
NM_011283chr1-433422543504734334680434290644334225,4341990,4342282,4350280,4340172,4342162,4342918,4350473,RP1_MOUSER2
AK004781chr1-448100844864944481796448267254481008,4483180,4483852,4485216,4486371,4482749,4483547,4483944,4486023,4486494,Q61473-2U95834
NM_011441chr1-448100844864944481796448348754481008,4483180,4483852,4485216,4486371,4482749,4483547,4483944,4486023,4486494,SOX17_MOUSER3
NM_025300chr1-476329147757914764532477575854763291,4767605,4772648,4774031,4775653,4764597,4767729,4772814,4774186,4775791,Q9CPP5_MOUSER4
BC068230chr1-476329147757904766544477575854763291,4767605,4772648,4774031,4775653,4766882,4767729,4772814,4774186,4775790,Q9CPR5_MOUSEU147577
NM_008866chr1+479797348368154797994483509794797973,4798535,4818664,4820348,4822391,4827081,4829467,4831036,4835043,4798063,4798567,4818730,4820396,4822462,4827155,4829569,4831213,4836815,LYPA1_MOUSER5
AK050549chr1+479799848313414798009483124084797998,4798535,4818664,4820348,4822391,4827081,4829467,4831036,4798063,4798567,4818730,4820396,4822462,4827155,4829569,4831341,P97823-2G168484
NM_011541chr1+4847894488798348479944886445104847894,4857550,4868107,4876824,4879537,4880820,4881995,4883497,4885014,4886436,4848057,4857613,4868213,4876912,4879683,4880877,4882150,4883644,4885086,4887983,TCEA1_MOUSER6

Note: all start coordinates in our database are 0-based, not 1-based. See explanation here.

Known Genes (knownGene) Track Description
 

Description

The UCSC Known Genes track shows known protein-coding genes based on protein data from UniProt (SWISS-PROT and TrEMBL) and mRNA data from the NCBI reference sequences collection (RefSeq) and GenBank. Each Known Gene is represented by an mRNA and a protein.

Display Conventions and Configuration

This track follows the display conventions for gene prediction tracks with the following color scheme:

  • Black: indicates the gene has a corresponding entry in the Protein Databank (PDB).
  • DarkBlue: indicates the gene has either a corresponding RefSeq mRNA that is "Reviewed" or "Validated" or a corresponding Swiss-Prot protein.
  • Medium Blue: indicates the gene has a corresponding RefSeq mRNA that is not "Reviewed" nor "Validated".
  • Light Blue: everything else. That is, the gene does not have a corresponding Protein Databank entry, RefSeq mRNA, or Swiss-Prot protein, but it has supporting evidence of a GenBank mRNA with a UniProt (TrEMBL) protein.

This track contains an optional codon coloring feature that allows users to quickly validate and compare gene predictions. To display codon colors, select the genomic codons option from the Color track by codons pull-down menu. Click here for more information about this feature.

Methods

This release of UCSC Known Genes was built by a new process, KG II, as described below.

UniProt protein sequences (including alternative splicing isoforms) and mRNA sequences from RefSeq and GenBank were aligned against the base genome using BLAT. RefSeq alignments having a base identity level within 0.1% of the best and at least 96% base identity with the genomic sequence were kept. GenBank mRNA alignments having a base identity level within 0.2% of the best and at least 97% base identity with the genomic sequence were kept. Protein alignments having a base identity level within 0.2% of the best and at least 80% base identity with the genomic sequence were kept.

Then the genomic mRNA and protein alignments were compared, and protein-mRNA pairings were determined from their overlaps. mRNA CDS data were obtained from RefSeq and GenBank data and supplemented by CDS structures derived from UCSC protein-mRNA BLAT alignments. The initial set of UCSC Known Genes candidates consists of all protein-mRNA pairs with valid mRNA CDS structures. A gene-check program (similar to the one used for the Consensus CDS (CCDS) project) is used to remove questionable candidates, such as those with in-frame stop codons, missing start or stop codons, etc.

From each group of gene candidates that share the same CDS structure, the protein-mRNA pair having the best ranking and protein-mRNA alignment score is selected as a UCSC Known Gene. The ranking of a gene candidate depends on its gene-check quality measures. When all else is equal, a preference is given to RefSeq mRNAs and next to MGC mRNAs. Similarly a preference is given to gene candidates represented by Swiss-Prot proteins. The protein-mRNA alignment score is calculated based on protein to mRNA alignment using TBLASTN, plus weighted sub-scores according to the date and length of the mRNA.

Credits

The UCSC Known Genes track was produced using protein data from UniProt and mRNA data from NCBI RefSeq and GenBank.

Data Use Restrictions

The UniProt data have the following terms of use, UniProt copyright(c) 2002 - 2004 UniProt consortium:

For non-commercial use, all databases and documents in the UniProt FTP directory may be copied and redistributed freely, without advance permission, provided that this copyright statement is reproduced with each copy.

For commercial use, all databases and documents in the UniProt FTP directory except the files

  • ftp://ftp.uniprot.org/pub/databases/uniprot/knowledgebase/uniprot_sprot.dat.gz
  • ftp://ftp.uniprot.org/pub/databases/uniprot/knowledgebase/uniprot_sprot.xml.gz
may be copied and redistributed freely, without advance permission, provided that this copyright statement is reproduced with each copy. More information for commercial users can be found here.

From January 1, 2005, all databases and documents in the UniProt FTP directory may be copied and redistributed freely by all entities, without advance permission, provided that this copyright statement is reproduced with each copy.

References

Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Wheeler DL. GenBank: update. Nucleic Acids Res. 2004 Jan 1;32(Database issue):D23-6. PMID: 14681350; PMC: PMC308779

Hsu F, Kent WJ, Clawson H, Kuhn RM, Diekhans M, Haussler D. The UCSC Known Genes. Bioinformatics. 2006 May 1;22(9):1036-46. PMID: 16500937

Kent WJ. BLAT - the BLAST-like alignment tool. Genome Res. 2002 Apr;12(4):656-64. PMID: 11932250; PMC: PMC187518