This track shows the pseudogenes located in ENCODE regions generated by
five different methods—Yale Pipeline, GenCode manual annotation, two
different UCSC methods, and Gene Identification Signature (GIS)—as well
as a consensus pseudogenes subtrack based on the
pseudogenes from all five methods. Datasets are displayed in separate
subtracks within the annotation and are individually described below.
The annotations are colored as follows:
||Pseudogenes arising via retrotransposition (exon structure of parent gene lost)
||Pseudogenes arising via gene duplication (exon structure of parent gene retained)
||Pseudogenes sequences that are single-exon and cannot be confidently
assigned to either the processed or the duplicated category
This subtrack shows pseudogenes derived from a consensus of the five
methods listed above. In the pseudogene.org data freeze dated 6 Jan. 2006,
201 consensus pseudogenes were found.
Here, pseudogenes are defined as genomic sequences that are similar to known
genes but exhibit various inactivating disablements (e.g. premature
stop codons or frameshifts) in their putative protein-coding regions and are
flagged as either recently-processed or non-processed.
The pseudogene sets were processed as follows:
- Step I: The four data sets were filtered to remove pseudogenes
that overlap with current Gencode coding exons/loci. Pseudogenes overlapping
with introns or noncoding genes were kept. Subsequent filtering of pseudogene
sets, excluding the Havana set, removed pseudogenes overlapping with exons of
UCSC Known Genes.
- Step II: A union of the pseudogenes from each filtered set was
created. If a pseudogenic region was annotated by more than one group,
the lowest starting coordinate and highest ending coordinate were used as the
- Step III: A parent protein for each pseudogene in the union was
assigned using a protein set from UniProt. Pseudogenes without a matching
protein were excluded.
- Step IV: Each pseudogene was realigned to its parent protein.
- Step V: The consensus list of pseudogenes was updated with
boundaries derived from the alignment in Step IV.
- Step VI: The consensus list of pseudogenes was updated with the
assigned parent proteins and new classifications (processed or non-processed).
Verification of the Consensus Pseudogenes
All pseudogenes in the list have been extensively curated by Adam Frankish and
Jennifer Harrow at the The Wellcome Trust Sanger Institute.
More information about this data set is available from pseudogene.org/ENCODE.
Havana-Gencode Annotated Pseudogenes and Immunglobulin Segments
This track shows pseudogenes annotated by the
at the Wellcome Trust Sanger Institute. Pseudogenes have homology to protein
sequences but generally have a disrupted CDS. For all annotated
pseudogenes, an active homologous gene (the parent) can be identified
elsewhere in the genome. Pseudogenes are classified as processed or
Prior to manual annotation, finished sequence is submitted to an
automated analysis pipeline for similarity searches and ab initio gene
predictions. The searches are run on a computer farm and stored in an
Ensembl MySQL database using the Ensembl analysis pipeline system
(Searle et al., 2004, Harrow et al., 2006).
A pseudogene is annotated
where the total length of the protein homology to the genomic sequence
is >20% of the length of the parent protein or >100 aa in length,
whichever is shortest. If a gene structure has an ORF but has lost
the structure of the parent gene, a pseudogene is annotated provided there
is no evidence of transcription from the pseudogene locus. When an
open but truncated reading frame is present, other evidence is used
(for example, 3' genomic polyA tract) to allow classification as a
pseudogene. When a parent gene has only a single coding exon (e.g.
olfactory receptors), a small 5' or 3' truncation to the CDS at the
pseudogene locus (compared to other family members) is sufficient to
confirm pseudogene status where the truncation is predicted to
significantly affect secondary structure by the literature and/or
Processed and unprocessed pseudogenes are
distinguished on the basis of structure and genomic context.
Processed pseudogenes, which arise via retrotransposition, lose the
intron-exon structure of the parent gene, often have an A-rich tract
indicative of the insertion site at their 3' end, and are flanked by
different genomic sequence to the parent gene. Unprocessed
pseudogenes, which arise via gene duplication, share both the
intron-exon structure and flanking genomic sequence with the parent
gene. Transcribed pseudogenes are indicated by the annotation of a
pseudogene and transcript variant alongside each other.
Harrow J, Denoeud F, Frankish A, Reymond A, Chen CK, Chrast J, Lagarde J,
Gilbert JG, Storey R, Swarbreck D, et al.
GENCODE: Producing a reference annotation for ENCODE.
Genome Biol. 2006;7 Suppl 1:S4.1-9.
Searle SM, Gilbert J, Iyer V, Clamp M.
The otter annotation system.
Genome Res. 2004 May;14(5):963-70.
This subtrack shows pseudogenes in the ENCODE regions identified by the Yale
Pseudogene Pipeline. In this analysis, pseudogenes are defined as genomic
sequences that are similar to known genes with various inactivating
disablements (e.g. premature stop codons or frameshifts) in their
putative protein-coding regions. Pseudogenes are flagged as
recently processed, recently duplicated, or of uncertain origin (either
ancient fragments or resulting from a single-exon parent).
- Step I: Repeat-masked human genome sequence was used as the target
for a six-frame TBLASTN where the query was the nonredundant human proteome
set (European Bioinformatics Institute). Only high-quality human protein
sequences from SWISS-PROT and TrEMBL were used, because this set included
processed or duplicated pseudogenes.
- Step II: BLAST hits that had a significant overlap with annotated
multiple-exon Ensembl genes were removed from consideration.
- Step III: The set of BLAST hits was reduced by selecting hits in
decreasing significance level and removing matches that overlapped by more
than 10 amino acids or 30 bp with a picked match.
- Step IV: Adjacent matches on a chromosome were merged together if
they were thought to belong to the same pseudogene locus. Merged matches were
extended on both sides to include the length of the query protein to which they
matched along with an extra 30 bp buffer on either side.
- Step V: The FASTA program was used to re-align these extended hits to
the genome. Redundant hits were removed and hits with gaps greater than 60 bp
were split into two alignments.
- Step VI: Alignments with possible artifactual frameshifts or stop
codons introduced by the alignment process were closely inspected.
- Step VII: False positives (E-value less than 10-10 or
amino acid sequence of less than 40% identity) and sequences matching
protein queries containing repeats or low-complexity regions were removed.
Potential functional genes were also removed. These were defined as having no
frameshift disruptions, less than 95% sequence identity to the query protein,
and translatable to a protein sequence longer than 95% of the length of
the query protein.
- Step VIII: The remaining putative pseudogene sequences were
classified based on several criteria. The intron-exon structure of the
functional gene was further used to infer whether a pseudogene was recently
duplicated or processed. A duplicated pseudogene retains the intron-exon
structure of its parent functional gene, whereas a processed pseudogene shows
evidence that this structure has been spliced out. Those sequences
where the insertions were 50% or more repeats (as detected by RepeatMasker)
are "Disrupted" processed pseudogenes. Small pseudogene sequences that
cannot be confidently assigned to either the processed or duplicated
category may be ancient fragments. Further details can be found in the
Verification of Yale Pseudogenes
All pseudogenes in the list have been manually checked.
Zhang Z, Harrison PM, Liu Y, Gerstein M.
Millions of years of evolution preserved: a comprehensive catalog
of the processed pseudogenes in the human genome.
Genome Res. 2003 Dec;13(12):2541-58.
Zheng D, Zhang Z, Harrison PM, Karro J, Carriero N, Gerstein M.
Integrated pseudogene annotation for human chromosome 22: evidence
J Mol Biol. 2005 May 27;349(1):27-45.
UCSC Retrogene Predictions
The Retrogene subtrack shows processed mRNAs that have been inserted back
into the genome since the mouse/human split. Retrogenes can be
functional genes that have acquired a promoter from a neighboring gene,
non-functional pseudogenes, or transcribed pseudogenes.
- Step I: All GenBank mRNAs for a particular species were aligned to
the genome using blastz.
- Step II: mRNAs that aligned twice in the genome (once with introns
and once without introns) were initially screened.
- Step III: A series of features were scored to determine candidates
for retrotranspostion events. These features included position and length of the
polyA tail, degree of synteny with mouse, coverage of repetitive elements,
number of exons that can still be aligned to the retroGene, and degree of
divergence from the parent gene. Retrogenes are classified using a threshold
score function that is a linear combination of this set of features.
Retrogenes in the final set have a score threshold greater than 425 based on a
ROC plot against the Vega annotated pseudogenes.
The "type" field has four possible values:
- singleExon: the parent gene is a single exon gene
- mrna: the parent gene is a spliced mrna that
has no annotation in NCBI refSeq, UCSC knownGene or Mammalian Gene Collection
- annotated: the parent gene has been annotated
by one of refSeq, knownGene or MGC
- expressed: an mRNA overlaps
the retrogene, indicating probable transcription
These features can be downloaded from the table pseudoGeneLink in many
formats using the Table Browser option on the menubar.
Kent WJ, Baertsch R, Hinrichs A, Miller W, Haussler D.
Duplication, deletion, and rearrangement in the mouse and human genomes.
Proc Natl Acad Sci USA. 2003 Sep 30;100(20):11484-9.
Schwartz S, Kent WJ, Smit A, Zhang Z, Baertsch R, Hardison R,
Haussler D, Miller W.
Human-mouse alignments with BLASTZ.
Genome Res. 2003 Jan;13(1):103-7.
UCSC Pseudogene Predictions
- Step I: A set of pre-aligned human known genes was mapped across the
human genome through the human Blastz Self Alignment using HomoMap (homologous
mapping method). The fragments identified by HomoMap are homologs of genes
from the Known Genes set.
- Step II: Each homologous fragment was compared with its
known reference gene and a set of features was then collected. The features
included sequence identity, Ka/Ks ratio (asynonymous substitution per codon vs.
synonymous substitution per codon), splicing sites, and the number of
premature stop codons. These homologous fragments are either genes or
- Step III: Homologous fragments that overlapped known reference
genes were labeled as positive samples; those overlapping known pseudogenes
were labeled as negative samples.
- Step IV: These positive and negative sets were used to train
support vector machines (SVMs) to separate coding fragments from pseudo
fragments. The trained SVMs were used to classify all homologous fragments
into potential coding elements or potential pseudo elements.
- Step V: Finally, a heuristic filter was used to correct some
misclassified fragments and to generate the final potential pseudogene set.
GIS-PET Pseudogene Predictions
This subtrack shows retrotransposed pseudogenes predicted by multiple mapped
GIS-PETs (gene identification signature-pair end ditags) collected from two
different cancer cell lines HCT116 and MCF7. A total of 49 non-redundant
processed pseudogenes predicted in the ENCODE regions are presented in this
dataset. Each pseudogene is labeled with an ID of the format
where "AAA" indicates the parental gene name, "GISPgene" is the GIS pseudogene, and "XX" is the unique ID for each pseudogene.
PETs were generated from full-length transcripts and
computationally mapped onto the human genome to demarcate the transcript start
and end positions. The PETs that mapped to multiple genome locations were
grouped into PET-based gene families that include parent gene and
pseudogenes. A representative member—the shortest PET as defined by
genomic coordinates—was selected from each family. This representative
PET was aligned to the hg17 genome using in order to identify all the
putative pseudogenes at the whole genome level. All hits with an
identity >=70% and coverage >=50% within ENCODE regions were
reported. In this context, "coverage" refers to alignment coverage of
the query sequence, i.e. a measure of how complete the predicted pseudogene
is relative to the query sequence.
Verification of GIS-PET Pseudogene Predictions
Pseudogenes were verified by manual examination.
These data were generated by the ENCODE Pseudogene Annotation group:
Siew Woh Choo
Roderic Guigo Serra,
Suganthi Balasubramanian and