Description
This track displays different measurements of conservation based on
the
MAVID multiple sequence alignments of ENCODE regions
shown in the MAVID Alignment track. Two programs —
phastCons (phylogenetic hidden-Markov model method) and
GERP (Genomic Evolutionary Rate Profiling)
— were used to generate the conservation scoring shown in this track. A
related track, MAVID Elements, shows multi-species conserved sequences (MCSs)
based on the conservation measurements displayed in this track.
For details on the conservation scores generated by each program, refer to the
individual Methods subsections.
Display Conventions and Configuration
The subtracks within this composite annotation track may be configured in a
variety of ways to highlight different aspects of the
displayed data. The graphical configuration options
are shown at the top of the track description page, followed by a list of
the subtracks. A subtrack may be hidden from view by unchecking the box to the
left of the track name in the list. For more information about the
graphical configuration options, click the
Graph
configuration help link.
Color differences among the subtracks are arbitrary; they provide a
visual cue for distinguishing the different gene prediction methods. See the
Methods section for display information specific to each subtrack.
Methods
The methods used to create the MAVID alignments in the ENCODE
regions are described in the MAVID Alignment track description.
PhastCons
The phastCons program predicts conserved elements and produces base-by-base
conservation scores using a two-state phylogenetic hidden Markov model.
The model consists of a state for conserved regions and a
state for nonconserved regions, each of which is associated with a
phylogenetic model. These two models are identical
except that the branch lengths of the conserved phylogeny are
multiplied by a scaling parameter rho (0 < rho < 1).
For determining the conservation for the ENCODE
alignments, the nonconserved model was estimated
from four-fold degenerate coding sites within the ENCODE regions using
the program phyloFit. The parameter rho was then estimated by
maximum likelihood, conditional on the nonconserved model, using the EM
algorithm implemented in phastCons. Parameter estimation was based on
a single large alignment, constructed by concatenating the
alignments for all conserved regions.
PhastCons was run with the options --expected-lengths 15 and
--target-coverage 0.01 to obtain the desired level of
"smoothing" and a final coverage by conserved elements of 5%.
The conservation score at each base is the posterior probability that the
base was generated by the conserved state of the phylo-HMM. It can
be interpreted as the probability that the base is in a conserved
element, given the assumptions of the model and the estimated parameters.
Scores range from 0 to 1, with higher scores corresponding to
higher levels of conservation.
More details on phastCons can be found in Siepel et. al. (2005)
cited below.
GERP
The GERP score is the expected substitution rate minus the observed substitution
rate at a particular human base. Scores are estimated on a column-by-column
basis using multiple sequence alignments of mammalian genomic DNA.
The scores are both positive and negative, with negative values (i.e.
observed > expected) corresponding to neutral or unconstrained sites and
positive values (i.e. observed < expected) corresponding to
constrained or slowly evolving sites.
The expected and observed rates are both calculated on a phylogenic tree using
the same fixed topology.
The branch lengths of the expected tree are based on the average substitutions
at neutral sites.
The branch lengths of the observed tree, which is calculated separately for
each human base, are based on the substitutions seen at the column of the
multiple alignment at that base.
Species that have gaps at a particular column are not considered in the scoring
for that column.
Higher scores correspond to human
bases in alignment columns with higher degrees of similarity, i.e.
bases that have evolved slowly, some of which have been under purifying
selection. The opposite holds true for swiftly evolving (low similarity)
columns.
Scores are deterministic, given a maximum-likelihood model of
nucleotide substitution, species topology, neutral tree, and alignment.
Credits
PhastCons was developed by
Adam Siepel, Cold Spring Harbor Laboratory, while at the
Haussler Lab at UCSC.
GERP was developed primarily by Greg Cooper in the lab of
Arend Sidow
at Stanford University
(Depts of Pathology and Genetics), in close collaboration with
Eric Stone (Biostatistics, NC State), and George Asimenos and
Eugene Davydov in the lab of
Serafim Batzoglou
(Dept. of Computer Science, Stanford).
The GERP data for this track was generated by Greg Cooper.
The PhastCons data was generated by Elliott Margulies,
with assistance from Adam Siepel.
References
Margulies, E.H., Blanchette, M., NISC Comparative Sequencing Program,
Haussler, D. and Green, E.D.
Identification and characterization of multi-species conserved
sequences.
Genome Res 13(12), 2507-18 (2003).
Siepel, A., Bejerano, G., Pedersen, J.S., Hinrichs, A., Hou, M.,
Rosenbloom, K., Clawson, H., Spieth, J., Hillier, L.W. et al.
Evolutionarily conserved elements in vertebrate,
insect, worm, and yeast genomes.
Genome Res 15(8), 1034-50 (2005).
References for the MAVID alignment tools can be found on the
MAVID Alignment track description page.
|