Description
This track displays different measurements of conservation based on
the multiple sequence alignments of ENCODE regions generated by the
Threaded Blockset Aligner (TBA) and shown in the TBA Alignment track.
The conservation scoring used to create this track was generated by
three programs:
- phastCons (phylogenetic hidden-Markov model method)
- GERP (Genomic Evolutionary Rate Profiling)
- SCONE (from Harvard Genetics)
A related
track, TBA Elements, shows multi-species conserved sequences (MCSs) based on
the conservation measurements displayed in this track.
For details on the conservation scores generated by each program, refer to the
individual Methods subsections.
Display Conventions and Configuration
The subtracks within this composite annotation track may be configured in a
variety of ways to highlight different aspects of the
displayed data. The graphical configuration options
are shown at the top of the track description page, followed by a list of
subtracks. A subtrack may be hidden from view by checking the box to the
left of the track name in the list. For more information about the
graphical configuration options, click the
Graph
configuration help link.
Color differences among the subtracks are arbitrary; they provide a
visual cue for distinguishing the different gene prediction methods. See the
Methods section for display information specific to each subtrack.
Methods
The methods used to create the TBA alignments in the ENCODE
regions are described in the TBA Alignment track description.
PhastCons
The phastCons program predicts conserved elements and produces base-by-base
conservation scores using a two-state phylogenetic hidden Markov model.
The model consists of a state for conserved regions and a
state for nonconserved regions, each of which is associated with a
phylogenetic model. These two models are identical
except that the branch lengths of the conserved phylogeny are
multiplied by a scaling parameter rho (0 < rho < 1).
For determining the conservation for the ENCODE
alignments, the nonconserved model was estimated
from four-fold degenerate coding sites within the ENCODE regions using
the program phyloFit. The parameter rho was then estimated by
maximum likelihood, conditional on the nonconserved model, using the EM
algorithm implemented in phastCons. Parameter estimation was based on
a single large alignment, constructed by concatenating the
alignments for all conserved regions.
PhastCons was run with the options --expected-lengths 15 and
--target-coverage 0.01 to obtain the desired level of
"smoothing" and a final coverage by conserved elements of 5%.
The conservation score at each base is the posterior probability that the
base was generated by the conserved state of the phylo-HMM. It can
be interpreted as the probability that the base is in a conserved
element, given the assumptions of the model and the estimated parameters.
Scores range from 0 to 1, with higher scores corresponding to
higher levels of conservation.
More details on phastCons can be found in Siepel et. al. (2005)
cited below.
GERP
The GERP score is the expected substitution rate minus the observed substitution
rate at a particular human base. Scores are estimated on a column-by-column
basis using multiple sequence alignments of mammalian genomic DNA.
The scores are both positive and negative, with negative values (i.e.
observed > expected) corresponding to neutral or unconstrained sites and
positive values (i.e. observed < expected) corresponding to
constrained or slowly evolving sites.
The expected and observed rates are both calculated on a phylogenic tree using
the same fixed topology.
The branch lengths of the expected tree are based on the average substitutions
at neutral sites.
The branch lengths of the observed tree, which is calculated separately for each
human base, are based on the substitutions seen at the column of the multiple
alignment at that base.
Species that have gaps at a particular column are not considered in the scoring
for that column.
Higher scores correspond to human
bases in alignment columns with higher degrees of similarity, i.e.
bases that have evolved slowly, some of which have been under purifying
selection. The opposite holds true for swiftly evolving (low similarity)
columns.
Scores are deterministic, given a maximum-likelihood model of
nucleotide substitution, species topology, neutral tree, and alignment.
SCONE
SCONE is a probabilistic measure of purifying selection expressed as a
p-value that a given position evolves neutrally. It has a model of
evolution that considers both sequence-contextual effects on
substitution rates and insertion/deletion events. This model may be
used to compute the probability of any transitional event along a
lineage.
The score is computed for any column in a multiple sequence alignment
by first parsimoniously inferring the evolutionary history of the
site, using a given phylogenetic tree with known branch-lengths.
Subsequently, transition probabilities are computed for each branch in
the tree. A heuristic score is computed using the formula:
S = ln(product(all i in M)/product(all j in C))
where M and C are the set of all branches in the tree that contain mutations and the set of all branches in
the tree that do not contain mutations, respectively. This heuristic score
serves to effectively partition sites according to the influence of
purifying selection on the site.
This heuristic score is used to compute a p-value by comparing it
against the expected distribution of neutral scores as determined by
Monte-Carlo simulation. Forward simulation of evolution is performed
along the phylogenetic tree using the SCONE model of mutation events, and
the above heuristic score is computed for a simulated tree. Repeated
simulation produces a distribution of scores that reflects the
neutral expected distribution. A p-value score may be computed by
counting the fraction of simulated heuristic scores that fall below
the heuristic score for the site.
Credits
PhastCons was developed by
Adam Siepel, Cold Spring Harbor Laboratory, while at the
Haussler lab at UCSC.
GERP was developed primarily by Greg Cooper in the lab of
Arend Sidow
at Stanford University
(Depts of Pathology and Genetics), in close collaboration with
Eric Stone (Biostatistics, NC State), and George Asimenos and
Eugene Davydov in the lab of
Serafim Batzoglou
(Dept. of Computer Science, Stanford).
SCONE was developed by Saurabh Asthana in the lab of Shamil
Sunyaev at Harvard Medical School and Brigham & Women's Hospital
(Department of Medicine/Division of Genetics).
TBA was provided by Minmei Hou, Scott Schwartz and Webb Miller of the
Penn State Bioinformatics
Group.
The GERP data for this track was generated by Greg Cooper.
The PhastCons data was generated by Elliott Margulies,
with assistance from Adam Siepel.
The SCONE data was generated by Saurabh Asthana.
References
Asthana S, Roytberg M, Stamatoyannopoulos J, Sunyaev S.
Analysis of Sequence Conservation at Nucleotide Resolution.
PLoS Comput. Biol. 2007 Dec 28:3(12):e254.
Blanchette M, Kent WJ, Reimer C, Elnitski L, Smit A,
Roskin K, Baertsch R, Rosenbloom KR, Clawson H, Green ED, et al.
Aligning Multiple Genomic Sequences With the Threaded Blockset
Aligner.
Genome Res. 2004 Apr:14(4):708-15.
Cooper GM, Stone EA, Asimenos G, NISC Comparative Sequencing Program,
Green ED, Batzoglou , Sidow A.
Distribution and intensity of constraint in mammalian genomic
sequence.
Genome Res. 2005 Jul;15(7):901-13. Epub 2005 Jun 17.
Margulies EH, Blanchette M, NISC Comparative Sequencing Program,
Haussler D, Green ED.
Identification and characterization of multi-species conserved
sequences.
Genome Res. 2003 Dec;13(12):2507-18.
Siepel A, Bejerano G, Pedersen JS, Hinrichs A, Hou M,
Rosenbloom K, Clawson H, Spieth J, Hillier LW, Richards S, et al.
Evolutionarily conserved elements in vertebrate,
insect, worm, and yeast genomes.
Genome Res. 2005 Aug;15(8):1034-50. Epub 2005 Jul 15.
|