Description
This track displays human-centric multiple sequence alignments in the
ENCODE regions for the 28 vertebrates included in the
September 2005 ENCODE MSA freeze,
based on comparative sequence data generated for the ENCODE project
as well as whole-genome assemblies residing at UCSC, as listed:
- human (May 2004, hg17)
- chimp (Nov 2003, panTro1)
- colobus_monkey (NISC)
- baboon (NISC)
- macaque (Jan 2005, rheMac1)
- dusky_titi (NISC)
- owl_monkey (NISC)
- marmoset (NISC)
- mouse_lemur (NISC)
- galago (NISC)
- rat (June 2003, rn3)
- mouse (Mar 2005, mm6)
- rabbit (NISC and May 2005 Broad Assisted Assembly v 1.0)
- cow (BCM)
- dog (July 2004, canFam1)
- rfbat (NISC)
- hedgehog (NISC)
- shrew (NISC and Sep 2005 Mullikin Phusion Assembly of Broad Traces)
- armadillo (NISC and May 2005 Broad Assisted Assembly v 1.0)
- elephant (NISC and May 2005 Broad Assisted Assembly v 1.0)
- tenrec (Apr 2005 Mullikin Phusion Assembly of Broad Traces)
- monodelphis (Oct 2004, monDom1)
- platypus (NISC and Aug 2005 Mullikin Phusion Assembly of WUGSC Traces)
- chicken (Feb 2004, galGal2)
- xenopus (Oct 2004, xenTro1)
- tetraodon (Feb 2004, tetNig1)
- fugu (Aug 2002, fr1)
- zebrafish (June 2004, danRer2)
The alignments in this track were generated using the
LAGAN
Alignment Toolkit. The Genome Browser companion tracks, MLAGAN Cons and MLAGAN Elements,
display conservation scoring and conserved elements for these alignments based
on various conservation methods.
Display Conventions and Configuration
In full display mode, this track shows pairwise alignments
of each species aligned to the human genome.
In dense mode, the alignments are depicted using a gray-scale
density gradient. The checkboxes in the track configuration section allow
the exclusion of species from the pairwise display.
When zoomed-in to the base-display level, the track shows the base
composition of each alignment. The numbers and symbols on the "Gaps"
line indicate the lengths of gaps in the human sequence at those
alignment positions relative to the longest non-human sequence. If there is
sufficient space in the display, the size of the gap is shown; if not, and if
the gap size is a multiple of 3, a "*" is displayed,
otherwise "+" is shown.
To view detailed information about the
alignments at a specific position, zoom in the display to 30,000 or fewer
bases, then click on the alignment.
Methods
MLAGAN alignments were produced by a pipeline specifically designed for
ENCODE. First, AB-BLAST was
used to find local similarities (anchors)
between the human sequence and the sequence of every other species. Then,
Shuffle-LAGAN was used to calculate the highest-scoring
human-monotonic
chain of these local similarities (according to a scoring scheme that
penalized evolutionary rearrangements), and — with the help of a utility
called SuperMap — produce a map of orthologous segments, in increasing
human coordinates. This map was used to undo the genomic rearrangements of the
other sequence and convert it to a form that was directly alignable to the
human sequence. The new humanized sequences, together with the human
sequence, were then multiply aligned using
MLAGAN.
The resulting alignments were subsequently refined using
MUSCLE, which
processed small non-overlapping alignment windows and realigned them in an
iterative fashion, keeping the refined alignment if it had a better
sum-of-pairs score than the original. Finally, a pairwise refinement
round was performed, during which the pieces that had very low identity (in the
induced pairwise alignments between human and each species) were removed
from the alignment.
Credits
The MLAGAN alignments were generated by George Asimenos from Stanford's ENCODE group.
Shuffle-LAGAN, SuperMap and MLAGAN were written by Mike Brudno.
MUSCLE was authored by Bob Edgar.
WU-BLAST was provided by
the Gish lab
at the School of Medicine, University of Washington in St. Louis.
The phylogenetic tree is based on Murphy et al. (2001).
References
Brudno M, Do C, Cooper G, Kim MF, Davydov E, Green ED, Sidow A, Batzoglou S.
LAGAN and Multi-LAGAN: efficient tools for large-scale multiple
alignment of genomic DNA.
Genome Res. 2003;13(4):721-31.
Brudno M, Malde S, Poliakov A, Do C, Couronne O, Dubchak I, Batzoglou S.
Global alignment: finding rearrangements during alignment.
Bioinformatics. 2003;19(Suppl. 1):i54-i62.
Edgar RC.
MUSCLE: multiple sequence alignment with high
accuracy and high throughput.
Nucl Acids Res. 2004;32(5):1792-7.
Murphy WJ, Eizirik E, O'Brien SJ, Madsen O, Scally M, Douady CJ, Teeling E,
Ryder OA, Stanhope MJ, de Jong WW et al.
Resolution of the early placental mammal radiation using Bayesian phylogenetics.
Science. 2001;294(5550):2348-51.
|
|