Description
These tracks display DNaseI hypersensitivity (HS) evidence as part of the
four Open Chromatin track sets.
DNaseI is an enzyme that has long been used to map general
chromatin accessibility and DNaseI "hypersensitivity" is a feature of active
cis-regulatory sequences. The use of this method has led to the discovery of
functional regulatory elements that include promoters, enhancers, silencers,
insulators, locus control regions, and novel elements. DNaseI hypersensitivity
signifies chromatin accessibility following binding of trans-acting factors in
place of a canonical nucleosome.
Together with FAIRE and
ChIP-seq experiments, these tracks display the locations of active regulatory
elements identified as open chromatin in
multiple cell types
from the Duke, UNC-Chapel Hill, UT-Austin, and EBI ENCODE group.
Within this project, open chromatin was identified using two
independent and complementary methods: these DNaseI HS assays
and Formaldehyde-Assisted Isolation of Regulatory Elements (FAIRE),
combined with chromatin immunoprecipitation (ChIP) for select
regulatory factors. DNaseI HS and FAIRE provide assay
cross-validation with commonly identified regions delineating the
highest confidence areas of open chromatin. ChIP assays provide
functional validation and preliminary annotation of a subset of
open chromatin sites. Each method employed Illumina (formerly Solexa)
sequencing by synthesis as the detection platform.
The Tier 1 and Tier 2 cell types were additionally verified using
high-resolution 1% ENCODE tiled microarrays supplied by NimbleGen.
Other Open Chromatin track sets:
- Data for the FAIRE experiments can be found in
UNC FAIRE.
- Data for the ChIP experiments can be found in
UTA TFBS.
- A synthesis of all the open chromatin assays for select cell lines can
be previewed in
Open Chrom Synth.
Display Conventions and Configuration
This track is a multi-view composite track that contains a single data type
with multiple levels of annotation (views). For each view, there are
multiple subtracks representing different cell types that display individually
on the browser. Instructions for configuring multi-view tracks are
here.
Chromatin data displayed here represents a continuum of signal intensities.
The Crawford lab recommends setting the "Data view scaling: auto-scale"
option when viewing signal data in full mode to see the full dynamic
range of the data. Note that in regions that do not have open chromatin sites,
autoscale will rescale the data and inflate the background signal, making the
regions appear noisy. Changing back to fixed scale will alleviate this issue.
In general, for each experiment in each of the cell types, the
Duke DNaseI HS tracks contain the following views:
- Peaks
- Regions of enriched signal in DNaseI HS experiments.
Peaks were called based on signals created using F-Seq, a software program
developed at Duke (Boyle et al., 2008b). Significant regions were
determined by fitting the data to a gamma distribution to calculate p-values.
Contiguous regions where p-values were below a 0.05/0.01 threshold were
considered significant. The solid vertical line in the peak represents the
point with the highest signal.
- F-Seq Density Signal
- Density graph (wiggle) of signal
enrichment calculated using F-Seq for the combined set of sequences from all
replicates. F-Seq employs Parzen kernel density estimation to create base pair
scores (Boyle et al., 2008b). This method does not look at fixed-length
windows, but rather weights contributions of nearby sequences in proportion to
their distance from that base. It only considers sequences aligned 4 or less
times in the genome and uses an alignability background model to try to correct
for regions where sequences cannot be aligned. For each cell type (especially
important for those with an abnormal karyotype), a model to try to correct for
amplifications and deletions that is based on control input data was also used.
- Base Overlap Signal
- An alternative version of the
F-Seq Density Signal track annotation that provides a higher resolution
view of the raw sequence data. This track also includes the combined set of
sequences from all replicates. For each sequence, the aligned read is
extended 5 bp in both directions from its 5' aligned end where DNase cut
the DNA. The score at each base pair represents the number of
extended fragments that overlap the base pair.
Peaks and signals displayed in this track are the results of pooled replicates. The raw
sequence and alignment files for each replicate are available for
download.
Metadata for a particular subtrack can be found by clicking the down arrow in the list of subtracks.
Methods
Cells were grown according to the approved
ENCODE cell culture protocols.
DNaseI hypersensitive sites were isolated using methods called DNase-seq or
DNase-chip (Song and Crawford, 2010; Boyle et al., 2008a; Crawford et al., 2006).
Briefly, cells were lysed with NP40, and intact nuclei were digested with optimal
levels of DNaseI enzyme. DNaseI-digested ends were captured from three different
DNase concentrations, and material was sequenced using Illumina (Solexa)
sequencing. DNase-seq data for Tier 1 and Tier 2 cell lines were verified by comparing
multiple independent growths (replicates) and determining the reproducibility of the
data. In general, cell lines were verified if 80% of the top 50,000 peaks in
one replicate were detected in the top 100,000 peaks of a second replicate. For
some cell types, additional verification was performed using similar material
hybridized to NimbleGen Human ENCODE tiling arrays (1% of the genome) along with
the input DNA as reference (DNase-chip). A more detailed protocol is available
here.
The read length for sequences from DNase-seq was 20 bases long due to a MmeI
cutting step of the approximately 50 kb DNA fragments extracted after DNaseI
digestion. Sequences from each experiment were aligned to the genome using
BWA (Li et al., 2008) for the GRCh37 (hg19) assembly.
- The command used for these alignments was:
> bwa aln -t 8 genome.fa s_1.sequence.txt.bfq > s_1.sequence.txt.sai
where genome.fa is the whole genome sequence and s_1.sequence.txt.bfq is one lane
of sequences converted into the required bfq format.
Sequences from multiple lanes
were combined for a single replicate using the bwa samse command and converted
to the sam/bam format using SAMtools.
Only those sequences that aligned to 4 or fewer locations were retained. Other sequences
were also filtered based on their alignment to problematic regions
(such as satellites and rRNA genes - see
supplemental materials).
The mappings of these short reads to the genome are available for
download.
Using F-seq, the resulting digital signal was converted to a continuous wiggle track that employs a Parzen kernel density estimation to create base pair scores
(Boyle et al., 2008b). Input data was generated for several
cell lines. These were used directly to create a control/background model used
for F-Seq when generating signal annotations for these cell lines.
These models were meant to correct for sequencing biases, alignment artifacts,
and copy number changes in these cell lines. Input data was not generated
directly for other cell lines. For cell lines for which there is
no input experiment available, the peaks were generated using the control
of generic_male or generic_female, as an attempt to create a general
background based on input data from several cell types. These files
are in "iff" format, which is used when calling peaks with
F-seq software, and can be downloaded from the
production lab directly
from under the section titled "Copy number / karyotype correction."
Using a general background model derived from the available Input data sets provided corrections for
sequencing biases and alignment artifacts, but did not correct for cell type-specific copy number changes.
- The exact command used for this step was:
> fseq -l 600 -v -f 0 -b <bff files> -p <iff files> aligments.bed
where the bff files are the background files based on alignability, the
iff files are the background files based on the Input experiments,
and alignments.bed is a bed file of filtered sequence alignments.
Discrete DNaseI HS sites (peaks) were identified from DNase-seq F-seq density signal.
Significant regions were determined by fitting the data to a gamma distribution to
calculate p-values. Contiguous regions where p-values were below a 0.05/0.01
threshold were considered significant.
Data from the high-resolution 1% ENCODE tiled microarrays supplied by
NimbleGen were normalized using the Tukey biweight normalization and peaks
were called using ChIPOTle (Buck et al., 2005) at multiple levels
of significance. Regions matched on size to these peaks that were devoid of
any significant signal were also created as a null model. These data were used
for additional verification of Tier 1 and Tier 2 cell lines by ROC analysis.
Files containing this data can be found in the
Downloads
directory, labeled 'Validation' in the View column.
Release Notes
This is Release 3 (August 2012) of the track. It includes 27 new experiments including 18 new cell lines.
- A synthesis of open chromatin evidence from the three assay types was
compiled for Tier 1 and 2 cell lines and can be viewed in
Open Chromatin Synthesis.
- Enhancer and Insulator Functional assays: A subset of DNase and FAIRE
regions were cloned into functional tissue culture reporter assays to test for
enhancer and insulator activity. Coordinates and results from these
experiments can be found in the
supplemental materials.
Credits
These data and annotations were created by a collaboration of multiple
institutions (contact:
Terry Furey)
- Duke University's Institute for Genome Sciences & Policy (IGSP): Alan Boyle,
Lingyun Song, and
Greg Crawford
- University of North Carolina at Chapel Hill: Paul Giresi,
Jason Lieb, and Terry Furey
- Universty of Texas at Austin: Zheng Liu, Ryan McDaniell, Bum-Kyu Lee, and
Vishy Iyer
- European Bioinformatics
Insitute: Paul Flicek, Damian Keefe, and
Ewan Birney
- University of
Cambridge, Department of Oncology and CR-UK Cambridge Research Institute (CRI): Stefan Graf
We thank NHGRI for ENCODE funding support.
References
Bhinge AA, Kim J, Euskirchen GM, Snyder M, Iyer VR.
Mapping the chromosomal targets of STAT1 by Sequence Tag Analysis of Genomic Enrichment (STAGE).
Genome Res. 2007 Jun;17(6):910-6.
Boyle AP, Davis S, Shulha HP, Meltzer P, Margulies EH, Weng Z, Furey TS, Crawford GE.
High-resolution mapping and characterization of open chromatin across the genome.
Cell. 2008 Jan 25;132(2):311-22.
Boyle AP, Guinney J, Crawford GE, Furey TS.
F-Seq: a feature density estimator for high-throughput sequence tags.
Bioinformatics. 2008 Nov 1;24(21):2537-8.
Buck MJ, Nobel AB, Lieb JD.
ChIPOTle: a user-friendly tool for the analysis of ChIP-chip data.
Genome Biol. 2005;6(11):R97.
Crawford GE, Davis S, Scacheri PC, Renaud G, Halawi MJ, Erdos MR, Green R, Meltzer PS, Wolfsberg TG, Collins FS.
DNase-chip: a high-resolution method to identify DNase I hypersensitive sites using tiled microarrays.
Nat Methods. 2006 Jul;3(7):503-9.
Crawford GE, Holt IE, Whittle J, Webb BD, Tai D, Davis S, Margulies EH, Chen Y, Bernat JA, Ginsburg D et al.
Genome-wide mapping of DNase hypersensitive sites using massively parallel signature sequencing (MPSS).
Genome Res. 2006 Jan;16(1):123-31.
ENCODE Project Consortium, Birney E, Stamatoyannopoulos JA, Dutta A, Guigó R, Gingeras TR, Margulies EH, Weng Z, Snyder M, Dermitzakis ET et al.
Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project.
Nature. 2007 Jun 14;447(7146):799-816.
Giresi PG, Kim J, McDaniell RM, Iyer VR, Lieb JD.
FAIRE (Formaldehyde-Assisted Isolation of Regulatory Elements) isolates active regulatory elements from human chromatin.
Genome Res. 2007 Jun;17(6):877-85.
Giresi PG, Lieb JD.
Isolation of active regulatory elements from eukaryotic chromatin using FAIRE (Formaldehyde Assisted Isolation of Regulatory Elements).
Methods. 2009 Jul;48(3):233-9.
Li H, Ruan J, Durbin R.
Mapping short DNA sequencing reads and calling variants using mapping quality scores.
Genome Res. 2008 Nov;18(11):1851-8.
Song L, Crawford GE.
DNase-seq: a high-resolution technique for mapping active gene regulatory elements across the genome from mammalian cells.
Cold Spring Harb Protoc. 2010 Feb;2010(2):pdb.prot5384.
Data Release Policy
Data users may freely use ENCODE data, but may not, without prior consent, submit publications
that use an unpublished ENCODE dataset until nine months following the release of the dataset.
This date is listed in the
Restricted Until column, above. The full data release policy for ENCODE is available
here.