Schema for Open Chromatin - ENCODE Open Chromatin, Duke/UNC/UT
  Database: hg18    Primary Table: wgEncodeUtaustinChIPseqSignalHuvecInput    Row Count: 2,930,636   Data last updated: 2009-09-24
Format description: Wiggle track values to display as y-values (first 6 fields are bed6)
On download server: MariaDB table dump directory
fieldexampleSQL type description
bin 585smallint(5) unsigned Indexing field to speed chromosome range queries.
chrom chr1varchar(255) Reference sequence chromosome or scaffold
chromStart 102int(10) unsigned Start position in chromosome
chromEnd 1126int(10) unsigned End position in chromosome
name chr1.0varchar(255) Name of item
span 1int(10) unsigned each value spans this many bases
count 1024int(10) unsigned number of values in this block
offset 0int(10) unsigned offset in File to fetch data
file /gbdb/hg18/wib/wgEncodeUtau...varchar(255) path name to data file, one byte per value
lowerLimit 0double lowest data value in this block
dataRange 0.1144double lowerLimit + dataRange = upperLimit
validCount 1024int(10) unsigned number of valid data values in this block
sumData 23.2904double sum of the data points, for average and stddev calc
sumSquares 1.74187double sum of data points squared, for stddev calc

Sample Rows
 
binchromchromStartchromEndnamespancountoffsetfilelowerLimitdataRangevalidCountsumDatasumSquares
585chr11021126chr1.0110240/gbdb/hg18/wib/wgEncodeUtaustinChIPseqSignalHuvecInput.wib00.1144102423.29041.74187
585chr111262150chr1.1110241024/gbdb/hg18/wib/wgEncodeUtaustinChIPseqSignalHuvecInput.wib00.063410247.14770.335884
585chr121503174chr1.2110242048/gbdb/hg18/wib/wgEncodeUtaustinChIPseqSignalHuvecInput.wib00102400
585chr131744198chr1.3110243072/gbdb/hg18/wib/wgEncodeUtaustinChIPseqSignalHuvecInput.wib00.005610240.11610.00035553
585chr141985222chr1.4110244096/gbdb/hg18/wib/wgEncodeUtaustinChIPseqSignalHuvecInput.wib00.0508102411.31910.440635
585chr152226246chr1.5110245120/gbdb/hg18/wib/wgEncodeUtaustinChIPseqSignalHuvecInput.wib00102400
585chr162467270chr1.6110246144/gbdb/hg18/wib/wgEncodeUtaustinChIPseqSignalHuvecInput.wib00102400
585chr172708294chr1.7110247168/gbdb/hg18/wib/wgEncodeUtaustinChIPseqSignalHuvecInput.wib00102400
585chr182949318chr1.8110248192/gbdb/hg18/wib/wgEncodeUtaustinChIPseqSignalHuvecInput.wib00.050810245.71760.220495
585chr1931810342chr1.9110249216/gbdb/hg18/wib/wgEncodeUtaustinChIPseqSignalHuvecInput.wib00.0509102415.96020.568893

Note: all start coordinates in our database are 0-based, not 1-based. See explanation here.

Open Chromatin (wgEncodeChromatinMap) Track Description
 

Description

These tracks display evidence of open chromatin in multiple cell types from the Duke/UNC/UT-Austin/EBI ENCODE group. Open chromatin was identified using two independent and complementary methods: DNaseI hypersensitivity (HS) and Formaldehyde-Assisted Isolation of Regulatory Elements (FAIRE), combined with chromatin immunoprecipitation (ChIP) for select regulatory factors. Each method was verified by two detection platforms: Illumina (formerly Solexa) sequencing by synthesis, and high-resolution 1% ENCODE tiled microarrays supplied by NimbleGen.

DNaseI HS data: DNaseI is an enzyme that has long been used to map general chromatin accessibility, and DNaseI "hyperaccessibility" or "hypersensitivity" is a feature of active cis-regulatory sequences. The use of this method has led to the discovery of functional regulatory elements that include enhancers, silencers, insulators, promotors, locus control regions and novel elements. DNaseI hypersensitivity signifies chromatin accessibility following binding of trans-acting factors in place of a canonical nucleosome.

FAIRE data: FAIRE (Formaldehyde Assisted Isolation of Regulatory Elements) is a method to isolate and identify nucleosome-depleted regions of the genome. FAIRE was initially discovered in yeast and subsequently shown to identify active regulatory elements in human cells (Giresi et al., 2007). Although less well-characterized than DNase, FAIRE also appears to identify functional regulatory elements that include enhancers, silencers, insulators, promotors, locus control regions and novel elements. DNA fragments isolated by FAIRE are 100-200 bp in length, with the average length being 140 bp.

ChIP data: ChIP (Chromatin Immunoprecipitation) is a method to identify the specific location of proteins that are directly or indirectly bound to genomic DNA. By identifying the binding location of sequence-specific transcription factors, general transcription machinery components, and chromatin factors, ChIP can help in the functional annotation of the open chromatin regions identified by DNaseI HS mapping and FAIRE.

Display Conventions and Configuration

This track is a multi-view composite track that contains multiple data types (views). For each view, there are multiple subtracks that display individually on the browser. Instructions for configuring multi-view tracks are here. Chromatin data displayed here represents a continuum of signal intensities. The Crawford lab recommends setting the "Data view scaling: auto-scale" option when viewing signal data in full mode. In general, for each experiment in each of the cell types, the Open Chromatin tracks contain the following views:

Peaks
Regions of enriched signal in either DNaseI HS, FAIRE, or ChIP experiments. Peaks were called based on signals created using F-Seq, a software program developed at Duke (Boyle et al., 2008b). Significant regions were determined by performing ROC analysis of sequence data using data from the 1% ENCODE arrays, and determining a cut-off value at approximately the 95% sensitivity level. The solid vertical line in the peak represents the point with highest signal. ENCODE Peaks tables contain a p-value for statistical significance. For these data, this was determined by fitting the data to a gamma distribution.
Peaks (Zinba)
Enriched regions for FAIRE data were called using ZINBA (Zero Inflated Negative Binomial Algorithm). ZINBA is a flexible statistical method that uses a generalized linear model to select genomic windows with enriched sequence counts after adjusting for relevant confounding factors such as mappability, GC content, and copy number alterations. Significant regions are selected using the set of standardized residuals below a false discovery rate (qvalue) threshold. Peaks were further refined using a shape detection algorithm to identify local maxima and boundaries of the Signal (Base Overlap) data within each significant region.
Signal (F-Seq Density)
Density graph (wiggle) of signal enrichment calculated using F-Seq for the combined set of sequences from all replicates. F-Seq employs Parzen kernel density estimation to create base pair scores (Boyle et al., 2008b). This method does not look at fixed-length windows but rather weights contributions of nearby sequences in proportion to their distance from that base. It only considers sequences aligned 4 or less times in the genome, and uses an alignability background model to try to correct for regions where sequences cannot be aligned. For the K562, HepG2 and HelaS3 cell types, where there is an abnormal karyotype, a model to try to correct for amplifications and deletions was also used. No control data were used in the creation of these annotations.
Signal (Base Overlap)
An alternative version of the Signal (F-Seq Density) track annotation that provides a higher resolution view of the raw sequence data. This track also includes the combined set of sequences from all replicates. For each sequence, the aligned read is extended in the following way: for DNase, the read is extended 5 bp in both directions from its 5' aligned end where DNase cut the DNA; for FAIRE and ChIP, the sequence is extend to a fragment length of 134 bp from the 5' aligned end representing the approximate average fragment length. The score at each base pair represents the number of extended fragments that overlap the base pair.
Alignments
Mappings of short reads to the genome (currently only available for download).
Additional data that were used to generate these tracks are located in the ENCODE Mappability track:
Uniqueness
The Duke uniqueness tracks were used in identify regions of unique sequence for different tag lengths. The tracks also identify regions where high-throughput sequence tags cannot be mapped.
Excluded Regions
The Duke excluded regions track was used to identify problematic regions for short sequence tag signal detection (such as satellites and rRNA genes). These regions of the genome were excluded from the Open Chromatin tracks.

Methods

Cells were grown according to the approved ENCODE cell culture protocols.

DNaseI hypersensitive sites were isolated using methods called DNase-seq or DNase-chip (Boyle et al., 2008a, Crawford et al., 2006). Briefly, cells were lysed with NP40, and intact nuclei were digested with optimal levels of DNaseI enzyme. DNaseI digested ends were captured from three different DNase concentrations, and material was sequenced using Illumina (Solexa) sequencing. DNase-seq data were verified using material that was hybridized to NimbleGen Human ENCODE tiling arrays (1% of the genome). Multiple independent growths (replicates) were compared to verify the reproducibility of the data. A more detailed protocol is available here.

FAIRE was performed (Giresi et al., 2007) by cross-linking proteins to DNA using 1% formaldehyde solution, and the complex was sheared using sonication. Phenol/chloroform extractions were performed to remove DNA fragments cross-linked to protein. The DNA recovered in the aqueous phase was hybridized to NimbleGen Human ENCODE tiling arrays (1% of the genome) and sequenced using a Solexa sequencing system. The ENCODE array data were used to verify the accuracy of the sequencing data, and multiple independent growths (replicates) were compared to assess the reproducibility of the data. A more detailed protocol is available here. Also see Giresi et al., 2009.

To perform ChIP, proteins were cross-linked to DNA in vivo using 1% formaldehyde solution (Bhinge et al., 2007, ENCODE Project Consortium., 2007). Cross-linked chromatin was sheared by sonication and immunoprecipitated using a specific antibody against the protein of interest. After reversal of the cross-links, the immunoprecipitated DNA was used to identify the genomic location of transcription factor binding. This was accomplished by Solexa sequencing of the ends of the immunoprecipitated DNA (ChIP-seq), as well as labeling and hybridization of the immunoprecipitated DNA to NimbleGen Human ENCODE tiling arrays (1% of the genome) along with the input DNA as reference (ChIP-chip). The ENCODE array data were used to verify the accuracy of the sequencing data, and multiple independent growths (replicates) were compared to assess the reproducibility of the data. A more detailed protocol is available here.

ENCODE Array data were normalized using the Tukey biweight normalization, and peaks were called using ChIPOTle (Buck, et al., 2005) at multiple levels of significance. Regions matched on size to these peaks that were devoid of any significant signal were also created to allow for ROC analysis.

Sequences from each experiment were aligned to the genome using Maq (Li et al., 2008) and those that aligned to 4 or fewer locations were retained. Other sequences were also filtered based on their alignment to problematic regions (such as satellites and rRNA genes). The resulting digital signal was converted to a continuous wiggle track using F-Seq that employs Parzen kernel density estimation to create base pair scores (Boyle et al., 2008b). Discrete DNase HS, FAIRE, and ChIP sites (peaks) were identified from DNase/FAIRE/ChIP-seq using F-Seq by setting a Parzen cutoff based on ROC curve analysis using peaks and non-peaks identified from DNase/FAIRE/ChIP-chip using NimbleGen Human ENCODE tiling arrays (1% of the genome).

Input data was generated for GM12878, K562, HeLa-S3, HepG2, and HUVEC. These were used directly to create a control/background model used for F-Seq when generating signal annotations and subsequenntly peaks for these cell lines. These models are meant to correct for sequencing biases, alignment artifacts, and copy number changes in these cell lines. Input data is not being generated directly for other cell lines. Instead, a general background model was derived from the five Input data sets. This should provide corrections for sequencing biases and alignment artifacts, but obviously not for cell type specific copy number changes.

Release Notes

This is Release 3 (Mar 2010) of this track, which includes 18 new cell line or cell/treatment experiments. In addition, a number of new experiments were added to existing cell lines. Almost all Peaks have been called anew using improved cut-offs and p-Values. Finally, a second type of peak called using a ZINBA algorithm has been provided for several of the FAIRE-seq experiments. For all new versions of previously-released data, the affected database tables and files include 'V2' or 'V3' in the name, and metadata is marked with "submittedDataVersion=V", followed by a number and reason for replacement. Previous versions of these files are available for download from the FTP site.

Credits

These data and annotations were created by a collaboration of multiple institutions (contact: Terry Furey):

We thank NHGRI for ENCODE funding support.

References

Bhinge AA, Kim J, Euskirchen GM, Snyder M, Iyer, VR. Mapping the chromosomal targets of STAT1 by Sequence Tag Analysis of Genomic Enrichment (STAGE). Genome Res. 2007 Jun;17(6):910-6.

Boyle AP, Davis S, Shulha HP, Meltzer P, Margulies EH, Weng Z, Furey TS, Crawford GE. High-resolution mapping and characterization of open chromatin across the genome. Cell. 2008 Jan 25;132(2):311-22.

Boyle AP, Guinney J, Crawford GE, and Furey TS. F-Seq: a feature density estimator for high-throughput sequence tags. Bioinformatics. 2008 Nov 1;24(21):2537-8.

Buck MJ, Nobel AB, Lieb JD. ChIPOTle: a user-friendly tool for the analysis of ChIP-chip data. Genome Biol. 2005;6(11):R97.

Crawford GE, Davis S, Scacheri PC, Renaud G, Halawi MJ, Erdos MR, Green R, Meltzer PS, Wolfsberg TG, Collins FS. DNase-chip: a high-resolution method to identify DNase I hypersensitive sites using tiled microarrays. Nat Methods. 2006 Jul;3(7):503-9.

Crawford GE, Holt IE, Whittle J, Webb BD, Tai D, Davis S, Margulies EH, Chen Y, Bernat JA, Ginsburg D et al. Genome-wide mapping of DNase hypersensitive sites using massively parallel signature sequencing (MPSS). Genome Res. 2006 Jan;16(1):123-31.

The ENCODE Project Consortium. Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature. 2007 Jun 14;447(7146):799-816.

Giresi PG, Kim J, McDaniell RM, Iyer VR, Lieb JD. FAIRE (Formaldehyde-Assisted Isolation of Regulatory Elements) isolated active regulatory elements in human chromatin. Genome Res. 2007 Jun;17(6):877-85.

Giresi PG, Lieb JD. Isolation of active regulatory elements from eukaryotic chromatin using FAIRE (Formaldehyde Assisted Isolation of Regulatory Elements). Methods. 2009 Jul;48(3):233-9.

Li H, Ruan J, and Durbin R. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res. 2008 Nov;18(11):1851-8.

Data Release Policy

Data users may freely use ENCODE data, but may not, without prior consent, submit publications that use an unpublished ENCODE dataset until nine months following the release of the dataset. This date is listed in the Restricted Until column on the track configuration page and the download page. The full data release policy for ENCODE is available here.