Schema for HAIB RNA-seq - RNA-seq from ENCODE/HAIB
  Database: hg19    Primary Table: wgEncodeHaibRnaSeqA549Etoh02AlnRep4
BAM File: /gbdb/hg19/bbi/wgEncodeHaibRnaSeqA549Etoh02AlnRep4.bam
Format description: The fields of a SAM short read alignment, the text version of BAM.
See the SAM Format Specification for more details
fielddescription
qNameQuery template name - name of a read
flagFlags. 0x10 set for reverse complement. See SAM docs for others.
rNameReference sequence name (often a chromosome)
pos1 based position
mapQMapping quality 0-255, 255 is best
cigarCIGAR encoded alignment string.
rNextRef sequence for next (mate) read. '=' if same as rName, '*' if no mate
pNextPosition (1-based) of next (mate) sequence. May be -1 or 0 if no mate
tLenSize of DNA template for mated pairs. -size for one of mate pairs
seqQuery template sequence
qualASCII of Phred-scaled base QUALity+33. Just '*' if no quality scores
tagTypeValsTab-delimited list of tag:type:value optional extra fields

Sample Rows
 
qNameflagrNameposmapQcigarrNextpNexttLenseqqualtagTypeVals
HWI-EAS149_3:1:113:677:494:0:1:116chr11058525536M*00TCTCCGGGTCTGTGCTGAGGAGAACGCAACTCCGCC#6A@ABB?A?BBAABABBAABBBBB@BBA@BB@BB@XA:C:0 MD:Z:36 NM:C:0
ILLUMINA-EAS45_4:1:57:1030:1861:0:1:116chr11181125536M*00NNNAATGTAGTTTAAACTAGATTGCCAGCACCGGGT####DD:?EA>F:DAGG'@B750CF?C2D/70EGGGXA:C:1 MD:Z:0C0T0G0C13G18 NM:C:5
HWI-EAS149_3:1:117:594:1237:0:1:10chr11203325536M*00GGCCTGTGCCAGGGTGCAAGCTGAGCACTGGAGTGGABA=??AABA<<BA1A<<:A9AA=A=94===2=13=XA:C:0 MD:Z:36 NM:C:0
HWI-EAS149_3:1:109:1472:999:0:1:10chr11208825536M*00GCCATGCCTAGAGTGGGATGGGCCATTGTTCATCTTBBBBBBBBBBBAB>BAAA??ABAA>BBB?AAA@?BAXA:C:0 MD:Z:36 NM:C:0
HWI-EAS149_3:1:16:1623:17:0:1:10chr11215025536M*00TAATACCACAACCAGGCATAGGGGAAAGATTGGAGN@=?99A:=BA@BA4;B>>C8@;;@3)7>286;>:##XA:C:0 MD:Z:35G0 NM:C:1
ILLUMINA-EAS45_4:1:44:630:884:0:1:116chr11215425536M*00NGNACAACCAGGCATAGGGGAAAGATTGGAGGAAAG###ADDGAGEGGE;DFGGGGGGGGGDBGGGGGGGGGXA:C:0 MD:Z:0A0C0C33 NM:C:3
HWI-EAS149_3:1:38:786:129:0:1:10chr11227625536M*00CTCGCTCCAGCAGCTGGACCCCTACCTGCCGTCTGCB;BAA=AA?ABBA?AAAABB5?A>5A>?>=<8>4:6XA:C:0 MD:Z:36 NM:C:0
ILLUMINA-EAS45_4:1:89:1696:812:0:1:10chr11267225536M*00CGACGGCCGACTTGGATCACACTCTTGTGAGTGNANGHEHGBHHDDHHBFGGGHGHEHGGGGHHBE?DD#?#XA:C:0 MD:Z:33T0C0C0 NM:C:3
ILLUMINA-EAS45_4:1:40:870:539:0:1:10chr11297725536M*00CAGAGCCCAGGCCAGGGGCCCCCAAGAAAGGCTNTAHF=GBGEFEHHE@HGHEHHGGFHGGH18HH@BF###XA:C:0 MD:Z:33C1G0 NM:C:2
HWI-EAS149_3:1:30:542:1817:0:1:10chr11318525536M*00CTGTAACTCAAAGCCTTAGCCTCTGTTCCCACGAAGBBBA@BBBBB=ABBBBBBBBBBBBBBB@AA>A?AA?XA:C:0 MD:Z:36 NM:C:0

HAIB RNA-seq (wgEncodeHaibRnaSeq) Track Description
 

Description

This track was produced as part of the ENCODE Project. RNA-seq is a method for mapping and quantifying the transcriptome of any organism that has a genomic DNA sequence assembly (Mortazavi et al., 2008). Biological replicates of ENCODE cell lines were grown on separate culture plates, total RNA was purified and polyA selected two times. The mRNA extract was then fragmented by magnesium-catalyzed hydrolysis and reverse transcribed to cDNA by random priming and amplification. The cDNA was sequenced on an Illumina Genome Analyzer (GAI or GAIIx).

The DNA sequences were aligned to the NCBI Build37 (hg19) version of the human genome using the sequence alignment programs ELAND (Illumina) or Bowtie (Langmead et al., 2009). The first 10 residues of sequencing have a weak characteristic nucleotide bias of unknown origin. This RNA-seq protocol does not specify the coding strand. As a result, there will be ambiguity at loci where both strands are transcribed.

Display Conventions and Configuration

This track is a multi-view composite track that contains multiple data types (views). For each view, there are multiple subtracks (cell lines, replicates and growth conditions) that display individually on the browser. Instructions for configuring multi-view tracks are here. The following views are in this track:

Alignments
The Alignments view shows reads mapped to the genome. See the Bowtie Manual for more information about the SAM Bowtie output (including tag definitions) and the SAM Format Specification for more information on the SAM/BAM file format.
The reads are named using the following convention:
Lane #:Tile #:X-coordinate:Y-coordinate
Raw Signal
Density graph of signal enrichment based on a normalized aligned read density (Read Per Million, RPM). RPM is reported in the score field and is equal to the number of reads at that position divided by the total number of reads divided by one million. The Raw Signal view displays dense, continuous data as a graph and the RPM measure assists in visualizing the relative amount of a given transcript across multiple samples.

Metadata for a particular subtrack can be found by clicking the down arrow in the list of subtracks.

Methods

Experimental Procedures

Cells were grown according to the approved ENCODE cell culture protocols. Cells were lysed in RLT buffer (Qiagen RNEasy kit) and processed on RNEasy midi columns according to the manufacturer's protocol, with the inclusion of the "on-column" DNase digestion step to remove residual genomic DNA. The mRNA was isolated from at least 10 ug of total RNA with oligo(dT) two times (Dynabeads mRNA PurificationgKit, Invitrogen). Alternatively, cells were lysed and mRNA was purified directly two times with oligo(dT) (Dynabeads mRNA DIRECT Kit, Invitrogen). A quantity of 100 ng of mRNA was fragmented by magnesium-catalyzed hydrolysis and reverse transcribed to cDNA by random priming according to the protocol in Mortazavi et al. (2008). The cDNA was prepared for sequencing on the Genome Analyzer flowcell according to the protocol for the ChIPSeq DNA genomic DNA kit (Illumina). The sequencing libraries were size-selected around 225 bp and amplified with 15 rounds of PCR.

Libraries were sequenced with an Illumina Genome Analyzer I or an Illumina Genome Analyzer IIx according to the manufacturer's recommendations. Single end reads of 36 nt in length were obtained.

Data Processing and Analysis

FastQ files were made from qseq files generated by the Illumina pipeline (Casava 1.7). The Raw Signal files (bigWig) were generated from bedgraph files and the score was calculated as the number of reads at that position divided by the total number of reads divided by one million.

Casava export files were aligned to the NCBI Build37 (hg19) version of the human genome with ELAND (Illumina), generating SAM files. FastQ files of experiments that were previously aligned to NCBI Build36 (hg18) were aligned to NCBI Build37 (hg19) using Bowtie (Langmead et al., 2009; parameters: -S -n 2 -k 11 -m 10 --best), also generating SAM files. SAM files were converted to BAM files with SAMtools (Li et al., 2009).

Gene expression within GENCODE V7 (Harrow et al., 2006) gene models was estimated using Cufflinks v0.9.3 (Roberts et al., 2011). Estimates of transcript abundance were reported in Fragments Per Kilobase of exon per Million fragments mapped (FPKM). FPKM is calculated by dividing the total number of fragments that align to the gene model by the size of the spliced transcript (exons) in kilobases. This number is then divided by the total number of reads in millions for the experiment. FPKM is reported in the last column of the GTF (TranscriptGencV7) files.

Raw Data (fastQ), Raw Signal (bigWig), Alignments (BAM) and Transcript GENCODE V7 (GTF) files are available from the Downloads page.

Verification

  • The mapped data were visually inspected to verify the majority of the reads fell within known exons.
  • Biological replicates confirm expression measurements with r > 0.90.

Release Notes

Update (May 2012): the labels of the Raw Signal subtracks have been updated because they were originally labeled as Signals instead of Raw Signals.

This is the first NCBI Build37 (hg19) release of this track (Feb 2012).
This release includes the 3 datasets (Jurkat, A549/DEX100nm, and A549/EtOH2pct) previously released on NCBI Build36 (hg18) and adds data for several more cell types and growth conditions in replicate. Four types of download files are available for each replicate including the Raw Data (fastQ), Transcript GENCODE V7 (GTF), Raw Signal (bigWig), and Alignments (BAM).

Credits

These data were produced by the Dr. Richard Myers Lab at the HudsonAlpha Institute for Biotechnology.

Contact: Dr. Florencia Pauli

References

Harrow J, Denoeud F, Frankish A, Reymond A, Chen CK, Chrast J, Lagarde J, Gilbert JG, Storey R, Swarbreck D et al. GENCODE: producing a reference annotation for ENCODE. Genome Biol. 2006;7 Suppl 1:S4.1-9.

Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 2009;10(3):R25.

Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, 1000 Genome Project Data Processing Subgroup. The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009 Aug 15;25(16):2078-9.

Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods. 2008 Jul;5(7):621-8.

Roberts A, Trapnell C, Donaghey J, Rinn JL, Pachter L. Improving RNA-Seq expression estimates by correcting for fragment bias. Genome Biol. 2011;12(3):R22.

Data Release Policy

Data users may freely use ENCODE data, but may not, without prior consent, submit publications that use an unpublished ENCODE dataset until nine months following the release of the dataset. This date is listed in the Restricted Until column, above. The full data release policy for ENCODE is available here.