Schema for HAIB RNA-seq - RNA-seq from ENCODE/HAIB

JavaScript is disabled in your web browser

You must have JavaScript enabled in your web browser to use the Genome Browser

Database: hg19 Primary Table: wgEncodeHaibRnaSeqA549Etoh02AlnRep4
BAM File: /gbdb/hg19/bbi/wgEncodeHaibRnaSeqA549Etoh02AlnRep4.bam
Format description: The fields of a SAM short read alignment, the text version of BAM.
See the SAM Format Specification for more details

field	description
`qName`	Query template name - name of a read
`flag`	Flags. 0x10 set for reverse complement. See SAM docs for others.
`rName`	Reference sequence name (often a chromosome)
`pos`	1 based position
`mapQ`	Mapping quality 0-255, 255 is best
`cigar`	CIGAR encoded alignment string.
`rNext`	Ref sequence for next (mate) read. '=' if same as rName, '*' if no mate
`pNext`	Position (1-based) of next (mate) sequence. May be -1 or 0 if no mate
`tLen`	Size of DNA template for mated pairs. -size for one of mate pairs
`seq`	Query template sequence
`qual`	ASCII of Phred-scaled base QUALity+33. Just '*' if no quality scores
`tagTypeVals`	Tab-delimited list of tag:type:value optional extra fields

Sample Rows

qName	flag	rName	pos	mapQ	cigar	rNext	seq	qual	tagTypeVals
HWI-EAS149_3:1:113:677:494:0:1:1	16	chr1	10585	255	36M	*	TCTCCGGGTCTGTGCTGAGGAGAACGCAACTCCGCC	#6A@ABB?A?BBAABABBAABBBBB@BBA@BB@BB@	XA:C:0 MD:Z:36 NM:C:0
ILLUMINA-EAS45_4:1:57:1030:1861:0:1:1	16	chr1	11811	255	36M	*	NNNAATGTAGTTTAAACTAGATTGCCAGCACCGGGT	####DD:?EA>F:DAGG'@B750CF?C2D/70EGGG	XA:C:1 MD:Z:0C0T0G0C13G18 NM:C:5
HWI-EAS149_3:1:117:594:1237:0:1:1	0	chr1	12033	255	36M	*	GGCCTGTGCCAGGGTGCAAGCTGAGCACTGGAGTGG	ABA=??AABA<<BA1A<<:A9AA=A=94===2=13=	XA:C:0 MD:Z:36 NM:C:0
HWI-EAS149_3:1:109:1472:999:0:1:1	0	chr1	12088	255	36M	*	GCCATGCCTAGAGTGGGATGGGCCATTGTTCATCTT	BBBBBBBBBBBAB>BAAA??ABAA>BBB?AAA@?BA	XA:C:0 MD:Z:36 NM:C:0
HWI-EAS149_3:1:16:1623:17:0:1:1	0	chr1	12150	255	36M	*	TAATACCACAACCAGGCATAGGGGAAAGATTGGAGN	@=?99A:=BA@BA4;B>>C8@;;@3)7>286;>:##	XA:C:0 MD:Z:35G0 NM:C:1
ILLUMINA-EAS45_4:1:44:630:884:0:1:1	16	chr1	12154	255	36M	*	NGNACAACCAGGCATAGGGGAAAGATTGGAGGAAAG	###ADDGAGEGGE;DFGGGGGGGGGDBGGGGGGGGG	XA:C:0 MD:Z:0A0C0C33 NM:C:3
HWI-EAS149_3:1:38:786:129:0:1:1	0	chr1	12276	255	36M	*	CTCGCTCCAGCAGCTGGACCCCTACCTGCCGTCTGC	B;BAA=AA?ABBA?AAAABB5?A>5A>?>=<8>4:6	XA:C:0 MD:Z:36 NM:C:0
ILLUMINA-EAS45_4:1:89:1696:812:0:1:1	0	chr1	12672	255	36M	*	CGACGGCCGACTTGGATCACACTCTTGTGAGTGNAN	GHEHGBHHDDHHBFGGGHGHEHGGGGHHBE?DD#?#	XA:C:0 MD:Z:33T0C0C0 NM:C:3
ILLUMINA-EAS45_4:1:40:870:539:0:1:1	0	chr1	12977	255	36M	*	CAGAGCCCAGGCCAGGGGCCCCCAAGAAAGGCTNTA	HF=GBGEFEHHE@HGHEHHGGFHGGH18HH@BF###	XA:C:0 MD:Z:33C1G0 NM:C:2
HWI-EAS149_3:1:30:542:1817:0:1:1	0	chr1	13185	255	36M	*	CTGTAACTCAAAGCCTTAGCCTCTGTTCCCACGAAG	BBBA@BBBBB=ABBBBBBBBBBBBBBB@AA>A?AA?	XA:C:0 MD:Z:36 NM:C:0

HAIB RNA-seq (wgEncodeHaibRnaSeq) Track Description


	Description This track was produced as part of the ENCODE Project. RNA-seq is a method for mapping and quantifying the transcriptome of any organism that has a genomic DNA sequence assembly (Mortazavi et al., 2008). Biological replicates of ENCODE cell lines were grown on separate culture plates, total RNA was purified and polyA selected two times. The mRNA extract was then fragmented by magnesium-catalyzed hydrolysis and reverse transcribed to cDNA by random priming and amplification. The cDNA was sequenced on an Illumina Genome Analyzer (GAI or GAIIx). The DNA sequences were aligned to the NCBI Build37 (hg19) version of the human genome using the sequence alignment programs ELAND (Illumina) or Bowtie (Langmead et al., 2009). The first 10 residues of sequencing have a weak characteristic nucleotide bias of unknown origin. This RNA-seq protocol does not specify the coding strand. As a result, there will be ambiguity at loci where both strands are transcribed. Display Conventions and Configuration This track is a multi-view composite track that contains multiple data types (views). For each view, there are multiple subtracks (cell lines, replicates and growth conditions) that display individually on the browser. Instructions for configuring multi-view tracks are here. The following views are in this track: Alignments The Alignments view shows reads mapped to the genome. See the Bowtie Manual for more information about the SAM Bowtie output (including tag definitions) and the SAM Format Specification for more information on the SAM/BAM file format. The reads are named using the following convention: Lane #:Tile #:X-coordinate:Y-coordinate Raw Signal Density graph of signal enrichment based on a normalized aligned read density (Read Per Million, RPM). RPM is reported in the score field and is equal to the number of reads at that position divided by the total number of reads divided by one million. The Raw Signal view displays dense, continuous data as a graph and the RPM measure assists in visualizing the relative amount of a given transcript across multiple samples. Metadata for a particular subtrack can be found by clicking the down arrow in the list of subtracks. Methods Experimental Procedures Cells were grown according to the approved ENCODE cell culture protocols. Cells were lysed in RLT buffer (Qiagen RNEasy kit) and processed on RNEasy midi columns according to the manufacturer's protocol, with the inclusion of the "on-column" DNase digestion step to remove residual genomic DNA. The mRNA was isolated from at least 10 ug of total RNA with oligo(dT) two times (Dynabeads mRNA PurificationgKit, Invitrogen). Alternatively, cells were lysed and mRNA was purified directly two times with oligo(dT) (Dynabeads mRNA DIRECT Kit, Invitrogen). A quantity of 100 ng of mRNA was fragmented by magnesium-catalyzed hydrolysis and reverse transcribed to cDNA by random priming according to the protocol in Mortazavi et al. (2008). The cDNA was prepared for sequencing on the Genome Analyzer flowcell according to the protocol for the ChIPSeq DNA genomic DNA kit (Illumina). The sequencing libraries were size-selected around 225 bp and amplified with 15 rounds of PCR. Libraries were sequenced with an Illumina Genome Analyzer I or an Illumina Genome Analyzer IIx according to the manufacturer's recommendations. Single end reads of 36 nt in length were obtained. Data Processing and Analysis FastQ files were made from qseq files generated by the Illumina pipeline (Casava 1.7). The Raw Signal files (bigWig) were generated from bedgraph files and the score was calculated as the number of reads at that position divided by the total number of reads divided by one million. Casava export files were aligned to the NCBI Build37 (hg19) version of the human genome with ELAND (Illumina), generating SAM files. FastQ files of experiments that were previously aligned to NCBI Build36 (hg18) were aligned to NCBI Build37 (hg19) using Bowtie (Langmead et al., 2009; parameters: -S -n 2 -k 11 -m 10 --best), also generating SAM files. SAM files were converted to BAM files with SAMtools (Li et al., 2009). Gene expression within GENCODE V7 (Harrow et al., 2006) gene models was estimated using Cufflinks v0.9.3 (Roberts et al., 2011). Estimates of transcript abundance were reported in Fragments Per Kilobase of exon per Million fragments mapped (FPKM). FPKM is calculated by dividing the total number of fragments that align to the gene model by the size of the spliced transcript (exons) in kilobases. This number is then divided by the total number of reads in millions for the experiment. FPKM is reported in the last column of the GTF (TranscriptGencV7) files. Raw Data (fastQ), Raw Signal (bigWig), Alignments (BAM) and Transcript GENCODE V7 (GTF) files are available from the Downloads page. Verification The mapped data were visually inspected to verify the majority of the reads fell within known exons. Biological replicates confirm expression measurements with r > 0.90. Release Notes Update (May 2012): the labels of the Raw Signal subtracks have been updated because they were originally labeled as Signals instead of Raw Signals. This is the first NCBI Build37 (hg19) release of this track (Feb 2012). This release includes the 3 datasets (Jurkat, A549/DEX100nm, and A549/EtOH2pct) previously released on NCBI Build36 (hg18) and adds data for several more cell types and growth conditions in replicate. Four types of download files are available for each replicate including the Raw Data (fastQ), Transcript GENCODE V7 (GTF), Raw Signal (bigWig), and Alignments (BAM). Credits These data were produced by the Dr. Richard Myers Lab at the HudsonAlpha Institute for Biotechnology. Contact: Dr. Florencia Pauli References Harrow J, Denoeud F, Frankish A, Reymond A, Chen CK, Chrast J, Lagarde J, Gilbert JG, Storey R, Swarbreck D et al. GENCODE: producing a reference annotation for ENCODE. Genome Biol. 2006;7 Suppl 1:S4.1-9. Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 2009;10(3):R25. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, 1000 Genome Project Data Processing Subgroup. The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009 Aug 15;25(16):2078-9. Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods. 2008 Jul;5(7):621-8. Roberts A, Trapnell C, Donaghey J, Rinn JL, Pachter L. Improving RNA-Seq expression estimates by correcting for fragment bias. Genome Biol. 2011;12(3):R22. Data Release Policy Data users may freely use ENCODE data, but may not, without prior consent, submit publications that use an unpublished ENCODE dataset until nine months following the release of the dataset. This date is listed in the Restricted Until column, above. The full data release policy for ENCODE is available here.

Description

Display Conventions and Configuration

Methods

Experimental Procedures

Data Processing and Analysis

Verification

Release Notes

Credits

References

Data Release Policy