Schema for HudsonAlpha RNA-seq - ENCODE HudsonAlpha RNA-seq

Home
Genomes
Genome Browser
Tools
Mirrors
- Euro/Asia Mirrors
- Mirroring Instructions
- US Server
- European Server
- Asian Server
Downloads
My Data
Projects
Help
About Us
- News
- Publications
- Blog
- Cite Us
- Credits
- Release Log
- Staff
- Conditions of Use
- Our History
- Jobs
- Licenses
- Contact Us

field

example

SQL type

info

description

bin

585

smallint(5) unsigned

range

Indexing field to speed chromosome range queries.

chrom

chr1

varchar(255)

values

Reference sequence chromosome or scaffold

chromStart

55424

int(10) unsigned

range

Start position in chromosome

chromEnd

59692

int(10) unsigned

range

End position in chromosome

name

OR4F5

varchar(255)

values

Name of item

score

int(10) unsigned

range

Optional score, nominal range 0-1000

strand

char(1)

values

+ or -

thickStart

58953

int(10) unsigned

range

Start of where display should be thick (start codon)

thickEnd

59691

int(10) unsigned

range

End of where display should be thick (stop codon)

reserved

int(10) unsigned

range

Used as itemRgb as of 2004-11-22

blockCount

int(10) unsigned

range

Number of blocks

blockSizes

12,83,793,

longblob

Comma separated list of block sizes

chromStarts

0,327,3475,

longblob

Start positions relative to chromStart

bin

chrom

chromStart

chromEnd

name

score

strand

thickStart

thickEnd

reserved

blockCount

blockSizes

chromStarts

585

chr1

55424

59692

OR4F5

58953

59691

12,83,793,

0,327,3475,

585

chr1

58953

59871

OR4F5

58953

59871

918,

587

chr1

357521

358460

OR4F16

357521

358460

939,

589

chr1

610958

611897

OR4F16

610958

611897

939,

591

chr1

850983

869824

SAMD11

300

851184

869396

60,92,182,51,125,90,186,163,116,79,500,125,111,674,

0,181,4414,5298,10031,13299,13534,15403,16395,16669,16818,17512,17957,18167,

591

chr1

861014

869824

SAMD11

300

861072

869396

125,90,138,163,116,79,500,125,111,674,

0,3268,3503,5372,6364,6638,6787,7481,7926,8136,

591

chr1

869445

883781

NOC2L

375

869936

879414

598,105,136,114,144,102,114,112,140,189,114,111,520,91,121,132,1440,

0,839,1315,1970,2199,3928,4287,6924,7797,8209,8972,9579,9801,11720,11892,12691,12896,

591

chr1

869445

884542

NOC2L

375

869936

884483

598,105,136,114,144,102,114,112,140,189,114,111,79,91,121,132,175,153,85,

0,839,1315,1970,2199,3928,4287,6924,7797,8209,8972,9579,9801,11720,11892,12691,12896,14726,15012,

591

chr1

869445

884542

NOC2L

375

869936

884483

598,90,136,114,144,102,114,112,140,189,114,111,79,91,121,132,175,153,85,

0,854,1315,1970,2199,3928,4287,6924,7797,8209,8972,9579,9801,11720,11892,12691,12896,14726,15012,

591

chr1

885829

890958

KLHL17

190

885936

890434

214,260,122,222,117,214,145,168,89,74,182,753,

0,706,1042,1239,1768,2117,2522,2750,3333,3520,3762,4376,

Description

This track is produced as part of the ENCODE Project. This track shows short tag sequencing of cDNA obtained from biological replicate samples (different culture plates) of the ENCODE cell lines. The sequences were aligned to the human genome (hg18) and UCSC known-gene splice junctions using different sequence alignment programs such ELAND (Illumina) or Bowtie (Langmead et al., 2009). RNA-seq is a method for mapping and quantifying the transcriptome of any organism that has a genomic DNA sequence assembly. RNA-seq is performed by reverse-transcribing an RNA sample into cDNA, followed by high throughput DNA sequencing, which was done here on an Illumina Genome Analyzer (GA2) (Mortazavi et al., 2008). The transcriptome measurements shown on these tracks were performed on polyA selected RNA from total cellular RNA. Data have been produced in two formats: single reads, each of which comes from one end of a randomly primed cDNA molecule; and paired-end reads, which are obtained as pairs from both ends cDNAs resulting from random priming. The resulting sequence reads are then informatically mapped onto the genome sequence (Alignments). Those that don't map to the genome are mapped to known RNA splice junctions (Splice Sites). These mapped reads are then counted to determine their frequency of occurrence at known gene models. Sequence reads that cluster at genome locations that lack an existing transcript model are also identified informatically and they are quantified. RNA-seq is especially suited for giving information about RNA splicing patterns and for determining unequivocally the presence or absence of lower abundance class RNAs. As performed here, internal RNA standards are used to assist in quantification and to provide internal process controls. This RNA-seq protocol does not specify the coding strand. As a result, there will be ambiguity at loci where both strands are transcribed. The "randomly primed" reverse transcription is, apparently, not fully random. This is inferred from a sequence bias in the first residues of the read population, and this likely contributes to observed unevenness in sequence coverage across transcripts.

Display Conventions and Configuration

This track is a multi-view composite track that contains multiple data types (views). For each view, there are multiple subtracks that display individually on the browser. Instructions for configuring multi-view tracks are here. The following views are in this track:

RPKM: RefSeq gene models are displayed shaded by their RPKM (Reads Per Kilobase of exon per Million reads) value. RPKM is reported in the score of each element, and each element is shaded using a gray scale that becomes darker as RPKM increases. The RPKM measure assists in visualizing the relative amount of a given transcript across multiple samples.
Alignments: The Alignments view shows reads mapped to the genome. Alignments are colored by cell type.

Methods

Gene expression is measured in Reads Per Kilobase exon per Million reads (RPKM; Mortazavi et al., 2008). RNA-seq reads are aligned to RefSeq gene models. RPKM is then calculated by dividing the total number of reads that align to the gene model (RefSeq) by the size of the spliced transcript in kilobases. This number is then divided by the total number of reads in millions for the experiment. For example, if x reads align to a RefSeq gene whose spliced transcript is y kb in size and there are z million reads in the experiment, then RPKM = x/(y*z).

Cells were grown according to the approved ENCODE cell culture protocols. A total of 2 X 107 cells were lysed in either 4mls of RLT buffer (Qiagen RNEasy kit), and processed on 2 RNEasy midi columns according to the manufacturer's protocol, with the inclusion of the "on-column" DNAse digestion step to remove residual genomic DNA. 75 µg of total RNA was selected twice with oligodT beads (Dynal) according to the manufacturer's protocol to isolate mRNA from each of the preparations. 100 ng of mRNA was then processed according to the protocol in Mortazavi et al. (2008), and prepared for sequencing on the Genome Analyzer flow cell according to the protocol for the ChIPSeq DNA genomic DNA kit (Illumina). Following alignment of the sequence reads to the genome assembly as described above, the sequence reads were further analyzed using the ERANGE 3.0 software package, which quantifies the number of reads falling within the mapped boundaries of known transcripts from the Gencode annotations. ERANGE assigns both genomically unique reads and reads that occur in 2-10 genomic locations for quantification.

Verification

Known exon maps as displayed on the genome browser are confirmed by the alignment of sequence reads.
Known spliced exons are detected at the expected frequency for transcripts of given abundance.
RT-QCPR confirms expression measurements with r > 0.8

Credits

Myers Group: Florencia Pauli, Tim Reddy

Wold Group: Ali Mortazavi, Brian Williams, Diane Trout, Brandon King, Ken McCue, Lorian Schaeffer.

Illumina gene expression group: Gary Schroth, Shujun Luo, Eric Vermaas.

Contacts: Tim Reddy and Flo Pauli (experimental).

References

Mortazavi A, Williams BA, McCue K, Schaeffer L, and Wold BJ. Mapping and quantifying mammalian transcriptomes by RNA-Seq Nature Methods. 2008 Jul; 5(7):621-628.

Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome Genome Biology. 2009 Mar; 10:R25.

Data Release Policy

Data users may freely use ENCODE data, but may not, without prior consent, submit publications that use an unpublished ENCODE dataset until nine months following the release of the dataset. This date is listed in the Restricted Until column, above. The full data release policy for ENCODE is available here.