Schema for CSHL Sm RNA-seq - ENCODE Cold Spring Harbor Labs Small RNA-seq
  Database: hg18    Primary Table: wgEncodeCshlShortRnaSeqAlignmentsProstateCellShort Data last updated: 2009-11-20
Big Bed File Download: /gbdb/hg18/bbi/wgEncodeCshlShortRnaSeqAlignmentsProstateCellShort.bb
Item Count: 58,102,701
Format description: Tag Alignment format (BED 3+)
fieldexampledescription
chromchr1Reference sequence chromosome or scaffold
chromStart164838310Start position in chromosome
chromEnd164838325End position in chromosome
sequenceCCGGCCCCCCGTCCTSequence of this read
score5Indicates mismatches, quality, or other measurement (0-1000)
strand-Orientation of this read (+ or -)

Sample Rows
 
chromchromStartchromEndsequencescorestrand
chr1164838310164838325CCGGCCCCCCGTCCT5-
chr1164838310164838325CCGGCCCCCCGTCCT5-
chr1164838310164838326CCCGGCCCCCCGTCCT3-
chr1164838310164838326CCCGGCCCCCCGTCCT3-
chr1164838310164838326CCCGGCCCCCCGTCCT3-
chr1164838310164838327CCCCGGCCCCCCGTCCT3-
chr1164838310164838327TCCCGGCCCCCCGTCCT10-
chr1164838311164838325CCGGCCCCCCGTCC5-
chr1164838311164838326CCCGGCCCCCCGTCC3-
chr1164838311164838327CCCCGGCCCCCCGTCC3-

CSHL Sm RNA-seq (wgEncodeCshlShortRnaSeq) Track Description
 

Description

This track depicts NextGen sequencing information for RNAs between the sizes of 20-200 nt isolated from RNA samples from tissues or sub cellular compartments from ENCODE cell lines. The overall goal of the ENCODE project is to identify and characterize all functional elements in the sequence of the human genome.

This cloning protocol generates directional libraries that are read from the 5′ ends of the inserts, which should largely correspond to the 5′ ends of the mature RNAs. The libraries were sequenced on a Solexa platform for a total of 36, 50 or 76 cycles however the reads undergo post-processing resulting in trimming of their 3′ ends. Consequently, the mapped read lengths are variable.

Display Conventions and Configuration

To show only selected subtracks, uncheck the boxes next to the tracks that you wish to hide.

Color differences among the views are arbitrary. They provide a visual cue for distinguishing between the different cell types and compartments.

Transfrags
Identical reads were collapsed while maintaining their multiplicity information and reported as "transfrags". "Y" means that the transfrag underwent clipping prior to mapping. "N" indicates that the transfrag did not undergo clipping. The Transfrags view includes all transfrags before filtering.
Raw Signals
The Raw Signal views show the density of aligned tags on the plus and minus strands.
Alignments
The Alignments view shows reads mapped to the genome and indicates where bases may mismatch. Every mapped read is displayed, i.e. uncollapsed. Sequences determined to be transcribed on the positive strand are shown in blue. Sequences determined to be transcribed on the negative strand are shown in orange. Sequences for which the direction of transcription was not able to be determined are shown in black. The score of each alignment is the number of times it was aligned to the entire genome, that is, a score of two means that this particular read was aligned to the genome twice in two different locations.

Methods

Small RNAs between 20-200 nt were ribominus treated according to the manufacturer's protocol (Invitrogen) using custom LNA probes targeting ribosomal RNAs (some datasets are also depleted of U snRNAs and high abundant microRNAs). The RNA was treated with Tobacco Alkaline Pyrophosphatase to eliminate any 5′ cap structures.

Poly-A Polymerase was used to catalyze the addition of C's to the 3′ end. The 5′ ends were phosphorylated using T4 PNK and an RNA linker was ligated onto the 5′ end. Reverse transcription was carried out using a poly-G oligo with a defined 5′ extension. The inserts were then amplified using oligos targeting the 5′ linker and poly-G extension and containing sequencing adapters. The library was sequenced on an Illumina GA machine for a total of 36, 50 or 76 cycles. Initially 1 lane is run. If an appreciable number of mappable reads are obtained, additional lanes are run. Sequence reads underwent quality filtration using Illumina standard pipeline (Gerlad).

The read lengths may exceed the insert sizes and consequently introduce 3′ adaptor sequence into the 3′ end of the reads. The 3′ sequencing adaptor was removed from the reads using a custom clipper program, which aligned the adaptor sequence to the short-reads, allowing up to 2 mismatches and no indels. Regions that aligned were "clipped" off from the read. The trimmed portions were collapsed into identical reads, their count noted and aligned to the human genome (NCBI build 36, hg18 unmasked) using Nexalign (Lassmann et al., not published). The alignment parameters are tuned to tolerate up to 2 mismatches with no indels and will allow for trimmed portions as small as 5 nucleotides to be mapped. We report reads that mapped 10 or fewer times.

Note: Data obtained from each lane is processed and mapped independently. The processed/mapped data from each lane is then complied as a single track without additional processing and submitted to UCSC. Consequently, identical reads within a lane were collapsed and their value is reported as the "transfrag" signal value. However, the redundancy between lanes has not been eliminated so the same transfrag may appear multiple times within a track.

Verification

Comparison of referential data generated from 8 individual sequencing lanes (Illumina technology).

Credits

Hannon lab members: Katalin Fejes-Toth, Vihra Sotirova, Gordon Assaf, Jon Preall

And members of the Gingeras and Guigo labs.

Data Release Policy

Data users may freely use ENCODE data, but may not, without prior consent, submit publications that use an unpublished ENCODE dataset until nine months following the release of the dataset. This date is listed in the Restricted Until column, above. The full data release policy for ENCODE is available here.