Description
This set of data tracks represents a comprehensive set of processed human small non-coding RNAs (sncRNAs) based on over
180 high-throughput small RNA-seq (smRNA-seq) experiments generated by over 30 independent groups.
The data tracks represent raw signal (expression) and peaks (regions of enrichment) that were generated using a uniform processing pipeline
by Wang lab at UPenn. Provided data tracks are based on the integrated analysis of data
from over 40 normal human tissues and cell types (see DASHR database).
We also provide sncRNA processing information for each peak/loci.
Methods
Data collection and curation of smRNA-seq
We manually curated Illumina smRNA-seq datasets on normal human tissue samples and cell types from GEO and SRA.
The smRNA-seq samples were categorized into different groups of tissues and cell types, according to study ID (GSE accession).
Processing smRNA-seq datasets
We standardized the processing of smRNA-seq datasets and generated sncRNA expression levels for sncRNA genes and mature sncRNA products derived from these larger RNAs. The pipeline can be summarized into three parts.
We first identified the correct adapter sequence and trimmed the sequencing reads using cutadapt.
We then mapped the set of trimmed reads corresponding to small RNAs to a standardized version of the human reference genome (GRCh37/hg19).
The reads were aligned using STAR algorithm using 'all-matches' strategy, i.e. allowing for multi-mapping and no mismatches.
Segmentation and quantification
We used a customized approach to identify peaks with evidence of specific processing for mature sncRNA products at base pair resolution.
We scanned the genomic sequence and identified the start of the peak by finding two adjacent positions with at least a 2-fold
increase in the number of mapped reads. Similarly, the corresponding end of the peak is found by looking for at least a 2-fold decrease
in the number of mapped reads. Additionally, the detected peaks needed to have at least 10 reads.
After identifying the mature sncRNA locations, we then quantified the number of reads falling within these regions as expression (raw read counts) for each sncRNA.
To enable comparison across tissues, we took into account the library size information for each of the sequencing experiments and reported the read count in 'reads per million' (RPM).
The bedscore (Score field) gives the log-transformed RPM expression score in [0,1000] range computed as max(0,min(100*log(100*RPM+0.05)/log(10),1000)).
A detailed description of the data processing pipeline and precise set of considerations for evaluating the quality of small RNA-seq data is available
in [1].
References
-
Yuk Yee Leung, Pavel P. Kuksa, Alexandre Amlie-Wolf, Otto Valladares, Lyle H. Ungar, Sampath Kannan, Brian D. Gregory, and Li-San Wang. DASHR: database of small human noncoding RNAs. Nucl. Acids Res., 2015 (Database Issue)
doi:10.1093/nar/gkv1188 PMID: 26553799
-
Yuk Yee Leung, Paul Ryvkin, Lyle H. Ungar, Brian D. Gregory, and Li-San Wang (2013) CoRAL: predicting non-coding RNAs from small RNA-sequencing data.
Nucleic Acids Research, 41, e137. PMID: 23700308
Data Release Policy
There are no restrictions on the use of the tracks.
Contact
Li-San Wang
|
|