Schema for GIS DNA PET - ENCODE Genome Institute of Singapore DNA Paired-End Ditags

Home
Genomes
Genome Browser
Tools
Mirrors
- Euro/Asia Mirrors
- Mirroring Instructions
- US Server
- European Server
- Asian Server
Downloads
My Data
Projects
Help
About Us
- News
- Publications
- Blog
- Cite Us
- Credits
- Release Log
- Staff
- Conditions of Use
- Our History
- Jobs
- Licenses
- Contact Us

field

description

qName

Query template name - name of a read

flag

Flags. 0x10 set for reverse complement. See SAM docs for others.

rName

Reference sequence name (often a chromosome)

pos

1 based position

mapQ

Mapping quality 0-255, 255 is best

cigar

CIGAR encoded alignment string.

rNext

Ref sequence for next (mate) read. '=' if same as rName, '*' if no mate

pNext

Position (1-based) of next (mate) sequence. May be -1 or 0 if no mate

tLen

Size of DNA template for mated pairs. -size for one of mate pairs

seq

Query template sequence

qual

ASCII of Phred-scaled base QUALity+33. Just '*' if no quality scores

tagTypeVals

Tab-delimited list of tag:type:value optional extra fields

qName

flag

rName

pos

mapQ

cigar

rNext

pNext

tLen

seq

qual

tagTypeVals

1510_196_411

115

chr1

10075

255

50M

18941

8916

ACCCTAACCCTAACCCTAACCCTAACCCTAACCCAACCCTAACCCTAACC

RG:Z:S1 CS:Z:T10103200103200101001002001032001032001032001032001 CQ:Z:* MD:Z:50

1690_1738_401

131

chr1

10091

255

25M

20207

10141

TAACCCTAACCCTAACCCAACCCTN

RG:Z:S1 CS:Z:G1301002301002301001010020 CQ:Z:* MD:Z:24A

1317_1772_84

131

chr1

10157

255

25M

21877

11745

TAACCCTAACCCTAACCCTAACCTA

RG:Z:S1 CS:Z:G1301002301002301002301023 CQ:Z:* MD:Z:7CCAAAGCCAAAGCCAAGC

313_612_637

131

chr1

10157

255

25M

21877

11745

TAACCCTAACCCTAACCCTAACCTA

RG:Z:S1 CS:Z:G1301002301002301002301023 CQ:Z:* MD:Z:1CCAAAGCCAAAGCCAAAGCCAAGC

1413_1557_843

115

chr1

10157

255

25M

21538

11406

TAACCCTAACCCTAACCCTAACCTA

RG:Z:S1 CS:Z:T0321103200103200103200103 CQ:Z:* MD:Z:25

1993_770_1106

115

chr1

10160

255

25M

20177

10042

CCCTAACCCTAACCCTAACCTAACC

RG:Z:S1 CS:Z:T1010320103200103200103000 CQ:Z:* MD:Z:25

1551_922_1854

115

chr1

10163

255

25M

20598

10460

TAACCCTAACCCTAACCTAACCCTA

RG:Z:S1 CS:Z:T0320010320103200103000103 CQ:Z:* MD:Z:25

1648_45_861

115

chr1

10163

255

25M

20598

10460

NAACCCTAACCCTAACCTAACCCTA

RG:Z:S1 CS:Z:T0320010320103200103000100 CQ:Z:* MD:Z:T24

651_1706_442

131

chr1

10172

255

25M

20698

10551

CCCTAACCTAACCCTAACCCTAACC

RG:Z:S1 CS:Z:G3002301023010023010023010 CQ:Z:* MD:Z:19AGCTGA

115_1451_1203

131

chr1

10174

255

25M

20032

9883

CTAACCTAACCCTAACCCTAACCCT

RG:Z:S1 CS:Z:G3230102301002301002301002 CQ:Z:* MD:Z:2GGTTCGGTTTCGGTTGATTGAAG

Description

This track is produced as part of the ENCODE Transcriptome Project. It shows the starts and ends of DNA fragments from different cell lines determined by paired-end ditag (PET) sequencing using different DNA fragment sizes for analysis of genome structural variation.

Display Conventions and Configuration

In the graphical display, the ends are represented by blocks connected by a horizontal line. In full and packed display modes, the arrowheads on the horizontal line represent the strand, and an ID of the format XXXXX-N-M is shown to the left of each PET, where X is the unique ID for each PET, N indicates the number of mapping locations in the genome (1 for a single mapping location, 2 for two mapping locations, and so forth), and M is the number of PET sequences at this location. PETs that mapped to multiple locations may represent low complexity or repetitive sequences.

To show only selected subtracks, uncheck the boxes next to the tracks that you wish to hide.

The query sequences in the SAM/BAM alignment representation are normalized to the + strand of the reference genome (see the SAM Format Specification for more information on the SAM/BAM file format). If a query sequence was originally the reverse of what has been stored and aligned, it will have the following flag:

(0x10) Read is on '-' strand.

BAM/SAM alignment representations also have tags. The following tags are associated with this track: RG, CQ, CS, and MD.

Mapping quality is not available for this track and so, in accordance with the SAM Format Specification, a score of 255 is used.

Methods

Sample genomic DNA was isolated, hydrosheared at a given size-range, then ligated with specific DNA linker sequence at both ends, followed by gel-selection of the desired size, e.g., 1 kb, 10 kb, etc. respectively. The DNA fragments modified with linker at both ends (e.g., 10 kb) were then circularized by ligation, followed by restriction digest with enzyme EcoP15I to generate DNA PETs (25 bp tag from each end). The PETs were ligated with SOLiD sequencing adaptors at both ends, then amplified by PCR and purified as complex templates for high throughput DNA sequencing. The current DNA PET data sets submitted are mostly generated by SOLiD platform. Cells were grown according to the approved ENCODE cell culture protocols.

Data: Reads of DNA PETs were mapped onto reference genome, GRCh37, hg19, excluding mitochondrion, haplotypes, randoms and chromosome Y. Majority of the PETs mapped on the same chromosome in correct orientations and within expected distance span (e.g., a 10 kb DNA PET was expected mapping on ~10 kb span distance). A small portion of misaligned PETs, called discordant PETs, mapped either too far from each other, had wrong orientations, or in different chromosomes indicating various genome structure or variations observed between the sample and the reference genome. The variations could be due to deletion, inversion, tandem repeats, trans-location, fusion etc.

Mapping parameters: Mapping was done using Applied Biosystems' SOLiD alignment and pairing pipeline. The ungapped alignment is done in color space. Seed and extend strategy is adopted where initial seed length of 25 is mapped with maximum of 2 mismatches and then extended to read length, each color space match is awarded a score of +1 and each mismatch is awarded a penalty of -2. Read Score = read length - # of mismatches - 2 * # of mismatches After extension each read is trimmed to its maximum score, shortest length. The color space sequences are then converted into base space and checked to ensure that each sequence has a maximum of 2 base pair mismatches. If any sequence has more than 2 mismatches, then that pair is discarded. The final output is converted into SAM/BAM format.

Verification

Representative structural variations identified by DNA PET data have been verified by targeted PCR and sequencing analysis to confirm the predicted rearrangement sites. Some of them have also been validated by FISH.

Credits

The GIS DNA PET libraries and sequence data for genome structural variation analysis were produced at the Genome Institute of Singapore. The data were mapped and analyzed by scientists Xiaoan Ruan, Atif Shahab, Chialin Wei, and Yijun Ruan at the Genome Institute of Singapore.

Contact: Yijun Ruan (now at The Jackson Laboratory)

Data Release Policy

Data users may freely use ENCODE data, but may not, without prior consent, submit publications that use an unpublished ENCODE dataset until nine months following the release of the dataset. This date is listed in the Restricted Until column, above. The full data release policy for ENCODE is available here.