NHGRI DIPs Track Settings
 
NHGRI Deletion/Insertion Polymorphisms in ENCODE regions   (All Pilot ENCODE Comparative Genomics and Variation tracks)

Display mode:      Duplicate track

Show only items with score at or above:   (range: 0 to 1000)

Data schema/format description and download
Source data version: ENCODE June 2005 Freeze
Assembly: Human Mar. 2006 (NCBI36/hg18)
Data coordinates converted via liftOver from: May 2004 (NCBI35/hg17)
Data last updated at UCSC: 2007-10-22

Description

This track shows deletion/insertion polymorphisms (DIPs). In packed and full modes, the sequence variation is shown to the left of the DIP. The naming convention "-/sequence" is used for deletions; "sequence/-" is used for insertions. The details page shows the name of the trace used to define the polymorphism, the quality score, and the strand on which the trace aligns to the reference sequence.

The quality score reflects the minimum PHRED quality value over the entire range of the DIP within the trace, plus 5 flanking bases. PHRED quality scores are expressed as log probabilities using the formula:

  Q = -10 * log10(Pe)
where Pe is the estimated probability of an error at that base. PHRED quality scores typically vary from 0 to 40, where 0 indicates complete uncertainty about the base and 40 implies odds of 10,000 to 1 that the base is correct. Sometimes a PHRED value of 50 or higher is used to denote finished sequence. A color gradient is used to distinguish quality scores in the browser display: brighter shading indicates higher scores.

The "Trace Pos" value on the details page indicates the 3' position of the DIP within the trace. The alleles are reported relative to the "+" strand of the reference sequence; however, the trace may actually align to the "-" strand. When viewing the chromatogram using the URL provided, if the trace aligned to the "-" strand, the DIP bases in the trace will be the reverse compliment of the variant allele given.

Methods

All human trace data from NCBI's trace archive were aligned to hg17 with ssahaSNP, followed by ssahaDIP post-processing to detect deletion/insertion polymorphisms. DIPs within ENCODE regions were extracted.

Verification

For verification, 500k traces from the mouse whole genome shotgun (WGS) sequencing effort were compared to mm6 using ssahaSNP and ssahaDIP. Because mm6 and these traces are from the same mouse strain, C57BL/6J, the DIP rate should be very low. Applying a quality threshold of Q23, the detected DIP rate was one DIP per 140k Neighborhood Quality Standard (NQS) bases. This level was ten-fold lower than the SNP rate for the same data set using ssahaSNP, which has been validated as having a 5% false positive rate. The detected DIP rate for human traces against hg17 is one DIP per 12k NQS bases, indicating a false positive rate of 12k/140k, or about 8%.

Further validation experiments are in progress.

Credits

All analyses were performed by Jim Mullikin using ssahaSNP and ssahaDIP. The trace data were contributed to the trace archive by many sequencing centers.

References

Ning Z, Cox AJ, Mullikin JC. SSAHA: A fast search method for large DNA databases. Genome Res. 2001 Oct;11(10):1725-9.

The International SNP Map Working Group. A map of human genome sequence variation containing 1.4 million single nucleotide polymorphisms. Nature. 2001 Feb 15;409(6822):928-33.