Schema for Platinum Genomes - Platinum genome variants
  Database: hg38    Primary Table: platinumNA12877
VCF File Download: /gbdb/hg38/platinumGenomes/NA12877.vcf.gz
Format description: The fields of a Variant Call Format data line
fielddescription
chromAn identifier from the reference genome
posThe reference position, with the 1st base having position 1
idSemi-colon separated list of unique identifiers where available
refReference base(s)
altComma separated list of alternate non-reference alleles called on at least one of the samples
qualPhred-scaled quality score for the assertion made in ALT. i.e. give -10log_10 prob(call in ALT is wrong)
filterPASS if this position has passed all filters. Otherwise, a semicolon-separated list of codes for filters that fail
infoAdditional information encoded as a semicolon-separated series of short keys with optional comma-separated values
formatIf genotype columns are specified in header, a semicolon-separated list of of short keys starting with GT
genotypesIf genotype columns are specified in header, a tab-separated set of genotype column values; each value is a colon-separated list of values corresponding to keys in the format column

Sample Rows
 
chromposidrefaltqualfilterinfoformatgenotypes
chr1727034.CT0PASSKM=5.71;KFP=0;KFF=0;MTD=bwa_freebayes,bwa_gatk,bwa_platypusGT1|0
chr1727242.GA0PASSKM=9.86;KFP=0;KFF=0;MTD=bwa_freebayes,bwa_gatk,bwa_platypus,isaac_strelkaGT1|0
chr1727477.GA0PASSKM=6.89;KFP=0;KFF=0;MTD=bwa_freebayes,bwa_gatk,bwa_platypusGT0|1
chr1727717.GC0PASSKM=11.2;KFP=0;KFF=0;MTD=bwa_freebayes,bwa_gatk,bwa_platypusGT0|1
chr1729886.TC0PASSKM=8.29;KFP=0;KFF=0;MTD=bwa_freebayes,bwa_gatk,bwa_platypusGT1|0
chr1736852.CT0PASSKM=12.1;KFP=0;KFF=0;MTD=bwa_freebayes,bwa_gatk,bwa_platypusGT1|0
chr1744224.CA0PASSKM=8.19;KFP=0;KFF=0;MTD=bwa_freebayes,bwa_gatk,bwa_platypusGT1|1
chr1758351.AG0PASSKM=11.5;KFP=0;KFF=0;MTD=bwa_freebayes,bwa_gatk,bwa_platypusGT1|0
chr1758443.GC0PASSKM=8.57;KFP=0;KFF=0;MTD=bwa_freebayes,bwa_gatk,bwa_platypusGT1|0
chr1766566.AG0PASSKM=2.92;KFP=0;KFF=0;MTD=bwa_freebayes,isaac_strelkaGT1|1

Platinum Genomes (platinumGenomes) Track Description
 

Description

These tracks show high-confidence "Platinum Genome" variant calls for two individuals, NA12877 and NA12878, part of a sequenced 17 member pedigree for family number 1463, from the Centre d'Etude du Polymorphisme Humain (CEPH). The hybrid track displays a merging of the NA12878 results with variant calls produced by Genome in a Bottle, discussed further below. CEPH is an international genetic research center that provides a resource of immortalized cell cultures used to map genetic markers, and pedigree 1463 represents a family lineage from Utah of four grandparents, two parents, and 11 children. The whole pedigree was sequenced to 50x depth on a HiSeq 2000 Illumina system, which is considered a platinum standard, where platinum refers to the quality and completeness of the resulting assembly, such as providing full chromosome scaffolds with phasing and haplotypes resolved across the entire genome.

This figure depicts the pedigree of the family sequenced for this study, where the ID for each sample is defined by adding the prefix NA128 to each numbered individual, so that 77 = NA12877 and 78 = NA12878, corresponding to the VCF tracks available in this track set. The dark orange individuals indicate sequences used in the analysis methods, whereas the blue represent the founder generations (grandparents), which were also sequenced and used in validation steps. The genomes of the parent-child trio on the top right side, 91-92-78, were also sequenced during Phase I of the 1000 Genomes Project.

These tracks represent a comprehensive genome-wide set of phased small variants that have been validated to high confidence. Sequencing and phasing a larger pedigree, beyond the two parents and one child, increases the ability to detect errors and assess the accuracy of more of the variants compared to a standard trio analysis. The genetic inheritance data enables creating a more comprehensive catalog of "platinum variants" that reflects both high accuracy and completeness. These results are significant as a comprehensive set of valid single-nucleotide variants (SNVs) and insertions and deletions (indels), in both the easy and difficult parts of the genome, provides a vital resource for software developers creating the next generation of variant callers, because these are the areas where the current methods most need training data to improve their methods. Since every one of the variants in this catalog is phased, this data set provides a resource to better assess emerging technologies designed to generate valid phasing information. To generate the calls, six analysis pipelines to call SNVs and indels were used and merged into one catalog, where the sensitivity of the genetic inheritance aided to detect genotyping errors and maximize the chance of only including true variants, that might otherwise be removed by suboptimal filtering. Read more about the detailed methods in the referenced paper, further describing this variant catalog of 4.7 million SNVs plus 0.7 million small (1-50 bp) indels, that are all consistent with the pattern of inheritance in the parents and 11 children of this pedigree.

The hybrid track in this set extends the characterization of NA12878 by incorporating high confidence calls produced by Genome in a Bottle analysis. The resulting merged files contain more comprehensive coverage of variation than either set independently, for instance, the hg19 version contains over 80,000 more indels than either input set. Read more about the hybrid methods at the following link: https://github.com/Illumina/PlatinumGenomes/wiki/Hybrid-truthset

Data Access

The VCF files for this track can be obtained from the download server: https://hgdownload.soe.ucsc.edu/gbdb/hg38/platinumGenomes/.
These files were obtained from the Platinum genomes source archive: https://s3.eu-central-1.amazonaws.com/platinum-genomes/2017-1.0/ReleaseNotes.txt.

Reference

Eberle MA, Fritzilas E, Krusche P, Källberg M, Moore BL, Bekritsky MA, Iqbal Z, Chuang HY, Humphray SJ, Halpern AL et al. A reference data set of 5.4 million phased human variants validated by genetic inheritance from sequencing a three-generation 17-member pedigree. Genome Res. 2017 Jan;27(1):157-164. PMID: 27903644; PMC: PMC5204340