Schema for Nextstrain Variants - Nextstrain/GISAID Sample Variants from nextstrain.org/ncov
  Database: wuhCor1    Primary Table: nextstrainSamplesAll
VCF File: /gbdb/wuhCor1/nextstrain/nextstrainSamples.vcf.gz
Format description: The fields of a Variant Call Format data line
See the Variant Call Format specification for more details
fielddescription
chromAn identifier from the reference genome
posThe reference position, with the 1st base having position 1
idSemi-colon separated list of unique identifiers where available
refReference base(s)
altComma separated list of alternate non-reference alleles called on at least one of the samples
qualPhred-scaled quality score for the assertion made in ALT. i.e. give -10log_10 prob(call in ALT is wrong)
filterPASS if this position has passed all filters. Otherwise, a semicolon-separated list of codes for filters that fail
infoAdditional information encoded as a semicolon-separated series of short keys with optional comma-separated values
formatIf genotype columns are specified in header, a semicolon-separated list of of short keys starting with GT
genotypesIf genotype columns are specified in header, a tab-separated set of genotype column values; each value is a colon-separated list of values corresponding to keys in the format column

Sample Rows
 
chromposidrefaltqualfilterinfoformatgenotypes
NC_045512v2103C103TCT.PASSAC=1;AN=4298GT:CLADE0:19A0:19A0:19A0:19A0:19A0:19A0:19A0:19A0:19A0:19A...
NC_045512v2105G105TGT.PASSAC=1;AN=4298GT:CLADE0:19A0:19A0:19A0:19A0:19A0:19A0:19A0:19A0:19A0:19A...
NC_045512v2106C106TCT.PASSAC=6;AN=4298GT:CLADE0:19A0:19A0:19A0:19A0:19A0:19A0:19A0:19A0:19A0:19A...
NC_045512v2110C110TCT.PASSAC=1;AN=4298GT:CLADE0:19A0:19A0:19A0:19A0:19A0:19A0:19A0:19A0:19A0:19A...
NC_045512v2127G127TGT.PASSAC=2;AN=4298GT:CLADE0:19A0:19A0:19A0:19A0:19A0:19A0:19A0:19A0:19A0:19A...
NC_045512v2129A129GAG.PASSAC=1;AN=4298GT:CLADE0:19A0:19A0:19A0:19A0:19A0:19A0:19A0:19A0:19A0:19A...
NC_045512v2141T141CTC.PASSAC=1;AN=4298GT:CLADE0:19A0:19A0:19A0:19A0:19A0:19A0:19A0:19A0:19A0:19A...
NC_045512v2160G160A,G160TGA,T.PASSAC=4,2;AN=4298GT:CLADE0:19A0:19A0:19A0:19A0:19A0:19A0:19A0:19A0:19A0:19A...
NC_045512v2174G174TGT.PASSAC=2;AN=4298GT:CLADE0:19A0:19A0:19A0:19A0:19A0:19A0:19A0:19A0:19A0:19A...
NC_045512v2180C180TCT.PASSAC=3;AN=4298GT:CLADE0:19A0:19A0:19A0:19A0:19A0:19A0:19A0:19A0:19A0:19A...

Nextstrain Variants (nextstrainSamples) Track Description
 

Description

Nextstrain.org displays data about single nucleotide variant alleles in the SARS-CoV-2 RNA and protein sequences that have occurred in different samples of the virus during the current 2019/2020 outbreak. Nextstrain has a powerful user interface for viewing the evolutionary tree that it infers from the patterns of variants in sequences worldwide, but does not offer a detailed plot of variants along the genome that can be correlated with other molecular information, so we have processed their data into this track to display the variants called by Nextstrain for each sample that Nextstrain has obtained from GISAID.

Click on the vertical column in the display for any position in the SARS-CoV-2 genome to see more details about the variant(s) that occur at that position, including protein change (if applicable; protein changes use gene names in the Nextstrain Genes track), number of samples with the variant, list giving the nucleotide (allele) for that position in each GISAID sample, etc.

Nextstrain identifies certain clades within the phylogenetic tree according to a set of defining variants. The Nextstrain Clades track provides more information about these clades and serves as a useful color key for the clade colors in the phylogenetic tree display.

This track is composed of several subtracks so that different subsets of variants may be viewed:

  • Recurrent Bi-allelic: This is the only subtrack displayed by default. It is limited to variants that have been observed in at least two samples, and excludes positions at which more than one alternate allele has been observed in more than one sample.
  • All Variants: All variants found in all samples.
  • <Clade> Variants: All variants found in samples belonging to <Clade>, which is one of Nextstrain's clades: A1a, A2, A2a, A3, A6, A7, B, B1, B2, or B4 ("old" clades, March 15 - June 2, 2020), or 19A, 19B, 20A, 20B or 20C (June 2, 2020 - present).

Display Conventions

In "dense" mode, a vertical line is drawn at each position where there is a variant. In "pack" mode, the display shows a plot of all samples' variants, with samples ordered using Nextstrain's phylogenetic tree in order to highlight patterns of linkage.

Each sample is placed in a horizontal row of pixels; when the number of samples exceeds the number of vertical pixels for the track, multiple samples fall in the same pixel row and pixels are averaged across samples.

Each variant is a vertical bar at its position in the SARS-CoV-2 genome with white (invisible) representing the reference allele and black representing the non-reference allele(s). Tick marks are drawn at the top and bottom of each variant's vertical bar to make the bar more visible when most alleles are reference alleles. Insertions and deletions are not shown as these are removed from the data by Nextstrain.

The phylogenetic tree for the samples built by Nextstrain is depicted in the left column of the display. Mousing over this will show the GISAID identifiers for the different samples. When the vertical height of the track is set sufficiently high (10 pixels per sample with the default font), sample names are drawn to the right of the tree; however, with thousands of samples in the Nextstrain tree, and a maximum track height of 2500 pixels, the full Nextstrain tree is too large for sample names to be displayed. In the track controls, the user can choose to display subtracks containing the phylogenetic trees and variants for individual clades. Some clades have few enough samples that they can be made tall enough to display sample names. Branches of the phylogenetic tree are colored by clade using the same color scheme as nextstrain.org.

Methods

Nextstrain downloads SARS-CoV-2 genomes from GISAID as they are submitted by labs worldwide. The sequences are processed by an automated pipeline and annotations are written to a data file that UCSC downloads and extracts annotations for display.

Data Access

You can download the VCF files underlying this track (nextstrainSamples*.vcf.gz) from our Download Server. The data can be explored interactively with the Table Browser or the Data Integrator. The data can be accessed from scripts through our API.

Nextstrain.org offers phylogenetic trees and metadata files: scroll to the bottom of the page and click "DOWNLOAD DATA", and a dialog with download options appears.

Credits

This work is made possible by the open sharing of genetic data by research groups from all over the world. We gratefully acknowledge their contributions. Special thanks to nextstrain.org for sharing its analysis of genomes collected by GISAID.

Data usage policy

The data presented here is intended to rapidly disseminate analysis of important pathogens. Unpublished data is included with permission of the data generators, and does not impact their right to publish. Please contact the respective authors if you intend to carry out further research using their data. Author contact info is available via nextstrain.org: scroll to the bottom of the page, click "DOWNLOAD DATA" and click "ALL METADATA (TSV)" in the resulting dialog.

References

Hadfield J, Megill C, Bell SM, Huddleston J, Potter B, Callender C, Sagulenko P, Bedford T, Neher RA. Nextstrain: real-time tracking of pathogen evolution. Bioinformatics. 2018 Dec 1;34(23):4121-4123. PMID: 29790939; PMC: PMC6247931

Sagulenko P, Puller V, Neher RA. TreeTime: Maximum-likelihood phylodynamic analysis. Virus Evol. 2018 Jan;4(1):vex042. PMID: 29340210; PMC: PMC5758920

Nguyen LT, Schmidt HA, von Haeseler A, Minh BQ. IQ-TREE: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies. Mol Biol Evol. 2015 Jan;32(1):268-74. PMID: 25371430; PMC: PMC4271533