Attempts to infer phylogenetic relationships, sites under selection, or evidence of recombination
from SARS-CoV-2 genome sequences can be led astray by sequencing errors, contamination, and
hypermutable sites. In order to make reliable inferences, it is important to identify probable
errors and susceptible sites within the genome sequences, carefully consider how those might
affect the specific analysis one is about to perform, and perhaps exclude problematic sites from
This track shows locations in the SARS-CoV-2 genome that have been identified as problematic for
analysis for various reasons. They have been collected in the github repository
Locations have been separated into two subtracks and colored corresponding to levels of severity:
- Mask: Problems are expected to affect most types
of analysis, so it is recommended to mask out these sites before analysis.
- Caution: Some types of analysis may be
affected while other types may not; caution is recommended.
Locations are labeled with the following terms to indicate the type of potential problem:
- ambiguous: Sites which show an excess of ambiguous basecalls relative to the
number of alternative alleles, often emerging from a single country or sequencing laboratory
- amended: Previous sequencing errors which now appear to have been fixed in the
latest versions of the GISAID sequences, at least in sequences from some of the sequencing
- highly_ambiguous: Sites with a very high proportion of ambiguous characters,
relative to the number of alternative alleles
- highly_homoplasic: Positions which are extremely homoplasic - it is sometimes not
necessarily clear if these are hypermutable sites or sequencing artefacts
- homoplasic: Homoplasic sites, with many mutation events needed to explain a
relatively small alternative allele count
- interspecific_contamination: Cases (only one instance as of July 2020) in which
the known sequencing issue is due to contamination from genetic material that does not have
- nanopore_adapter: Cases in which the known sequencing issue is due to the adapter
sequences in nanopore reads
- narrow_src: Mutations which are found in sequences from only a few sequencing labs
(usually two or three), possibly as a consequence of the same artefact reproduced independently
- neighbour_linked: Proximal mutations displaying near perfect linkage
- seq_end: Alignment ends are affected by low coverage and high error rates
(masking recommended, but might be more stringent than necessary)
- single_src: Only observed in samples from a single laboratory
Multiple groups applied various methods (De Maio, Walker et al.;
De Maio, Gozashti et al.; Turakhia et al.) to identify sites that
were homoplasic, likely contaminated, likely sequencing error and/or observed in multiple
virus lineages by only one or a few laboratories. They contributed their observations
and recommendations to the github repository
UCSC downloaded the collection, split the sites into Mask and Caution subsets depending
on the recommended action and reformatted the data for display in the Genome Browser.
The original data file was downloaded from github:
You can download the bigBed files underlying this track (problematicSites*.bb) from our
Download Server. The data can be explored interactively with the
or the Data Integrator. The data can be
accessed from scripts through our API.
De Maio N, Walker C, Borges R, Weilguny L, Slodkowicz G, Goldman N.
Issues with SARS-CoV-2 sequencing data.
virological.org. 2020 May 5.
De Maio N, Gozashti L, Turakhia Y, Walker C, Lanfear R, Corbett-Detig R, Goldman N.
Updated analysis with data from 12th June 2020.
virological.org. 2020 July 14.
Turakhia Y, Thornlow B, Gozashti L, Hinrichs AS, Fernandes JD, Haussler D, and Corbett-Detig R.
Stability of SARS-CoV-2 Phylogenies.
bioRxiv. 2020 June 9.