Galaxy ENA mutations Tracks
 
GalaxyProject surveillance of SARS-CoV-2 mutations through consistent processing of public raw sequencing data tracks   (All Variation and Repeats tracks)

Display mode:   

 All
Galaxy ENA mutations in top lineages - current quarter  Most frequent lineages of current quarter  
Galaxy ENA mutations in top lineages - a quarter ago  Most frequent lineages of a quarter ago  
Galaxy ENA mutations in top lineages - two quarters ago  Most frequent lineages of two quarters ago  
Galaxy ENA mutations in top lineages - three quarters ago  Most frequent lineages of three quarters ago  
Assembly: SARS-CoV-2 Jan. 2020 (NC_045512.2)

Description

This track represents parts of the SARS-CoV-2 analysis efforts of the GalaxyProject [1]. This project aims at fully open and transparent, high-quality reanalysis of public raw sequencing data deposited in INSDC databases on ready-to-use public infrastructure [2]. It restricts itself to data deposited by national genome surveillance projects that are providing sufficient sample metadata (along with the submitted data or through personal communication) to allow for best-practice analysis and reporting (for examples see [3, 4, 5]).

Required metadata are:

  • Sample collection date
  • Sequencing platform, library layout and strategy (currently reanalysis is done for ampliconic paired-end Illumina and ONT data)
  • the primer scheme used for the generation of amplicons (this information is used to trim primer sequences from the data before variant calling; reanalysis can be done for any primer scheme with publicly available primer binding site information)
  • some kind of discernible batch information (e.g. a library identifier) that can be used to form batches of samples for reanalysis and batch-level reporting

Analysis is performed on public Galaxy servers with only open-source tools orchestrated through public, community-developed, reproducible workflows available from WorkflowHub and Dockstore and includes mutation calling for all samples, generation of per-sample and batch-level mutation reports and plots, generation of consensus sequences and pangolin lineage assignments. Key results and metadata are hosted on a public FTP server provided by the Centre for Genomic Regulation and the Barcelona Supercomputing Centre and form the basis of these UCSC genome browser tracks. The project web site has more information about available results data.

Display Conventions and Configuration

Track structure

The GalaxyProject SARS-CoV-2 mutations tracking effort comes as a supertrack containing four subtracks that represent mutation data from SARS-CoV-2 samples collected in different 3-months periods of the Covid-19 pandemic. The quarters are redefined with each data update with the latest/current quarter starting 3 months prior to the day of the update. The end date displayed on the current quarter track corresponds to the collection date of the most recent analyzed sample on the day of the update.

Each quarter's subtrack is, in turn, composed of separate mutation data tracks for the five most common pangolin lineages observed in the data for that quarter.

Together the tracks can be used to explore the change of dominating lineages (and their associated mutation patterns) over time and, for lineages dominant over multiple quarters, to search for evidence of emerging within-lineage mutations.

Mutation feature display

To facilitate such search the shading of mutation features reflects the mutation's observed frequency among the samples of a given lineage in the given quarter, which means that lineage defining mutations should be displayed in dark grey/black, while newly emerging mutations or non-systematic variant calling artefacts should appear in lighter shades of grey.

Mutation features are labeled with their effects at the amino acid level and, for SNV mutations, the feature as a whole will extend across the base triplet encoding the affected amino acid, while the thick part of the feature will indicate the precise base that gets changed by the mutation. For deletions, the whole feature will have a thick rendering, while insertions will be displayed all thin.

Mutation details

Hovering over any mutation feature (in dense or full display mode of the track) will reveal details of the mutation and the associated statistics, in particular:

  • the precise value for its observed frequency in the lineage and quarter
  • the intra-sample allele frequency (median and lower/upper quartile) at which the mutation has been called in the samples in which it has been detected.
  • the collection date and the collecting country of the sample, in which this mutation was first (ever) detected in the context of the lineage. Note that for older, still circulating lineages the collection date of that sample can be older than the start of the earliest quarter displayed in the genome browser (since our complete surveillance data goes back further than four quarters).

Filtering Mutations

Mutation features displayed in each subtrack can be filtered by

  • country or combination of countries in which samples of the given lineage and collection quarter with the mutation have been collected. You could for example filter all current quarter lineage tracks to show only mutations that have been found (in their respective lineage) in the UK.
  • within-lineage frequency. By default only mutations are shown that have been observed in at least 5% of the samples assigned to the given lineage in the given quarter (0.05 default filter setting). You can lower or increase that threshold as you see fit. Note however, that the underlying bigbed data of the tracks is filtered to contain only data for mutations above a threshold of 0.1% (i.e. a 0.001 hard filter is always in effect).

Methods

For analyses, batches of raw sequencing data get downloaded from public databases (in particular, from the FTP server of the European Nucleotide Archive) onto one of several public Galaxy instances. The data gets processed with a sequencing platform-specific variation analysis workflow (one for paired-end Illumina data, another one for ONT data), which performs QC, read mapping, mapped reads postprocessing including primer trimming, variant calling and annotation and results in a collection of VCF files, one for each sample in the batch. This output gets picked up by a reporting workflow, which generates per-sample and per-batch mutation reports and a per-batch allele-frequency plot for a quick overview over variant patterns in the batch. In parallel, the outputs of the variation analysis workflow are also used by a consensus workflow to produce a FASTA consensus sequence for every sample in the batch. Sequencing data downloads, execution of the three types of workflows, and export of key results files are orchestrated by bot scripts, which can be used together with the public workflows to set up the complete analysis system on any Galaxy server. The bot accounts on participating Galaxy servers are checked on a roughly weekly basis for newly finished analysis histories, then

  1. those histories are made publicly accessible on their server
  2. batch information, i.e., samples analyzed and their metadata, links to the histories, etc. are added to
    ftp://xfer13.crg.eu/gx-surveillance.json
  3. pangolin lineage assignment is (re)performed for the entire collection of samples ever analyzed
  4. the genome browser tracks get recalculated by
    1. parsing all analyzed data on the ftp server
    2. determining the five most frequently observed pangolin lineages for each of the last four quarters, starting from the current date
    3. extracting all mutations seen in each quarter for each of the five top lineages in that quarter
    4. rebuilding the bigbed files and track files

Credits

The analysis behind these tracks is the result of joint efforts of the Galaxy community at large, the usegalaxy.org and usegalaxy.eu teams, the IUC, and the IWC.

The infrastructure and development work behind the project was made possible by generous support from funding agencies around the world.

For questions regarding SARS-CoV-2 data analysis and its automation with Galaxy, please join us in the GalaxyProject Public Health matrix channel.

The project would not be possible without the sequencing data provided by genome surveillance initiatives that have decided to make their data and metadata publically available by depositing it in INSDC databases. In particular we would like to thank:

References

  1. Baker, D.; van den Beek, M.; Blankenberg, D.; Bouvier, D.; Chilton, J.; Coraor, N.; Coppens, F.; Eguinoa, I.; Gladman, S.; GrĂ¼ning, B.; Keener, N.; Lariviere, D.; Lonie, A.; Kosakovsky Pond, S.; Maier, W.; Nekrutenko, A.; Taylor, J. & Weaver, S. (2020): No more business as usual: Agile and effective responses to emerging pathogen threats require open data and open analytics. PLoS Pathogens 16(8):e1008643. DOI: 10.1371/journal.ppat.1008643
  2. Maier, W.; Bray, S.; van den Beek, M.; Bouvier, D.; Coraor, N.; Miladi, M.; Singh, B.; Argila, J. R. D.; Baker, D.; Roach, N.; Gladman, S.; Coppens, F.; Martin, D. P.; Lonie, A.; GrĂ¼ning, B.; Pond, S. L. K. & Nekrutenko, A. (2021): Ready-to-use public infrastructure for global SARS-CoV-2 monitoring. Nature Biotechnology 39, 1178-1179. DOI: 10.1038/s41587-021-01069-1