Patching up the Genome

From biologists to computer scientists, the human genome has presented a grand puzzle. With regards to UCSC, the story began in 1985 when our chancellor, molecular biologist Robert Sinsheimer, proposed a bold endeavor – sequence the complete human genome. 5 years later the International Genome Project was launched. The next chapter took place in 1999 when computer science professor David Haussler was asked to join the project.  Haussler, in turn, enlisted then graduate student Jim Kent to help with assembling the genome. This collaboration culminated on July 7, 2000, when the first human genome assembly was made available on the UCSC servers. Over 500 GB were downloaded worldwide in 24 hours.  (Hey, back in 2000, that was a lot!)

UCSCReleaseDownloads

Total web traffic at the University of California Santa Cruz in 2000. When the genome becomes available online, all other web activity at the university shrank to the background.

Three months later, the UCSC Genome Browser came online as a resource to distribute and visualize the genome.  The first ten releases, hg1-hg10 were assembled at UCSC, after which the task was taken over by NCBI. As NCBI incremented the official releases and changed the naming scheme, UCSC released browsers at a slower rate, continuing to increment the hg* nomenclature.  By the time NCBI released NCBI33 in 2003, UCSC released it as hg15. After releasing so many browsers in under three years, the pace slowed, with each assembly taking around one year longer than the previous.

Patches: What are they and why are they important?

Blog_table

Note: hg38 follows hg19. The UCSC nomenclature was changed to match the Genome Reference Consortium (GRC)’s GRCh release number.

The early genome assemblies were largely aiming to increase the fidelity of the reference. However, with each release, research progress was temporarily hampered as scientists adjusted to sequence changes and shifted coordinates. This has often led to scientists continuing to use an older release as it may be better annotated and established. This is evident in the Genome Browser as a majority of our users continue to work on GRCh37/hg19 in spite of GRCh38/hg38’s release more than 4 years ago.

Looking at the numbers, however, we can see that GRCh38 is the most accurate human genome to date. With these benchmarks in accuracy, the GRC has shifted focus beyond fidelity to inclusion. The GRC  now strives to capture more of the genetic diversity present in the human population. The initial release of GRCh38/hg38 included 261 alternate haplotype sequences, nearly a 30-fold increase over GRCh37/hg19.

UCSC builds a new assembly database for each full release of a genome assembly, but the GRC also releases “patch” updates for genome assemblies. Through patch releases, the GRC adds new alternate haplotype sequences, and also corrected sequences, without changing the sequences or coordinate system of the initial assembly release.

To quote directly from the GRC:

Patches are accessioned scaffold sequences that represent assembly updates. They add information to the assembly without disrupting the chromosome coordinates. Patches are given chromosome context via alignment to the current assembly. Together, the scaffold sequence and alignment define the patch.

These patch sequences are more important now than ever before as the GRC has decided to indefinitely postpone the release of the next coordinate-changing assembly (which would have been GRCh39/hg39), instead opting for additional patches to GRCh38/hg38. There are two kinds of patch sequences:

Novel patches (alternative haplotypes): Chromosomal regions of the genome that exhibit sufficient variability to prevent adequate representation by a single sequence. Also referred to as alternate loci. UCSC labels these haplotype sequences by appending “_alt” to their names.

Fix patches: Error corrections (addressed by approaches such as base changes, component replacements/updates, switch-point updates or tiling-path changes) or assembly improvements, such as the extension of sequence into gaps. UCSC labels these fix sequences by appending “_fix” to their names.

These patch sequences, especially novel patches, have been increasing in number and will continue to do so.

patches

The number of human assembly patch sequences is quickly growing. This is primarily due to alternative haplotypes (_alt) sequences, though fix sequences (_fix) are also being introduced. The fix patches reset from GRCh37.p13 to GRCh38 as they were integrated into the assembly.

A better approach to patches

Our approach thus far in the Genome Browser has been to make data tracks indicating the locations of these patch releases along the initial assembly chromosomes. While these are useful, they provide little in the way of annotations and are largely underutilized by users. With the increase of these patches and postponement of GRCh39, however, we have decided to switch our approach and add the new sequences, and annotations on the new sequences, to the UCSC hg38 database. This will allow patches to be visualized on the Browser as standalone reference sequences, not unlike a regular chromosome or the alternate haplotype sequences that were included in the initial assembly release. BLAT results may also include alignments to these sequences.

The addition of new genomic sequences to an existing UCSC database is a departure from our longstanding practice of building a new database every time we import a new genome assembly release.  To minimize disruption to pipelines that use our download files, especially those in the bigZips directory, we will leave the original bigZips/hg38.* files unchanged, and add a subdirectory when we incorporate sequences from a patch release; for example, bigZips/p12/ for patch release GRCh38.p12.  We will also add bigZips/latest/ which will link to the most recent patch release subdirectory, so that pipelines may stay up to date with UCSC’s patch sequence annotations if desired. In other words, the bigZips downloads will be “opt-in” for patch sequences.

Changes and improvements to hg38

Currently, we are in the process of adding these sequences to the GRCh38/hg38 genome database with the potential to do the same for GRCh37/hg19 and GRCm38/mm10 at a future date. Changes that users may see are as follows:

  • BLAT/In-Silico PCR – Additional hits on _alt and _fix sequences
  • Position searches in the hg38 browser may lead to _alt and _fix sequences in addition to or instead of initial assembly chromosomes
  • Replacing the ‘GRC Patch Release’ and ‘Alt Map’ tracks with ‘Fix Patches’ and ‘Alt Haplotypes’ tracks which include alignments to alts/fixes with details pages and links to jump between main chromosomes and alts/fixes
  • New subdirectories of bigZips download directory (initial, p12, latest)
  • New sequences/annotations in /gbdb/hg38 download files (same file names, extended contents)
  • SQL queries to genome-mysql.soe.ucsc.edu may include new results on _alt and _fix sequences

It is also worth noting what will not change. Existing sequences, and annotations on existing sequences, will not change. Download files in the bigZips directory, such as bigZips/hg38.2bit and bigZips/hg38.fa.masked.gz, will not change.

So what kind of annotations can be found on these hg38 patch sequences?

  • Annotations generated by UCSC such as RepeatMasker, CpG Islands, AUGUSTUS, Human mRNAs and Pfam
  • NCBI’s sequence alignments of patch sequences to chromosomes: Fix Patches, Alt Haplotypes
  • External annotation sources such as RefSeq and GENCODE that include annotations on patch sequences (up to this point we have ignored those patch annotations)
  • Select tracks have been lifted from main chromosomes onto the patches using NCBI’s alignments, most notably GTEx Gene and ENCODE Regulation

For additional information on these patch sequences, and a full list of sequences in hg38, you may visit the hg38 Genome Browser Gateway page.

We are always receptive to our users and their needs. If there are any specific track annotations you would like to see on these patches or if you have any questions regarding this implementation and how it may affect you, please write into our public mailing list (genome@soe.ucsc.edu) or our private mailing list if your message includes sensitive data (genome-www@soe.ucsc.edu).

2 thoughts on “Patching up the Genome

  1. Daniel

    Is it possible to lift over genes/coordinates from a patch/alternative chromosome to the reference genome’s chromosome?
    For example, the ensembl gene ID: ENSG00000274143 (KIR2DL5A) is located at: chr19_GL949747v2_alt:450950-460291. Are there chain files anywhere that liftOver can use for that? (obviously, http://hgdownload.soe.ucsc.edu/goldenPath/hg38/liftOver only has chain files for lifting over between reference assemblies).

    Thanks

    Reply
    1. Luis Nassar Post author

      Hello Daniel,

      If you are looking at a patch sequence (alt or fix) on the browser, there is a way to view the corresponding coordinates on the ‘main’ chromosome. For example, here is a session of the area you are referring to where ENSG00000274143 is located: http://genome.ucsc.edu/s/Lou/hg38_altchrom. If you click into the “Reference Assembly Alternate Haplotype Sequence Alignments” item on the Browser, which represents the alignment of this alternate sequence, you will see the option to “view the corresponding position range on the main chromosome”, or even view the entire patch sequence placed on the chromosome. This area happens to correspond to chr19:54,832,710-54,833,356. I would like to point out, however, that while the original window is 9342bp, the resulting region on the main chromosome is only 647bp and contains only the first exon of a different gene. I’ll explain the reason for this poor conversion further below.

      As far as having chain files for these patches, we do not currently have any available for download. Patches often contain new sequences that are not present in the original chromosome, and these new sequences may not be alignable. In the case of KIR2DL5A, if you go to the following session (http://genome.ucsc.edu/s/Lou/hg38_altchrom_gap) you can see that most exons fall within an alignment gap (red highlight) so liftOver would fail unless you use a really low -minBlocks value, and then it would map only the first exon.

      Another way to look at this is to get the KIR2DL5A sequence and BLAT it against the hg38 genome. You see that while the alignment to chr19_GL949747v2_alt is 100% identity and is a better match than the alignment of the transcript sequence to the main chr19 which has 93.4% identity:

      ACTIONS QUERY SCORE START END QSIZE IDENTITY CHROM STRAND START END SPAN
      browser details NM_020535.3 735 861 1596 1596 100.0% chr19_GL949747v2_alt + 459579 460414 836
      vs.
      browser details NM_020535.3 624 861 1596 1596 93.4% chr19 + 54752216 54753051 836

      Main chr19 just doesn’t have this sequence (just some similar sequence, possibly due to lots of duplication and divergence over evolutionary time). For this reason, we are trying to promote the use of patch sequences.
      You can identify what tracks have information on patch sequences by the “P12” icon next to their name (such as Fix Patches, Alt Haplotypes, etc). Do note, some few patch sequences are missing data even on these labeled tracks.

      By chance, are you wanting to map it back to the main chromosome in order to compare it to additional annotations? As this is a recently released feature, we would be interested in hearing how users are utilizing patch sequences, and what implementations/data tracks they would like to see on them.

      I encourage you to write into our publicly archived mailing list (genome@soe.ucsc.edu) if you have any further questions or suggestions, or you may also use our private mailing list (genome-www@soe.ucsc.edu) if your message contains sensitive information.

      Reply

Leave a Reply

Your email address will not be published. Required fields are marked *