The new NCBI RefSeq tracks and You

The release of the new NCBI RefSeq track marks a major shift in how we include annotations from NCBI’s Reference Sequence Database (RefSeq) in the UCSC Genome Browser. This new track is a composite track that contains the combined set of curated and predicted annotations from the RefSeq database for hg38/GRCh38. It also contains tracks that break up the annotation set into a few subsets. These subsets include only the curated transcripts (NM, NR, or YP transcripts), only the predicted transcripts (XM or XR transcripts), all of the other annotations from RefSeq that don’t fit into the curated or predicted subsets, and the alignments of the curated and predicted transcripts to the genome. All of the coordinates and alignments in these tracks are provided by the RefSeq group.

This new NCBI RefSeq composite also includes a “UCSC RefSeq” track that is based on our original method of producing the “RefSeq Genes” track. This “UCSC RefSeq” track is built by aligning RNAs obtained from the RefSeq Database to the genome. In the early days of the UCSC Genome Browser, only RNA sequences were provided by RefSeq, so we used BLAT to align them to the genome. This was a good solution in the past, but over time this method has led to some issues with transcripts matching to multiple places and our alignments of small exons or other regions differing slightly from those found in the RefSeq database. This type of minor alignment difference can be seen in the following session, where you can see that the RefSeq Curated (top) and UCSC RefSeq (bottom) tracks place the small fifth exon in transcript NM_001130970 at different locations due to the fact that there are multiple matches to this exon sequence in that region.

The new set of RefSeq tracks differs from the “UCSC RefSeq” track in a few key ways. First, as mentioned previously, the new tracks are based entirely on positions and alignments provided by RefSeq. Second, this track is currently only available for the hg38/GRCh38 assembly. This means that if you obtain the hg38 coordinates for a RefSeq transcript from the UCSC Genome Browser, these coordinates should be the same as those from the entry found at NCBI’s RefSeq Database. Lastly, these new NCBI RefSeq tracks include predicted transcripts, which were absent from our original RefSeq track.

This has been a long and exciting collaboration between the UCSC Genome Browser staff and NCBI’s RefSeq group. We trust that this full complement of tracks from the Reference Sequence Database will be helpful to you, our Browser users. We hope to bring these tracks to more genome assemblies in the future.

4 thoughts on “The new NCBI RefSeq tracks and You

  1. Brian

    There are now new NCBI RefSeq tracks for human, rat, yeast, C. elegans, zebrafish, X. tropicalis and fly!

    We are pleased to announce the release of a new set of gene annotation tracks for the hg19/GRCh37, hg38/GRCh38, rn6/Rnor_6.0, sacCer3/R64, ce11/WBcel235, danRer10/GRCz10, danRer11/GRCz11, xenTro7/Xtropicalis_v7, xenTro9/Xenopus_tropicalis_v9.1 and dm6/Release 6 plus ISO1 MT assemblies based on data from NCBI’s Reference Sequence Database (RefSeq). For all of these tracks, the alignments and coordinates are provided by RefSeq. These tracks are organized in a composite track that includes:

    RefSeq All – all annotations from the curated and predicted sets
    RefSeq Curated – curated annotations beginning with NM, NR, or NP
    RefSeq Predicted – predicted annotations beginning with XM or XR
    RefSeq Other – all other RefSeq annotations not included in RefSeq All
    RefSeq Alignments – alignments of transcripts to the genome provided by RefSeq

    The new composite track also includes a “UCSC RefSeq” track that is based on our original “RefSeq Genes” track. As with before, this UCSC track is a result of our realignments of RefSeq RNAs to the genome, which means that there may be some cases where the annotations differ from those in the new NCBI RefSeq tracks. Also note that the Predictions subtrack is unavailable for the following assemblies: hg19, dm6, ce11, and sacCer3.

    A huge thank you to Terence Murphy from the RefSeq group and to Hiram Clawson, Angie Hinrichs, Christopher Lee and many others from the UCSC Genome Browser staff for bringing this track to life.

    Reply
  2. Pingback: Accessing the Genome Browser Programmatically Part 2 – Using the Public MySQL Server and gbdb System | UCSC Genome Browser Blog

  3. Pingback: Using the GRCh38 reference assembly for clinical interpretation in VSClinical: Webcast Q&A | Our 2 SNPs...®

Leave a Reply

Your email address will not be published. Required fields are marked *