How portable is your track hub? Use hubCheck to find out!

Track and assembly hubs are collections of data that are hosted on your servers and can be displayed using the UCSC Genome Browser and other genome browsers supporting the UCSC track hub format. Track hubs allow for the visualization of data on assemblies that we already host (such as the human or mouse genomes), while assembly hubs can be used to create genome browsers for any genome assembly of your choosing.

Hubs depend on a number of different plain text configuration files. The most important are the trackDb.txt files for each assembly in your hub. These files contain the track configuration settings, also known as “trackDb settings”, that control how the track displays in the Genome Browser as well as the display of the item detail pages. You can see the trackDb settings available for hubs on the Hub Track Database Definition page.

As the track hub format has grown in popularity, other genome browsers, including Ensembl, Biodalliance, and the WashU Epigenome Browser, have implemented support for the UCSC track hub format. The Ensembl genome browser currently boasts fairly comprehensive support of the UCSC track hub format. In addition to supporting track hubs on their site, the Ensembl team has also created a Track Hub Registry that pulls hubs listed on our Public Hubs page into a centralized database alongside those hubs submitted to their registry. In an attempt to make the adoption of our track hub format easier, we talked to the other genome browsers about what settings were core to a track hub being, well, a track hub. We sort the list of hundreds of settings into various support levels, which include:

  • Required – needed to display a hub across the different browsers.
  • Base – non-required settings that are likely to be supported by other genome browsers
  • Full – all other trackDb settings fully supported in the UCSC Genome Browser
  • New – settings introduced since the last versioned release, may change between now and the next versioned release
  • Deprecated – settings that may currently work, but could cease to work in the future as they are not being actively developed

We periodically increment the trackDb version number as major updates and changes are made to the settings. The latest change — version 2 — included settings related to the release of several “big*” file types, such as bigGenePred, bigPsl, bigChain, bigMaf, and CRAM. It also included moving several settings (html, priority, colorByStrand, autoScale, spectrum) from the “full” level to the “base” level to indicate that they are supported at other genome browsers (at this time primarily Ensembl).

During the initial versioning process, we improved the “hubCheck” utility to check the support of the trackDb settings used in your hub against our master list of trackDb settings, by version and support level. The hubCheck tool can be utilised in a variety of different ways; principally it checks if your hub works, but it can also list your hub’s settings and their support levels (required, deprecated, base, full and new) as well as check the support of your settings against any genome browser.  For example, to test compatibility with Ensembl (which supports the ‘base’ level of hub settings), use the command:

hubCheck -checkSettings -level=base http://genome.ucsc.edu/goldenPath/help/examples/hubDirectory/hub.txt

You can see more examples of how you might use hubCheck to check the compatibility of your hub with other genome browsers in our help documentation. To acquire hubCheck, you can click Downloads from the top blue menu bar and then select Utilities and navigate to the utilities directory.

If you have questions about creating or validating your track and assembly hubs, please feel free to contact us!

For more information on hubs in the UCSC Genome Browser, please see the following pages:

For more information on hubs in other genome browsers, see their help pages here:

Questions about other genome browsers support for hubs should be directed to their mailing lists.

The new NCBI RefSeq tracks and You

The release of the new NCBI RefSeq track marks a major shift in how we include annotations from NCBI’s Reference Sequence Database (RefSeq) in the UCSC Genome Browser. This new track is a composite track that contains the combined set of curated and predicted annotations from the RefSeq database for hg38/GRCh38. It also contains tracks that break up the annotation set into a few subsets. These subsets include only the curated transcripts (NM, NR, or YP transcripts), only the predicted transcripts (XM or XR transcripts), all of the other annotations from RefSeq that don’t fit into the curated or predicted subsets, and the alignments of the curated and predicted transcripts to the genome. All of the coordinates and alignments in these tracks are provided by the RefSeq group.

This new NCBI RefSeq composite also includes a “UCSC RefSeq” track that is based on our original method of producing the “RefSeq Genes” track. This “UCSC RefSeq” track is built by aligning RNAs obtained from the RefSeq Database to the genome. In the early days of the UCSC Genome Browser, only RNA sequences were provided by RefSeq, so we used BLAT to align them to the genome. This was a good solution in the past, but over time this method has led to some issues with transcripts matching to multiple places and our alignments of small exons or other regions differing slightly from those found in the RefSeq database. This type of minor alignment difference can be seen in the following session, where you can see that the RefSeq Curated (top) and UCSC RefSeq (bottom) tracks place the small fifth exon in transcript NM_001130970 at different locations due to the fact that there are multiple matches to this exon sequence in that region.

The new set of RefSeq tracks differs from the “UCSC RefSeq” track in a few key ways. First, as mentioned previously, the new tracks are based entirely on positions and alignments provided by RefSeq. Second, this track is currently only available for the hg38/GRCh38 assembly. This means that if you obtain the hg38 coordinates for a RefSeq transcript from the UCSC Genome Browser, these coordinates should be the same as those from the entry found at NCBI’s RefSeq Database. Lastly, these new NCBI RefSeq tracks include predicted transcripts, which were absent from our original RefSeq track.

This has been a long and exciting collaboration between the UCSC Genome Browser staff and NCBI’s RefSeq group. We trust that this full complement of tracks from the Reference Sequence Database will be helpful to you, our Browser users. We hope to bring these tracks to more genome assemblies in the future.

The UCSC Genome Browser Coordinate Counting Systems

If you think dogs can’t count, try putting three dog biscuits in your pocket and then giving Fido only two of them.  

~Phil Pastoret

“Counting is easy. Right?”

I say this with my hand out, my thumb and 4 fingers spread out. With my other hand’s pointer finger, I simply count each digit, “one, two, three, four, five.” Easy.

But what happens when you start counting at 0 instead of 1? You can see that you have 5 digits (4 fingers and a thumb), but how do you calculate the size of your range?

With your hand in mind as an example, let’s look at counting conventions as they relate to bioinformatics and the UCSC Genome Browser genomic coordinate systems.

The UCSC Genome Browser uses two different systems:

“1-start, fully-closed” = coordinates positioned within the web-based UCSC Genome Browser. “0-start, half-open” = coordinates stored in database tables.
Table 1. UCSC Genome Browser coordinate systems summary
0-start, half-open (0-based) 1-start, fully-closed (1-based)
“BED” format (Browser Extensible Data):
chr1 127140000 127140001
Note: Spaces, not punctuation
When using BED format, browser & utilities
assume coords are 0-start, half-open.
“Position” format:
chr1:127140001-127140001
Note: Punctuation used, no spaces
When using “position” format, browser & utilities
assume coords are 1-start, fully-closed.
Stored in UCSC Genome Browser tables Positioned in UCSC Genome Browser web interface
To convert to 1-start, fully-closed:
add 1 to start, end = same
To convert to 0-start, half-open:
subtract 1 from start, end = same
 

Section 1: Interval types

0-start vs. 1-start : Does counting start at 0 or 1?
Synonyms:
Sometimes referred to as “0-based” vs “1-based” or 
“0-relative vs “1-relative.”

Interval Types
For a counted range, is the specified interval fully-open, fully-closed, or a hybrid-interval (e.g., half-open)?

Ok, time to flashback to math class!
You might recall that specifying an interval type as open, closed (or a combination, e.g., “half-open”) refers to whether or not the endpoints of the interval are included in the set. For further explanation, see the
interval math terminology wiki article. Figure 1 below describes various interval types.

Figure1

Figure 1. (To enlarge, click image.) Description of interval types.

Section 2: Interval types in the UCSC Genome Browser

UCSC Genome Browser web interface = “1-start, fully-closed”

A common counting convention is a system that we all used when we first learned to count the fingers on our hands; this is referred to as the “one-based, fully-closed” system (Figure 2, below). Note that an extra step is needed to calculate the range total (5).

The “1-start, fully-closed” system is what you SEE when using the UCSC Genome Browser web interface. However, all positional data that are stored in database tables use a different system.

1-starthandfinal

Figure 2. (To enlarge, click image.) 1-start, fully-closed interval. Most common counting convention. Used within the UCSC Genome Browser web interface (but not used in UCSC Genome Browser databases/tables). We calculate that we have 5 digits because 5 (pinky finger, range end) – 1 (the thumb, range start) = 4. We then need to add one to calculate the correct range; 4+1= 5.

UCSC Genome Browser tables = “0-start, half-open”

While the commonly-used “one-start, fully-closed” system is more intuitive, it is not always the most efficient method for performing calculations in bioinformatic systems, because an additional step is required to calculate the size of the base-pair (bp) range.

To increase efficiency, the UCSC Genome Browser uses a “hybrid-interval” coordinate system for storing coordinates in databases/tables that is referred to as “0-start, half-open” (see Figure 3, below).

Although coordinates in the web browser are converted to the more human-readable “1-start, fully-closed” system, coordinates are stored in database tables as “0-start, half-open.” You may have heard various terms to express this 0-start system:

Synonyms for “0-start, half-open”

  • 0-based, half-open
  • 0-based start, 1-based end
    • Note: This is not technically accurate, but conceptually helpful. A “1-based end” refers to the end of the range being included, as in the common “1-based, fully-closed” system.
  • 0-start, hybrid-interval (interval type is: start-included, end-excluded)

newhand0-startfinal

Figure 3. (To enlarge, click image.) The UCSC Genome Browser coordinate system for databases/tables (not the web interface) is “0-start, half-open” where start is included (closed-interval), and stop is excluded (open-interval). We calculate that we have 5 digits because 5 (range end after pinky finger) – 0 (the thumb, range start)  = 5.

Another example which compares 0-start and 1-start systems is seen below, in Figure 4. This figure describes the differences in defining and calculating the range for a specified sequence highlighted in yellow, “T, C, G, A.”

finalgrid

Figure 4. (To enlarge, click image.)  Calculation of genomic range for comparing “1-start, fully-closed” vs. “0-start, half-open” counting systems.

Section 3: Formatting

Coordinate formatting indicates interval type

The UCSC Genome Browser and many of its related command-line utilities distinguish two types of formatted coordinates and make assumptions of each type.

The “Position” format (referring to the “1-start, fully-closed” system as coordinates are “positioned” in the browser)

  • Written as: chr1:127140001-127140001
  • No spaces.
  • Includes punctuation: a colon after the chromosome, and a dash between the start and end coordinates.
  • When in this format, the assumption is that the coordinate is 1-start, fully-closed.

The “BED” format (referring to the “0-start, half-open” system)

  • Written as: chr1 127140000 127140001
  • Spaces between chromosome, start coordinate, and end coordinate.
  • No punctuation.
  • When in this format, the assumption is that the coordinates are 0-start, half-open.

Section 4: Examples

SNP example

What we SEE in the Genome Browser interface itself is the “1-start, fully-closed” system. However, these data are not STORED in the UCSC Genome Browser databases and tables in the same way. The UCSC Genome Browser databases store coordinates in the “0-start, half-open” coordinate system.

Table 2. SNP coordinates in web browser (1-start) vs table (0-start)
rs782519173 (hg38) Start End
Positioned in web browser: 1-start, fully-closed  133255708  133255708
Stored in table: 0-start, half-open  133255707  133255708

LiftOver examples and coordinate formatting

Let’s take a look at the two types of coordinate formatting (“BED” and “position”) when using the UCSC Genome Browser web-based and command-line utility liftOver tools.

1) Web-based LiftOver example

Below is an example from the UCSC Genome Browser’s web-based LiftOver tool (Home > Tools > LiftOver). Depending on how input coordinates are formatted, web-based LiftOver will assume the associated coordinate system and output the results in the same format.

Table 3. UCSC Genome Browser web-based LiftOver and “position” coordinate formatting
Input: Assembly = panTro3
chr1
:127140001127140001
Output: Lifts to this position in hg19:
chr1:110255313110255313
Notes: If your input is entered with the “position” formatted coords (1-start, fully-closed),
the browser will also output the same “position” format. (Note positional format
includes “:” and “-” and no spaces.)
Table 4. UCSC Genome Browser web-based LiftOver and “BED” coordinate formatting
Input: Assembly = panTro3
chr1 127140000 127140001
Output: Lifts to this position in hg19:
chr1 110255312 110255313
Notes: If your input is entered with the “BED” formatted coords (0-start, half-open), the
browser will also output the same “BED” format. (Note BED format contains no
punctuation and includes spaces.)
 * Note that the web-based output file extension is misleading in this case; while titled “*.bed” the positional output is not actually in “0-start, half-open” BED format, because the 1-start, fully-closed “positional” format was used for input. 

 2) Command-line liftOver utility example

When using the command-line utility of liftOver, understanding coordinate formatting is also important. Just like the web-based tool, coordinate formatting specifies either the “0-start half-open” or the “1-start fully-closed” convention. For example, if you have a list of 1-start “position” formatted coordinates, and you want to use the command-line liftOver utility, you will need to specify in your command that you are using “position” formatted coordinates to the liftOver utility.

To view the liftOver utility usage statement and options, enter “liftOver” on your command-line (with no other arguments, and without the quotes).

Table 5. UCSC Genome Browser command-line liftOver and “position” coordinate formatting
Input:
(panTro3.txt)
chr1:127140001127140001
Command: liftOver -positions panTro3.txt liftOver/panTro3ToHg19.over.chain.gz mapped unMapped
Output: chr1:110255313110255313
via “mapped” file for hg19
Notes: Note: Must specify “-positions” for 1-start “position” format in command-line liftOver
Table 6. UCSC Genome Browser command-line liftOver and “BED” coordinate formatting
Input:
(panTro3.bed)
chr1 127140000 127140001
Command: liftOver panTro3.bed liftOver/panTro3ToHg19.over.chain.gz mapped unMapped
Output: chr1 110255312 110255313
via “mapped” file for hg19
Notes: Note: No special argument needed, 0-start “BED” formatted coordinates are default. 

Wiggle Files

The wiggle (WIG) format is used for dense, continuous data where graphing is represented in the browser. Wiggle files of variableStep or fixedStep data use “1-start, fully-closed” coordinates. Like all other UCSC Genome Browser data, these coordinates are positioned in the browser as “1-start, fully-closed.”

Note: Many other formats outside of the UCSC Genome Browser use 1-start coordinate systems, such as GTF/GFF.

Table 7. UCSC Genome Browser wiggle files & coordinate systems
File Type Wiggle file Coordinate system as positioned
in UCSC Genome Browser
bedGraph -> bigWig 0-start, half-open 1-start, fully-closed
wiggle variableStep -> bigWig 1-start, fully-closed 1-start, fully-closed
wiggle fixedStep -> bigWig 1-start, fully-closed 1-start, fully-closed

 Section 5: Resources

GTEx Resources in the Browser

Have you been wondering when we’ll get some of that next-gen gene expression in human tissues up as tracks in the browser? The GNF Atlas microarray tracks are so 2004… Yes, we do have RNA-seq from ENCODE cell lines, but you can get only so far with cell lines (are they even human?). Well, wait no longer! Once we learned what the GTEx folks are up to – RNA-seq and genotyping of samples from 53 tissues in many hundreds of donors – we just had to get on board! Read on for details…

The NIH Genotype-Tissue Expression (GTEx) project was created to establish a sample and data resource for studies on the relationship between genetic variation and gene expression in multiple human tissues. In April this year the Genome Browser released the GTEx Gene Expression track, which showcases data from the GTEx midpoint milestone data release (V6, October 2015) – 8555 tissue samples obtained from 570 adult postmortem individuals. The track shows median expression level per tissue at each gene via a new bar graph display:

gtexGeneTcap3

The height of each bar represents the median expression level across all samples for a tissue, and the bar color indicates the tissue (we are using GTEx publication color conventions). You can see the gene description and tissue name with expression level when you mouseover, and can view the tissue legend in glorious detail on the track configuration page. Above, notice the 3 highly expressed tissues for TCAP protein (titin-cap, used in muscle assembly) – unsurprisingly in this case, heart (2 sub-tissues) and skeletal muscle.

In the tissue mix sampled by GTEx, you’ll find a dozen brain sub-tissues, a handful of cardiovascular tissues, and bits from digestive, reproductive, and endocrine systems. For a nice summary of the tissues assayed, check out the GTEx project portal. Not so interested in all the tissues? Turn on the tissue filter and limit the graph to show just your faves!

Once you’ve found your favorite gene, you can drill down for more detail. A nice boxplot showing the range for all samples and the sample count is right here on the details page:

gtexBoxplotTcap

You’ll also see this plot on the new RNA-Seq Expression panel of the UCSC Genes detail page:

gTexGeneDetailsMenu

If gene-level calls aren’t your thing – you’re more of a deep diver and want to see the actual RNA-seq coverage – you might find the newly released GTEx Signal Hub just your style. We were fortunate to be able to team up with the Global Alliance crowd here within the UCSC Genomics Institute and convince them to pump all the available GTEx RNA-seq through their hot new Toil pipeline (along with twice as much cancer data) to produce signal graphs. A round of ‘biggification’, lifting and track configuration (gotta have those GTEx colors!) produced the hub. Find it on the Public Hubs panel of the Track Hubs page, which you can navigate to via the My Data > Track Hubs menu option in the top blue bar.

Did I mention you can find the GTEx gene track and the GTEx Signal hub on both the hg19 (GRCh36) and hg38 (GRCh37) genome browsers?

Give the new tracks a spin! To get you started, here’s a session:

gtexSessionForBlog

Now enjoy!!

 

 

 

 

 

 

The new Genome Browser gateway

New UCSC Genome Browser gateway page design

New UCSC Genome Browser gateway page design

The opinions expressed here are those of the author, Cath Tyner, and do not necessarily reflect those of the University of California Santa Cruz or any of its units.

Maybe it’s just me, but I can clearly remember the excitement of getting brand new sparkly shoes as a young kid.  Half the excitement was picking out the shoes – my siblings and I would try on every potential new-shoe option. There were high standards, of course; rigorous criteria that had to be thoroughly discussed and tested. Could they make you jump SUPER high? Low tops or high tops? Classic laces or cutting edge velcro? After finally picking out the perfect pair and racing to put them on with maniacal laughter, the reality set in. New shoes. Ahhhh. The excitement of showing off my new kicks at school was only one night away.

That’s a little bit how we all felt when we unveiled the brand new sparkly Genome Browser gateway page  earlier this week. This was a project that had been “in the works” for quite a long time, starting from ideas and drawings, moving into design phases, and finally maturing into many iterations of testable versions as the development process gained its own momentum. This project soon had a life of its own – we all became shepherds as we guided it into what we finally knew was a final product.

The things we are most excited about? We’ve already received feedback that the new human-centric phylogenetically ordered tree menu is downright awesome (and we think so too).  For me, the graphics and colors pull me in, inviting me to visually scroll through our entire genome species collection. With a flick of the scroll handle on the tree menu, I can zip from “us humans” all the way down to sea hare or Ebola virus; within two seconds, I’ve just traveled through millions and millions of evolutionary years. Based on NCBI’s taxonomy database, the “tree menu” provides an interactive way to explore our genome species collection. Little known fact: Try hovering over one of the “branches” of the tree (the horizontal and vertical lines connecting all species) and see what you find!

Example of mouse hover on tree menu branch

Example of mouse hover on tree menu branch

Another exciting new feature that makes our eyes light up is the autocomplete search function and “popular species” button shortcuts:

Button shortcuts & autocomplete search

Button shortcuts & autocomplete search

We know that over 95% of you will benefit from our “popular species” buttons as quick access shortcuts to the genomes that you use most. We also believe that just about everyone will benefit from the autocomplete search function. For example, you can enter “fish” to see genomes from our aquatic friends, or you can enter something as specific as “hg38” to load a particular assembly version. With a whopping 276 genomes and counting, autocomplete search is a celebrated new feature! The same autocomplete function works great for our public genome hubs; try typing “plant” to see related hubs.

Want to jump to your favorite gene in the genome browser? The “position/search term” functionality remains just as efficient – just enter a genomic position, gene symbol, or search term, lean back in your comfy chair, and press “GO.” You’re there.

To see more details, including a few menu option changes, visit the gateway announcement on our news page and watch the short gateway video tour.

We sincerely hope you enjoy the new gateway page as much as we do – and as always, we invite you to contact us with questions, concerns, and compliments. 😉

 

Untitled

How to share your UCSC screenthoughts

by Robert Kuhn      August 12, 2015

The UCSC Genome Browser is great tool for visualizing your data alongside a ton of data from all over the place.  Perhaps, at long last, you have loaded up a gene set, the supporting mRNAs and maybe the SNPs from OMIM or dbSNP, and the Conservation track to make a great point.

Now you want to save that thought, or share it with a colleague, or make a slide for a meeting, or publish it in a paper. Saving your screenthought can take two forms: static or dynamic.  You can snap and save a picture of the screen, or you can share a link to an active Genome Browser.  We’ll talk about both approaches here and discuss some of the advantages and pitfalls of each.

Share a static image.    You can always take a screen grab and throw it onto a slide with little effort.  The screen resolution is fine for  a slide, because your computer and your slide will viewFingerboth be 72 or 96 dpi.  But, if you try that for a publication, your image will have to be really small (scale down 3x in each dimension to get 300 dpi for print) or it will be unacceptably fuzzy.

To get high resolution images for publication, use the Browser’s .pdf export function to allow the vector-graphics image to scale to full journal size and resolution. Look for the .pdf output in the “View” pulldown menu at the top of the Browser page.  Both the chromosome ideogram and the main Browser graphic can be saved in this fashion.

Share a dynamic session, but DO NOT copy a URL.  To save a dynamic screen session that would allow you or others to look around, add more data tracks, check out other genes, etc., you might be tempted to simply copy the URL from your Firefox or Chrome web browser.  That might even seem to work OK at first, but it is in fact not a stable link and can lead to weird Browser behavior.  Worse, you may not even be sharing what you think you are, and will never know it.

Let’s break down a URL as copied directly from my Firefox and see how it plays out.

url2

This URL contains a parameter, hgsid, which is actually a pointer to a row in a UCSC database identifying your session and keeping the state of all your variables (we borrowed the name “cart”).  If you send this URL to someone, yet keep browsing around, your cart will continue to change as you work, and your friend will see the latest state your Genome Browser is in when she clicks the link. The original state of your cart when you shared the URL is long gone before she sees it.

Your shared URL might even appear to work OK because two of the variables in the URL, db (database) and position, will override values stored in your cart (cart variables are separated by an ampersand).  Your friend will see the right genome assembly (db variable) and location (position variable) and think she’s seeing what you want.  But, if you have turned any data tracks on or off in the interim, or removed a custom track, those changes will also be part of what she sees. The original state is lost.  A different colleague could click the link at some other time and see something different still.

As an experiment, here is that same URL in a form you can click or copy/paste into your web browser:

http://genome.ucsc.edu/cgi-bin/hgTracks?db=hg19&position=chr8%3A38311140-38327276&hgsid=438231169_c2xrrbHK2bQhTuHqjEIOniXGqenu

Does it look like this?

Untitled

That’s what it looked like when I shared the URL. Your click will show the 5’ end of the FGFR1 gene region on human assembly hg19 (because the URL has explicitly included db and position variables), but who knows what tracks might be turned on or off in the interim? Whatever the last person to click it did to it will rule. Every person who reads this blog and clicks the link can change the track configuration for whomever comes next. Only the db and position are going to persist.

Quick-and-dirty URL hack.    If you really want a quick-and-dirty way to share a link, here are a couple of suggestions.  You could send the link as it is above, then strip a few characters out of the hgsid in the URL in your own browser and refresh.  Because the new long hgsid string will not exist in our database, you will be assigned a new hgsid and the state of the old one will stick – until your friend starts messing with it.  Or you could strip out the hgsid parameter entirely and add in other parameters that define the tracks you want to turn on, e.g.:

&knownGene=pack&snp142=dense

That will better define the tracks you want, but it is neither as stable nor as easy as saving a Session. You can use “hide,” too, to be sure certain tracks are turned off. Read more about configuring your links here.

Share a stable dynamic Session.    The best way to save a train of thought in a stable fashion is via the Saved Session tools under the “My Data” pulldown menu. A Saved Session acts as a mydataFingerstable snapshot of all the details of your Browser view.  Saving a thought using this feature requires a login, but it allows you to save the state of a Browser session (semi)-permanently. Anyone viewing your session will be able to further browse around the genome without affecting the session you saved.  After you have saved a session, you will see a “Browser” link that can be copied and shared.

For example, to load the view above as a stable session, try this link (no login is required to view some else’s Saved Session):

http://genome.ucsc.edu/cgi-bin/hgTracks?hgS_doOtherUser=submit&hgS_otherUserName=sessionGallery&hgS_otherUserSessionName=hg19_watsonKriek

Although anyone with this URL can view this session, no one can change it unless logged in as user “SessionGallery.”

In the past we endeavored to save the Session for at least 3-4 months after the last time it was viewed, and custom tracks in sessions were subject to persist for at least 48 hours after the last time they were viewed. We have now moved to not remove session data, unless deleted, and to not remove custom tracks in sessions.  We still encourage people to save their Session cart to a local file using the “Save Settings” feature (and to keep backups of all their custom tracks on a local machine).  That way, you can load your Session settings any time and onto any copy of the Browser (such as to the European mirror or a local Genome Browser-in-a-Box) and avoid any possible loss of data due to unforeseen circumstances.  We do the best we can to maintain our servers so that you do not lose your sessions, but computers are only human and they break.

Really stable sessions.    If you are looking to create a permanent link for a publication, you should consider hosting your downloaded Session and any of your own custom data on a server you control (such as in a Track Hub). It will still be loaded onto the UCSC Genome Browser, but you are not at the mercy of California earthquakes, wildfires or crashed servers (except for your own).  You can read more about building links to remotely hosted user information here and on our Session’s Gallery page here.

On both pages you can learn about the following parameters for forming links to launch sessions from your hub:

hgS_doLoadUrl=submit
hgS_loadUrlName=

We hope we have given you some food for thought about how to make the Genome Browser more useful in your work.  Using a reliable method for saving and sharing sessions is great way to avoid the frustration of lost data and misleading links.  Stay tuned for more useful Browser tips in future blogs.

New default gene set on GRCh38: GENCODE Basic genes

Screen Shot 2015-06-29 at 3.32.45 PM

Genome Browser screen shot of the GRCh38 (hg38) human assembly showing the GENCODE Basic track opened in the PTEN region on chromosome 10.

As of Monday, July 29, 2015, the UCSC Genome Browser will use the GENCODE v22 comprehensive gene set as its default gene set on the human genome assembly GRCh38 (hg38), replacing the previous default set of genes created here at UCSC using code written by Jim Kent. This track, which is labeled as “GENCODE Basic” in the Genes and Gene Predictions track group, replaces UCSC Genes track as the default gene set.  We’re making this change in recognition of the value of reducing the number of competing gene sets used by the bioinformatics community.  With this change we will be using the same set of genes as Ensembl, reducing the potential for confusion, especially in clinical settings.

We’ve kept the same familiar UCSC Genes schema for the new gene set, using nearly all the same table names and fields that appeared in earlier versions of UCSC Genes. Hopefully this will make the transition to the new GENCODE models easier. Every transcript in the new set has both a UCSC ID and a GENCODE transcript ID. There are a couple of new tables: knownCds, which has the coding frame numbers for each gene, and knownToMrna, which captures the association to GenBank mRNAs. A couple tables are no longer present: knownGeneTxMrna and knownGeneTxPep.

By default, we display only the transcripts tagged as “basic” by the GENCODE Consortium. However, all the transcripts in the GENCODE comprehensive set are present in the tables. You can view them in the browser by selecting “show comprehensive set” in the “Show” section of the track’s description page. On that same page, you can also configure the browser to label the genes with the GENCODE transcript IDs by selecting “GENCODE Transcript ID” label option.

The new gene set has 195,178 total transcripts, compared with 104,178 in the previous UCSC Genes version. The total number of canonical genes, now defined using the GENCODE gene loci ( ENSG* identifiers), has increased from 48,424 to 49,534.

Comparing the previous gene set with the new version:

  • 9,459 transcripts are identical.
  • 22,088 transcripts were not carried forward to the new version.
  • 43,681 have consistent splicing, but changes in the UTR.
  • 28,950 transcripts overlap with those in the previous set, but have
    at least one different splice.

We plan to continue using the previous UCSC computational pipeline to generate the default gene set on the mouse assembly, GRCm38 (mm10), for the foreseeable future. We will also periodically update the old UCSC-computed gene set on the human GRCh38 assembly as an ancillary track (“Old UCSC Genes”) without the rich set of link-outs we maintain for the default gene set.

Introducing the Genome Browser YouTube Channel

Here at the Genome Browser we’re constantly looking for ways to improve the Browser and make it more accessible. A big part of that is making it as easy as possible for people to learn how to use our tools to best serve their research. In the past this has included setup and maintenance of documentation, including our help docs as well as a dedicated wiki site, where browser staffers and external users alike have shared content. We also continue to offer real-time support on our mailing list (genome@soe.ucsc.edu).

Thanks to funding support from the NHGRI we were recently able to amp up our training efforts in two ways. We now have a program whereby interested groups can economically host a Genome Browser workshop at their institution. For more information, fill out our intake survey: bit.ly/ucscTraining.

The other thing we have been able to do is launch a YouTube channel where you will find video tutorials explaining how to use various parts of the Browser. While static documents and email support are great, we realize some people learn better by seeing how something is done. We also hope this will be a good resource for those unable to physically attend one of our trainings. The video topics are meant to address some of the common workflows and questions we get from users. Each video is an illustration of how to answer a particular query, for example: “How do I identify exon numbers with the UCSC Genome Browser?

The answer will follow a sequence of steps traversing different parts of the Browser. For those who want to jump straight to one of the steps/skills listed in the video, you will find a set of internal links to the timepoints within the video in the YouTube video description. There, you will also find a transcript of the video if you want to follow along or take notes:

Screen Shot 2015-02-26 at 2.35.02 PM

You can find links to these resources on our training page. If you have a question that you’d like to see demoed in a video, we are always open to suggestions! You can reach the training department by email or tweet us an idea @GenomeBrowser.

New features & data – Winter 2015

We realize that it is sometimes difficult to keep up with all of the new features and data sets in the Genome Browser. After all, we release new annotation tracks almost daily, and we update our software every three weeks. This post highlights a smattering of the most recent updates.

Browser & Track Hub Features

– Personalize your view of GENCODE Genes

In addition to choosing which GENCODE Gene tracks to view (e.g. basic gene set, PolyA, pseudogenes), you can now filter and highlight transcripts within the tracks. Try it here (click on the “Genes” link).

– Display your bigWig data on the other strand in your track hub

Use the new trackDb setting, negateValues on, to allow your bigWig data to be displayed on the Crick strand. This setting negates the values in the wiggle file, meaning that positive values become negative and vice versa. This is useful for wiggles representing transcription or other activities on the Crick strand. Note that wiggles with negative values are drawn in the color specified in altColor, not color as positive values are.

negateValues

– Disconnect your hub automatically

If you need to automatically disconnect your hub, you can use the hubClear variable in the URL. This is especially helpful for users who are  creating hubs dynamically. For example, to disconnect the urlOfHubToClear hub, use a URL constructed like so:

http://genome.ucsc.edu/cgi-bin/hgTracks?db=hg19&hubClear=http://urlOfHubToClear

– Enable BLAT for your assembly hub

If you have created your own assembly hub you can now set up a BLAT server to enable quick mRNA/DNA and cross-species protein alignments. All you need is a server from which you can run gfServer, and the .2bit file containing the sequence of your assembly. Read the detailed instructions here.

Note that the BLAT and gfServer programs and source code are freely available from the University of California Santa Cruz for academic and non-commercial use. A license is required for commercial use.

Annotation Tracks & Assemblies

– dbSNP v141 for hg19/GRCh37 & hg38/GRCh38

We released four annotation tracks from human Build 141 of NCBI’s database of short genetic variations, dbSNP. This release marks the first set of data available for the newest human assembly, hg38/GRCh38. Read more.

Since then, NCBI has released the next database update: dbSNP Build 142. We have derived another four tracks from this release, which are currently undergoing our rigorous quality assurance process and will be released very soon.

– Proteomics data available for hg19/GRCh37: PeptideAtlas track & CPTAC data hub

Data from the National Cancer Institute’s (NCI) Clinical Proteomic Tumor Analysis Consortium (CPTAC) is now available in the UCSC Genome Browser as a public track hub. This track hub contains peptides that were identified by CPTAC in their deep mass spectrometry-based characterization of the proteome content of breast, colorectal and ovarian cancer biospecimens that were initially sequenced by The Cancer Genome Atlas (TCGA).

In addition, we have also released a PeptideAtlas track that displays peptide identifications from the PeptideAtlas August 2014 (Build 433) Human build. This build, based on 971 samples containing more than 420 million spectra, identified over a million distinct peptides covering more than 15,000 canonical proteins. Read more.

– GenBank track updates

We have reduced the frequency of GenBank data updates for assemblies other than human and mouse. The GenBank-based tracks for selected recent assemblies are now updated about once a week rather than daily. The remainder of the 150+ assemblies in the Genome Browser are updated whenever a newer assembly is released and after that, about once a month. The GenBank update schedule for the human and mouse assemblies remains unchanged. Read more.

– UniProt track for hg19/GRCh37

We have added a UniProt track to hg19/GRCh37, and merged the old PFAM (Protein Families) track into it. Check it out here.

– New assembly browsers:

  • CowBos taurus, bosTau8 – sequenced/assembled by University of Maryland (UMD 3.1.1)
  • Fruitfly, D. melanogaster, dm6 – provided by the FlyBase Consortium/Berkeley Drosophila Genome Project/Celera Genomics
  • Ebola Virus, Sierra Leone 2014 outbreak, eboVir3 assembly browser and portal

Ebola virus

New Product

– Genome Browser in a Box (GBiB)

In case you missed the previous blog post, we have created an easily installable version of the Genome Browser. You can set it up in just a few minutes on your laptop for private browsing of your own data alongside the native annotation tracks. It’s fine-tuned to work with hg19/GRCh37, but it works with all other assemblies as well. If you have genomic sequence for other organisms, you can add your own assembly hub. Read more.


If you would like to stay informed about our new features and data sets, you are welcome to subscribe to our low-volume announcement mail list: genome-announce@soe.ucsc.edu

Ebola update

The opinions expressed here are those of the author, Jim Kent, and do not necessarily reflect those of the University of California Santa Cruz or any of its units. 

It’s been nearly a month since I wrote my first Ebola blog entry. Since then the world at large and myself in particular have learned more about Ebola. We have seen clearly that the virus can be transmitted within hospitals in developed countries. We’ve gotten more data showing that good hospital care including hydration, survivor plasma, and electrolyte balancing can save 75% of the patients, perhaps more if applied early. We’ve seen that, from a political point of view, it’s better for the Centers for Disease Control (CDC) to overreact than under-react. We see the epidemic continue to grow, but we also see some signs of its growth rate slowing, at least in Liberia. It seems a good time for a follow-up post.

At UCSC we’ll be adding new data and editing the Ebola Portal (http://genome.ucsc.edu/ebolaPortal/) in the coming weeks. Wikipedia has done such a great job synthesizing Ebola scientific knowledge that we’re dropping the Treatments and Vaccines section of the Ebola Portal in favor of a Wikipedia link. We’re continuing to encourage people to release Ebola viral and antibody sequences. We’ve added a new viral genome sequence from the smaller epidemic going on in the Democratic Republic of Congo (Maganga et al., 2014), and expect the first sequences from American patients soon.

In broader scientific terms, I think that most of the important medical, scientific, and epidemiological issues are now known. The challenge of how to formulate this knowledge into the most effective response is still a huge task. How can we minimize the loss of life with the resources at our disposal?

Many aspects of the epidemiology of Ebola are clear. In Africa as a whole, the time it takes to double the number of people who have been infected is about three weeks. In rural areas and affluent urban areas the doubling time is approximately four weeks, while in the shantytowns it is approximately two weeks. Epidemics in general follow an S-shaped curve, as shown in Figure 1 below. Initially there is a period of exponential growth. Approximately at the point where half of the people have become infected, the growth slows simply because there are fewer people left to infect. Even in the worst hit place in this epidemic, the shantytown of New Cru Town in Monrovia, the epidemic is still in the exponential growth phase on the left side of this curve. This is both good and bad: good because most of the people have not been subject to the certain pain and likely death of Ebola infection, but bad in that the epidemic will rapidly worsen.

constrainedGrowth

Figure 1. A graph of the constrained growth equation that epidemics tend to follow in enclosed, freely mixing areas.

Within a single patient, the medical course of the disease is also relatively clear. After initial infection there is an incubation period of typically 9 days, which can be as short as three days and at least as long as three weeks before symptoms develop. The first symptoms are similar to those of many diseases – aches, fatigue, sometimes a headache, and sometimes stomach pains. After about three days of general malaise, usually a fever develops. The disease progresses rapidly in the next four days. During this phase there is intense diarrhea, usually vomiting, and sometimes bleeding. An adult patient will lose about 10 liters of fluid per day from these causes, if kept hydrated, and will often die from the effects of dehydration otherwise. After four days of intense symptoms patients will start improving if they are destined to recover, or deteriorate further if not. The recovery rates in Africa are only about 30%.

Taking care of an Ebola patient is a lot of work and is vastly complicated by the precautions caretakers must take to avoid becoming infected themselves. The patients are in considerable pain and subject to retching, spasms, and convulsions. For many patients, a madness sets in during the peak of the disease as well. Getting the patients to drink their 10 liters of electrolytes or stay attached to their IV lines, as well as clean up after them, is physically demanding and emotionally draining work. This is exacerbated by the need to wear a protective suit that gets so hot people can safely work in it for only 45 minutes without themselves getting dehydrated. In the U.S. hospitals, approximately 100 staff are required for a single Ebola patient. Doctors without Borders manages to get by with much fewer staff than this, but it is unrealistic to think that an Ebola patient can be managed with less than two staff per bed.

This is where we come to the fundamental conflict between the epidemiology and the medicine.   Medically we want to treat every Ebola patient. The combination of hydration and plasma and/or antiviral treatment seems to raise the recovery rate from 30% to 75%, and is likely to improve further as our experience and tools for treatment grow. However, according to CDC estimates (corrected for under-reporting), as of 9/26/2014 there were 1500 people needing beds in Ebola treatment facilities in Liberia and Sierra Leone alone. We did not have the ~3000 support staff we needed then, and do not have the ~10,000 staff we would need for the ~5000 people estimated to need beds as I write this on Nov 3.

In medicine, generally prevention is far easier than treatment. For Ebola the most important prevention is keeping the patient away from other people during the most infectious phase when the patient is sickest, typically starting the day after the first sign of a fever and continuing until the patient dies or recovers. If the patient dies, the body is also exceedingly infectious. By and large the Africans have accepted the need to treat the body as hazardous and to bypass traditional funeral practices as a result. The big controversy in Africa right now concerns what to do with the patient during the infectious stage.

Ideally, patients would be brought into a treatment facility a day or two before they become highly infectious. This would have the dual benefit of isolating the population at large from infection and more than doubling the patient’s chance of survival. Unfortunately, because we don’t have enough people to treat patients this way, we have to pursue other courses of action as well that are not ideal for the people currently infected, but at least reduce the amount of people who will be infected in the future. Once we have vaccines in quantity, likely by March 2015, the situation will get much better. In the meantime though, to save lives, we have to consider a measure nobody really likes – quarantine.

Quarantine has become a bad word, in large part because most of the recent quarantines have been implemented so poorly. Quarantine is never going to be a joyful event, but if done carefully and with compassion, it need not be particularly unpleasant either. Certainly being quarantined is much more pleasant than catching Ebola or having friends and family die, and for the next several months at least, that is the alternative.

In general, people need food, water, and protection from extremes of temperature to live, and a degree of social contact with friends and family and a bit of entertainment to be happy. There is no reason that these can’t be provided inside of quarantine, and the cost of doing so is ever so much less than the cost of providing care for an Ebola patient.

The worst hit parts of Africa, and the ones in most need of quarantine, are the shantytowns. In a shantytown in the tropics, most structures are little more than a roof for shade and protection from the rain. Setting up structures such as these, capable of holding a family or social unit of about six with simple cots to sleep on, would not be hard and could be the basis of a quarantine unit. Food could be distributed in a central mess hall, and temperatures taken before one was allowed into the mess hall to eat. People showing fevers or other signs of sickness would be taken from the mess hall to a community care center where family could see patients. Ideally quarantine units of approximately 250 people could be set up in many places. The 250-person limit would reduce the spread of infection within a unit.

Once out of quarantine, ideally the dwellers of a shantytown would be moved into a refugee camp that would slowly grow to the size of the shantytown it is replacing. This camp would need a mess hall and a latrine system of some sort.

People would be invited, not forced, from the shantytown into the quarantine facility. If food, water, shelter, and minimal medical care are available, it is likely that the demand for going into such a quarantine facility would exceed the space available. A lottery would be a fair way to decide who gets in first.

After a certain point in time, everyone in the shantytown will either have passed through quarantine and into the refugee camp, have caught Ebola and either died or become non-infectious, proven naturally immune, or gotten very lucky. At this point the shantytown could be disinfected and the people from refugee camp could move back home. It seems likely that we may have a vaccine deployed as well by then.

Outside of the shantytowns, needed quarantines could be done in people’s own homes. In villages, a community care center coupled with contact tracing is all that is necessary. The traditional methods of contact tracing do work well outside of dense urban settings lacking basic infrastructure.

What would a community care center look like? The goal would be to have a place where the patients could, to the best of their ability, take care of themselves with limited help from survivors of Ebola and the bravest volunteers from their friends and family. The crucial parts of a facility are:

  • Adequate stocks of oral rehydration fluids containing the correct balance of sugars, sodium, and potassium salts.
  • “Cholera cots” (see Figure 2) that can efficiently and safely collect the patient liquid hazardous waste.
  • A place to disinfect and dispose of the waste.
  • Basic protection equipment and disinfection facilities for the workers.
  • Water and simple food such as bananas and rice.
  • Lamivudine or other mass-produced antivirals that don’t require refrigeration, if available.
  • A fence so patients can’t exit until they’ve recovered and to keep out unprotected people.
choleraBed

Figure 2. A cholera cot – a must for treating diarrheal diseases in the tropics. (Image from Hesperian health guides.)

How well these community centers will work is perhaps the most uncertain part of this plan but, particularly with the cooperation of survivors, they may represent our best hope until vaccines are widely available. Socially they would need to be set up so that people could visit and talk through the fence to patients, but be located out of sight of the main habitations so as not to provoke despair. Community care centers have worked successfully in some Liberian towns, as described in detail in the Nov. 4, 2014 issue of Morbidity and Mortality Weekly Report (MMWR) from the CDC (Logan et al., 2014).

The CDC has done a lot of good work in containing this epidemic. Where they’ve faltered has been in portraying more certainty and perhaps more optimism than is warranted by what we know. Perhaps the CDC and leadership are worried that people won’t listen to them if they don’t convey absolute certainty; that if they don’t minimize statements of risk, people will panic. However, panic is normally a temporary condition. In the end, level heads that can reasonably appraise the situation will prevail. How can we appraise the situation, though, if we are not told the truth in all of its uncertainty and risk?

It is true that Ebola is mostly spread by contact with bodily fluids. It is true in previous, smaller epidemics that airborne spread between humans, if any, has played a minor role. However, it is wishful thinking, not science, to absolutely rule this out. With a disease as dangerous as Ebola, certainly it is better to err on the side of caution. Wearing a face mask on public transportation in an Ebola-infected area and washing one’s hands when one arrives back home or at work should be our advice, not — as Obama has said in videos aimed at West Africans — that you need not worry about catching Ebola on the bus if you live in an area where it is rampant. Wearing full body protection including a breathing apparatus should be the norm among Ebola medical personnel, and somewhat belatedly it has become so.

It is true that people with Ebola will mostly show a fever before the illness gets really serious, and vomiting and diarrhea start. However, the temperature increase one develops in response to an illness is highly variable across the population. Children in general spike higher fevers than adults. A noticeable fraction of adults, around 10%, don’t get fevers higher than 100 degrees even in the absence of medicine. A significant fraction of people are on anti-inflammatory medications for arthritis and other common conditions and don’t get fevers for this reason. In Africa, where presumably people tend to be less medicated than in the U.S., reports show that 11% to 13% of people sick enough with Ebola to take themselves to the hospital do not have a fever (Schieffelin JS et al., 2014; Who Ebola Response Team, 2014).

It is true that Ebola is mostly non-contagious before people reach the stage of illness where they show a fever (if one is going to develop a fever). Using a RT-PCR test, we can’t detect virus in the blood before the initial pre-fever symptoms of malaise, aches, fatigue etc. are felt. By the time fever shows, typically we do get solid RT-PCR results, but the viral levels measure only 10% of what they will the next day when the viral level typically peaks and the blood, at least, is maximally infectious (Towner et al., 2014). The viral loads in blood typically remain at the peak level for four days, and then either the patient dies, or the viral loads decrease and the patient recovers. If we assume (and it is an assumption) that a person’s level of contagiousness follows the blood viral load, then certainly most of the disease transmission occurs in the last four days, rather than the days leading up to and including the initial fever stage. Because there is a lag time before people notice that they have a fever and go to the hospital, how much of the transmission is likely to occur in the 8 hours after fever starts? Since the viral load will be rising from 10% to 100% over the course of the day, following an exponential progression, I’ll estimate the viral load on average during the first 8 hours after fever as 13% of peak, and the next 18 hours after fever as 33% of peak. With this I can estimate the viral load over time in the 8-hour window as:

Initial-transmission/(initial-transmission + later-transmission)
or
(8 hours * 13%) / (8 hours * 13% + 16 hours * 33% + 4*24 hours * 100%)

which comes to almost exactly 1%. So, while it is scientifically reasonable to estimate that 99% of the transmission will be avoided if people go into isolation relatively promptly after they’ve reached the stage of the disease usually associated with a fever, it is also reasonable to estimate that 1% of the transmission occurs before this stage. The clinical and epidemiological data suggest that it could not be much higher than this, but are not strong enough to say that it could be lower. Given the deadliness of the disease, it is prudent to consider people infectious at a low level even before the illness becomes severe.

If the world at large tended to under-react early in the course of this epidemic, for the most part this has changed. The CDC and others have tightened their recommendations and response in the USA. African nations and health organizations have been effective in keeping the spread of Ebola outside of Guinea, Liberia, and Sierra Leone to small, quickly extinguished outbreaks. The combination of popular education about how to avoid catching Ebola, contact tracing, and quarantine seems to be putting the brakes on the epidemic in the rural areas of West Africa. I do hope a system similar to the quarantine-into-refuge I describe here can be applied to the slums and shantytowns, and that these, together with community care centers, will help save many of those in even the hardest hit regions.

References:

Logan G et al. Establishment of a Community Care Center for Isolation and Management of Ebola Patients — Bomi County, Liberia, October 2014. MMWR 2014;63(Early Release):1-3.

Maganga GD et al. Ebola Virus Disease in the Democratic Republic of Congo. N Engl J Med. 2014 Oct 15. [Epub ahead of print]

Schieffelin JS et al. Clinical Illness and Outcomes in Patients with Ebola in Sierra Leone. N Engl J Med. 2014 Oct 29. [Epub ahead of print]

Towner JS et al. Rapid diagnosis of Ebola hemorrhagic fever by reverse transcription-PCR in an outbreak setting and assessment of patient viral load as a predictor of outcome. J Virol. 2004 Apr;78(8):4330-41.

WHO Ebola Response Team. Ebola virus disease in West Africa–the first 9 months of the epidemic and forward projections. N Engl J Med. 2014 Oct 16;371(16):1481-95.