Monthly Archives: December 2016

The UCSC Genome Browser Coordinate Counting Systems

If you think dogs can’t count, try putting three dog biscuits in your pocket and then giving Fido only two of them.  

~Phil Pastoret

“Counting is easy. Right?”

I say this with my hand out, my thumb and 4 fingers spread out. With my other hand’s pointer finger, I simply count each digit, “one, two, three, four, five.” Easy.

But what happens when you start counting at 0 instead of 1? You can see that you have 5 digits (4 fingers and a thumb), but how do you calculate the size of your range?

With your hand in mind as an example, let’s look at counting conventions as they relate to bioinformatics and the UCSC Genome Browser genomic coordinate systems.

The UCSC Genome Browser uses two different systems:

“1-start, fully-closed” = coordinates positioned within the web-based UCSC Genome Browser. “0-start, half-open” = coordinates stored in database tables.
Table 1. UCSC Genome Browser coordinate systems summary
0-start, half-open (0-based) 1-start, fully-closed (1-based)
“BED” format (Browser Extensible Data):
chr1 127140000 127140001
Note: Spaces, not punctuation
When using BED format, browser & utilities
assume coords are 0-start, half-open.
“Position” format:
chr1:127140001-127140001
Note: Punctuation used, no spaces
When using “position” format, browser & utilities
assume coords are 1-start, fully-closed.
Stored in UCSC Genome Browser tables Positioned in UCSC Genome Browser web interface
To convert to 1-start, fully-closed:
add 1 to start, end = same
To convert to 0-start, half-open:
subtract 1 from start, end = same
 

Section 1: Interval types

0-start vs. 1-start : Does counting start at 0 or 1?
Synonyms:
Sometimes referred to as “0-based” vs “1-based” or 
“0-relative vs “1-relative.”

Interval Types
For a counted range, is the specified interval fully-open, fully-closed, or a hybrid-interval (e.g., half-open)?

Ok, time to flashback to math class!
You might recall that specifying an interval type as open, closed (or a combination, e.g., “half-open”) refers to whether or not the endpoints of the interval are included in the set. For further explanation, see the
interval math terminology wiki article. Figure 1 below describes various interval types.

Figure1

Figure 1. (To enlarge, click image.) Description of interval types.

Section 2: Interval types in the UCSC Genome Browser

UCSC Genome Browser web interface = “1-start, fully-closed”

A common counting convention is a system that we all used when we first learned to count the fingers on our hands; this is referred to as the “one-based, fully-closed” system (Figure 2, below). Note that an extra step is needed to calculate the range total (5).

The “1-start, fully-closed” system is what you SEE when using the UCSC Genome Browser web interface. However, all positional data that are stored in database tables use a different system.

1-starthandfinal

Figure 2. (To enlarge, click image.) 1-start, fully-closed interval. Most common counting convention. Used within the UCSC Genome Browser web interface (but not used in UCSC Genome Browser databases/tables). We calculate that we have 5 digits because 5 (pinky finger, range end) – 1 (the thumb, range start) = 4. We then need to add one to calculate the correct range; 4+1= 5.

UCSC Genome Browser tables = “0-start, half-open”

While the commonly-used “one-start, fully-closed” system is more intuitive, it is not always the most efficient method for performing calculations in bioinformatic systems, because an additional step is required to calculate the size of the base-pair (bp) range.

To increase efficiency, the UCSC Genome Browser uses a “hybrid-interval” coordinate system for storing coordinates in databases/tables that is referred to as “0-start, half-open” (see Figure 3, below).

Although coordinates in the web browser are converted to the more human-readable “1-start, fully-closed” system, coordinates are stored in database tables as “0-start, half-open.” You may have heard various terms to express this 0-start system:

Synonyms for “0-start, half-open”

  • 0-based, half-open
  • 0-based start, 1-based end
    • Note: This is not technically accurate, but conceptually helpful. A “1-based end” refers to the end of the range being included, as in the common “1-based, fully-closed” system.
  • 0-start, hybrid-interval (interval type is: start-included, end-excluded)

newhand0-startfinal

Figure 3. (To enlarge, click image.) The UCSC Genome Browser coordinate system for databases/tables (not the web interface) is “0-start, half-open” where start is included (closed-interval), and stop is excluded (open-interval). We calculate that we have 5 digits because 5 (range end after pinky finger) – 0 (the thumb, range start)  = 5.

Another example which compares 0-start and 1-start systems is seen below, in Figure 4. This figure describes the differences in defining and calculating the range for a specified sequence highlighted in yellow, “T, C, G, A.”

finalgrid

Figure 4. (To enlarge, click image.)  Calculation of genomic range for comparing “1-start, fully-closed” vs. “0-start, half-open” counting systems.

Section 3: Formatting

Coordinate formatting indicates interval type

The UCSC Genome Browser and many of its related command-line utilities distinguish two types of formatted coordinates and make assumptions of each type.

The “Position” format (referring to the “1-start, fully-closed” system as coordinates are “positioned” in the browser)

  • Written as: chr1:127140001-127140001
  • No spaces.
  • Includes punctuation: a colon after the chromosome, and a dash between the start and end coordinates.
  • When in this format, the assumption is that the coordinate is 1-start, fully-closed.

The “BED” format (referring to the “0-start, half-open” system)

  • Written as: chr1 127140000 127140001
  • Spaces between chromosome, start coordinate, and end coordinate.
  • No punctuation.
  • When in this format, the assumption is that the coordinates are 0-start, half-open.

Section 4: Examples

SNP example

What we SEE in the Genome Browser interface itself is the “1-start, fully-closed” system. However, these data are not STORED in the UCSC Genome Browser databases and tables in the same way. The UCSC Genome Browser databases store coordinates in the “0-start, half-open” coordinate system.

Table 2. SNP coordinates in web browser (1-start) vs table (0-start)
rs782519173 (hg38) Start End
Positioned in web browser: 1-start, fully-closed  133255708  133255708
Stored in table: 0-start, half-open  133255707  133255708

LiftOver examples and coordinate formatting

Let’s take a look at the two types of coordinate formatting (“BED” and “position”) when using the UCSC Genome Browser web-based and command-line utility liftOver tools.

1) Web-based LiftOver example

Below is an example from the UCSC Genome Browser’s web-based LiftOver tool (Home > Tools > LiftOver). Depending on how input coordinates are formatted, web-based LiftOver will assume the associated coordinate system and output the results in the same format.

Table 3. UCSC Genome Browser web-based LiftOver and “position” coordinate formatting
Input: Assembly = panTro3
chr1
:127140001127140001
Output: Lifts to this position in hg19:
chr1:110255313110255313
Notes: If your input is entered with the “position” formatted coords (1-start, fully-closed),
the browser will also output the same “position” format. (Note positional format
includes “:” and “-” and no spaces.)
Table 4. UCSC Genome Browser web-based LiftOver and “BED” coordinate formatting
Input: Assembly = panTro3
chr1 127140000 127140001
Output: Lifts to this position in hg19:
chr1 110255312 110255313
Notes: If your input is entered with the “BED” formatted coords (0-start, half-open), the
browser will also output the same “BED” format. (Note BED format contains no
punctuation and includes spaces.)
 * Note that the web-based output file extension is misleading in this case; while titled “*.bed” the positional output is not actually in “0-start, half-open” BED format, because the 1-start, fully-closed “positional” format was used for input. 

 2) Command-line liftOver utility example

When using the command-line utility of liftOver, understanding coordinate formatting is also important. Just like the web-based tool, coordinate formatting specifies either the “0-start half-open” or the “1-start fully-closed” convention. For example, if you have a list of 1-start “position” formatted coordinates, and you want to use the command-line liftOver utility, you will need to specify in your command that you are using “position” formatted coordinates to the liftOver utility.

To view the liftOver utility usage statement and options, enter “liftOver” on your command-line (with no other arguments, and without the quotes).

Table 5. UCSC Genome Browser command-line liftOver and “position” coordinate formatting
Input:
(panTro3.txt)
chr1:127140001127140001
Command: liftOver -positions panTro3.txt liftOver/panTro3ToHg19.over.chain.gz mapped unMapped
Output: chr1:110255313110255313
via “mapped” file for hg19
Notes: Note: Must specify “-positions” for 1-start “position” format in command-line liftOver
Table 6. UCSC Genome Browser command-line liftOver and “BED” coordinate formatting
Input:
(panTro3.bed)
chr1 127140000 127140001
Command: liftOver panTro3.bed liftOver/panTro3ToHg19.over.chain.gz mapped unMapped
Output: chr1 110255312 110255313
via “mapped” file for hg19
Notes: Note: No special argument needed, 0-start “BED” formatted coordinates are default. 

Wiggle Files

The wiggle (WIG) format is used for dense, continuous data where graphing is represented in the browser. Wiggle files of variableStep or fixedStep data use “1-start, fully-closed” coordinates. Like all other UCSC Genome Browser data, these coordinates are positioned in the browser as “1-start, fully-closed.”

Note: Many other formats outside of the UCSC Genome Browser use 1-start coordinate systems, such as GTF/GFF.

Table 7. UCSC Genome Browser wiggle files & coordinate systems
File Type Wiggle file Coordinate system as positioned
in UCSC Genome Browser
bedGraph -> bigWig 0-start, half-open 1-start, fully-closed
wiggle variableStep -> bigWig 1-start, fully-closed 1-start, fully-closed
wiggle fixedStep -> bigWig 1-start, fully-closed 1-start, fully-closed

 Section 5: Resources


If after reading this blog post you have any public questions, please email genome@soe.ucsc.edu. All messages sent to that address are archived on a publicly accessible forum. If your question includes sensitive data, you may send it instead to genome-www@soe.ucsc.edu.