The Gene Haplotype Alleles feature displays the chromosome-phased
1000 Genomes Phase 1 data for protein
coding regions. These data comprise the genomes of 1,092 individuals from 14 populations in Africa,
Europe, East Asia and the Americas, constructed using a combination of low-coverage whole-genome
and exome sequencing.
The variant genotypes have been phased by the 1000 Genomes Project (i.e., the two alleles
of each diploid genotype have been assigned to two haplotypes, one inherited from each parent).
How to use the "Gene Haplotype Alleles" section
Click on any protein-coding gene in the UCSC Genes track and scroll to the Common Gene
Haplotype Alleles section. (The feature is currently implemented only on GRCh37/hg19
protein-coding genes.) There will be a table of haplotypes for the protein-coding
portion of the gene. Each row in the table represents a unique gene haplotype as found in
the 1000 Genomes Phase 1 project data. The table is sortable on any column by clicking on
the column headers.
Haplotype frequencies are based upon all releveant chromosomes in the data set. The total
number is almost always 2,184 (1,092 for Y and location-dependent for X). Hover over
the frequency calculations to show the number of a particular haplotype in the dataset
(e.g., "N=1327 of 2184").
If appropriate, the homozygous frequency will also be shown and will reflect the number
of individuals in the dataset.
The reference haplotype (made of entirely reference variants) may
not be represented in the 1000 Genomes data. If it is, it will be so marked in the
By default, only non-synonymous, common variant sites are displayed. Common variants
are defined as occurrring in at least 1% of 1000 Genomes subject chromosomes.
includes rare and synonymous variant sites found in
1000 Genomes subjects in the list of haplotypes.
returns to haplotypes defined by common and non-synonmous
displays variant sites (and full sequence) as DNA bases.
displays variant sites (and full sequence) as predicted amino acids. Predicted stop codons
are represented by "]" and predicted frameshifts by
The reference variant is shown at the top of each variant site column. This is the
value found in the GRCh37/hg19 reference genome at that variant site. In most cases it is a
single letter (AA code/DNA base). In the case of an insertion with respect to the reference
genome, the reference value is shown as "-". Large deletions are represented
by the first two sequence letters followed by "+++".
Hovering the pointer over any of the variant site links will show a more complete description
of that variant. For example, the variant description
"AA:16 A|T chr9:136137554 SNP: G|A (0.995|0.005) rs55917063"
consists of the following elements:
AA:16 A|T - AA residue number and variants (AA view
only) chr9:136137554 - genome location SNP: G|A (0.995|0.005) - nucleotide
variants and allele frequencies rs55917063 - dbSNP variant name (if one is known)
Clicking on any
non-reference variant shown in the variant sites columns
will link to the full details of that variant site in the 1000 Genomes phase 1 track.
Each haplotype allele sequence is generated from GRCh37/hg19 reference DNA, with variants
spliced in, then translated into amino acids.
shows the predicted effects of variation on gene sequence for each of the haplotypes.
If variant sites are currently displayed as DNA bases, then the predicted DNA
sequence is shown (for coding regions only). If variant sites are displayed as amino acids, the
predicted protein sequence is shown.
simultaneously shows the DNA sequence above the protein sequence for easy comparison.
Showing protein sequence with the DNA triplets is the easiest way to visualize the
reverts to the simplified protein sequences view.
hides the full sequence view completely.
Green vertical highlights accentuate the
variant sites within the full sequence.
Bold red letters mark the effects of variation. Synonymous changes
are only evident when DNA bases are displayed.
Blue vertical highlights show a variant that
has been sorted on by clicking its column header. Sorting on a variant can be used to quickly
locate one site out of many in the full sequence view.
The AA residue number is shown when hovering over any part of
the sequence in amino acid view.
By default, only common gene haplotype alleles are displayed, defined as occurring in at least
1% of the relevant 1000 Genomes subject chromosomes.
includes all haplotypes. Some large gene models cover many variants
and therefore have a very large number of distinct
haplotypes represented in the 1000 Genomes project data. If this is the case, only the 100
most frequently occurring haploptyes will be shown in the table, though the true number will
limits the display to only common haplotypes.
Each haplotype is found in one or more subjects participating in phase 1 of the
1000 Genomes project. The distribution of the haplotypes across different population
groups can be examined by pressing
to display the distribution across a broader spectrum of groups
to return to the major grouping and
to hide these data again.
When shown, each population group is a column in the table which contains the percent of
that haplotype that is found in each group. This is not the same as the percent of each
group that has the haplotype. Hover over the distribution numbers to show the
frequency of occurrence of the haplotype within each group. For example,
hovering over 25.7
might show "N=304 of 1183 (found in 71.0% of all ASN)",
meaning that of the 1183 occurrences of the haplotype, 304 or 25.7% are found in the
ASN group and that 71.0% of all East Asian copies of this gene (in 1000 Genomes phase 1
data) match this haplotype.
To see the number of 1000 genomes chromosomes covered for each group, hover over the
column header (e.g. ASN will usually show
"East Asian [N=572]").
By default, scoring is hidden. Three types of scores are provided to help users find
haplotype alleles that occur more or less frequently than expected or that have unusual
distributions in populations.
See definitions below.
To see the calculated scores, press
Population Group definitions
The numbers listed here are of individuals, but the numbers used in generating
the haplotypes table are frequently the number of relevant chromosomes
(e.g. 2184 not 1092).
Includes only major groups for which there are data in phase 1 of the
1000 Genomes project.
Ad Mixed American
1000 Genomes Groups
Includes only 1000 Genomes groups for which there are data in phase 1
of the project.
African Ancestry in Southwest US
Luhya in Webuye, Kenya
Yoruba in Ibadan, Nigeria
Ad Mixed American:
Colombian in Medellin, Colombia
Mexican Ancestry in Los Angeles, California
Puerto Rican in Puerto Rico
Han Chinese in Beijing, China
Han Chinese South
Japanese in Tokyo, Japan
Utah residents with Northern and Western European ancestry
Finnish in Finland
British in England and Scotland
Iberian populations in Spain
Toscani in Italia
Scores alone cannot be used to draw definitive conclusions about
The haplotype score is based on the normalized (-log10) probability of
finding exactly N subject chromosomes with this haplotype, given the frequencies of individual
variants and assuming they are independent. The score is normalized by multiplying the base
probability by the total number of variants. Normalization allows comparing the scores
between genes with many variant sites and those with few. The score will be positive if the
haplotype is more frequent than expected by chance and negative if less frequent. Larger
scores will result when minor variant alleles occur together more frequently
than expected, which might reflect co-selection or may merely be an artifact of more
recent events. A negative haplotype score may be more informative.
For haplotypes made from common, non-synonymous variants, haplotype scores above 606
are seen in only 2% of genes. Likewise, a score of less than -199 is only seen
in 2% of genes.
The homozygous score is based upon the (-log10) probability of finding
exactly N individuals with this haplotype on both chromosomes, given the actual
frequency of the haplotype in subject chromosomes. The score will be positive if the
haplotype is found homozygous in more individuals than expected and negative when found
in fewer than expected. Negative values might suggest that the haplotype is deleterious
when homozygous. For haplotypes made from common, non-synonymous variants, homozygous
scores above 92 are seen in only 2% of genes. Likewise, a score of less than -15 is
only seen in 2% of genes.
The population score (only visible when population distributions are displayed)
is the fixation index (FST) based upon the difference in variance between
sub-population haplotype frequencies and the total haplotype frequency. Note that
this calculation is based upon the frequency of haplotype, rather than the distribution of
that haplotype across populations. Nevertheless, large population scores should reflect
large skews in distribution in more frequently occurring haplotypes.
For haplotypes made from common, non-synonymous variants,
population scores above 0.424 are seen in only 2% of genes and scores above 0.506 are
seen in only 1% of genes.
If the gene is on the negative ('-' or "reverse") strand, all variant sites and
sequences will be presented with respect to the negative strand. This differs from the way
variants are displayed in the 1000 Genomes phase 1 variations track, which are always
shown as they appear on the positive ('+' or "forward") strand.
Only variant sites occurring within coding exons are currently included in haplotypes.
Variants occurring within intron splice junctions are not included.
Haplotypes are defined by the set of variant sites included and the variant allele at each
of those sites. Therefore haplotype and homozygous frequency calculations depend upon
which variant sites are included. Likewise all scores are specific to the haplotype as
defined by the variant sites included, and population scores are also specific to the
population groups that are examined.
The haplotypes displayed are not pregenerated but are derived from 1000 Genomes VCF files
and other Genome Browser dataset at the time they are requested. Consequently, scoring is
calculated in the context of a single gene model and the variant sites used to derive
If the number of variants covered exceeds 200, then the haplotype table will not be
displayed and the reason so noted.
Certain viewing options are expensive operations which will slow the
gene page response time. If this section is not being actively used, it is recommended that
previous choices are cleared by pressing "Reset to defaults" at the bottom of this section.