Author Archives: Christopher Lee

Accessing the Genome Browser Programmatically Part 3 – Controlling the Genome Browser Image

The previous parts of this series (part 1 and part 2) focused on how to use the Genome Browser to obtain data, and for this third and final post we’re gonna divert from that theme and talk about how to control the track image itself.

Standard procedure for obtaining images of the browser is to configure the view exactly as you want, and then use the “View->PDF/PS” option in the menu bar in order to download a PDF or PostScript of your image. In addition to this method, you can generate PNG images on the fly with the following hgRenderTracks template:
http://genome.ucsc.edu/cgi-bin/hgRenderTracks?parameters

Parameters should be replaced by the URL key-value pairs that the main track display, hgTracks, understands, like ‘db=hg19’ or ‘knownGene=pack’. For example, to compare the transcripts provided by NCBI to UCSC’s own alignments of the transcripts at the ABO locus, you can use the following URL and the cURL program to download a PNG file:

curl 'http://genome.ucsc.edu/cgi-bin/hgRenderTracks?db=hg19&position=chr9:136130563-136150630&hideTracks=1&refSeqComposite=pack&ncbiRefSeqCurated_sel=1&ucscRefSeqView=pack&refGene_sel=1&pubs=pack' > example.png

Opening example.png in your favorite image viewer will display the following image:
refSeqAndPubs

There are many additional parameters described on the Sharing your custom track section of the custom tracks help page. Using these parameters you can configure hgRenderTracks to display any combinations of tracks, and along the hgt.customText parameter, also show your custom tracks with them.

To illustrate, if I have the following custom track:

browser hide all
browser gold=pack
browser gap=pack
browser visibility=pack
chr1 1000 2000
chr1 2100 3000
chr1 3100 4000
…

Hosted on the web at http://genome-test.soe.ucsc.edu/~chmalee/exCustomTrack.bed, then I can tell hgRenderTracks to load this file with the hgt.customText parameter like so:

http://genome.ucsc.edu/cgi-bin/hgRenderTracks?db=hg38&position=chr1:1-100000&hgt.customText=http://genome-test.soe.ucsc.edu/~chmalee/exCustomTrack.bed
customTrackAndGoldGap

The “browser” lines at the beginning of the custom track indicate which native tracks to turn on along their visibilities, while the “hide all” line turns all the other native tracks off. In addition to these basic instructions there are many more examples on the UCSC Genome Browser Wiki.

What about when you want to view a genome and annotations not hosted on our site? If you have a FASTA file of your genome available, you can use faToTwoBit to convert your genome into a 2bit file, then make an assembly hub out of your data. Once you’ve created your hub, you can view the hub with the hubUrl setting. As an example, I have hosted an assembly hub for Arabadopsis thaliana here, and I can view the hub via a single URL like so:

https://genome.ucsc.edu/cgi-bin/hgTracks?genome=araTha1&hubUrl=https://genome-test.gi.ucsc.edu/~chmalee/araTha1/plantAraTha1/hub.txt.
assemblyHubViaUrl

If your data needs to stay behind your local firewall, then you can use the GBiB and GBiC products so you can set up your own “copy” of the UCSC Genome Browser that meets your privacy needs.

Further Reading:

Accessing the Genome Browser Programmatically Part 2 – Using the Public MySQL Server and gbdb System

If you missed part 1 about obtaining sequence data, you can catch up here.

The UCSC Genome Browser is a large repository of data from multiple sources, and if you want to query that annotation data, the easiest way to get started is via the Table Browser. Choose the assembly and track of interest and click the “describe table schema” button, which will show the MySQL database name, the primary table name, the fields of the table and their descriptions. If the track is stored not in MySQL but as a binary file (like bigBed or bigWig) in /gbdb, it will show a file name, e.g. "Big Bed File: /gbdb/dm6/ncbiRefSeq/ncbiRefSeqOther.bb". If this is the case, skip directly to the Accessing the gbdb directory system section below. Otherwise, the track data is either a single MySQL table or a set of related tables, which you can either download as gzipped text files from the “Annotation Database” section on our downloads page (here’s the GRCh37/hg19 listing) and work on them locally, or use the public MySQL server and issue MySQL queries remotely. Generally speaking, the format for most of our tables is similar to the formats described here, e.g., in bed (“chrom chromStart chromEnd”) format, and we do not store any sequence or contigs in our databases, which means you’ll need to use the instructions in Part 1 of this blog series in order to get any raw sequence data.

Accessing the public MySQL server
The best way to showcase the public MySQL server is to show some examples — here are a few to get you started:
1. If you want to download some transcripts from the new NCBI RefSeq Genes track, you can use the following command:

$ mysql -h genome-mysql.soe.ucsc.edu -ugenome -A -e "select * from ncbiRefSeq limit 2" hg38
+-----+-------------+-------+--------+---------+-------+----------+--------+-----------+--------------------------------------------------------------------+--------------------------------------------------------------------+-------+---------+--------------+------------+-----------------------------------+
| bin | name        | chrom | strand | txStart | txEnd | cdsStart | cdsEnd | exonCount | exonStarts                                                         | exonEnds                                                           | score | name2   | cdsStartStat | cdsEndStat | exonFrames                        |
+-----+-------------+-------+--------+---------+-------+----------+--------+-----------+--------------------------------------------------------------------+--------------------------------------------------------------------+-------+---------+--------------+------------+-----------------------------------+
| 585 | NR_046018.2 | chr1  | +      |   11873 | 14409 |    14409 |  14409 |         3 | 11873,12612,13220,                                                 | 12227,12721,14409,                                                 |     0 | DDX11L1 | none         | none       | -1,-1,-1,                         |
| 585 | NR_024540.1 | chr1  | -      |   14361 | 29370 |    29370 |  29370 |        11 | 14361,14969,15795,16606,16857,17232,17605,17914,18267,24737,29320, | 14829,15038,15947,16765,17055,17368,17742,18061,18366,24891,29370, |     0 | WASH7P  | none         | none       | -1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1, |
+-----+-------------+-------+--------+---------+-------+----------+--------+-----------+--------------------------------------------------------------------+--------------------------------------------------------------------+-------+---------+--------------+------------+-----------------------------------+

2. If you are interested in a particular enhancer region, for instance “chr1:166,167,154-166,167,602”, and want to find the nearest genes within a 10kb range, then the following query will do the job:

$ chrom="chr1"
$ chromStart="166167154"
$ chromEnd="166167602"
$ mysql -h genome-mysql.soe.ucsc.edu -ugenome -A -e "select \
   e.chrom, e.txStart, e.txEnd, e.strand, e.name, j.name as geneSymbol from ncbiRefSeqCurated e,\
   ncbiRefSeqLink j where e.name = j.id AND e.chrom='${chrom}' AND \
      ((e.txStart >= ${chromStart} - 10000 AND e.txStart <= ${chromEnd} + 10000) OR \ (e.txEnd >= ${chromStart} - 10000 AND e.txEnd <= ${chromEnd} + 10000)) \
order by e.txEnd desc " hg38
+-------+-----------+-----------+--------+----------------+------------+
| chrom | txStart   | txEnd     | strand | name           | geneSymbol |
+-------+-----------+-----------+--------+----------------+------------+
| chr1  | 166055917 | 166166755 | -      | NR_135199.1    | FAM78B     |
| chr1  | 166055917 | 166166755 | -      | NM_001320302.1 | FAM78B     |
| chr1  | 166069298 | 166166755 | -      | NM_001017961.4 | FAM78B     |
+-------+-----------+-----------+--------+----------------+------------+

3. If you need to get gene names and their lengths for RNA-seq read normalization, you can use the following query:

$ mysql -h genome-mysql.soe.ucsc.edu -u genome -A -e “ \
  select l.name, kr.value, psl.qEnd - psl.qStart as length \
  from   refGene r, hgFixed.refLink l, knownToRefSeq kr, knownCanonical kc, refSeqAli psl \
  where  r.name = l.mrnaAcc and r.name = kr.value and kr.name = kc.transcript \
         and r.name = psl.qName group by kr.value limit 3” hg38
+-------+-----------+--------+
| name  | value     | length |
+-------+-----------+--------+
| A2M   | NM_000014 |   4920 |
| NAT2  | NM_000015 |   1317 |
| ACADM | NM_000016 |   2622 |
+-------+-----------+--------+

In addition to our download site and public MySQL server hosted here in California, we have also recently added support for a download site (http://hgdownload-euro.soe.ucsc.edu) and public MySQL server (genome-euro-mysql.soe.ucsc.edu) hosted in Europe, which will speed up downloads for many of our users.

Please follow the Conditions for Use when querying the public MySQL servers.

Many of the command line utilities available on our utilities downloads server are also able to interact with our databases or download files, like mafFetch (as long as your ~/.hg.conf file is present as discussed below):

$ mafFetch xenTro9 multiz11way region.bed stdout
##maf version=1
##maf version=1 scoring=blastz
a score=0.000000
s xenTro9.chr9     15946024 497 +  80437102 ACTAT...
e galGal5.chr14     1678315   0 -  15595052 I
e xenLae2.chr9_10L 13130032 2034 - 117834370 I

a score=2992.000000
s xenTro9.chr9     15946521 145 +  80437102 TCATC...
s xenLae2.chr9_10L 13132066 148 - 117834370 TTATC...

Note: Only the first 5 bases on each line and only the first 10 lines are shown for brevity.

Here we are directly querying the mutliz11way table for the Xenopus tropicalis xenTro9 assembly, no need to download the entire alignment file to the local disk and query manually. Commands of this nature usually require a special private .hg.conf file in the user’s home directory (note the leading dot). This configuration file contains a couple key=value lines that most of our programs can parse and then use to access the public MySQL server. This page contains a sample .hg.conf file that can be used by most of the command line utilities to direct them to access either our US MySQL server or our European MySQL server. That sample .hg.conf is certainly enough to get started, but for more information about the various Genome Browser configuration options, please see the comments in the ex.hg.conf and minimal.hg.conf files.


Accessing the gbdb directory system
The third method of grabbing our data is via the /gbdb/ directory system. This location, browsable here, holds most of the bigBed, bigWig, and other large data files that we do not keep directly in MySQL databases/tables. There are many utilities available for manipulating these files, and most of them are able to work on remote files, for example:

$ bigBedToBed -chrom=chr1 -start=5563837 -end=5564370 http://hgdownload.soe.ucsc.edu/gbdb/hg38/crispr/crispr.bb stdout 
chr1    5563870    5563893        55    +    5563870    5563890    0,200,0    255,255,0    128,128,0    CAAGTGGAATCAGGATGCCT    GGG    55    72% (57)    52% (46)    10    60    MIT Spec. Score: 55, Doench 2016: 72%, Moreno-Mateos: 52%    3345002138
chr1    5563878    5563901        59    +    5563878    5563898    0,200,0    0,200,0    128,128,0    ATCAGGATGCCTGGGATATG    TGG    59    63% (54)    61% (50)    6    63    MIT Spec. Score: 59, Doench 2016: 63%, Moreno-Mateos: 61%    22777603204

Also note that we have all of this data available via rsync as well, so the following command will work to download the crispr.bb file referenced above:

$ rsync -vh hgdownload.soe.ucsc.edu::gbdb/hg38/crispr/crispr.bb
-rw-rw-r--  1466266135 2017/03/30 14:31:48 crispr.bb

sent 33 bytes  received 70 bytes  206.00 bytes/sec
total size is 1.47G  speedup is 14235593.54

If you are interested in say, Human GRCh37/hg19 gbdb data, then all you have to do is change the “hg38” at the end of the template http://hgdwonload.soe.ucsc.edu/gbdb/hg38 url to “hg19”, resulting in http://hgdwonload.soe.ucsc.edu/gbdb/hg19. This holds for all databases at UCSC, like mm10 or bosTau8.

Summary
Just as in part 1, if you are going to continually request parts of the same files or table over and over again, it is best to download the file from our downloads server and operate on it locally. All of our track data, including MySQL tables and bigBed/Wig/BAM files are hosted on our downloads server at http://hgdownload.soe.ucsc.edu. Generally speaking bigBeds/bigWigs/BAMs and other binary files are located in the hgdownload.soe.ucsc.edu/gbdb/ location discussed earlier, while MySQL table data in gzipped plain text format can be found at http://hgdownload.soe.ucsc.edu/goldenPath/$db (where $db is a database name like hg19 or hg38) or via queries against the public MySQL server directly.

Stay tuned for part 3 of this programmatic access series — controlling the Genome Browser image!

Accessing the Genome Browser Programmatically Part 1 – How to get sequence from the UCSC Genome Browser

Note: We now have an API which can also perform many of these functions.

As the number of bioinformaticians have grown since the inception of the UCSC Genome Browser in 2000, there has been an increased need for programmatic access to the data and tools hosted at UCSC. Although there is no true API developed by UCSC (yet), there are a number of ways to interface with the UCSC Genome Browser, some more efficient than others. The intention of this blog post series is to explain some of the preferred ways to access the commonly requested Genome Browser data and tools and to add a bit of explanation of the architecture of the UCSC Genome Browser in general. The three most common requests are 1) how to download a single stretch of sequence in FASTA format, 2) how to download multiple ranges of sequence, and 3) how to get basic statistics on the nucleotides in a sequence. If you want the in-depth examples and explanation, skip down, but if you’re crunched for time, all you really need to know is the following three Q&As:

Q: How do I extract some sequence?
A: The best choice is to use the twoBitToFa command, available for your system here (Windows 10 users can use the linux.x86_64/ binaries in the Windows Subsystem for Linux). Here’s an example:

$ twoBitToFa http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.2bit:chr1:100100-100200 stdout
>chr1:100100-100200
gcctagtacagactctccctgcagatgaaattatatgggatgctaaatta
taatgagaacaatgtttggtgagccaaaactacaacaagggaagctaatt

Q: What if I have a list of coordinates?
A: Again use twoBitToFa, this time with the -bed option (also check out the post on coordinate systems):

$ cat input.bed
chr1 4150100 4150200 seq1
chr1 4150300 4150400 seq2
$ twoBitToFa http://hgdownload.soe.ucsc.edu/goldenPath/mm10/bigZips/mm10.2bit -bed=input.bed stdout
>seq1
gcatcccagtcctgatactggaaaattcatttagtgacaagcgagggcca
cttgggattctctcacccccatatttaggagaccttattagggtcacctt
>seq2
tatccccttccctccccaccagatactacaattcacatcatactctgtcc
cccagtctacccataaaatctattctatttacctctccaaacgaagatct

Q: How do I count A, C, G, T?
A: twoBitToFa followed by faCount (available from the same location as twoBitToFa):

$ twoBitToFa http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.2bit:chr1:100100-100200 stdout | faCount stdin
#seq    len     A       C       G       T       N       cpg
chr1:100100-100200      100     37      17      21      25      0       0
total   100     37      17      21      25      0       0

Run twoBitToFa or faCount with no arguments to get a usage message and view all of their options:

$ faCount
faCount - count base statistics and CpGs in FA files.
...


The most efficient way to get sequence from UCSC Genome Browser

The most common data request we receive is a request for FASTA sequence or sequences, making it a fitting subject for part 1 of this blog series about programmatic access to the Genome Browser. If you are browsing a region in the genome browser and you want to get a FASTA sequence for just the region you are browsing, using the keyboard shortcut ‘vd’ (v then d for view DNA) is probably the easiest way. But what about when you want to get sequences for a list of regions? What about if you need your web application to download the sequence? You could download sequence interactively with the Table Browser, although the solution is somewhat cumbersome: first you must make a custom track of the region(s) you would like sequence for, and then use the “output format: sequence” option with your custom track selected as the primary track. Fortunately, there is a much easier approach – downloading the 2bit file for your organism of interest and then using the twoBitToFa command on it like so:

$ wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.2bit
$ twoBitToFa hg38.2bit:chr1:100100-100200 stdout
>chr1:100100-100200
gcctagtacagactctccctgcagatgaaattatatgggatgctaaatta
taatgagaacaatgtttggtgagccaaaactacaacaagggaagctaatt

The twoBitToFa command is available from the list of public utilities, in the directory appropriate to your operating system. twoBitToFa even accepts a URL to our downloads server as the 2bit argument, so if you wanted to grab some mm10 sequence, or even a list of sequences, you can just query the downloads server directly like so:

$ cat input.bed
chr1 4150100 4150200 seq1
chr1 4150300 4150400 seq2
$ twoBitToFa http://hgdownload.soe.ucsc.edu/goldenPath/mm10/bigZips/mm10.2bit -bed=input.bed stdout
>seq1
gcatcccagtcctgatactggaaaattcatttagtgacaagcgagggcca
cttgggattctctcacccccatatttaggagaccttattagggtcacctt
>seq2
tatccccttccctccccaccagatactacaattcacatcatactctgtcc
cccagtctacccataaaatctattctatttacctctccaaacgaagatct

Note that “stdout” in the above commands is a special option (along with the corresponding “stdin”) that tells the majority of UCSC commands to read/write from/to /dev/stdin and /dev/stdout instead of the required filenames, and is exemplified by the following common usage of generating some quick statistics on a region like chr1:100100-100200:

$ twoBitToFa http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.2bit:chr1:100100-100200 stdout | faCount stdin
#seq    len     A       C       G       T       N       cpg
chr1:100100-100200      100     37      17      21      25      0       0
total   100     37      17      21      25      0       0

The twoBitToFa and URL to hgdownload 2bit combo is important because our downloads server is significantly more robust than our DAS CGI, can support more requests, and won’t slow the main site down for other users. We’ve also noticed that our DAS server often receives many requests for the same sequence, so for those of you providing software where the same query will be made multiple times, consider whether it would be more efficient to download an entire 2bit file to your local disk, rather than send the same query thousands of times to our servers.

Summary
twoBitToFa and faCount are two useful utilities, among the many other hundreds of tools available, that are useful for extracting sequence data. While not as preferable to working with locally downloaded files, twoBitToFa can also work with URLs to 2bit files, such as those on the UCSC Genome Browser download site. Stay tuned for part 2 of this programmatic access series — Using the Genome Browser public MySQL server and gbdb.

Annotating millions of private variants with vai.pl

For almost 4 years, Genome Browser users have been able to use the Variant Annotation Integrator (VAI) to predict the functional effects of their variants of interest. The VAI takes a variety of inputs (pgSnp/VCF custom track or hub, dbSNP rsID, HGVS terms) and annotates all the variants with their functional effect in Sequence Ontology terms (e.g. synonymous_variant, missense_variant, frameshift_variant, etc). The VAI returns predictions in Variant Effect Predictor (VEP) format, which is described here.

The VAI is quite flexible, and offers the option to choose any gene set or gene prediction track in the chosen genome database for functional annotation. All human genome databases include UCSC Genes (based on GENCODE V24 in GRCh38/hg38), RefSeq Genes, GENCODE/Ensembl genes as well as gene predictions produced by tools such as Augustus. Gene/transcript annotations used as the basis for functional effect prediction should be chosen carefully since they have a large effect on results (McCarthy et al.). For the GRCh37/hg19 and GRCh38/hg38 assemblies in particular, regulatory regions from ENCODE summary datasets can be used to identify variants that may have a regulatory effect, and disease or pathogenicity information from the Database of Non-Synonymous Functional Predictions (dbNSFP) can help distinguish between protein changes that are likely to be very disruptive to function, versus those that are likely to have little functional effect. Conservation scores may also be added to the output.

To reduce the volume of output and narrow in on the variants that are most likely to damage genes, filters can be added to restrict the output to specific functional effects (such as missense, frameshift, etc.) and/or variants overlapping conserved elements predicted from multi-species alignments.

Unfortunately, as a web tool the VAI does have some limitations, namely that only 100,000 variants at a time can be annotated, which prevents annotating variants derived from whole genome sequencing experiments. Also, for clinical users, privacy restrictions may prevent the uploading of a patient’s variant data to the UCSC Genome Browser.

Now our new vai.pl program provides a way around these restrictions. This program is intended to be run on a Genome Browser in a box (GBiB) or server hosting a mirror of the Genome Browser. vai.pl forms an interface to the VAI program running on your GBiB or mirror (so private data stays local), and is able to bypass the variant limit imposed by the web-based VAI. The script has many of the same configuration options as the web-based VAI, including filtering via functional effect term, position filters, and dbSNP rsID annotation. The script even includes a “–dry-run” option, so power users can further configure the VAI to better suit their needs.

Example Usage

For example, say you have a VCF file with a couple thousand variants and you want to check to see if there are any dbSNP rs IDs associated with your variants. Use the --rsId option:

$ vai.pl hg19 --rsId gatkUG.vcf.gz

## ENSEMBL VARIANT EFFECT PREDICTOR format (UCSC Variant Annotation Integrator)
## Output produced at 2017-05-05 13:40:35
## Connected to UCSC database hg19
## Variants: from file or URL (/hive/users/chmalee/hgVaiScriptTesting/gatkUG.vcf.gz)
## Transcripts: RefSeq Genes (hg19.refGene)
## dbSNP: Simple Nucleotide Polymorphisms (dbSNP 149) (/gbdb/hg19/vai/snp149.bed4.bb)
Uploaded Variation Location Allele Gene Feature Feature type Consequence Position in cDNA Position in CDS Position in protein Amino acid change Codon change Co-located Variation Extra
chr20_10000117_C/T chr20:10000117 T SNAP25-AS1 NR_040710 Transcript downstream_gene_variant - - rs4816203 DISTANCE=4343
chr20_10000211_C/T chr20:10000211 T SNAP25-AS1 NR_040710 Transcript downstream_gene_variant - - rs4813908 DISTANCE=4249
chr20_10000439_T/G chr20:10000439 G SNAP25-AS1 NR_040710 Transcript downstream_gene_variant - - rs4816204 DISTANCE=4021
chr20_10000598_T/A chr20:10000598 A SNAP25-AS1 NR_040710 Transcript downstream_gene_variant - - rs6057087 DISTANCE=3862
...
...
...

What if your colleague gave you a list of rs IDs, and you want to know what genes they fall in and what changes they might cause? Just pass vai.pl your list of rs IDs as an input file, and it will do the rest!

$ vai.pl hg19 listOfRsIDs.txt

## ENSEMBL VARIANT EFFECT PREDICTOR format (UCSC Variant Annotation Integrator)
## Output produced at 2017-05-05 13:45:46
## Connected to UCSC database hg19
## Variants: Variant Identifiers (/data/tmp/hgv/hg19_bd61d73837d586acba8b9a674d8bf351.vcf)
## Transcripts: RefSeq Genes (hg19.refGene)
Uploaded Variation Location Allele Gene Feature Feature type Consequence Position in cDNA Position in CDS Position in protein Amino acid change Codon change Co-located Variation Extra
rs762221666 chr1:36228116 T CLSPN NM_001330490 Transcript intron_variant - - - - - INTRON=4/24
rs762221666 chr1:36228116 T CLSPN NM_001190481 Transcript intron_variant - - - - - INTRON=4/23
rs762221666 chr1:36228116 T CLSPN NM_022111 Transcript intron_variant - - - - - INTRON=4/24
rs528917690 chr1:229013556 G - - - intergenic_variant - - - - - - -
rs558192635 chr10:25615277 C GPR158 NM_020752 Transcript intron_variant - - - - - INTRON=2/10
rs769006799 chr10:26225804 T LOC101929073 NR_120650 Transcript upstream_gene_variant - - - DISTANCE=3165
...
...
...

Note the missing gene name for the rs528917690, because this is an intergenic variant.

What if you only care about the variants that fall on chr22? vai.pl supports a --position option built just for that:

$ vai.pl hg19 --rsId --position=chr22 chr22.1000GenomesPhase3.vcf.gz

## ENSEMBL VARIANT EFFECT PREDICTOR format (UCSC Variant Annotation Integrator)
## Output produced at 2017-05-05 13:36:32
## Connected to UCSC database hg19
## Variants: from file or URL (/hive/users/chmalee/hgVaiScriptTesting/chr221000GenomesPhase3.vcf.gz)
## Transcripts: RefSeq Genes (hg19.refGene)
## dbSNP: Simple Nucleotide Polymorphisms (dbSNP 149) (/gbdb/hg19/vai/snp149.bed4.bb)
Uploaded Variation Location Allele Gene Feature Feature type Consequence Position in cDNA Position in CDS Position in protein Amino acid change Codon change Co-located Variation Extra
rs587697622 chr22:16050075 G - - - intergenic_variant - - - - - rs587697622 -
rs587755077 chr22:16050115 A - - - intergenic_variant - - - - - rs587755077 -
rs587654921 chr22:16050213 T - - - intergenic_variant - - - - - rs587654921 -
rs587712275 chr22:16050319 T - - - intergenic_variant - - - - - rs587712275 -
rs587769434 chr22:16050527 A - - - intergenic_variant - - - - - rs587769434 -
...
...
...

Ok great, but web-based VAI lets me annotate only specific variants, and I don’t care about intronic variants, upstream/downstream variants, or intergenic variants, only those that fall within exons as annotated by the GENCODE V24 track. Well good thing vai.pl supports annotation via specific gene tracks with the --geneTrack option, and can include/exclude different functional types with the include_ option:

$ vai.pl hg38 --include_intron=off --include_upDownstream=off --include_intergenic=off \
--geneTrack=wgEncodeGencodeCompV24 listOfRsIDs.txt

## ENSEMBL VARIANT EFFECT PREDICTOR format (UCSC Variant Annotation Integrator)
## Output produced at 2017-05-05 13:53:57
## Connected to UCSC database hg38
## Variants: Variant Identifiers (/data/tmp/hgv/hg38_bd61d73837d586acba8b9a674d8bf351.vcf)
## Transcripts: Comprehensive Gene Annotation Set from GENCODE Version 24 (Ensembl 83) (hg38.wgEncodeGencodeCompV24)
Uploaded Variation Location Allele Gene Feature Feature type Consequence Position in cDNA Position in CDS Position in protein Amino acid change Codon change Co-located Variation Extra
rs371031144 chr13:112864559 A ATP11A ENST00000471555.5 Transcript NMD_transcript_variant - - - INTRON=8/12
rs750654524 chr16:57996479 T ZNF319 ENST00000299237.2 Transcript 3_prime_UTR_variant 2410 - - EXON=2/2
rs575863935 chr16:83575109 A CDH13 ENST00000539548.6 Transcript NMD_transcript_variant - - - INTRON=6/12
rs78060447 chr4:83316348 C HPSE ENST00000507150.5 Transcript NMD_transcript_variant - - - INTRON=4/11
...
...
...

By default, vai.pl includes all functional types, use the --include_type=off switch to turn them off. vai.pl also limits output to only 10,000 variants, but you can override this with the --variantLimit option. The following example compares the number of annotated variants with default settings and with the --variantLimit option (Please note the grep command at the end is only a rough approximation of finding all the unique variants):

$ vai.pl hg19 ftp://ngs.sanger.ac.uk/production/hrc/HRC.r1-1/HRC.r1-1.GRCh37.wgs.mac5.sites.vcf.gz > HRC.vai
$ grep -v ^# HRC.vai | grep -v ^Uploaded | awk '{print $2 ":" $1;}' | uniq | wc -l
9992
$ vai.pl hg19 --variantLimit=10000000 ftp://ngs.sanger.ac.uk/production/hrc/HRC.r1-1/HRC.r1-1.GRCh37.wgs.mac5.sites.vcf.gz > HRC.vai
$ grep -v ^# HRC.vai | grep -v ^Uploaded | awk '{print $2 ":" $1;}' | uniq | wc -l
9992195

Unfortunately the script will not annotate more than approximately 10,000,000 variants due to the amount of memory needed (it caps its usage at 6GB; this may change in the future), so setting the --variantLimit option any higher than 10,000,000 will not work. Instead you will need to split up your VCF file. For a full list of all the options outlined here as well as others, run vai.pl with no arguments to get the usage message.

The script has some other drawbacks as well. For one, as previously mentioned, the script can only be run on a GBiB, a mirror site (installed via the Genome Browser in the Cloud (GBiC) script or a manual installation), or another machine running our CGIs. This is because under the hood the script uses the existing web-based VAI executable to run, which in turn requires either manual compilation of our source code, or a precompiled binary from our downloads server. Furthermore, VAI is tightly coupled to our genome databases and files.

Secondly, users will need to have a .hg.conf file in their home directory in order to run the program. The .hg.conf file is a file that a majority of UCSC-specific utilities use to set various configuration options. This file specifies options like which MySQL server to point to (a local server with private data), a fallback MySQL server if one isn’t available (UCSC public MySQL server), the primary location of bigData files, etc. Our source tree includes a very minimal hg.conf file that should allow basic usage of the script.

Different system setups will require different options in each users’ hg.conf settings, and if you are running a full mirror, then the CGIs will require their own hg.conf, separate from each user who may be running vai.pl! GBiB users should have a functioning .hg.conf file set up already, and thus for the script to work out of the box, you should only need to change the udc.cacheDir setting from:
udc.cacheDir=/data/trash/udcCache

to a user-writable directory such as:
udcCacheDir=./udcCache

Mirror users won’t have an .hg.conf file by default, but can create one and add the line:
include /usr/local/apache/cgi-bin/hg.conf

This will take care of most of the work outside of fine-tuning a few settings like the udc.cacheDir mentioned previously. For more information about hg.conf parameters and settings, please see the example hg.conf file here. Any questions about fine-tuning these parameters should be sent to our public support forum mentioned below.

Download

vai.pl is available from the UCSC Genome Browser Store via download of the GBiB (use the gbibAddTools command), GBiC (use the browserSetup.sh addTools command), or full source. vai.pl is free for non-commercial use. If you encounter issues or have any questions while running vai.pl, please send your questions to our public mailing list at genome@soe.ucsc.edu, or if your question involves private data to genome-www@soe.ucsc.edu.

References

Choice of transcripts and software has a large effect on variant annotation.
McCarthy DJ, Humburg P, Kanapin A, Rivas MA, Gaulton K, Cazier JB, Donnelly P.
Genome Med. 2014 Mar 31;6(3):26. doi: 10.1186/gm543.

UCSC Data Integrator and Variant Annotation Integrator.
Hinrichs AS, Raney BJ, Speir ML, Rhead B, Casper J, Karolchik D, Kuhn RM, Rosenbloom KR, Zweig AS, Haussler D, Kent WJ.
Bioinformatics. 2016 May 1;32(9):1430-2. doi: 10.1093/bioinformatics/btv766.