Schema for Web Sequences - DNA Sequences in Web Pages Indexed by / Microsoft Research
  Database: hg19    Primary Table: pubsBingBlat    Row Count: 313,510   Data last updated: 2013-11-21
Format description: publications blat feature table, in bed12+ format
On download server: MariaDB table dump directory
fieldexampleSQL type info description
bin 585smallint(5) unsigned range Indexing field to speed chromosome range queries.
chrom chr1varchar(255) values chromosome
chromStart 14789int(10) unsigned range start position on chromosome
chromEnd 15004int(10) unsigned range end position on chromosome
name 3500336380varchar(255) values internal articleId, article that matches here
score 75int(10) unsigned range score of feature
strand  char(1) values strand of feature
thickStart 14789int(10) unsigned range start of exons
thickEnd 15004int(10) unsigned range end of exons
reserved 8421504int(10) unsigned range no clue
blockCount 2int(10) unsigned range number of blocks
blockSizes 40,35longblob   size of blocks
chromStarts 0,180longblob   A comma-separated list of block starts
tSeqTypes gvarchar(255) values comma-seq list of matching sequence db (g=genome, p=protein, c=cDNA)
seqIds 350033638000000000blob values comma-separated list of matching seqIds
seqRanges 0-75blob values ranges start-end on sequence that matched, one for each seqId
publisher  varchar(255) values publisher of article, for hgTracks feature filter
pmid  varchar(255) values PMID of article, for annoGrator output, avoids table join
doi  varchar(255) values doi of article, for annoGrator output, avoids table join
issn  varchar(255) values issn of journal
journal  varchar(255) values name of journal
title Tophat, Cufflinks and repl...varchar(255) values title of article, for genome browser mouseover
firstAuthor seqanswers.comvarchar(255) values first author family name of article, for genome browser
year 0varchar(255) values year of article, for genome browser
impact 0varchar(255) values impact factor of journal, for genome browser coloring, derived from official impact factors: max impact is 25, value is scaled to 0-255
classes  varchar(255) values classes assigned to journal article, for genome browser coloring
locus WASH2P,WASH7Pvarchar(255) values closest gene symbols, one or two, comma-separated

Connected Tables and Joining Fields
        hg19.pubsBingBlatPsl.articleId (via
      hgFixed.pubsBingArticle.articleId (via
      hgFixed.pubsBingSequenceAnnot.articleId (via
      hg19.pubsBingBlatPsl.qName (via pubsBingBlat.seqIds)

Sample Rows
585chr1147891500435003363807514789150048421504240,350,180g3500336380000000000-75 Tophat, Cufflinks and replicates - Page 2 - SEQanswersseqanswers.com00WASH2P,WASH7P
585chr115017155903500327042381150171559084215042326,550,518g3500327042000000080-747Research Technologies at Indiana Universitybiomedapp.iu.edu00WASH7P
585chr16885868895350002048937688586889584215041370g350002048900000000,3500020489000000010-36,0-36Genome mapability - Musings from a PhD candidatedavetang.org00OR4F5
585chr16917069479350035979714269170694798421504276,660,243c350035979700000000,3500359797000000020-76,10-76 CRAM compression and TLEN SAM's field - SEQanswersseqanswers.com00OR4F5
585chr17001370230350042757015070013702308421504275,750,142g350042757000000000,3500427570000000010-75,0-75 Inconsistency with SAM flag output? - SEQanswersseqanswers.com00OR4F5
585chr198860988883500207083269886098888842150435,7,140,6,14g350020708300000108,350020708300000060,3500207083000002390-24,0-21,0-21Method For The Simultaneous Determination Of Blood Group And Platelet Antigen Genotypes.freshpatents.com00OR4F5
586chr11376031380083500170315405137603138008842150414050p350017031500015076,3500170315000150740-135,0-270Balding D. (2007) Handbook of Statistical Geneticswww.scribd.com00OR4F5
586chr1139485143008350041933217941394851430088421504265,17290,1794g350041933200000004,350041933200000000,350041933200000001,350041933200000002,3500419332000000030-1263,0-1859,0-1852,0-1860,0-576PPT – Evolution by Genome Duplication PowerPoint presentation | free to viewwww.powershow.com00OR4F5
586chr11415351430083500270480137214153514300884215042457,60,58,59,61,59,60,59,59,62,61,58,60,58,16,59,59,59,59,57,57,59,58,580,61,125,187,250,314,377,441,503,566,631,695,756,819,881,919,981,1044,1107,1170,1230,1291,1353,1415g350027048000000003,3500270480000000020-902,0-525Chen-Kung Chou
587chr135226535229035004275832535226535229084215041250g35004275830000000074-992010-11-10.GENSIPS.Assembly in the Cloudschatzlab.cshl.edu00OR4F29

Note: all start coordinates in our database are 0-based, not 1-based. See explanation here.

Web Sequences (pubsBingBlat) Track Description


This track is powered by Bing! and Microsoft Research. UCSC collaborators at Microsoft Research (Bob Davidson, David Heckerman) implemented a DNA sequence detector and processed thirty days of web crawler updates, which covers roughly 40 billion webpages. The results were mapped with BLAT to the genome.

Display Convention and Configuration

The track indicates the location of sequences on web pages mapped to the genome, labelled with the web page URL. If the web page includes invisible meta data, then the first author and a year of publication is shown instead of the URL. All matches of one web page are grouped ("chained") together. Web page titles are shown when you move the mouse cursor over the features. Thicker parts of the features (exons) represent matching sequences, connected by thin lines to matches from the same web page within 30 kbp.


All file types (PDFs and various Microsoft Office formats) were converted to text. The results were processed to find groups of words that look like DNA/RNA sequences. These were then mapped with BLAT to the human genome using the same software as used in the Publication track.


DNA sequence detection by Bob Davidson at Microsoft Research. HTML parsing and sequence mapping by Maximilian Haeussler at UCSC.


Aerts S, Haeussler M, van Vooren S, Griffith OL, Hulpiau P, Jones SJ, Montgomery SB, Bergman CM, Open Regulatory Annotation Consortium. Text-mining assisted regulatory annotation. Genome Biol. 2008;9(2):R31. PMID: 18271954; PMC: PMC2374703

Haeussler M, Gerner M, Bergman CM. Annotating genes and genomes with DNA sequences extracted from biomedical articles. Bioinformatics. 2011 Apr 1;27(7):980-6. PMID: 21325301; PMC: PMC3065681

Van Noorden R. Trouble at the text mine. Nature. 2012 Mar 7;483(7388):134-5.