Schema for N-SCAN - N-SCAN Gene Predictions

Home
Genomes
Genome Browser
Tools
Mirrors
- Euro/Asia Mirrors
- Mirroring Instructions
- US Server
- European Server
- Asian Server
Downloads
My Data
Projects
Help
About Us
- News
- Publications
- Blog
- Cite Us
- Credits
- Release Log
- Staff
- Conditions of Use
- Our History
- Jobs
- Licenses
- Contact Us

field

example

SQL type

info

description

bin

585

smallint(5) unsigned

range

Indexing field to speed chromosome range queries.

name

chr1.1.001.a

varchar(255)

values

Name of gene

chrom

chr1

varchar(255)

values

Reference sequence chromosome or scaffold

strand

char(1)

values

+ or - for strand

txStart

4558

int(10) unsigned

range

Transcription start position (or end position for minus strand item)

txEnd

19480

int(10) unsigned

range

Transcription end position (or start position for minus strand item)

cdsStart

4558

int(10) unsigned

range

Coding region start (or end position for minus strand item)

cdsEnd

12334

int(10) unsigned

range

Coding region end (or start position for minus strand item)

exonCount

int(10) unsigned

range

Number of exons

exonStarts

4558,4832,5658,6469,6716,77...

longblob

Exon start positions (or end positions for minus strand item)

exonEnds

4692,4901,5754,6628,6918,79...

longblob

Exon end positions (or start positions for minus strand item)

id

int(10) unsigned

range

name2

chr1.1.001

varchar(255)

values

cdsStartStat

cmpl

enum('none', 'unk', 'incmpl', 'cmpl')

values

cdsEndStat

cmpl

enum('none', 'unk', 'incmpl', 'cmpl')

values

exonFrames

1,1,1,1,0,0,2,0,-1,

longblob

hg18.nscanPasaPep.name (via nscanPasaGene.name)

bin

name

chrom

strand

txStart

txEnd

cdsStart

cdsEnd

exonCount

exonStarts

exonEnds

name2

cdsStartStat

cdsEndStat

exonFrames

585

chr1.1.001.a

chr1

4558

19480

4558

12334

4558,4832,5658,6469,6716,7777,8130,12290,19183,

4692,4901,5754,6628,6918,7924,8242,12468,19480,

chr1.1.001

cmpl

1,1,1,1,0,0,2,0,-1,

585

chr1.1.002.a

chr1

55418

59871

55427

59871

55418,58899,

55436,59871,

chr1.1.002

cmpl

0,0,

589

chr1.pasa.1.a

chr1

556520

557941

557315

557858

556520,556983,

556673,557941,

chr1.pasa.1

cmpl

-1,0,

589

chr1.pasa.1.b

chr1

557150

557928

557663

557858

557150,557467,

557158,557928,

chr1.pasa.1

cmpl

-1,0,

chr1.1.003.b

chr1

654571

658978

654600

658552

654571,658549,658834,

655047,658607,658978,

chr1.1.003

cmpl

0,0,-1,

589

chr1.1.003.a

chr1

654571

655287

654600

655143

654571,655140,

655047,655287,

chr1.1.003

cmpl

0,0,

chr1.1.004.a

chr1

703947

786582

786390

786582

703947,707179,729161,730018,732459,756883,766442,774726,777913,778633,786347,

704335,707286,729465,730209,732566,756920,766554,774778,778009,778765,786582,

chr1.1.004

cmpl

-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,0,

591

chr1.1.005.a

chr1

793821

830082

793821

794805

793821,794770,800282,801988,802300,829700,

793918,794818,800398,802249,802510,830082,

chr1.1.005

cmpl

2,0,-1,-1,-1,-1,

591

chr1.1.006.a

chr1

842236

843777

842236

842554

842236,843264,

842729,843777,

chr1.1.006

cmpl

0,-1,

591

chr1.1.007.a

chr1

850122

869824

851184

869396

850122,851164,855397,856281,861014,864282,864517,866386,867378,867652,867801,868495,868940,869150,

850191,851256,855579,856332,861139,864372,864703,866549,867494,867731,868301,868620,869051,869824,

chr1.1.007

cmpl

-1,0,0,2,2,1,1,1,2,1,2,1,0,0,

Description

This track shows gene predictions using the N-SCAN gene structure prediction software provided by the Computational Genomics Lab at Washington University in St. Louis, MO, USA.

Methods

N-SCAN

N-SCAN combines biological-signal modeling in the target genome sequence along with information from a multiple-genome alignment to generate de novo gene predictions. It extends the TWINSCAN target-informant genome pair to allow for an arbitrary number of informant sequences as well as richer models of sequence evolution. N-SCAN models the phylogenetic relationships between the aligned genome sequences, context-dependent substitution rates, insertions, and deletions.

Human N-SCAN uses mouse (mm7) as the informant and iterative pseudogene masking.

N-SCAN PASA-EST

N-SCAN PASA-EST combines EST alignments into N-SCAN. Similar to the conservation sequence models in TWINSCAN, separate probability models are developed for EST alignments to genomic sequence in exons, introns, splice sites and UTRs, reflecting the EST alignment patterns in these regions. N-SCAN PASA-EST is more accurate than N-SCAN while retaining the ability to discover novel genes to which no ESTs align.

In N-SCAN PASA-EST, cDNA sequences were clustered using the PASA program beforehand. PASA, the Program to Assemble Spliced Alignments, was created by Brian Haas at TIGR. The algorithm assembles clusters of overlapping transcript alignments (ESTs and full-length cDNAs) into maximal alignment assemblies, thereby comprehensively incorporating all available transcript data and capturing subtle splicing variations.

The PASA clusters were used as 'EST' sequences in N-SCAN PASA-EST. The resulting gene models were updated with the input PASA clusters using the assembly tool of the PASA pipeline. These updates consist of automatically generated alternative splices, UTR features and sometimes merging of two gene models. In addition, PASA assigned open reading frames to clusters that did not overlap a gene prediction, but that did contain a full length cDNA, and output them as 'novel genes'. Note that PASA does not use any cDNA annotation from input but assigns the ORF itself.

No manual annotation was performed to generate any of the gene models. The high accuracy of the set is in part due to the large number of available ESTs and full length cDNAs.

Credits

Thanks to Michael Brent's Computational Genomics Group at Washington University St. Louis for providing these data.

Special thanks for this implementation of N-SCAN to Aaron Tenney in the Brent lab, and Robert Zimmermann, currently at Max F. Perutz Laboratories in Vienna, Austria.

References

Gross SS, Brent MR. Using multiple alignments to improve gene prediction. In Proc. 9th Int'l Conf. on Research in Computational Molecular Biology (RECOMB '05):374-388 and J Comput Biol. 2006 Mar;13(2):379-93.

Korf I, Flicek P, Duan D, Brent MR. Integrating genomic homology into gene structure prediction. Bioinformatics. 2001 Jun 1;17(90001):S140-8.

van Baren MJ, Brent MR. Iterative gene prediction and pseudogene removal improves genome annotation. Genome Res. 2006 May;16(5):678-85.

Haas BJ, Delcher AL, Mount SM, Wortman JR, Smith RK Jr, Hannick LI, Maiti R, Ronning CM, Rusch DB, Town CD et al. Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic Acids Res 2003 Oct 1;31(19):5654-66.