Description
ECgene (gene prediction by EST clustering) predicts genes by combining
genome-based EST clustering and a transcript assembly procedure in a coherent
and consistent fashion. Specifically, ECgene takes alternative splicing events
into consideration. The positions of splice sites (i.e. exon-intron
boundaries)
in the genome map are utilized as critical information in the whole
procedure. Sequences that share splice sites in the genomic alignment are
grouped together to define an EST cluster. Transcript assembly, based
on graph theory, produces gene models and clone evidence, which is essentially
identical to sub-clustering according to splice variants.
Display Conventions and Configuration
This track follows the display conventions for
gene prediction
tracks.
The track description page offers the following filter and configuration
options:
- Color track by condons: Select the genomic codons option
to color and label each codon in a zoomed-in display to facilitate validation
and comparison of gene predictions. Click the "Codon coloring help"
link on the track description page for more information about this feature.
Methods
The following is a brief summary of the ECgene algorithm:
-
Genomic alignment of mRNA and ESTs: Input sequences were aligned against the
genome using the Blat program developed by Jim Kent. Blat alignments were
corrected for valid splice sites, and the SIM4 program was used for suspicious
alignments if necessary.
-
Sequences that share more than one splice site were grouped together to define
an EST cluster in a similar manner to the genome-based version of the UniGene
algorithm.
-
The exon-connectivity in each cluster was represented as a directed acyclic
graph (DAG). Distinct paths along exons were obtained by the
depth-first-search (DFS) method. They correspond to possible gene models
encompassing all alternative splicing events.
-
EST sequences in each cluster were sub-clustered further according to the
compatibility of each splice variant with gene structure, and they can be
regarded as clone evidence for the corresponding isoform. Gene models without
sufficient evidence were discarded at this stage. The presence of polyA tails,
detected from careful analysis of genomic alignment of mRNA and EST sequences,
was specifically used to determine the gene boundary.
-
Finally, unspliced sequences were added without altering the exon-intron
boundaries of existing gene models.
Coding potential of gene models:
Peptide sequences are available only for those gene models judged to have good
coding potential. ORF and CDS were determined based on the number of exons,
the ORF length, the presence of the start codon (Met), and the CDS length.
ORFs (defined as the region between two adjacent stop codons) were classified
into four groups:
- spliced ORFs with Met
- spliced ORFs without Met
- single-exon ORFs with Met
- single-exon ORFs without Met
Initially, the first group was searched for the ORF with the longest CDS.
Coding sequences were accepted if they were longer than 30 amino acids (93 bp)
or they were identical to one of SwissProt proteins excluding fragmented
entries. If such an ORF could not be identified in the first group, the other
groups were examined sequentially for the presence of an ORF using the same
criteria. Genes lacking an apparent ORF were defined as non-coding RNA genes.
Credits
This algorithm and the predictions for this track were developed by Professor
Sanghyuk Lee's
Lab of Bioinformatics at Ewha Womans Univeristy, Seoul, KOREA.
|