Home  -   Genomes  -   Blat  -   Tables  -   Gene Sorter  -   PCR  -   Proteome  -   Session  -   FAQ  -   Help

  Terminology
  • Raw sequence: Individual unassembled sequence reads, produced by sequencing of clones containing DNA inserts.
  • Draft genome sequence: The sequence produced by combining the information from individual sequenced clones (by creating merged sequenced contigs and them employing linking information to create scaffolds) and positioning the sequence along the physical map of the chromosomes. (Nickname "golden path".)
  • Public (sequence) databases: GenBank, EMBL, DDBJ.
  • Paired-end sequence: Raw sequence obtained from both ends of a clones insert in any vector, such as a plasmid or BAC (see below).
  • Finished sequence: Complete sequence of a clone or genome, having an accuracy of at least 99.99% and no gaps.
  • Accession: a record from a public database, more specifically a record for a sequenced clone (see below).
  • Clone: human BAC, PAC, cosmid, etc. clone.
  • Accessioned clone: clone with sequence submitted to the public sequence databases.
  • Coverage (or depth): The average number of times that a nucleotide is represented by a high-quality base in a collection of random raw sequence. Operationally, a 'high-quality base' is defined as one with an accuracy of at least 99% (corresponding to a Phred score of at least 20).
  • Reported coverage: fold coverage of clone by reads (e.g.4x) as reported in accession.
  • Full shotgun coverage: The coverage in random raw sequence needed from a large-insert clone to ensure that it is ready for finishing; this varies among centers but is typically 8-10 fold. Clones with full shotgun coverage can usually be assembled with only a handful of gaps per 100 kb.
  • Half shotgun coverage: Half the amount of full shotgun coverage (typically, 4-5 fold random coverage).
  • Sequenced clone: clone that has been at least partially sequenced and submitted to the public databases as an accession.
  • Finished clone: A large-insert clone which is entirely represented by finished sequence. (A phase 3 clone with keyword "HTGS_phase3" in the accession.)
  • Full shotgun clone: A large-insert clone for which full shotgun sequence has been produced.
  • Draft clone: A large-insert clone for which roughly half-shotgun sequence has been produced. Operationally, the collection of draft clones produced by each center was required to have an average coverage of fourfold for the entire set and a minimum coverage of threefold for each clone. (A clone with keyword "HTGS_Draft" or "HTGS_phase1" or "HTGS_phase2" in the accession.)
  • Predraft clone: A large-insert clone for which some shotgun sequence is available, but which does not meet the standards for inclusion in the collection of draft clones.
  • Freeze: snapshot of the most recent accession for each sequenced clone and ancillary data taken at a particular date.
  • Contig: The result of joining an overlapping collection of sequences or clones.
  • Sequenced-clone contig (SCC): Contig produced by merging overlapping sequenced clones. (Nickname "barge".)
  • SCC gap: Gap between adjacent sequenced-clone contigs in the same fingerprint clone contig. (Nickname "clone gap".)
  • Fingerprint clone contig (FCC): Contig produced by joining clones inferred to overlap on the basis of their restriction digest fingerprints.
  • FCC gap: a gap between adjacent fingerprint clone contigs in the genome sequence (nickname "contig gap" or "layout gap").
  • Genome fingerprint map: The collection of all fingerprint clone contigs placed in a genome-wide map. (Nickname "BAC map" or "FPC map".)
  • Genetic map: A genome map in which polymorphic loci are positioned relative to one another based on the frequency with which they recombine during meiosis. The unit of distance is centiMorgans (cM), denoting a 1% chance of recombination.
  • Radiation hybrid (RH) map: A genome map in which STSs are positioned relative to one another based on the frequency with which they separated by radiation induced breaks. The frequency is assayed by analyzing a panel of human-hamster hybrid cell lines, each produced by lethally irradiating human cells and fusing them with recipient hamster cells such that each carries a collection of human chromosomal fragments. The unit of distance is centiRays (cR), denoting a 1% chance of a break occuring between the loci.
  • Sequenced-clone layout: An ordering of a set of sequenced clones based on association with a single fingerprint clone contig framework. (Nickname "Accession Map").
  • STS: Sequenced tagged site, corresponding to a short (typically, less than 500 bp) unique genomic locus for which a PCR assay has been developed.
  • EST: Expressed sequence tag, obtained by performing a single raw sequence read from a random cDNA clone.
  • SSR: Simple sequence repeat, a sequence consisting largely of a tandem repeat of a specific k-mer (such as (CA)15). Many SSRs are polymorphic and have been widely used in genetic mapping.
  • SNP: Single nucleotide polymorphism, or a single nucleotide position in the genomic sequence for which two or more alternative alleles are present at appreciable frequency (traditionally, at least 1%) in the human population.
  • Initial sequence contigs: Contigs produced by merging overlapping sequence reads obtained from a single clone, in a process called sequence assembly. (Nickname "fragment" or ".ffa fragment".)
  • Merged sequence contigs: Contigs produced by taking the initial sequence contigs contained in overlapping clones and merging those found to overlap. These are also referred to simple as 'sequence contigs' where no confusion will result. (Nickname "raft".)
  • Sequence gap: a gap between adjacent sequence contigs in the draft genome sequence that is not also an FCC or SCC gap. (Nickname "fragment gap".)
  • BAC clone: Bacterial artificial chromosome vector carrying a genomic DNA insert, typically of 100-200kb. Most of the large-insert clones sequenced in the project were BAC clones.
  • BAC end: end sequence from a BAC clone.
  • BAC end pair: the two reads from the ends of a BAC clone, taken from the public databases.
  • Plasmid end pair: the two reads from the ends of a plasmid clone.
  • cDNA: EST or full length mRNA from the public databases.
  • Bridge: a link between two sequence contigs formed by:
    • matches to a BAC end pair
    • matches to a plasmid end pair
    • matches to two consecutive parts of a cDNA sequence
    • ordering and orientation information provided in an accession.
  • Bridged sequence gap: a gap between two consecutive sequence contigs in the draft genome sequence that are joined by a bridge.
  • Bridged SCC gap: an SCC gap between two sequenced clone contigs A and B such that there is a bridge from a sequence contig of A to a sequence contig of B.
  • Bridged FCC gap: similar to a bridged SCC gap, but between two fingerprint clone contigs.
  • Scaffold: The result of connecting contigs by linking information, such as paired-end reads from plasmids, paired-end reads from BACs, known mRNAs, or other sources. The contigs in a scaffold are ordered and oriented with respect to one another.
  • Sequence-contig scaffold: Scaffold produced by connecting a maximal set of sequence contigs joined by bridged gaps.
  • Sequenced-clone-contig scaffold: Scaffold produced by joining sequenced clone contigs by bridged SCC gaps.
  • .fa and .agp files: described in data formats

  • PHRED: A widely-used computer program that analyses raw sequence to produce a 'base call' with an associated 'quality score' for each position in the sequence. A Phred quality score of X corresponds to an error probability of approximately 10-X/10. Thus, a Phred quality score of 30 corresponds to 99.9% accuracy from the base call in the raw read.
  • PHRAP: A widely-used computer program that assembles raw sequence into sequence contigs and assigns to each position an associated `quality score' for each position in the sequence, based on the Phred scores of the raw sequence reads. A Phrap quality score of X corresponds to an error probability of approximately 10-X/10. Thus, a Phrap quality score of 30 corresponds to 99.9% accuracy for a base in the assembled sequence.
  • GigAssembler: A computer program developed during this project for merging the information from individual sequenced clones into a draft genome sequence.
  • N50 length: A measure of the contig length (or scaffold length) containing the 'typical' nucleotide. Specifically, it is the maximal length L such that 50% of all nucleotides lie in contigs (or scaffolds) or size at least L.

  Notes on Chromosome Comparator Statistics
  • Input/Assembly Statistics: Statistics labelled as "A"("Assembly") are based on the final draft assembly (or 'Golden Path') of the genome for the freeze in question. "I"("Input") statistics are based on the set of clones (defined by the 'freeze') used as input by GigAssembler in generating the assembly, in their entire, raw form.
  • Size: The 'Size' of a genomic object is the number of actual basepairs it includes - i.e., it doesn't include any gaps. So, for example, the 'size' of a clone is the sum of the lengths of its fragments (nonoverlapping parts of fragments for assembly stats).
  • Extent: How much space the object takes up on the genome - how far apart, in basepairs, its extreme points are.
  • Coverage: How thoroughly an clone is sequenced - specifically, the ratio of the Size to the Extent.
  • Finished, Deep, Draft Sequence Contigs: These are classifications of Sequence Contigs:
    • Finished: sequence contig that contains at least one finished clone fragment. Often abbrev. 'Fin'.
    • Deep: Unfinished sequence contig that contains fragments from at least two different draft clones.
    • Draft/Predraft: All other sequence contigs

  More Terminology Resources
  • Further terminology is introduced in the description of the algorithm used to build the working draft.
  • Further general definitions can be found in NHGRI's Glossary of Genetic Terms.