|
General Technique
Many algorithms for computational gene prediction have been described and embodied in a large variety of software packages in regular use by the scientific community. These algorithms are ultimately based on two underlying concepts:
- the identification and modeling of gene structure, and
- the recognition of sequence similarity.
When predicting gene structure, one is identifying the structural and functional needs for a segment of genomic sequence to be able to be transcribed, then spliced, and finally to produce a protein sequence. When studying sequence similarity, one is identifying genes by observing regions of sequence that were maintained in evolution; or when comparing a genome to mRNA/EST data, one is identifying those regions that are observed to be expressed in the available data set. Some very successful hybrid methods have been developed, combining these two concepts. The commonly used gene prediction programs are, therefore, typically not conceptually independent, and their combination is not trivial, typically yielding minor incremental improvements over each separate program. The algorithms have a strong bias for identifying conserved protein-coding genes, but there is now evidence for the existence of many transcripts not fulfilling these requirements.
We have defined a third basic concept that can be used for gene prediction based entirely on the analysis of genomic sequence data:
- the recognition of "transcription side effects"
This refers to the effects of sustained transcription on the genomic sequence as an accumulation of small mutational and selective biases related to the transcription process. We identified and developed four novel algorithms based on this third orthogonal concept. Two of the methods are based on an observed mutational strand bias due to transcription-coupled DNA repair, while the other two are based on asymmetric selection against poly-adenylation signals that would truncate transcription units.
The new methods, statistically combined into an integrated method called "FEAST". FEAST can:
- successfully identify the location and orientation of many previously known genes, and
- give specific predictions for the location and orientation of thousands of additional transcripts.
Purpose/use/application of the technique:
The genome of Homo sapiens has been sequenced as a means to establish the ultimate genetic "parts list" of human biology, (i.e. the identification of the complete ensemble of genes, followed by their functional characterization). The genomes of several other model species have been completed or drafted, informing us about their biology and helping us interpret the human genomic sequence data. The genetic "parts list", together with the identification of regulatory sites in the genome, are among the most basic building blocks for the construction of metabolic and regulatory models of cell function.
Computational and experimental analyses have suggested the existence of a relatively modest number of genes, but the identification of significantly conserved sequences shows the existence of many additional functional elements in the genome. More recently, genome tiling experiments have provided evidence for much more extensive transcription than previously reported, and revealed that about half of the observed transcripts are not exported outside the nucleus, suggesting their functional form may be as untranslated RNA.
The FEAST algorithms are a powerful new tool for identifying the extent and orientation of novel genes, revealing the presence of functional units within regions previously considered to be "gene deserts". As such, it is the natural tool to follow up on positional cloning efforts that suggest that the causative agent for a disease might be located in an otherwise uncharacterized region of the genome.
Example(s) of projects at ISB that use this technique:
We successfully developed a "proof of principle" implementation of the novel transcript prediction method and used it to identify previously undetected human genes in several loci, including the loci of susceptibility to Type 1 (juvenile) diabetes.
Ongoing area of technology development:
With the availability of complete (or extensively drafted) genomes, extensive pairwise and multi-species alignments can be produced at high certainty for most regions of the genome. This will allow us to combine information from orthologous loci in different species. For example, when considering interspersed repeats, some will have inserted into the genome prior to the divergence between the species, while others will be lineage-specific. If a genomic region has continued to be transcribed after a speciation event, the lineage-specific repeats can be taken as independent pieces of evidence supporting transcription at the specific locus. Therefore, a combined and more powerful prediction can be made, simultaneously, for the orthologous loci. A locus may not have accumulated sufficient signal to produce a significant prediction in any one species, but the combined multi-species signal may be significant. In this sense, even a locus that has lost its function in one species, and is decaying as a pseudogene, may hold significant information about its previous transcription to support the identification of its ortholog in a species where it is still functional.
Representative publication(s):
Glusman, G., Qin, S., El-Gewely, R., Siegel, A., Roach, J., Hood, L. and Smit, A.F.A. A third approach to gene prediction suggests thousands of additional human transcribed regions. (2006) PLoS Computational Biology 2(3): e18.
|