Gene Prediction based on Comparative Genomics
Recently, the importance of sequence comparisons between genomes of
different species to locate functional domains conserved through
evolution (protein coding among them) has been underscored, and new
bioinformatics methodologies have been developed to infer protein
coding genes from sequence comparisons of the genomes of two different
species developed (Batzoglou et al., 2000; Bafna and Hudson, 2000;
Wiehe et al., 2001; Korf et al., 2001, Novichkov et al., 2001), which
appear to lead to highly accurate predictions. The rationale is that
functional regions (protein-coding among them) are more conserved than
non-functional ones across the DNA sequence of genomes from different
species (see figure below).
We are developing a method to predict genes in the human genome which
combines information from sequence signals potentially involved in
gene specification (splice sites and start codons, essentially) and
from protein-coding induced bias in the nucleotide composition of the
DNA sequence, with information from sequence similarity to the mouse
genome. Unlike methods previously described, this method does not
require fully assembled genomic mouse syntenic regions, and it can be
used with fragmentary mouse data at any level of coverage. A
preliminary version of this program is being used by the Mouse Genome
Pairwise comparison using the program
tblastx of the
human and mouse genomic sequences coding for the HLA class II alpha
chain. Red boxes indicate the coding exons, while black diagonals
indicate the conserved alignments. The score of the conserved
alignments (divided by 10) is given in the lower panels. While
conserved regions between the human and mouse genomic sequences coding
for these gene fully include the coding exons, a substantial fraction
of intronic regions is also conserved.
- R. Guigó, E.T. Dermitzakis, P. Agarwal, C.P. Ponting, G. Parra, A. Reymond, J.F. Abril, E. Keibler, R. Lyle, C. Ucla, S.E. Antonarakis and M.R. Brent.
"Comparison of mouse and human genomes followed by experimental verification yields an estimated 1,019 additional genes."
PNAS 100(3):1140-1145 (2003) [ Abstract ] [Datasets]
- G. Parra, P. Agarwal, J.F. Abril, T. Wiehe, J.W. Fickett and R. Guigó.
"Comparative gene prediction in human and mouse."
Genome Research 13(1):108-117 (2003) [Abstract] [Datasets]
- Mouse Genome Sequencing Consortium (including J.F. Abril, G. Parra and R. Guigó).
"Initial sequencing and comparative analysis of the mouse genome."
Nature 420(6915):520-562 (2002) [Abstract]
- M.J. Betts, R. Guigó, P. Agarwal and R.B. Russell.
"Exon structure conservation despite low sequence similarity: a relic of dramatic events in evolution?"
EMBO Journal 20(19):5354-5360 (2001) [Abstract]
- T. Wiehe, S. Gebauer-Jung, T. Mitchell-Olds and R. Guigó.
"SGP-1: Prediction and Validation of Homologous Genes Based on Sequence Alignments."
Genome Research 11(9):1574-1583 (2001) [Abstract]
- T. Wiehe, R. Guigó, and W. Miller.
"Genome Sequence Comparisons: Hurdles in the Fast Lane to Functional Genomics."
Briefings in Bioinformatics 1(4):381-388 (2000) [PubMed Abstract]