Genome BioInformatics Research Lab

  IMIM * UPF * CRG * GRIB HOME Research * geneid
 
RESEARCH TOPICS
 
   
Gene Prediction Software: geneid
 

The group is involved in the ongoing development of the gene prediction program geneid. geneid (Guigó et al., 1992) was one of the first programs to predict full exonic structures of vertebrate genes in anonymous DNA sequences. geneid was designed with a hierarchical structure: First, gene defining signals (splice sites, start and stop codons) were predicted along the query DNA sequence. Next, potential exons were constructed from these sites, and finally the optimal scoring gene prediction was assembled from the exons. In the original geneid the scoring function to optimize was rather heuristic: the sequence sites were predicted and scored using Frequency Matrices (PWMs), a number of coding statistics were computed on the predicted exons, and each exon was scored as a function of the scores of the exon defining sites and of the coding statistics. To estimate the coefficients of this function a neural network was used. An exhaustive search of the space of possible gene assemblies was performed to rank predicted genes according with an score obtained through a complex function of the scores of the assembled exons.

During the nineties geneid had some usage, mostly through a now nonfunctional e-mail server at Boston University (geneid@darwin.bu.edu) and through a WWW server at the Institut Municipal d'Investigació Mèdica (/geneid.html). During this period, however, there has been substantial developments in the field of computational gene identification,and the original geneid had became clearly inferior to other existing tools. Therefore, we started some time ago developing an improved version of the geneid program, at least as accurate as other existing tools, but much more efficient handling very large genomic sequences both in terms of speed and usage of memory.

geneid prediction on a the ADH region of the fly genome compare with the actual gene structure of the region.


This new version maintains the hierarchical structure (signal to exon to gene) in the original geneid, but we have simplified the scoring schema and furnished it with a probabilistic meaning: Scores for both, exons defining signals and protein coding potential, are computed as log-likelihood ratios, which for a given predicted exon are summed up into the exon score, in consequence also a log-likelihood ratio. Then, a dynamic programming algorithm (Guigó, 1998) is used to search the space of predicted exons to assemble the gene structure (in the general case, multiple genes in both strands) maximizing the sum of the scores of the assembled exons, which can also be assumed to be a log-likelihood ratio.

Execution time in this new version of geneid grows linearly with the size of the input sequence, currently at about two MegaBases per minute in a Pentium III (500 Mhz) running linux. The amount of memory required is also proportional to the length of the sequence, about one MegaByte per MegaBase plus a constant amount of about 15 MegaBytes, irrespective of the length of the sequence. In the practice, thus, geneid is able to analyze sequences of virtually any length, for instance chromosome size sequences.

This new version was initially trained to predict genes in the genome sequence of Drosophila melanogaster (Parra et al., 2000), but versions currently exist for human, Dictyostelium discoideum, Fugu rubripes and Tetraodon Nigrovirides. geneid is at the core of the developments in our group to predict selenoprotein genes, and for comparative gene prediction.

 
Relevant publications
 

  • E. Blanco, G. Parra and R. Guigó.
    "Using geneid to Identify Genes."
    In A. D. Baxevanis and D. B. Davison, chief editors:
    Current Protocols in Bioinformatics. Volume 1, Unit 4.3.
    John Wiley & Sons Inc., New York, 2002. ISBN: 0-471-25093-7.   [Table of Contents]

  • G. Glökner, L. Eichinger, K. Szafranski, J.A. Pachebat, A.T. Bankier, P.H. Dear, R. Lehmann, C. Baumgart, G. Parra, J.F. Abril, R. Guigó, K. Kumpf, B. Tunggal, the Dictyostelium Genome Sequencing Consortium, E. Cox, M.A. Quail, M. Platzer, A. Rosenthal and A.A. Noegel.
    "Sequence and Analysis of Chromosome 2 of Dictyostelium discoideum."
    Nature 418(6893):79-85 (2002) [Abstract]

  • G. Parra, E. Blanco, and R. Guigó.
    "Geneid in Drosophila."
    Genome Research 10(4):511-515 (2000)   [Abstract]   [Datasets]

  • R. Guigó, M. Burset, P. Agarwal, J.F. Abril, R.F. Smith and J.W. Fickett.
    "Sequence Similarity Based Gene Prediction."
    In S. Suhai editor:
    Genomics and Proteomics: Functional and Computational Aspects.
    Plenum Publishing Corporation, 2000.

  • R. Guigó.
    "DNA composition, codon usage and exon prediction."
    In M. Bishop, editor:
    Genetic Databases. Pp:53-80.
    Academic Press, 1999.

  • R. Guigó.
    "Assembling genes from predicted exons in linear time with dynamic programming."
    Journal of Computational Biology, 5:681-702 (1998) [PubMed Abstract]

  • R. Guigó.
    "Computational gene identification."
    Journal of Molecular Medicine, 75:389-393 (1997) [PubMed Abstract]

  • J. W. Fickett and R. Guigó.
    "Computational gene identification."
    In S.R. Swindell, R.R. Miller and G. Myers, editors:
    Internet for the Molecular Biologist. Pp:73-100.
    Horizon Scientific Press, Oxford, United Kingdom, 1996.

  • R. Guigó and J. W. Fickett.
    "Distinctive sequence features in protein coding, genic non-coding, and intergenic human DNA."
    Journal of Molecular Biology, 253:51-60 (1995) [Abstract]

  • J. W. Fickett and R. Guigó.
    "Estimation of protein coding density in a corpus of DNA sequence data."
    Nucleic Acids Research, 20:2837-2844 (1993) [PubMed Abstract]

  • R. Guigó, S. Knudsen, N. Drake, and T. F. Smith.
    "Prediction of gene structure."
    Journal of Molecular Biology, 226:141-157 (1992) [PubMed Abstract]

 
  Disclaimer webmaster