Genome BioInformatics Research Lab

  CRG HOME Software sgp2
sgp2 HomePage

  1. What's sgp2?
  2. Main features
  3. Examples
  4. tblastx searching
  5. Gene predictions on genomes
  6. Comparative gene prediction in human and mouse
  7. Source code distribution
  8. References
  9. Authors and acknowledgements

What's sgp2?

sgp2 is a program to predict genes by comparing anonymous genomic sequences from two different species. It combines tblastx, a sequence similarity search program, with geneid, an "ab initio" gene prediction program. In "assymetric" mode, genes are predicted in one sequence from one species (the target sequence), using a set of sequences (maybe only one) from the other species (the reference set). Essentially, geneid is used to predict all potential exons along the target sequence. Scores of exons are computed as log-likelihood ratios, function of the splice sites defining the exon, the coding bias in composition of the exon sequence as measured by a Markov Model of order five, and of the optimal alignment at the amino acid level between the target exon sequence and the counterpart homologous sequence in the reference set. From the set of predicted exons, the gene structure is assembled (eventually multiple genes in both strands) maximizing the sum of the scores of the assembled exons. A more detailed description of the algorithm can be found here.

Main features

  • sgp2 predictions are more accurate than those obtained by pure "ab initio" gene finding programs.
  • Gene prediction can be done in the target and the reference genome simultaneously
  • sgp2 output can be customized to different formats as gff, geneid format or XML are available. Allowing different amount of information associated to each output format.
  • Shotgun data can be used at any level of coverage as the informant genome.
  • sgp2 is very efficient in terms of speed and memory usage.




tblastx searching

The BLAST algorithms employ heuristics at various stages of the searching process, and the choice of the parameters is very important for the speed, the sensitivity and the specificity of the searches. sgp2 predictions are mainly based on the tblastx results, therefore the selection of the parameters is critical for the final predictions. Although both mostly spread versions of BLAST (NCBI Blast and WU-blast) are accepted as input file for sgp2, we recommend to use WU-blast because it has a more flexible set of parameters.

For the sgp2 predictions, the following WU-blast set of parameters were suggested to optimize the speed, sensitivity and specificity of the tblastx searches:

               B=100000 V=100000 hspmax=0 gspmax=0 nogap
               altscore='* any -999' altscore='any * -999'
               E=0.01 E2=0.01 S2=80 Z=3000000000

If you are comparing long sequences you can increase the word size variable to 5 (W=5), which speeds up the search by 60-fold and the sensitivity is nearly the same as the previous parameters.
More details on the tblastx parameters for long genomic sequences to predict genes with sgp2 can be found in the human mouse prediction section or in the papers of the references section. A deep analysis on tblastx parameters in large genomic sequences can be found in the following article:

  • I. Korf. "Serial Blast searching."
    Bioinformatics 19(12):1492-1496 (2003) [Abstract]

Gene predictions on genomes

This link contains the set of predicted genes using sgp2 on the recently sequenced genomes ( Homo sapiens, Mus musculus and Rattus norvegicus) for some of their last releases.

Comparative gene prediction in human and mouse

We evaluated the accuracy of sgp2 using a number of different data sets. The lack of a gold standard of gene prediction makes it difficult to get accurate assessments from any single data set. We used different data sets. The results of this initial test can be found here.
In collaboration with other groups, we developed a two stage procedure that exploit the human and mouse genome sequences to produce a set of genes with a much higher rate of experimental verification using sgp2 and twinscan. The results of this approach can be found here.

Source code distribution

sgp2 distribution contains several directories and files compressed in a tar.gz file. Source code and documentation files are included in the distribution, as well as several parameters files and other extra information.

All of the files can be obtained from our ftp server:

sgp2 v 1.1:

  • sgp2 v 1.1 full distribution: source code and documentation

Instructions to install sgp2 in your computer.



  • G. Parra, P. Agarwal, J.F. Abril, T. Wiehe, J.W. Fickett and R. Guigó.
    "Comparative gene prediction in human and mouse."
    Genome Research 13(1):108-117 (2003) [Abstract]

  • R. Guigó, E.T. Dermitzakis, P. Agarwal, C.P. Ponting, G. Parra, A. Reymond, J.F. Abril, E. Keibler, R. Lyle, C. Ucla, S.E. Antonarakis and M.R. Brent.
    "Comparison of mouse and human genomes followed by experimental verification yields an estimated 1,019 additional genes."
    Proc. Nat. Acad. Sci. 100(3):1140-1145 (2003)   [ PubMed ]   [ Abstract ]   [ Full Text ]

Authors and acknowledgements

The current version of sgp2 has been written by Genis Parra, Josep F. Abril and Roderic Guigó.
With contributions from Enrique Blanco, Thomas Whiehe and Moises Burset.

CopyRight © 2003

sgp2 is under GNU General Public License.

  Disclaimer webmaster