- What's sgp2?
- Main features
- tblastx searching
- Gene predictions on genomes
- Comparative gene prediction
in human and mouse
- Source code distribution
- Authors and acknowledgements
sgp2 is a program to predict genes by comparing anonymous
genomic sequences from two different species. It combines
a sequence similarity search program, with
an "ab initio" gene prediction program. In "assymetric" mode, genes
are predicted in one sequence from one species (the target sequence),
using a set of sequences (maybe only one) from the other species (the
reference set). Essentially,
geneid is used to predict all potential
exons along the target sequence. Scores of exons are computed as
log-likelihood ratios, function of the splice sites defining the exon,
the coding bias in composition of the exon sequence as measured by a
Markov Model of order five, and of the optimal alignment at the amino
acid level between the target exon sequence and the counterpart
homologous sequence in the reference set. From the set of predicted
exons, the gene structure is assembled (eventually multiple genes in
both strands) maximizing the sum of the scores of the assembled exons.
A more detailed description of the algorithm can be found
- sgp2 predictions are more accurate than those obtained by pure "ab
initio" gene finding programs.
- Gene prediction can be done in the target and the reference
- sgp2 output can be customized to different formats as
gff, geneid format or XML are available. Allowing different amount of
information associated to each output format.
- Shotgun data can be used at any level of coverage as the informant
- sgp2 is very efficient in terms of speed and memory usage.
The BLAST algorithms employ heuristics at various stages of the
searching process, and the choice of the parameters is very important for
the speed, the sensitivity and the specificity of the searches. sgp2
predictions are mainly based on the tblastx results,
therefore the selection of the parameters is critical for the final
predictions. Although both mostly spread versions of BLAST (NCBI Blast and WU-blast) are accepted as input
file for sgp2, we recommend to use WU-blast because it has a more
flexible set of parameters.
For the sgp2 predictions, the following WU-blast set of parameters were
suggested to optimize the speed, sensitivity and specificity of the
B=100000 V=100000 hspmax=0 gspmax=0 nogap
altscore='* any -999' altscore='any * -999'
E=0.01 E2=0.01 S2=80 Z=3000000000
If you are comparing long sequences you can increase the word size
variable to 5 (W=5), which speeds up the search by 60-fold and the
sensitivity is nearly the same as the previous parameters.
More details on the tblastx parameters for long genomic
sequences to predict genes with sgp2 can be found in the human mouse prediction section or in the
papers of the references section. A
deep analysis on tblastx parameters in large genomic
sequences can be found in the following article:
- I. Korf. "Serial Blast searching."
Bioinformatics 19(12):1492-1496 (2003)
contains the set of predicted genes using sgp2 on the
recently sequenced genomes ( Homo sapiens, Mus musculus and Rattus
norvegicus) for some of their last releases.
We evaluated the accuracy of sgp2 using a number of different
data sets. The lack of a gold standard of gene prediction makes it
difficult to get accurate assessments from any single data set. We
used different data sets. The results of this initial test can be
In collaboration with other groups, we developed a two stage procedure
that exploit the human and mouse genome sequences to produce a set of
genes with a much higher rate of experimental verification using
sgp2 and twinscan. The results of this approach can
sgp2 distribution contains several directories and files
compressed in a tar.gz file. Source code and documentation
files are included in the distribution, as well as several parameters
files and other extra information.
All of the files can be
obtained from our
sgp2 v 1.1:
- sgp2 v 1.1 full distribution: source code and documentation
Instructions to install sgp2 in your computer.
- G. Parra, P. Agarwal, J.F. Abril, T. Wiehe, J.W. Fickett and R. Guigó.
"Comparative gene prediction in human and mouse."
Genome Research 13(1):108-117 (2003) [Abstract]
- R. Guigó, E.T. Dermitzakis, P. Agarwal, C.P. Ponting, G. Parra, A. Reymond, J.F. Abril, E. Keibler, R. Lyle, C. Ucla, S.E. Antonarakis and M.R. Brent.
"Comparison of mouse and human genomes followed by experimental verification yields an estimated 1,019 additional genes."
Proc. Nat. Acad. Sci. 100(3):1140-1145 (2003) [ PubMed ]
[ Abstract ]
[ Full Text ]
The current version of sgp2 has been written by
Josep F. Abril and
With contributions from Enrique Blanco, Thomas Whiehe and Moises Burset.
CopyRight © 2003
sgp2 is under GNU General Public License.