"An Assessment of Gene Prediction Accuracy in Large DNA Sequences."
R. Guigó, P. Agarwal, J.F. Abril, M. Burset and J.W. Fickett.
Genome Research 10(10):1631-1642 (2000).
One of the first useful products from the human genome will be a set
of predicted genes. Besides its intrinsic scientific interest, the
accuracy and completeness of this data set is of considerable
importance for human health and medicine.
Though progress has been made on computational gene identification in
terms of both methods and accuracy evaluation measures, most of the
sequence sets in which the programs are tested are short genomic
sequences, and there is concern that these accuracy measures may not
extrapolate well to larger, more challenging data sets. Given the
absence of experimentally verified large genomic data sets, we
constructed an semi-artificial test set comprising a number of short
single-gene genomic sequences with randomly generated intergenic
regions. This test set, which should still present an easier problem
than real human genomic sequence, mimics the 200Kbp or so long BACs
In our experiments with these longer genomic sequences, the accuracy
of GenScan, likely the most accurate ab initio gene prediction
program, dropped significantly, although its sensitivity remained
high. The accuracy of similarity-based programs, such as Genewise,
Procrustes, and Blastx, on the other hand, was
not affected significantly by the presence of random intergenic
sequence, but it depended on the strength of the similarity to the
protein homolog. As expected, the accuracy dropped if the models were
built using more distant homologs, and we are able to quantitatively
estimate this decline. However, the specificities of these techniques
are still rather good even when the similarity is weak, which is a
desirable characteristic for driving expensive follow-up experiments.
Our experiments suggest that though gene prediction will improve with
every new protein that is discovered and through improvements in the
current set of tools, we still have a long way to go before we can
decipher the precise exonic structure of every gene in the human
genome using purely computational methodology.
Sequence Test Set
Here you will find the files containing the genomic DNA sequences used in
the analysis (fasta format), masked or not for low complexity regions, and the gene features extracted from EMBL v50 (GFF format).
- Single-Gene Sequences (SGS):
- Semi-Artifitial Genomic Sequences (SAGS):