UPDATE This page is no longer maintained. The GENCODE project has moved to the Sanger Institute UPDATE
The ENCODE Project: ENCyclopedia Of DNA Elements
The American National Human Genome Research Institute (NHGRI) launched a public research consortium named ENCODE, the Encyclopedia Of DNA Elements, in September 2003, to carry out a project to identify all functional elements in the human genome sequence.
(A complete description of the project is available in the Main ENCODE website (NHGRI), see also Science 2004 Oct 22; 306(5696):636-40)
NEW We are pleased to announce that the EGASP Genome Biology supplement is now available via AMAZON. NEW
The GENCODE Project: Encyclopedia of genes and genes variants
GENCODE is a sub-project of ENCODE.
The overall goal of the proposal is to identify all protein-coding genes in the regions of the human genome selected within the ENCODE project. This means the delineation of one complete mRNA sequence for at least one splice isoform of each protein coding gene in the ENCODE regions, and often, but not systematically, the inference of a number of additional alternative splice forms of these genes.
Towards that end, we plan to use a wide variety of existing computational gene prediction tools among which those that take advantage of the conservation of characteristic features between the human genes and their orthologues in other vertebrate species will play an essential role.
Computational predictions will be followed by experimental verification with RT-PCR, RACE and direct sequencing of the products. The entire system will be integrated in a largely automated pipeline.
In contrast to other undirected large-scale gene characterization projects, our proposal emphasizes a targeted approach in which computational gene predictions guide the subsequent experimental verification. We will follow an stepwise strategy in which we will first validate known genes, then we will confirm reliable computational predictions, and finally we will identify previously unknown genes. Specifically, our aims are:
1. Experimental validation of known genes. We will validate genes in the RefSeq collection, but only when discrepancies with the corresponding orthologous vertebrate genes suggest that the human RefSeq transcript version does not correspond to the correct exonic structure.
2. Experimental confirmation of reliable computational predictions. We will obtain the full-length mRNA sequence for at least one splice form of each prediction in the Ensembl collection not included in RefSeq. This will often involve refining the initial prediction, and identifying non-functional pseudogenes in this collection.
3. Identification of previously unknown genes, followed by experimental verification. We will obtain a full-length mRNA sequence for each experimentally confirmed gene not in RefSeq or Ensembl. We will especially concentrate our efforts to genes and exonic variants likely to be underrepresented in the current catalog of human genes: short and intronless genes, genes undergoing non-canonical splicing, selenoprotein genes (genes translating the TGA stop codon, into a selenocysteine residue), genes with unusual codon composition that may express at very low levels of with a very restricted pattern, human specific genes and genes evolving very rapidly, whose corresponding orthologues either do not exist in other species or are difficult to identify.
4. Identification of a number of splice variants. The exhaustive characterization of the splice isoforms of the genes in the ENCODE region is beyond the scope of this proposal. However, as a result of the experimental verification techniques used in the previous steps, we will be often able to identify all constitutive and alternative exons of the verified genes, and infer a few of their splice variants.