Description
The Gencode Genes track (v3.1, March 2007) shows high-quality manual
annotations in the ENCODE regions generated by the
GENCODE project.
The gene annotations are colored based on the HAVANA annotation type. See the
table below for the color key, as well as more detail about the transcript
and feature types. The Gencode project recommends that the annotations
with known and validated transcripts; i.e., the types Known
and Novel_CDS (which are colored
dark green in the track
display) be used as the reference gene annotation.
The v3.1 release includes the following updates and enhancements to v2.2
(Oct. 2005):
- Apart from the usual additions and corrections, 69 loci (consisting of 132
transcripts) were re-annotated based on Rapid Amplification of cDNA Ends
(RACE), array, and sequencing
analyses performed within the Affymetrix/GENCODE collaboration
(see the Methods section below, also Denoeud et al., 2007 and The
ENCODE Project Consortium, 2007).
- The polymorphic gene type was added.
- PolyA features were added.
- A bug affecting frames of CDSs with missing start or stop codons was
fixed.
- The experimental validation data contained in the Gencode
Introns track from the previous release were integrated into the
Gencode Genes track by annotators using the Human and Vertebrate Analysis and
Annotation manual curation process (HAVANA).
Type |
Color |
Description |
Known |
dark green |
Known protein-coding genes (i.e., referenced
in Entrez Gene) |
Novel_CDS |
dark green |
Have an open reading frame (ORF) and are identical, or
have homology, to cDNAs or proteins but do not fall into the above
category. These can be known in the sense that they are represented by
mRNA sequences in the public databases, but they are not yet
represented
in Entrez Gene or have not received an official gene name. They can
also
be novel in that they are not yet represented by an mRNA sequence in
human. |
Novel_transcript |
light green |
Similar to Novel_CDS; however, cannot be assigned an unambigous
ORF. |
Putative |
light green |
Have identical, or have homology to spliced ESTs, but
are devoid of significant ORF and polyA features. These are
generally short (two or three exon) genes or gene fragments. |
TEC |
light green |
(To Experimentally Confirm)
Single-exon objects (supported by multiple unspliced ESTs with polyA
sites and signals). |
Polymorphic |
purple |
Have functional transcripts in one haplotype and "pseudo"
(non-functional) transcripts in another. |
Processed_pseudogene |
blue |
Pseudogenes that lack introns and are thought to arise
from reverse transcription of mRNA followed by reinsertion of
DNA into the genome. |
Unprocessed_pseudogene |
blue |
Pseudogenes that can contain introns, as they are
produced by gene duplication. |
Artifact |
grey |
Transcript evidence and/or its translation equivocal.
Usually these arise from high-throughput cDNA sequencing projects that
submit automatic annotation, sometimes resulting in erroneous CDSs in
what turns out to be, for example, 3' UTRs. In addition HAVANA has
extended this category to include cDNAs with non-canonical splice sites
due to deletion/sequencing errors. |
PolyA_signal |
brown |
Polyadenylation signal |
PolyA_site |
orange |
Polyadenylation site |
Pseudo_polyA |
pink |
"Pseudo"-polyadenylation signal detected in the sequence
of a processed pseudogene.
Warning:
Pseudo_polyA features and processed_pseudogenes
generally don't overlap. The reason is that pseudogene annotations are
based solely on protein evidence, whereas pseudo_polyA signals are
identified from transcript evidence; as they are found at the end of
the 3' UTR, they can lie several kb downstream of the 3' end of the
pseudogene. |
The current full set of GENCODE annotations is available for download
here.
Methods
For a detailed description of the methods and references used, see Harrow
et al., 2006 and Denoeud et al., 2007.
5' RACE/array experiments
A combination of 5’ RACE and
high-density tiling microarrays were used to empirically annotate 5’
transcription start sites (TSSs) and internal exons of all 410 annotated
protein-coding loci across the 44 ENCODE regions (Oct. 2005 GENCODE
freeze). The 5’ RACE reactions were performed with oligonucleotides
mapping to a coding exon common to most of the transcripts of a protein-coding
gene locus annotated by GENCODE (Oct. 2005 freeze) on polyA+ RNA
from twelve adult human tissues (brain, heart, kidney, spleen, liver,
colon, small intestine, muscle, lung, stomach, testis, placenta) and
three cell lines
(GM06990 (lymphoblastoid),
HL60 (acute promyelocytic leukemia) and
HeLaS3 (cervix carcinoma)).
The RACE reactions were then hybridized to 20 nucleotide-resolution
Affymetrix tiling arrays covering the non-repeated regions of the 44
ENCODE regions. The resulting "RACEfrags"
-- array-detected fragments of RACE products -- were assessed for
novelty by comparing their genome coordinates to those of
GENCODE-annotated exons. Connectivity between novel RACEfrags and their
respective index exon were further investigated by RT-PCR, cloning and
sequencing. The resulting cDNA sequences (deposited in GenBank under
accession numbers DQ655905-DQ656069 and
EF070113-EF070122) were then fed into the HAVANA annotation pipeline as
mRNA evidence (see "HAVANA manual annotations" below).
HAVANA manual annotations
The HAVANA
process was used to produce these annotations.
Before the manual annotation process begins, an automated analysis pipeline
for similarity searches and ab initio predictions is run
on a computer farm and stored in an Ensembl MySQL
database using a modified Ensembl analysis pipeline system. All
searches and prediction algorithms, except CpG island prediction (see
cpgreport in the EMBOSS application suite), are run on repeat-masked
sequence. RepeatMasker is used to mask interspersed repeats, followed by Tandem
repeats finder to mask tandem repeats.
Nucleotide sequence databases are searched with wuBLASTN, and
significant hits are re-aligned to the unmasked genomic sequence using
est2genome.
The UniProt protein database is searched with wuBLASTX, and the
accession numbers of significant hits are found in the Pfam
database. The hidden Markov models for Pfam protein domains are aligned
against the genomic sequence using Genewise to provide annotation of
protein domains.
Several ab initio prediction algorithms are also run:
Genescan and Fgenesh for genes, tRNAscan to find tRNAgenes and Eponine
TSS to predict transcription start sites.
Once the automated analysis is complete, the annotator uses a Perl/Tk
based graphical interface, "otterlace", developed in-house at
the Wellcome Trust Sanger Institute to edit annotation data held in a
separate MySQL database system. The interface displays a rich,
interactive graphical view of the genomic region, showing features such as
database matches, gene predictions, and transcripts created by the
annotators. Gapped alignments of nucleotide and protein blast hits to
the genomic sequence are viewed and explored using the "Blixem"
alignment viewer.
Additionally, the "Dotter" dot plot tool is used to show the
pairwise alignments of unmasked sequence, thus revealing the location
of exons that are occasionally missed by the automated blast searches
because of their small size and/or match to repeat-masked sequence.
The interface provides a number of tools that the annotator uses to
build genes and edit annotations: adding transcripts, exon coordinates,
translation regions, gene names and descriptions, remarks and
polyadenlyation signals and sites.
Verification
See Harrow et al., 2006 for information on verification techniques.
Credits
This GENCODE release is the result of a collaborative effort among
the following laboratories:
Lab/Institution
|
Contributors
|
HAVANA annotation
group, Wellcome Trust Sanger Insitute, Hinxton, UK |
Adam Frankish, Jonathan Mudge, James
Gilbert, Tim Hubbard, Jennifer Harrow
|
Genome Bioinformatics
Lab CRG, Barcelona, Spain |
France Denoeud, Julien Lagarde, Sylvain
Foissac, Robert Castelo, Roderic Guigó (GENCODE Principal
Investigator) |
Department of
Genetic Medicine and Development, University of Geneva, Switzerland |
Catherine Ucla, Carine Wyss,
Caroline Manzano, Colette Rossier, Stylianos E. Antonorakis |
Center for
Integrative Genomics, University of Lausanne, Switzerland |
Jacqueline Chrast, Charlotte N.
Henrichsen, Alexandre Reymond |
Affymetrix, Inc.,
Santa Clara, CA, USA |
Philipp Kapranov, Thomas R. Gingeras |
References
Denoeud F, Kapranov P, Ucla C, Frankish A, Castelo R, Drenkow J, Lagarde J, Alioto T, Manzano C,
Chrast J et al.
Prominent use of distal 5' transcription start sites and discovery of a large number of additional
exons in ENCODE regions.
Genome Res. 2007 Jun;17(6):746-59.
Harrow J, Denoeud F, Frankish A, Reymond A, Chen CK, Chrast J, Lagarde J, Gilbert JG, Storey R,
Swarbreck D et al.
GENCODE: producing a reference annotation for ENCODE.
Genome Biol. 2006;7 Suppl 1:S4.1-9.
ENCODE Project Consortium, Birney E, Stamatoyannopoulos JA, Dutta A, Guigó R, Gingeras TR,
Margulies EH, Weng Z, Snyder M, Dermitzakis ET et al.
Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot
project.
Nature. 2007 Jun 14;447(7146):799-816.
|
|