This track was produced as part of the ENCODE Project. RNA-seq is a method for
mapping and quantifying the transcriptome of any organism that has a genomic DNA
sequence assembly. RNA-seq was performed by reverse-transcribing an RNA sample into
cDNA, followed by high-throughput DNA sequencing, which was done on an Illumina
Genome Analyzer (GAI or GAIIx) (Mortazavi et al., 2008). The transcriptome
measurements shown on these tracks were performed on
polyA selected RNA from
total cellular RNA using two different protocols:
one that preserves information about which strand the read is coming from and one
that does not. Due to the specifics of the enzymology of library construction, gene
and transcript quantification is more accurate based on the non-strand-specific
protocol, while the strand-specific protocol is useful for assigning strandedness, but
in general less reliable for quantification.
Non-strand-specific Protocol (deep "reference" transcriptome measurements, 2x75 bp reads)
PolyA-selected RNA was fragmented by magnesium-catalyzed hydrolysis, converted into
cDNA by random priming and then amplified. Data were produced in two formats: single
reads, each of which came from one end of a cDNA molecule, and paired-end reads, which
were obtained as pairs from both ends of cDNAs. This RNA-seq protocol does not specify
the coding strand. As a result, there is ambiguity at loci where both strands are
transcribed. The "randomly primed" reverse transcription is, apparently, not
fully random. This is inferred from a sequence bias in the first residues of the read
population, and this likely contributes to observed unevenness in sequence coverage
across transcripts.
Strand-specific Protocol (1x75 bp reads)
PolyA-selected RNA was fragmented by magnesium-catalyzed hydrolysis. A set of 3' and 5'
adapters were ligated to the 3' and 5' ends of the fragments, respectively. The resulting
RNA molecules were converted to cDNA and amplified. This RNA-seq protocol does specify the
coding strand as each read is in the same 5'-3' orientation as the original RNA strand. As a
result, loci where both strands are transcribed can be disambiguated. However, RNA ligation is
an inherently biased process and as a result, greater unevenness in sequence coverage across
transcripts is observed compared to the non-strand-specific data, and quantification is less
accurate.
Data Analysis
Reads were aligned to the hg19 human reference genome using TopHat (Trapnell et al.,
2009), a program specifically designed to align RNA-seq reads and discover splice junctions de
novo. Cufflinks (Trapnell et al., 2010), a de novo transcript assembly and quantification
software package, was run on the TopHat alignments to discover and quantify novel transcripts and
to obtain transcript expression estimates based on the GENCODE annotation. All sequence files,
alignments, gene and transcript models and expression estimates files are available for download.
Display Conventions and Configuration
This track is a multi-view composite track that contains multiple data types (views). For each
view, there are multiple subtracks that display individually on the browser. Instructions for
configuring multi-view tracks are here.
The following views are in this track:
Alignments
The Alignments view shows reads aligned to the genome. Alignments are colored by cell type.
The tags used in this file are: NH XS CP NM CC. The 'XS' custom tag indicates the sense/anti-sense
of the strand. See the Bowtie Manual (Langmead et al., 2009) for more information about the
SAM Bowtie output (including other tags) and the
SAM Format Specification
for more information on the SAM/BAM file format.
For Stranded Data (1x75)
Plus Raw Signal (only for stranded data)
Density graph (wiggle) of signal enrichment based on a normalized aligned read density (Read
Per Million, RPM) for reads aligning to the forward strand.
Minus Raw Signal (only for stranded data)
Density graph (wiggle) of signal enrichment based on a normalized aligned read density (Read
Per Million, RPM) for reads aligning to the reverse strand. Minus strand score is multiplied by -1
for display purposes so that data can be viewed around an axis.
For Paired-End Non-Stranded Data (2x75)
Raw Signal (only for paired-end data)
Density graph (wiggle) of signal enrichment based on a normalized aligned read density (Read
Per Million, RPM). The RPM measure assists in visualizing the relative amount of a given transcript
across multiple samples. A separate track for the forward (plus) and reverse (minus) strands are
provided for strand-specific data.
Splice Sites
Subset of aligned reads that crosses splice junctions.
Metadata for a particular subtrack can be found by clicking the down arrow in the list of subtracks.
Methods
Experimental Procedures
Cells were grown according to the approved ENCODE cell culture protocols except for H1-hESC for which frozen cell pellets
were purchased from Cellular Dynamics. Cells were lysed in RLT buffer (Qiagen RNeasy kit) and
processed on RNeasy midi columns according to the manufacturer's protocol, with the inclusion of
the "on-column" DNase digestion step to remove residual genomic DNA.
A quantity of 75 µg of total RNA was selected twice with oligo-dT beads (Dynal) according
to the manufacturer's protocol to isolate mRNA from each of the preparations. For 2x75 bp non-stranded
RNA-seq, 100 ng of mRNA was then processed according to the protocol in Mortazavi et al.
(2008), and prepared for sequencing on the Genome Analyzer flow cell according to the protocol for the
ChIP-seq DNA genomic DNA kit (Illumina). The majority of paired-end libraries were size-selected around
200 bp (fragment length) with the exception of a few additional replicates that were size-selected at 400
bp with the specific intent to investigate the effect of fragment length on results. Strand-specific RNA-seq
libraries were prepared from 100 ng of mRNA from the same preparation following
Illumina's Strand-Specific RNA-seq protocol.
Libraries were sequenced with an Illumina Genome Analyzer I or an Illumina Genome Analyzer IIx according
to the manufacturer's recommendations. Reads of 75 bp length were obtained, single-end for directional,
strand-specific libraries (1x75D) and paired-end for non-strand-specific libraries (2x75).
Data Processing and Analysis
Reads were mapped to the reference human genome (version hg19), with or without the Y chromosome,
depending on the sex of the cell line, and without the random chromosomes and haplotypes in all cases,
using TopHat (version 1.0.14). TopHat was
used with default settings with the exception of specifying an empirically determined mean inner-mate
distance. After mapping reads to the genome and identifying splice junctions, the data were further
analyzed using the transcript assembly and quantification software
Cufflinks (version 0.9.3) using the sequence
bias detection and correction option. Cufflinks was used in two modes: 1) expression for genes and
individual transcripts was quantified based on the GENCODE annotation, for both versions v3c and v4 of
GENCODE GRCh37; 2) Cufflinks was run in de novo transcript assembly and quantification mode to
obtain candidate novel transcript and gene models and expression estimates for them.
.fastq - Raw sequence files in fastq format with phred33 quality scores.
Junctions.bedRnaElements - A BED file containing TopHat-defined splice junctions.
TranscriptDeNovo.gtf - A GTF file containing transcript models and expression estimates in
FPKM (Fragments Per Kilobase per Million reads) produced by Cufflinks in de novo mode.
TranscriptGencV3c.gtf - Expression level estimates at the transcript level for the GENCODE
GRCh37.v3c annotation in GTF format.
GenesDeNovo.gtf - Expression estimates for genes defined by Cufflinks in de novo mode in
GTF format.
GeneGencV3c.gtf - Expression level estimates at the gene level for the GENCODE GRCh37.v3c
annotation in GTF format.
ExonGencV3c.gtf - Expression level estimates for GENCODE GRCh37.v3c exons in GTF format
derived by summing the expression levels in FPKM for all transcripts containing a given exon.
TSS.gtf - Expression level estimates for GENCODE GRCh37.v3c transcription start sites (TSS) in
GTF format derived by summing the expression levels in FPKM for all transcripts originating from a given TSS.
Verification
Known exon maps as displayed on the genome browser are confirmed by the alignment of sequence reads.
Known spliced exons are detected at the expected frequency for transcripts of given abundance.
Linear range detection of spiked-in RNA transcripts from Arabidopsis and phage lambda over 5
orders of magnitude.
Endpoint RT-PCR confirms presence of selected 3' UTR extensions.
Correlation to published microarray data r = 0.62.
Release Notes
This is release 4 (August 2012). Fastq files for GM12892, GM12891 and K562 (R1x75) were replaced
after errors were found in the GEO submission process.
Credits
Wold Group: Ali Mortazavi, Brian Williams, Georgi Marinov, Diane Trout, Brandon King, Ken McCue, Lorian Schaeffer.
Myers Group: Norma Neff, Florencia Pauli, Fan Zhang, Tim Reddy, Rami Rauch, Chris Partridge.
Illumina gene expression group: Gary Schroth, Shujun Luo, Eric Vermaas.
TopHat/Cufflinks development: Cole Trapnell, Lior Pachter, Steven Salzberg.
Data users may freely use ENCODE data, but may not, without prior consent, submit publications
that use an unpublished ENCODE dataset until nine months following the release of the dataset.
This date is listed in the Restricted Until column, above. The full data release policy
for ENCODE is available here.