Intro: sequencing and the data deluge

Transcript Intro: sequencing and the data deluge

MCB3895-004 Lecture #21
Nov 20/14
Prokaryote RNAseq
Today:
• Building off last lecture, we will use reference
alignment methods to understand differential
gene expression in prokaryotes
• Use Bowtie2 for alignment
• Use Edge-pro for determining transcript
abundance
Experiment:
• Compare E.coli K-12 grow in glucose minimal
medium aerobically vs. anaerobically
• Aerobic datasets: SRR922260
• Anaerobic datasets: SRR922265
• All sequenced using Illumina GAIIx, 2x36bp PE
Basic idea of RNAseq
• One way to analyze a transcriptome (i.e., all
the mRNA molecules) is to count the number of
transcripts from each gene
• More transcripts implies more activity of that
gene
• Improvement over previous technology
(microarrays) that required some knowledge of
what genes to look for and were less sensitive
Problems:
1. How to compare short genes to long ones?
• Short genes will have fewer reads mapping to them
by random chance
2. How to compare genes from different
genomes with different sampling intensity?
• Transcripts sampled more deeply will have more
reads mapping to them
RPKM
• "Reads per kilobase per million"
• RPKM normalizes for both gene length and
sampling intensity
• RPKM = [# of mapped reads]/[length of
transcript in kb]/[million mapped reads]
• Allows genes to be compared to each other
• Allows transcription to be compared between
transcriptomes
RNAseq software
• Many packages exist for comparing
transcriptomes
• Most are tailored towards eukaryotes
• Emphasis on finding splice variants (not in bacteria)
• Do not account for overlapping genes (common in
bacteria, rare in eukaryotes)
Generalized scheme for
RNAseq
1. Map reads to reference genome
2. Count reads mapping to each gene
3. Normalize for gene length and sampling
depth (i.e., calculate RPKM)
4. Statistically compare test and control sample
sets (a topic in itself, not covered in depth
here)
EDGE-pro
• The software we will use is EDGE-pro
• Installed on server in
/opt/bioinformatics/EDGE_pro_v1.3.1/
• Tailored for prokaryotes
• Magoc et al. (2013) Evolutionary Bioinformatics
9:127-136
• http://ccb.jhu.edu/software/EDGE-pro/
EDGE-pro outline
1.
2.
3.
4.
5.
Use Bowtie2 to map reads
Calculate per base coverage
Assign per gene coverage
Disambiguate overlapping genes
Calculate RPKM for each gene
Running EDGE-pro
• syntax: $ perl
/opt/bioinformatics/EDGE_pro_v1.3.1/edge.pl
-g [.fna name] -p [.ppt name] -r [.rnt name]
-u [.fastq 1 name ] -v [.fastq 2 name] -s
/opt/bioinformatics/EDGE_pro_v1.3.1/
•
•
•
•
•
•
-g: reference .fna file name
-p: reference .ptt file name
-r: reference .rnt file name
-u: .fastq file name to map
-v: .fastq file pairing with that specified by -u, if exists
-s: location where program lives
• e.g.: $ perl
/opt/bioinformatics/EDGE_pro_v1.3.1/edge.pl
-g NC_000913.fna -p NC_000913.ptt -r
NC_000913.rnt -u SRR922260_1.fastq -v
SRR922260_2.fastq -s
/opt/bioinformatics/EDGE_pro.v1.3.1/
EDGE-pro: results
• One nice thing about EDGE-pro is that it runs
many scripts all by itself
• A "wrapper" or "pipeline" is something that bundles
different programs altogether
• Many of the output files are from bowtie2,
some are from EDGE-pro itself
• Note: make sure that you have enough space in
your account for these files
• The RPKM data are located in "out.rpkm_0",
which is a tab-delimited table listing the reads
mapped to each predicted transcript
Comparing conditions
• There are many different ways to compare test
and control conditions
• This is outside of the scope of this class
• The RPKM values generated by EDGE-pro can
be reformatted to be input
• EDGE-pro contains a script that will do this for
DESeq, one of the most popular
• Generally multiple replicates should be
considered for each condition
EDGE-pro comparison
• The EDGE-pro paper suggests an easy
heuristic for transcriptome comparison:
1. Average RPMK values from treatment
replicates
2. Determine the RPMK fold change between
test and control treatments using simple
division
3. Only keep results >4-fold different
A reference genome quirk:
• EDGE-pro requires the standard .fna
genome file and .ptt and .rnt files that list
gene locations on the chromosome
• Unfortunately only available from the old
version of the NCBI ftp server
• Location for today:
ftp://ftp.ncbi.nlm.nih.gov/genomes/
Bacteria/Escherichia_coli_K_12_subs
tr__MG1655_uid57779/
Today's assignment
• Use EDGE-pro to calculate RPMK values for
the E.coli K-12 RNAseq transcriptomes
generated under aerobic (SRR922260) and
anaerobic (SRR922265) conditions
• Write a short perl script to calculate the
recommended EDGE-pro comparison
• Only one replicate so no averaging needed
• Report 4-fold overrepresented genes in aerobic
treatment
• Report 4-fold overrepresented genes in anaerobic
treatment

Intro: sequencing and the data deluge

Transcript Intro: sequencing and the data deluge

Directory