RNA-seq data analysis Project

Download Report

Transcript RNA-seq data analysis Project

RNA-seq data analysis Project
QI LIU
From reads to differential expression
QC by
FastQC
Raw Sequence Data
FASTQ Files
Reads Mapping
Unspliced Mapping
Spliced mapping
BWA, Bowtie
TopHat, MapSplice
Mapped Reads
Expression Quantification
SAM/BAM Files
Summarize read counts
FPKM/RPKM
Cufflinks
DE testing
DEseq, edgeR, etc
Cuffdiff
List of DE
Functional Interpretation
Function
enrichment
Infer networks
QC by
RNA-SeQC
Integrate with
other data
Biological Insights & hypothesis
Tools
Read alignment:
TopHat2 (http://ccb.jhu.edu/software/tophat/index.shtml)
Bowtie2 (http://bowtie-bio.sourceforge.net/bowtie2/index.shtml)
Tools for manipulating SAM files:
SAMTOOLS (http://samtools.sourceforge.net/samtools.shtml)
Counting reads in gene level:
htseq-count
(http://www-huber.embl.de/users/anders/HTSeq/doc/count.html)
Downstream analysis:
The R statistical computing environment (http://www.r-project.org/)
R package: edgeR
(http://www.bioconductor.org/packages/release/bioc/html/edgeR.html)
Visualization: IGV (http://www.broadinstitute.org/software/igv/download)
/home/igptest/path.txt
setpkgs -a python
BOWTIE2=/scratch/liuq6/software/bowtie2
TOPHAT2=/scratch/liuq6/software/tophat2
export PYTHONPATH=/scratch/liuq6/software/htseqlib:$PYTHONPATH
export PATH=/home/igptest/exomesequencing/software/:$BOWTIE2:$TOPHAT2:$PATH
transcriptomeindex=/scratch/liuq6/reference/gtfindex/Homo_sapiens.GRCh37.75
genomeindex=/scratch/liuq6/reference/bowtie2_index/hg19
gtffile=/scratch/liuq6/reference/Homo_sapiens.GRCh37.75_chr1-22-X-Y-M.gtf
reference=/home/igptest/exomesequencing/reference/hg19/hg19_chr.fa
VarScan=/home/igptest/exomesequencing/software/VarScan.v2.2.10.jar
Environment Variables
• The PATH is an environment variable. It is a colon delimited
list of directories that your shell searches through when you
enter a command. All executables are kept in different
directories on the Linux and Unix like operating systems.
• PYTHONPATH is an environment variable which you can set to
add additional directories where python will look for modules
and packages
Two ways
1. source path.txt
source is a bash shell built-in command that executes the content of the file
passed as argument.
2. put the scripts in .bashrc
.bashrc is a file from which bash reads and executes command automatically
when you log in.
/home/yourusername/.bashrc
Practice
Before and after you load all the environment
echo $SHELL (the name of the current shell)
env (the existing environment variables)
echo $HOME
echo $PATH
echo $gtffile
Reads alignment
TopHat
1. un-spliced mapping
to transcriptome and
then genome (bowtie)
2. “Contiguously unmappable" reads are used
to predict possible splice junctions.
Tophat2
The basename of the genome
index to be searched
• tophat2 [options]* <genome_index_base>
PE_reads_1.fq.gz PE_reads_2.fq.gz
• Options:
-o/--output-dir <string>
-G/--GTF <GTF/GFF3 file>
Sets the name of the directory in which TopHat will
write all of its output. The default is "./tophat_out".
Supply TopHat with a set of gene model
annotations and/or known transcripts, as a GTF 2.2
or GFF3 formatted file
--transcriptome-index <dir/prefix> use the previously built transcriptome index files
TopHat output
• accepted_hits.bam. A list of read alignments in SAM
format.
• junctions.bed. A UCSC BED track of junctions reported
by TopHat. Each junction consists of two connected
BED blocks, where each block is as long as the maximal
overhang of any read spanning the junction. The score
is the number of alignments spanning the junction.
• insertions.bed and deletions.bed. UCSC BED tracks of
insertions and deletions reported by TopHat.
Practice
• tophat2 --transcriptomeindex=$transcriptomeindex -o ctrl1
$genomeindex
/scratch/igptest/RNAseq/data/ctrl1.fastq
Add --transcriptome-only to save time
• tophat2 --transcriptomeindex=$transcriptomeindex -o ctrl1 -transcriptome-only $genomeindex
/scratch/igptest/RNAseq/data/ctrl1.fastq
SAMTOOLS
• samtools view
• samtools sort
• samtools index
Practice
• View the header section of bam file
samtools view -h accepted_hits.bam
• Extract the alignments with MAPQ>10
samtools view -q 10 accepted_hits.bam
Practice
• Extract the alignments mapped to reverse
strand
samtools view -f 16 accepted_hits.bam
Practice
• Index the alignment file (.bai file)
samtools index accepted_hits.bam
Use samtools to get all the reads mapped to
tp53 and myc. How many reads? Any junction
reads?
– TP53, 17:7,571,720-7,590,868
– MYC, 8:128,748,315-128,753,680
Homework
• Install IGV
https://www.broadinstitute.org/igv/download
• Use IGV to view the bam file (need the bam
and index files)
• Take a look at the reads mapped to genes MYC
and TP53
Homework
• Align ER1.fastq
• How many reads mapped to MYC and TP53?
Are there any junction reads? Visualize the
results in IGV
• Extract the alignments mapped to MYC and
generate a new BAM file
• Extract the alignments with MAPQ>10 and
generate a new BAM file and index the file
Ht-seq
Given a file with aligned sequencing reads and a list of genomic features, a common
task is to count how many reads map to each feature.
Deal with reads that overlap more than one features
Ht-seq
• Options
Ht-seq
• htseq-count [options] <alignment_file> <gff_file>
Output:
a table with counts for each feature, followed by the special counters, which count reads that
were not counted for any feature for various reasons. The names of the special counters all
start with a double underscore, to facilitate filtering.
 __no_feature: reads (or read pairs) which could not be assigned to any feature (set S as described above
was empty).
 __ambiguous: reads (or read pairs) which could have been assigned to more than one feature and hence
were not counted for any of these (set Shad mroe than one element).
 __too_low_aQual: reads (or read pairs) which were skipped due to the –a option
 __not_aligned: reads (or read pairs) in the SAM file without alignment
 __alignment_not_unique: reads (or read pairs) with more than one reported alignment. These reads are
recognized from the NH optional SAM field tag. (If the aligner does not set this field, multiply aligned
reads will be counted multiple times, unless they getv filtered out by due to the -a option.)
Practice
• htseq-count -s no -f bam -i gene_name
ctrl1/accepted_hits.bam $gtffile > ctrl1.count
Homework
• Summarize the read counts to the gene level
for ER1
• Summarize the read counts to the exon level
for ctrl1
edgeR
• Install R in your laptop
• Install edgeR package
source("http://bioconductor.org/biocLite.R")
biocLite("edgeR")
edgeR
• Raw counts
• Small sample size
• Complicated experimental design
Negative binomial distribution
• Technical replicates –Poisson distribution
var(ygi)=𝜇gi
• Biological variances
var(ygi)=𝜇gi+∅𝑔 𝜇gi2
Normalization
• TMM (a trimmed mean of M-values)
minimize the log-fold changes between the
samples for most genes.
Normalization
• RNA-seq measures the relative abundance of
each gene in each RNA sample, but RNA
output per cell
N
T
FPKM: g2-g10
TMM: g1
g1
1
1000
g2
1
1
g3
1
1
g4
1
1
g5
1
1
g6
1
1
g7
1
1
g8
1
1
g9
1
1
g10 1
1
Differential analysis
DGEList
Read Data
preprocess
calcNormFactors
MDS
Normalization
visualize
PCA
estimateGLMCommonDisp
estimateGLMTrendedDisp
Estimate the dispersion
estimateGLMTagwiseDisp
glmFit
glmLRT
Differential expression
Heatmap