RNA sequencing for differential expression genes
Download
Report
Transcript RNA sequencing for differential expression genes
RNA sequencing for differential
expression genes
SPEAKER : TZU-CHUN LO
ADVISOR : YAO-TING HAUNG
Outline
Molecular Central Dogma
RNA Sequencing
Differential Expression Gene
Case–Control Study
Negative Binomial Distribution
Hypothesis Testing
Rice
SNP, QTL, Pathway
Molecular Central Dogma
The central dogma of molecular biology
describes the flow of genetic information
within a biological system.
Forest
Branches
BBQ
RNA Sequencing
Gene 1
DNA
RNA
Alignment
exons
Gene 2
mRNA
reads
Spliced alignment
Alignment
Read counts
DEG process
Finding differential expression genes
via read counts each gene.
Differential Expression Gene
We want to find the cold-resistant genes in rice.
Rice genome
Gene 1
Gene 2
Gene 3
We should compare with two conditions.
Room temperature
Gene 1
Gene 2
Gene 3
Gene 1
13
6
Gene 2
4
5
Gene 3
7
2
Low temperature
Cole-resistant differential
expression genes :
Strategy for DEG
Case–control study
Two existing groups differing in outcome are identified and
compared on the basis of some supposed causal attribute.
condition
case
control
Gene 1
69
71
69 v.s 71
Almost the same ?
Gene 2
86
56
86 v.s 56
Possible DEG
Gene 3
66
111
66 v.s 111
More likely DEG
Gene
… 4
80
…
60
…
80 v.s 60
How to judge?
It is just one of sample in condition.
Question
Is the number adequate to the gene? Negative binomial distribution
How to define the gene is differential expression? Hypothesis test
Negative Binomial Distribution
NB is a count data distribution that can substitute
Poisson distribution for better variance.
j
Gene abundance parameter
Smooth function
i
69
𝑚𝑒𝑑𝑖𝑎𝑛
69
69 × 71
,
86
86 × 56
,
66
66 × 111
= 0.986
i=1~n
j=1~m
Library size parameter
Smooth function is more complex, so let us forget it.
3
FPKM
An indicator used to represent mRNA expression.
Fragments Per Kilobase of transcript per Million
mapper reads.
𝐹𝑃𝐾𝑀 =
𝑟𝑒𝑎𝑑𝑠 𝑜𝑓 𝑔𝑒𝑛𝑒
𝑎𝑙𝑙 𝑚𝑎𝑝𝑝𝑒𝑑 𝑟𝑒𝑎𝑑𝑠 𝑚𝑖𝑙𝑙𝑖𝑜𝑛𝑠 ∗ 𝑒𝑥𝑜𝑛 𝑙𝑒𝑛𝑔𝑡ℎ 𝑜𝑓 𝑔𝑒𝑛𝑒(𝑘𝑖𝑙𝑜𝑏𝑎𝑠𝑒𝑠)
10
Genome
Exon length:
8
10
Gene 1
7
4
8
9
reads
bases
Gene 2
10
𝐺𝑒𝑛𝑒 1 =
= 0.029 ∗ 109
4
(10 + 4) (8 + 10 + 7)
𝐺𝑒𝑛𝑒
2
=
= 0.017 ∗ 109
∗
(10 + 4) (8 + 9)
106
103
∗
106
103
FPKM
109
𝑉𝑎𝑟 𝐹𝑃𝐾𝑀 =
𝑀∗𝐿
2
(𝑉𝑎𝑟[𝐾])
Before hypothesis testing, we have to get FPKM and
variance of FPKM.
K-Reads
case
control
FPKM
case
control
Gene 1
69
71
Gene 1
9.34
14.75
Gene 2
86
56
Gene 2
22.31
15.37
Gene 3
66
111
Gene 3
40.48
53.98
…
…
…
…
…
…
Var(K)
case
control
Var(FPKM)
case
control
Gene 1
10
6
Gene 1
6
3.6
Gene 2
170
166
Gene 2
136
132.8
Gene 3
362
310
Gene 3
120.6
109.3
…
…
…
…
…
…
Hypothesis Testing
Step 1 : You find some observations or clues support
a novel idea.
Step 2 : Assume a against opinion that you want to
fight it.
Step 3 : Go to test it and take a stand.
p-value
T-test
Using t-test to compare the log ratio (log fold-change)
of gene’s expression between condition (a) and (b).
𝑌=
𝐹𝑃𝐾𝑀𝑎
,
𝐹𝑃𝐾𝑀𝑏
log 𝑌 = log
𝑖𝑓 𝑓𝑝𝑘𝑚𝑎 = 𝑓𝑝𝑘𝑚𝑏 , 𝑦 = 1
𝐹𝑃𝐾𝑀𝑎
𝐹𝑃𝐾𝑀𝑏
, 𝑖𝑓 𝑦 = 1, log(𝑦) = 0
𝐻0 : 𝜇 = 0, 𝐻1 : 𝜇 ≠ 0, 𝐴𝑠𝑠𝑢𝑚𝑒 𝑡ℎ𝑎𝑡 𝐻0 𝑖𝑠 𝑡𝑟𝑢𝑒.
𝑇=
𝐸 log 𝑌 −𝜇
𝑉𝑎𝑟[log(𝑌)]
=
𝐸 log 𝑌
𝑉𝑎𝑟[log(𝑌)]
𝐹𝑃𝐾𝑀
≈
log 𝐹𝑃𝐾𝑀𝑎
𝑏
𝑉𝑎𝑟 𝐹𝑃𝐾𝑀𝑎
𝐹𝑃𝐾𝑀𝑎 2
𝑉𝑎𝑟 𝐹𝑃𝐾𝑀𝑏
𝐹𝑃𝐾𝑀𝑏 2
−
T-test
𝐹𝑃𝐾𝑀𝑎
𝐹𝑃𝐾𝑀𝑏
log
𝑇≈
𝑉𝑎𝑟 𝐹𝑃𝐾𝑀𝑎
𝐹𝑃𝐾𝑀𝑎 2
⇒ 𝑝 − 𝑣𝑎𝑙𝑢𝑒
𝑉𝑎𝑟 𝐹𝑃𝐾𝑀𝑏
𝐹𝑃𝐾𝑀𝑏 2
−
FPKM
case
control
Var(FPKM)
case
control
Gene 1
9.34
14.75
Gene 1
6
3.6
Gene 2
22.31
15.37
Gene 2
136
132.8
Gene 3
40.48
53.98
Gene 3
120.6
109.3
…
…
…
…
…
…
T-test
Gene 1
Gene 2
Gene 3
…
p-value
0.187
0.039
0.014
…
Result Investigating
Discussing alpha=0.05 with read counts & p-value.
If alpha=0.05
case
control
p-value
result
Gene 1
69
71
0.187
X
Gene 2
86
56
0.039
V
Gene 3
66
111
0.016
V
Gene 4
80
60
0.045
V
If alpha=0.04 or 0.03 ?
We don’t know which alpha is the best,
but we can do some subsequent processing.
RNA sequencing for Rice
Plan
Cold-resistant genes
Samples
Japonica (TN67): room temperature (R), low temperature (L)
Indica (IR64): room temperature (R), low temperature (L)
Rice
粳稻(TN67) : 米粒闊而短,黏性較大,Q彈,如 : 蓬萊米。
秈稻(IR64) : 米粒細而長,黏性較小,易碎,如 : 在來米。
Zone
TN67 : High-latitude, or high altitude
IR64 : Low-latitude, or low altitude
Strategy for DEG
Case–control study
Four combinations
Different varieties or distinct temperatures
Four sets of differential expression genes
The DEGs above combination (A,B,C,D)
Negative binomial
Inference probability situation by sample
Hypothesis test
Which is the DEG that we want
Subsequent processing
SNP, QTL, Pathway
A
TN67R
IR64R
D
B
TN67L
IR64L
C
SNP
A single-nucleotide polymorphism is a
sequence variation occurring when a single
nucleotide differs between members of a biological
species.
Case
ATGCCCTCGTAA
TTACTGCGT
ATGCGCTCGAAA
TTACTCCGT
Control
Assembly
SNP
QTL
Quantitative traits refer to phenotypes (characteristics)
that vary in degree and can be attributed
to polygenic effects (product of two or more genes)
Quantitative trait loci (QTLs) are stretches of DNA
containing or linked to the genes that underlie a
quantitative trait.
Ex : QT(cold) Loci : 599~799 (base)
1
genes
QTL
DNA
Cold tolerance (29) & pollen fertility (43)
QTL length : ~million bases
1000
Pathway
Pathway is a collection of manually drawn pathway
maps representing molecular interaction and
reaction networks.
Rice
Gene No.2
Gene No.55
Gene No.99
Cold-resistant
Conclusion
Review
RNA Sequencing
Differential Expression Gene
Case–Control Study
Negative Binomial Distribution
Hypothesis Testing
Rice
SNP
QTL
Pathway
Variance of negative binomial
NB is a count data distribution that can substitute
poisson distribution for better variance.
Strategy for DEG
Case-control in the same temperature : A, C
Case-control in the same variety : B, D
Let T is a set of all genes.
𝐴⋂𝐶 = 𝑋
𝐴⋂ 𝑇 − 𝐶 = 𝑌, 𝑇 − 𝐴 ⋂𝐶 = 𝑍
𝐵⋂𝐷 = 𝑂
𝐵⋂ 𝑇 − 𝐷 = 𝑃, 𝑇 − 𝐵 ⋂𝐷 = 𝑄
𝑟𝑒𝑠𝑢𝑙𝑡 = {𝑋, 𝑌, 𝑍, 𝑂, 𝑃, 𝑄}
QTL
生物的另一類性狀例如人類的身高、體重、高
血壓、糖尿病;水稻株高及產量對疾病的抵抗程度;老鼠
的體脂肪百分比;乳牛的乳產量;雞的產卵量,由
於其變異性是連續性的,不易分類,且易受環境影響,故
稱為數量性狀(quantitative trait)。數量性狀是由多
個基因所控制,由於每個基因對數量性狀均有影響,所以
每一基因的作用便相對地小。這些控制數量性狀的
基因稱為微效基因(polygenes)或又稱為數量性狀基因
座(quantitative trait loci,QTL)。
Rice genome size 430Mb
QTL
Negative binomial distribution
NB is a count data distribution that can inference
adequate number by sample.
j
i
Smooth function
Negative binomial distribution
NB is a count data distribution that can substitute
Poisson distribution for better variance.
Hypothesis test
Step 1 : You find some observations or clues support
a novel idea.()
Step 2 : Assume a against opinion that you want to
fight it.
Step 3 : Go to test it and take a stand.
p-value
Case-control example
Example
condition
case
control
Gene 1
69
71
69 v.s 71
Almost the same
Gene 2
86
56
86 v.s 56
Possible DEG
Gene 3
66
111
66 v.s 111
More likely DEG
…
…
…
Question
Is the number adequate to the gene?
Negative binomial
How to define the gene is differential expression?
Hypothesis test
Variance of negative binomial
NB is a count data distribution that can substitute
Poisson distribution for better variance.
RNA sequencing
Gene 1
DNA
exons
Gene 2
mRNA
RNA
reads
Alignment
Spliced alignment
DNA
We should align with regions above blue.
RNA sequencing
Spliced alignment
TopHat
Condition 1 : case
Condition 2 : control
Sample
1
2
3
…
1
2
3
…
Gene 1
75
69
70
…
73
71
68
…
Gene 2
101
86
75
…
31
56
49
…
Gene 3
28
66
45
…
120
111
145
…
…
…
…
…
…
…
…
…
…
Reads
case
control
Variance
case
control
Gene 1
69
71
Gene 1
69
71
Gene 2
86
56
Gene 2
86
56
Gene 3
66
111
Gene 3
66
111
…
…
…
…
…
…