dplasso - Center for Science of Information

Download Report

Transcript dplasso - Center for Science of Information

Big Challenges with Big
Data in Life Sciences
Shankar Subramaniam
UC San Diego
The Digital Human
A Super-Moore’s Law
Adapted from Lincoln Stein 2012
The Phenotypic Readout
Data to Networks to Biology
NETWORK RECONSTRUCTION
• Data-driven network reconstruction of biological
systems
– Derive relationships between input/output data
– Represent the relationships as a network
Experiments/Measurements
Inverse Problem: Data-driven Network Reconstruction
Network Reconstructions
Reverse Engineering of biological networks
• Reverse engineering of biological networks:
- Structural identification: to ascertain network structure or
topology.
- Identification of dynamics to determine interaction details.
• Main approaches:
-
Statistical methods
Simulation methods
Optimization methods
Regression techniques
Clustering
Network Reconstruction of
Dynamic Biological Systems:
Doubly Penalized LASSO
Behrang Asadi*, Mano R. Maurya*,
Daniel Tartakovsky, Shankar Subramaniam
Department of Bioengineering
University of California, San Diego
• NSF grants (STC-0939370, DBI-0641037 and DBI-0835541)
• NIH grants 5 R33 HL087375-02
* Equal effort
APPLICATION
Phosphoprotein signaling and cytokine measurements in RAW
264.7 macrophage cells.
MOTIVATION FOR THE NOVEL METHOD
• Various methods
– Regression-based approaches (least-squares) with statistical
significance testing of coefficients
– Dimensionality-reduction to handle correlation: PCR and PLS
– Optimization/Shrinkage (penalty)-based approach: LASSO
– Partial-correlation and probabilistic model/Bayesian-based
• Different methods have distinct
advantages/disadvantages
‒ Can we benefit by combining the methods?
‒ Compensate for the disadvantages
• A novel method: Doubly Penalized Linear Absolute
Shrinkage and Selection Operator (DPLASSO)
‒ Incorporate both “statistical significant testing” and
“Shrinkage”
LINEAR REGRESSION
Goal: Building a linear-relationship based model
y  Xb  e ; e ~ N (0,  )
X: input data (m samples by n inputs), zero mean, unit standard deviation
y: output data (m samples by 1 output column), zero-mean
b: model coefficients: translates into the edges in the network
e: normal random noise with zero mean
Ordinary Least Squares solution:
bˆ  argmin{e2  (y - Xb)T ( y - Xb)}
bˆ  (XT X)-1 XT y
Formulation for dynamic systems:
dX X

yX

 Xb  e; e(t ) ~ N (0,  )
dt
t
STATISTICAL SIGNIFICANCE TESTING
• Most coefficients non-zero, a mathematical artifact
• Perform statistical significance testing
• Compute the standard deviation on the coefficients
T
ˆ ˆ
*
2 b b
ˆ
ˆ b)  ˆ
cov(
 
y  y 
For LeastSquares :  b,LS  diag(( X T X )1 )1/ 2  RMSELS  (m / v);v  m  n  1
RMSELS 
1 m
2
(
y

y
)
 std ( yi  yi , p )  (m  1) / m

i
i
,
p
m i 1
• Ratio rij ,k  bij ,k /  b,ij ,k
• Coefficient is significant (different from zero) if:
rij  tinv(1   / 2, v)
v  DOF ,   1  confidence level
•
Edges in the network graph represents the coefficients.
* Krämer, Nicole, and Masashi Sugiyama. "The degrees of freedom of partial least squares regression." Journal of the American
Statistical Association106.494 (2011): 697-705.
CORRELATED INPUTS: PLS
• Partial least squares finds direction in the X space that explains
the maximum variance direction in the Y space
X=TPT +E
Y=UQT +F
ˆ ˆ
Y=XB+B
0
• PLS regression is used when the number of observations per
variable is low and/or collinearity exists among X values
• Requires iterative algorithm: NIPALS, SIMPLS, etc
• Statistical significance testing is iterative
* H. WOLD, (1975), Soft modelling by latent variables; the non-linear iterative partial least squares approach, in
Perspectives in Probability and Statistics, Papers in Honour of M. S. Bartlett, J. Gani, ed., Academic Press, London.
LASSO
• Shrinkage version of the Ordinary Least Squares, subject to
L-1 penalty constraint (the sum of the absolute value of the
coefficients should be less than a threshold)
The LASSO estimator is then defined as:
N


0
2
ˆ
ˆ
(b , b)  argmin ( yi  b   b j xij ) 
j
 i 1

L-1
subject to b  t bˆ0
0

j
j

j
j
Cost
Function
Constraint
• Where bˆ represents the full least square estimates
• 0 < t < 1: causes the shrinkage
0
* Tibshirani, R.: ‘Regression shrinkage and selection via the Lasso’, J. Roy. Stat. Soc. B Met., 1996, 58, (1), pp. 267–288
Noise and Missing Data
– More systematic comparison needed with
respect to:
1.
2.
3.
4.
5.
6.
Noise: Level, Type
Size (dimension)
Level of missing data
Collinearity or dependency among input channels
Missing data
Nonlinearity between inputs/outputs and nonlinear
dependency
7. Time-series inputs(/outputs) and dynamic
structure
METHODS
• Linear Matrix Inequalities (LMI)*
– Converts a nonlinear optimization problem into a linear
optimization problem.
T
min(e) s / t (Y - Xb)(Y - Xb)  eI mm
B
n p
– Congruence transformation:
 eI mm

 (Y - Xbˆ )T

Y - Xbˆ 
0
-I p p 
– Pre-existing knowledge of the system (e.g.
the form of LMI constraints:
a13  0 , a21  0 )
vr  0, r  i
vi  
vr  1, r  i
vi Bu j  u j B vi  ()0
T
– Threshold the coefficients:
T
T

bˆij  bˆij / bˆi..
2
bˆ: j
* [Cosentino, C., et al., IET Systems Biology, 2007. 1(3): p. 164-173]
can be added in
2

ur  0, r  i
ui  
ur  1, r  i
METRICS
• Metrics for comparing the methods
o Reconstruction from 80% of datasets and 20% for validation
o RMSE on the test set, and the number and the identity of the significant
predictors as the basic metric to evaluate the performance of each method
1. Fractional error in the estimating the parameters
 bmethod , j 
b frac , j  mean 
1 
 btrue , j



parameters smaller than 10% of the standard deviation of
all parameter values were set to 0 when generating the
synthetic data
2. Sensitivity, specificity, G, accuracy
TN  TP
TN  TP  FN  FP
TP
Sensitivity :
TP  FN
TN
Specificity :
TN  FP
Accuracy :
TP : True Positive
FP : False Positive
TN : True Negative
FN : False Negative
RESULTS: DATA SETS
• Data sets for benchmarking: Two data sets
1. First set: experimental data measured on
macrophage cells (Phosphoprotein (PP) vs
Cytokine)*
2. Second sets consist of synthetic data
generated in Matlab. We build the model using
80% of the data-set (called training set) and
use the rest of data-set to validate the model
(called test set).
* [Pradervand, S., M.R. Maurya, and S. Subramaniam, Genome Biology, 2006. 7(2): p. R11].
RESULTS: PP-Cytokine Data Set
• Schematic representation of Phosphoprotein (PP) vs
Cytokine
- Signals were transmitted through
22 recorded signaling proteins and
other pathways (unmeasured
pathways).
- Only measured pathways
contributed to the analysis
Schematic graphs from:
[Pradervand, S., M.R. Maurya, and S. Subramaniam, Genome Biology, 2006. 7(2): p. R11].
PP-CYTOKINE DATASET
Measurements of phosphoproteins in response to LPS
Courtesy: AfCS
Measurements of cytokines in response to
LPS
~ 250 such datasets
RESULTS: COMPARISON
• Comparison on synthetic noisy data
•
The methods are applied on synthetic data with 22 inputs and 1 output.
The true coefficients for the inputs (about 1/3rd) are made zero to
test the methods if they identify them as insignificant.
•
Effect of noise level
Four outputs with 5, 10, 20 and 40% noise levels, respectively, are
generated from the noise-free (true) output.
•
Effect of noise type
Three outputs with White, t-distributed, and uniform noise types,
respectively are generated from the noise-free (true) output
RESULTS: COMPARISON
• Variability between realizations of data with white noise
PCR, LASSO, and LMI—are used to identify significant predictors for
1000 input-output pairs.
Histograms of the coefficients in the three significant predictors
common to the three methods:
Mean and standard deviation in the histograms of
the coefficients computed with PCR, LASSO, and
LMI.
Method Predictor #
1
10
11
PCR
LASSO
LMI
True value
-3.40
5.82
-6.95
Mean
-3.81
4.73
-6.06
Std.
0.33
0.32
0.32
Frac. Err. in mean
0.12
0.19
0.13
Mean
-2.82
4.48
-5.62
Std.
0.34
0.32
0.33
Frac. Err. in mean
0.17
0.23
0.19
Mean
-3.70
4.74
-6.34
Std.
0.34
0.32
0.34
Frac. Err. in mean
0.09
0.18
0.09
RESULTS: COMPARISON
•
Comparison of outcome of different methods on the real data

Different methods identified unique sets of common and distinct
predictors
for each output
• Only the PCR
method detects
the true input
cAMP
• zone I provides
validation and it
highlights the
common output of
all the methods
Graphical illustration of methods PCR, LASSO, and LMI in detection of
significant predictors for output IL-6 in PP/cytokine experimental dataset
RESULTS: SUMMARY
•
Comparison with respect to different noise types:
– LASSO is the most robust methods for different noise types.
•
Missing data RMSE:
– LASSO less deviation, more robust.
•
Collinearity:
– PCR less deviation against noise level, better accuracy and G with
increasing noise level.
A COMPARISON
(Asadi, et al., 2012)
Methods / Criteria
PCR
LASSO LMI
Increasing Noise
RMSE
Score= (average RMSE across different noise levels for LS)/(average RMSE across different noise levels
for the chosen method)
Standard deviation and error in mean of Coefficients.
Score = 1 – average (fractional error in mean(10,12,20) + (std(10,12,20)/ |true associated coefficients|) )
Acc./G
Score = average accuracy across different noise levels for chosen method (white noise)
 / 0.68
degrades gradually
with level of noise
 / 0.56
 / 0.94
 / 0.53
 / 0.47
 / 0.55
 / 0.70
 / 0.87
 / 0.91
at high noise all
similar
/ 0.81
 / .55
 / 0.78
 / 0.80
 / 0.56
 / 0.79
 / 0.71
 / 0.87
 / 0.91
 / 0.77
 / 0.53
 / 0.75
 / 0.66
 / 0.83
 / 0.90
Fractional Error in estimating the parameters
Score = 1- average fractional error in estimating the coefficients across different noise levels for chosen
method (white noise)
Types of noise
Fractional Error in estimating the parameters
Score = 1- average fractional error in estimating the coefficients across different noise levels and different
noise types (20% noise level)
Accuracy and G
Score = average accuracy across different noise levels and different noise types
Dimension ratio / Size
Fractional Error in estimating the parameters
Score = 1- average fractional error in estimating the coefficients across different noise levels and different
ratios (m/n = 100/25, 100/50, 400/100)
Accuracy and G
Score = average accuracy across different white noise levels and different ratios (m/n = 100/25, 100/50,
400/100)
Doubly Penalized Least Absolute
Shrinkage and Selection Operator
DPLASSO
OUR APPROACH: DPLASSO
Statistical
Significant Testing
PLS
B : b1 , b2 , b3 , b4 , b5 , b6 , b7 , b8 , ...
W : 0, 1 , 0 , 1 , 0 , 1 , 0, 1 ,... 
LASSO
B : b1, b2 , b3 , b4 , b5 , b6 , b7 , b8 , ...
Model
y = Xbˆ + ε
Reconstructed
Network
ˆ +ε
y = Xb
B : b1 , b3 , b5 , b6 , b7 ,...
DPLASSO WORK FLOW
• Our approach: DPLASSO includes two parameter
selection layers:
•
Layer 1 (supervisory layer):
– Partial Least Squares (PLS)
– Statistical significance testing
• Layer 2 (lower layer):
– LASSO with extra weights on less informative model parameters
derived in layer 1
– Retain significant predictors and set the remaining small coefficients to
zero
bˆ j  arg min{e2  ( y j - Xb j )T ( y j - Xb j )}
s/t

i 1,..., p
wij bˆij  t 

i 1,..., p
wij bˆijLS
0
wij  
1
bij is PLS- significant
otherwise
DPLASSO: EXTENDED VERSION
• Smooth weights:
•
Layer 1 :
– Continuous significance score η (versus binary):
i  ri - tinv(1   / 2, v)
s() (Significance Score)
w( ) (Weight function)
1
0.9
0.8
v  DOFPLS ,   1  confidence level
– Mapping function (logistic significance score):
0.7
0.6
0.5
0.4
1
si (i ) 
1  e( i )
0.3
0.2
Tuning parameter
• Layer 2:
0.1
0
-5
0
 (Significance Score)
– Continuous weight vector (versus fuzzy weight vector)
bˆ j  arg min{e 2  ( y j - Xb j )T ( y j - Xb j )} , s / t

i 1,..., p
wi bˆij  t 
wi  1  si (i )  si (i ),
significan t coefficien ts :i  0  0.5  si (i )  1  0  wi  0.5
insignificant coefficien ts :i  0  0  si (i )  0.5  0.5  wi  1

i 1,..., p
wi bˆijLS
5
APPLICATIONS
1. Synthetic (random) networks: Datasets
generated in Matlab
2. Biological dataset: Saccharomyces
cerevisiae - cell cycle model
SYNTHETIC (RANDOM) NETWORKS
Datasets generated in Matlab using:
dX X

yX

 Xb  e; e(t ) ~ N (0,  )
dt
t
•
Linear dynamic system
•
Dominant poles/Eigen values (λ) ranges [-2,0]
•
Lyapunov stable
–
Informal definition from wikipedia: if all solutions of the
dynamical system that start out near an equilibrium point xe
stay near xe forever, then the system is “Lyapunov stable”.
•
Zero-input/Excited-state release condition
•
5% measurement (white) noise.
METRICS
• Two metrics to evaluate the performance of DPLASSO
1. Sensitivity, Specificity, G (Geometric mean of Sensitivity and
Specificity), Accuracy
TN  TP
TN  TP  FN  FP
TP
Sensitivity
TP  FN
TN
Specificity
TN  FP
TP
P recision 
TP  FP
Accuracy
TP : True Positive
FP : False Positive
TN : True Negative
FN : False Negative
2. The root-mean-squared error (RMSE) of prediction

1 m
RMSE 
( yi  yi , p )2

m i 1
TUNING
• Tuning shrinkage parameter for DPLASSO
The shrinkage parameters in LASSO level (threshold t) via k-fold
cross-validation (k = 10) on associated dataset
Rule of thumb after cross
validations:
Example:
Optimal value of the tuning
parameter for a network with 65%
connectivity roughly equal to 0.65
Validation error versus selection threshold t for
DPLASSO on synthetic data set
PERFORMANCE COMPARISON: ACCURACY
Network Size 20
MC 10
Noise 5%
Accuracy
Accuracy
Density 50%
Density 20%
LASSO
DPLASSO
PLS
0.7
0.65
LASSO
DPLASSO
PLS
1
0.8
0.6
0.6
0.55
0.4
0.5
2
1.5
0
1.5
0
1
-2

0.2
2
0.5
-4
0
1
-2

0.5
-4

Accuracy
0

Accuracy
Density 10%
Density 5%
LASSO
DPLASSO
PLS
1
0.8
LASSO
DPLASSO
PLS
0.8
0.6
0.6
0.4
0.4
0.2
0.2
2
1.5
0
1
-2

•
•
0.5
-4
0

0
2
1.5
0
1
-2

0.5
-4
0

PLS Better performance
DPLASSO provides good compromise between LASSO and PLS in terms of
accuracy for different network densities
PERFORMANCE COMPARISON: SENSITIVITY
Network Size 20
MC 10
Noise 5%
Sensitivity
Density 50%
1
Density 20%
1
LASSO
DPLASSO
PLS
0.8
Sensitivity
LASSO
DPLASSO
PLS
0.8
0.6
0.6
0.4
2
1.5
0
1
-2

0.4
2
1.5
0
0.5
-4
0
1
-2


Sensitivity
0.5
-4
0

Sensitivity
Density 10%
Density 5%
1
LASSO
DPLASSO
PLS
0.8
1
LASSO
DPLASSO
PLS
0.8
0.6
0.6
0.4
2
1.5
0
1
-2

•
•
0.4
2
1.5
0
0.5
-4
0

1
-2

0.5
-4
0

LASSO has better performance
DPLASSO provides good compromise between LASSO and PLS in terms of
Sensitivity for different network densities
PERFORMANCE COMPARISON: SPECIFICITY
Network Size 20
MC 10
Noise 5%
Specificity
Density 50%
0.8
Density 20%
0.8
LASSO
DPLASSO
PLS
0.6
Specificity
LASSO
DPLASSO
PLS
0.6
0.4
0.4
0.2
0.2
0
2
0
2
1.5
0
-2

1.5
0
1
-2
0.5
-4
0


Specificity
0.5
-4
0

Specificity
Density 10%
Density 5%
0.8
LASSO
DPLASSO
PLS
0.6
0.8
LASSO
DPLASSO
PLS
0.6
0.4
0.4
0.2
0.2
0
2
1.5
0
1
-2

•
1
0
2
1.5
0
0.5
-4
0

1
-2

0.5
-4
0

DPLASSO provides good compromise between LASSO and PLS in terms of
specificity for different network densities.
PERFORMANCE COMPARISON: NETWORK-SIZE
cy
Network Size: 10
Network Size: 20
Network Size: 50
* 100 potential connections
* 400 potential connections
* 2500 potential connections
LASSO
1
DPLASSO
PLS
0.8
Accuracy
Accuracy
Accuracy
0.6
1
DPLASSO
PLS
0.8
LASSO
LASSO
1
0.6
0.6
0.4
0.4
0.2
2
0.2
2
0.2
2
1.5
1.5
1
-2

•
•
0
0.5
-4 0

1
-2

0.5
-4 0

PLS
0.8
0.4
0
DPLASSO
1.5
0
1
-2

0.5
-4 0
DPLASSO provides good compromise between LASSO and PLS in terms of
accuracy for different network sizes
DPLASSO provides good compromise between LASSO and PLS in terms of
sensitivity (not shown) for different network sizes

ROC CURVE vs. DYNAMICS AND WEIGHTINGS
ROC for variable  (the closer to origin the larger - Density: 20% MC: 10 Size: 50)
1
LASSO
DPLASSO
PLS
Sensitivity
0.8
0.6
0.4
0.2
0
0
0.1
0.2
0.3
0.4
0.5
0.6
Specificity
0.7
0.8
0.9
1
ROC for variable  (the larger  the larger - Density: 20% MC: 10 Size: 50)
1
LASSO
DPLASSO
PLS
Sensitivity
0.8
0.6
0.4
0.2
0
0
•
•
0.1
0.2
0.3
0.4
0.5
0.6
Specificity
0.7
0.8
0.9
1
DPLASSO exhibits better performance for networks with slow dynamics.
The parameter γ in DPLASSO can be adjusted to improve performance
for fast dynamic networks
YEAST CELL DIVISION
Experimental dataset generated via well-known nonlinear model of a
cell division cycle of fission yeast. The model is dynamic with 9 state
variables.
* Novak, Bela, et al. "Mathematical model of the cell division cycle of fission
yeast." Chaos: An Interdisciplinary Journal of Nonlinear Science 11.1 (2001): 277-286.
CELL DIVISION CYCLE
True Network (Cell Division Cycle)
Missing in DPLASSO!
PLS
DPLASSO
LASSO
RECONSTRUCTION PERFORMANCE
Case Study I: 10 Monte Carlo Simulations, Size 20, Average over different γ, λ, network
density, and Monte Carlo sample datasets
Method
Accuracy
Metric
Sensitivity Specificity
SD RMSE/Mean
LASSO
0.31
0.92
0.16
0.14
DPLASSO
0.56
0.73
0.52
0.08
PLS
0.60
0.67
0.63
0.09
Case Study II: Cell Division Cycle, Average over γ value
Method
Accuracy
Metric
Sensitivity Specificity
SD RMSE/Mean
LASSO
0.39
0.90
0.05
0.06
DPLASSO
0.52
0.90
0.34
0.07
PLS
0.59
0.80
0.20
0.07
CONCLUSION
• Novel method, Doubly Penalized Linear Absolute Shrinkage and
Selection Operator (DPLASSO), to reconstruct dynamic biological
networks
– Based on integration of significance testing of coefficients and optimization
– Smoothening function to trade off between PLS and LASSO
• Simulation results on synthetic datasets
– DPLASSO provides good compromise between PLS and LASSO in terms
of accuracy and sensitivity for
• Different network densities
• Different network sizes
• For biological dataset
– DPLASSO best in terms of sensitivity
– DPLASSO good compromise between LASSO and PLS in terms of
accuracy, specificity and lift
Information Theory
Methods
Farzaneh Farangmehr
Mutual Information
•
It gives us a metric that is indicative of how much information from a
variable can be obtained to predict the behavior of the other variable .
•
The higher the mutual information, the more similar are the two profiles.
•
For two discrete random variables of X={x1,..,xn} and Y={y1,…ym}:
m
n
I ( X ; Y )   p( xi , y j ) log
j 1 i 1
p( xi , y j )
p( xi ) p( y j )
p(xi,yj) is the joint probability of xi and yj
P(xi) and p(yj) are marginal probability of xi and yj
Information theoretical approach
Shannon theory
•
Hartley’s conceptual framework of information relates the information of a random
variable with its probability.
•
Shannon defined “entropy”, H, of a random variable X given a random sample in terms
of its probability distribution:
{x1 ,...,xn }
n
n
i 1
i 1
H ( X )   P( xi ) I ( xi )   P( xi ) log[P( xi )]
•
Entropy is a good measure of randomness or uncertainty.
•
Shannon defines “mutual information” as the amount of information about a random
variable X that can be obtained by observing another random variable Y:
I ( X , Y )  H ( X )  H (Y )  H ( X , Y )  H (Y )  H (Y X )  H ( X )  H ( X Y )  I (Y , X )
Mutual information networks
X={x1 , …,xi}
•
The ultimate goal is to find the best model that maps X  Y
-
•
Y={y1 , …,yj}
The general definition: Y= f(X)+U. In linear cases: Y=[A]X+U where [A] is a matrix
defines the linear dependency of inputs and outputs
Information theory maps inputs to outputs (both linear and non-linear models)
by using the mutual information:
m
n
I ( X ; Y )   p( xi , y j ) log
j 1 i 1
p( xi , y j )
p( xi ) p( y j )
Mutual information networks
• The entire framework of network reconstruction using information theory
has two stages:
1-Mutual information measurements
2- The selection of a proper threshold.
• Mutual information networks rely on the measurement of the mutual
information matrix (MIM). MIM is a square matrix whose elements (MIMij
= I(Xi;Yj)) are the mutual information between Xi and Yj.
•
Choosing a proper threshold is a non-trivial problem. The usual way is to
perform permutations of expression of measurements many times and
recalculate a distribution of the mutual information for each permutation.
Then distributions are averaged and the good choice for the threshold is
the largest mutual information value in the averaged permuted
distribution.
Mutual information networks
Data Processing Inequality (DPI)
•
The DPI for biological networks states that if genes g1 and g3 interact
only through a third gene, g2, then:
I ( g1 , g3 )  min[I ( g1 , g2 ); I ( g2 , g3 )]
•
Checking against the DPI may identify those gene pairs which are not
directly dependent even if
p( gi , g j )  p( gi ) p( g j )
ARACNe algorithm
•
ARACNE stands for “Algorithm
for
the
Reconstruction
of
Accurate
Cellular
NEtworks”
[25].
•
ARACNE identifies candidate
interactions
by
estimating
pairwise gene expression profile
mutual information, I(gi, gj) and
then
filter
MIs
using
an
appropriate
threshold,
I0,
computed for a specific p-value,
p0. In the second step, ARACNe
removes the vast majority of
indirect connections using the
Data
Processing
Inequality
(DPI).
ARACNE flowchart [Califano and coworkers]
ProteinCytokine
Network in
Macrophage
Activation
Application to Protein-Cytokine Network Reconstruction
Release of immune-regulatory Cytokines during inflammatory response is medicated by a
complex signaling network [45].
Current knowledge does not provide a complete picture of these signaling components.
22 Signaling proteins responsible for cytokine releases:
cAMP, AKT, ERK1, ERK2, Ezr/Rdx, GSK3A, GSK3B, JNK lg, JNK sh, MSN, p38,
p40Phox, NFkB p65, PKCd, PKCmu2,RSK, Rps6 , SMAD2, STAT1a, STAT1b, STAT3,
STAT5
7 released cytokines (as signal receivers):
G-CSF, IL-1a, IL-6, IL-10, MIP-1a, RANTES, TNFa
we developed an information theoretic-based model that derives the responses of seven
Cytokines from the activation of twenty two signaling Phosphoproteins in RAW 264.7
macrophages.
This model captured most of known signaling components involved in Cytokine releases and was
able to reasonably predict potentially important novel signaling components.
Protein-Cytokine Network Reconstruction
MI Estimation using KDE
- Given a random sample {x1 ,...,xn }for a univariate random variable X with an unknown
density f a kernel density estimator (KDE) estimates the shape of this function as:
assuming Gaussian kernels:
x  xi
1 n
1
f ( x)   kh ( x  xi ) 
kh (
)
n i 1
nh
h
 ( x  xi ) 2 
f ( x) 
 exp 2h 2 
2 nh2 i 1


n
1
- Bivariate kernel density function of two random variables X and Y given two random
samples {x1 ,...,xn } and { y1 ,..., y n } :
f ( x, y) 
1
2nh2
 ( x  xi ) 2  ( y  y i ) 2 
exp



2h 2
i 1


n
Mutual information of X and Y using Kernel Density Estimation:
f (x j , y j )
1 n
I ( X , Y )   ln
n j 1 f ( x j ) f ( y j )
n =sample size; h=kernel width
Protein-Cytokine Network Reconstruction
Kernel bandwidth selection
•
There is not a universal way of choosing h and however the ranking of the MI’s
depends only weakly on them.
•
The most common criterion used to select the optimal kernel width is to minimize
expected risk function, also known as the mean integrated squared error (MISE):
•
Loss function (Integrated Squared Error) :
MISE(h)  E  [ f h ( x)  f ( x)]2 dx
L(h)   [ f h ( x)  ( f ( x)]2 dx
  f h2 ( x)dx  2 f h ( x) f ( x)dx   f 2 ( x)dx where
•
f
2
( x)dx  const
Unbiased Cross-validation approach select the kernel width that minimizes the lost
function by minimizing:
UCV (h)   f h2 ( x)dx 
2 n
 f ( xi )
n i 1 ( i ),h
where f(-i),h (xi) is the kernel density estimator using the bandwidth h at xi obtained
after removing ith observation.
Protein-Cytokine Network Reconstruction
Threshold Selection
•
Based on large deviation theory (extended to biological networks by ARACNE), the
probability that an empirical value of mutual information I is greater than I0,
provided that its true value I  0 , is:
Where the bar denotes the true MI, N is the sample size and c is a constant. After taking the
logarithm of both sides of the above equation:
P(I > I 0 I = 0) ~ e-cNI0
•
Therefore, lnP can be fitted as a linear function of I0 and the slope of b, where b is
proportional to the sample size N.
ln P  a  bI0
•
Using these results, for any given dataset with sample size N and a desired p-value,
the corresponding threshold can be obtained.
Protein-Cytokine Network Reconstruction
Kernel density estimation of cytokines
Figure 3: The probability distribution of
seven released cytokines in macrophage 246.7
using on Kernel density estimation (KDE)
Mutual information for all 22x7
pairs of phosphoprotein-cytokine
from toll data (the upper bar) and
non-toll data (the lower bar).
Protein-Cytokine Network Reconstruction
Protein-Cytokine signaling networks
A
+
B
=
The topology of signaling protein-released cytokines
obtained from the non-Toll (A) and Toll (B) data.
Protein-cytokine Network Reconstruction
Summary
• This model successfully captures all known signaling
components involved in cytokine releases
• It predicts two potentially new signaling components
involved in releases of cytokines including: Ribosomal
S6 kinase on Tumor Necrosis Factor and Ribosomal
Protein S6 on Interleukin-10.
• For MIP-1α and IL-10 with low coefficient of
determination data that lead to less precise linear
the information theoretical model shows advantage
over linear methods such as PCR minimal model
[Pradervand et al.] in capturing all known regulatory
components involved in cytokine releases.
Network reconstruction from time-course data
Background: Time-delayed gene networks
•
Comes from the consideration that the expression of a gene at a certain time
could depend by the expression level of another gene at previous time point or
at very few time points before.
•
The time-delayed gene regulation pattern in organisms is a common phenomenon
since:
•
If effect of gene g1 on gene g2 depends on an inducer,g3, that has to be
bound first in order to be able to bind to the inhibition site on g2, there
can be a significant delay between the expression of gene g1 and its
observed effect, i.e., the inhibition of gene g2.
•
Not all the genes that influence the expression level of a gene are
necessarily observable in one microarray experiment. It is quite possible
that there are not among the genes that are being monitored in the
experiment, or its function is currently unknown.
Network reconstruction from time-course data
The Algorithm

ICNA(esi )  argmint es0i / esti   up
or
esti / es0i   down

Network reconstruction from time-course data
Algorithm
Network reconstruction from time-course data
The flow diagram
Gene lists
Apply DPI for
connections above
the threshold
Cluster into
n
subnetwork
s
Measure
sub-network
activities
Remove
connections
below the
threshold
Flag potentially
dependent subnetworks by
measuring ICNA
Measure the influence
between flagged subnetworks
Find the
threshold
Build Inflence matrix
Build the network
based on non-zero
elements of the
mutual information
matrix
The flow diagram of the information theoretic approach for
biological network reconstruction from time-course microarray
data by identifying the topology of functional sub-networks
Network reconstruction from time-course data
Case study: the yeast cell-cycle
The cell cycle consists of four distinct phases:
G0 (Gap 0) :A resting phase where the cell has left the cycle and has stopped dividing.
G1 (Gap 1) : Cells increase in size in Gap 1. The G1 checkpoint control mechanism ensures that
everything is ready for DNA synthesis.
S1 (Synthesis): DNA replication occurs during this phase.
G2 (Gap 2): During the gap between DNA
synthesis and mitosis, the cell will
continue to grow. The G2 checkpoint
control mechanism ensures that
everything is ready to enter the M
(mitosis) phase and divide.
M (Mitosis) : Cell growth stops at this stage and
cellular energy is focused on the orderly
division into two daughter cells. A checkpoint
in the middle of mitosis (Metaphase Checkpoint) ensures that the
cell is ready to complete cell division.
Network reconstruction from time-course data
Case study: the yeast cell-cycle
• Data from Gene Expression Omnibus (GEO)
• Culture synchronized by alpha factor arrest. samples taken every 7
minutes as cells went through cell cycle.
• Value type: Log ratio
• 5,981 genes, 7728 probes and 14 time points
• 94 Pathways from KEGG Pathways
Network reconstruction from time-course data
Case study: the yeast cell-cycle
The
reconstructed
functional
network of yeast
cell cycle
obtained from
time-course
microarray data
Mutual information networks
Advantages and Limits
•
A major advantage of information theory is its nonparametric nature.
Entropy does not require any assumptions about the distribution of
variables [43].
•
It does not make any assumption about the linearity of the model for
the ease of computation.
•
It is applicable for time series data.
•
A high mutual information does not tell us anything about the direction
of the relationship.
Time Varying Networks
Causality
Maryam Masnardi-Shirazi
Causal Inference of Time-Varying
Biological Networks
Definition of Causality
Beyond Correlation: Causation
Idea: map a set of K time series to a directed graph with K nodes
where an edge is placed from a to b if the past of a has an impact on
the future of b
How do we quantitatively do this in a general purpose manner?
Granger’s Notion of Causality
It is said that process X Granger Causes Process Y, if future values of Y
can be better predicted using the past values of X and Y than only using
past values of Y.
Ganger Causality Formulation
• There are many ways to formulate the notion
of granger causality, some of which are:
- Information Theory and the concept of
Directed Information
- Learning Theory
- Dynamic Bayesian Networks
- Vector Autoregressive Models (VAR)
- Hypothesis Tests, e.g. t-test and F tests
Vector Autoregressive Model (VAR)
Least Squares Estimation
Least Squares Estimation (Cont.)
Processing the data
•
Phosphoprotein two-ligand screen assay: RAW 264.7
• There are 327 experiments from western blots processed with
mixtures of phosphospecific antibodies. In all experiments, the
effects of single ligand and simultaneous ligand addition are
measured
• Each experiment includes the fold change of Phosphoprotein at
time points t=0, 1, 3, 10, 30 minutes
• Data at time=30 minute is omitted, and data from t=0:10 is
interpolated by steps=1 min
Least Squares Estimation and Rank Deficiency of
Transformation Matrix
Exp.1
Exp.1
Exp.2
Exp.2
All X data
All Y data
Exp. 327
Exp. 327
Normalizing the data
Statistical Significance Test (Confidence Interval)
The Reconstructed Phosphoproteins Signaling
Network
• The network is
reconstructed by
estimating causal
relationships between all
nodes
• All the 21
phosphoproteins are
present and interacting
with one another
• There are 122 edges in
this network
Correlation and Causation
• The conventional dictum that "correlation does not
imply causation" means that correlation cannot be
used to infer a causal relationship between the
variables
• This does not mean that correlations cannot indicate
the potential existence of causal relations. However,
the causes underlying the correlation, if any, may be
indirect and unknown
• Consequently, establishing a correlation between two
variables is not a sufficient condition to establish a
causal relationship (in either direction).
Correlation and Causality comparison
Heat-map of the correlation matrix between
the input (X) and output (Y)
The reconstructed network considering
significant coefficients and their intersection
with connections having correlations higher than
0.5
The conventional dictum that "correlation does not imply causation" means that correlation cannot be used to infer a causal
relationship between the variables. This dictum should not be taken to mean that correlations cannot indicate the potential
existence of causal relations. However, the causes underlying the correlation, if any, may be indirect and unknown.
Consequently, establishing a correlation between two variables is not a sufficient condition to establish a causal
relationship (in either direction).
Correlation and Causality comparison (cont.)
Heat-map of the correlation matrix between
the input (X) and output (Y)
The reconstructed network considering
significant coefficients and their intersection
with connections having correlations higher than
0.4
Validating our network
Identification of
Crosswalk between
phosphoprotein
Signaling Pathways in
RAW 264.7
Macrophage Cells
(Gupta et al., 2010)
The Reconstructed Phosphoproteins Signaling Network
for t=0 to t=4 minutes
Heat-map of the correlation matrix
between the input (X) and output (Y)
for t=0 to t=4 minutes
9 nodes
15 edges
Intersection of Causal Coefficients with
connections with correlations higher than
0.4 for time t=0 to t=4 minutes
The Reconstructed Phosphoproteins Signaling Network
for t=3 to t=7 minutes
Heat-map of the correlation matrix
between the input (X) and output (Y)
for t=3 to t=7 minutes
19 nodes
51 edges
Intersection of Causal Coefficients with
connections with correlations higher than
0.4 for time t=3 to t=7 minutes
The Reconstructed Phosphoproteins Signaling Network
for t=6 to t=10 minutes
Heat-map of the correlation matrix
between the input (X) and output (Y)
for t=6 to t=10 minutes
19 nodes
56 edges
Intersection of Causal Coefficients with
connections with correlations higher than
0.4 for time t=6 to t=10 minutes
Time-Varying reconstructed Network
t=0 to 4 min
t=3 to 7 min
t=6 to 10 min
The Reconstructed Network for t=0 to t=4 minutes
without the presence of LPS as a Ligand
With LPS
15 Edges
Without
LPS
16 Edges
The Reconstructed Network for t=3 to t=7 minutes
without the presence of LPS as a Ligand VS the
presence of all ligands
With all ligands
including LPS
Without LPS
(51 Edges)
(55 Edges)
The Reconstructed Network for t=6 to t=10 minutes without
the presence of LPS as a Ligand VS the presence of all ligands
With all ligands
including LPS
(56 Edges)
Without LPS
(66 Edges)
Time-Varying Network with LPS not present as a
ligand
t=0 to 4 min
t=3 to 7 min
t=6 to 10 min
Summary
•
Information theory methods can help in determining causal and timedependent networks from time series data.
•
The granularity of the time course will be a factor in determining the
causal connections.
•
Such dynamical networks can be used to construct both linear and
nonlinear models from data.