Clustering in Generalized Linear Mixed Model using

Download Report

Transcript Clustering in Generalized Linear Mixed Model using

Clustering in Generalized Linear Mixed
Model Using Dirichlet Process Mixtures
Ya Xue Xuejun Liao
April 1, 2005
1
Introduction


Concept drift is in the framework of
generalized linear mixed model, but brings
new question of exploiting the structuring of
auxiliary data.
Mixtures with a countably infinite number of
components can be handled in a Bayesian
framework by employing Dirichlet process
priors.
2
Outline

Part I: generalized linear mixed model
•
Generalized linear model (GLM)
Generalized linear mixed model (GLMM)
Advanced applications
Bayesian feature selection in GLMM

Part II: nonparametric method
•
•
•
•
•
•
•
Chinese restaurant process
Dirichlet process (DP)
Dirichlet process mixture models
Variational inference for Dirichlet process mixtures
3
Part I
Generalized Linear Mixed Model
4
Generalized Linear Model (GLM)

A linear model specifies the relationship
between a dependent (or response) variable
Y, and a set of predictor variables, Xs, so that
yi  xi '  ,

i : subject.
GLM is a generalization of normal linear
regression models to exponential family
(normal, Poisson, Gamma, binomial, etc).
5
Generalized Linear Model (GLM)
GLM differs from linear model in two major
respects:
 The distribution of Y can be non-normal, and
does not have to be continuous.
 Y still can be predicted from a linear
combination of Xs, but they are "connected" via
a link function.
6
Generalized Linear Model(GLM)
DDE Example: binomial distribution
 Scientific interest: does DDE exposure increase
the risk of cancer? Test on rats. Let i index rat.
 Dependent variables:
yi ~ Bin(1, pi ), pi : risk of cancer for rat i.
 1, rat i is diagnosedwith cancer
yi  
.
0, no cancer

Independent variable: dose of DDE exposure,
denoted by xi.
7
Generalized Linear Model(GLM)

Likelihood function of yi:
1 yi
f ( yi | pi )  pi (1  pi )
yi
pi
1
 exp{yi ln
 ln
}
1  pi
1  pi
exp{yii }
pi

, wherei  ln
.
1  exp{i }
1  pi

pi
 xi '  ,
Choosing the canonical link i  ln
1  pi
the likelihood function becomes
exp{yi xi '  }
f ( yi | xi ,  ) 
1  exp{xi '  }
8
GLMM – Basic Model
Returning to the DDE example, 19 labs all over
the world participated this bioassay.
 There are unmeasured factors that vary
between the different labs.
 For example, rodent diet.
 GLMM is an extension of the generalized
linear model by adding random effects to the
linear predictor (Schall 1991).
9
GLMM – Basic Model

The previous linear predictor is modified as:
ij  xij '   zij ' bi ,
where i  1,, n index lab, j  1,, ni index rat
within lab i .

 are “fixed” effects - parameters common

to all rats.
bi are “random” effects - deviations for lab i.
10
GLMM – Basic Model
ij  xij '   zij ' bi


If we choose xij = zij , then all the regression
coefficients are assumed to vary for the
different labs.
If we choose zij = 1, then only the intercept
varies for the different labs (random intercept
model).
11
GLMM - Implementation


•
•
Gibbs sampling
Disadvantage: slow convergence.
Solution: hierarchical centering reparametrisation
(Gelfand 1994; Gelfand 1995)
Deterministic methods are only available for logit
and probit models.
EM algorithm (Anderson 1985)
Simplex method (Im 1988)
12
GLMM – Advanced Applications


Nested GLMM: within each lab, rats were group
housed with three cats per cage.
ijk  xijk '   zijk ' bi  vijk 'ij
let i index lab, j index cage and k index rat.
Crossed GLMM: for all labs, four dose protocols
were applied on different rats.
ij  xij '   zij ' bi  vij 'k
let i index lab, j index rat and k indicate the
protocol applied on rat i,j.
13
GLMM – Advanced Applications


Nested GLMM: within each lab, rats were group
housed with three cats per cage.
Two-level GLMM:
level I – lab, level II – cage.
Crossed GLMM: for all labs, four dose protocols
were applied on different rats.
•
Rats are sorted into 19 groups by lab.
•
Rats are sorted into 4 groups by protocol.
14
GLMM – Advanced Applications

•
Temporal/spatial statistics:
Account for correlation between the random
effects at different times/locations.
Dynamic latent variable model (Dunson 2003)
Let i index patient and t index follow-up time,
t 1
it  xit v   ( Tjk xTjk v) jk   it
T
k 0
15
GLMM – Advanced Applications
•
Spatially varying coefficient processes (Gelfand
2003): random effects are modeled as spatially
correlated process.
25
20
Possible application:
15
A landmine field where
landmines tend to be
close together.
10
5
0
-5
-5
0
5
10
15
20
25
16
Bayesian Feature Selection in GLMM
Simultaneous selection of fixed and random
effects in GLMM (Cai and Dunson 2005)
 Mixture prior: p( x)   ( x  0)  (1   ) g ( x)
0.5
0.45
0.4
 ( x  0)
0.35
0.3
g (x)
0.25
0.2
0.15
0.1
0.05
0
-5
0
5
17
Bayesian Feature Selection in GLMM


•
•
Fixed effects: choose mixture priors for the
fixed effects coefficients.
Random effects: reparameterization
LDU decomposition of the random effect
covariance
Choose mixture prior for the elements in the
diagonal matrix.
18
Missing Identification in GLMM

Data table of DDE bioassay
……
Berlin
Berlin
Tokyo
Tokyo
……



1
1
0
1
0.01
0.01
0.01
0.01
0.00 34.10 40.90 37.50
0.00 35.70 35.60 32.10
0.00 56.50 28.90 27.10
0.00 51.50 29.90 25.90
What if the first column is missing?
Unusual case in statistics, so few people work on it.
But this is the problem we have to solve for
concept drift.
19
Concept Drift


Primary data p( yi | xi , w)   ( yi xi ' w)
Auxiliary data p( yi | xi , i , w)   ( yi ( xi ' w  i ))
If we treat the drift variable as random
variable, concept drift is a random intercept
model - a special case of GLMM.
20
Clustering in Concept Drift
Histogram of the estimated non-zero auxiliary variable, C=10
9
90
8
80
7
70
6
60
5
50
Number of occurences
Estimated auxiliary variables, C=10
40
4
30
3
20
2
10
1
0
0
0
20
40
60
Value of 
80
100
0
50
100
150
200
index of auxiliary data
250
300
K = 51 clusters (including 0) out of 300 auxiliary data points
Bin resolution = 1
21
Clustering in Concept Drift


There are intrinsic clusters in auxiliary data
with respect to drift value.
“The simplest explanation is best.”
Occam Razor
Why don’t we instead give each cluster a
random effect variable?
22
Clustering in Concept Drift



In usual statistics applications, we know
which individuals share the same random
effect .
However, in concept drift, we do not know
which individuals (data points or features)
share the same random-intercept.
Can we train the classifier and cluster the
auxiliary data simultaneously? This is a new
problem we aim to solve.
23
Clustering in Concept Drift

How many clusters (K) should we include in
our model?

Does choosing K actually make sense?

Is there a better way?
24
Part II
Nonparametric Method
25
Nonparametric method



Parametric method: the forms of the
underlying density functions were known.
Nonparametric method is a wide category, e.g.
NN, minmax, bootstrapping...
Nonparametric Bayesian method: make use
of the Bayesian calculus without prior
parameterized knowledge.
26
Cornerstones of NBM
Dirichlet process (DP)
allow flexible structures to be learned and
allow sharing of statistical strength among
sets of related structures.
 Gaussian process (GP)
allow sharing in the context of multiple
nonparametric regressions
(suggest to have a separate seminar on GP)

27
Chinese Restaurant Process


Chinese restaurant process (CRP) is a
distribution on partitions of integers.
CRP is used to represent uncertainty over the
number of components in a mixture model.
28
Chinese Restaurant Process
 Unlimited number of tables
 Each table has an unlimited capacity to seat
customers.



29
Chinese Restaurant Process
The (m+1)th subsequent customer sits at a
table drawn from the following distribution:
p (occupiedtable i | previouscustom ers) 
mi
 m
p (an unoccupiedtable| previouscustom ers) 

 m
where mi is the number of previous customers
at table i and  is a parameter.
30
Chinese Restaurant Process
Example: m  9,   1.
The probability that next customer sits at table
2
9 1


1
9 1
2
9 1




4
9 1

 
1
9 1

31
Chinese Restaurant Process


CRP yields an exchangeable distribution on
partitions of integers, i.e., the specific
ordering of the customers is irrelevant.
An infinite set of random variables is said to
be infinitely exchangeable if for every finite
subset {x1 , x2 ,, xn }, we have
p( x1, x2 ,xn )  p( x (1) , x (2) ,x (n) )
for any permutation  .
32
Dirichlet Process
G0: any probability measure on the reals,
 : partition.
A process is a Dirichlet process if the following
equation holds for all partitions:
(G(1 ),, G(k )) ~ Dir(G0 (1 ),, G0 (k ))
where  is a concentration parameter.
Note: Dir – Dirichlet distribution, DP - Dirichlet process.
33
Dirichlet Process

Denote a sample from the Dirichlet process as
G ~ DP( , G0 )


G is a distribution.
Denote a sample from the distribution G as
 |G ~ G
Graphical model for
a DP generating the
parameters  .
34
Dirichlet Process
Properties of DP:
 E[G]  G0



1 n
p(G | 1 ,, n )  DP(  n,
G0 
i )

 n
  n i 1

1 n
E[G | 1 ,, n ] 
G0 
i

 n
  n i 1
35
Dirichlet Process
The marginal probabilities for a new 
1 n
p( n 1   i for 1  i  n | 1 ,, n ,  , G0 ) 
i ( j )

  n j 1
p( n 1   i for 1  i  n | 1 ,, n ,  , G0 ) 

 n
This is Chinese restaurant process.
36
DP Mixtures
xi | i ~ F (i )
i | G ~ G
G ~ DP( , G0 )
If F is a normal distribution, this is the a
Gaussian mixture model.
37
Applications of DP



Infinite Gaussian Mixture Model (Rasmussen
2000)
Infinite Hidden Markov Model (Beal 2002)
Hierarchical Topic Models and the Nested
Chinese Restaurant Process (Blei 2004)
38
Implementation of DP
Gibbs sampling


If G0 is a conjugate prior for the likelihood
given by F: (Escobar 1995)
Non-conjugate prior: (Neal 1998)
39
Variational Inference for DPM

The goal is to compute the predictive density
under DP mixture
p ( x | x1 ,..., xn )   p ( x |  ) p ( | x1 ,..., xn )d
Also, we minimized the KL distance between
p and a variational distribution q.
 This algorithm is based on the stick-breaking
representation of DP.
(I would suggest to have a separate seminar on
stick-breaking view of DP and variational DP.)

40
Open Questions




Can we apply ideas of infinite models beyond
identifying the number of states or
components in a mixture?
Under what conditions can we expect these
models to give consistent estimates of
densities?
...
Specified to our problem: Non conjugate due
to sigmoid function
41