Transcript Document

Multivariate Analysis
Review
Multivariate distributions
The multivariate Normal
distribution
x  [x1, x2, … xp] is said to have a p-variate
normal distribution with mean vector  and
covariance matrix S if
f ( x )  f  x1 ,

 2 
1
p/2
, xp 
S
1/ 2
x ~ N p  , S
e

1
 x    S1  x   
2
Surface Plots of the bivariate
Normal distribution
Contour Plots of the bivariate
Normal distribution
Scatter Plots of data from the
bivariate Normal distribution
Trivariate Normal distribution - Contour map
 x    S 1  x    = const
mean vector
x3
 1 
   2 
 3 
x2
x1
Trivariate Normal distribution
x3
x2
x1
Trivariate Normal distribution
x3
x1
x2
Trivariate Normal distribution
x3
x2
x1
Marginal and Conditional
distributions
Theorem: (Marginal distributions for the Multivariate
Normal distribution)
 x1  q
have p-variate Normal distribution
Let x   
 x2  p  q
 1  q
with mean vector    
 2  p  q
 S11 S12 
and Covariance matrix S  


S
S
22 
 12
Then the marginal distribution of xi is qi-variate Normal
distribution (q1 = q, q2 = p - q)
with mean vector
i
and Covariance matrix S ii
Theorem: (Conditional distributions for the
Multivariate Normal distribution)
 x1  q
have p-variate Normal distribution
Let x   
 x2  p  q
 1  q
with mean vector    
 2  p  q
 S11 S12 
and Covariance matrix S  


S
S
22 
 12
Then the conditional distribution of xi given x j is qi-variate
Normal distribution

 
 i  SijS x j   j 

with mean vector
i j
and Covariance matrix
Sii j  Sii  SijSjj1Sij
1
jj
The conditional distribution of x2 given x1 is:
f 2|1  x2 x1  

f  x1 , x2 
f1  x1 
1
 2 
 pq / 2
A
1
2
e


1
 S11
where b  2  S12
 x1  1 
1
 S11
and A  S22  S12
S12


 12 x2 b A1 x2 b

11

The matrix S
S222 11 
S

S
S
 S22
12 11
11 S12
12
22
is called the matrix of partial variances and covariances.
The  i, j 
th
element of the matrix S
S22
2 11
ijij11,2....
, 2,,qq
is called the partial covariance (variance if i = j)
between xi and xj given x1, … , xq.
 ij1, 2,,q  
ij 1,2....q
ij1, 2,,q
ij 1,2....q

ii1ii, 21,2....
,, q qjj jj1,2....
1, 2 ,
q ,q
is called the partial correlation between xi and xj given
x1 , … , xq .
1
 S11
the matrix   S12
is called the matrix of regression coefficients for
predicting xq+1, xq+2, … , xp from x1, … , xq.
Mean vector of xq+1, xq+2, … , xp given x1, … , xqis:

 
1
 S11
2211  
Bxx1j  

 where   2  S12
1
Independence
Note: two vectors, x1 and x2 , are independent if
f  x1, x2   f1  x1  f2  x2 
Then the conditional distribution of xi given x j
is equal to the marginal distribution of xi
 x1 
 1 
If x    is multivariate Normal with mean vector    
 x2 
 2 
 S11 S12 
and Covariance matrix S  


S
S
22 
 12
Then the two vectors, x1 and x2 , are independent if S12  0
The components of the vector,
x
, are independent if
 ij = 0 for all i and j (i ≠ j )
i. e. S is a diagonal matrix
Transformations
Transformations
Theorem
Let x1, x2,…, xn denote random variables
with joint probability density function
f(x1, x2,…, xn )
Let u1 = h1(x1, x2,…, xn).
u2 = h2(x1, x2,…, xn).
un = hn(x1, x2,…, xn).
define an invertible transformation from the x’s to the u’s
Then the joint probability density function of
u1, u2,…, un is given by:
g  u1 ,
, un   f  x1 ,
 f  x1,
, xn 
d  x1 ,
d  u1 ,
, xn  J
, xn 
, un 
 dx1
 du
 1
d  x1 , , xn 
where J 
 det 
d  u1 , , un 

 dxn
 du1
Jacobian of the transformation
dx1 

dun



dxn 
dun 
Theorem
Let x1, x2,…, xn denote random variables
with joint probability density function
f(x1, x2,…, xn )
Let u1 = a11x1+ a12x2+…+ a1nxn + c1
u2 = a21x1 + a22x2+…+ a2nxn + c2
⁞
un = an1 x1+ an2 x2 +…+ annxn + cn
define an invertible linear transformation from the
x’s to the u’s
1
u  Ax  c or x  A
u  c 
Then the joint probability density function of
u1, u2,…, un is given by:
g  u1 ,
, xn 
1
A
 f A
 u  c 
1
A
 a11

A  det 
 an1
a1n 


ann 
, un   f  x1 ,
1
where
Theorem
Suppose that The random vector, x  [x1, x2, … xp]
has a p-variate normal distribution with mean
vector  and covariance matrix S
then u  Ax  c
has a p-variate normal distribution
with mean vector
u  A  c
and covariance matrix
Su  ASA
Theorem
(Linear transformations of Normal RV’s)
Suppose that The random vector, x
has a p-variate normal distribution with mean
vector  and covariance matrix S
Let A be a q × p matrix of rank q ≤ p
then Ax has a p-variate normal distribution
with mean vector  Ax  A
and covariance matrix S Ax  ASA
Maximum Likelihood Estimation
Multivariate Normal distribution
The Method of Maximum Likelihood
Suppose that the data x1, … , xn has joint density
function
f(x1, … , xn ; q1, … , qp)
where q  (q1, … , qp) are unknown parameters
assumed to lie in W (a subset of p-dimensional
space).
We want to estimate the parametersq1, … , qp
Definition: The Likelihood function
Suppose that the data x1, … , xn has joint density
function
f(x1, … , xn ; q1, … , qp)  f x , q
Then given the data the Likelihood function is defined
to be
L q = L(q1, … , qp)
= f(x1, … , xn ; q1, … , qp)  f x , q
Note: the domain of L(q1, … , qp) is the set W.
 
 
 
Definition: Maximum Likelihood Estimators
Suppose that the data x1, … , xn has joint density
function
f(x1, … , xn ; q1, … , qp)  f x , q
Then the Likelihood function is defined to be
L q = L(q1, … , qp)
= f(x1, … , xn ; q1, … , qp)
and the Maximum Likelihood estimators of the
parameters q1, … , qp are the values that maximize
L q = L(q1, … , qp)
 
 
 
i.e. the Maximum Likelihood estimators of the
parameters q1, … , qp are the values
qˆ1 , ,qˆp
Such that

L qˆ1 ,
Note:

,qˆp  max L q1 ,
q1 , ,q p
maximizing L q1 ,
is equivalent to maximizing
l q1 ,
, q p   ln L q1 ,
the log-likelihood function
,q p 
,q p 
,q p 
Maximum Likelihood Estimation
Multivariate Normal distribution
Summary:
the Maximum Likelihood estimators of
 and S
are
n
1
ˆ  x   xi
n i 1
and
n
1
ˆS    x  x  x  x   n  1 S
i
i
n i 1
n
Sampling distribution of the
MLE’s
Summary

The sampling distribution of x
is p-variate normal with


1
 x   and S  S
n

x
The sampling distribution of
the sample covariance matrix S
and
n

1
Sˆ 
S
n
The Wishart distribution
A multivariate generalization of the
c2 distribution
Definition: the p-variate Wishart distribution
Let z1 , z2 ,
, zk
be k independent random p-vectors
Each having a p-variate normal distribution with
mean vector 0 and covariance matrix S 
p p
p1
Let U  z1 z1  z2 z2 
p p
 zk zk
Then U is said to have the p-variate Wishart distribution
with k degrees of freedom and covariance matrix S 
p p
U Wp  kS
The density ot the p-variate Wishart distribution
Suppose U
Wp  kS
Then the joint density of U is:
 
fU u 
p p
u
 k  p 1 / 2
2kp / 2


exp  12 tr Su 
k/2
S  p  k / 2
U
where p(·) is the multivariate gamma function.
p
i.e.  p  k / 2   p p 1 / 4    k  1  j  / 2
j 1
It can be easily checked that when p = 1 and S  1
then the Wishart distribution becomes the c2
distribution with k degrees of freedom.
Theorem
Wp  kS
Let C denote a q  p matrix of rank q  p.
Suppose U
q p
then
V  CUC Wp  k CSC
Corollary 1: v  aUa W1  k aSa    a2 ck2
with   aSa
2
a
Corollary 2:
If uii  the i diagonal element of U
2
then uii  ii c k where S   ij 
th
Theorem
Suppose U
1
Wp  k1S and U2 Wp  k2 S
are independent, then
V  U1  U2 Wp  k1  k2 S
Theorem
Suppose U1
Wp  k1S and U2 are independent and
V  U1  U2 Wp  kS with k  k1
then U2
Wp  k  k1S
Summary: Sampling distribution of MLE’s for
multivatiate Normal distribution
Let x1 , x2 ,
then
, xn be a sample from N p  S
N p   S
1
n
x
and

U    xi  x  xi  x    n  1 S Wp  n  1S 
n
i 1
Also
1

u 
2 ii
n 1
2
sii
c 2  n  1
Correlation
The sample covariance matrix:
 s11
s
12


S

p p

 s1 p
s12
s11
s2 p
where
s1 p 

s2 p 


s pp 
1
sik 
xij  xi  xkj  xk 


n  1 j 1
n
The sample correlation matrix:
1
r
12


R

p p

 r1 p
1
r2 p
where
rik 
r1 p 

r2 p 


1 
r12
 x
n
sik
sii skk
j 1

 x
n
j 1
ij
ij
 xi  xkj  xk 
 xi 
2
 x
n
j 1
kj
 xk 
2
Note:
1
R  D SD
1
where
 s11

 0

D
p p

 0

0
s22
0
0 

0 


s pp 
Tests for Independence
and
Non-zero correlation
Tests for Independence
Test for zero correlation (Independence between a two
variables)
n  2

The test statistic t  rij
1  rij2
If independence is true then the test statistic t will have a t distributions with n = n –2 degrees of freedom.
The test is to reject independence if:
 n 2
t  t / 2
Test for non-zero correlation (H0:   0 
The test statistic
z
1  1  r  1  1  0 
ln 

  ln 
2  1  r  2  1  0 
1
n3
If H0 is true the test statistic z will have approximately a
Standard Normal distribution
We then reject H0 if:
z  z / 2
Partial Correlation
Conditional Independence
Recall
 x1  q
has p-variate Normal distribution
If x   
 x2  p  q
 1  q
with mean vector    
 2  p  q
 S11
and Covariance matrix S  

S12
S12 
S 22 
1
 S11
T hematrix S21  S22  S12
S12
is called the matrix of partial variances and covariances.
The i, j  elementof thematrix S21
th
 ij1, 2,,q
is called the partial covariance (variance if i = j)
between xi and xj given x1, … , xq.
 ij1, 2,,q 
 ij1, 2,,q
 ii1, 2,,q  jj1, 2,,q
is called the partial correlation between xi and xj given
x1 , … , xq .
Let
 S11
S
 S12
S12 
S 22 
denote the sample Covariance matrix
1
 S11
Let S21  S22  S12
S12
The i, j  elementof thematrix S21
sij1,2,,q
th
is called the sample partial covariance (variance if i = j)
between xi and xj given x1, … , xq.
Also
rij1, 2,,q 
sij1, 2,,q
sii1, 2,,q s jj1, 2,,q
is called the sample partial correlation between xi and xj
given x1, … , xq.
Test for zero partial correlation correlation (Conditional
independence between a two variables given a set of p
Independent variables)
The test statistic
rij. x1 ,
,xp
t  rij . x1 ,
 n  p  2
,xp
1  rij2. x1 ,
,x p
= the partial correlation between yi and yj given
x1, …, xp.
If independence is true then the test statistic t will have a t distributions with n = n – p - 2 degrees of freedom.
The test is to reject independence if:
t  t n/2 p 2
Test for non-zero partial correlation
H 0 : ij. x1 ,
, xp
 ij0. x1 ,
,xp
The test statistic
0

1

r
1
ij . x1 ,
ln 
2  1  rij0. x1 ,

z
0


1


1
, xp
ij . x1 ,
  ln 
0


2
1


, xp 
ij . x1 ,

1
n p 3
, xp
, xp
If H0 is true the test statistic z will have approximately a
Standard Normal distribution
We then reject H0 if:
z  z / 2




The Multiple Correlation
Coefficient
Testing independence between a single
variable and a group of variables
Definition
 y1
has (p +1)-variate Normal distribution
Suppose x   
 x1  p
 y  1
 
with mean vector
 1  p
 yy
and Covariance matrix S  
 1 y
 1y 
S11 
We are interested if the variable y is independent of the vector x1
The multiple correlation coefficient is the maximum
correlation between y and a linear combination of the
components of x1
The multiple correlation coefficient
  
 1y S 1 y
 y x1 , x2 ,, xn 
 yy
The sample Multiple correlation coefficient
 s yy
Let S  
 s1 y
s1y 
denote the sample covariance matrix.

S11 
Then the sample Multiple correlation coefficient is
ry x1 ,
, xn

s1y S111s1 y
s yy
Testing for independence between y and x1
The test statistic
2
r
n  p  1 y x1 , , xn
F
p 1  ry2x1 , , xn
1

s
S
n  p 1
1 y 11 s1 y

p
s yy  s1y S111s1 y
If independence is true then the test statistic F will have an Fdistributions with n1 = p degrees of freedom in the numerator
and n1 = n – p - 1 degrees of freedom in the denominator
The test is to reject independence if:
F  F  p, n  p 1
Canonical Correlation Analysis
The problem
Quite often when one has collected data on several
variables.
The variables are grouped into two (or more) sets
of variables and the researcher is interested in
whether one set of variables is independent of the
other set.
In addition if it is found that the two sets of variates are
dependent, it is then important to describe and
understand the nature of this dependence.
The appropriate statistical procedure in this case is
called Canonical Correlation Analysis.
Definition: (Canonical variates and Canonical correlations)
 x1  q have p-variate Normal distribution
Let x   
 x2  p  q
 S11 S12 
 1  q
and S  

with    

S
S
22 
 12
 2  p  q
Let
and
U1  a1x1  a11 x1   aq1 xq
V1  b1 x2  b11 xq 1   bp1 q x p
be such that U1 and V1 have achieved the maximum correlation f1.
Then U1 and V1 are called the first pair of canonical variates and
f1 is called the first canonical correlation coefficient.
The remaining canonical variates and canonical
correlation coefficients
The second pair of canonical variates
U 2  a2 x1  a1 2 x1   aq 2 xq
V2  b2 x2  b1 2 xq1   bp 2q x p
are found by finding a2 and b2
1. (U2,V2) are independent of (U1,V1).
2. The correlation between U2 and V2 is
maximized
The correlation, f2, between U2 and V2 is called the second
canonical correlation coefficient.
The ith pair of canonical variates
Ui  aix1  a1i  x1   aqi  xq
Vi  bixi  b1i  xq 1   bpi q x p
are found by finding ai and bi, so that
1. (Ui,Vi) are independent of (U1,V1), …,
(Ui-1,Vi-1).
2. The correlation between Ui and Vi is
maximized
The correlation, fi, between Ui and Vi is called the i th
canonical correlation coefficient.
Coefficients for the ith pair of canonical variates, ai and bi
are eigenvectors of the matrices
1
1
1
1
 S11
 S11
S12
S12S22
and S22
S12
S12 respectively
associated with the ith largest eigenvalue (same for both matrices)
The ith largest eigenvalue of the two matrices is the square of the
ith canonical correlation coefficient fi
1
1
 S11
the i th largest eigenvalue of S12
S12S22
fi 
=
1
 S11
the i th largest eigenvalue of S 221S12
S12
Inference for the mean vector
Univariate Inference
Let x1, x2, … , xn denote a sample of n from the normal
distribution with mean  and variance 2.
Suppose we want to test
H0:  = 0 vs
HA:  ≠ 0
The appropriate test is the t test:
The test statistic:
x  0
t n
s
Reject H0 if |t| > t/2
The multivariate Test
Let x1 , x2 , , xn denote a sample of n from the p-variate
normal distribution with mean vector  and covariance
matrix S.
Suppose we want to test
H 0 :   0 vs
H A :   0
Roy’s Union- Intersection Principle
This is a general procedure for developing a
multivariate test from the corresponding univariate test.
1.
Convert the multivariate problem to a univariate problem by
considering an arbitrary linear combination of the
observation vector.
 X1 
 
i.e. observation vector X   
X p 
 
arbitrary linear combination of the observations
U  aX  a1 X1 
 ap X p
2.
3.
Perform the test for the arbitrary linear combination of the
observation vector.
Repeat this for all possible choices of
 a1 
 
a 
a p 
 
4.
5.
6.
Reject the multivariate hypothesis if H0 is rejected for any
one of the choices for a .
Accept the multivariate hypothesis if H0 is accepted for all
of the choices for a .
Set the type I error rate for the individual tests so that the
type I error rate for the multivariate test is .
Hotelling’s T2 statistic
We reject H0 :   0

1
2
if T  n  x  0  S  x  0   t  / 2
2
To determine t2 / 2
It turns out that if H0 is true than
n  p 2 n  p n
 S1 x  
F
T 
x




0
0
p  n  1
p  n  1
has an F distribution with n1 = p and n2 = n - p
Hotelling’s T2 test
We reject H0 :   0
or if
n p 2
F
T  F  p, n  p 
p  n  1
p  n  1

1
T  n  x  0  S  x  0  
F  p, n  p   Ta2
n p
2
Simultaneous Inference for means
Recall

T  n  x    S 1  x   
2
 max t
a
2
 a   max
a
n  ax  a 
2
aS 1a
(Using Roy’s Union Intersection Principle)
Now


2

1

P T  T   P n x   S x    T 


2


n ax  a


 P max
 T

1
 a

aS a


 n ax  a 2




P

T
for
all
a

1


aS a


1
2


aS a 
 P  ax  a 
T for all a 
n









 1



Thus


aS 1a 
aS 1a 
P  ax 
T  a  ax 
T for all a 
n
n


 1
and the set of intervals
1
1


aS a 
aS a 
ax 
T to ax 
T
n
n
Form a set of (1 – )100 % simultaneous
confidence intervals for a
Recall

T
n -1 p p ,n p

=
F
n p

Thus the set of (1 – )100 % simultaneous
confidence intervals for a
1

a S a  n -1 p p ,n  p
ax 
F
n
n p
aS 1a  n -1 p p ,n  p
to ax 
F
n
n p
The two sample problem
The multivariate Test
Let x1 , x2 , , xn denote a sample of n from the p-variate
normal distribution with mean vector  x and covariance
matrix S.
Let y1 , y2 , , ymdenote a sample of m from the p-variate
normal distribution with mean vector  yand covariance
matrix S.
Suppose we want to test
H 0 :  x   y vs
H A : x   y
Hotelling’s T2 statistic for the two sample problem
T 
2
1
1 1 

 n m 

1
x

y
S
  pooled  x  y 
n 1
m 1
S pooled 
Sx 
Sy
nm2
nm2
if H0 is true than
n  m  p 1 2
F
T
p  n  m  2
has an F distribution with n1 = p and
n2 = n +m – p - 1
Thus
Hotelling’s T2 test
We reject H0 : x   y
n  m  p 1 2
if F 
T  F  p, n  m  p  1
p  n  m  2
with T 
2
S pooled
1
1 1 
 n  m 
 S 1
x

y
  pooled  x  y 
n 1
m 1

Sx 
Sy
nm2
nm2
Simultaneous inference for the
two-sample problem
• Hotelling’s T2 statistic can be shown to have
been derived by Roy’s Union-Intersection
principle
namely T 
2
1

1
1
 n  m 
 1
x  y   S pooled
x  y 




2
 a x  y   


2
 max t  a   max
a
a
1 1

aS pooled a   
n m
where   x   y

Thus


n  m  p 1 2
1  P F 
T  F  p, n  m  p  1 
p  n  m  2


 2 p  n  m  2

 P T 
F  p, n  m  p  1 
n  m  p 1


 P T 2  T 
p  n  m  2
where T 
F  p, n  m  p  1
n  m  p 1

Thus


2
 a x  y   


  T    1
P  max 

a
1
1





a S pooled a   


n m


2
  a x  y   




or P 
 T for all a   1  
1 1 
 aS

a


pooled 


n m




Thus
2

1 1




P  a x  y  
 T aS pooled a    for all a   1  

n m




Hence

1 1

P  a  x  y   T aS pooled a
  a   x   y 
n m

 a  x  y   T


1 1
aS pooled a
 for all a   1  
n m

Thus
a  x  y   T

1 1
aS pooled a

n m
form 1 –  simultaneous confidence intervals for
a   x   y 
MANOVA
Multivariate Analysis of Variance
One way Multivariate Analysis
of Variance (MANOVA)
Comparing k p-variate Normal
Populations
The F test – for comparing k means
Situation
• We have k normal populations
• Let i and S denote the mean vector and
covariance matrix of population i.
• i = 1, 2, 3, … k.
• Note: we assume that the covariance matrix
for each population is the same.
S1  S2 
 Sk  S
We want to test
H0 : 1  2  3 
 k
against
H A : i   j for at least one pair i, j
The data
• Assume we have collected data from each of
k populations
• Let xi1 , xi 2 , , xin denote the n observations
from population i.
• i = 1, 2, 3, … k.
Computing Formulae:
Compute
n
1) Ti   xij  Total vector for sample i
j 1
 n

x
  1ij   T 
 j 1   1i 
 

 n
  
 x  Tpi 
pij
 

j 1
 G1 
k
k
 
2) G   Ti   xij     Grand Total vector
i 1
i 1 j 1
G p 
 
ni
3)
N  kn  Total sample size
 k n 2
  x1ij
 i 1 j 1
k
n
4)  xij xij  
 k n
i 1 j 1

x1ij x pij

 i 1 j 1
5)
 1 k 2
 n  T1i
i 1
k

1

TT

i i  
n i 1
 k
 1  T1iTpi
 n i 1

x1ij x pij 

i 1 j 1



k
n
2

x

pij

i 1 j 1
k
n
1 k

T1iTpi 

n i 1



k
1
2 
T

pi

n i 1
Let
1 k
1

H   TT
GG
i i 
n i 1
N
 1 k 2 G12
T1i 


N
 n i 1

 k
 1 T T  G1G p
1i pi
 n 
N
i 1
G1G p 
1 k
T1iTpi 


n i 1
N 


1 k 2 G12 
T1i 

n i 1
N 
k

2
n
x

x
 1i 1 


i 1


 k
 n  x1i  x1   x pi  x p 
 i 1

n  x1i  x1   x pi  x p  
i 1



k
2

n   x pi  x p 

i 1
k
= the Between SS and SP matrix
k
Let
n
1 k

E   xij xij   TT
i i
n i 1
i 1 j 1
 k n 2 1 k 2
  x1ij  n  T1i
i 1
 i 1 j 1

 k n
1 k

x1ij x pij   T1iTpi
 
n i 1
i 1 j 1
k
n
2

 x1ij  x1i 


i 1 j 1


 k n

 x1ij  x1i  x pij  x pi 
 
i 1 j 1
1 k

x
x

T
T


1ij pij
1i pi 
n
i 1 j 1
i 1



k
n
k
1
2
x

Tpi2 


pij

n i 1
i 1 j 1
k
n

 x1ij  x1i  x pij  x pi 

i 1 j 1



k
n
2
 x pij  x pi  

i 1 j 1

k
n
= the Within SS and SP matrix
The Manova Table
Source
Between
Within
SS and SP matrix
 h11

H 
 h1 p

 e11

E
e1 p

h1 p 


hpp 
e1 p 


e pp 
There are several test statistics for testing
H0 : 1  2  3 
 k
against
H A : i   j for at least one pair i, j
1. Roy’s largest root
1  largest eigenvalue of HE1
This test statistic is derived using Roy’s union
intersection principle
2. Wilk’s lambda (L)
E
1
L

H  E HE1  I
This test statistic is derived using the generalized
Likelihood ratio principle
3. Lawley-Hotelling trace statistic
T02  trHE1  sum of the eigenvalues of HE1
4. Pillai trace statistic (V)
V  trH  H  E 
1
Profile Analysis
Definition
• Let X1, X2, … , Xp denote p jointly distributed variables
under study
• Let 1, 2, … , p denote the means of these variables 
denote the means these variables
• The profile of these variables is a plot of i vs i.
i
i
The multivariate Test
Let x1 , x2 , , xn denote a sample of n from the p-variate
normal distribution with mean vector  x and covariance
matrix S.
Let y1 , y2 , , ymdenote a sample of m from the p-variate
normal distribution with mean vector  yand covariance
matrix S.
Suppose we want to test
H 0 :  x   y vs
H A : x   y
Hotelling’s T2 statistic for the two sample problem
T 
2
1
1 1 

 n m 

1
x

y
S
  pooled  x  y 
n 1
m 1
S pooled 
Sx 
Sy
nm2
nm2
if H0 is true than
n  m  p 1 2
F
T
p  n  m  2
has an F distribution with n1 = p and
n2 = n +m – p - 1
Profile Comparison
X
Group A
Group B
1
2
3
variables
…
p
Hotelling’s T2 test, tests
H0 : Equality of Profiles
against
H A : Different profiles
Profile Analysis
Parallelism
Variables not interacting with groups
(parallelism)
X






1
2
3
variables
…
groups
p
Variables interacting with groups
(lack of parallelism)
X






1
2
3
variables
…
groups
p
Parallelism
• Group differences are constant across variables
Lack of Parallelism
• Group differences are variable dependent
• The differences between groups is not the
same for each variable
Test for parallelism
Let x1 , x2 , , xn denote a sample of n from the p-variate
normal distribution with mean vector  x and covariance
matrix S.
Let y1 , y2 , , ymdenote a sample of m from the p-variate
normal distribution with mean vector  yand covariance
matrix S.
Let
Then
1 1 0 0
0 1 1 0

0 0 1 1
C 
p

 1 p
0 0 0 1


0 0 0 0
0
0
0
0
1
0
0 
0

0


1
 X1  X 2 
 X1  

   X2  X3 
CX  C   


X p  

 
 X p 1  X p 
The test for parallelism is
H 0 : C  x  C  y vs
H A : C x  C  y
Consider the data Cx1 , Cx2 , , Cxn
This is a sample of n from the (p -1) -variate normal
distribution with mean vector Cx and covariance
matrix CSC  .
Also Cy1 , Cy2 ,
, Cym
is a sample of m from the (p -1) -variate normal
distribution with mean vector C y and covariance matrix
.CSC 
Hotelling’s T2 test for parallelism
1
nm

T 
Cx  Cy   CS pooled C    Cx  Cy 

nm
2
if H0 is true than
nm p
F
T2
 p  1 n  m  2
has an F distribution with n1 = p – 1 and
n2 = n +m – p
Thus we reject H0 if F > F with n1 = p – 1 and
n2 = n +m – p
To perform the test for parallelism, compute
differences of successive variables for each case in
each group and perform the two-sample
Hotelling’s T2 test.
Test for Equality of Groups
(Parallelism assumed)
Groups equal
X
groups
1
2
3
variables
…
p
If parallelism is proven:
It is appropriate to test for equality of profiles
H 0 : 1p   x1 
  xp  
1
p
H A : 1p   x1 
  xp  
1
p
i.e.
H 0 : 1p 1 x  1p 1 y vs
H A : 1p 1 x  1p 1 y


y1

  yp  vs
y1

  yp 
The t test
nm
t
nm

1
p
1 x  1p 1 y
1
p
1S pooled 1



nm 1 x  1 y
n  m 1S
pooled 1
Thus we reject H0 if |t| > t/2 with df = n = n +m - 2
To perform this test, average all the variables for
each case in each group and perform the twosample t-test.
Test for equality of variables
(Parallelism Assumed)
Variables equal
X






1
2
3
variables
…
groups
i
Let
Then
1 1 0 0
0 1 1 0

0 0 1 1
C 
p

 1 p
0 0 0 1


0 0 0 0
0
0
0
0
1
0
0 
0

0


1
 X1  X 2 
 X1  

   X2  X3 
CX  C   


X p  

 
 X p 1  X p 
The test for equality of variables for the first group is:
H 0 : C  x  0 vs
H A : C x  0
Consider the data Cx1 , Cx2 , , Cxn
This is a sample of n from the p-variate normal
distribution with mean vector Cx and covariance
matrix CSC  .
Hotelling’s T2 test for equality of variables

T  n Cx  0
2

 CS
pooled
C 
1
Cx  0
1

 n  Cx   CS pooled C   Cx 
if H0 is true than
n  p 1
2
F
T
 p  1 n  1
has an F distribution with n1 = p – 1 and n2 = n - p + 1
Thus we reject H0 if F > F with n1 = p – 1 and
n2 = n – p + 1
To perform the test, compute differences of
successive variables for each case in the group and
perform the one-sample Hotelling’s T2 test for a
zero mean vector
A similar test can be performed for the second
sample.
Both of these tests do not assume parllelism.
If parallelism is assumed then
Then Cx1, Cx2 , , Cxn , Cy1, Cy2 , , Cym
This is a sample of n + m from the p-variate normal
distribution with mean vector Cx  Cy and
covariance matrix CSC  .
The test for equality of variables is:
H 0 : C  x  C  x  0 vs
H A : C x  C x  0
Hotelling’s T2 test for equality of variables
1
1

T 
nCx  mCy   CS pooled C    nCx  mCy 

nm
2
if H0 is true than
nm p
F
T2
 p  1 n  m  2
has an F distribution with n1 = p – 1 and n2 = n +m - p
Thus we reject H0 if F > F with n1 = p – 1 and
n2 = n + m – p
To perform this test for parallelism,
1. Compute differences of successive variables for
each case in each group
2. Combine the two samples into a single sample of
n + m and
3. Perform the single-sample Hotelling’s T2 test for
a zero mean vector.
Repeated Measures Designs
In a Repeated Measures Design
We have experimental units that
• may be grouped according to one or several
factors (the grouping factors)
Then on each experimental unit we have
• not a single measurement but a group of
measurements (the repeated measures)
• The repeated measures may be taken at
combinations of levels of one or several
factors (The repeated measures factors)
The Anova Model for a simple
repeated measures design
Repeated measures
subjects
y11 y12 y13 … y1t
y21 y22 y23 … y2t
yn1 yn2 y13 … ynt
The Model
yij = the jth repeated measure on the ith subject
=  + i + tj + eij
where  = the mean effect,
i = the effect of subject i,
tj = the effect of time j,
eij = random error.
 ̴
i
N  0,  2 
 t

 t j  0 
 j 1

 ̴
e ij
N  0,  2 


The Analysis of Variance
The Sums of Squares
n
1. SSSubject  t   yi  y 
2
i 1
- used to measure the variability of i
(between subject variability)
2. SSTime  n y j  y 
t
2
j 1
- used to test for the differences in tj (time)
3. SSError   yij  yi  y j  y 
n
t
i 1 j 1
2
- used to measure the variability of eij (within
subject variability)
ANOVA table – Repeated measures (no grouping
factor, 1 repeated measures factor (time))
Source
Between
Subject Error
Time
Between
Subject Error
S.S.
d.f.
M.S
SSSubject
n-1
MSSubject
SSTime
SSError
t-1
MSTime
MSError
(n - 1)(t - 1)
F
MS Time
MS Error
The general Repeated Measures
Design
g groups of n subjects
t repeated measures
In a Repeated Measures Design
We have experimental units that
• may be grouped according to one or several
factors (the grouping factors – df = g - 1)
Then on each experimental unit we have
• not a single measurement but a group of
measurements (the repeated measures)
• The repeated measures may be taken at
combinations of levels of one or several factors
(The repeated measures factors – df = t - 1)
• There are also the interaction effects between the
grouping and repeated measures factors – df =
(g -1)(t -1)
The Model - Repeated Measures Design
yobservation    mean 
Main effects,interactionsGroupingfactors 
Betweensubject Error 
Main effects,interactionsRM factors 
Interactio
nsGrouping& RM factors 
e
1
e 2 Withinsubject Error
ANOVA table for the general repeated measures design
Source
d.f.
Main effects and interactions of
g-1
grouping factors
Between subject Error
g(n – 1)
interactions of grouping factors
with repeated measures factors
(t – 1)(g – 1)
Main effects and interactions of
repeated measures factors
t-1
Within subject Error
g(t – 1)(n – 1)
The Multivariate Model for a
Repeated measures design
The Anova (univariate) Model
yij = the jth repeated measure on the ith subject
=  + j + tj + eij
where  = the mean effect,

j = the effect of subject i,
N  0,  2 
 t

 t j  0 
 j 1

tj = the effect of time j,
eij = random error.
i

e ij
N  0,  2 


Implications of The Anova (univariate) Model
j = the mean of y ij
 E  yij   E     E  i   E t j   E  e ij 
=  + 0 + tj + 0 =  + tj

var  yij   E  yij  i 
2
  E    e  
 E   2 ie ij  e
2
i
2
i
2
ij

ij
   
2
2


 E    e    e  
 E    e  e   e e 
cov  yij , yij   E  yij  i   yij  i 
i
2
i
ij
i
i
ij
ij 
ij 
ij ij 
  2

  correlation between yij and yij 2
   2
2
The implication of the ANOVA model for a
repeated measures design is that the correlation
between repeated measures is constant.
The multivariate model for a repeated
measures design
Let y1 , y2 , , yn denote a sample of n from the p-variate
normal distribution with mean vector  and covariance
matrix S.
Here
 11  12


12
22

S


 1t  2t
 1t 

 2t 


 tt 
Allows for arbitrary correlation structure amongst the
repeated measures – yi1, yi2, … , yit
Test for equality of repeated
measures
Repeated measures equal
X
1
2
3
…
repeated measures
t
Let
Then
1 1 0 0
0 1 1 0

0 0 1 1
C 
p

 1 p
0 0 0 1


0 0 0 0
0
0
0
0
1
 Y1  Y2 
 Y1  

   Y2  Y3 
CY  C   


Yp  
  Y Y 
 p 1 p 
0
0 
0

0


1
The test for equality of repeated measures:
H 0 : C   0 vs
H A : C  0
Consider the data Cy1 , Cy2 , , Cyn
This is a sample of n from the (t – 1)-variate normal
distribution with mean vector Cx and covariance
matrix CSC  .
Hotelling’s T2 test for equality of variables

T  n Cy  0
2

 CSC Cy  0
1
1

 n  Cx   CSC  Cx 
if H0 is true than
n  t 1
2
F
T
 t  1 n  1
has an F distribution with n1 = t – 1 and n2 = n - t + 1
Thus we reject H0 if F > F with n1 = p – 1 and
n2 = n – t + 1
To perform the test, compute differences of
successive variables for each case in the group and
perform the one-sample Hotelling’s T2 test for a
zero mean vector
Techniques for studying correlation and
covariance structure
Principal Components Analysis (PCA)
Factor Analysis
Principal Component Analysis
Let x have a p-variate Normal distribution
with mean vector  and covariance matrix S.
Definition:
The linear combination
C1  a1x1 
 ap xp  ax
is called the first principal component if
a   a1 ,
, a p 
is chosen to maximize
Var C1   Var  ax   aSa
subject to
aa  a12 
a2p  1
The complete set of Principal components
Let x have a p-variate Normal distribution
with mean vector  and covariance matrix S.
Definition:
The set of linear combinations
C1  a11 x1 
 a1 p x p  a1x
C p  a p1 x1 
 a pp x p  ap x
are called the principal components of x
if
ai   ai1 ,
, aip  are chosen such that
2

ai ai  ai1 
a 1
2
ip
and
1. Var(C1) is maximized.
2. Var(Ci) is maximized subject to Ci being
independent of C1, …, Ci-1 (the previous i -1
principle components)
Result
ai   ai1 ,
, aip 
is the eigenvector of S associated with the ith largest
eigenvalue, i of the covariance matrix and
Var Ci   Var  aix   aiSai  i
Recall any positive matrix, S
S   a1 ,
1

, a p  
0

0   a1 
 


PDP
 
 p   ap 
, a p are eigenvectors of S of length 1 and
i   p  0
where a1 ,
are eigenvalues of S
P   a1 ,
, a p  is an orthogonal matrix.
(PP  PP  I )
Graphical Picture of Principal Components
Multivariate Normal data falls
in an ellipsoidal pattern.
The shape and orientation of
the ellipsoid is determined by
the covariance matrix S
The eignevectors of S are vectors giving the directions of the
axes of the ellopsoid The eigenvalues give the length of these
axes.
Recall that if S is a positive definite matrix
S  1a1a1 
  a1 ,
 p ap ap
1

, a1  
0

0   a1 
 
 
 p   ap 
 PDP
where P is an orthogonal matrix (P’P = PP’ = I)
with the columns equal to the eigenvectors of S.
and D is a diagonal matrix with diagonal
elements equal to the eigenvalues of S.
The vector of Principal components
 C1   a1x   a1 
  
  
C 
    x  Px
C p   ap x   ap 
  
  
has covariance matrix
Sc  PSP  P  PDP P   PP  D  PP 
1
0


D

0


p

An orthogonal matrix rotates vectors, thus
C  Px
rotates the vector x
into the vector of Principal components C
Also
tr(D) = tr  Sc   tr  PSP  tr  SPP  tr  S
p
p
   
i 1
p
i
p
i 1
ii
 var C    var  x   Total Variance of x
i 1
i
i 1
i
The ratio
i

p

j 1
j
i

p

j 1
var  Ci 
Total Variance of x
jj
denotes the proportion of variance explained by the
ith principal component Ci.
Also
and
CovCi , x j   i aij
i
CorrCi , x j  
aij  i aij if  ii  1
 ii
Factor Analysis
An Alternative technique for studying
correlation and covariance structure
Let x have a p-variate Normal distribution
with mean vector  and covariance matrix S
The Factor Analysis Model:
Let F1, F2, … , Fk denote independent standard normal
observations (the Factors)
Let e1, e2, … , ep denote independent normal random variables
with mean 0 and var(ei) = yp
Suppose that there exists constants ij (the loadings) such that:
x1= 1 + 11F1+ 12F2+ … + k Fk + e
x2= 2 + 21F1+ 22F2+ … + k Fk + e
…
xp= p + p1F1+ p2F2+ … + pk Fk + ep
Factor Analysis Model
x    LF  e
where
and







F is N 0k , I k while e is N 0 p , 
y 1 0  0 
0 y

0
2



 


 0 0  y p 

Note:

S  Varx   LL  
hence
k
 ii  var  xi    ij2 y i  i2 y i
j 1
and
k
 im  covxi , xm    ij mj
j 1
k
i2   ij2 is called the communality.
j 1
i.e. the component of variance of xi that is due to the common
factors F1, F2, … , Fk.
y i is called thespecific variance
i.e. the component of variance of xi that is specific only to that
observation.
F1, F2, … , Fk are called the common factors
e1, e2, … , ep are called the specific factors
ij  cov  xi , F j 
= the correlation between xi and Fj.
if var  xi   1
Rotating Factors
Recall the factor Analysis model
 

x  LF  e
This gives rise to the vector x having covariance
matrix:

S  Varx   LL  
Let P be any orthogonal matrix, then
* *

x  LF  e  LP PF  e  L F  e
and

S  Var  x   LL    LPPL    L L  
*
where F *  PF and L*  LP
 
*
Hence if
with
 

x  LF  e

S  Varx   LL  
is a Factor Analysis model
then so also is
x  L* F *  e

with S  var  x   L L  
*
 
*
where P is any orthogonal matrix.
The process of exploring other models through
orthogonal transformations of the factors is called
rotating the factors
There are many techniques for rotating the factors
• VARIMAX
• Quartimax
• Equimax
VARIMAX rotation attempts to have each individual
variables load high on a subset of the factors
Extracting the Factors
Several Methods – we consider two
1. Principal Component Method
2. Maximum Likelihood Method
Principle Component Method
Recall
S   a1 ,
where a1 ,
1

, a p  
0

0   a1 
 


PDP
 
 p   ap 
, a p are eigenvectors of S of length 1 and
i 
 p  0
are eigenvalues of S
Hence
S   1 a1 ,
  a 
 1 1
,  p a p  
  LL  0
p p


  p ap 
Thus
L   1 a1 ,
,  p a p  and   0
p p
This is the Principal Component Solution with p factors
Note: The specific variances, yi, are all zero.
The objective in Factor Analysis is to explain the
correlation structure in the data vector with as few
factors as necessary
It may happen that the latter eigenvalues of S are small.
i 
 p  0
1
0   a1 

 
S   a1 , , a p  
 
0
  ap 

p

 
 1a1a1   p a p ap
 1a1a1   k ak ak  Lk Lk
where Lk   1 a1 , , k ak 
In addition let
y i   ii   i

th
diagonal element of S  Lk Lk
k
  ii   ij2
j 1
In this case
S  Lk Lk  
where
y 1


0

0


y p 
The equality will
be exact along the
diagonal
Maximum Likelihood Estimation
Let x1 ,
, xn denote a sample from N p  , S
S  L L  
where
p p
p k k  p
The joint density of x1 ,
p p
, xn is
L  , S  L  , L,  

1
 2 
np / 2
S
n/2
 1   1
 

1
exp  2 tr  S A  nS  x    x      
 
  
where A   n  1 S    xi  x  xi  x 
n
i 1
The Likelihood function is
L  , S  L  , L,  

1
 n1 p / 2
 2 


1
 2 


1
n 1 
exp

tr
S
S 
2 
 n1 / 2
S
p/2
S
1/ 2
 n


1
exp  2  x    S  x     

 
with S  L L  
p p
p k k  p
p p
ˆ
The maximum likelihood estimates ˆ , Lˆ and 
Are obtained by numerical maximization of L  , L, 
Discrimination and
Classification
Discrimination
Situation:
We have two or more populations 1, 2, etc
(possibly p-variate normal).
The populations are known (or we have data
from each population)
We have data for a new case (population
unknown) and we want to identify the which
population for which the new case is a member.
The Basic Problem
Suppose that the data from a new case x1, … , xp
has joint density function either :
1: g(x1, … , xn) or
2: h(x1, … , xn)
We want to make the decision to
D1: Classify the case in 1 (g is the
correct distribution) or
D2: Classify the case in 2 (h is the
correct distribution)
The Two Types of Errors
Misclassifying the case in 1 when it actually lies
in 2.
Let P[1|2] = P[D1|2] = probability of this type of error
1.
Misclassifying the case in 2 when it actually lies
in 1.
Let P[2|1] = P[D2|1] = probability of this type of error
2.
This is similar Type I and Type II errors in hypothesis
testing.
Note:
A discrimination scheme is defined by splitting p –
dimensional space into two regions.
1.
C1 = the region were we make the decision D1.
(the decision to classify the case in 1)
2.
C2 = the region were we make the decision D2.
(the decision to classify the case in 2)
There can be several approaches to determining the
regions C1 and C2. All concerned with taking into
account the probabilities of misclassification P[2|1] and
P[1|2]
1.
Set up the regions C1 and C2 so that one of the
probabilities of misclassification , P[2|1] say, is at
some low acceptable value . Accept the level of
the other probability of misclassification P[1|2] =
b.
2.
Set up the regions C1 and C2 so that the total
probability of misclassification:
P[Misclassification] = P[1] P[2|1] + P[2]P[1|2]
is minimized
P[1] = P[the case belongs to 1]
P[2] = P[the case belongs to 2]
3.
Set up the regions C1 and C2 so that the total
expected cost of misclassification:
E[Cost of Misclassification] = ECM
= c2|1P[1] P[2|1] + c1|2 P[2]P[1|2]
is minimized
P[1] = P[the case belongs to 1]
P[2] = P[the case belongs to 2]
c2|1= the cost of misclassifying the case in 2
when the case belongs to 1.
c1|2= the cost of misclassifying the case in 1
when the case belongs to 2.
The Optimal Classification Rule
The Neyman-Pearson Lemma
Suppose that the data x1, … , xp has joint density
function
f(x1, … , xp ;q)
where q is either q1 or q2.
Let
g(x1, … , xp) = f(x1, … , xn ;q1) and
h(x1, … , xp) = f(x1, … , xn ;q2)
We want to make the decision
D1: q = q1 (g is the correct distribution) against
D2: q = q2 (h is the correct distribution)
then the optimal regions (minimizing ECM, expected
cost of misclassification) for making the decisions D1
and D2 respectively are C1 and C2


C1   x1 ,


, xp   
L q1 
L q2 

g  x1 ,
h  x1 ,
, xp 
, xp 


 k


g  x1 ,
, xp 


 k


and


C2   x1 ,


where
, xp   
k
L q2 
c1 2 P  2
c21 P 1
L q1 

h  x1 ,
, xp 
Fishers Linear Discriminant Function.
Suppose that x1, … , xp is either data from a p-variate
Normal distribution with mean vector:
1 or 2
The covariance matrix S is the same for both
populations 1 and 2.
g x 
hx 
1
 2 
p/2
S
1/ 2
1
 2 
p/2
S
1/ 2
e
 12  x  1  S1  x  1 
e
 12  x  2  S1  x  2 
The function
a x   1  2  S1 x
Is called Fisher’s linear discriminant function
2
1
2
1
a x   1  2  S1 x  K
In the case where the populations are unknown
but estimated from data
Fisher’s linear discriminant function
ˆa x   x  x  S 1 x
1
2
The Optimal Classification Rule
Suppose that the data x1, … , xp has joint density
function
f(x1, … , xp ;q)
where q is either q1 or q2.
Let
g(x1, … , xp) = f(x1, … , xn ;q1) and
h(x1, … , xp) = f(x1, … , xn ;q2)
We want to make the decision
D1: q = q1 (g is the correct distribution) against
D2: q = q2 (h is the correct distribution)
then the optimal regions (minimizing ECM, expected
cost of misclassification) for making the decisions D1
and D2 respectively are C1 and C2


C1   x1 ,


, xp   
L q1 
L q2 

g  x1 ,
h  x1 ,
, xp 
, xp 


 k


g  x1 ,
, xp 


 k


and


C2   x1 ,


where
, xp   
k
L q2 
c1 2 P  2
c21 P 1
L q1 

h  x1 ,
, xp 
Fishers Linear Discriminant Function.
Suppose that x1, … , xp is either data from a p-variate
Normal distribution with mean vector:
1 or 2
The covariance matrix S is the same for both
populations 1 and 2.
g x 
hx 
1
 2 
p/2
S
1/ 2
1
 2 
p/2
S
1/ 2
e
 12  x  1  S1  x  1 
e
 12  x  2  S1  x  2 
The Optimal Rule states that we should classify into
populations 1 and 2 using:
1
  x    S  x   
e
p/2
1/ 2
g  x   2  S


1
  x    S  x   
hx
e
p/2
1/ 2
 2  S
1
1
2
1
1
2
2
1
1
2
 x 2  S1  x 2  12  x 1  S1  x 1 
e
That is make the decision
D1 : population is 1
c1 2 P  2
if  > k
k
c21 P 1
1
2
or ln  
1
2
 x  2  S1  x  2   12  x  1  S 1  x  1   ln k
or
 x  2  S 1  x  2    x  1  S 1  x  1   2 ln k
or
xS1 x  22S1 x  2S12
 xS1x  21S1x  1S11  2ln k
and
 1  2  S1 x  ln k  12  1S11  2S12 
Finally we make the decision
D1 : population is 1
if
a x  K
where
Fisher’s Linear
discriminant function

a  S 1  1  2  and K  ln k  12 1S 11   2 S 1 2
and
k
c1 2 P  2

c21 P 1
Note: k = 1 and ln k = 0 if c1|2 = c2|1 and P[1] = P[2].


and K  12 1S11  2S12  12  1  2  S1  1  2 
Graphical illustration of
a x   1  2  S1 x
Fisher’s linear discriminant function
2
1
2
1
a x   1  2  S1 x  K
Note: k = 1 and ln k = 0 if c1|2 = c2|1 and P[1] = P[2].


and K  12 1S11  2S12  12  1  2  S1  1  2 
Thus a x  K with
a  S 1  1  2  and K  ln k  12  1S 11   2 S 1 2 
is equivalent to
     1   
  
1
  2  S x  2 S   2 2 S 2
  
   1   
  
1




or
2 2S 2  2S x  2 S   S x

1
2
S   S

2
2
2





 












x  12 xS x  12 S   S x  12 xS x
     
     
x   2  S x   2   x  1  S x  1 
 
 
Mahalanobis distancex, 2   Mahalanobis distancex, 1 
2  
2  
dM x, 2 S  dM x, 1 S
Thus we make the decision
D1 : population is 1
if
 
 
Mahalanobis distancex, 2   Mahalanobis distancex, 1 
2
1
2
1

x
 
 
dM2 x, 2 S  dM2 x, 1 S
Thus we make the decision
D2 : population is 2
if
 
 
Mahalanobis distancex, 2   Mahalanobis distancex, 1 
2
1
2
1

x
 
 
dM2 x, 2 S  dM2 x, 1 S
Thus we make the decision
D1 : population is 1
if
 
 
Mahalanobis distancex, 2   Mahalanobis distancex, 1 
where

a  S 1  1  2  and K  ln k  12 1S 11   2 S 1 2
and
k
c1 2 P  2

c21 P 1
Note: k = 1 and ln k = 0 if c1|2 = c2|1 and P[1] = P[2].


and K  12 1S11  2S12  12  1  2  S1  1  2 
Discrimination of p-variate Normal distributions
(unequal Covariance matrices)
Suppose that x1, … , xp is either data from a p-variate
Normal distribution with mean vector:
1 or 2
and covariance matrices, S1 and S2 respectively.
f x 
g x 
1
 2 
p/2
S1
1/ 2
1
 2 
p/2
S2
1/ 2
e
 12  x  1  S11  x  1 
e
1
 12  x  2  S2  x  2 
The optimal rule states that we should classify into
populations 1 and 2 using:
1
  x    S  x   
e
p/2
1/ 2
f  x   2  S1


1
  x    S  x   
g x
e
p/2
1/ 2
 2  S2
1
2
1
2

S2
1/ 2
S1
1/ 2
e
1
2
1
1
1
1
1
2
1
2
2
1
 x  2  S2  x  2   12  x  1  S1  x  1 
That is make the decision
D1 : population is 1
if  ≥ k
1
1



ln    x  2  S2  x  2    x  1  S1  x  1   ln S2  ln S1   ln k


1
2
or
1
2
where
and

1
1
 
1
1

x S2  S1 x  1S1  2S2 x  K  ln k

1
1
1 S2 1
K  ln
 1S1 1  2S2 2
2 S1 2

k
c1|2 P  2 
c2|1 P 1


Summarizing we make the decision to classify in
population p1 if:
x  Ax  bx  c  0
where
A
1
2
S
1
2
1
 S1
1

1
b  S1 1  S2 2
and
c1|2 P  2
1
1
1 S2 1
c   ln
 1S1 1  2 S 2 2  ln
2 S1 2
c2|1 P 1


Discrimination of p-variate Normal distributions
(unequal Covariance matrices)
1
2
2
1
x  Ax  bx  c  0
Discrimination amongst k populations
We want to determine if an observation vector
 x1 
 
x 
 xp 
 
comes from one of the k populations
 1 : f1  x1 , , x p   f1  x 
 k : f1  xk ,
, xp   fk  x 
For this purpose we need to partition p-dimensional
space into k regions C1, C2 , …, Ck
For this purpose we need to partition p-dimensional
space into k regions C1, C2 , …, Ck
We will make the decision: Di  x came from i 
if x  Ci
Misclassification probabilities
P[j|i] = P[ classify the case in j when case is from i]
 P  x  C j  i  
 f  x  dx
i
Cj
Cost of Misclassification
cj|i = Cost classifying the case in j when case is from i
Initial probabilities of inclusion
P[i] = P[ classify the case is from i initially]
Expected Cost of Misclassification of a case from population i
We assume that we know the case came from i
ECM  i   c1 i P 1 i  
 ci 1 i P i  1 i   ci 1 i P i  1 i  
 ck i P  k i 
  c j i P  j i 
j i
Total Expected Cost of Misclassification
ECM  P 1 ECM 1 
 P k  ECM  k 
  P i  ECM i 
i
  P i   c j i P  j i 
i
j i
  P i   c j i  f i  x  dx
i
j i
j
Cj
    P i  fi  x  c j i dx
j C j i j
i
Optimal Classification Rule
The optimal classification rule will find the regions Cj
that will minimize:
ECM     P i  f i  x  c j i dx
j C j i j
 c    P i  f i  x dx if c j|i  c
j C j i j
 k

 c     P i  f i  x   P  j  f j  x   dx
j C j  i 1

ECM will be minimized if Cj is chosen where the term
that is omitted:
P j f j  x 
is the largest
Optimal Regions when misclassification
costs are equal


C j  x P  j  f j  x   P i  fi  x  for i  j


 x ln P  j  f j  x   ln P i  f i  x  for i  j
Optimal Regions when misclassification
costs are equal an distributions are p-variate Normal
with common covariance matrix S

  x ln P  j  f

C j  x P  j  f j  x   P i  fi  x  for i  j
j
 x   ln P i  fi  x 
In the case of normality
1

fi  x  
e
p/2
1/ 2
 2  S
1
2

for i  j
 x  i  S1  x  i 
ln P i fi  x 
p
1
 ln P i   ln 2  ln S  12  x  i  S 1  x  i 
2
2
and ln P  j  f j  x   ln P i  fi  x  if:
p
1

ln P  j   ln 2  ln S  12  x   j  S 1  x   j 
2
2
p
1
 ln P i   ln 2  ln S  12  x  i  S 1  x  i 
2
2
that is
j S1x  12 j S1 j  ln P  j   iS1x  12 iS1i  ln P i
or
aj x  bj  aix  bi
where ai  S1i and bi  ln P i  12 iS1i
Summarizing
We will classify the observation vector in population j
if:
L j  aj x  b j  max Li  max  aix  bi 
i
i
where ai  S1i and bi  ln P i  12 iS1i
1
L1  L2
L1  L2 , L3
L3  L1 , L2
L1  L3
3 L  L
2
3
2
L2  L1 , L3
Classification or Cluster Analysis
Have data from one or several
populations
Situation
• Have multivariate (or univariate) data from
one or several populations (the number of
populations is unknown)
• Want to determine the number of populations
and identify the populations
Hierarchical Clustering Methods
The following are the steps in the agglomerative Hierarchical
clustering algorithm for grouping N objects (items or variables).
1.
2.
3.
Start with N clusters, each consisting of a single entity and
an N X N symmetric matrix (table) of distances (or
similarities) D = (dij).
Search the distance matrix for the nearest (most similar)
pair of clusters. Let the distance between the "most
similar" clusters U and V be dUV.
Merge clusters U and V. Label the newly formed cluster
(UV). Update the entries in the distance matrix by
a)
b)
deleting the rows and columns corresponding to
clusters U and V and
adding a row and column giving the distances
between cluster (UV) and the remaining clusters.
4.
Repeat steps 2 and 3 a total of N-1 times. (All objects
will be a single cluster a termination of this algorithm.)
Record the identity of clusters that are merged and the
levels (distances or similarities) at which the mergers
take place.
Different methods of computing inter-cluster distance
Cluster Distance
Single Linkage
3
1
2
d2 4
5
4
Complete Lin kage
3
1
2
4
d1 5
5
Average Linkage
3
1
2
4
5
d 1 3+ d 1 4+ d 1 5+ d 2 3+ d 2 4+ d 2 5
6
k—means Clustering
A non-hierarchical clustering scheme
want subdivide the data set into k groups
The k means algorithm
1. Initially subdivide the complete data into k groups.
2. Compute the centroids (mean vector) for each
group.
3. Sequentially go through the data reassigning each
case to the group with the closest centroid.
4. After reassigning a case to a new group recalculate
the centroid for the original group and the new
group to which it is a member.
5. Continue until there are no new reassignment of
cases.
Discrete Multivariate Analysis
Analysis of Multivariate
Categorical Data
Multiway Frequency Tables
• Two-Way
A
B
• Three -Way
B
A
C
• Three -Way
C
B
A
• four -Way
B
A
C
D
Models for count data
•
•
•
•
Binomial
Hypergeometric
Poisson
Multinomial
Log Linear Model
Three-way Frequency Tables
Log-Linear model for three-way tables
Let ijk denote the expected frequency in cell
(i,j,k) of the table then in general
ln ij  u  u1(i )  u2( j )  u3( k )  u12(i, j )
u13(i,k )  u23( j ,k )  u123(i, j ,k )
or the multiplicative form
ij  e
ln ij
 e e
u
u1( i )
e
e
u13( i ,k )
u2( j )
e
e
e
u3( k )
u23( j ,k )
u12( i , j )
e
u123( i , j ,k )
 q q1(i ) q2( j ) q3( k ) q12(i, j )
q13(i,k ) q23( j ,k ) q123(i, j ,k )
Hierarchical Log-linear models
for categorical Data
For three way tables
The hierarchical principle:
If an interaction is in the model, also keep
lower order interactions and main effects
associated with that interaction
Hierarchical Log-linear models for 3 way table
Model
[1][2][3]
[1][23]
[2][13]
[3][12]
[12][13]
[12][23]
[13][23]
[12][13] [23]
[123]
Description
Mutual independence between all three variables.
Independence of Variable 1 with variables 2 and 3.
Independence of Variable 2 with variables 1 and 3.
Independence of Variable 3 with variables 1 and 2.
Conditional independence between variables 2 and 3 given variable 1.
Conditional independence between variables 1 and 3 given variable 2.
Conditional independence between variables 1 and 2 given variable 3.
Pairwise relations among all three variables, with each two variable interaction
unaffected by the value of the third variable.
The saturated model
Comments
• The log-linear model is similar to the ANOVA
models for factorial experiments.
• The ANOVA models are used to understand
the effects of categorical independent variables
(factors) on a continuous dependent variable
(Y).
• The log-linear model is used to understand
dependence amongst categorical variables
• The presence of interactions indicate
dependence between the variables present in
the interactions
Goodness of Fit Statistics
These statistics can be used to check
if a log-linear model will fit the
observed frequency table
Goodness of Fit Statistics
The Chi-squared statistic
c 
2
 Observed  Expected 
Expected
2

x
ijk
 ˆ ijk

2
ˆijk
The Likelihood Ratio statistic:
 Observed 
G  2  Observed  ln 
  2  xijk
 Expected 
2
 xijk
 ln  ˆ
 ijk
d.f. = # cells - # parameters fitted
We reject the model if
c2
or
G2 is
greater than c / 2
2



Conditional Test Statistics
• Suppose that we are considering two Loglinear models and that Model 2 is a special
case of Model 1.
• That is the parameters of Model 2 are a subset
of the parameters of Model 1.
• Also assume that Model 1 has been shown to
adequately fit the data.
In this case one is interested in testing if the
differences in the expected frequencies between
Model 1 and Model 2 is simply due to random
variation]
The likelihood ratio chi-square statistic that achieves
this goal is:
G  2 1  G  2   G 1
2
2
2
  Expected 1 
 2  Observed  

  Expected 2 
df  df2  df1
Stepwise selection procedures
Forward Selection
Backward Elimination
Forward Selection:
Starting with a model that under fits the data, log-linear
parameters that are not in the model are added step by
step until a model that does fit is achieved.
At each step the log-linear parameter that is most
significant is added to the model:
To determine the significance of a parameter added we
use the statistic:
G2(2|1) = G2(2) – G2(1)
Model 1 contains the parameter.
Model 2 does not contain the parameter
Backward Elimination:
Starting with a model that over fits the data, log-linear
parameters that are in the model are deleted step by step
until a model that continues to fit the model and has the
smallest number of significant parameters is achieved.
At each step the log-linear parameter that is least
significant is deleted from the model:
To determine the significance of a parameter deleted we
use the statistic:
G2(2|1) = G2(2) – G2(1)
Model 1 contains the parameter.
Model 2 does not contain the parameter
Modelling of response variables
Independent → Dependent
Logit Models
When some variables are dependent (response)
and other variables are independent (predictor)
variables, the logit model is used when the
dependent variable is binary.
Case: One dependent variable Two
Independent variable
Consider the log-linear model for ijk , the
expected frequency in cell (i,j,k) of the table then
in general
ln ijk  u  u1i   u2 j   u3k   u12i , j 
 u13i ,k   u23 j ,k   u123i , j ,k 
Variables 1 and 2 are the independent variables
and variable 3 is the binary dependent variable
The Logit model (for the binary response variable)
 ij1 
  ln ij1   ln ij 2 
ln
 
 ij 2 
 u  u1i   u2 j   u31  u12i, j   u13i,1  u23 j ,1  u123i, j ,1

 u  u    u    u    u
1i
2 j
32
12i , j 
 u13i,2  u23 j ,2  u123i, j ,2
 2u31  2u13i,1  2u23 j ,1  2u123i, j ,1
Since

u32   u31 , u13i , 2   u13i ,1 ,
u23 j , 2   u23 j ,1 and u123i , j , 2   u123i , j ,1

The logit model:
 ij1 
  ln ij1   ln ij 2 
ln
 
 ij 2 
 v  v1i   v2i   v12i, j 
where
v  2u31 , v1i   2u13i ,1 , v2i   2u23 j ,1
and v12i , j   2u123i , j ,1
Thus corresponding to a loglinear model there is logit
model predicting log ratio of expected frequencies of
the two categories of the independent variable.
Also k +1 factor interactions with the dependent
variable in the loglinear model determine k factor
interactions in the logit model
k+1=1
constant term in logit model
k + 1 = 2,
main effects in logit model
Fitting a Logit Model with a
Polytomous Response Variable
Techniques for handling Polytomous Response Variable
Approaches
1. Consider the categories 2 at a time. Do this for all
possible pairs of the categories.
2. Look at the continuation ratios
i.
ii.
iii.
iv.
1 vs 2
1,2 vs 3
1,2,3 vs 4
etc
Causal or Path Analysis for
Categorical Data
When the data is continuous, a causal pattern
may be assumed to exist amongst the variables.
The path diagram
This is a diagram summarizing causal
relationships.
Straight arrows are drawn between a variable
that has some cause and effect on another
variable
X
Y
Curved double sided arrows are drawn
between variables that are simply
correlated
X
Y
Example 1
The variables – Job stress, Smoking, Heart Disease
The path diagram
Job
Stress
Smoking
Heart
Disease
In Path Analysis for continuous variables, one is
interested in determining the contribution along each
path (the path coefficents)
Example 2
The variables – Job stress, Alcoholic Drinking,
Smoking, Heart Disease
The path diagram
Job
Stress
Smoking
Drinking
Heart
Disease
If you have a causal Path diagram then variables
at the end of a straight one side arrow
should be treated as dependent variables in a
logit analysis.
Example 1
The variables – Job stress, Smoking, Heart Disease
The path diagram
Job
Stress
Smoking
Heart
Disease
Analysis 1 – Job Stress predicting Smoking
Analysis 2 – Job Stress, Smoking predicting
Heart Disease
Example 2
The variables – Job stress, Alcoholic Drinking,
Smoking, Heart Disease
The path diagram
Job
Stress
Smoking
Drinking
Heart
Disease
Example 2
Analysis 1 - Job stress predicting Drinking
Analysis 2 - Job stress predicting Smoking
Analysis 3 - Job stress, Drinking, Smoking predicting
Heart Disease
The path diagram
Job
Stress
Smoking
Drinking
Heart
Disease