Transcript Chapter 5

Chapter 5
Statistical Inference
Estimation and Testing Hypotheses
5.1 Data Sets & Matrix Normal Distribution
Data matrix
 X 11  X 1 p 

 n observatio ns
X  
 :
p variables
 X n1  X np 
where n rows X1, …, Xn are iid N p  μ ,Σ  .
Vec(X') is an np×1 random vector with
μ
mean vecto r      1n  μ
 
 μ 
covariance matrix  diag Σ , ,Σ   I n  Σ
We write X ~ N n p 1n  μ ' , I n  Σ  . More general, we
can define matrix normal distribution.
Definition 5.1.1
An n×p random matrix X is said to follow a matrix
normal distribution N n p M, W  V  if Vec X'  ~ N np μ , W  V ,
where μ  VecM' .
In this case,
X  M  BYA'
where W=BB', V=AA', Y has i.i.d. elements each following
N(0,1).
Theorem 5.5.1
The density function of X ~ N n p M, W  V  with W > 0, V >0
is given by
2 
 np
2
W
p
2
V
n
2
 1

etr   W -1 X - M V -1 X - M  ,
 2

where etr(A)= exp(tr(A)).
Corollary 1:
Let X be a matrix of n observations from N p  μ , Σ  . Then the
density function of X is
2 
 np
where
2
Σ
n
2
 1 -1 
etr   Σ A  ,
 2

A    X j  μ  X j  μ ' .
n
j1
5.2 Maximum Likelihood Estimation
A. Review
X 1 ,  , X n are i .i .d . N  μ , 2 
Step 1. The likelihood function
L μ ,
2

1

i 1 2 
n
 2 
n
1  x μ 
  i

2


e 
2
n
1
2

2   n exp 


x

μ



i
2
 2 i 1

Step 2. Domain (parameter space)
   μ , 2  :μ  R , 2  0
The MLE of  μ , 2  maximizes L μ , 2  over H.
Step 3. Maximization
L μ , 2 
2




n
x


2


n
 2  2  exp  2   xi  x  exp 

2
2

2

i 1

 

n
n
1

2

 2 2  2 exp  2   xi  x  
 2 i 1

 L x , 2 , 2  0.
1
n
n
It implies thatˆ  x.
Let a    x1  x  and g 
n
2
i 1
2
  2 
n
2  2
n
 g  2 
a
1
2
2
ˆ



0




x

x
.

1
2

n n i 1

exp  a

.
2
2
Results 4.9
(p168 of textbook)
B. Multivariate population
X1 , , X n are samples of N p  μ ,Σ .
Step 1. The likelihood function
L μ ,Σ   2 
 np
2
Σ
n
2 etr 
1 -1 
 Σ A
 2

where A    x j  μ  x j  μ '
n
j 1
Step 2. Domain
   μ , Σ  :μ  R p , Σ : p  p , Σ  0
Step 3. Maximization
(a)
max L μ , Σ   max  x , Σ 
μ ,Σ  0
Σ 0
 2 
 np
2
max Σ
n
2
Σ 0
where B    x j  x  x j  x ' .
n
j 1
We can prove that P(B > 0) = 1 if n > p .
 1 -1 
etr   Σ B  ,
 2

(b) Let B  CC' , C : p  p , C  0. Let Σ  CΣ * C'
Then Σ * -1  C 'Σ -1C
We have
tr Σ 1B   tr Σ 1CC'   tr C'Σ 1C 
 tr Σ* 1 
Σ*

C -1
max Σ
Σ 0
B
max
*
Σ 0
n
Σ C'  Σ
-1
2 etr 
n
2
B
1 -1 
 Σ B
 2

n
 1 * -1 
*
2
Σ
etr  Σ  .
 2

(c) Let λ1, …, λp be the eigenvalues of Σ * .
 1 * -1 
Σ
etr  Σ 
max
Σ* 0
 2

1


p 
n


2 e 2 j 



max
j

1 , , p 0 j 1


n
*  2
The function g(λ)= λ-n/2 e -1/ 2λ arrives its maximum at λ=1/n.
The function L(Σ *) arrives its maximum at λ1 =1/n, …, λp =1/n
and
1
*
ˆ
Σ  Ip .
n
(d) The MLE of Σ is
ˆ  CΣ
ˆ * C'  1 CC'  1 B .
Σ
n
n
Theorem 5.2.1
Let X1, …, Xn be a sample from N p  μ ,Σ
Σ  0 . Then the MLEs of μ and Σ are
 with n > p and
n
1
μˆ  x and Σˆ    x j  x  x j  x ' ,
n j 1
respectively, and the maximum likelihood is
 np
L x ,Σ   2 
2
B
n
2n
np
2e
 np
2
.
Theorem 5.2.2
Under the above notations, we have
a)
x and Σˆ are independent;
b)
 1 
x ~ N p μ , Σ 
 n 
c) Σˆ is a biased estimator of
Σ
n 1
ˆ
E Σ 

n
 
A unbiased estimator of Σ is recommended by
1 n
S
 x j  x x j  x '
n  1 j 1
called the sample covariance matrix.
Theorem 5.2.3
Let θˆ be the MLE of θ and f θ  be a measurable function.
Then f θˆ  is the MLE of f θ  .
Corollary 1
The MLE of the correlations
rij 
bij
biib jj
 ij
is
 
, where B  bij .
Matalb code: mean, cov, corrcoef
5.3 Wishart distribution
A. Chi-square distribution
Let X1, …, Xn are iid N(0,1). Then Y  X 12    X n2 ~  n2,
the chi-square distribution with n degrees of freedom or
Definition 5.1.1
If x ~ Nn(0, In), then Y= x'x is said to have a chi-square
distribution with n degrees of freedom, and write Y ~  n2 .
 If x ~ N n 0, 
2I
n
, then Y 
1
2
x'x ~  n2
 If x ~ N n 0, Σ , then Y  x' Σ -1 x ~  n2
B. Wishart distribution (obtained by Wishart in 1928)
Definition 5.1.1
Let x ~ N n p 0, I n .Σ . Then we said that W= x'x is distributed
according to a Wishart distribution W p n ,Σ  .
 p  1 W p n ,Σ    2  n2 , where Σ   2 .
 The density of W p n ,Σ n  p , Σ  0  is

C W
pW   
0 ,
1
 n  p 1

2
etr  
1 -1 
Σ W  , if W  0
 2

otherwise
 B    x j -x  x j - x ' ~ W p n  1,Σ .
n
j 1
5.4 Discussion on estimation
A. Unbiaseness
Let θˆ be an estimator of θ . If E θˆ   θ is called unbiased
estimator of θ .
Theorem 5.4.1
Let X1, …, Xn be a sample from N p  μ , Σ  .
1 n
x   xj
n j 1
1 n
S
  x j  x  x j  x '
n  1 j 1
Then
and
are unbiased estimators of μ and Σ , respectively.
Matlab code: mean, cov, corrcoef
B. Decision Theory
t  x  : an estimator of θ based on sample X
Lθ , t  : a loss function
pθ  x  : the density of X with the parameter θ
Then the average of loss is give by
Rθ , t   Eθ Lθ , t    Lθ , t  x pθ  x dx
That is called the risk function.
max R θ , t  : the maximum risk if t is employed.
θ 
Definition 5.4.2
An estimator t(X) is called a minimax estimator of θ if
max R θ , t    min max R θ , t  
θ 
t
θ 
Example 1
Under the loss function
Lθ , t   θ  t ' θ  t ,
the sample mean x is a minimax estimator of μ .
C. Admissible estimation
Definition 5.4.3
An estimator t1(x) is said to be at least as good as another t2(x) if
Rθ , t1   Rθ , t 2 ,θ  
And t1 is said to be better than or strictly dominates t2 if the above
inequality holds with strict inequality for at least one θ   .
Definition 5.4.4
An estimator t* is said to be inadmissible if there exists
another estimator t** that is better than t*. An estimator t* is
admissible if it is not inadmissible.
 The admissibility is a weak requirement.
 Under the loss L μ , t    μ  t '  μ  t  , the sample mean x
is an inadmissible if the population is N p μ , Σ  and p  3.
 James & Stein pointed out
 p-2 
ˆ
μ  1x
 n x 'x 
is better than x when p  3. The estimator μˆ is called
James-Stein estimator.
5.5 Inferences about a mean vector (Ch.5 Textbook)
Let X1, …, Xn be iid samples from N p μ , Σ .
H 0 : μ  μ0 ,
Case A:
H1 : μ  μ0
Σ is known.
a) p = 1
u
x μ 0

n ~ N 0, 1
b) p > 1
T02  n x  μ 0 ' Σ 1  x  μ 0 .
Under the hypothesis H0 , x ~ N p  μ 0 , 1 Σ . Then

1
2
n 


1
x  μ0 
Σ y , y ~ N p 0, I p .
n
nΣ

1
2
 x  μ0   y
T02  nx μ 0 ' Σ 1 x μ 0   y'y ~  2p
Theorem 5.5.1
Let X1, …, Xn be a sample from N p  μ , Σ  , where is Σ
known. The null distribution of T02 under H 0 :μ  μ 0
is
 2p and the rejection area is T02   2p  .
Case B:
Σ is unknown.
a) Suggestion: Replace Σ by the Sample Covariance
2
Matrix S in T0 , i.e.
T2
 n x  μ 0 ' S -1  x0  μ 0 
 nn  1 x  μ 0 'B -1  x  μ 0 
where
1
1 n
x j  x x j  x '
S
B

n-1
n-1 j 1
There are many theoretic approaches to find a suitable
statistic. One of the methods is the Likelihood Ratio
Criterion.
The Likelihood Ratio Criterion (LRC)
Step 1 The likelihood function
 np
L μ , Σ   2 

n
2
Σ
n

2
 1 -1 
etr  Σ A
 2


where A   x j  μ x j  μ '
j 1
Step 2 Domains


   μ ,Σ  |μ  R ,Σ  0

p
  μ , Σ  | μ  μ 0 , Σ  0
max L μ ,Σ

 
max L μ ,Σ 
Step 3 Maximization

We have obtained
 np
maxL μ, Σ   2 
2

By a similar way we can find
 np
max L μ ,Σ   2 
H0
2

where
A0


2
n
n
2
n
2
e
np
2
e
2
 np
2
j
μ0 x j μ0 '
 x
j
 x  x μ0 x j  x  x μ0 '
j 1
n
j 1
under H 0
A0
 np
np
 x
n

B
n


 B  n x  μ 0  x  μ 0 '

Then, the LRC is

A0
B
n
n
2
2

B
n
2
B  n x  μ 0  x  μ 0 '
n
2
Note
1
  x  μ '
B  n x  μ 0  x  μ 0 ' 
n x  μ 0 
B
1 2

 B 1  n x  μ 0 'B  x  μ 0   B 1 
T 
 n 1 
-1
Finally
T 

  1 
 n  1
2
n
2
T2
Remark: Let t(x) be a statistic for the hypothesis and f(u) is a
strictly monotone function. Then
  x   f t  x 
is a statistic which is equivalent to t(x). We write
 x  t x
5.6 T2-statistic
Definition 5.6.1
Let W ~ W p n, Σ  and μ ~ N p 0 , Σ  be independent with
n > p. The distribution of
T 2  n μ 'W 1 μ
is called T2 distribution.
• The distribution T2 is independent of Σ , we shall write T 2 ~ Tp2,n
• n  p 1
np
• As
T 2 ~ Fp ,n  p 1
n  x  μ 0  ~ N p 0 ,Σ , B ~ W p n  1,Σ 
T 2  n  1 n  x  μ 0 ' B 1  n  x  μ 0  ~ T p2,n 1
And
n  p  T 2 ~ F
p ,n  p
p
n 1
Theorem 5.6.1
n p 2
Under H 0 : μ  μ 0 , T ~ T p ,n 1 and
T ~ F p ,n  p
n  1 p
2
2
Theorem 5.6.2
The distribution of T 2 is invariant under all affine transformations
y  GX  d , G : p  p , G  0 , d : p  1
of the observations and the hypothesis
Confidence Region
• A 100 (1- α )% confidence region for the mean of a pdimensional normal distribution is the ellipsoid
determined by all μ such that
p(n  1)
n( x  μ)' S ( x  μ) 
F p , n  p (α )
n p
1
Proof:
Original
observations
mean
Given mean
H0
X1, …, Xn
Sample Mean
Sample Covariance Matrix
x
Ty2
After transformation
y1  d  Gx1 , , yn  d  Gxn
μ
μ0
μ μ0
S
 n y - μ 0* ' S y-1  y - μ 0* 
Gμ  d  μ *
Gμ 0  d  μ 0*
μ *  μ 0*
y  d  Gx
GSG'  S y
1




 n G  x - μ 0  ' GSG' G  x - μ 0 
 n x - μ 0 'S -1  x - μ 0   Tx2
Example 5.6.1 (Example 5.2 in Textbook)
Perspiration from 20 healthy females was analysis.
SWEAT DATA
Individual X1 (Sweat rate) X2 (Sodium) X3 (Potassium)
1
3.7
48.5
2
5.7
65.1
3
3.8
47.2
4
3.2
53.2
5
3.1
55.5
6
4.6
36.1
7
2.4
24.8
8
7.2
33.1
9
6.7
47.4
10
5.4
54.1
11
3.9
36.9
12
4.5
58.8
13
3.5
27.8
14
4.5
40.2
15
1.5
13.5
16
8.5
56.4
17
4.5
71.6
18
6.5
52.8
19
4.1
44.4
20
5.5
40.9
Source: Courtest of Dr. Gerald Bargman.
9.3
8.0
10.9
12.0
9.7
7.9
14.0
7.6
8.5
11.3
12.7
12.3
9.8
8.4
10.1
7.1
8.2
10.9
11.2
9.4
4
4
 
 
H 0 : μ   50, H1 : μ   50,
 10 
 10 
 
 
Computer calculations provide:
 4.640 
 2.879 10.010
x  45.400  . S  10.010 199.788



 9.965 
 - 1.810 - 5.640
and
 .586 - .022 .258 
S -1   - .022 .006 - .002 


 .258 - .002 .402 
- 1.810 
- 5.640 

3.628 
We evaluate
T 2  204.640 4, 45.400 50, 9.965 10
 .586  .022 .258   4.640 4 
  .022 .006  .002  45.400 50



 .258  .002 .402   9.965 10 
 .467 
 20.640,  4.600,  .035   .042  9.74
 .160 
Comparing the observed T 2  9.74 with the critical value
n  1 p F .10  193 F .10  3.3532.44  8.18
3 ,17
n  p  p ,n  p
17
we see that T 2  9.74  8.18 , and consequently, we reject H0 at the
10% level of significance.
Mahalanobis Distance
Definition 5.6.1
Let x and y be samples of a population G with mean μ and
covariance matrix Σ  0 . The quadratic forms
DM2  x, y    x  y 'Σ -1  x  y  and
DM2  x, G    x - μ 'Σ -1  x - μ 
are called Mahalanobis distance (M-distance) between x and y,
and x and G, respectively.
If can be verified that
•
DM  x , y   0 , DM  x , y   0 ,  x  y
• DM  x , y   DM  y , x 
• DM  x, y   DM  x, z   DM z, y ,
•
 x, y, z
T02  n x - μ 0 ' Σ -1  x - μ 0   nDM  x, G 
5.7 Two Samples Problems (Section 6.3, Textbook)
5.7 Two Samples Problems (Section 6.3, Textbook)
We have two samples from the two populations
G1 : N p  μ1 , Σ ,
G2 : N p  μ 2 , Σ ,
x1 , , x n ,
n p
y1 , , ym , m  p
where μ1 , μ 2 and Σ are unknown.
H 0 : μ1  μ 2 ,
The LRC is
nm
1
 x  y 'S -pooled
x  y
T 
nm
2
1n
where x   xi ,
n i 1
S pooled
H1 : μ1  μ 2
1 m
y   yj
m j 1
m
1
n


  xi  x  xi  x '    y j  y  y j  y ' 

n  m  2 i 1
j 1

Under the hypothesis
2
T ~T
2
p , n  m -1
n  m  p 1 2
and
T ~ Fp ,n  m p 1
n  m  2 p
The 1001   % confidence region of a'  μ1  μ 2  is
1
2
nm

2
a'  x  y   T
a'S pooled a   a'  μ1  μ 2 
nm


1
2
nm

2
 a'  x  y   T
a'S pooled a  ,
nm


where
T2  Tp2,n  m1   .
Example 5.7.1(p.338-339)
Jolicoeur and Mosimann (1960) studied the relationship of size and
shape for painted turtles. The following table contains their
measurements on the carapaces of 24 female and 24 male turtles.
Female
Male
Length(x1 ) Width(x2 ) Height(x3 ) Length(x1 ) Width(x2 ) Height(x3 )
98
103
103
105
109
123
123
133
133
133
134
136
138
138
141
147
149
153
155
155
158
159
162
177
81
84
86
86
88
92
95
99
102
102
100
102
98
99
105
108
107
107
115
117
115
118
124
132
38
38
42
42
44
50
46
51
51
51
48
49
51
51
53
57
55
56
63
60
62
63
61
67
93
94
96
101
102
103
104
106
107
112
113
114
116
117
117
119
120
120
121
125
127
128
131
135
74
78
80
84
85
81
83
83
82
89
88
86
90
90
91
93
89
93
95
93
96
95
95
106
37
35
35
39
38
37
39
39
38
40
40
40
43
41
41
41
40
44
42
45
45
45
46
47
136.0417 
11.3750 
x  102.5833 , y  88.2917 




 52.0417 
40.7083
S pooled
295.1431 175.0607 101.6649


 175.0607 110.8869 61.7491
101.6649 61.7491 37.9982
24  24
1
 x  y 'S -pooled
 x  y   72.3816
T 
24  24
24  24  3  1 2
F
T  23.0782 F3, 44 0.01  4.30
324  24  2 
2
5.8 Multivariate Analysis of Variance
A. Review
There are k normal populations
G1 : N  μ 1 , σ 2 ,
x11 ,  , xn11  ,
x1



Gk : N  μ k , σ 2 , x1 k  ,  , xn kk  , xk
One wants to test equality of the means μ 1 , , μ k
H 0 : μ1    μ k , H1 : μ1  μ j , for some i  j
The analysis of variance employs decomposition of sum squares
k
SSTR    na  x a  x  , sum of squares amongtreat ment
s
k
a 1
na
2

a 

a 

SSE   x j  x a ,
a 1 j 1
k na
2

SST   x j  x ,
2
sum of squares within group
totalsum of squares
a 1 j 1
where
1 na  a 
1 k na  a 
xa   x j , x    x j , n  n1    nk
na j 1
n a 1 j 1
The testing statistics is
SS TR  k  1 H 0
F
~ Fk 1,n  k
SSE n  k 
B. Multivariate population (pp295-305)
G1 : N p  μ 1 , Σ ,
x11 ,  , x n1k 

Gk : N p  μ k , Σ ,

k 
k 
x1 ,  , x nk
Σ is unknown, one wants to test
H 0 : μ1    μ k , H1 : μ1  μ j , for some i  j
I. The likelihood ratio criterion
Step 1 The likelihood function
 np
L μ1 , , μ k ,Σ   2 
2
Σ
n
2 etr 
1 -1 
 Σ A ,
 2

where A    x ja   μ a x ja   μ a '
k na
a 1 j 1
Step 2 The domains

ω
  μ1 ,, μ k , Σ  : μ j  R , j  1, ,k, Σ  0
  μ1 ,, μ k , Σ  : μ1    μ k  R , Σ  0
p
p
Step 3 Maximization
 np
max L μ1, ,μ k , Σ   2 e 

2
 np
max L μ1, ,μ k , Σ   2 e 

where
k




na
2
1
T
n
1
E
n
n
2
n
2

T   x ja   x x ja   x ' ,
a 1 j 1
k na

E   x ja   xa x ja   xa '
a 1 j 1
are the total sum of squares and products matrix and the error sum
of squares and products matrix, respectively.
1 na  a 
xa 
 xj ,
na J 1
1 k na  a 
x    xj
n a 1 j 1
The treatment sum of squares and product matrix
B  T  E   na  xa  x  xa  x ' .
k
a 1
The LRC

E
T
n
n
2
2
E
E


.
T
EB
Definition 5.8.1
Assume A ~ W p n ,Σ  and B ~ W p m ,Σ  are independent,
where n  p , m  p , Σ  0 . The distribution
A

A B
is called Wilks
 -distribution and write  ~  p,n,m
.
Theorem 5.8.1
Under H0 we have
1)
T ~ W p  n  1, Σ  , E ~ W p  n  k , Σ  , B ~ W p  k  1, Σ 
2) E and B are independent
3)
The LRC under the hypothesis has a  p,n-k,k -1
Special cases of the Wilks
 -distributions  p ,n ,m
n  p  11  
 m 1
~ Fp ,n  p 1
p

n  p1 
 m2
~ F2 p ,2n  p 
p

n 1 
 p 1
~ Fm ,n
m 
n  11  
 p2
~ F2 m ,2n 1
m

See pp300-305, Textbook for example.
2. Union-Intersection Decision Rule
H 0 : μ1    μk
Consider projection hypothesis
H a 0 : a' μ1    a' μ k , a  R , a  0
p
H 0   H a0
a 0
Ga1 : N (a' μ1 , a' Σa) :
a' x1(1) ,, a' x n(11 )



Gak :
N (a' μ k , a' Σa)
a' x1( k ) ,, a' x n( kk )
For projection data, we have
SSTR   a'Ba , SSE  a'Ea
SST  a'Ta
and the F-statistic
a'Ba  k - 1 H0
Fa 
~ Fk-1,n-k
a'Ea  n  k 
Fa  Fk-1,n-k  * .
The rejection region for H is R 
F  F
With the rejection region
0
a
a
aR p
Fa or
that implies the testing statistic is max
aR p
a'Ba
  max
a  0 a'Ea
k 1, n  k
α 
*
Lemma 1
Let A be a symmetric matrix of order p. Denote by
1  2     p , the eigenvalues of A, and l1   l p , the
associated eigenvectors of A. Then
max
x' Ax
 max x' Ax  1
x' x
x 1
min
x' Ax
 min x' Ax   p
x' x
x 1
x0
x0
Lemma 2
Let A and B are two p× p matrices and A’=A, B>0.
Denote by 1     p and l1   l p , the eigenvalues of
1
1
B  2 AB  2 and associated eigenvectors. Then
x' Ax
 1
max
x' Bx
x0
x' Ax
  k 1,k 1,, p 1
min
x 'li  0 ,i 1,, k , x ' Bx
x0
Remark1: 1 , ,  p are eigenvalue s of A  B  0 .
Remark2: The union - intersecti on statistic is the largest eigenvalue s
of B  E  0 .
1
1
2 BE 2
p
1
Remark3: Let 1     p be eigenvalues of E
Wilks -statistic can be expressed as   
. The
.
i 11  i