Transcript Chapter 5
Chapter 5
Statistical Inference
Estimation and Testing Hypotheses
5.1 Data Sets & Matrix Normal Distribution
Data matrix
X 11 X 1 p
n observatio ns
X
:
p variables
X n1 X np
where n rows X1, …, Xn are iid N p μ ,Σ .
Vec(X') is an np×1 random vector with
μ
mean vecto r 1n μ
μ
covariance matrix diag Σ , ,Σ I n Σ
We write X ~ N n p 1n μ ' , I n Σ . More general, we
can define matrix normal distribution.
Definition 5.1.1
An n×p random matrix X is said to follow a matrix
normal distribution N n p M, W V if Vec X' ~ N np μ , W V ,
where μ VecM' .
In this case,
X M BYA'
where W=BB', V=AA', Y has i.i.d. elements each following
N(0,1).
Theorem 5.5.1
The density function of X ~ N n p M, W V with W > 0, V >0
is given by
2
np
2
W
p
2
V
n
2
1
etr W -1 X - M V -1 X - M ,
2
where etr(A)= exp(tr(A)).
Corollary 1:
Let X be a matrix of n observations from N p μ , Σ . Then the
density function of X is
2
np
where
2
Σ
n
2
1 -1
etr Σ A ,
2
A X j μ X j μ ' .
n
j1
5.2 Maximum Likelihood Estimation
A. Review
X 1 , , X n are i .i .d . N μ , 2
Step 1. The likelihood function
L μ ,
2
1
i 1 2
n
2
n
1 x μ
i
2
e
2
n
1
2
2 n exp
x
μ
i
2
2 i 1
Step 2. Domain (parameter space)
μ , 2 :μ R , 2 0
The MLE of μ , 2 maximizes L μ , 2 over H.
Step 3. Maximization
L μ , 2
2
n
x
2
n
2 2 exp 2 xi x exp
2
2
2
i 1
n
n
1
2
2 2 2 exp 2 xi x
2 i 1
L x , 2 , 2 0.
1
n
n
It implies thatˆ x.
Let a x1 x and g
n
2
i 1
2
2
n
2 2
n
g 2
a
1
2
2
ˆ
0
x
x
.
1
2
n n i 1
exp a
.
2
2
Results 4.9
(p168 of textbook)
B. Multivariate population
X1 , , X n are samples of N p μ ,Σ .
Step 1. The likelihood function
L μ ,Σ 2
np
2
Σ
n
2 etr
1 -1
Σ A
2
where A x j μ x j μ '
n
j 1
Step 2. Domain
μ , Σ :μ R p , Σ : p p , Σ 0
Step 3. Maximization
(a)
max L μ , Σ max x , Σ
μ ,Σ 0
Σ 0
2
np
2
max Σ
n
2
Σ 0
where B x j x x j x ' .
n
j 1
We can prove that P(B > 0) = 1 if n > p .
1 -1
etr Σ B ,
2
(b) Let B CC' , C : p p , C 0. Let Σ CΣ * C'
Then Σ * -1 C 'Σ -1C
We have
tr Σ 1B tr Σ 1CC' tr C'Σ 1C
tr Σ* 1
Σ*
C -1
max Σ
Σ 0
B
max
*
Σ 0
n
Σ C' Σ
-1
2 etr
n
2
B
1 -1
Σ B
2
n
1 * -1
*
2
Σ
etr Σ .
2
(c) Let λ1, …, λp be the eigenvalues of Σ * .
1 * -1
Σ
etr Σ
max
Σ* 0
2
1
p
n
2 e 2 j
max
j
1 , , p 0 j 1
n
* 2
The function g(λ)= λ-n/2 e -1/ 2λ arrives its maximum at λ=1/n.
The function L(Σ *) arrives its maximum at λ1 =1/n, …, λp =1/n
and
1
*
ˆ
Σ Ip .
n
(d) The MLE of Σ is
ˆ CΣ
ˆ * C' 1 CC' 1 B .
Σ
n
n
Theorem 5.2.1
Let X1, …, Xn be a sample from N p μ ,Σ
Σ 0 . Then the MLEs of μ and Σ are
with n > p and
n
1
μˆ x and Σˆ x j x x j x ' ,
n j 1
respectively, and the maximum likelihood is
np
L x ,Σ 2
2
B
n
2n
np
2e
np
2
.
Theorem 5.2.2
Under the above notations, we have
a)
x and Σˆ are independent;
b)
1
x ~ N p μ , Σ
n
c) Σˆ is a biased estimator of
Σ
n 1
ˆ
E Σ
n
A unbiased estimator of Σ is recommended by
1 n
S
x j x x j x '
n 1 j 1
called the sample covariance matrix.
Theorem 5.2.3
Let θˆ be the MLE of θ and f θ be a measurable function.
Then f θˆ is the MLE of f θ .
Corollary 1
The MLE of the correlations
rij
bij
biib jj
ij
is
, where B bij .
Matalb code: mean, cov, corrcoef
5.3 Wishart distribution
A. Chi-square distribution
Let X1, …, Xn are iid N(0,1). Then Y X 12 X n2 ~ n2,
the chi-square distribution with n degrees of freedom or
Definition 5.1.1
If x ~ Nn(0, In), then Y= x'x is said to have a chi-square
distribution with n degrees of freedom, and write Y ~ n2 .
If x ~ N n 0,
2I
n
, then Y
1
2
x'x ~ n2
If x ~ N n 0, Σ , then Y x' Σ -1 x ~ n2
B. Wishart distribution (obtained by Wishart in 1928)
Definition 5.1.1
Let x ~ N n p 0, I n .Σ . Then we said that W= x'x is distributed
according to a Wishart distribution W p n ,Σ .
p 1 W p n ,Σ 2 n2 , where Σ 2 .
The density of W p n ,Σ n p , Σ 0 is
C W
pW
0 ,
1
n p 1
2
etr
1 -1
Σ W , if W 0
2
otherwise
B x j -x x j - x ' ~ W p n 1,Σ .
n
j 1
5.4 Discussion on estimation
A. Unbiaseness
Let θˆ be an estimator of θ . If E θˆ θ is called unbiased
estimator of θ .
Theorem 5.4.1
Let X1, …, Xn be a sample from N p μ , Σ .
1 n
x xj
n j 1
1 n
S
x j x x j x '
n 1 j 1
Then
and
are unbiased estimators of μ and Σ , respectively.
Matlab code: mean, cov, corrcoef
B. Decision Theory
t x : an estimator of θ based on sample X
Lθ , t : a loss function
pθ x : the density of X with the parameter θ
Then the average of loss is give by
Rθ , t Eθ Lθ , t Lθ , t x pθ x dx
That is called the risk function.
max R θ , t : the maximum risk if t is employed.
θ
Definition 5.4.2
An estimator t(X) is called a minimax estimator of θ if
max R θ , t min max R θ , t
θ
t
θ
Example 1
Under the loss function
Lθ , t θ t ' θ t ,
the sample mean x is a minimax estimator of μ .
C. Admissible estimation
Definition 5.4.3
An estimator t1(x) is said to be at least as good as another t2(x) if
Rθ , t1 Rθ , t 2 ,θ
And t1 is said to be better than or strictly dominates t2 if the above
inequality holds with strict inequality for at least one θ .
Definition 5.4.4
An estimator t* is said to be inadmissible if there exists
another estimator t** that is better than t*. An estimator t* is
admissible if it is not inadmissible.
The admissibility is a weak requirement.
Under the loss L μ , t μ t ' μ t , the sample mean x
is an inadmissible if the population is N p μ , Σ and p 3.
James & Stein pointed out
p-2
ˆ
μ 1x
n x 'x
is better than x when p 3. The estimator μˆ is called
James-Stein estimator.
5.5 Inferences about a mean vector (Ch.5 Textbook)
Let X1, …, Xn be iid samples from N p μ , Σ .
H 0 : μ μ0 ,
Case A:
H1 : μ μ0
Σ is known.
a) p = 1
u
x μ 0
n ~ N 0, 1
b) p > 1
T02 n x μ 0 ' Σ 1 x μ 0 .
Under the hypothesis H0 , x ~ N p μ 0 , 1 Σ . Then
1
2
n
1
x μ0
Σ y , y ~ N p 0, I p .
n
nΣ
1
2
x μ0 y
T02 nx μ 0 ' Σ 1 x μ 0 y'y ~ 2p
Theorem 5.5.1
Let X1, …, Xn be a sample from N p μ , Σ , where is Σ
known. The null distribution of T02 under H 0 :μ μ 0
is
2p and the rejection area is T02 2p .
Case B:
Σ is unknown.
a) Suggestion: Replace Σ by the Sample Covariance
2
Matrix S in T0 , i.e.
T2
n x μ 0 ' S -1 x0 μ 0
nn 1 x μ 0 'B -1 x μ 0
where
1
1 n
x j x x j x '
S
B
n-1
n-1 j 1
There are many theoretic approaches to find a suitable
statistic. One of the methods is the Likelihood Ratio
Criterion.
The Likelihood Ratio Criterion (LRC)
Step 1 The likelihood function
np
L μ , Σ 2
n
2
Σ
n
2
1 -1
etr Σ A
2
where A x j μ x j μ '
j 1
Step 2 Domains
μ ,Σ |μ R ,Σ 0
p
μ , Σ | μ μ 0 , Σ 0
max L μ ,Σ
max L μ ,Σ
Step 3 Maximization
We have obtained
np
maxL μ, Σ 2
2
By a similar way we can find
np
max L μ ,Σ 2
H0
2
where
A0
2
n
n
2
n
2
e
np
2
e
2
np
2
j
μ0 x j μ0 '
x
j
x x μ0 x j x x μ0 '
j 1
n
j 1
under H 0
A0
np
np
x
n
B
n
B n x μ 0 x μ 0 '
Then, the LRC is
A0
B
n
n
2
2
B
n
2
B n x μ 0 x μ 0 '
n
2
Note
1
x μ '
B n x μ 0 x μ 0 '
n x μ 0
B
1 2
B 1 n x μ 0 'B x μ 0 B 1
T
n 1
-1
Finally
T
1
n 1
2
n
2
T2
Remark: Let t(x) be a statistic for the hypothesis and f(u) is a
strictly monotone function. Then
x f t x
is a statistic which is equivalent to t(x). We write
x t x
5.6 T2-statistic
Definition 5.6.1
Let W ~ W p n, Σ and μ ~ N p 0 , Σ be independent with
n > p. The distribution of
T 2 n μ 'W 1 μ
is called T2 distribution.
• The distribution T2 is independent of Σ , we shall write T 2 ~ Tp2,n
• n p 1
np
• As
T 2 ~ Fp ,n p 1
n x μ 0 ~ N p 0 ,Σ , B ~ W p n 1,Σ
T 2 n 1 n x μ 0 ' B 1 n x μ 0 ~ T p2,n 1
And
n p T 2 ~ F
p ,n p
p
n 1
Theorem 5.6.1
n p 2
Under H 0 : μ μ 0 , T ~ T p ,n 1 and
T ~ F p ,n p
n 1 p
2
2
Theorem 5.6.2
The distribution of T 2 is invariant under all affine transformations
y GX d , G : p p , G 0 , d : p 1
of the observations and the hypothesis
Confidence Region
• A 100 (1- α )% confidence region for the mean of a pdimensional normal distribution is the ellipsoid
determined by all μ such that
p(n 1)
n( x μ)' S ( x μ)
F p , n p (α )
n p
1
Proof:
Original
observations
mean
Given mean
H0
X1, …, Xn
Sample Mean
Sample Covariance Matrix
x
Ty2
After transformation
y1 d Gx1 , , yn d Gxn
μ
μ0
μ μ0
S
n y - μ 0* ' S y-1 y - μ 0*
Gμ d μ *
Gμ 0 d μ 0*
μ * μ 0*
y d Gx
GSG' S y
1
n G x - μ 0 ' GSG' G x - μ 0
n x - μ 0 'S -1 x - μ 0 Tx2
Example 5.6.1 (Example 5.2 in Textbook)
Perspiration from 20 healthy females was analysis.
SWEAT DATA
Individual X1 (Sweat rate) X2 (Sodium) X3 (Potassium)
1
3.7
48.5
2
5.7
65.1
3
3.8
47.2
4
3.2
53.2
5
3.1
55.5
6
4.6
36.1
7
2.4
24.8
8
7.2
33.1
9
6.7
47.4
10
5.4
54.1
11
3.9
36.9
12
4.5
58.8
13
3.5
27.8
14
4.5
40.2
15
1.5
13.5
16
8.5
56.4
17
4.5
71.6
18
6.5
52.8
19
4.1
44.4
20
5.5
40.9
Source: Courtest of Dr. Gerald Bargman.
9.3
8.0
10.9
12.0
9.7
7.9
14.0
7.6
8.5
11.3
12.7
12.3
9.8
8.4
10.1
7.1
8.2
10.9
11.2
9.4
4
4
H 0 : μ 50, H1 : μ 50,
10
10
Computer calculations provide:
4.640
2.879 10.010
x 45.400 . S 10.010 199.788
9.965
- 1.810 - 5.640
and
.586 - .022 .258
S -1 - .022 .006 - .002
.258 - .002 .402
- 1.810
- 5.640
3.628
We evaluate
T 2 204.640 4, 45.400 50, 9.965 10
.586 .022 .258 4.640 4
.022 .006 .002 45.400 50
.258 .002 .402 9.965 10
.467
20.640, 4.600, .035 .042 9.74
.160
Comparing the observed T 2 9.74 with the critical value
n 1 p F .10 193 F .10 3.3532.44 8.18
3 ,17
n p p ,n p
17
we see that T 2 9.74 8.18 , and consequently, we reject H0 at the
10% level of significance.
Mahalanobis Distance
Definition 5.6.1
Let x and y be samples of a population G with mean μ and
covariance matrix Σ 0 . The quadratic forms
DM2 x, y x y 'Σ -1 x y and
DM2 x, G x - μ 'Σ -1 x - μ
are called Mahalanobis distance (M-distance) between x and y,
and x and G, respectively.
If can be verified that
•
DM x , y 0 , DM x , y 0 , x y
• DM x , y DM y , x
• DM x, y DM x, z DM z, y ,
•
x, y, z
T02 n x - μ 0 ' Σ -1 x - μ 0 nDM x, G
5.7 Two Samples Problems (Section 6.3, Textbook)
5.7 Two Samples Problems (Section 6.3, Textbook)
We have two samples from the two populations
G1 : N p μ1 , Σ ,
G2 : N p μ 2 , Σ ,
x1 , , x n ,
n p
y1 , , ym , m p
where μ1 , μ 2 and Σ are unknown.
H 0 : μ1 μ 2 ,
The LRC is
nm
1
x y 'S -pooled
x y
T
nm
2
1n
where x xi ,
n i 1
S pooled
H1 : μ1 μ 2
1 m
y yj
m j 1
m
1
n
xi x xi x ' y j y y j y '
n m 2 i 1
j 1
Under the hypothesis
2
T ~T
2
p , n m -1
n m p 1 2
and
T ~ Fp ,n m p 1
n m 2 p
The 1001 % confidence region of a' μ1 μ 2 is
1
2
nm
2
a' x y T
a'S pooled a a' μ1 μ 2
nm
1
2
nm
2
a' x y T
a'S pooled a ,
nm
where
T2 Tp2,n m1 .
Example 5.7.1(p.338-339)
Jolicoeur and Mosimann (1960) studied the relationship of size and
shape for painted turtles. The following table contains their
measurements on the carapaces of 24 female and 24 male turtles.
Female
Male
Length(x1 ) Width(x2 ) Height(x3 ) Length(x1 ) Width(x2 ) Height(x3 )
98
103
103
105
109
123
123
133
133
133
134
136
138
138
141
147
149
153
155
155
158
159
162
177
81
84
86
86
88
92
95
99
102
102
100
102
98
99
105
108
107
107
115
117
115
118
124
132
38
38
42
42
44
50
46
51
51
51
48
49
51
51
53
57
55
56
63
60
62
63
61
67
93
94
96
101
102
103
104
106
107
112
113
114
116
117
117
119
120
120
121
125
127
128
131
135
74
78
80
84
85
81
83
83
82
89
88
86
90
90
91
93
89
93
95
93
96
95
95
106
37
35
35
39
38
37
39
39
38
40
40
40
43
41
41
41
40
44
42
45
45
45
46
47
136.0417
11.3750
x 102.5833 , y 88.2917
52.0417
40.7083
S pooled
295.1431 175.0607 101.6649
175.0607 110.8869 61.7491
101.6649 61.7491 37.9982
24 24
1
x y 'S -pooled
x y 72.3816
T
24 24
24 24 3 1 2
F
T 23.0782 F3, 44 0.01 4.30
324 24 2
2
5.8 Multivariate Analysis of Variance
A. Review
There are k normal populations
G1 : N μ 1 , σ 2 ,
x11 , , xn11 ,
x1
Gk : N μ k , σ 2 , x1 k , , xn kk , xk
One wants to test equality of the means μ 1 , , μ k
H 0 : μ1 μ k , H1 : μ1 μ j , for some i j
The analysis of variance employs decomposition of sum squares
k
SSTR na x a x , sum of squares amongtreat ment
s
k
a 1
na
2
a
a
SSE x j x a ,
a 1 j 1
k na
2
SST x j x ,
2
sum of squares within group
totalsum of squares
a 1 j 1
where
1 na a
1 k na a
xa x j , x x j , n n1 nk
na j 1
n a 1 j 1
The testing statistics is
SS TR k 1 H 0
F
~ Fk 1,n k
SSE n k
B. Multivariate population (pp295-305)
G1 : N p μ 1 , Σ ,
x11 , , x n1k
Gk : N p μ k , Σ ,
k
k
x1 , , x nk
Σ is unknown, one wants to test
H 0 : μ1 μ k , H1 : μ1 μ j , for some i j
I. The likelihood ratio criterion
Step 1 The likelihood function
np
L μ1 , , μ k ,Σ 2
2
Σ
n
2 etr
1 -1
Σ A ,
2
where A x ja μ a x ja μ a '
k na
a 1 j 1
Step 2 The domains
ω
μ1 ,, μ k , Σ : μ j R , j 1, ,k, Σ 0
μ1 ,, μ k , Σ : μ1 μ k R , Σ 0
p
p
Step 3 Maximization
np
max L μ1, ,μ k , Σ 2 e
2
np
max L μ1, ,μ k , Σ 2 e
where
k
na
2
1
T
n
1
E
n
n
2
n
2
T x ja x x ja x ' ,
a 1 j 1
k na
E x ja xa x ja xa '
a 1 j 1
are the total sum of squares and products matrix and the error sum
of squares and products matrix, respectively.
1 na a
xa
xj ,
na J 1
1 k na a
x xj
n a 1 j 1
The treatment sum of squares and product matrix
B T E na xa x xa x ' .
k
a 1
The LRC
E
T
n
n
2
2
E
E
.
T
EB
Definition 5.8.1
Assume A ~ W p n ,Σ and B ~ W p m ,Σ are independent,
where n p , m p , Σ 0 . The distribution
A
A B
is called Wilks
-distribution and write ~ p,n,m
.
Theorem 5.8.1
Under H0 we have
1)
T ~ W p n 1, Σ , E ~ W p n k , Σ , B ~ W p k 1, Σ
2) E and B are independent
3)
The LRC under the hypothesis has a p,n-k,k -1
Special cases of the Wilks
-distributions p ,n ,m
n p 11
m 1
~ Fp ,n p 1
p
n p1
m2
~ F2 p ,2n p
p
n 1
p 1
~ Fm ,n
m
n 11
p2
~ F2 m ,2n 1
m
See pp300-305, Textbook for example.
2. Union-Intersection Decision Rule
H 0 : μ1 μk
Consider projection hypothesis
H a 0 : a' μ1 a' μ k , a R , a 0
p
H 0 H a0
a 0
Ga1 : N (a' μ1 , a' Σa) :
a' x1(1) ,, a' x n(11 )
Gak :
N (a' μ k , a' Σa)
a' x1( k ) ,, a' x n( kk )
For projection data, we have
SSTR a'Ba , SSE a'Ea
SST a'Ta
and the F-statistic
a'Ba k - 1 H0
Fa
~ Fk-1,n-k
a'Ea n k
Fa Fk-1,n-k * .
The rejection region for H is R
F F
With the rejection region
0
a
a
aR p
Fa or
that implies the testing statistic is max
aR p
a'Ba
max
a 0 a'Ea
k 1, n k
α
*
Lemma 1
Let A be a symmetric matrix of order p. Denote by
1 2 p , the eigenvalues of A, and l1 l p , the
associated eigenvectors of A. Then
max
x' Ax
max x' Ax 1
x' x
x 1
min
x' Ax
min x' Ax p
x' x
x 1
x0
x0
Lemma 2
Let A and B are two p× p matrices and A’=A, B>0.
Denote by 1 p and l1 l p , the eigenvalues of
1
1
B 2 AB 2 and associated eigenvectors. Then
x' Ax
1
max
x' Bx
x0
x' Ax
k 1,k 1,, p 1
min
x 'li 0 ,i 1,, k , x ' Bx
x0
Remark1: 1 , , p are eigenvalue s of A B 0 .
Remark2: The union - intersecti on statistic is the largest eigenvalue s
of B E 0 .
1
1
2 BE 2
p
1
Remark3: Let 1 p be eigenvalues of E
Wilks -statistic can be expressed as
. The
.
i 11 i