Transcript Document
Multivariate Analysis Review Multivariate distributions The multivariate Normal distribution x [x1, x2, … xp] is said to have a p-variate normal distribution with mean vector and covariance matrix S if f ( x ) f x1 , 2 1 p/2 , xp S 1/ 2 x ~ N p , S e 1 x S1 x 2 Surface Plots of the bivariate Normal distribution Contour Plots of the bivariate Normal distribution Scatter Plots of data from the bivariate Normal distribution Trivariate Normal distribution - Contour map x S 1 x = const mean vector x3 1 2 3 x2 x1 Trivariate Normal distribution x3 x2 x1 Trivariate Normal distribution x3 x1 x2 Trivariate Normal distribution x3 x2 x1 Marginal and Conditional distributions Theorem: (Marginal distributions for the Multivariate Normal distribution) x1 q have p-variate Normal distribution Let x x2 p q 1 q with mean vector 2 p q S11 S12 and Covariance matrix S S S 22 12 Then the marginal distribution of xi is qi-variate Normal distribution (q1 = q, q2 = p - q) with mean vector i and Covariance matrix S ii Theorem: (Conditional distributions for the Multivariate Normal distribution) x1 q have p-variate Normal distribution Let x x2 p q 1 q with mean vector 2 p q S11 S12 and Covariance matrix S S S 22 12 Then the conditional distribution of xi given x j is qi-variate Normal distribution i SijS x j j with mean vector i j and Covariance matrix Sii j Sii SijSjj1Sij 1 jj The conditional distribution of x2 given x1 is: f 2|1 x2 x1 f x1 , x2 f1 x1 1 2 pq / 2 A 1 2 e 1 S11 where b 2 S12 x1 1 1 S11 and A S22 S12 S12 12 x2 b A1 x2 b 11 The matrix S S222 11 S S S S22 12 11 11 S12 12 22 is called the matrix of partial variances and covariances. The i, j th element of the matrix S S22 2 11 ijij11,2.... , 2,,qq is called the partial covariance (variance if i = j) between xi and xj given x1, … , xq. ij1, 2,,q ij 1,2....q ij1, 2,,q ij 1,2....q ii1ii, 21,2.... ,, q qjj jj1,2.... 1, 2 , q ,q is called the partial correlation between xi and xj given x1 , … , xq . 1 S11 the matrix S12 is called the matrix of regression coefficients for predicting xq+1, xq+2, … , xp from x1, … , xq. Mean vector of xq+1, xq+2, … , xp given x1, … , xqis: 1 S11 2211 Bxx1j where 2 S12 1 Independence Note: two vectors, x1 and x2 , are independent if f x1, x2 f1 x1 f2 x2 Then the conditional distribution of xi given x j is equal to the marginal distribution of xi x1 1 If x is multivariate Normal with mean vector x2 2 S11 S12 and Covariance matrix S S S 22 12 Then the two vectors, x1 and x2 , are independent if S12 0 The components of the vector, x , are independent if ij = 0 for all i and j (i ≠ j ) i. e. S is a diagonal matrix Transformations Transformations Theorem Let x1, x2,…, xn denote random variables with joint probability density function f(x1, x2,…, xn ) Let u1 = h1(x1, x2,…, xn). u2 = h2(x1, x2,…, xn). un = hn(x1, x2,…, xn). define an invertible transformation from the x’s to the u’s Then the joint probability density function of u1, u2,…, un is given by: g u1 , , un f x1 , f x1, , xn d x1 , d u1 , , xn J , xn , un dx1 du 1 d x1 , , xn where J det d u1 , , un dxn du1 Jacobian of the transformation dx1 dun dxn dun Theorem Let x1, x2,…, xn denote random variables with joint probability density function f(x1, x2,…, xn ) Let u1 = a11x1+ a12x2+…+ a1nxn + c1 u2 = a21x1 + a22x2+…+ a2nxn + c2 ⁞ un = an1 x1+ an2 x2 +…+ annxn + cn define an invertible linear transformation from the x’s to the u’s 1 u Ax c or x A u c Then the joint probability density function of u1, u2,…, un is given by: g u1 , , xn 1 A f A u c 1 A a11 A det an1 a1n ann , un f x1 , 1 where Theorem Suppose that The random vector, x [x1, x2, … xp] has a p-variate normal distribution with mean vector and covariance matrix S then u Ax c has a p-variate normal distribution with mean vector u A c and covariance matrix Su ASA Theorem (Linear transformations of Normal RV’s) Suppose that The random vector, x has a p-variate normal distribution with mean vector and covariance matrix S Let A be a q × p matrix of rank q ≤ p then Ax has a p-variate normal distribution with mean vector Ax A and covariance matrix S Ax ASA Maximum Likelihood Estimation Multivariate Normal distribution The Method of Maximum Likelihood Suppose that the data x1, … , xn has joint density function f(x1, … , xn ; q1, … , qp) where q (q1, … , qp) are unknown parameters assumed to lie in W (a subset of p-dimensional space). We want to estimate the parametersq1, … , qp Definition: The Likelihood function Suppose that the data x1, … , xn has joint density function f(x1, … , xn ; q1, … , qp) f x , q Then given the data the Likelihood function is defined to be L q = L(q1, … , qp) = f(x1, … , xn ; q1, … , qp) f x , q Note: the domain of L(q1, … , qp) is the set W. Definition: Maximum Likelihood Estimators Suppose that the data x1, … , xn has joint density function f(x1, … , xn ; q1, … , qp) f x , q Then the Likelihood function is defined to be L q = L(q1, … , qp) = f(x1, … , xn ; q1, … , qp) and the Maximum Likelihood estimators of the parameters q1, … , qp are the values that maximize L q = L(q1, … , qp) i.e. the Maximum Likelihood estimators of the parameters q1, … , qp are the values qˆ1 , ,qˆp Such that L qˆ1 , Note: ,qˆp max L q1 , q1 , ,q p maximizing L q1 , is equivalent to maximizing l q1 , , q p ln L q1 , the log-likelihood function ,q p ,q p ,q p Maximum Likelihood Estimation Multivariate Normal distribution Summary: the Maximum Likelihood estimators of and S are n 1 ˆ x xi n i 1 and n 1 ˆS x x x x n 1 S i i n i 1 n Sampling distribution of the MLE’s Summary The sampling distribution of x is p-variate normal with 1 x and S S n x The sampling distribution of the sample covariance matrix S and n 1 Sˆ S n The Wishart distribution A multivariate generalization of the c2 distribution Definition: the p-variate Wishart distribution Let z1 , z2 , , zk be k independent random p-vectors Each having a p-variate normal distribution with mean vector 0 and covariance matrix S p p p1 Let U z1 z1 z2 z2 p p zk zk Then U is said to have the p-variate Wishart distribution with k degrees of freedom and covariance matrix S p p U Wp kS The density ot the p-variate Wishart distribution Suppose U Wp kS Then the joint density of U is: fU u p p u k p 1 / 2 2kp / 2 exp 12 tr Su k/2 S p k / 2 U where p(·) is the multivariate gamma function. p i.e. p k / 2 p p 1 / 4 k 1 j / 2 j 1 It can be easily checked that when p = 1 and S 1 then the Wishart distribution becomes the c2 distribution with k degrees of freedom. Theorem Wp kS Let C denote a q p matrix of rank q p. Suppose U q p then V CUC Wp k CSC Corollary 1: v aUa W1 k aSa a2 ck2 with aSa 2 a Corollary 2: If uii the i diagonal element of U 2 then uii ii c k where S ij th Theorem Suppose U 1 Wp k1S and U2 Wp k2 S are independent, then V U1 U2 Wp k1 k2 S Theorem Suppose U1 Wp k1S and U2 are independent and V U1 U2 Wp kS with k k1 then U2 Wp k k1S Summary: Sampling distribution of MLE’s for multivatiate Normal distribution Let x1 , x2 , then , xn be a sample from N p S N p S 1 n x and U xi x xi x n 1 S Wp n 1S n i 1 Also 1 u 2 ii n 1 2 sii c 2 n 1 Correlation The sample covariance matrix: s11 s 12 S p p s1 p s12 s11 s2 p where s1 p s2 p s pp 1 sik xij xi xkj xk n 1 j 1 n The sample correlation matrix: 1 r 12 R p p r1 p 1 r2 p where rik r1 p r2 p 1 r12 x n sik sii skk j 1 x n j 1 ij ij xi xkj xk xi 2 x n j 1 kj xk 2 Note: 1 R D SD 1 where s11 0 D p p 0 0 s22 0 0 0 s pp Tests for Independence and Non-zero correlation Tests for Independence Test for zero correlation (Independence between a two variables) n 2 The test statistic t rij 1 rij2 If independence is true then the test statistic t will have a t distributions with n = n –2 degrees of freedom. The test is to reject independence if: n 2 t t / 2 Test for non-zero correlation (H0: 0 The test statistic z 1 1 r 1 1 0 ln ln 2 1 r 2 1 0 1 n3 If H0 is true the test statistic z will have approximately a Standard Normal distribution We then reject H0 if: z z / 2 Partial Correlation Conditional Independence Recall x1 q has p-variate Normal distribution If x x2 p q 1 q with mean vector 2 p q S11 and Covariance matrix S S12 S12 S 22 1 S11 T hematrix S21 S22 S12 S12 is called the matrix of partial variances and covariances. The i, j elementof thematrix S21 th ij1, 2,,q is called the partial covariance (variance if i = j) between xi and xj given x1, … , xq. ij1, 2,,q ij1, 2,,q ii1, 2,,q jj1, 2,,q is called the partial correlation between xi and xj given x1 , … , xq . Let S11 S S12 S12 S 22 denote the sample Covariance matrix 1 S11 Let S21 S22 S12 S12 The i, j elementof thematrix S21 sij1,2,,q th is called the sample partial covariance (variance if i = j) between xi and xj given x1, … , xq. Also rij1, 2,,q sij1, 2,,q sii1, 2,,q s jj1, 2,,q is called the sample partial correlation between xi and xj given x1, … , xq. Test for zero partial correlation correlation (Conditional independence between a two variables given a set of p Independent variables) The test statistic rij. x1 , ,xp t rij . x1 , n p 2 ,xp 1 rij2. x1 , ,x p = the partial correlation between yi and yj given x1, …, xp. If independence is true then the test statistic t will have a t distributions with n = n – p - 2 degrees of freedom. The test is to reject independence if: t t n/2 p 2 Test for non-zero partial correlation H 0 : ij. x1 , , xp ij0. x1 , ,xp The test statistic 0 1 r 1 ij . x1 , ln 2 1 rij0. x1 , z 0 1 1 , xp ij . x1 , ln 0 2 1 , xp ij . x1 , 1 n p 3 , xp , xp If H0 is true the test statistic z will have approximately a Standard Normal distribution We then reject H0 if: z z / 2 The Multiple Correlation Coefficient Testing independence between a single variable and a group of variables Definition y1 has (p +1)-variate Normal distribution Suppose x x1 p y 1 with mean vector 1 p yy and Covariance matrix S 1 y 1y S11 We are interested if the variable y is independent of the vector x1 The multiple correlation coefficient is the maximum correlation between y and a linear combination of the components of x1 The multiple correlation coefficient 1y S 1 y y x1 , x2 ,, xn yy The sample Multiple correlation coefficient s yy Let S s1 y s1y denote the sample covariance matrix. S11 Then the sample Multiple correlation coefficient is ry x1 , , xn s1y S111s1 y s yy Testing for independence between y and x1 The test statistic 2 r n p 1 y x1 , , xn F p 1 ry2x1 , , xn 1 s S n p 1 1 y 11 s1 y p s yy s1y S111s1 y If independence is true then the test statistic F will have an Fdistributions with n1 = p degrees of freedom in the numerator and n1 = n – p - 1 degrees of freedom in the denominator The test is to reject independence if: F F p, n p 1 Canonical Correlation Analysis The problem Quite often when one has collected data on several variables. The variables are grouped into two (or more) sets of variables and the researcher is interested in whether one set of variables is independent of the other set. In addition if it is found that the two sets of variates are dependent, it is then important to describe and understand the nature of this dependence. The appropriate statistical procedure in this case is called Canonical Correlation Analysis. Definition: (Canonical variates and Canonical correlations) x1 q have p-variate Normal distribution Let x x2 p q S11 S12 1 q and S with S S 22 12 2 p q Let and U1 a1x1 a11 x1 aq1 xq V1 b1 x2 b11 xq 1 bp1 q x p be such that U1 and V1 have achieved the maximum correlation f1. Then U1 and V1 are called the first pair of canonical variates and f1 is called the first canonical correlation coefficient. The remaining canonical variates and canonical correlation coefficients The second pair of canonical variates U 2 a2 x1 a1 2 x1 aq 2 xq V2 b2 x2 b1 2 xq1 bp 2q x p are found by finding a2 and b2 1. (U2,V2) are independent of (U1,V1). 2. The correlation between U2 and V2 is maximized The correlation, f2, between U2 and V2 is called the second canonical correlation coefficient. The ith pair of canonical variates Ui aix1 a1i x1 aqi xq Vi bixi b1i xq 1 bpi q x p are found by finding ai and bi, so that 1. (Ui,Vi) are independent of (U1,V1), …, (Ui-1,Vi-1). 2. The correlation between Ui and Vi is maximized The correlation, fi, between Ui and Vi is called the i th canonical correlation coefficient. Coefficients for the ith pair of canonical variates, ai and bi are eigenvectors of the matrices 1 1 1 1 S11 S11 S12 S12S22 and S22 S12 S12 respectively associated with the ith largest eigenvalue (same for both matrices) The ith largest eigenvalue of the two matrices is the square of the ith canonical correlation coefficient fi 1 1 S11 the i th largest eigenvalue of S12 S12S22 fi = 1 S11 the i th largest eigenvalue of S 221S12 S12 Inference for the mean vector Univariate Inference Let x1, x2, … , xn denote a sample of n from the normal distribution with mean and variance 2. Suppose we want to test H0: = 0 vs HA: ≠ 0 The appropriate test is the t test: The test statistic: x 0 t n s Reject H0 if |t| > t/2 The multivariate Test Let x1 , x2 , , xn denote a sample of n from the p-variate normal distribution with mean vector and covariance matrix S. Suppose we want to test H 0 : 0 vs H A : 0 Roy’s Union- Intersection Principle This is a general procedure for developing a multivariate test from the corresponding univariate test. 1. Convert the multivariate problem to a univariate problem by considering an arbitrary linear combination of the observation vector. X1 i.e. observation vector X X p arbitrary linear combination of the observations U aX a1 X1 ap X p 2. 3. Perform the test for the arbitrary linear combination of the observation vector. Repeat this for all possible choices of a1 a a p 4. 5. 6. Reject the multivariate hypothesis if H0 is rejected for any one of the choices for a . Accept the multivariate hypothesis if H0 is accepted for all of the choices for a . Set the type I error rate for the individual tests so that the type I error rate for the multivariate test is . Hotelling’s T2 statistic We reject H0 : 0 1 2 if T n x 0 S x 0 t / 2 2 To determine t2 / 2 It turns out that if H0 is true than n p 2 n p n S1 x F T x 0 0 p n 1 p n 1 has an F distribution with n1 = p and n2 = n - p Hotelling’s T2 test We reject H0 : 0 or if n p 2 F T F p, n p p n 1 p n 1 1 T n x 0 S x 0 F p, n p Ta2 n p 2 Simultaneous Inference for means Recall T n x S 1 x 2 max t a 2 a max a n ax a 2 aS 1a (Using Roy’s Union Intersection Principle) Now 2 1 P T T P n x S x T 2 n ax a P max T 1 a aS a n ax a 2 P T for all a 1 aS a 1 2 aS a P ax a T for all a n 1 Thus aS 1a aS 1a P ax T a ax T for all a n n 1 and the set of intervals 1 1 aS a aS a ax T to ax T n n Form a set of (1 – )100 % simultaneous confidence intervals for a Recall T n -1 p p ,n p = F n p Thus the set of (1 – )100 % simultaneous confidence intervals for a 1 a S a n -1 p p ,n p ax F n n p aS 1a n -1 p p ,n p to ax F n n p The two sample problem The multivariate Test Let x1 , x2 , , xn denote a sample of n from the p-variate normal distribution with mean vector x and covariance matrix S. Let y1 , y2 , , ymdenote a sample of m from the p-variate normal distribution with mean vector yand covariance matrix S. Suppose we want to test H 0 : x y vs H A : x y Hotelling’s T2 statistic for the two sample problem T 2 1 1 1 n m 1 x y S pooled x y n 1 m 1 S pooled Sx Sy nm2 nm2 if H0 is true than n m p 1 2 F T p n m 2 has an F distribution with n1 = p and n2 = n +m – p - 1 Thus Hotelling’s T2 test We reject H0 : x y n m p 1 2 if F T F p, n m p 1 p n m 2 with T 2 S pooled 1 1 1 n m S 1 x y pooled x y n 1 m 1 Sx Sy nm2 nm2 Simultaneous inference for the two-sample problem • Hotelling’s T2 statistic can be shown to have been derived by Roy’s Union-Intersection principle namely T 2 1 1 1 n m 1 x y S pooled x y 2 a x y 2 max t a max a a 1 1 aS pooled a n m where x y Thus n m p 1 2 1 P F T F p, n m p 1 p n m 2 2 p n m 2 P T F p, n m p 1 n m p 1 P T 2 T p n m 2 where T F p, n m p 1 n m p 1 Thus 2 a x y T 1 P max a 1 1 a S pooled a n m 2 a x y or P T for all a 1 1 1 aS a pooled n m Thus 2 1 1 P a x y T aS pooled a for all a 1 n m Hence 1 1 P a x y T aS pooled a a x y n m a x y T 1 1 aS pooled a for all a 1 n m Thus a x y T 1 1 aS pooled a n m form 1 – simultaneous confidence intervals for a x y MANOVA Multivariate Analysis of Variance One way Multivariate Analysis of Variance (MANOVA) Comparing k p-variate Normal Populations The F test – for comparing k means Situation • We have k normal populations • Let i and S denote the mean vector and covariance matrix of population i. • i = 1, 2, 3, … k. • Note: we assume that the covariance matrix for each population is the same. S1 S2 Sk S We want to test H0 : 1 2 3 k against H A : i j for at least one pair i, j The data • Assume we have collected data from each of k populations • Let xi1 , xi 2 , , xin denote the n observations from population i. • i = 1, 2, 3, … k. Computing Formulae: Compute n 1) Ti xij Total vector for sample i j 1 n x 1ij T j 1 1i n x Tpi pij j 1 G1 k k 2) G Ti xij Grand Total vector i 1 i 1 j 1 G p ni 3) N kn Total sample size k n 2 x1ij i 1 j 1 k n 4) xij xij k n i 1 j 1 x1ij x pij i 1 j 1 5) 1 k 2 n T1i i 1 k 1 TT i i n i 1 k 1 T1iTpi n i 1 x1ij x pij i 1 j 1 k n 2 x pij i 1 j 1 k n 1 k T1iTpi n i 1 k 1 2 T pi n i 1 Let 1 k 1 H TT GG i i n i 1 N 1 k 2 G12 T1i N n i 1 k 1 T T G1G p 1i pi n N i 1 G1G p 1 k T1iTpi n i 1 N 1 k 2 G12 T1i n i 1 N k 2 n x x 1i 1 i 1 k n x1i x1 x pi x p i 1 n x1i x1 x pi x p i 1 k 2 n x pi x p i 1 k = the Between SS and SP matrix k Let n 1 k E xij xij TT i i n i 1 i 1 j 1 k n 2 1 k 2 x1ij n T1i i 1 i 1 j 1 k n 1 k x1ij x pij T1iTpi n i 1 i 1 j 1 k n 2 x1ij x1i i 1 j 1 k n x1ij x1i x pij x pi i 1 j 1 1 k x x T T 1ij pij 1i pi n i 1 j 1 i 1 k n k 1 2 x Tpi2 pij n i 1 i 1 j 1 k n x1ij x1i x pij x pi i 1 j 1 k n 2 x pij x pi i 1 j 1 k n = the Within SS and SP matrix The Manova Table Source Between Within SS and SP matrix h11 H h1 p e11 E e1 p h1 p hpp e1 p e pp There are several test statistics for testing H0 : 1 2 3 k against H A : i j for at least one pair i, j 1. Roy’s largest root 1 largest eigenvalue of HE1 This test statistic is derived using Roy’s union intersection principle 2. Wilk’s lambda (L) E 1 L H E HE1 I This test statistic is derived using the generalized Likelihood ratio principle 3. Lawley-Hotelling trace statistic T02 trHE1 sum of the eigenvalues of HE1 4. Pillai trace statistic (V) V trH H E 1 Profile Analysis Definition • Let X1, X2, … , Xp denote p jointly distributed variables under study • Let 1, 2, … , p denote the means of these variables denote the means these variables • The profile of these variables is a plot of i vs i. i i The multivariate Test Let x1 , x2 , , xn denote a sample of n from the p-variate normal distribution with mean vector x and covariance matrix S. Let y1 , y2 , , ymdenote a sample of m from the p-variate normal distribution with mean vector yand covariance matrix S. Suppose we want to test H 0 : x y vs H A : x y Hotelling’s T2 statistic for the two sample problem T 2 1 1 1 n m 1 x y S pooled x y n 1 m 1 S pooled Sx Sy nm2 nm2 if H0 is true than n m p 1 2 F T p n m 2 has an F distribution with n1 = p and n2 = n +m – p - 1 Profile Comparison X Group A Group B 1 2 3 variables … p Hotelling’s T2 test, tests H0 : Equality of Profiles against H A : Different profiles Profile Analysis Parallelism Variables not interacting with groups (parallelism) X 1 2 3 variables … groups p Variables interacting with groups (lack of parallelism) X 1 2 3 variables … groups p Parallelism • Group differences are constant across variables Lack of Parallelism • Group differences are variable dependent • The differences between groups is not the same for each variable Test for parallelism Let x1 , x2 , , xn denote a sample of n from the p-variate normal distribution with mean vector x and covariance matrix S. Let y1 , y2 , , ymdenote a sample of m from the p-variate normal distribution with mean vector yand covariance matrix S. Let Then 1 1 0 0 0 1 1 0 0 0 1 1 C p 1 p 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 X1 X 2 X1 X2 X3 CX C X p X p 1 X p The test for parallelism is H 0 : C x C y vs H A : C x C y Consider the data Cx1 , Cx2 , , Cxn This is a sample of n from the (p -1) -variate normal distribution with mean vector Cx and covariance matrix CSC . Also Cy1 , Cy2 , , Cym is a sample of m from the (p -1) -variate normal distribution with mean vector C y and covariance matrix .CSC Hotelling’s T2 test for parallelism 1 nm T Cx Cy CS pooled C Cx Cy nm 2 if H0 is true than nm p F T2 p 1 n m 2 has an F distribution with n1 = p – 1 and n2 = n +m – p Thus we reject H0 if F > F with n1 = p – 1 and n2 = n +m – p To perform the test for parallelism, compute differences of successive variables for each case in each group and perform the two-sample Hotelling’s T2 test. Test for Equality of Groups (Parallelism assumed) Groups equal X groups 1 2 3 variables … p If parallelism is proven: It is appropriate to test for equality of profiles H 0 : 1p x1 xp 1 p H A : 1p x1 xp 1 p i.e. H 0 : 1p 1 x 1p 1 y vs H A : 1p 1 x 1p 1 y y1 yp vs y1 yp The t test nm t nm 1 p 1 x 1p 1 y 1 p 1S pooled 1 nm 1 x 1 y n m 1S pooled 1 Thus we reject H0 if |t| > t/2 with df = n = n +m - 2 To perform this test, average all the variables for each case in each group and perform the twosample t-test. Test for equality of variables (Parallelism Assumed) Variables equal X 1 2 3 variables … groups i Let Then 1 1 0 0 0 1 1 0 0 0 1 1 C p 1 p 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 X1 X 2 X1 X2 X3 CX C X p X p 1 X p The test for equality of variables for the first group is: H 0 : C x 0 vs H A : C x 0 Consider the data Cx1 , Cx2 , , Cxn This is a sample of n from the p-variate normal distribution with mean vector Cx and covariance matrix CSC . Hotelling’s T2 test for equality of variables T n Cx 0 2 CS pooled C 1 Cx 0 1 n Cx CS pooled C Cx if H0 is true than n p 1 2 F T p 1 n 1 has an F distribution with n1 = p – 1 and n2 = n - p + 1 Thus we reject H0 if F > F with n1 = p – 1 and n2 = n – p + 1 To perform the test, compute differences of successive variables for each case in the group and perform the one-sample Hotelling’s T2 test for a zero mean vector A similar test can be performed for the second sample. Both of these tests do not assume parllelism. If parallelism is assumed then Then Cx1, Cx2 , , Cxn , Cy1, Cy2 , , Cym This is a sample of n + m from the p-variate normal distribution with mean vector Cx Cy and covariance matrix CSC . The test for equality of variables is: H 0 : C x C x 0 vs H A : C x C x 0 Hotelling’s T2 test for equality of variables 1 1 T nCx mCy CS pooled C nCx mCy nm 2 if H0 is true than nm p F T2 p 1 n m 2 has an F distribution with n1 = p – 1 and n2 = n +m - p Thus we reject H0 if F > F with n1 = p – 1 and n2 = n + m – p To perform this test for parallelism, 1. Compute differences of successive variables for each case in each group 2. Combine the two samples into a single sample of n + m and 3. Perform the single-sample Hotelling’s T2 test for a zero mean vector. Repeated Measures Designs In a Repeated Measures Design We have experimental units that • may be grouped according to one or several factors (the grouping factors) Then on each experimental unit we have • not a single measurement but a group of measurements (the repeated measures) • The repeated measures may be taken at combinations of levels of one or several factors (The repeated measures factors) The Anova Model for a simple repeated measures design Repeated measures subjects y11 y12 y13 … y1t y21 y22 y23 … y2t yn1 yn2 y13 … ynt The Model yij = the jth repeated measure on the ith subject = + i + tj + eij where = the mean effect, i = the effect of subject i, tj = the effect of time j, eij = random error. ̴ i N 0, 2 t t j 0 j 1 ̴ e ij N 0, 2 The Analysis of Variance The Sums of Squares n 1. SSSubject t yi y 2 i 1 - used to measure the variability of i (between subject variability) 2. SSTime n y j y t 2 j 1 - used to test for the differences in tj (time) 3. SSError yij yi y j y n t i 1 j 1 2 - used to measure the variability of eij (within subject variability) ANOVA table – Repeated measures (no grouping factor, 1 repeated measures factor (time)) Source Between Subject Error Time Between Subject Error S.S. d.f. M.S SSSubject n-1 MSSubject SSTime SSError t-1 MSTime MSError (n - 1)(t - 1) F MS Time MS Error The general Repeated Measures Design g groups of n subjects t repeated measures In a Repeated Measures Design We have experimental units that • may be grouped according to one or several factors (the grouping factors – df = g - 1) Then on each experimental unit we have • not a single measurement but a group of measurements (the repeated measures) • The repeated measures may be taken at combinations of levels of one or several factors (The repeated measures factors – df = t - 1) • There are also the interaction effects between the grouping and repeated measures factors – df = (g -1)(t -1) The Model - Repeated Measures Design yobservation mean Main effects,interactionsGroupingfactors Betweensubject Error Main effects,interactionsRM factors Interactio nsGrouping& RM factors e 1 e 2 Withinsubject Error ANOVA table for the general repeated measures design Source d.f. Main effects and interactions of g-1 grouping factors Between subject Error g(n – 1) interactions of grouping factors with repeated measures factors (t – 1)(g – 1) Main effects and interactions of repeated measures factors t-1 Within subject Error g(t – 1)(n – 1) The Multivariate Model for a Repeated measures design The Anova (univariate) Model yij = the jth repeated measure on the ith subject = + j + tj + eij where = the mean effect, j = the effect of subject i, N 0, 2 t t j 0 j 1 tj = the effect of time j, eij = random error. i e ij N 0, 2 Implications of The Anova (univariate) Model j = the mean of y ij E yij E E i E t j E e ij = + 0 + tj + 0 = + tj var yij E yij i 2 E e E 2 ie ij e 2 i 2 i 2 ij ij 2 2 E e e E e e e e cov yij , yij E yij i yij i i 2 i ij i i ij ij ij ij ij 2 correlation between yij and yij 2 2 2 The implication of the ANOVA model for a repeated measures design is that the correlation between repeated measures is constant. The multivariate model for a repeated measures design Let y1 , y2 , , yn denote a sample of n from the p-variate normal distribution with mean vector and covariance matrix S. Here 11 12 12 22 S 1t 2t 1t 2t tt Allows for arbitrary correlation structure amongst the repeated measures – yi1, yi2, … , yit Test for equality of repeated measures Repeated measures equal X 1 2 3 … repeated measures t Let Then 1 1 0 0 0 1 1 0 0 0 1 1 C p 1 p 0 0 0 1 0 0 0 0 0 0 0 0 1 Y1 Y2 Y1 Y2 Y3 CY C Yp Y Y p 1 p 0 0 0 0 1 The test for equality of repeated measures: H 0 : C 0 vs H A : C 0 Consider the data Cy1 , Cy2 , , Cyn This is a sample of n from the (t – 1)-variate normal distribution with mean vector Cx and covariance matrix CSC . Hotelling’s T2 test for equality of variables T n Cy 0 2 CSC Cy 0 1 1 n Cx CSC Cx if H0 is true than n t 1 2 F T t 1 n 1 has an F distribution with n1 = t – 1 and n2 = n - t + 1 Thus we reject H0 if F > F with n1 = p – 1 and n2 = n – t + 1 To perform the test, compute differences of successive variables for each case in the group and perform the one-sample Hotelling’s T2 test for a zero mean vector Techniques for studying correlation and covariance structure Principal Components Analysis (PCA) Factor Analysis Principal Component Analysis Let x have a p-variate Normal distribution with mean vector and covariance matrix S. Definition: The linear combination C1 a1x1 ap xp ax is called the first principal component if a a1 , , a p is chosen to maximize Var C1 Var ax aSa subject to aa a12 a2p 1 The complete set of Principal components Let x have a p-variate Normal distribution with mean vector and covariance matrix S. Definition: The set of linear combinations C1 a11 x1 a1 p x p a1x C p a p1 x1 a pp x p ap x are called the principal components of x if ai ai1 , , aip are chosen such that 2 ai ai ai1 a 1 2 ip and 1. Var(C1) is maximized. 2. Var(Ci) is maximized subject to Ci being independent of C1, …, Ci-1 (the previous i -1 principle components) Result ai ai1 , , aip is the eigenvector of S associated with the ith largest eigenvalue, i of the covariance matrix and Var Ci Var aix aiSai i Recall any positive matrix, S S a1 , 1 , a p 0 0 a1 PDP p ap , a p are eigenvectors of S of length 1 and i p 0 where a1 , are eigenvalues of S P a1 , , a p is an orthogonal matrix. (PP PP I ) Graphical Picture of Principal Components Multivariate Normal data falls in an ellipsoidal pattern. The shape and orientation of the ellipsoid is determined by the covariance matrix S The eignevectors of S are vectors giving the directions of the axes of the ellopsoid The eigenvalues give the length of these axes. Recall that if S is a positive definite matrix S 1a1a1 a1 , p ap ap 1 , a1 0 0 a1 p ap PDP where P is an orthogonal matrix (P’P = PP’ = I) with the columns equal to the eigenvectors of S. and D is a diagonal matrix with diagonal elements equal to the eigenvalues of S. The vector of Principal components C1 a1x a1 C x Px C p ap x ap has covariance matrix Sc PSP P PDP P PP D PP 1 0 D 0 p An orthogonal matrix rotates vectors, thus C Px rotates the vector x into the vector of Principal components C Also tr(D) = tr Sc tr PSP tr SPP tr S p p i 1 p i p i 1 ii var C var x Total Variance of x i 1 i i 1 i The ratio i p j 1 j i p j 1 var Ci Total Variance of x jj denotes the proportion of variance explained by the ith principal component Ci. Also and CovCi , x j i aij i CorrCi , x j aij i aij if ii 1 ii Factor Analysis An Alternative technique for studying correlation and covariance structure Let x have a p-variate Normal distribution with mean vector and covariance matrix S The Factor Analysis Model: Let F1, F2, … , Fk denote independent standard normal observations (the Factors) Let e1, e2, … , ep denote independent normal random variables with mean 0 and var(ei) = yp Suppose that there exists constants ij (the loadings) such that: x1= 1 + 11F1+ 12F2+ … + k Fk + e x2= 2 + 21F1+ 22F2+ … + k Fk + e … xp= p + p1F1+ p2F2+ … + pk Fk + ep Factor Analysis Model x LF e where and F is N 0k , I k while e is N 0 p , y 1 0 0 0 y 0 2 0 0 y p Note: S Varx LL hence k ii var xi ij2 y i i2 y i j 1 and k im covxi , xm ij mj j 1 k i2 ij2 is called the communality. j 1 i.e. the component of variance of xi that is due to the common factors F1, F2, … , Fk. y i is called thespecific variance i.e. the component of variance of xi that is specific only to that observation. F1, F2, … , Fk are called the common factors e1, e2, … , ep are called the specific factors ij cov xi , F j = the correlation between xi and Fj. if var xi 1 Rotating Factors Recall the factor Analysis model x LF e This gives rise to the vector x having covariance matrix: S Varx LL Let P be any orthogonal matrix, then * * x LF e LP PF e L F e and S Var x LL LPPL L L * where F * PF and L* LP * Hence if with x LF e S Varx LL is a Factor Analysis model then so also is x L* F * e with S var x L L * * where P is any orthogonal matrix. The process of exploring other models through orthogonal transformations of the factors is called rotating the factors There are many techniques for rotating the factors • VARIMAX • Quartimax • Equimax VARIMAX rotation attempts to have each individual variables load high on a subset of the factors Extracting the Factors Several Methods – we consider two 1. Principal Component Method 2. Maximum Likelihood Method Principle Component Method Recall S a1 , where a1 , 1 , a p 0 0 a1 PDP p ap , a p are eigenvectors of S of length 1 and i p 0 are eigenvalues of S Hence S 1 a1 , a 1 1 , p a p LL 0 p p p ap Thus L 1 a1 , , p a p and 0 p p This is the Principal Component Solution with p factors Note: The specific variances, yi, are all zero. The objective in Factor Analysis is to explain the correlation structure in the data vector with as few factors as necessary It may happen that the latter eigenvalues of S are small. i p 0 1 0 a1 S a1 , , a p 0 ap p 1a1a1 p a p ap 1a1a1 k ak ak Lk Lk where Lk 1 a1 , , k ak In addition let y i ii i th diagonal element of S Lk Lk k ii ij2 j 1 In this case S Lk Lk where y 1 0 0 y p The equality will be exact along the diagonal Maximum Likelihood Estimation Let x1 , , xn denote a sample from N p , S S L L where p p p k k p The joint density of x1 , p p , xn is L , S L , L, 1 2 np / 2 S n/2 1 1 1 exp 2 tr S A nS x x where A n 1 S xi x xi x n i 1 The Likelihood function is L , S L , L, 1 n1 p / 2 2 1 2 1 n 1 exp tr S S 2 n1 / 2 S p/2 S 1/ 2 n 1 exp 2 x S x with S L L p p p k k p p p ˆ The maximum likelihood estimates ˆ , Lˆ and Are obtained by numerical maximization of L , L, Discrimination and Classification Discrimination Situation: We have two or more populations 1, 2, etc (possibly p-variate normal). The populations are known (or we have data from each population) We have data for a new case (population unknown) and we want to identify the which population for which the new case is a member. The Basic Problem Suppose that the data from a new case x1, … , xp has joint density function either : 1: g(x1, … , xn) or 2: h(x1, … , xn) We want to make the decision to D1: Classify the case in 1 (g is the correct distribution) or D2: Classify the case in 2 (h is the correct distribution) The Two Types of Errors Misclassifying the case in 1 when it actually lies in 2. Let P[1|2] = P[D1|2] = probability of this type of error 1. Misclassifying the case in 2 when it actually lies in 1. Let P[2|1] = P[D2|1] = probability of this type of error 2. This is similar Type I and Type II errors in hypothesis testing. Note: A discrimination scheme is defined by splitting p – dimensional space into two regions. 1. C1 = the region were we make the decision D1. (the decision to classify the case in 1) 2. C2 = the region were we make the decision D2. (the decision to classify the case in 2) There can be several approaches to determining the regions C1 and C2. All concerned with taking into account the probabilities of misclassification P[2|1] and P[1|2] 1. Set up the regions C1 and C2 so that one of the probabilities of misclassification , P[2|1] say, is at some low acceptable value . Accept the level of the other probability of misclassification P[1|2] = b. 2. Set up the regions C1 and C2 so that the total probability of misclassification: P[Misclassification] = P[1] P[2|1] + P[2]P[1|2] is minimized P[1] = P[the case belongs to 1] P[2] = P[the case belongs to 2] 3. Set up the regions C1 and C2 so that the total expected cost of misclassification: E[Cost of Misclassification] = ECM = c2|1P[1] P[2|1] + c1|2 P[2]P[1|2] is minimized P[1] = P[the case belongs to 1] P[2] = P[the case belongs to 2] c2|1= the cost of misclassifying the case in 2 when the case belongs to 1. c1|2= the cost of misclassifying the case in 1 when the case belongs to 2. The Optimal Classification Rule The Neyman-Pearson Lemma Suppose that the data x1, … , xp has joint density function f(x1, … , xp ;q) where q is either q1 or q2. Let g(x1, … , xp) = f(x1, … , xn ;q1) and h(x1, … , xp) = f(x1, … , xn ;q2) We want to make the decision D1: q = q1 (g is the correct distribution) against D2: q = q2 (h is the correct distribution) then the optimal regions (minimizing ECM, expected cost of misclassification) for making the decisions D1 and D2 respectively are C1 and C2 C1 x1 , , xp L q1 L q2 g x1 , h x1 , , xp , xp k g x1 , , xp k and C2 x1 , where , xp k L q2 c1 2 P 2 c21 P 1 L q1 h x1 , , xp Fishers Linear Discriminant Function. Suppose that x1, … , xp is either data from a p-variate Normal distribution with mean vector: 1 or 2 The covariance matrix S is the same for both populations 1 and 2. g x hx 1 2 p/2 S 1/ 2 1 2 p/2 S 1/ 2 e 12 x 1 S1 x 1 e 12 x 2 S1 x 2 The function a x 1 2 S1 x Is called Fisher’s linear discriminant function 2 1 2 1 a x 1 2 S1 x K In the case where the populations are unknown but estimated from data Fisher’s linear discriminant function ˆa x x x S 1 x 1 2 The Optimal Classification Rule Suppose that the data x1, … , xp has joint density function f(x1, … , xp ;q) where q is either q1 or q2. Let g(x1, … , xp) = f(x1, … , xn ;q1) and h(x1, … , xp) = f(x1, … , xn ;q2) We want to make the decision D1: q = q1 (g is the correct distribution) against D2: q = q2 (h is the correct distribution) then the optimal regions (minimizing ECM, expected cost of misclassification) for making the decisions D1 and D2 respectively are C1 and C2 C1 x1 , , xp L q1 L q2 g x1 , h x1 , , xp , xp k g x1 , , xp k and C2 x1 , where , xp k L q2 c1 2 P 2 c21 P 1 L q1 h x1 , , xp Fishers Linear Discriminant Function. Suppose that x1, … , xp is either data from a p-variate Normal distribution with mean vector: 1 or 2 The covariance matrix S is the same for both populations 1 and 2. g x hx 1 2 p/2 S 1/ 2 1 2 p/2 S 1/ 2 e 12 x 1 S1 x 1 e 12 x 2 S1 x 2 The Optimal Rule states that we should classify into populations 1 and 2 using: 1 x S x e p/2 1/ 2 g x 2 S 1 x S x hx e p/2 1/ 2 2 S 1 1 2 1 1 2 2 1 1 2 x 2 S1 x 2 12 x 1 S1 x 1 e That is make the decision D1 : population is 1 c1 2 P 2 if > k k c21 P 1 1 2 or ln 1 2 x 2 S1 x 2 12 x 1 S 1 x 1 ln k or x 2 S 1 x 2 x 1 S 1 x 1 2 ln k or xS1 x 22S1 x 2S12 xS1x 21S1x 1S11 2ln k and 1 2 S1 x ln k 12 1S11 2S12 Finally we make the decision D1 : population is 1 if a x K where Fisher’s Linear discriminant function a S 1 1 2 and K ln k 12 1S 11 2 S 1 2 and k c1 2 P 2 c21 P 1 Note: k = 1 and ln k = 0 if c1|2 = c2|1 and P[1] = P[2]. and K 12 1S11 2S12 12 1 2 S1 1 2 Graphical illustration of a x 1 2 S1 x Fisher’s linear discriminant function 2 1 2 1 a x 1 2 S1 x K Note: k = 1 and ln k = 0 if c1|2 = c2|1 and P[1] = P[2]. and K 12 1S11 2S12 12 1 2 S1 1 2 Thus a x K with a S 1 1 2 and K ln k 12 1S 11 2 S 1 2 is equivalent to 1 1 2 S x 2 S 2 2 S 2 1 1 or 2 2S 2 2S x 2 S S x 1 2 S S 2 2 2 x 12 xS x 12 S S x 12 xS x x 2 S x 2 x 1 S x 1 Mahalanobis distancex, 2 Mahalanobis distancex, 1 2 2 dM x, 2 S dM x, 1 S Thus we make the decision D1 : population is 1 if Mahalanobis distancex, 2 Mahalanobis distancex, 1 2 1 2 1 x dM2 x, 2 S dM2 x, 1 S Thus we make the decision D2 : population is 2 if Mahalanobis distancex, 2 Mahalanobis distancex, 1 2 1 2 1 x dM2 x, 2 S dM2 x, 1 S Thus we make the decision D1 : population is 1 if Mahalanobis distancex, 2 Mahalanobis distancex, 1 where a S 1 1 2 and K ln k 12 1S 11 2 S 1 2 and k c1 2 P 2 c21 P 1 Note: k = 1 and ln k = 0 if c1|2 = c2|1 and P[1] = P[2]. and K 12 1S11 2S12 12 1 2 S1 1 2 Discrimination of p-variate Normal distributions (unequal Covariance matrices) Suppose that x1, … , xp is either data from a p-variate Normal distribution with mean vector: 1 or 2 and covariance matrices, S1 and S2 respectively. f x g x 1 2 p/2 S1 1/ 2 1 2 p/2 S2 1/ 2 e 12 x 1 S11 x 1 e 1 12 x 2 S2 x 2 The optimal rule states that we should classify into populations 1 and 2 using: 1 x S x e p/2 1/ 2 f x 2 S1 1 x S x g x e p/2 1/ 2 2 S2 1 2 1 2 S2 1/ 2 S1 1/ 2 e 1 2 1 1 1 1 1 2 1 2 2 1 x 2 S2 x 2 12 x 1 S1 x 1 That is make the decision D1 : population is 1 if ≥ k 1 1 ln x 2 S2 x 2 x 1 S1 x 1 ln S2 ln S1 ln k 1 2 or 1 2 where and 1 1 1 1 x S2 S1 x 1S1 2S2 x K ln k 1 1 1 S2 1 K ln 1S1 1 2S2 2 2 S1 2 k c1|2 P 2 c2|1 P 1 Summarizing we make the decision to classify in population p1 if: x Ax bx c 0 where A 1 2 S 1 2 1 S1 1 1 b S1 1 S2 2 and c1|2 P 2 1 1 1 S2 1 c ln 1S1 1 2 S 2 2 ln 2 S1 2 c2|1 P 1 Discrimination of p-variate Normal distributions (unequal Covariance matrices) 1 2 2 1 x Ax bx c 0 Discrimination amongst k populations We want to determine if an observation vector x1 x xp comes from one of the k populations 1 : f1 x1 , , x p f1 x k : f1 xk , , xp fk x For this purpose we need to partition p-dimensional space into k regions C1, C2 , …, Ck For this purpose we need to partition p-dimensional space into k regions C1, C2 , …, Ck We will make the decision: Di x came from i if x Ci Misclassification probabilities P[j|i] = P[ classify the case in j when case is from i] P x C j i f x dx i Cj Cost of Misclassification cj|i = Cost classifying the case in j when case is from i Initial probabilities of inclusion P[i] = P[ classify the case is from i initially] Expected Cost of Misclassification of a case from population i We assume that we know the case came from i ECM i c1 i P 1 i ci 1 i P i 1 i ci 1 i P i 1 i ck i P k i c j i P j i j i Total Expected Cost of Misclassification ECM P 1 ECM 1 P k ECM k P i ECM i i P i c j i P j i i j i P i c j i f i x dx i j i j Cj P i fi x c j i dx j C j i j i Optimal Classification Rule The optimal classification rule will find the regions Cj that will minimize: ECM P i f i x c j i dx j C j i j c P i f i x dx if c j|i c j C j i j k c P i f i x P j f j x dx j C j i 1 ECM will be minimized if Cj is chosen where the term that is omitted: P j f j x is the largest Optimal Regions when misclassification costs are equal C j x P j f j x P i fi x for i j x ln P j f j x ln P i f i x for i j Optimal Regions when misclassification costs are equal an distributions are p-variate Normal with common covariance matrix S x ln P j f C j x P j f j x P i fi x for i j j x ln P i fi x In the case of normality 1 fi x e p/2 1/ 2 2 S 1 2 for i j x i S1 x i ln P i fi x p 1 ln P i ln 2 ln S 12 x i S 1 x i 2 2 and ln P j f j x ln P i fi x if: p 1 ln P j ln 2 ln S 12 x j S 1 x j 2 2 p 1 ln P i ln 2 ln S 12 x i S 1 x i 2 2 that is j S1x 12 j S1 j ln P j iS1x 12 iS1i ln P i or aj x bj aix bi where ai S1i and bi ln P i 12 iS1i Summarizing We will classify the observation vector in population j if: L j aj x b j max Li max aix bi i i where ai S1i and bi ln P i 12 iS1i 1 L1 L2 L1 L2 , L3 L3 L1 , L2 L1 L3 3 L L 2 3 2 L2 L1 , L3 Classification or Cluster Analysis Have data from one or several populations Situation • Have multivariate (or univariate) data from one or several populations (the number of populations is unknown) • Want to determine the number of populations and identify the populations Hierarchical Clustering Methods The following are the steps in the agglomerative Hierarchical clustering algorithm for grouping N objects (items or variables). 1. 2. 3. Start with N clusters, each consisting of a single entity and an N X N symmetric matrix (table) of distances (or similarities) D = (dij). Search the distance matrix for the nearest (most similar) pair of clusters. Let the distance between the "most similar" clusters U and V be dUV. Merge clusters U and V. Label the newly formed cluster (UV). Update the entries in the distance matrix by a) b) deleting the rows and columns corresponding to clusters U and V and adding a row and column giving the distances between cluster (UV) and the remaining clusters. 4. Repeat steps 2 and 3 a total of N-1 times. (All objects will be a single cluster a termination of this algorithm.) Record the identity of clusters that are merged and the levels (distances or similarities) at which the mergers take place. Different methods of computing inter-cluster distance Cluster Distance Single Linkage 3 1 2 d2 4 5 4 Complete Lin kage 3 1 2 4 d1 5 5 Average Linkage 3 1 2 4 5 d 1 3+ d 1 4+ d 1 5+ d 2 3+ d 2 4+ d 2 5 6 k—means Clustering A non-hierarchical clustering scheme want subdivide the data set into k groups The k means algorithm 1. Initially subdivide the complete data into k groups. 2. Compute the centroids (mean vector) for each group. 3. Sequentially go through the data reassigning each case to the group with the closest centroid. 4. After reassigning a case to a new group recalculate the centroid for the original group and the new group to which it is a member. 5. Continue until there are no new reassignment of cases. Discrete Multivariate Analysis Analysis of Multivariate Categorical Data Multiway Frequency Tables • Two-Way A B • Three -Way B A C • Three -Way C B A • four -Way B A C D Models for count data • • • • Binomial Hypergeometric Poisson Multinomial Log Linear Model Three-way Frequency Tables Log-Linear model for three-way tables Let ijk denote the expected frequency in cell (i,j,k) of the table then in general ln ij u u1(i ) u2( j ) u3( k ) u12(i, j ) u13(i,k ) u23( j ,k ) u123(i, j ,k ) or the multiplicative form ij e ln ij e e u u1( i ) e e u13( i ,k ) u2( j ) e e e u3( k ) u23( j ,k ) u12( i , j ) e u123( i , j ,k ) q q1(i ) q2( j ) q3( k ) q12(i, j ) q13(i,k ) q23( j ,k ) q123(i, j ,k ) Hierarchical Log-linear models for categorical Data For three way tables The hierarchical principle: If an interaction is in the model, also keep lower order interactions and main effects associated with that interaction Hierarchical Log-linear models for 3 way table Model [1][2][3] [1][23] [2][13] [3][12] [12][13] [12][23] [13][23] [12][13] [23] [123] Description Mutual independence between all three variables. Independence of Variable 1 with variables 2 and 3. Independence of Variable 2 with variables 1 and 3. Independence of Variable 3 with variables 1 and 2. Conditional independence between variables 2 and 3 given variable 1. Conditional independence between variables 1 and 3 given variable 2. Conditional independence between variables 1 and 2 given variable 3. Pairwise relations among all three variables, with each two variable interaction unaffected by the value of the third variable. The saturated model Comments • The log-linear model is similar to the ANOVA models for factorial experiments. • The ANOVA models are used to understand the effects of categorical independent variables (factors) on a continuous dependent variable (Y). • The log-linear model is used to understand dependence amongst categorical variables • The presence of interactions indicate dependence between the variables present in the interactions Goodness of Fit Statistics These statistics can be used to check if a log-linear model will fit the observed frequency table Goodness of Fit Statistics The Chi-squared statistic c 2 Observed Expected Expected 2 x ijk ˆ ijk 2 ˆijk The Likelihood Ratio statistic: Observed G 2 Observed ln 2 xijk Expected 2 xijk ln ˆ ijk d.f. = # cells - # parameters fitted We reject the model if c2 or G2 is greater than c / 2 2 Conditional Test Statistics • Suppose that we are considering two Loglinear models and that Model 2 is a special case of Model 1. • That is the parameters of Model 2 are a subset of the parameters of Model 1. • Also assume that Model 1 has been shown to adequately fit the data. In this case one is interested in testing if the differences in the expected frequencies between Model 1 and Model 2 is simply due to random variation] The likelihood ratio chi-square statistic that achieves this goal is: G 2 1 G 2 G 1 2 2 2 Expected 1 2 Observed Expected 2 df df2 df1 Stepwise selection procedures Forward Selection Backward Elimination Forward Selection: Starting with a model that under fits the data, log-linear parameters that are not in the model are added step by step until a model that does fit is achieved. At each step the log-linear parameter that is most significant is added to the model: To determine the significance of a parameter added we use the statistic: G2(2|1) = G2(2) – G2(1) Model 1 contains the parameter. Model 2 does not contain the parameter Backward Elimination: Starting with a model that over fits the data, log-linear parameters that are in the model are deleted step by step until a model that continues to fit the model and has the smallest number of significant parameters is achieved. At each step the log-linear parameter that is least significant is deleted from the model: To determine the significance of a parameter deleted we use the statistic: G2(2|1) = G2(2) – G2(1) Model 1 contains the parameter. Model 2 does not contain the parameter Modelling of response variables Independent → Dependent Logit Models When some variables are dependent (response) and other variables are independent (predictor) variables, the logit model is used when the dependent variable is binary. Case: One dependent variable Two Independent variable Consider the log-linear model for ijk , the expected frequency in cell (i,j,k) of the table then in general ln ijk u u1i u2 j u3k u12i , j u13i ,k u23 j ,k u123i , j ,k Variables 1 and 2 are the independent variables and variable 3 is the binary dependent variable The Logit model (for the binary response variable) ij1 ln ij1 ln ij 2 ln ij 2 u u1i u2 j u31 u12i, j u13i,1 u23 j ,1 u123i, j ,1 u u u u u 1i 2 j 32 12i , j u13i,2 u23 j ,2 u123i, j ,2 2u31 2u13i,1 2u23 j ,1 2u123i, j ,1 Since u32 u31 , u13i , 2 u13i ,1 , u23 j , 2 u23 j ,1 and u123i , j , 2 u123i , j ,1 The logit model: ij1 ln ij1 ln ij 2 ln ij 2 v v1i v2i v12i, j where v 2u31 , v1i 2u13i ,1 , v2i 2u23 j ,1 and v12i , j 2u123i , j ,1 Thus corresponding to a loglinear model there is logit model predicting log ratio of expected frequencies of the two categories of the independent variable. Also k +1 factor interactions with the dependent variable in the loglinear model determine k factor interactions in the logit model k+1=1 constant term in logit model k + 1 = 2, main effects in logit model Fitting a Logit Model with a Polytomous Response Variable Techniques for handling Polytomous Response Variable Approaches 1. Consider the categories 2 at a time. Do this for all possible pairs of the categories. 2. Look at the continuation ratios i. ii. iii. iv. 1 vs 2 1,2 vs 3 1,2,3 vs 4 etc Causal or Path Analysis for Categorical Data When the data is continuous, a causal pattern may be assumed to exist amongst the variables. The path diagram This is a diagram summarizing causal relationships. Straight arrows are drawn between a variable that has some cause and effect on another variable X Y Curved double sided arrows are drawn between variables that are simply correlated X Y Example 1 The variables – Job stress, Smoking, Heart Disease The path diagram Job Stress Smoking Heart Disease In Path Analysis for continuous variables, one is interested in determining the contribution along each path (the path coefficents) Example 2 The variables – Job stress, Alcoholic Drinking, Smoking, Heart Disease The path diagram Job Stress Smoking Drinking Heart Disease If you have a causal Path diagram then variables at the end of a straight one side arrow should be treated as dependent variables in a logit analysis. Example 1 The variables – Job stress, Smoking, Heart Disease The path diagram Job Stress Smoking Heart Disease Analysis 1 – Job Stress predicting Smoking Analysis 2 – Job Stress, Smoking predicting Heart Disease Example 2 The variables – Job stress, Alcoholic Drinking, Smoking, Heart Disease The path diagram Job Stress Smoking Drinking Heart Disease Example 2 Analysis 1 - Job stress predicting Drinking Analysis 2 - Job stress predicting Smoking Analysis 3 - Job stress, Drinking, Smoking predicting Heart Disease The path diagram Job Stress Smoking Drinking Heart Disease