Basic principles of probability theory

Download Report

Transcript Basic principles of probability theory

Contingency tables and Correspondence analysis
•
•
•
•
•
•
Contingency table
Pearson’s chi-squared test for association
Correspondence analysis using SVD
Plots
References
Exercises
Contingency tables
Contingency tables are often used in social sciences (such as sociology, education, psychology).
These tables can be considered as frequency tables. Rows and columns are some
categorical variables. If variables are continuous then we can use bins for these continuous
variables and convert them into categorical ones.
Categorical variables have discrete values. For example: Different drugs and effects of different
drugs as “Excellent”, “good” etc.
Contingency tables sometimes are called incidence matrices. Example of contingency tables.
Survey of effects of four different drug types. Patients gave score for each drug type
(excellent, very good, good, fair, poor). Number of all elements is 121.
Drug
Drug
Drug
Drug
A
B
C
D
excellent
6
12
0
1
very good
8
8
3
1
good
10
3
12
8
fair
1
3
6
12
poor
5
5
10
7
First question is if there is association between columns and rows. If there is some association
then we want to find some structure in this data table. Can we order columns and rows by
their closeness? Can we find associations between columns and rows.
Problem of correspondence analysis is to find an optimal representation of contingency table in a
lower dimensional space so that columns and rows are on the same scale.
Pearson chi-squared test
Suppose that we have a data matrix N that has I rows and J columns. Elements of the matrix are
nij. Let us use the following
notations:
I
J
n   nij ,
P  N/n, r  P1, c  PT 1
i 1 j 1
Dr  diag(r ), Dc  diag(c )
R  Dr1P, C  Dc1PT
Q  P  rcT
r and c are row and column sums, R and C are row and column profiles, respectively. Q is
difference between P and product of row and column sums. More notations and relations:
in( I )  tr(Dr (R  1cT )Dc1 (R  1cT )T )
the total inertiaof rows
in( J )  tr(Dc (C  1rT )Dr1 (C  1rT )T )
the total inertiaof columns
relation in( I )  in( J ) is true.
in( I )  tr(Dr (R  1cT )Dc1 (R  1cT )T )  tr(Dr (Dr1P  1cT )Dc1 (Dr1P  1cT )T )  tr(Q Dc1QTDr1 )  X 2 / n
in( I )  tr(Dc (C  1rT )Dr1 (C  1rT )T )  tr(Dc (Dc1PT  1rT )Dr1 (Dc1PT  1rT )T )  tr(QΤDr 1Q Dc1 )  X 2 / n
row and column inertias are multiple of chi-squared with degrees of freedom (I-1)(J-1).
Multiplicity is 1/n. If P would be probability then if there would be no association between
rows and columns then Q would be 0. It is equivalent to saying that rows and columns are
independent
For above example chi-squared test carried out in R gives:
Pearson's Chi-squared test
data: Dr1
X-squared = 47.0718, df = 12, p-value = 4.53e-06
This test shows that null-hypothesis should be rejected. I.e. there is strong evidence that there is
row-column association.
Probabilistic interpretation of matrices
If the matrix P would be a probability matrix i.e. each element pij are probability of
happening rows and columns simultaneously then we can have the following
interpretation of the involved matrices:
1)
Elements of r are the marginal probabilities of columns. Elements of c are the
marginal probabilities of rows
2)
Elements of Q are differences between joint probability and product of individual
probabilities. In some sense this matrix represents the degree of dependencies of
rows and columns
3)
Elements of R are the conditional probabilities of columns when row is know
4)
Elements of C are the conditional probabilities of rows when column is known
5)
Total inertia is the total indicator of dependencies of rows and columns.
Contingency tables: homogeneity and heterogeneity
t=in(I)=X2/n is the coefficient of association called as Pearson’s mean-square contingency. It is
the total inertia. The total inertial is measure of homogeneity/heterogeneity of the table. If t
is large it is a measure of heterogeneity and if t is small it is a measure of homogeneity of
the table. Homogeneity means that there is no row-column association. t can also be
calculated using:
I
J
i 1
j 1
t   ri [( pij / ri  c j )2 / c j ]
Second summation is sum of a weighted squared distance between the vector of relative
frequency of the ith row (i.e. jth row profile – pij/ri) and the average row profile – c.
Inverse of the elements of c are the weights. It is known as chi-squared distance between
ith row profile and the average row profile.The total inertia is is further weighted sums of I
chi-squared distances. The weights are the elements of r. If all elements of row profiles are
close to the average row profile then table is homogenous. Otherwise table is
heterogeneous.
We can do similar calculations for the column profiles. It is done easily by changing roles of r
and c.
This distances are similar to Euclidean distances and techniques used for Euclidean distances can
also be used for this case. We will learn techniques for metric scaling in one of the
lectures.
Correspondence analysis and eigenvalues
For a given contingency table we calculate row and column profiles. Now we want to find a
vector when multiplied by row profiles from the left will have highest possible variance. It
means that we want to maximise
(Rg  1cTg)T Dr (Rg  1cTg)  max
To make this problem solvable we add an additional constraint (similar to PCA). We want
weighted norm of the vector to be unit and weighted mean to be 0. Weights are column
sums.
gTDcg  1, cTg  0
If mean is 0 and we know that cTy=rTRy=0, then we can write for the maximisation problem
(Rg)T DrRg  gTPTDr1Pg  max
If we use Lagrange multipliers technique then we get:
gTPTDr1Pg   (1  gTDcg)  max  PTDr1Pg  Dcg
Thus the problem reduces to the eigenvalue problem. As a result we will have principal
coordinates for columns. Similarly we can find principal coordinates for columns.
This problem easily and compactly solved if we use singular value decomposition.
Conditions of the weighted norm of the vector to be unit and weighted mean to be 0 are similar
to those in PCA (norm of the vector is unit and mean values of the variables are 0 in case
of PCA ).
Contingency table: Correspondence analysis
Above stated problem is solved using singular value decomposition of the probability
matrix minus column average multiplied by row average
Q  P  rcT
W  Dr1/2Q Dc1/2
Let us use the singular value decomposition:
W  UDμVT UTU  I VTV  I
It is equivalent to (generalized singular value decomposition):
Q  U1Dμ V1T
U1  Dr1/2U
U1T Dr1U1  I V1T Dc1 V1  I
V1  Dc1/2V
Principal row and column coordinates are:
F  Dr1QDc1V1  Dr1U1Dμ
G  Dc1QTDr1U1  Dc1V1Dμ
First few (one or two) elements of F and G are usually taken and plotted
simultaneously. Transitions between columns and rows are given:
F  Dr1PGDμ1  RGDμ1 G  Dc1PTFDμ1  CFDμ1
This relation is useful for addition of supplementary rows or columns to the picture.
Another useful formula is a reconstruction formula:
T 1/2
T
1/2
1
1/2
P  rcT  UDμVT  rcT  D1/2
U
D
V
D

rc

D
FD
GD
r
1 μ 1
c
r
μ
c
Correspondence analysis
Elements of D are called the principal inertias. They are also related to the canonical
correlations given by the package R. Larger value of D means that the
corresponding element has higher importance. It is usual to use one or two
elements of F and G. Then these elements are used for various plots.
For pictorial representation either columns and row are plotted in and ordered form or
biplots is used to find possible association between rows and columns as well as
their order.
It is worth noting that correspondence analysis is a very useful tool. It is very useful in
archeology, ecology, medicine, psychology. It may even be useful in history and
other fields.
There are many, many problems can be brought to this type of analysis. As soon as you
can define two sets of categories say cat1, cat2 and find frequencies for all cross
terms of cat1 and cat2 you can apply correspondence analysis.
On the other hand it should be considered as a dimension reduction technique and can
be used together with others (for example PCA). Comparative application of
different dimension reduction technique may give insight to the problem and
structure in the data.
Algorithm of Correspondence analysis
1.
2.
3.
4.
5.
6.
7.
8.
Take a contingency table (N) and find sum of all elements (total sum)
Divide all elements by the total sum (call it P)
Find row and column sums (r and c)
From each element of P subtract product of corresponding elements of row and
column sums (call it Q).
Find generalised SVD of the Q. Normalisation conditions for left and right side
matrices are weigted normalisation with weights corresponding to the inverses of
row and column sums.
Find principal row and column coordinates. Take few elements and plot them.
If there are new elements (rows or columns) use transition formula to find
principal coordinates corresponding to them. Plot them as a supplementary points.
(R does not allow to do it directly)
Analyse the results (order and closeness of columns and rows, possible
associations between columns and rows).
Plot of correspondence analysis: Example
This is 1D pictorial form of the table quality of drugs. Positions of rows and columns correspond
to row and column scores. Size of the circles corresponds to number of elements for the
corresponding cell of the contingency table. This picture already can tell something about
the structure of the data.
Biplot for the correspondence analysis
Biplot produced by R: Columns and rows are plotted simultaneously. Black are rows
and red are columns. Positions of the points correspond to their scores. Again
from this picture we can deduce some structure about data.
R commands for contingency tables and correspondence
analysis
For correspondence analysis we need libraries ctest, MASS and mva. We need to load them
library(mva)
library(MASS)
library(ctest)
(mva and ctest may not be needed if you use R version 2.0.0 or higher)
To perform chi-squared test we can use (load data first)
data(drivers)
dr1 = matrix(drivers,ncol=12,byrow=1)
chisq.test(dr1)
chisq.test(dr1,simulate.p.value=T)
If there is some association between rows and columns then we can start usinng the
correspondence analysis:
cdriver = corresp(dr1,nf=1) nf is the number of factors we want to find. we can plot this
using the plot command
plot(cdriver) – If we have only 1 factor then result will be pictorial representation of the table.
if nf=2 then result will be the biplot.
References
1)
2)
Krzanowski WJ and Marriout FHC. (1994) Multivatiate analysis. Kendall’s
library of statistics
Greenacre MJ (1984) Theory and applications of Correspondence analysis
Exercises 5
a) Take the data from R. Data set is deaths – monthly death rates from lung deceases
in the UK. These data cannot be used directly for chisq.test and corresp
commands. Data should be converted to data matrix. It can be done using
data(deaths)
dth = matrix(deaths,ncol=12,byrow=TRUE)
Now try to analyse these data using corresponding analysis technique
b) Take data set accdeaths (accidental deaths in the USA from 1973-1978). These data
should also be converted to data matrix..
Sometimes it is better to work with data frames with names for columns and rows. In
the example of deaths it can be done using (now dth1 should be analysed):
dth1 = data.frame(dth,row.names=c('1974','1975','1976','1977','1978','1979'))
names(dth1) = c('Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec')