Statistics for Marketing and Consumer Research

Download Report

Transcript Statistics for Marketing and Consumer Research

Correspondence Analysis
Chapter 14
Statistics for Marketing & Consumer Research
Copyright © 2008 - Mario Mazzocchi
1
Correspondence analysis
• Multivariate statistical technique which
looks into the association of two or more
categorical variables and display them
jointly on a bivariate graph
• It can be used to apply multidimensional
scaling to categorical variable.
Statistics for Marketing & Consumer Research
Copyright © 2008 - Mario Mazzocchi
2
Correspondence analysis
and data reduction techniques
• Factor and principal component analyses are only applied
to metric (interval or ratio) quantitative variables
• Traditional multidimensional scaling deals with non-metric
preference and perceptual data when those are on an
ordinal scale
• Correspondence analysis allows data reduction (and
graphical representation of dissimilarities) on non-metric
nominal (categorical) variables
• The issue with categorical (non-ordinal) variables is how to
measure distances between two objects: Correspondence
analysis exploits contingency tables and association
measures
Statistics for Marketing & Consumer Research
Copyright © 2008 - Mario Mazzocchi
3
Example (Trust data)
• Do consumers with different jobs (q55) show preferences
for some specific type of chicken (q6)?
Correspondence Table
If employed, what is your
occupation?
I am not employed
Non manual employee
Manual employee
Executive
Self employed
professional
Farmer / agricultural
worker
Employer / Entrepreneur
Other
Active Margin
Statistics for Marketing & Consumer Research
Copyright © 2008 - Mario Mazzocchi
In a typical week, what type of fresh or frozen chicken do you buy for
your household's home consumption?
'Value'
'Standard'
'Organic'
'Luxury'
chicken
chicken
chicken
chicken
Active Margin
17
50
10
17
94
11
74
14
28
127
6
19
4
8
37
0
7
6
14
27
1
18
7
3
29
1
1
1
0
3
0
11
47
4
31
204
2
1
45
3
1
74
9
44
370
4
Independence
• If the two characters are independent then the
number in the cells of the table should simply
depend on the row and column totals (lecture 9)
• Measure the distance between the expected
frequency in each cell and the actual (observed)
frequency
• Compute a statistic (the Chi-square statistic)
which allows one to test whether the difference
between the expected and actual value is
statistically significant
Statistics for Marketing & Consumer Research
Copyright © 2008 - Mario Mazzocchi
5
Reducing the number of dimensions
• The elements composing the Chi-square statistic
are standardized metric values, one for each of the
cells
• They become larger as the association between
two specific characters increases
• These elements can be interpreted as a metric
measure of distance
• The resulting matrix is similar to a covariance
matrix
• A method similar to principal component analysis
can be applied to this matrix to reduce the number
of dimensions
Statistics for Marketing & Consumer Research
Copyright © 2008 - Mario Mazzocchi
6
coordinates
• The principal component scores provide
standardized values that can be used as
coordinates
• One may apply the same data reduction technique
• first by rows (synthesizing occupation as a function of
types of chicken)
• then by column (synthesizing types of chicken as a
function of occupation)
• The first two components for each application
generate a bivariate plot which shows both the
occupation and the type of chicken in the same
space
Statistics for Marketing & Consumer Research
Copyright © 2008 - Mario Mazzocchi
7
Output from
Correspondence Analysis
Unemployed
are closer to
“Value” chicken
Statistics for Marketing & Consumer Research
Copyright © 2008 - Mario Mazzocchi
Executives prefer
“Luxury” chicken
8
Applications
• It is possible to represent on the same graph
consumer preferences for different brands and
characteristics of a specific product (e.g. car
brands together with colour, power, size, etc.)
• This allows one to explore brand choice in relation
to characteristics opening the way to product
modifications and innovations to meet consumer
preferences
• Correspondence analysis is particularly useful when
the variables have many categories
• The application to metric (continuous) data is not
ruled out but data need to be categorized first
Statistics for Marketing & Consumer Research
Copyright © 2008 - Mario Mazzocchi
9
Summary
• Correspondence analysis is a compositional technique which
starts from a set of product attributes to portrait the overall
preference for a brand
• This technique is very similar to PCA and can be employed for
data reduction purposes or to plot perceptual maps
• Because of the way it is constructed correspondence analysis
can be applied to either the row or the columns of the data
matrix
• For example if rows represent brands and columns are
different attributes:
1. By applying the method by rows one obtains the coordinates for the
brands
2. The application by columns allows one to represent the attributes in
the same graph
Statistics for Marketing & Consumer Research
Copyright © 2008 - Mario Mazzocchi
10
Steps to run correspondence
analysis
•
•
•
•
•
Represent the data in a contingency table
Translate the frequencies of the contingency
table into a matrix of metric (continuous)
distances through a set of Chi-square association
measures on the row and column profiles
Extract the dimensions (in a similar fashion to
PCA)
Evaluate the explanatory power of the selected
number of dimensions
Plot row and column objects in the same coordinate space
Statistics for Marketing & Consumer Research
Copyright © 2008 - Mario Mazzocchi
11
The frequency table
Categorical variable X (k categories)
Categorical variable Y (l categories)
x1
x2
…
xi
…
xk
y1 y2 …
f11 f12
f21 f22
yj
f1j
f2j
fi1
fij
fil
fk1 fj2
f01 f02
fkj
f0j
fkl
f0l
Column profile
Statistics for Marketing & Consumer Research
Copyright © 2008 - Mario Mazzocchi
…
yl
f1l
f2l
Row profile
f10
f20
…
Row masses
fi0
…
fkl
1
Column masses
12
Interpretation of coordinates
• The categories of the x variable can be seen
as different coordinates for the points
identified by the y variable
• The categories of the y variable can be seen
as different coordinates for the points
identified by the x variable
• Thus it is possible to represent the x and y
categories as points in space, imposing (as in
multidimensional scaling) that they respect
some distance measure
Statistics for Marketing & Consumer Research
Copyright © 2008 - Mario Mazzocchi
13
Representations
• Take the row profile (the categories of x) and plot
the categories in a bi-dimensional graph, using the
categories of y to define the distances
• This allows one to compare nominal categories
within the same variable: those categories of x
which show similar levels of association with a
given category of y can be considered as closer
than those with very different levels of association
with the same category of y
• The same procedure is carried out transposing the
table which means that the categories of y can be
represented using the categories of x to define the
distances
Statistics for Marketing & Consumer Research
Copyright © 2008 - Mario Mazzocchi
14
Computing the distances
•
When the coordinates are defined simultaneously for the categories
of x and y the Chi-square value can be computed for each cell as
follows
•
•
Obtain the expected table frequencies
Where nij and fij are the absolute and relative frequencies, respectively, ni0 and n0j (or
fi0 and f0j) are the marginal totals for row i and column j (the row masses and column
masses) respectively and n00 is the sample size (hence the total relative frequency f00
equals one)
f 
*
ij
•
ni 0  n0 j
n00

fi 0  f 0 j
f 00
 fi 0  f 0 j
The Chi-square value can now be computed for each cell (i,j)
 ij2 
Statistics for Marketing & Consumer Research
Copyright © 2008 - Mario Mazzocchi
( f ij  f ij* ) 2
These are the quadr
between category i a
of the x variable
f ij*
15
The distance matrix
• The matrix 2 measures all of the associations
between the categories of the first variable and those
of the second one.
• A generalization of the multivariate case (MCA is
possible by stacking the matrix
• Stacking: compose a large matrix by blocks, where each block is the
contingency matrix for two variables (all possible associations are
taken into consideration)
• The stacked matrix is referred to as the Burt Table
• To obtain similarity values from the 2 matrix:
•
•
•
•
compute the square root of the elemental Chi-square values
use the the appropriate sign (the sign of the difference fij –fij*)
large positive values correspond to strongly associated categories
large negative values identify those categories where the
association is strong but negative indicating dissimilarity
Statistics for Marketing & Consumer Research
Copyright © 2008 - Mario Mazzocchi
16
Estimation
• The resulting matrix D contains metric and continuous
similarity data
• It is possible to apply PCA to translate such a matrix into
coordinates for each of the categories first those of x then
those of y
• Before PCA can be applied some normalization is required
so that the input matrix becomes similar to a correlation
matrix
• The use of the square root of the row masses (columns) for
normalizing the values in D represents the key difference
from PCA
• The rest of the estimation process follows the results of the
PCA
• As for PCA eigenvalues are computed, one for each
dimension, which can be used to evaluate the proportion of
dissimilarity maintained by that dimension
Statistics for Marketing & Consumer Research
Copyright © 2008 - Mario Mazzocchi
17
Inertia
• Inertia is a measure of association between two categorical
variables based on the Chi-squared statistic.
• In correspondence analysis the proportion of inertia
explained by each of the dimensions can be regarded as a
measure of goodness-of-fit because the effectiveness of
correspondence analysis depends on the degree of
association between x and y
• Total inertia
– is a measure of the overall association between x and y
– is equal to the sum of the eigenvalues
– corresponds to the Chi-square value divided by the number of
observations
– A total inertia above 0.20 is expected for adequate representations
• Inertia values can be computed for each of the dimensions
and represent the contribution of that dimension to the
association (Chi-square) between the two variables
Statistics for Marketing & Consumer Research
Copyright © 2008 - Mario Mazzocchi
18
SPSS example
• EFS data set:
• economic position of
the household
reference person
(a093)
• type of tenure (a121)
• Their Pearson Chisquare value is 274,
which means
significant association
at the 99.9%
confidence level)
Statistics for Marketing & Consumer Research
Copyright © 2008 - Mario Mazzocchi
19
Analysis
Define the range, i.e. the categories for each
variable that enter the analysis
Some categories
can be indicated as
supplementary:
they appear in the
graphical
representation, but
do not influence the
actual estimation of
the scores
Statistics for Marketing & Consumer Research
Copyright © 2008 - Mario Mazzocchi
20
Model options
Choose the number of
dimensions to be
retained
Choice of
distance measure
Standardization (only for
Euclidean distance)
Normalization
Which variable
should be
privileged?
Statistics for Marketing & Consumer Research
Copyright © 2008 - Mario Mazzocchi
21
Number of dimensions
• The maximum number of dimensions for the
analysis is equal to
• the number of rows minus one, or
• the number of columns minus one (whichever the
smaller)
• In our example, the maximum number of
dimensions would be five which reduces to four
due to missing values in one row category.
• As shown later in this section one may then choose
to graphically represent only a sub-set of the
extracted dimensions (usually two or three) to
make interpretation easier
Statistics for Marketing & Consumer Research
Copyright © 2008 - Mario Mazzocchi
22
Distance measure
• Chi-square distance (as discussed earlier)
• Euclidean distance
• uses the square root of the sum of squared differences
between pairs of rows and pairs of columns
• this also requires one to choose a method for centering
the data (see the SPSS manual for details)
• For this example standard correspondence analysis
(with the Chi-square distance) does not require a
standardization method.
Statistics for Marketing & Consumer Research
Copyright © 2008 - Mario Mazzocchi
23
Normalization method
• Defines how correspondence analysis is run: whether to give priority to
comparisons between the categories for x (row) or those for y (columns)
• This choice influence the way distances are summarized by the first
dimensions
• Row principal normalization: the Euclidean distances in the final
bivariate plot of x and y are as close as possible to the Chi-square
distances between the rows, that is the categories of x
• The opposite is valid for the column principal method
• Symmetrical normalization: the distances on the graph resemble as much
as possible distances for both x and y by spreading the total inertia
symmetrically
• Principal normalization: inertia is first spread over the scores for x, then y
• Weighted normalization: defines a weighting value between minus one and
plus one where minus one is the column principal zero is symmetrical and
plus one is the row principal
• EFS example: the row principal method is more appropriate as it is more
relevant to see how differences in socio-economic conditions impact on
the tenure type than it is by looking at distances between tenure types.
Statistics for Marketing & Consumer Research
Copyright © 2008 - Mario Mazzocchi
24
Additional statistics
Although CA is a
nonparametric method,
it is possible to compute
standard deviations and
correlations under the
assumption of
multinomial distribution
of the cell frequencies,
(when data are obtained
as a random sample
from a normally
distributed population)
Allows one to order the categories of x and y using scores
obtained from CA
E.g. the tenure types and the socio-economic conditions
might follow some ordering but cannot be defined with
sufficient precision to consider these variables as ordinal.
One can use the scores in the first dimension (or the first
two) to order the categories and produce a permutated
correspondence table.
Statistics for Marketing & Consumer Research
25
Copyright © 2008 - Mario Mazzocchi
Plots
Three graphs:
•Biplot (both x & y)
• x only (rows)
• y only (columns)
One usually chooses to
represent only the first
two or three of the
extracted dimensions
Statistics for Marketing & Consumer Research
Copyright © 2008 - Mario Mazzocchi
26
Output
The first dimensin explains 85%, the first two 93%
of total inertia. However, note that total inertia
does not correspond to total variability, but to the
variability of the extracted dimensions
The SV is the
square root of inertia
(the eigenvalue)
Summary
Proportion of Inertia
Dimension
1
2
3
4
Total
Singular
Value
.669
.209
.173
.072
a. 24 degrees of freedom
Usually a value of
total inertia above
0.2 is regarded as
acceptable
Inertia
.447
.044
.030
.005
.526
Chi Square
231.402
Sig.
.000a
Accounted for
.850
.083
.057
.010
1.000
The Chi-square stat
suggests strong and
significant association
Statistics for Marketing & Consumer Research
Copyright © 2008 - Mario Mazzocchi
Cumulative
.850
.933
.990
1.000
1.000
Confidence Singular Value
Standard
Deviation
.031
.055
.055
.053
2
.094
Correlation
3
-.032
.011
4
-.022
.081
-.042
These precision measures
are based on the
multinomial distribution
assumption
27
Row scores
These categories have a higher relevance because
they are more important categories in the original
correspondence table. These two categories
(especially retirement) strongly contribute to
Overview Row Points
explaining the first dimensionContribution
The mass column shows
the relative weight of each
category on the sample
Economic position of
Household Reference
Person
Self-employed
Fulltime employee
Pt employee
Unemployed
Workarelated govt train
prog
Ret unoc over min ni age
Active Total
b
Score in Dimension
Mass
.080
.539
.077
.018
1
2
3
Inertia
.024
.152
.028
.033
1
Of Point to Inertia of Dimension
2
3
.016
.001
.496
.334
.030
.027
.010
.295
.318
.001
.622
.157
4
Of Dimension to Inertia of Point
2
3
4
.002
.620
.089
.008
.005
.002
.453
.336
.055
.814
.141
.032
.296
.527
-.239
-.154
.025
.049
-.409
-1.223
.433
-.039
-.352
.509
4
-.164
.026
-.143
.241
1
.000
.
.
.
.
.
.000
.000
.000
.000
.
.
.
.
.
.286
1.000
-.999
.089
.015
.019
.288
.526
.639
1.000
.052
1.000
.002
1.000
.020
1.000
.992
.008
.000
.000
1.000
.407
.290 is
The second dimension
.071
.984
characterized by.300unemployed
and
.156
.202
.013
part-time employees
a. Supplementary point
b. Row Principal normalization
Scores are computed for each
category but the supplemental one,
provided there are no missing data
Shows how total inertia has been
distributed across rows (similar to
communalities)
Scores are the coordinates for the
map
Statistics for Marketing & Consumer Research
Copyright © 2008 - Mario Mazzocchi
28
Total
1.000
1.000
1.000
1.000
Column scores
• The same exercise is carried out on columns,
however the row principal method does not
normalize by column
Overview Column Pointsb
Score in Dimension
Tenure - type
Local Authority rented
unfurnished
Housing association
Other rented unfurnished
Rented furnished
Owned with mortgage
Owned by rental
purchase
Owned outright
Rent free a
Active Total
Mass
1
2
3
Contribution
4
Inertia
Of Point to Inertia of Dimension
2
3
1
4
Of Dimension to Inertia of Point
2
3
4
1
Total
.098
-.699
-1.993
.051
1.106
.039
.048
.388
.000
.120
.548
.436
.000
.016
1.000
.066
.050
.032
.457
-.781
.487
.531
.971
-1.263
-2.023
-1.098
.371
2.821
-2.190
-2.270
.233
-1.273
.891
-4.585
.133
.039
.022
.014
.196
.040
.012
.009
.431
.105
.205
.038
.063
.524
.240
.164
.025
.107
.040
.669
.008
.462
.245
.284
.982
.118
.413
.119
.014
.405
.333
.349
.004
.014
.010
.248
.000
1.000
1.000
1.000
1.000
.002
1.179
1.120
-1.287
5.002
.002
.003
.003
.004
.057
.725
.064
.058
.153
1.000
.295
.009
1.000
-1.244
-.957
.819
-1.039
-.382
-2.996
.018
-3.705
.214
.007
.526
.457
.000
1.000
.198
.000
1.000
.043
.000
1.000
.000
.000
1.000
.954
.512
.040
.059
.006
.338
.000
.090
1.000
1.000
By column the first dimension is especially related to the
“owned by mortgage” and “owned outright” categories
Statistics for Marketing & Consumer Research
29
a. Supplementary point
b. Row Principal normalization
Copyright © 2008 - Mario Mazzocchi
Bi-plot
Employed individuals are
closer to owned
accommodations
Retired individuals are
also close to owned
accommodations
Part-time employees and
unemployed individuals are closer
to rented accommodations and
other forms of accommodations
Statistics for Marketing & Consumer Research
Copyright © 2008 - Mario Mazzocchi
30
Multiple Correspondence
Analysis(MCA)
When all variables are multiple
nominal, then optimal scaling applies
MCA
Statistics for Marketing & Consumer Research
Copyright © 2008 - Mario Mazzocchi
31
Plot with 3 variables
The analysis
now also
includes the
government
office region
Statistics for Marketing & Consumer Research
Copyright © 2008 - Mario Mazzocchi
32
SAS correspondence analysis
• SAS procedure: proc CORRESP
• simple correspondence analysis
• multiple correspondence analysis (option MCA)
• same types of normalization as SPSS
• option PROFILE (ROW, COLUMN or BOTH)
Statistics for Marketing & Consumer Research
Copyright © 2008 - Mario Mazzocchi
33