Transcript Document
Multivariate Coarse Classing
of Nominal Variables
Geraldine E. Rosario
Talk given at Fair Isaac on July 14, 2003
Based on paper “Mapping Nominal Values to Numbers
for Effective Visualization”, InfoVis 2003.
1
Outline
• Motivation
• Overview of Distance-QuantificationClassing approach
• Algorithmic Details
• Experimental Evaluation
• Wrap-Up
2
Those pesky nominal variables
• Nominal variable: variables whose values
do not have a natural ordering or distance
• High cardinality nominal variable: has large
number of distinct values
• Examples?
• Examples of business applications using
nominal variables?
• Why do you usually pre-process/transform
them before doing data analysis?
3
Visualizing Nominal
Variables
• Most data visualization
tools are designed for
numeric variables.
• What if variable is
nominal?
• Most tools which are
designed for nominal
variables cannot handle
large # of values.
4
Quantified Nominal Variables
Are the order
and spacing
of values
within each
variable
believable?
5
Coarse Classing Nominal Variables
• Possible ways of classing nominal variables with
high cardinality:
– Domain expertise
– Univariate: using information about the variable itself.
e.g. based on frequency of occurrence of the attributes
– Bivariate: using information from one other variable.
e.g. relationship with predictor variable
– Multivariate: based on the profile across several other
variables. e.g. using cluster analysis
• Is multivariate coarse classing better?
6
The approach
7
Proposed Approach
Pre-process nominal variables using a DistanceQuantification-Classing (DQC) approach
Steps:
1. Distance – transform the data so that the distance
between 2 nominal values can be calculated (based on
the variable’s relationship with other variables)
2. Quantification – assign order and spacing to the
nominal values
3. Classing or intra-dimension clustering – determine
which values are similar to each other and can be
grouped together
Each step can be done by more than one technique.
8
Distance-Quantification-Classing Approach
Target variable &
data set with nominal variables
DISTANCE STEP
Transformed data for distance calculation
QUANTIFICATION STEP
Nominal-to-numeric
mapping
CLASSING STEP
Classing tree
9
Example Input to Output
Task: Pre-process color based on its patterns across quality and size.
Observed Counts
COLOR by QUALITY
Data:
Quality (3): good,ok,bad
Color (6) : blue,green,orange,
purple,red,white
Size (10) : a to j
Good
Blue
187
Green
267
Orange 276
Purple 155
Red
283
White
459
Total
1627
Ok Bad
727 546
538 356
411 191
436
361
307
357
366
327
2785 2138
Total
1460
1161
878
952
947
1152
6550
blue purple green red orange white
-0.02 0
-0.54 -0.5 0.55 0.57
10
Other Potential Uses of DQC as Pre-Processor
• For techniques that require numeric inputs: linear
regression, some clustering algorithms (can speed up
calculations but with some loss of accuracy)
• For techniques that require low cardinality nominal
variables: scorecards, neural networks, association rules
• FICO-specific:
– Multivariate coarse classing
– ClusterBots – nominal variables could be quantified
and distance calculations would be simpler. Could be
applied to mixed variables?
– Product groups, merchant groups
• Can you think of other uses?
11
Details … Details …
12
Distance Step:
Correspondence Analysis
• Used for analyzing n-way tables containing some measure
of association between rows and columns
• Simple Correspondence Analysis (SCA) – for 2 variables
• Multiple Correspondence Analysis (MCA) – for > 2
variables. Uses SCA.
• Focused Correspondence Analysis (FCA) – proposed
alternative to MCA when memory is limited. Uses SCA.
• Reinvented as Dual Scaling, Reciprocal Averaging,
Homogeneity Analysis, etc.
• Similar to PCA but for nominal variables
13
Simple Correspondence Analysis – The Basic Idea
Observed Counts
COLOR by QUALITY
Good
Blue
187
Green
267
Orange 276
Purple
155
Red
283
White
459
Total 1627
Ok Bad
727 546
538 356
411 191
436
361
307
357
366
327
2785 2138
Total
1460
1161
878
952
947
1152
6550
Can we find similar COLORs based
on its association with QUALITY?
Similar profiles
Calculate c2 statistic (measures the
strength of association between
COLOR and QUALITY based on
assumption of independence).
Any deviation from independence
will increase the c2 value.
Row Percentages
Good
Blue
13
Green
23
Orange 31
Purple
16
Red
30
White
40
Ok
50
46
47
46
32
32
Bad
37
31
22
38
38
28
100
100
100
100
100
100
14
Simple Correspondence Analysis – Steps
Row
percentage
matrix
Similar row profiles:
(blue,purple), …
Column
percentage
matrix
Similar column profiles:
(ok,bad), …
Normalize counts table
Eigenvalues
Identify a few independent dimensions
which can reconstruct the c2 value.
(SVD, EigenAnalysis).
Scale the new dimensions such that
c2 distances between row points
is maximized.
Coordinates for
Independent Dimensions
Blue
Green
Orange
Purple
Red
White
Dim1
- 0.02
- 0.54
0.55
0
- 0.50
0.57
Dim2
- 0.28
0.14
0.10
- 0.25
0.20
0.19
15
Simple Correspondence Analysis – The Output
• Coordinates Matrix
–
–
–
–
Set of independent dimensions
Dimensions ordered by diminishing importance
Total # of independent dimensions = min(r,c)-1
Similar to principal components from PCA
• Eigenvalues
– Indicates the importance of each independent
dimension
16
Distance Step Alternative:
Multiple Correspondence Analysis
•
•
Steps:
1. BurtTable(rawdataMatrix) burtMatrix
2. SCA(burtMatrix) coordMatrix, evaluesVector
3. ReduceNDim(coordMatrix, evaluesVector)
coordMatrixSubset
Input to SCA - Burt Table: crosses all variables by all variables
X1
X1
X1 by X1
counts table
X2
X3
…
X1 by X2
counts table
X2
X3
…
17
Multiple Correspondence Analysis
• Features:
– For a given variable, determines which values
are similar to each other by comparing value
profiles across all other variables
• multivariate
• maximizes usage of information
• memory-intensive
– Simultaneously analyzes of all variables
• efficient calculations
18
Reduce Number of Dimensions to Keep
• Reduce the number of independent
dimensions to keep for subsequent analysis
(due to large # of analysis variables and
high cardinality)
eigenvalue
1
2
3
4
dimension #
5
19
Distance Step Alternative:
Focused Correspondence Analysis
• Proposed alternative to MCA when memory space
is limited
• Core idea: instead of comparing value profiles
across all other nominal variables, just compare
value profiles across the nominal variables which
are most correlated with the target variable
• Input to Simple CA:
target
variable Xi
X3
X1
Xi by X3
Xi by X1
counts table counts table
X9
…
20
Focused Correspondence Analysis
•
Steps:
1. PairwiseAssociate(rawdataMatrix) assocMatrix
2. Set k (# analysis variables to use)
3. FCATable(rawdataMatrix, k, assocMatrix)
fcaInputMatrix
4. SCA(fcaInputMatrix) coordMatrix, evaluesVector
5. ReduceNDim(coordMatrix, evaluesVector)
coordMatrixSubset
21
FCA: Calculate Pairwise Association
• Used Uncertainty Coefficient U(R|C) to
measure strength of nominal association
– Bounded [0,1]
– U(R|C)=1 value of row variable R can be
known precisely given value of column variable C
• Example: U(R|C) association matrix
U(R|C)
Quality
Color
Size
Quality
1.0
0.0173
0.0017
Color
0.0287
1.0
0.1267
Size
0.0028
0.1234
1.0
22
FCA: Determine top k associated
variables for each nominal variable
• Set k >= 2 to ensure use of at least one
analysis variable per target variable
• Cannot use a threshold on the association
measure
23
Focused Correspondence Analysis
• Features:
– One-at-a-time analysis
• Less/controllable memory usage
• Sub-optimal quantification compared to MCA
– Requires pre-processing step to determine top
correlated variables per target variable
• longer run time
24
Quantification Step: Modified Optimal Scaling
Coordinates for
Independent Dimensions
Blue
Green
Orange
Purple
Red
White
Dim1
- 0.02
- 0.54
0.55
0
- 0.50
0.57
Dim2
- 0.28
0.14
0.10
- 0.25
0.20
0.19
Optimal
Scaling
Optimal Scaling goal: maximize the
variance of the scores of the records,
where score = average(qij)
Nominal-to-numeric
mapping
Nominal Numeric
Blue
-0.02
Green
-0.54
Orange
0.55
Purple
0
Red
-0.50
White
0.57
Rec Q1 Q2 ...
Score
1
0.5 -0.3 …
0.4
2
-0.6 0.1 …
-0.02
…
25
Quantification Step: Modified Optimal Scaling
• Problem with Optimal Scaling: perfect
associations between variables are not
recreated in the quantified versions
• Modified Optimal Scaling:
– Let p = # of eigenvalues = 1.0
– If p >= 1 then set
p
num eric[i] coordinate[i, j ]
j 1
– Else set numeric[i] coordinate[i,1]
26
Classing Step: Hierarchical Cluster Analysis
Coordinates for
Independent Dimensions
Blue
Green
Orange
Purple
Red
White
Dim1
- 0.02
- 0.54
0.55
0
- 0.50
0.57
Dim2
- 0.28
0.14
0.10
- 0.25
0.20
0.19
Counts
1460
1161
878
952
947
1152
Cluster Analysis
weighted by counts
[from FCA]
blue purple green red orange white
27
Loss of Information
due to Classing
Observed Counts COLOR by SIZE
U(R|C) = 0.1234
100
50
0
blue purple green red orange white Info loss
1.
2.
3.
4.
Blue
Green
Orange
Purple
Red
White
Total
a
0
0
7
0
0
6
13
b
8
2
49
5
0
70
134
… j
…
…
…
…
…
…
…
Total
1460
1161
878
952
947
1152
6550
Determine variable V with highest association with target X.
Create X by V counts table.
Calculate total table measure of association (eg, U(X|V)).
Starting from bottom of tree, for every pair of nodes merged,
calculate cumulative information loss:
100* ( A( fullTable) A(afterMerging))
InfoLoss
A( fullTable)
28
Distance-Quantification-Classing Approach
Target variable &
data set with nominal variables
DISTANCE STEP
Transformed data for distance calculation
QUANTIFICATION STEP
Nominal-to-numeric
mapping
CLASSING STEP
Classing tree
29
Does this approach work?
30
Experimental Evaluation
• Wrong quantification and classing will introduce
artificial patterns and cause errors in interpretation
• Evaluation measures:
–
–
–
–
Believability
Quality of Visual Display
Quality of classing
Quality of quantification
– Space – FCA less space
– Run time – MCA faster
perception
statistical
computational
31
Test Data Sets
32
Believability and Quality of
Visual Display
• Given two displays resulting from different
nominal-to-numeric mappings:
– Which mapping gives a more believable
ordering and spacing?
• Based on your domain knowledge, are the values
that are positioned close together similar to each
other?
• Are the values that are positioned far from the rest
of the values really outliers?
– Which display has less clutter?
33
Automobile Data: Alphabetical
34
Automobile Data: MCA
Are these
patterns
believable?
35
Automobile Data: FCA
Are these
patterns
believable?
36
PERF Data: Alphabetical
Region-Country:
1-many
Country-Product:
many-many
Are these
associations
preserved and
revealed?
37
PERF Data: FCA
Region-Country:
1-many
Country-Product:
many-many
Are these
associations
preserved and
revealed?
38
Quality of Classing
• Classing A is better than classing B if,
given a classing tree, the rate of
information loss with each merging is
slower
Information loss
due to classing
for one variable
Calculate
difference
between
the lines.
[The lower the line,
the slower the info loss,
the better the classing.]
39
Which classing is better
… depends on dataset
Distribution of
difference
between
the lines.
40
Quality of Quantification
•
A quantification is good if …
1. If data points that are close together in
nominal space are also close together in
numeric space
2. If two variables are highly associated with
each other, then their quantified versions
should also have high correlation.
41
MCA gives better quantification
Average
Squared
Correlation
[higher value =
better quantification]
Correlation between
MCA and FCA scales
[how close are FCA
scales to MCA scales]
42
Had enough yet?
43
Going back to
Multivariate Coarse Classing
• Other issues:
– Missing values
– Mixed or numeric variables as analysis
variables
– Nominal values with small counts
– Robustness of quantification and classing
44
Can you think of other uses of DQC at FICO?
• For techniques that require numeric inputs: linear
regression, some clustering algorithms (can speed up
calculations but with some loss of accuracy)
• For techniques that require low cardinality nominal
variables: scorecards, neural networks, association rules
• FICO-specific:
– Multivariate coarse classing
– ClusterBots – nominal variables could be quantified
and distance calculations would be simpler. Could be
applied to mixed variables?
– Product groups, merchant groups
– ???????
45
Implementation
• SAS version exists
– PROC CORRESP, PROC CLUSTER, PROC
FREQ
• C++ version in development
46
Summary
• DQC is a general-purpose approach for pre-processing
nominal variables for data analysis techniques
requiring numeric variables or low cardinality nominal
variables
• DQC – multivariate, data-driven, scalable, distancepreserving, association-preserving
• FCA is a viable alternative to MCA when memory
space is limited
• Quality of classing and quantification
– depends on strength of associations within the data set.
– is in the eye of the user
47
Yippee, it’s over!
Original InfoVis2003 paper: Mapping Nominal
Values to Numbers for Effective Visualization.
http://davis.wpi.edu/~xmdv/documents.html
XmdvTool Homepage:
http://davis.wpi.edu/~xmdv
[email protected]
Code is free for research and education.
48
References
• [Gre93] GREENACRE, M.J., 1993, Correspondence Analysis in
Practice, London :Academic Press
• [Gre84] Greenacre, M. (1984), Theory and Applications of
Correspondence Analysis, London: Academic Press
• [Sta] StatSoft Inc. Correspondence Analysis.
http://www.statsoftinc.com/textbook/stcoran.html
• [Fri99] Friendly, Michael. 1999. "visualizing Categorical Cata." In
Sirken, Monroe G. et. al. (eds). Cognition and Survey Research. New
York: John Wiley & Sons.
• [Kei97] Keim D. A.: Visual Techniques for Exploring Databases,
Invited Tutorial, Int. Conference on Knowledge Discovery in
Databases (KDD'97), Newport Beach, CA, 1997.
• [Hua97b] Zhexue Huang. A Fast Clustering Algorithm to Cluster Very
Large Categorical Data Sets in Data Mining (1997)
• SAS Manuals (PROC CORRESP, PROC CLUSTER, PROC FREQ)
49
What input tables can SCA accept?
•
In general, SCA can use as input any table
that has the properties:
1. The table must use the same physical units or
measurements, and
2. The values in the table must be non-negative.
The FCA input table satisfies these properties.
50
Uncertainty Coefficient U(R|C)
U (R | C) H
r
H
RC
R
H C H RC
H
R
c
p
i 1 j 1
ij
log( p ),
ij
p
ij
P[ R i, C j ]
r
H
R
i 1
c
p
i
log( p ),
i
p
i
c
H
C
j 1
j 1
p
ij
r
p
j
log( p ),
j
p
j
i 1
p
ij
Source: SAS Proc Freq
51
Average Squared Correlation
• Given the raw data matrix R=[rij],
where the columns represent the
variables. Create new matrix
Q=[qij] where qij.=quantified
version of rij.. Let Qj=jth column of
Q.
• For each record i, calculate
scorei=average(Sj qij )
• For each variable j, calculate
corrj=correlation(Qi,score)
• Calculate average of the squared
correlation.
Source: [Gre93]
Rec Q1 Q2 ...
Score
1
0.5 -0.3 …
0.4
2
-0.6 0.1 …
-0.02
…
Pair
Sqr(Correlation)
Q1,score
0.36
Q2,score
0.49
…
average=___
52