Transcript Slide 1

Chapter 2
Data Mining Processes and
Knowledge Discovery
Identify
actionable results
結束
Contents
Describes the Cross-Industry Standard Process for
Data Mining (CRISP-DM), a set of phases that can
be used in data mining studies
Discusses each phase in detail
Gives an example illustration
Discusses a knowledge discovery process
2-2
結束
CRISP-DM
Cross-Industry
Standard Process for
Data Mining
 One of first
comprehensive
attempts toward
standard process
model for data mining
 Independent of
industry sector &
technology
2-3
結束
CRISP-DM Phases
1. Business (or problem) understanding
2. Data understanding
 A systematic process to try to make sense of the
massive amounts of data generated from daily
operations.
3. Data preparation
•
Transform & create data set for modeling
4. Modeling
5. Evaluation
•
Check good models, evaluate to assure nothing
missing
6. Deployment
2-4
結束
Business Understanding
Solve a specific problem
 Determining business objectives, assessing the current
situation, establishing data mining goals, and developing a
project plan.
Clear definition helps
Measurable success criteria
Convert business objectives to set of data-mining
goals
What to achieve in technical terms, such as
What types of customers are interested in each of our
products?
What are typical profiles of customers …
2-5
結束
Data Understanding
Initial data collection, data description, data
exploration, and the verification of data quality.
Three issues considered in data selection:
1. Set up a concise and clear description of the problem.
For example, a retail DM project may seek to identify
spending behaviors of female shoppers who purchase
seasonal clothes.
2. Identify the relevant data for the problem description,
such demographical, credit card transactional,
financial data…
3. Select variables for the relevant important for the
project.
2-6
結束
Data Understanding (cont.)
Data types:
 Demographic data (income, education, age …)
 Socio-graphic data (hobby, club membership,…)
 Transactional data (sales record, credit card spending…)
 Quantitative data: are measurable using numerical values)
 Qualitative data: known as categorical data, contains both nominal and
ordinal data. (see also page. 22)
Related data: Can come from many sources?
 Internal
 ERP (or MIS)
 Data Warehouse
 External
 Government data
 Commercial data
 Created
 Research
2-7
結束
Data Preparation
Once data sources available are identified, the data
need to be selected, cleaned, built into the desired
and formatted forms.
Clean data: Formats, gaps, filters outliers &
redundancies (see page .22)
Unified numerical scales
Nominal data
Code (such gender data, male and female)
Ordinal data
Nominal code or scale (excellent, fair, poor)
Cardinal data (Categorical, A, B, C levels)
2-8
結束
Types of Data
Type
Numerical
Features
Continuous
Integer
Binary
Categorical
Synonyms
Range
Range
Yes/No
Flag
Finite
Set
Date/Time
Range
String
Typeless
Text
String
Range: Numeric vales (integer, real, or date/time)
Set: Data with distinct multiple value (numeric, string, or data/time)
Typeless: for other types of data
2-9
結束
Data Preparation (Cont.)
Several statistical method and visualization tools can
be used to preprocess the selected data.
Such max, min, mean, and mode can be used to
aggregate or smooth the data.
Scatter plots and box plots can be used to filter outliers.
More advanced techniques, such as regression analysis,
cluster analysis, decision tree, or hierarchical analysis
may be applied in data preprocessing.
In some cases, data preprocessing could take over 50%
of the time of the entire data mining process.
Shortening data processing time can reduce much of the
total computation time in data mining.
2-10
結束
Data Preparation – data transformation
Data transformation is to use simple mathematical
formulations or learning curves to convert different
measurements of selected, and clean, data into a
unified numerical scale for the data analysis.
Data transformation can be used to
1. Transform from numerical to numerical scales, to
shrink or enlarge the given data. Such as (x-min)/maxmin) to shrink the data into the interval [0,1].
2. Recode categorical data to numerical scales.
Categorical data can be ordinal (less, moderate, strong)
and nominal (red, yellow, blue..). Such 1=yes, 0=no.
see also page. 24.
See page. 24 for more details.
2-11
結束
Modeling
Data modeling is where the data mining software is used to
generate results for various situations. Data visualization and
cluster analysis are useful for initial analysis.
Depending on the data type,
1. if the task is to group data, discriminant analysis is applied.
2. If the purpose is estimation, regression is appropriate the
data are continuous (and logistic regression is not).
3. Neural networks could be applied for both tasks.
Data Treatment
 Training set for development of the model.
 Test set for testing the model that is built.
 Maybe others for refining the model
2-12
結束
Data mining techniques
Techniques
 Association: the relationship of a particular item in a data
transaction on other items in the same transaction is used to
predict patterns. See also page 25 for example.
 Classification: the methods are intended for learning
different functions that map each item of the selected data
into one of a predefined set of classes. Two key research
problems related to classification results are the evaluation of
misclassification and prediction power(C4.5).
Mathematical modeling is often used to construct classification
methods are binary decision trees (CART), neural networks
(nonlinear), linear programming (boundary), and statistics.
See also page. 25, 26 for more explanations
2-13
結束
Data mining techniques (Cont.)
 Clustering: taking ungrouped data and uses automatic
techniques to put this data into groups.
Clustering is unsupervised and does not require a learning set.
(Chapter 5)
 Predictions: is related to regression technique, to discover the
relationship between the dependent and independent
variables.
 Sequential patterns: seeks to find similar patterns in data
transaction over a business period.
The mathematical models behind sequential patterns are logic
rules, fuzzy logic, and so on.
Similar time sequences: applied to discover sequences similar
to a known sequence over both past and current business
periods.
2-14
結束
Evaluation
Does model meet business
objectives?
Any important business
objectives not addressed?
Does model make sense?
Is model actionable?
CRISP-DM
2-15
結束
Deployment
DM can be used to verify previously held hypotheses
or for knowledge discovery.
DM models can be applied to business purposes ,
including prediction or identification of key situations
Ongoing monitoring & maintenance
Evaluate performance against success criteria
Market reaction & competitor changes (remodeling or
fine tune)
2-16
結束
Example
Training set for computer purchase
16 records
5 attributes
Goal
Find classifier for consumer behavior
2-17
結束
Database (1st half)
Case
Age
Income
Student
Credit
Gender
Buy?
A1
31-40
High
No
Fair
Male
Yes
A2
>40
Medium
No
Fair
Female
Yes
A3
>40
Low
Yes
Fair
Female
Yes
A4
31-40
Low
Yes
Excellent
Female
Yes
A5
≤30
Low
Yes
Fair
Female
Yes
A6
>40
Medium
Yes
Fair
Male
Yes
A7
≤30
Medium
Yes
Excellent
Male
Yes
A8
31-40
Medium
No
Excellent
Male
Yes
2-18
結束
Database (2nd half)
Case
Age
Income
Student
Credit
Gender
Buy?
A9
31-40
High
Yes
Fair
Male
Yes
A10
≤30
High
No
Fair
Male
No
A11
≤30
High
No
Excellent
Female
No
A12
>40
Low
Yes
Excellent
Female
No
A13
≤30
Medium
No
Fair
Male
No
A14
>40
Medium
No
Excellent
Female
No
A15
≤30
Unknown
No
Fair
Male
Yes
A16
>40
Medium
No
N/A
Female
No
2-19
結束
Data Selection
Gender has weak relationship with purchase
Based on correlation
Drop gender
Selected Attribute Set
{Age, Income, Student, Credit}
2-20
結束
Data Preprocessing
Income unknown in Case 15
Credit not available in Case 16
Drop these noisy cases
2-21
結束
Data Transformation
Assign numerical values to each attribute
Age:
≤30 = 3
31-40 = 2
>40 = 1
Income: High = 3
Medium = 2 Low = 1
Student: Yes = 2
No = 1
Credit: Excellent = 2 Fair = 1
2-22
結束
Data Mining
Categorize output
Buys = C1
Doesn’t buy = C2
Conduct analysis
Model says A8, A10 don’t buy; rest do
Of the actual yes, 7 correct and 1 not
Of the actual no, 2 correct
Confusion matrix
2-23
結束
Data Interpretation and Test Data Set
Test on independent data
Case
Actual
Model
B1
Yes
Yes (1)
B2
Yes
Yes (2)
B3
Yes
Yes (3)
B4
Yes
Yes (4)
B5
Yes
Yes (5)
B6
Yes
Yes (6)
B7
Yes
Yes (7)
B8 (do not)
No
No
B9
No
Yes
B10 (do not)
No
No
2-24
結束
Confusion Matrix
Model Buy
Model Not
Totals
Actual Buy
7
0
7
Actual Not
1
2
3
Totals
8
2
10
2-25
結束
Measures
Correct classification rate
9/10 = 0.90
Cost function
cost of error:
model says buy, actual no $20
model says no, actual buy $200
1 x $20 + 0 x $200 = $20
2-26
結束
Goals
Avoid broad concepts:
Gain insight; discover meaningful patterns;
learn interesting things
Can’t measure attainment
Narrow and specify:
Identify customers likely to renew; reduce
churn;
Rank order by propensity (favor) to…;
2-27
結束
Goals
Description: what is
understand
explain
discover knowledge
Prescription: what should be done
classify
predict
2-28
結束
Goal
Method A:
four rules, explains 70%
Method B:
fifty rules, explains 72%
BEST?
Gain understanding:Method A better
minimum description length (MDL)
Reduce cost of mailing: Method B better
2-29
結束
Measurement
Accuracy
How well does model describe observed data?
Confidence levels
 a proportion of the time between lower and
upper limits
Comprehensibility
Whole or parts?
2-30
結束
Measuring Predictive
Classification & prediction:
error rate = incorrect/total
requires evaluation set be representative
Estimators
predicted - actual (MAD, MSE, MAPE)
variance = sum(predicted - actual)^2
standard deviation = square root of variance
distance - how far off
2-31
結束
Statistics
Population - entire group studied
Sample - subset from population
Bias - difference between sample average &
population average
mean, median, mode
distribution
significance
correlation, regression (hamming distance)
2-32
結束
Classification Models
LIFT = probability in class by sample divided by
probability in class by population
if population probability is 20% and
sample probability is 30%,
LIFT = 0.3/0.2 = 1.5
Best lift not necessarily best need sufficient
sample size as confidence increase.
2-33
結束
Lift Chart
LIFT
100
90
80
responded
70
60
% mailed
50
% responded
40
30
20
10
0
0
5
10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
mailed
2-34
結束
Measuring Impact
Ideal - $ (NPV) because of expenditure
Mass mailing may be better
Depends on:
fixed cost
cost per recipient
cost per respondent
value of positive response
2-35
結束
Bottom Line
Return on investment
2-36
結束
Example Application
Telephone industry
Problem: Unpaid bills
Data mining used to develop
models to predict nonpayment as
early as possible
See page. 27
2-37
結束
Knowledge Discovery Process
1 Data Selection
Learning the application domain
Creating target data set
2 Data Preprocessing
Data cleaning & preprocessing
3 Data Transformation
Data reduction & projection
4 Data Mining
Choosing function
Choosing algorithms
Data mining
5 Data Interpretation
Interpretation
Using discovered knowledge
2-38
結束
1: Business Understanding
Predict which customers would be insolvent
In time for firm to take preventive measures
(and avert losing good customers)
Hypothesis:
Insolvent customers would change calling
habits & phone usage during a critical period
before & immediately after termination of
billing period
2-39
結束
2: Data Understanding
Static customer information available in files
Bills, payments, usage
Used data warehouse to gather & organize
data
Coded to protect customer privacy
2-40
結束
Creating Target Data Set
Customer files
Customer information
Disconnects
Reconnections
Time-dependent data
Bills
Payments
Usage
100,000 customers over 17-month period
Stratified (hierarchical) sampling to assure all groups
appropriately represented
2-41
結束
3: Data Preparation
Filtered out incomplete data
Deleted inexpensive calls
Reduced data volume about 50%
Low number of fraudulent cases
Cross-checked with phone disconnects
Lagged data made synchronization necessary
2-42
結束
Data Reduction & Projection
Information grouped by account
Customer data aggregated by 2-week periods
Discriminant analysis on 23 categories
Calculated average owed by category (significant)
Identified extra charges (significant)
Investigated payment by installments (not
significant)
2-43
結束
Choosing Data Mining Function
Classes:
 Most possibly solvent (99.3%)
 Most possibly insolvent (0.7%)
Costs of error widely different
New data set created through stratified sampling
 Retained all insolvent
 Altered distribution to 90% solvent
 Used 2,066 cases total
Critical period identified
 Last 15 two-week periods before service interruption
Variables defined by counting measures in twoweek periods
 46 variables as candidate discriminant factors
2-44
結束
4: Modeling
Discriminant Analysis
Linear model
SPSS – stepwise forward selection
Decision Trees
Rule-based classifier, C5, C4.5
Neural Networks
Nonlinear model
2-45
結束
Data Mining
Training set about 2/3rds
Rest test
Discriminant analysis
Used 17 variables
Equal costs – 0.875 correct
Unequal costs – 0.930 correct
Rule-based – 0.952 correct
Neural network – 0.929 correct
2-46
結束
5: Evaluation
1st objective to maximize accuracy of predicting
insolvent customers
Decision tree classifier best
2nd objective to minimize error rate for solvent
customers
Neural network model close to Decision tree
Used all 3 on case-by-case basis
2-47
結束
Coincidence Matrix – Combined Models
Model
insolvent
Model
solvent
Unclass
Totals
Actual
insolvent
19
17
28
64
Actual
solvent
1
626
27
654
Totals
20
643
91
718
2-48
結束
6: Implementation
Every customer examined using all 3
algorithms
If all 3 agreed, used that classification
If disagreement, categorized as unclassified
Correct on test data 0.898
Only 1 actually solvent customer would
have been disconnected
2-49