Transcript Steps

Chapter 2: Basics of Business Analytics
2.1 Overview of Techniques
2.2 Data Difficulties
2.3 Honest Assessment
2.4 Methodology
2.5 Recommended Reading
1
Chapter 2: Basics of Business Analytics
2.1 Overview of Techniques
2.2 Data Difficulties
2.3 Honest Assessment
2.4 Methodology
2.5 Recommended Reading
2
Objectives


3
Name two major types of data mining analyses.
List techniques for supervised and unsupervised
analyses.
Analytical Methodology
A methodology clarifies the purpose and implementation of
analytics.
Define/Refine
business objective.
Assess results.
Select data
Deploy models
Explore input data
Prepare and
Repair data
Apply Analysis
Transform input
data
4
Business Analytics and Data Mining
Data mining is a key part of effective business analytics.
Components of data mining:
 data management
 data management
 data management
 customer segmentation
 predictive modeling
 forecasting
 standard and nonstandard statistical modeling
practices
5
What Is Data Mining?



6
Information Technology
– Complicated database queries
Machine Learning
– Inductive learning from examples
Statistics
– What we were taught not to do
Translation for This Course
Segmentation
Predictive Modeling
 Unsupervised classification
 Supervised classification
– Cluster Analysis
– Linear regression
– Association Rules
– Logistic regression
 Other techniques
– Decision trees
 Other techniques
7
Customer Segmentation
Segmentation is a vague term with many meanings.
Segments can be based on the following:
 A Priori Judgment
– Alike based on business rules, not
based on data analysis
 Unsupervised Classification
– Alike with respect to several
attributes
 Supervised Classification
– Alike with respect to a target,
defined by a set of inputs
8
Segmentation: Unsupervised Classification
9
Training Data
Training Data
case 1: inputs, ?
case 2: inputs, ?
case 3: inputs, ?
case 4: inputs, ?
case 5: inputs, ?
case 1: inputs, cluster 1
case 2: inputs, cluster 3
case 3: inputs, cluster 2
case 4: inputs, cluster 1
case 5: inputs, cluster 2
new
case
new
case
Segmentation: A Selection of Methods
Barbie
 Candy
Beer
 Diapers
Peanut butter  Meat
k-means clustering
10
Association rules
(Market basket analysis)
Predictive Modeling: Supervised Classification
Training Data
case 1: inputs  prob  class
case 2: inputs  prob  class
case 3: inputs  prob  class
case 4: inputs  prob  class
case 5: inputs  prob  class
new
case
11
new
case
Predictive Modeling: Supervised Classification
Inputs
Target
Cases
.. .. .. .. .. .. .. .. ..
. . . . . . . . .
...
...
...
...
...
...
...
...
...
...
12
.. ..
. .
Types of Targets



13
Logistic Regression
– event/no event (binary target)
– class label (multiclass problem)
Regression
– continuous outcome
Survival Analysis
– time-to-event (possibly censored)
Discrete Targets



14
Healthcare
– Target = favorable/unfavorable outcome
Credit Scoring
– Target = defaulted/did not default on a loan
Marketing
– Target = purchased product A, B, C, or none
Continuous Targets



15
Healthcare Outcomes
– Target = hospital length of stay, hospital cost
Liquidity Management
– Target = amount of money at an ATM machine or
in a branch vault
Merchandise Returns
– Target = time between purchase and return
(censored)
Application: Target Marketing

Cases
=

Inputs
=

Target
Action
=
=

16
customers, prospects, suspects,
households
geo/demo-graphics, psychometrics,
RFM variables
response to a past or test solicitation
target high-responding segments
of customers in future campaigns
Application: Attrition Prediction/Defection
Detection

Cases
Inputs
=
=

Target
=

Action
=

17
existing customers
payment history, product/service
usage, demographics
churn, brand-switching, cancellation,
defection
customer loyalty promotion
Application: Fraud Detection




18
Cases
Inputs
Target
Action
=
=
=
=
past transaction or claims
particulars and circumstances
fraud, abuse, deception
impede or investigate suspicious cases
Application: Credit Scoring

Cases
Inputs
=
=

Target
=

Action
=

19
past applicants
application information, credit bureau
reports
default, charge-off, serious delinquency,
repossession, foreclosure
accept or reject future applicants for
credit
The Fallacy of Univariate Thinking
What is the most important cause of churn?
Prob(churn)
International
Usage
20
Daytime
Usage
A Selection of Modeling Methods
Linear Regression,
Logistic Regression
21
Decision
Trees
Hard Target Search
Transactions
22
...
Hard Target Search
Transactions
23
Fraud
Undercoverage
Accepted
Bad
Accepted
Good
Rejected
No Follow-up
24
...
Undercoverage
Next
Generation
Accepted
Bad
Accepted
Good
Rejected
No Follow-up
25
Chapter 2: Basics of Business Analytics
2.1 Overview of Techniques
2.2 Data Difficulties
2.3 Honest Assessment
2.4 Methodology
2.5 Recommended Reading
26
Objectives

27
Discuss several of the challenges of data mining and
ways to address these challenges.
Initial Challenges in Data Mining
1. What do I want to predict?
2. What level of granularity is needed to obtain data
about the customer?
28
...
Initial Challenges in Data Mining
1. What do I want to predict?
 a transaction
 an individual
 a household
 a store
 a sales team
2. What level of granularity is needed to obtain data
about the customer?
29
...
Initial Challenges in Data Mining
30
1. What do I want to predict?
 a transaction
 an individual
 a household
 a store
 a sales team
2. What level of granularity is needed to obtain data
about the customer?
 transactional
 regional
 daily
 monthly
 other
Typical Data Mining Time Line
Allotted Time
Projected:
Actual:
Dreaded:
(Data Acquisition)
Needed:
Data Preparation
31
Data Analysis
Data Challenges
What identifies a unit?
32
Cracking the Code
What identifies a unit?
ID1 ID2
2612
2613
2614
2615
2616
2617
2618
2618
2619
2620
2620
2620
33
624
625
626
627
628
629
630
631
632
633
634
635
DATE
941106
940506
940809
941010
940507
940812
950906
951107
950112
950802
950908
950511
JOB SEX FIN PRO3 CR_T ERA
06
04
11
16
04
09
09
13
10
11
06
01
8
5
5
1
2
1
2
2
5
1
0
1
DEC
ETS
PBB
RVC
ETT
OFS
RFN
PBB
SLP
STL
DES
DLF
.
.
.
.
.
.
71
0
0
34
0
0
.
.
.
.
.
.
612
623
504
611
675
608
.
.
.
.
.
.
12
23
04
11
75
08
Data Challenges
What should the data look like to perform an analysis?
34
Data Arrangement
What should the data look like to perform an analysis?
Long-Narrow
Acct type
2133
2133
2133
2653
2653
3544
3544
3544
3544
3544
35
MTG
SVG
CK
CK
SVG
MTG
CK
MMF
CD
LOC
Short-Wide
Acct CK SVG MMF CD LOC MTG
2133
2653
3544
1
1
1
1
1
0
0
0
1
0
0
1
0
0
1
1
0
1
Data Challenges
What variables do I need?
36
Derived Inputs
What variables do I need?
Claim
Accident
Date
Time
11nov96
22dec95
26apr95
02jul94
08mar96
15dec96
09nov94
37
102396/12:38
012395/01:42
042395/03:05
070294/06:25
123095/18:33
061296/18:12
110594/22:14
Delay Season Dark
19
fall
0
333
3
0
69
winter
spring
summer
winter
1
1
0
0
186
4
summer
fall
0
1
Data Challenges
How do I convert my data to the proper level of
granularity?
38
Roll-Up
How do I convert my data to the proper level of
granularity?
HH Acct Sales
4461
4461
4461
4461
4461
4911
5630
5630
6225
6225
39
2133
2244
2773
2653
2801
3544
2496
2635
4244
4165
160
42
212
250
122
786
458
328
27
759
HH
Acct
Sales
4461 2133
4911 3544
?
?
5630 2496
6225 4244
?
?
Rolling Up Longitudinal Data
How do I convert my data to the proper level of
granularity?
Frequent
Flying
VIP
Flier
Month
Mileage
Member
40
10621
10621
Jan
Feb
650
0
No
No
10621
10621
Mar
Apr
0
250
No
No
33855
33855
33855
33855
Jan
Feb
Mar
Apr
350
300
1200
850
No
No
Yes
Yes
Data Challenges
What sorts of raw data quality problems can I expect?
41
Errors, Outliers, and Missings
What sorts of raw data quality problems can I expect?
cking #cking
ADB NSF dirdep SVG
bal
Y
Y
Y
y
Y
Y
Y
Y
Y
42
1
1
1
.
2
1
1
.
.
1
3
2
468.11 1
68.75 0
212.04 0
. 0
585.05 0
47.69 2
4687.7 0
. 1
. .
0.00 0
89981.12 0
585.05 0
1876
0
6
0
7218
1256
0
0
1598
0
0
7218
Y
Y
Y
Y
Y
Y
Y
1208
0
0
4301
234
238
0
1208
0
0
45662
234
Missing Value Imputation
What sorts of raw data quality problems can I expect?
Inputs
?
?
?
?
?
Cases
?
?
?
?
43
Data Challenges
Can I (more importantly, should I) analyze all the data
that I have?
All the observations?
All the variables?
44
Massive Data
Can I (more importantly, should I) analyze all the data
that I have?
Bytes
Paper
45
Kilobyte
210
½ sheet
Megabyte
220
1 ream
Gigabyte
230
167 feet
Terabyte
240
32 miles
Petabyte
250
32,000 miles
Sampling
Can I (more importantly, should I) analyze all the data
that I have?
46
Oversampling
Can I (more importantly, should I) analyze all the data
that I have?
OK
Fraud
47
The Curse of Dimensionality
Can I (more importantly, should I) analyze all the data
that I have?
1–D
2–D
3–D
48
Dimension Reduction
Input3
E(Target)
Can I (more importantly, should I) analyze all the data
that I have?
Redundancy
Irrelevancy
Input1
49
Catalog Case Study
Analysis goal:
A mail-order catalog retailer wants to save money on
mailing and increase revenue by targeting mailed catalogs
to customers who are most likely to purchase in the future.
Data set: CATALOG
Number of rows: 48,356
Number of columns: 98
Contents: sales figures summarized
across departments and quarterly totals
for 5.5 years of sales
Targets: RESPOND (binary)
ORDERSIZE (continuous)
50
Catalog Case Study: Basics
Throughout this chapter, you work with data in
SAS Enterprise Miner to perform exploratory analysis.
1. Import the CATALOG data.
2. Identify the target variables.
3. Define and transform the variables for use in RFM
analysis.
4. Perform graphical RFM analysis in SAS Enterprise
Miner.
Later, you use the CATALOG data for predictive modeling
and scoring.
51
Accessing and Importing Data for Modeling
First, get familiar with the data!
The data file is a SAS data set.
1. Create a project in SAS Enterprise Miner.
2. Create a diagram.
3. Locate and import the CATALOG data.
4. Define characteristics of the data set, such as the
variable roles and measurement levels.
5. Perform a basic exploratory analysis of the data.
52
Defining a Data Source
Catalog data
ABA1
SAS
Foundation
Server
Libraries
53
Metadata
Definition
Metadata Definition
Select a table.
Set the metadata information.
Three purposes for metadata:
 Define variable roles (input, target, ID, etc.).
 Define measurement levels (binary, interval, nominal,
etc.).
 Define table role (raw data, transactional data, scoring
data, etc.).
54
Creating Projects
and Diagrams in
SAS Enterprise Miner
Catalog Case Study
Task: Create a project and a diagram in
SAS Enterprise Miner.
55
Defining a Data Source
Catalog Case Study
Task: Define the CATALOG data source in
SAS Enterprise Miner.
56
Defining Column Metadata
Catalog Case Study
Task: Define column metadata.
57
Changing the Explore Window
Sampling Defaults and
Exploring a Data Source
Catalog Case Study
Tasks: Change preference settings in
the Explore window and explore
variable associations.
58
IDEA EXCHANGE
Consider an academic retention example. Freshmen
enter a university in the fall term, and some of them drop
out before the second term begins. Your job is to try to
predict whether a student is likely to drop out after the first
term.
What kinds of variables would you consider using to
assess this question?
Continued…
59
IDEA EXCHANGE
As an administrator, do you have this information? Could
you obtain it? What kinds of data quality issues do you
anticipate?
Are there any ethical considerations in accessing the
information in your study?
Continued…
60
IDEA EXCHANGE
How does time factor into your data collection? Do
inferences about students five years ago apply to
students today? How do changes in technology,
university policies, and teaching trends affect your
conclusions?
61
Chapter 2: Basics of Business Analytics
2.1 Overview of Techniques
2.2 Data Difficulties
2.3 Honest Assessment
2.4 Methodology
2.5 Recommended Reading
62
Objectives



63
Explain what is meant by a model giving the best
prediction.
Describe data splitting.
Discuss the advantages of using honest assessment
to evaluate a model and obtain the model with the best
prediction.
Predictive Modeling Implementation



64
Model Selection and Comparison
– Which model gives the best prediction?
Decision/Allocation Rule
– What actions should be taken on new cases?
Deployment
– How can the predictions be applied to new cases?
...
Predictive Modeling Implementation



65
Model Selection and Comparison
– Which model gives the best prediction?
Decision/Allocation Rule
– What actions should be taken on new cases?
Deployment
– How can the predictions be applied to new cases?
Getting the “Best” Prediction: Fool’s Gold
My model fits the
training data perfectly...
I’ve struck it rich!
66
Model Complexity
67
Model Complexity
Too flexible
68
Model Complexity
Too flexible
69
Model Complexity
Just right
Too flexible
70
Data Splitting and Honest Assessment
71
Overfitting
Training Set
72
Test Set
Better Fitting
Training Set
73
Test Set
Predictive Modeling Implementation



74
Model Selection and Comparison
– Which model gives the best prediction?
Decision/Allocation Rule
– What actions should be taken on new cases?
Deployment
– How can the predictions be applied to new cases?
Decisions, Decisions
Predicted
0
Actual
0 360
540
1 20
80
0 540
360
1 40
60
0 720
180
1 60
75
1
40
.08
44%
80%
1.3
.10
60%
60%
1.4
.12
76%
40%
1.8
Misclassification Costs
Predicted Class
Actual
0
76
Action
1
Accept
0
True
Neg
False
Pos
1
False
Neg
True
Pos
Deny
OK
0
1
Fraud
9
0
Predictive Modeling Implementation



77
Model Selection and Comparison
– Which model gives the best prediction?
Decision/Allocation Rule
– What actions should be taken on new cases?
Deployment
– How can the predictions be applied to new
cases?
Scoring
Model Deployment
Model
Development
78
Scoring Recipe


79
The model results in
a formula or rules.
The data requires
modifications.
– Derived inputs
– Transformations
– Missing value
imputation

The scoring code is
deployed.
– To score, you do not
re-run the algorithm;
apply score code
(equations) obtained
from the final model to
the scoring data.
Scorability
Training Data
Classifier
X1
1
Tree
.8
.6
.4
New Case
.2
0
0
80
.2 .4 .6 .8
X2
1
Scoring Code
If x1<.47
and x2<.18
or x1>.47
and x2>.29,
then red.
Scoring Pitfalls: Population Drift
Data generated
Data cleaned
Model deployed
Time
Data analyzed
Data acquired
81
The Secret to Better Predictions
Fraud
OK
Transaction Amt.
82
...
The Secret to Better Predictions
Fraud
OK
Transaction Amt.
83
...
The Secret to Better Predictions
Fraud
OK
Transaction Amt.
84
Cheatin’
Heart
IDEA EXCHANGE
Think of everything you have done in the past week. What
transactions or actions created data? For example, point
of sale transactions, internet activity, surveillance, and
questionnaires are all data collection avenues that many
people encounter daily.
 How do you think that the data about you will be used?
 How could models be deployed that use data about
you?
85
Chapter 2: Basics of Business Analytics
2.1 Overview of Techniques
2.2 Data Difficulties
2.3 Honest Assessment
2.4 Methodology
2.5 Recommended Reading
86
Objectives



87
Describe a methodology for implementing business
analytics through data mining.
Discuss each of the steps, with examples, in the
methodology.
Create a project and diagram in SAS Enterprise Miner.
Methodology
Data mining is not a linear process, but it is a cycle, where
later results can lead back to previous steps.
Define/Refine
business objective
Assess results
Select data
Deploy models
Explore input data
Prepare and
repair data
Apply analysis
Transform input
data
88
Why Have a Methodology?





89
To avoid learning things that are not true
To avoid learning things that are not useful
– results that arise from past marketing decisions
– results that you already know
– results that you already should know
– results that you are not allowed to use
To create stable models
To avoid making the mistakes that you made in the
past
To develop useful tips from what you learned
Methodology
1. Define the business objective and state it as a
data mining task.
Define/Refine
business objective
Assess results
Select data
Deploy models
Explore input data
Prepare and
repair data
Apply analysis
Transform input
data
90
1) Define the Business Objective








91
Improve the response rate for a direct
marketing campaign.
Increase the average order size.
Determine what drives customer acquisition.
Forecast the size of the customer base in the future.
Choose the right message for the right groups of
customers.
Target a marketing campaign to maximize incremental
value.
Recommend the next, best product for existing
customers.
Segment customers by behavior.
A lot of good statistical analysis is directed at
solving the wrong business problem.
Define the Business Goal
Example: Who is the yogurt lover?
What is a yogurt lover?
 One answer prints coupons
at the cash register.
 Another answer mails coupons
to people’s homes.
 Another results in advertising.
92
MEDIUM
LOW
$$ Spent on Yogurt
HIGH
Big Challenge: Defining a Yogurt Lover
LOW
MEDIUM
Yogurt as %
of All Purchases
93
HIGH
“Yogurt lover” is not in
the data.
You can impute it, using
business rules:
 Yogurt lovers spend
a lot of money on
yogurt.
 Yogurt lovers spend
a relatively large
amount of their
shopping dollars
on yogurt.
Next Challenge: Profile the Yogurt Lover
You have identified a segment of customers that you
believe are yogurt lovers.
But who are they? How would I know them in the store?
 Identify them by demographic data.
 Identify them by other things that they purchase (for
example, yogurt lovers are people who buy nutrition
bars and sports drinks).
What action can I take?
 Set up “yogurt-lover-attracting” displays
94
IDEA EXCHANGE
If a customer is identified as a yogurt lover, what action
should be taken? Should you give yogurt coupons, even
though these individuals will buy yogurt anyway? Is there
a cross-sell opportunity? Is there an opportunity to identify
potential yogurt lovers? What would you do?
95
Profiling in the Extreme: Best Buy
Using analytical methodology, electronics retailer
Best Buy discovered that a small percentage of
customers accounted for a large percentage of revenue.
Over the past several years, the company has adopted a
customer-centric approach to store design and flow,
staffing, and even corporate acquisitions such as the
Geek Squad support team.
The company’s largest competitor has gone bankrupt
while Best Buy has seen growth in market share.
See Gulati (2010)
96
Define the Business Objective
What Is the business objective?
Example: Telco Churn
Initial problem: Assign a churn score to all customers.
 Recent customers with little call history
 Telephones? Individuals? Families?
 Voluntary churn versus involuntary churn
How will the results be used?
Better objective: By September 24, provide a list of the
10,000 elite customers who are most likely to churn in
October.
The new objective is actionable.
97
Define the Business Objective
Example: Credit Churn
How do you define the target? When did a customer leave?
 When she has not made a new charge in six months?
 When she had a zero balance for three months?
 When the balance does not support the cost of carrying
the customer?
 When she cancels her card?
3.0%
1.0%
 When the contract ends?
0.8%
0.6%
0.4%
0.2%
0.0%
0
1
2
3
4
5
6
Tenure (months)
98
7
8
9 10 11 12 13 14 15
Translate Business Objectives into
Data Mining Tasks
Do you already know the answer?
In supervised data mining, the data has examples of what
you are looking for, such as the following:
 customers who responded in the past
 customers who stopped
 transactions identified as fraud
In unsupervised data mining, you are looking for new
patterns, associations, and ideas.
99
Data Mining Tasks Lead to Specific Techniques
Objectives
Customer
Acquisition
Credit Risk
Pricing
Customer Churn
Fraud Detection
Discovery
Customer Value
100
...
Data Mining Tasks Lead to Specific Techniques
Objectives
Customer
Acquisition
Credit Risk
Pricing
Customer Churn
Fraud Detection
Discovery
Customer Value
Tasks
Exploratory Data
Analysis
Binary Response
Modeling
Multiple Response
Modeling
Estimation
Forecasting
Detecting Outliers
Pattern Detection
101
...
Data Mining Tasks Lead to Specific Techniques
Objectives
Customer
Acquisition
Credit Risk
Pricing
Customer Churn
Fraud Detection
Discovery
Customer Value
102
Tasks
Techniques
Exploratory Data
Analysis
Decision Trees
Binary Response
Modeling
Neural Networks
Regression
Multiple Response
Modeling
Survival Analysis
Estimation
Association Rules
Forecasting
Link Analysis
Detecting Outliers
Hypothesis Testing
Pattern Detection
Visualization
Clustering
Data Analysis Is Pattern Detection
Patterns might not represent any underlying rule.
Some patterns reflect some underlying reality.
 The party that holds the White House tends to
lose seats in Congress during off-year elections.
Others do not.
 When the American League wins the World Series
in Major League Baseball, Republicans take the
White House.
 Stars cluster in constellations.
Sometimes, it is difficult to tell without analysis.
 In U.S. presidential contests, the taller candidate
usually wins.
103
Example: Maximizing Donations
Example from the KDD Cup, a data mining competition
associated with the KDD Conference (www.sigkdd.org):
 Purpose: Maximizing profit for a charity fundraising
campaign
 Tested on actual results from mailing (using data withheld
from competitors)
Competitors took multiple approaches
to the modeling:
 Modeling who will respond
 Modeling how much people will give
 Perhaps more esoteric approaches
However, the top three winners all took
the same approach (although they used
different techniques, methods, and
software).
104
The Winning Approach:
Expected Revenue
Task: Estimate responseperson,
the probability that a person
responds to the mailing (all
customers).
Task: Estimate the value of response,
dollarsperson (only customers who respond).
Choose prospects with the highest expected value,
responseperson * dollarsperson.
105
An Unexpected Pattern
An unexpected pattern suggests an approach.
When people give money frequently, they tend to
donate less money each time.
 In most business applications, as people take an
action more often, they spend more money.
 Donors to a charity are different.
This suggests that potential donors go
through a two-step process:
 Shall I respond to this mailing?
 How much money should I give
this time?
Modeling can follow the same logic.
106
Methodology
2. Select or collect the appropriate data to address the
problem. Identify the customer signature.
Define/Refine
business objective
Assess results
Select data
Deploy models
Explore input data
Prepare and
repair data
Apply analysis
Transform input
data
107
2) Select Appropriate Data






What is available?
What is the right level of granularity?
How much data is needed?
How much history is required?
How many variables should be used?
What must the data contain?
Assemble results into customer signatures.
108
Representativeness of the Training Sample
The model set might not reflect the relevant population.
 Customers differ from prospects.
 Survey responders differ from non-responders.
 People who read e-mail differ from people who do not
read e-mail.
 Customers who started three
years ago might differ from
customers who started three
months ago.
 People with land lines differ
from those without.
109
Availability of Relevant Data
Elevated printing defect rates might be due to humidity,
but that information is not in press run records.
Poor coverage might be the number one reason for
wireless subscribers canceling their subscriptions, but
data about dropped calls is not in billing data.
Customers might already have potential cross-sell
products from other companies, but that information is not
available internally.
110
Types of Attributes in Data
Readily Supported
 Binary
 Categorical (nominal)
 Numeric (interval)
 Date and time
111
Require More Work
 Text
 Image
 Video
 Links
IDEA EXCHANGE
Suppose that you were in charge of a charity similar to the
KDD example above. What kind of data are you likely to
have available before beginning the project? Is there
additional data that you would need?
Do you have to purchase the data, or is it publicly
available for free? How could you make the best use of a
limited budget to acquire high quality data about individual
donation patterns?
112
The Customer Signature
The primary
key uniquely
identifies each
row, often
corresponding
to customer ID.
The target
A foreign key
columns are
gives access to
what you are
data in another
looking for.
table, such as
Sometimes, the
ZIP code
information is in
demographics.
multiple columns,
such as a churn flag
and churn date.
Some columns
are ignored
because the
values are not
predictive or they
contain future
information, or for
other reasons.
Each row generally corresponds to a customer.
113
Data Assembly Operations
Copying
Pivoting
Table
lookup
Derivation of
new variables
Summarization
of values from data
Aggregation
114
Methodology
3. Explore the data. Look for anomalies. Consider timedependent variables. Identify key relationships among
variables.
Define/Refine
business objective
Assess results
Select data
Deploy models
Explore input data
Prepare and
repair data
Apply analysis
Transform input
data
115
3) Explore the Data
Examine distributions.
 Study histograms.
 Think about extreme values.
 Notice the prevalence of missing values.
Compare values with descriptions.
Validate assumptions.
Ask many questions.
116
Ask Many Questions







117
Why were some customers active for 31 days in
February, but none were active for more than 28 days
in January?
How do some retail card holders spend more than
$100,000 in a week in a grocery store?
Why were so many customers born in 1911? Are they
really that old?
Why do Safari users never make second purchases?
What does it mean when the contract begin date is
after the contract end date?
Why are there negative numbers in the sale price
field?
How can active customers have a non-null value in the
cancellation reason code field?
Be Wary of Changes over Time
Price-related cancellations
Does the same code have the same meaning in historical
data?
Did different data elements start being loaded at different
points in time?
Did something happen at a particular point in time?
May
118
Price
increase
price
complaint
stops
Jun
Jul
Aug
Sep
Oct
Nov
Dec
Jan
Feb
Mar
Apr
May
Jun
Methodology
4. Prepare and repair the data. Define metadata correctly.
Partition the data and create balanced samples, if
necessary.
Define/Refine
business objective
Assess results
Select data
Deploy models
Explore input data
Prepare and
repair data
Apply analysis
Transform input
data
119
4) Prepare and Repair the Data




120
Set up a proper temporal relationship
between the target variable and inputs.
Create a balanced sample, if possible.
Include multiple time frames if necessary.
Split the data into Training, Validation, and (optionally)
Test data sets.
Temporal Relationship: Prediction or Profiling?
The same techniques work for both.
Earlier
In a predictive model, values of explanatory
variables are from an earlier time frame than
the target variable.
Same Timeframe
Later
In a profiling model, the explanatory variables and the
target variable might all be from the same time frame.
121
Balancing the Input Data Set
A very accurate model simply predicts that no one wants
a brokerage account:
 98.8% accurate
 1.2% error rate
This is useless for differentiating among customers.
Distribution of Brokerage Target Variable
Brokerage = "Y"
2,355
Brokerage = "N"
228,926
0
122
50,000
100,000
150,000
200,000
250,000
Two Ways to Create Balanced Data
123
Data Splitting and Validation
Error Rate
Improving the model causes
the error rate to decline on the
data used to build it. At the
same time, the model
becomes more complex.
Models Getting More Complex
124
Validation Data Prevents Overfitting
Sweet
spot
Validation
Data
Error Rate
Signal
Noise
Training Data
Models Getting More Complex
125
Partitioning the Input Data Set
Training
Use the training set to find patterns and create
an initial set of candidate models.
Validation
Use the validation set to select the best model
from the candidate set of models.
Test
Use the test set to measure performance of the
selected model on unseen data. The test set can
be an out-of-time sample of the data, if necessary.
Partitioning data is an allowable luxury because data
mining assumes a large amount of data.
Test sets do not help select the final model; they only
provide an estimate of the model’s effectiveness in the
population. Test sets are not always used.
126
Fix Problems with the Data
Data imperfectly describes the features
of the real world.
 Data might be missing or empty.
 Samples might not be representative.
 Categorical variables might have too many values.
 Numeric variables might have unusual distributions
and outliers.
 Meanings can change over time.
 Data might be coded inconsistently.
127
No Easy Fix for Missing Values
Throw out the records with missing values?
 No. This creates a bias for the sample.
Replace missing values with a “special” value (-99)?
 No. This resembles any other value to a data mining algorithm.
Replace with some “typical” value?
 Maybe. Replacement with the mean, median, or mode changes
the distribution, but predictions might be fine.
Impute a value? (Imputed values should be flagged.)
 Maybe. Use distribution of values to randomly choose a value.
 Maybe. Model the imputed value using some technique.
Use data mining techniques that can handle missing values?
 Yes. One of these, decision trees, is discussed.
Partition records and build multiple models?
 Yes. This action is possible when data is missing for a
canonical reason, such as insufficient history.
128
Methodology
5. Transform data. Standardize, bin, combine, replace,
impute, log, etc.
Define/Refine
business objective
Assess results
Select data
Deploy models
Explore input data
Prepare and
repair data
Apply analysis
Transform input
data
129
5) Transform Data









Standardize values into z-scores.
Turn counts into percentages.
Remove outliers.
Capture trends with ratios, differences, or beta values.
Combine variables to bring information to the surface.
Replace categorical variables with some numeric
function of the categorical values.
Impute missing values.
Transform using mathematical functions, such as logs.
Translate dates to durations.
Example: Body Mass Index (kg/m2) is a better predictor of
diabetes than either variable separately.
130
A Selection of Transformations
Standardize numeric values.
 All numeric values are replaced by the notion of “how
far is this value from the average.”
 Conceptually, all numeric values are in the same range.
(The actual range differs, but the meaning is the same.)
 Although it sometimes has no effect on the results
(such as for decision trees and regression), it never
produces worse results.
 Standardization is so useful that it is often built into
SAS Enterprise Miner modeling nodes.
131
A Selection of Transformations
“Stretching” and “squishing” transformations
 Log, reciprocal, square root are examples.
Replace categorical values with appropriate numeric
values.
 Many techniques work better with numeric values than
with categorical values.
 Historical projections (such as handset churn rate or
penetration by ZIP code) are particularly useful.
132
IDEA EXCHANGE
What are some other warning signs you can think of in
modeling? Have you experienced any pitfalls that were
memorable, or that changed the way you approach the
data analysis objectives?
133
Methodology
6. Apply analysis. Fit many candidate models, try different
solutions, try different sets of input variables, select the
best model.
Define/Refine
business objective
Assess results
Select data
Deploy models
Explore input data
Prepare and
repair data
Apply analysis
Transform input
data
134
6) Apply Analysis









135
Regression
Decision trees
Cluster detection
Association rules
Neural networks
Memory-based reasoning
Survival analysis
Link analysis
Genetic algorithms
Train Models
OUTPUT
INPUT
INPUT
136
INPUT
INPUT
INPUT
INPUT
INPUT
INPUT
MODEL 3
MODEL 2
MODEL 1
INPUT
Build candidate models by
applying a data mining technique
(or techniques) to the training data.
OUTPUT
OUTPUT
Assess Models
OUTPUT
INPUT
INPUT
137
INPUT
INPUT
INPUT
INPUT
INPUT
INPUT
MODEL 3
MODEL 2
MODEL 1
INPUT
Assess models by applying the
models to the validation data set.
OUTPUT
OUTPUT
Assess Models
Score the validation data using the candidate models and
then compare the results. Select the model with the best
performance on the validation data set.
Communicate model assessments through
 quantitative measures
 graphs.
138
Look for Warnings in Models
Trailing Indicators: Learning Things That Are Not True
What happens in month 8?
Minutes of Use by Tenure
120
Minutes of Use
100
80
60
40
20
0
1
2
3
4
5
6
7
8
9
10
11
Tenure (Months)
Does declining usage in month 8 predict attrition in month 9?
139
Look for Warnings in Models
Perfect Models: Things that are too good to be true.
100% of customers who spoke to a customer support
representative cancelled a contract.
Eureka! It’s all I need to know!
•If a customer cancels, they are automatically flagged to
get a call from customer support
•The information is useless in predicting cancellation.
Models that seem too good usually are.
140
Methodology
7. Deploy models. Score new observations, make modelbased decisions. Gather results of model deployment.
Define/Refine
business objective
Assess results
Select data
Deploy models
Explore input data
Prepare and
repair data
Apply analysis
Transform input
data
141
7) Deploy Models and Score New Data
142
Methodology
8. Assess the usefulness of the model. If model has gone
stale, revise the model.
Define/Refine
business objective
Assess results
Select data
Deploy models
Explore input data
Prepare and
repair data
Apply analysis
Transform input
data
143
8) Assess Results





144
Compare actual results against
expectations.
Compare the challenger’s results against
the champion’s.
Did the model find the right people?
Did the action affect their behavior?
What are the characteristics of the customers
most affected by the intervention?
Good Test Design Measures the Impact
of Both the Message and the Model
NO
Message
YES
Impact of model on group
getting message
Control Group
Target Group
Chosen at random;
receives message.
Chosen by model;
receives message.
Response measures
message without model.
Response measures
message with model.
Holdout Group
Modeled Holdout
Chosen at random;
receives no message.
Chosen by model;
receives no message.
Response measures
background response.
Response measures
model without message.
YES
NO
Picked by Model
145
Impact of
message
on group
with good
model
scores
Test Mailing Results
E-mail campaign test results
 lift 3.5
E-Mail Test
0.8
Response Rate
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
Target Group
146
Control Group
Holdout Group
Methodology
9. As you learn from earlier model results, refine the
business goals to gain more from the data.
Define/Refine
business objective
Assess results
Select data
Deploy models
Explore input data
Prepare and
repair data
Apply analysis
Transform input
data
147
9) Begin Again
Revisit business objectives.
Define new objectives.
Gather and evaluate new data.
 model scores
 cluster assignments
 responses
Example:
A model discovers that geography is a good predictor
of churn.
 What do the high-churn geographies have in
common?
 Is the pattern your model discovered stable over time?
148
Lessons Learned
Data miners must be careful to avoid pitfalls.
 Learning things that are not true or not useful
 Confusing signal and noise
 Creating unstable models
A methodology is a way of being careful.
149
IDEA EXCHANGE
Outline a business objective of your own in terms of the
methodology described here.
What is your business objective? Can you frame it in
terms of a data mining problem? How will you select the
data? What are the inputs? What do you want to look at to
get familiar with the data?
Continued…
150
IDEA EXCHANGE
Anticipate any data quality problems you might encounter
and how you could go about fixing them.
Do any variables require transformation?
Proceed through the remaining steps of the methodology
as you consider your example.
151
Basic Data Modeling
A common approach to modeling customer value is RFM
analysis, so named because it uses three key variables:
 Recency – how long it has been since the customer’s
last purchase
 Frequency – how many times the customer has
purchased something
 Monetary value – how much money the customer has
spent
RFM variables tend to predict responses to marketing
campaigns effectively.
152
RFM Cell Approach
Frequency
Monetary
value
Recency
153
RFM Cell Approach
A typical approach to RFM analysis is to bin customers
into (roughly) equal-sized groups on each of the rankordered R,F, and M variables. For example,
 Bin five groups on R (highest bin = most recent)
 Bin five groups on F (highest bin = most frequent)
 Bin five groups on M (highest bin = highest value)
The combination of the bins gives an RFM “score” that
can be compared to some target or outcome variable.
Customer score 555 = most recent quintile, most frequent
quintile, highest spending quintile.
154
Computing Profitability in RFM
Break-even response rate =
current cost of promotion per dollar of net profit.
Cost of promotion to an individual
Average net profit per sale
Example: It costs $2.00 to print and mail each catalog.
Average net profit per transaction is $30.
2.00/30.00 = 0.067
Profitable RFM cells are those with a response rate
greater than 6.7%.
155
RFM Analysis of the Catalog Data




156
Recode frequency so that the highest values are the
most recent.
Bin the R, F, and M variables into 5 groups each,
numbered 1-5, so that 1 is the least valuable and 5 is
the most valuable bin.
Concatenate the RFM variables to obtain a single
RFM “score.”
Graphically investigate the response rates for the
different groups.
Performing RFM Analysis of
the Catalog Data
Catalog Case Study
Task: Perform RFM analysis on the
catalog data.
157
Performing Graphical RFM
Analysis
Catalog Case Study
Task: Perform graphical RFM analysis.
158
Limitations of RFM
Only uses three variables
 Modern data collection processes offer rich
information about preferences, behaviors, attitudes,
and demographics.
Scores are entirely categorical
 515 and 551 and 155 are equally good, if RFM
variables are of equal importance.
 Sorting by the RFM values is not informative and
overemphasizes recency.
So many categories
 Simple example above results in 125 groups.
Not very useful for finding prospective customers
 Statistics are descriptive.
159
IDEA EXCHANGE
Would RFM analysis apply to a business objective you
are considering? If so, what would be your R, F, and M
variables?
What other basic analytical techniques could you use to
explore your data and get preliminary answers to your
questions?
160
Chapter 2: Basics of Business Analytics
2.1 Overview of Techniques
2.2 Data Difficulties
2.3 Honest Assessment
2.4 Methodology
2.5 Recommended Reading
161
Recommended Reading
Davenport, Thomas H., Jeanne G. Harris, and Robert
Morison. 2010. Analytics at Work: Smarter Decisions,
Better Results. Boston: Harvard Business Press.
 Chapters 2 through 6, DELTA method, optional
162