The Process of Data Mining

Download Report

Transcript The Process of Data Mining

Data Mining Methodology
1
Why have a Methodology

Don’t want to learn things that aren’t true
 May not represent any underlying reality
○ Spurious correlation
○ May not be statistically significant or may be
statistically significant but coincidental
○ Because data mining makes less assumptions
about the data and searches through a richer
hypothesis space, this is a big issue
 Model overfitting is an issue
2
Why have a Methodology II

Data may not reflect relevant population
 Data mining normally assumes training data
matches the test and score data

Quick overview of how data used for DM
○ Training set used to build the model
○ Validation set used to tune model or select amongst
alternative models
○ Test set used to evaluate model & report quality
 For prediction tasks, test set must have the “answer”
○ Model eventually applied to score set, which for
predictive tasks, does not have the answer
○ Evaluation must always occur on data not used to
build or tune or select the model
3
Why have a Methodology III

Do not want to learn things that are not
useful
 May be already known
 May not be actionable
4
Hypothesis Testing

Data Mining is not usually used for
hypothesis testing
 B&L does not really say this
 Typical assumption is data already collected
and you have little influence on the process
○ Data may be in a data warehouse
○ usually do not modify the scenarios for
collecting the data or the parameters
○ Experimental design not part of data mining
○ Active learning is related to this, where you
carefully select the data to learn from
5
The Methodology (Fayyad)

According to the article by Fayyad et. al,
the main steps are:
 Data Selection
 Preprocessing
 Transformation
 Data Mining
 Interpretation/Evaluation
6
The Methodology (B & L)

According to Berry & Linoff, the main steps are:










Translate business problem into a DM problem
Select Data
Get to know the data
Create a model set
Fix problems with the data (“preprocess”)
Transform the data
Build models (“Data mining”)
Assess models (“Interpret/Evaluate”)
Deploy Models
Assess Results then start over
7
Steps in the Process: Selection

Many of the steps are not very complex,
so her some selective comments:
 Selection:
○ DM usually tries to use all available data
○ May not be necessary, can generate learning
curves where see how performance varies
with increasing amounts of data
○ Data Mining is not afraid of using lots of
variables (unlike statistics). But some data
mining methods (especially statistical ones)
do have problems with many variables.
8
Steps in the Process: Know the Data

Getting to know the data:
 always useful and also helps make sure you
understand the problem
 Data visualization can help
 Data mining is not really like a black box
where the computer does all of the work
○ having or generating good features (variables)
is critical. Data visualization can help
9
Steps in the Process: Create Model Set

Creating a model (training) set
 Sometimes you may want to form the training
set other than by random sampling
○ It is often recommended to balance the classes if
they are highly unbalanced
 Not really a good idea or needed. Can use cost-sensitive
learning instead, but we will address later
 May want to focus on harder problems
- Active learning skews the training data, but the purpose
is to save effort in manually labeling the training data
10
Steps in the Process: Create Model Set
 Data sets relevant to Data Mining
○ Training set: used to build initial model
○ Validation set: used to either tune model (e.g.,
pruning) or select amongst multiple models
○ Test set: used to evaluate goodness of model
 For predictive tasks, must have class labels
○ Score set: Data that model ultimately build for
 For predictive tasks, class labels are not available
 Note that training, validation and test data come
from labeled data
 Cross validation can maximize size of labeled data
○ 10-fold cross validation uses 90% for training and
10% testing. It will entail 10 runs.
11
Steps in the Process: Fix Data
Many data mining methods don’t need as
much variable “fixing” as statistical methods
 Types of fixing

 Missing values: many ways to fix
 Too many categorical values: reduce
○ Binning, etc.
 Numerical values skewed
○ Take log etc

Data preprocessing (Fayyad) may just alter
the representation
12
Steps in the Process: Transform

Aggregate data to a higher level
 Time series data often must be converted into
examples for classification algorithms
○ Phone call data aggregated from call level to
describe activity associated with a phone #/user

Construction of new features is part of this
step. Feature construction can be critical.
 Area of plot more useful for estimating value of
home than length and width.
13
Steps in the Process: Assess Model

Predictive models are assessed based on the
correctness of their predictions
 Accuracy is the simplest measure, but often not very
useful since not all errors are equal
○ we will learn more about this later
○ Lift curves are discussed in B&L (p 81)
 Lift ratio = P(class|sample) / P(class|population)
 Life only makes sense when we can be selective, like in direct
marketing where we don’t have to judge every response

Descriptive models can be hard to evaluate since
their may not be objective criteria
 How do you tell if a clustering is meaningful?
○ More on assessment methods later
14
Steps in the Process: Deploy

Research models are fine, we run them off
line and when we want to
 In a business, must deal with real-world issues
○ In the WISDM project, we want to classify activities
in real time. This is also needed for many fraud
detection models. Must be able to execute the
model and do it quickly, possibly on different
hardware.
 Some tools allow you to export the model as code
○ Even in off-line evaluation, may need to handle
huge amounts of data
15
Steps in the Process: Assess Results

True assessment is not just of model,
but includes the business context
 Takes into account all costs and benefits
 This may include costs that are very hard to
quantify
○ How much does a false negative medical test
cost it causes the patient to die of a
preventable disease?
16
Steps in the Process: Iterate

Data Mining is an iterative process
 Iteration can occur between most of the steps
 Example: You don’t like overall results so you
add another feature. You then assess its impact
to see if you should keep it.
 Example: You realize that assessment of your
model does not make sense and is missing
some costs, so you then incorporate these costs
into the model
17