Chapter 3 Data Mining Concepts: Data Preparation, Model

Download Report

Transcript Chapter 3 Data Mining Concepts: Data Preparation, Model

Chapter 3 Data Mining Concepts: Data Preparation and Model Evaluation

Data Mining 2011 - Volinsky - Columbia University 1

Data Preparation

• Data in the real world is dirty – incomplete : lacking attribute values, lacking certain attributes of interest, or containing only aggregate data – noisy : containing errors or outliers – inconsistent : containing discrepancies in codes or names • No quality data, no quality mining results!

– Quality decisions must be based on quality data – Data warehouse needs consistent integration of quality data – Assessment of quality reflects on confidence in results 2 Data Mining 2011 - Volinsky - Columbia University

Preparing Data for Analysis

• Think about your data – how is it measured, what does it mean?

– nominal or categorical • jersey numbers, ids, colors, simple labels • sometimes recoded into integers - careful!

– ordinal • rank has meaning - numeric value not necessarily • educational attainment, military rank – integer valued • distances between numeric values have meaning • temperature, time – ratio • zero value has meaning - means that fractions and ratios are sensible • money, age, height, • It might seem obvious what a given data value is, but not always – pain index, movie ratings, etc Data Mining 2011 - Volinsky - Columbia University 3

Investigate your data carefully!

• Example: lapsed donors to a charity: (KDD Cup 1998)

– Made their last donation to PVA 13 to 24 months prior to June 1997 – 200,000 (training and test sets) – Who should get the current mailing? – What is the cost effective strategy?

– “tcode” was an important variable… Data Mining 2011 - Volinsky - Columbia University 4

Data Mining 2011 - Volinsky - Columbia University 5

Data Mining 2011 - Volinsky - Columbia University 6

Data Mining 2011 - Volinsky - Columbia University 7

Data Mining 2011 - Volinsky - Columbia University 8

Tasks in Data Preprocessing • Data cleaning – Check for data quality – Missing data • Data transformation – Normalization and aggregation • Data reduction – Obtains reduced representation in volume but produces the same or similar analytical results • Data discretization – Combination of reduction and transformation but with particular importance, especially for numerical data 9 Data Mining 2011 - Volinsky - Columbia University

Data Cleaning / Quality

• Individual measurements – Random noise in individual measurements • Outliers • Random data entry errors • Noise in label assignment (e.g., class labels in medical data sets) • can be corrected or smoothed out – Systematic errors • E.g., all ages > 99 recorded as 99 • More individuals aged 20, 30, 40, etc than expected – Missing information • Missing at random – Questions on a questionnaire that people randomly forget to fill in • Missing systematically – Questions that people don’t want to answer – Patients who are too ill for a certain test Data Mining 2011 - Volinsky - Columbia University 10

Missing Data

• Data is not always available – E.g., many records have no recorded value for several attributes, • survey respondents • disparate sources of data • Missing data may be due to – equipment malfunction – data not entered properly – data not available – Different versions of data have been merged – Try and figure it out!!!

Data Mining 2011 - Volinsky - Columbia University 11

How to Handle Missing Data?

• Ignore the tuple – Only feasible for a small % of missing values • Use a global constant (such as variable mean) to fill in the missing value: – “unknown” as a category – For continuous data, this will decrease variance significantly • Use a random value to fill in the missing value – Preserves variance, and ‘does no harm’ • Use imputation – nearest neighbor – model based (regression or Bayesian based) Data Mining 2011 - Volinsky - Columbia University 12

Missing Data

• What do I choose for a given situation?

• What you do depends on – the data - how much is missing? are they ‘important’ values?

– the model - can it handle missing values? – Is the data missing at random?

– there is no right answer!

Always check robustness of results Data Mining 2011 - Volinsky - Columbia University 13

Noisy Data

• Noise: random error or variance in a measured variable • Incorrect attribute values (outliers) may due to – faulty data collection – data entry problems – technology limitation – YOU!

– Try and figure it out • Other data problems which requires data cleaning – duplicate records – incomplete data – inconsistent data Data Mining 2011 - Volinsky - Columbia University 14

Data Transformation

• Can help reduce influence of extreme values • Variance reduction: – Often very useful when dealing with skewed data (e.g. incomes) – square root, reciprocal, logarithm, raising to a power – Logit: transforms probabilities from 0 to 1 to real-line logit (

p

)  log( 1

p

p

) • Normalization: scaled to fall within a small, specified range – min-max normalization – Standardization / z-score normalization • Attribute/feature construction – New attributes constructed from the given ones Data Mining 2011 - Volinsky - Columbia University 15

Dealing with massive data

• What if the data simply does not fit on my computer (or R crashes)?

– Sample sample sample • be careful to do proper randomization and stratification – Find a smaller question • Use tools to reduce dataset and reframe question – Use a database • Mysql is a good (and free) one – Investigate data reduction strategies • Can reduce either n or p Data Mining 2011 - Volinsky - Columbia University 16

Data Reduction: Dimension Reduction

• In general, incurs loss of information about x • If dimensionality p is very large (e.g., 1000’s), representing the data in a lower dimensional space may make learning more reliable, – e.g., clustering example • 100 dimensional data • if cluster structure is only present in 2 of the dimensions, the others are just noise • if other 98 dimensions are just noise (relative to cluster structure), then clusters will be much easier to discover if we just focus on the 2d space • Dimension reduction can also provide interpretation/insight – e.g for 2d visualization purposes 17 Data Mining 2011 - Volinsky - Columbia University

Data Reduction: Dimension Reduction

• Feature selection (i.e., attribute subset selection): – Use EDA to find useless variables – Use exhaustive search on a simple model (e.g. regression) • Can be computationally expensive – Use heuristic methods like stepwise methods (forward / backward selection) • Can get trapped in local minima Data Mining 2011 - Volinsky - Columbia University 18

Data Reduction (n): Sampling

• Don’t forget about sampling!

• Choose a representative subset of the data – Simple random sampling may be ok but beware of skewed variables. • Stratified sampling methods – Approximate the percentage of each class (or subpopulation of interest) in the overall database – Used in conjunction with skewed data – Propensity scores may be useful if response is unbalanced.

19 Data Mining 2011 - Volinsky - Columbia University

Data Reduction: Projection Methods

• Projections: the shadow of a multidimensional object on a lower dimensional space • Mathematically: multiplying an n x p data matrix by an orthonormal p x d projection matrix Alternatively: Courtesy: Cook, Buja, Lee, Wickham Data Mining 2011 - Volinsky - Columbia University 20

Projections

Courtesy: Cook, Buja, Lee, Wickham Data Mining 2011 - Volinsky - Columbia University 21

Data Reduction: Principal Components

• One of several projection methods • Idea: Find a projection of your data in a lower dimension, that maximizes the amount of information retained • Information = variance • Works for numeric data only • Used when the number of dimensions is large Data Mining 2011 - Volinsky - Columbia University 22

x 2

PCA Example

Direction of 1 st principal component vector (highest variance projection) Data Mining 2011 - Volinsky - Columbia University x 1 23

x 2

PCA Example

Direction of 1 st principal component vector (highest variance projection) x 1 Direction of 2nd principal component vector Data Mining 2011 - Volinsky - Columbia University 24

Principal Components

• Sequentially extracts optimal maximal variance “direction” • All directions ‘principal components’ are orthoganal • Note: variables must be standardized!!

x = Original points in p dimentional space Projection matrix of orthogonal directions Original points projected into d dimensions Principal components are related to the covariance of the original data – Technically: the first PC is the eigenvector for the first eigenvalue of the covariance of X – Highly correlated data reduces nicely ‘scree’ plot can help assess how many PC to use….

Data Mining 2011 - Volinsky - Columbia University 25

Left variables

Example: Music Data

Scree plot What’s wrong with this picture?

Data Mining 2011 - Volinsky - Columbia University 26

Scaled data

Example: Music Data

Scree plot Loadings = Coefficients (weights) of varaibles in projection vector Data Mining 2011 - Volinsky - Columbia University 27

Data Reduction: Multidimensional Scaling

• Start with an n x p matrix of observations and variables • Create an n x n matrix of distances (similarities) – Feasible when n small(ish) – 0’s on the diagonal – Symmetric • Or, you may have a distance of matrices to start with – Relationships, networks, etc • MDS: – finds a representation of these points in a lower-dimensional space usually 2), where the distances in this space best represent the original distances 28 Data Mining 2011 - Volinsky - Columbia University

Cadillac Camaro Corsica Civic

Price

34.7

15.1

11.4

12

Fuel

16 19 25 42

FuelTank

18.0

15.5

15.6

11.9

• Example:

– Distance between X and Y?

Cadillac Camaro Corsica Civic

Cadillac

0 20 25.09

38.1

Camaro

20.0

 0 7.05

Corsica

25.9

7.1

0 26.9

21.84

Data Mining 2011 - Volinsky - Columbia University 

i

 1,2,3 (

x i

y i

) 2

Civic

38.1

26.9

21.84

0 29

Multidimensional Scaling (MDS)

• MDS score function (“stress”)

S

i

, 

j

(

d

(

i

,

j

)   (

i

,

j

)) 2 /

i

, 

j d

(

i

,

j

) 2 Original dissimilarities Euclidean distance in “embedded” k-dim space • Local minimum is found via algorithmic methods – (the algorithm is gradient descent) • Morse code example Data Mining 2011 - Volinsky - Columbia University 30

MDS: face data

Data Mining 2011 - Volinsky - Columbia University 31

MDS: 2d embedding of face images

Similar faces are close to each other Sometimes the axes can have an interpretation Data Mining 2011 - Volinsky - Columbia University 32

Data Mining 2011 - Volinsky - Columbia University 33

Model Evaluation

Data Mining 2011 - Volinsky - Columbia University 34

Evaluating Models: in-sample

How good is (a,b)?

For a given (x,y), the score function S measures how good the model fits: This is just one of many possible score functions Data Mining 2011 - Volinsky - Columbia University 35

Evaluating Models: In-Sample • In-sample: error goes to zero with enough parameters (k): goodness of fit increases with parameters (k) • High bias: doesn’t fit data well, but generalizable and robust High variance: non robust to changes or new data, but low error Score function should embody the comprimise: score(model) = Goodness-of-fit - penalty(k) e.g. Bayesian Information Criterion Data Mining 2011 - Volinsky - Columbia University 36

In v. Out

• In-sample evaluation – Uses all of the data to fit parameters – Focus: how well does my model ‘fit’ the data – Penalties to decide on number of parameters • Out-of-sample evaluation – Split data into training and test sets – Focus: how well does my model predict things – Prediction error is all that matters • Statistics traditionally looks at in-sample where as data mining / machine learning typically uses out of-sample Data Mining 2011 - Volinsky - Columbia University 37

Evaluating Models: Out-of-sample • Fit model on part of data • Evaluate on out-of-sample • If model is overfit, will not perform well on out-of-sample data

Data Mining 2011 - Volinsky - Columbia University 38

Data Partitioning

• Randomly partition data into training and test set • Training set – data used to train/build the model.

– Estimate parameters (e.g., for a linear regression), build decision tree, build artificial network, etc.

• Test set – a set of examples not used for model induction. The model’s performance is evaluated on unseen data. Aka out-of-sample data

.

• Generalization Error: Model error on the test data.

Set of training examples Set of test examples

Data Mining 2011 - Volinsky - Columbia University 39

Complexity and Generalization

Score Function e.g., squared error S train ( q ) Optimal model complexity S test ( q ) Complexity = degrees of freedom in the model (e.g., number of variables) Data Mining 2011 - Volinsky - Columbia University 40

41

Holding out data

• The holdout method reserves a certain amount for testing and uses the remainder for training – Usually: one third for testing, the rest for training • For “unbalanced” datasets, random samples might not be representative – Few or none instances of some classes • Stratified sample: – Make sure that each class is represented with approximately equal proportions in both subsets 41 Data Mining 2011 - Volinsky - Columbia University

42

Repeated holdout method • Holdout estimate can be made more reliable by repeating the process with different subsamples

– In each iteration, a certain proportion is randomly selected for training (possibly with stratification) – The error rates on the different iterations are averaged to yield an overall error rate

• This is called the repeated holdout method

42 Data Mining 2011 - Volinsky - Columbia University

43

Cross-validation

• Most popular and effective type of repeated holdout is cross-validation • Cross-validation avoids overlapping test sets – First step: data is split into k subsets of equal size – Second step: each subset in turn is used for testing and the remainder for training • This is called k-fold cross-validation • Often the subsets are stratified before the cross validation is performed 43 Data Mining 2011 - Volinsky - Columbia University

Cross-validation example:

44 Data Mining 2011 - Volinsky - Columbia University 44 44

More on cross-validation

45 • Standard data-mining method for evaluation: stratified ten-fold cross-validation • Why ten? Extensive experiments have shown that this is the best choice to get an accurate estimate • Stratification reduces the estimate’s variance • Even better: repeated stratified cross-validation – E.g. ten-fold cross-validation is repeated ten times and results are averaged (reduces the sampling variance)

• Error estimate is the mean across all repetitions

Data Mining 2011 - Volinsky - Columbia University 45

46

Leave-One-Out cross-validation

• • • • Leave-One-Out: a particular form of cross-validation: – – Set number of folds to number of training instances I.e., for n training instances, build classifier n times Makes best use of the data Involves no random subsampling Computationally expensive, but good performance Data Mining 2011 - Volinsky - Columbia University 46

47

Leave-One-Out-CV and stratification

• • Disadvantage of Leave-One-Out-CV: stratification is not possible – It guarantees a non-stratified sample because there is only one instance in the test set!

Extreme example: random dataset split equally into two classes – – – Best model predicts majority class 50% accuracy on fresh data Leave-One-Out-CV estimate is 100% error!

Data Mining 2011 - Volinsky - Columbia University 47

Three way data splits

• One problem with CV is since data is being used jointly to fit model and estimate error, the error could be biased downward. • If the goal is a real estimate of error (as opposed to which model is best), you may want a three way split: – Training set: examples used for learning – Validation set: used to tune parameters – Test set: never used in the model fitting process, used at the end for unbiased estimate of hold out error 48 Data Mining 2011 - Volinsky - Columbia University

Classification Evaluation • Score for continuous response based on squared error • What if response is binary or categorical?

– classification problems – e.g., fraud or not, boy or girl, etc. simple example:

Inputs Output Model’s prediction Correct/ incorrect prediction

Single No of cards

0 1 0 0 … 1 2 5 1 …

Age

28 56 61 28 …

Income>50K Good/ Bad risk

1 0 1 1 1 0 0 1 Good/ Bad risk 1 0 1 1

:) :)

:(

:)

49

Evaluation of Classification

predicted outcome 1 0 actual outcome 1 0

a c b d

Accuracy = (a+d) / (a+b+c+d) – Not always the best choice • Assume 1% fraud, • model predicts no fraud • What is the accuracy?

Actual Class

Fraud 0 No Fraud 0 Fraud

Predicted Class

No Fraud 10

Data Mining 2011 - Volinsky - Columbia University

990

50

Evaluation of Classification

Other options: – recall or sensitivity (how many of those that are really positive did you predict?): • a/(a+c) – precision (how many of those predicted positive really are?) • a/(a+b) Precision and recall are always in tension – Increasing one tends to decrease another – Document retrieval example predicted outcome 1 0 Data Mining 2011 - Volinsky - Columbia University actual outcome 1 0

a b c

51

d

Evaluation of Classification

Yet another option: – recall or sensitivity (how many of the positives did you get right?): • a/(a+c) – Specificity (how many of the negatives did you get right?) • d/(b+d) Sensitivity and sensitivity have the same tension Different fields use different metrics predicted outcome 1 0 actual outcome 1 0

a b c

52

d

Data Mining 2011 - Volinsky - Columbia University

Evaluation for a Thresholded Response

• Many classification models output probabilities • These probabilities get thresholded to make a prediction.

• Classification accuracy depends on the threshold – good models give low probabilities to Y=0 and high probabilities to Y=1.

53 Data Mining 2011 - Volinsky - Columbia University

Test Data predicted probabilities Suppose we use a cutoff of 0.5… actual outcome 1 0 predicted outcome 1 0 8 3 0 9 54 Data Mining 2011 - Volinsky - Columbia University

Suppose we use a cutoff of 0.5… actual outcome 1 0 predicted outcome 1 0 8 3 0 9

8

sensitivity: = 100%

8+0 9

specificity: = 75%

9+3

we want both of these to be high 55 Data Mining 2011 - Volinsky - Columbia University

Suppose we use a cutoff of 0.8… actual outcome 1 0 predicted outcome 1 0 6 2 2 10

6

sensitivity: = 75%

6+2 10

specificity: = 83%

10+2

56 Data Mining 2011 - Volinsky - Columbia University

• • Note there are 20 possible thresholds Plotting all values of sensitivity vs. specificity gives a sense of model performance by seeing the tradeoff with different thresholds • • • Note if threshold = minimum c=d=0 so sens=1; spec=0 If threshold = maximum a=b=0 so sens=0; spec=1 If model is perfect sens=1; spec=1 Data Mining 2011 - Volinsky - Columbia University 1 0

c a

actual outcome 1 0

b d

57

ROC curve plots sensitivity vs. (1-specificity) – also known as false positive rate Always goes from (0,0) to (1,1) The more area in the upper left, the better Random model is on the diagonal Data Mining 2011 - Volinsky - Columbia University “Area under the curve” (AUC) is a common measure of predictive performance 58

Pts without the disease

Another Look at ROC Curves

Pts with disease

Test Result Data Mining 2011 - Volinsky - Columbia University 59

Threshold

Call these patients “negative” Call these patients “positive” Test Result Data Mining 2011 - Volinsky - Columbia University 60

Some definitions ...

Call these patients “negative” Call these patients “positive”

True Positives

Test Result

without the disease with the disease

Data Mining 2011 - Volinsky - Columbia University 61

Call these patients “negative” Call these patients “positive” Test Result

without the disease with the disease

Data Mining 2011 - Volinsky - Columbia University

False Positives

62

Call these patients “negative” Call these patients “positive ” True negatives Test Result

without the disease with the disease

Data Mining 2011 - Volinsky - Columbia University 63

Call these patients “negative” Call these patients “positive”

False negatives

Test Result

without the disease with the disease

Data Mining 2011 - Volinsky - Columbia University 64

‘‘-’’

Moving the Threshold: right

‘‘+’’

Test Result

without the disease with the disease

Data Mining 2011 - Volinsky - Columbia University 65

‘‘-’’

Moving the Threshold: left

‘‘+’’

Test Result

without the disease with the disease

Data Mining 2011 - Volinsky - Columbia University 66

100%

ROC curve

0% 0% 100%

False Positive Rate (1-specificity) Data Mining 2011 - Volinsky - Columbia University 67

Comparing Models

• Highest AUC wins • But pay attention to ‘Occam’s Razor’ – ‘the best theory is the smallest one that describes all the facts’ – Also known as the ‘parsimony principle’ – If two models are similar, pick the simpler one

Incorporating cost functions

• Not all errors are the same: – Loan payments • A bad loan costs us much more than a lost customer – Medical tests • Cost of false alarm vs. missed diagnosis – Spam • Cost of spam getting through vs. filtering important mail actual outcome 1 0 • Building algorithm to minimize cost is the same as adding weight to false neg and false pos predicted outcome 1 0 Data Mining 2011 - Volinsky - Columbia University

0 C(FN) C(FP ) 0