Nonparametric Regression

Download Report

Transcript Nonparametric Regression

Prelude of Machine Learning 202 Statistical Data Analysis in the Computer Age (1991) Bradely Efron and Robert Tibshirani

Agenda

• • • • • • Overview Bootstrap Nonparametric Regression Generalized Additive Models Classification and Regression Trees Conclusion

Agenda

• • • • • •

Overview

Bootstrap Nonparametric Regression Generalized Additive Models Classification and Regression Trees Conclusion

Overview

• • • Classical statistical methods from 1920-1950: – Linear regression, hypothesis testing, standard errors, confidence intervals, etc.

New statistical methods Post 1980: – Based on the power of electronic computation – Require fewer distributional assumptions than their predecessors How to spend computational wealth wisely?

Agenda

• • • • • • Overview

Bootstrap

Nonparametric Regression Generalized Additive Models Classification and Regression Trees Conclusion

Bootstrap

• • • Random sample from 164 data points t(x) = 28.58

How accurate is t(x)?

• • A device for extending SE to estimators other than the mean Suppose t(x) is 25% trimmed mean

Bootstrap

• • • Why use a trimmed mean rather than mean(x)?

If data is from a long-tailed probability distribution, then the trimmed mean can be substantially more accurate than mean(x) In practice, one does not know a priori if the true probability distribution is long-tailed. The bootstrap can help answer this question.

Agenda

• • • • • • Overview Bootstrap

Nonparametric Regression

Generalized Additive Models Classification and Regression Trees Conclusion

Nonparametric Regression

• • Quadratic regression curve at 60% compliance 27.72 +/- 3.08

Nonparametric Regression

• • Nonparametric Regression with loess at 60% compliance 32.38 +/- ?

• i.e. – Windowing with nearlest 20% data points – Smooth weight function – Weighted linear regression • How to find SE?

Nonparametric Regression

• • How to find SE?

Bootstrap • • • At 60% compliance QR : 27.72 +/- 3.08

NPR: 32.38 +/- 5.71

• 32.38 +/- 5.71 with B=50 • • On balance, the quadratic estimate should probably be preferred in this case. It would have to have an unusually large bias to undo its superiority in SE.

Agenda

• • • • • • Overview Bootstrap Nonparametric Regression

Generalized Additive Models

Classification and Regression Trees Conclusion

Generalized Additive Models

Generalized Linear model:

– Generalizes linear regression – Linear model related to response variable using a link function Y = g(b 0 + b 1 *X 1 + ... + b m *X m ) • Additive Model: – Non parametric regression method – – Estimate a non parametric function for each predictor Combine all predictor functions to predict the dependent variable •

Generalized Additive Model (GAM) :

Blends properties of Additive models with generalized linear model (GLM) – Each predictor function f

i

(x

i

) is fit using parametric or non parametric means – Provides good fits to training data at the expense of interpretability

GAM Case Study

• • • • Analyze survival of infants after cardiac surgery for heart defects Dataset: 497 infant records Explanatory variables: – – – Age (Days) Weight (Kg) Whether Warm-blood cardiopelgia (WBC) was applied WBC support data: – Of 57 infants who received WBC procedure, 7 died – Of 440 infants who received standard procedure, 133 died

• •

GAM Case Study: Logistic regression results

Three parameter regression model – – Age, Weight: continuous variables WBC applied: binary variable Results: – – – WBC has strong beneficial effect: odds ratio of 3.8:1 Higher weight => Lower risk of death Age has no significant effect

GAM Case Study: GAM Analysis

• • Add three individual smooth functions – Use locally weighted scatter plot smoothing (Loess) method Results: – WBC has strong beneficial effect: odds ratio of 4.2:1 – Lighter infants have 55 times more likely to die than heavier infants – Surprising findings from log odds curve for age !

GAM Case Study: Conclusion

• Traditional regression models may lead to oversimplification – – – Linear logistic regression forces curves to be straight lines Vital information regarding effect of age lost in a linear model More acute problem with large number of explanatory variables • GAM analysis exploits computational power to achieve new level of analysis flexibility – A Personal computer can do what required a Mainframe 10 years ago

Agenda

• • • • • • Overview Bootstrap Nonparametric Regression Generalized Additive Models

Classification and Regression Trees

Conclusion

Classification and Regression Tree

• • • • A non parametric technique An ideal analysis method to apply computer algorithms Splits based upon how well the splits can explain variability Once a node is split, the procedure is applied to each “split” recursively

CART Case study

• • • Gain insight into causes of duodenal ulcers – – – Use sample of 745 rats 1 out of 56 different alkyl nucleophiles administered to each rat Response: One of three severity levels (1,2,3), 3 being the highest severity Skewed misclassification costs – Severe ulcer misclassification is more expensive than mild ulcer misclassification Analysis tree construction: – – Use 745 observations as the training data Compute ‘apparent’ misclassification rates – Training data misclassification rate has downward bias

CART Case study

• Classification tree

CART Case study: Observations

• • • Optimal size of classification tree is a tradeoff – Higher training errors versus overfitting It is usually better to construct large tree and prune from bottom How to chose optimal size classification tree ?

– Use test data on different tree models to understand misclassification rate in each tree – In the absence of test data, use cross validation approach

CART: Cross validation

• • • • Mimic the use of test sample Standard cross validation approach: – – – Divide dataset into 10 equal partitions Use 90% of data as training set and the remaining 10% as test data Repeat with all different combinations of the training and test data Cross validation misclassification errors found to be 10% higher than the original Cross validation and bootstrapping are closely related – Research on hybrid approaches in progress

Agenda

• • • • • • Overview Bootstrap Nonparametric Regression Generalized Additive Models Classification and Regression Trees

Conclusion

Conclusion

• • Computers have enabled a new generation of statistical methods and tools • Replace traditional mathematical ways with computer algorithms.

• Freedom from bell-shaped curve assumptions of the traditional approach Modern Statisticians need to understand: • Mathematical tractability is not required for computer based methods • Which computer based methods to use • When to use each method