Transcript Nonparametric Regression
Prelude of Machine Learning 202 Statistical Data Analysis in the Computer Age (1991) Bradely Efron and Robert Tibshirani
Agenda
• • • • • • Overview Bootstrap Nonparametric Regression Generalized Additive Models Classification and Regression Trees Conclusion
Agenda
• • • • • •
Overview
Bootstrap Nonparametric Regression Generalized Additive Models Classification and Regression Trees Conclusion
Overview
• • • Classical statistical methods from 1920-1950: – Linear regression, hypothesis testing, standard errors, confidence intervals, etc.
New statistical methods Post 1980: – Based on the power of electronic computation – Require fewer distributional assumptions than their predecessors How to spend computational wealth wisely?
Agenda
• • • • • • Overview
Bootstrap
Nonparametric Regression Generalized Additive Models Classification and Regression Trees Conclusion
Bootstrap
• • • Random sample from 164 data points t(x) = 28.58
How accurate is t(x)?
• • A device for extending SE to estimators other than the mean Suppose t(x) is 25% trimmed mean
Bootstrap
• • • Why use a trimmed mean rather than mean(x)?
If data is from a long-tailed probability distribution, then the trimmed mean can be substantially more accurate than mean(x) In practice, one does not know a priori if the true probability distribution is long-tailed. The bootstrap can help answer this question.
Agenda
• • • • • • Overview Bootstrap
Nonparametric Regression
Generalized Additive Models Classification and Regression Trees Conclusion
Nonparametric Regression
• • Quadratic regression curve at 60% compliance 27.72 +/- 3.08
Nonparametric Regression
• • Nonparametric Regression with loess at 60% compliance 32.38 +/- ?
• i.e. – Windowing with nearlest 20% data points – Smooth weight function – Weighted linear regression • How to find SE?
Nonparametric Regression
• • How to find SE?
Bootstrap • • • At 60% compliance QR : 27.72 +/- 3.08
NPR: 32.38 +/- 5.71
• 32.38 +/- 5.71 with B=50 • • On balance, the quadratic estimate should probably be preferred in this case. It would have to have an unusually large bias to undo its superiority in SE.
Agenda
• • • • • • Overview Bootstrap Nonparametric Regression
Generalized Additive Models
Classification and Regression Trees Conclusion
Generalized Additive Models
•
Generalized Linear model:
– Generalizes linear regression – Linear model related to response variable using a link function Y = g(b 0 + b 1 *X 1 + ... + b m *X m ) • Additive Model: – Non parametric regression method – – Estimate a non parametric function for each predictor Combine all predictor functions to predict the dependent variable •
Generalized Additive Model (GAM) :
– Blends properties of Additive models with generalized linear model (GLM) – Each predictor function f
i
(x
i
) is fit using parametric or non parametric means – Provides good fits to training data at the expense of interpretability
GAM Case Study
• • • • Analyze survival of infants after cardiac surgery for heart defects Dataset: 497 infant records Explanatory variables: – – – Age (Days) Weight (Kg) Whether Warm-blood cardiopelgia (WBC) was applied WBC support data: – Of 57 infants who received WBC procedure, 7 died – Of 440 infants who received standard procedure, 133 died
• •
GAM Case Study: Logistic regression results
Three parameter regression model – – Age, Weight: continuous variables WBC applied: binary variable Results: – – – WBC has strong beneficial effect: odds ratio of 3.8:1 Higher weight => Lower risk of death Age has no significant effect
GAM Case Study: GAM Analysis
• • Add three individual smooth functions – Use locally weighted scatter plot smoothing (Loess) method Results: – WBC has strong beneficial effect: odds ratio of 4.2:1 – Lighter infants have 55 times more likely to die than heavier infants – Surprising findings from log odds curve for age !
GAM Case Study: Conclusion
• Traditional regression models may lead to oversimplification – – – Linear logistic regression forces curves to be straight lines Vital information regarding effect of age lost in a linear model More acute problem with large number of explanatory variables • GAM analysis exploits computational power to achieve new level of analysis flexibility – A Personal computer can do what required a Mainframe 10 years ago
Agenda
• • • • • • Overview Bootstrap Nonparametric Regression Generalized Additive Models
Classification and Regression Trees
Conclusion
Classification and Regression Tree
• • • • A non parametric technique An ideal analysis method to apply computer algorithms Splits based upon how well the splits can explain variability Once a node is split, the procedure is applied to each “split” recursively
CART Case study
• • • Gain insight into causes of duodenal ulcers – – – Use sample of 745 rats 1 out of 56 different alkyl nucleophiles administered to each rat Response: One of three severity levels (1,2,3), 3 being the highest severity Skewed misclassification costs – Severe ulcer misclassification is more expensive than mild ulcer misclassification Analysis tree construction: – – Use 745 observations as the training data Compute ‘apparent’ misclassification rates – Training data misclassification rate has downward bias
CART Case study
• Classification tree
CART Case study: Observations
• • • Optimal size of classification tree is a tradeoff – Higher training errors versus overfitting It is usually better to construct large tree and prune from bottom How to chose optimal size classification tree ?
– Use test data on different tree models to understand misclassification rate in each tree – In the absence of test data, use cross validation approach
CART: Cross validation
• • • • Mimic the use of test sample Standard cross validation approach: – – – Divide dataset into 10 equal partitions Use 90% of data as training set and the remaining 10% as test data Repeat with all different combinations of the training and test data Cross validation misclassification errors found to be 10% higher than the original Cross validation and bootstrapping are closely related – Research on hybrid approaches in progress
Agenda
• • • • • • Overview Bootstrap Nonparametric Regression Generalized Additive Models Classification and Regression Trees
Conclusion
Conclusion
• • Computers have enabled a new generation of statistical methods and tools • Replace traditional mathematical ways with computer algorithms.
• Freedom from bell-shaped curve assumptions of the traditional approach Modern Statisticians need to understand: • Mathematical tractability is not required for computer based methods • Which computer based methods to use • When to use each method