Transcript Slide 1

8/31/2006

Lecture 2 – Modern Statistical Modeling, an Overview

Rice ELEC 697 Farinaz Koushanfar Fall 2006 1

Summary

• A little bit of history • The culture of statistical modeling – Classic – Modern • Exploratory data analysis – Exploratory vs. confirmatory – Examples 8/31/2006 2

A little bit of history

• Statistics is the science of learning from data to understand its meaning, structure, relationships, etc – 100s of years • For a brief history of pre-20 th centruty statistics check: http://www.bized.ac.uk/timeweb/reference/statisticians.htm

• Statistics as an independent discipline has started to separate from math ~70 years ago • Like many other disciplines in science and engineering, statistics has undergone a major revolution in the past 30 years • Earlier, most data was collected manually, and we were dealing with small data sets. Now, we have terabits storage data bases that we like to capture and model 8/31/2006 3

The scientists behind what I will talk about today…

• An appropriate answer to the right problem is worth a good deal more than an exact answer to an approximate problem – John Tukey • Wrote his PhD thesis in

Convergence and Uniformity in topology

• Recognized the importance of statistics during the World War II • Mathematics is just a tool to facilitate addressing • Many contributions including fast Fourier transforms, Jackknife, exploratory data analysis sound at Princeton (1939) problems John Wilder Tukey 1915-2000 John W. Tukey,

We Need Both Exploratory and Confirmatory

The American Statistician

, Vol. 34, No. 1 (Feb., 1980), pp. 23-25 4 8/31/2006

The scientists behind what I will talk about today…

• PhD in math in 1954 at Berkeley • Became a Prof. of probability at UCLA math dept.

• Left in 1967 – realized that abstract mathematics has very little to do with real life • Wrote a book, independent consultant for 13 years • Finally could solve interesting and important real-world problems!!

• Got a Berkeley position in 1980, this time to help fund the right department for him Leo Breiman 1928-2005 8/31/2006 Leo Breiman,

Statistical Modeling: The Two Cultures

Statistical Science

, Vol. 16, No. 3 (Aug., 2001), pp. 199-215

The culture of statistical modeling

• Statistics really starts with data • Two main goals – Prediction (estimation) – Information (detection) • Two different cultures

x

– Stochastic models, e.g., response var = f(predictor var, random noise, parameters), model selection, prediction, evaluation (classic) – Algorithmic models, the relating function is an algorithm that operates on the input x to predict the response y (modern)

Nature y

8/31/2006 6

Breiman argues that,

The focus on classical data models and ignorance of modern methods has: • Led to irrelevant theory and questionable scientific conclusions • Kept statisticians back from using more suitable models • Has prevented the classical statisticians from working on exciting new problems • In this course, we will cover more of classics and a few modern methods 8/31/2006 7

Back to the history

• Upon his return to academia, Breiman realized that all articles (at the time) begin and end with data models • Data models has had success in analyzing the data and getting information about producing data • Misuse of data models has lead to many questionable conclusion about the underlying system • Algorithmic models are mostly developed in the machine learning community • Modern learning has lead to changes in perception!

8/31/2006 8

The model becomes the truth!

• Invent or use a reasonably good parametric class of models for a complex mechanism • Estimate the parameters and draw conclusions: – The conclusions are about the model’s mechanism and not about the nature mechanism – If the model is a poor estimation of the nature, the conclusions are wrong !

• Example: • The coefficients {b m residual sum of squares, etc.

y  b 0  M  1 b m x m   • Assume that the data is iid following the above model } are to be estimated,  N(0, • Tests of hypothesis, confidence intervals, distribution of • Thousands of articles are published on related proofs • Conclusions drawn ignoring that models are valid  2 ) 8/31/2006 9

More problems with classical data models

• Multiplicity of data models – Answering the question of which model is the best – Each model gives a different picture of the reality and leads to different conclusions • Predictive accuracy – This is a function of the number of parameters used so is not a good measure alone • Other limitations of data models (next slide) 8/31/2006 10

Limitations of data models

• Multivariate analysis is just not working – Nobody really believes in multivariate Normal, but everybody uses it – If all a man has is a hammer, then every problem looks like a nail... As data becomes more complex the simplicity of model-based approach diminishes – Approaching the problem by looking for a data model restricts statisticians from dealing with more interesting and realistic problems 8/31/2006 11

Algorithmic models

• Have been around for some time, pioneers among statisticians, include Olshen, Friedman, Wahba, Zhang, and Singer • Many new problems have been attacked, including speech, image, and handwriting recognition, nonlinear time series, financial market predictions • Shift from data models to the properties of the algorithms • Characterizing convergence and complexity • Example: Vapnik constructed informative bounds on the generalization error of classification algorithm that depends on the capacity of the algorithm! (Support Vector Machines) 8/31/2006 12

Examples of recent advances

• Multiplicity of good models (Rashomon) – Bagging is a solution • The conflict between simplicity and accuracy (Occam) – Occam dilemma: accuracy requires more complex prediction. Simple and interpretable functions do not make accurate predictors • Dimensionality – curse or blessing (Bellman) – How to extract and put together many small pieces of information 8/31/2006 13

Breiman concluding remarks

• Nowhere it is written on stone what kind of models should be used • Breiman is not against data models, but he thinks the emphasis has to be on the problem and not model • Find a way to manage complex environments • E.g. microarray data, Internet traffic, ad-hoc network complexity, ULSI variability, etc • The root of science is to check theory against reality • Need this philosophy to address real-world problems 8/31/2006 14

Exploratory data analysis (EDA)

• Analysis can be done by various techniqes – Mathematical – Logical – Tabular – Graphical – … 8/31/2006 15

EDA

• EDA mostly uses graphical techniques, but • It is really a different philosophy to approach the problem • Differs from classical methods, also referred to by confirmatory data analysis (CDA) 8/31/2006 16

EDA vs. CDA

• CDA – A general problem to explore – Collects some data – Makes a hypothesis on the models – Carries out an analysis of the data based on models – Draws conclusions based on the model features • EDA – A general problem to explore – Collects some data – Carries out an analysis of the data – Infers a model that is appropriate – Draws conclusions based on the data features 8/31/2006 17

EDA vs. CDA (Cont’d)

• Rigor – CDA is rigorous, formal and objective – EDA is suggestive, subject to analysts view • Data treatment – In CDA few numbers summarize data properties – In EDA all data is in focus • Assumptions – In CDA one discovers statistically significant variations from the assumed model, assuming it was correct – In EDA, the assumptions are few, analysis of data has priority 8/31/2006 18

Why Exploratory Data Analysis?

• EDA is oriented toward the future, rather than the past – Utilize data to understand, rather than summarize – Really important in research • A good feel for data is invaluable – Gain insights into the process behind the data – To understand what is NOT in the data • Can (almost) only be obtained by graphical techniques – Graphs give information that no number can replace – Rely on human ability to recognize patterns and to compare 8/31/2006 19

Typical Assumptions for Measurement Process

• The data from a process is: – Random drawings (one data point should not influence the other) – From a fixed distribution (an thus generalizable) – The distribution has a fixed location (the expectation is fixed) – And a fixed variation (the way the data differs from the expectation is fixed) • We measure mean and variance to asses the last two assumptions 8/31/2006 20

EDA Techniques

• Plot a lot of aspects of the data in a variety of techniques, including scatter plots, barplots, histograms, pie charts, and factor plots • E.g. Run sequence plot for mean and variance assumptions – All values of y i are plotted on a chart where the y-axis is y i against the index i (x-axis) – Graphically check the fixed location – Graphically check the fixed variations 8/31/2006 21

Example- EDA

• Run sequence plot (compare the two) 8/31/2006 22