Slide 1

Transcript Slide 1

8/31/2006

Lecture 2 – Modern Statistical Modeling, an Overview

Rice ELEC 697 Farinaz Koushanfar Fall 2006 1

Summary

• A little bit of history • The culture of statistical modeling – Classic – Modern • Exploratory data analysis – Exploratory vs. confirmatory – Examples 8/31/2006 2

A little bit of history

• Statistics is the science of learning from data to understand its meaning, structure, relationships, etc – 100s of years • For a brief history of pre-20 th centruty statistics check: http://www.bized.ac.uk/timeweb/reference/statisticians.htm

• Statistics as an independent discipline has started to separate from math ~70 years ago • Like many other disciplines in science and engineering, statistics has undergone a major revolution in the past 30 years • Earlier, most data was collected manually, and we were dealing with small data sets. Now, we have terabits storage data bases that we like to capture and model 8/31/2006 3

The scientists behind what I will talk about today…

• An appropriate answer to the right problem is worth a good deal more than an exact answer to an approximate problem – John Tukey • Wrote his PhD thesis in

Convergence and Uniformity in topology

• Recognized the importance of statistics during the World War II • Mathematics is just a tool to facilitate addressing • Many contributions including fast Fourier transforms, Jackknife, exploratory data analysis sound at Princeton (1939) problems John Wilder Tukey 1915-2000 John W. Tukey,

We Need Both Exploratory and Confirmatory

The American Statistician

, Vol. 34, No. 1 (Feb., 1980), pp. 23-25 4 8/31/2006

The scientists behind what I will talk about today…

• PhD in math in 1954 at Berkeley • Became a Prof. of probability at UCLA math dept.

• Left in 1967 – realized that abstract mathematics has very little to do with real life • Wrote a book, independent consultant for 13 years • Finally could solve interesting and important real-world problems!!

• Got a Berkeley position in 1980, this time to help fund the right department for him Leo Breiman 1928-2005 8/31/2006 Leo Breiman,

Statistical Modeling: The Two Cultures

Statistical Science

, Vol. 16, No. 3 (Aug., 2001), pp. 199-215

The culture of statistical modeling

• Statistics really starts with data • Two main goals – Prediction (estimation) – Information (detection) • Two different cultures

– Stochastic models, e.g., response var = f(predictor var, random noise, parameters), model selection, prediction, evaluation (classic) – Algorithmic models, the relating function is an algorithm that operates on the input x to predict the response y (modern)

Nature y

8/31/2006 6

Breiman argues that,

The focus on classical data models and ignorance of modern methods has: • Led to irrelevant theory and questionable scientific conclusions • Kept statisticians back from using more suitable models • Has prevented the classical statisticians from working on exciting new problems • In this course, we will cover more of classics and a few modern methods 8/31/2006 7

Back to the history

• Upon his return to academia, Breiman realized that all articles (at the time) begin and end with data models • Data models has had success in analyzing the data and getting information about producing data • Misuse of data models has lead to many questionable conclusion about the underlying system • Algorithmic models are mostly developed in the machine learning community • Modern learning has lead to changes in perception!

8/31/2006 8

The model becomes the truth!

• Invent or use a reasonably good parametric class of models for a complex mechanism • Estimate the parameters and draw conclusions: – The conclusions are about the model’s mechanism and not about the nature mechanism – If the model is a poor estimation of the nature, the conclusions are wrong !

• Example: • The coefficients {b m residual sum of squares, etc.

y  b 0  M  1 b m x m   • Assume that the data is iid following the above model } are to be estimated,  N(0, • Tests of hypothesis, confidence intervals, distribution of • Thousands of articles are published on related proofs • Conclusions drawn ignoring that models are valid  2 ) 8/31/2006 9

Limitations of data models

• Multivariate analysis is just not working – Nobody really believes in multivariate Normal, but everybody uses it – If all a man has is a hammer, then every problem looks like a nail... As data becomes more complex the simplicity of model-based approach diminishes – Approaching the problem by looking for a data model restricts statisticians from dealing with more interesting and realistic problems 8/31/2006 11

Algorithmic models

• Have been around for some time, pioneers among statisticians, include Olshen, Friedman, Wahba, Zhang, and Singer • Many new problems have been attacked, including speech, image, and handwriting recognition, nonlinear time series, financial market predictions • Shift from data models to the properties of the algorithms • Characterizing convergence and complexity • Example: Vapnik constructed informative bounds on the generalization error of classification algorithm that depends on the capacity of the algorithm! (Support Vector Machines) 8/31/2006 12

Examples of recent advances

• Multiplicity of good models (Rashomon) – Bagging is a solution • The conflict between simplicity and accuracy (Occam) – Occam dilemma: accuracy requires more complex prediction. Simple and interpretable functions do not make accurate predictors • Dimensionality – curse or blessing (Bellman) – How to extract and put together many small pieces of information 8/31/2006 13

Breiman concluding remarks

• Nowhere it is written on stone what kind of models should be used • Breiman is not against data models, but he thinks the emphasis has to be on the problem and not model • Find a way to manage complex environments • E.g. microarray data, Internet traffic, ad-hoc network complexity, ULSI variability, etc • The root of science is to check theory against reality • Need this philosophy to address real-world problems 8/31/2006 14

Exploratory data analysis (EDA)

• Analysis can be done by various techniqes – Mathematical – Logical – Tabular – Graphical – … 8/31/2006 15

EDA

• EDA mostly uses graphical techniques, but • It is really a different philosophy to approach the problem • Differs from classical methods, also referred to by confirmatory data analysis (CDA) 8/31/2006 16

EDA vs. CDA

• CDA – A general problem to explore – Collects some data – Makes a hypothesis on the models – Carries out an analysis of the data based on models – Draws conclusions based on the model features • EDA – A general problem to explore – Collects some data – Carries out an analysis of the data – Infers a model that is appropriate – Draws conclusions based on the data features 8/31/2006 17

EDA vs. CDA (Cont’d)

• Rigor – CDA is rigorous, formal and objective – EDA is suggestive, subject to analysts view • Data treatment – In CDA few numbers summarize data properties – In EDA all data is in focus • Assumptions – In CDA one discovers statistically significant variations from the assumed model, assuming it was correct – In EDA, the assumptions are few, analysis of data has priority 8/31/2006 18

Why Exploratory Data Analysis?

• EDA is oriented toward the future, rather than the past – Utilize data to understand, rather than summarize – Really important in research • A good feel for data is invaluable – Gain insights into the process behind the data – To understand what is NOT in the data • Can (almost) only be obtained by graphical techniques – Graphs give information that no number can replace – Rely on human ability to recognize patterns and to compare 8/31/2006 19

Typical Assumptions for Measurement Process

• The data from a process is: – Random drawings (one data point should not influence the other) – From a fixed distribution (an thus generalizable) – The distribution has a fixed location (the expectation is fixed) – And a fixed variation (the way the data differs from the expectation is fixed) • We measure mean and variance to asses the last two assumptions 8/31/2006 20

EDA Techniques

• Plot a lot of aspects of the data in a variety of techniques, including scatter plots, barplots, histograms, pie charts, and factor plots • E.g. Run sequence plot for mean and variance assumptions – All values of y i are plotted on a chart where the y-axis is y i against the index i (x-axis) – Graphically check the fixed location – Graphically check the fixed variations 8/31/2006 21

Example- EDA

• Run sequence plot (compare the two) 8/31/2006 22

Slide 1

Transcript Slide 1

Lecture 2 – Modern Statistical Modeling, an Overview

Summary

A little bit of history

The scientists behind what I will talk about today…

The scientists behind what I will talk about today…

The culture of statistical modeling

Breiman argues that,

Back to the history

The model becomes the truth!

More problems with classical data models

Limitations of data models

Algorithmic models

Examples of recent advances

Breiman concluding remarks

Exploratory data analysis (EDA)

EDA

EDA vs. CDA

EDA vs. CDA (Cont’d)

Why Exploratory Data Analysis?

Typical Assumptions for Measurement Process

EDA Techniques

Example- EDA

Directory