Transcript Slide 1
8/31/2006
Lecture 2 – Modern Statistical Modeling, an Overview
Rice ELEC 697 Farinaz Koushanfar Fall 2006 1
Summary
• A little bit of history • The culture of statistical modeling – Classic – Modern • Exploratory data analysis – Exploratory vs. confirmatory – Examples 8/31/2006 2
A little bit of history
• Statistics is the science of learning from data to understand its meaning, structure, relationships, etc – 100s of years • For a brief history of pre-20 th centruty statistics check: http://www.bized.ac.uk/timeweb/reference/statisticians.htm
• Statistics as an independent discipline has started to separate from math ~70 years ago • Like many other disciplines in science and engineering, statistics has undergone a major revolution in the past 30 years • Earlier, most data was collected manually, and we were dealing with small data sets. Now, we have terabits storage data bases that we like to capture and model 8/31/2006 3
The scientists behind what I will talk about today…
• An appropriate answer to the right problem is worth a good deal more than an exact answer to an approximate problem – John Tukey • Wrote his PhD thesis in
Convergence and Uniformity in topology
• Recognized the importance of statistics during the World War II • Mathematics is just a tool to facilitate addressing • Many contributions including fast Fourier transforms, Jackknife, exploratory data analysis sound at Princeton (1939) problems John Wilder Tukey 1915-2000 John W. Tukey,
We Need Both Exploratory and Confirmatory
The American Statistician
, Vol. 34, No. 1 (Feb., 1980), pp. 23-25 4 8/31/2006
The scientists behind what I will talk about today…
• PhD in math in 1954 at Berkeley • Became a Prof. of probability at UCLA math dept.
• Left in 1967 – realized that abstract mathematics has very little to do with real life • Wrote a book, independent consultant for 13 years • Finally could solve interesting and important real-world problems!!
• Got a Berkeley position in 1980, this time to help fund the right department for him Leo Breiman 1928-2005 8/31/2006 Leo Breiman,
Statistical Modeling: The Two Cultures
Statistical Science
, Vol. 16, No. 3 (Aug., 2001), pp. 199-215
The culture of statistical modeling
• Statistics really starts with data • Two main goals – Prediction (estimation) – Information (detection) • Two different cultures
x
– Stochastic models, e.g., response var = f(predictor var, random noise, parameters), model selection, prediction, evaluation (classic) – Algorithmic models, the relating function is an algorithm that operates on the input x to predict the response y (modern)
Nature y
8/31/2006 6
Breiman argues that,
The focus on classical data models and ignorance of modern methods has: • Led to irrelevant theory and questionable scientific conclusions • Kept statisticians back from using more suitable models • Has prevented the classical statisticians from working on exciting new problems • In this course, we will cover more of classics and a few modern methods 8/31/2006 7
Back to the history
• Upon his return to academia, Breiman realized that all articles (at the time) begin and end with data models • Data models has had success in analyzing the data and getting information about producing data • Misuse of data models has lead to many questionable conclusion about the underlying system • Algorithmic models are mostly developed in the machine learning community • Modern learning has lead to changes in perception!
8/31/2006 8
The model becomes the truth!
• Invent or use a reasonably good parametric class of models for a complex mechanism • Estimate the parameters and draw conclusions: – The conclusions are about the model’s mechanism and not about the nature mechanism – If the model is a poor estimation of the nature, the conclusions are wrong !
• Example: • The coefficients {b m residual sum of squares, etc.
y b 0 M 1 b m x m • Assume that the data is iid following the above model } are to be estimated, N(0, • Tests of hypothesis, confidence intervals, distribution of • Thousands of articles are published on related proofs • Conclusions drawn ignoring that models are valid 2 ) 8/31/2006 9
More problems with classical data models
• Multiplicity of data models – Answering the question of which model is the best – Each model gives a different picture of the reality and leads to different conclusions • Predictive accuracy – This is a function of the number of parameters used so is not a good measure alone • Other limitations of data models (next slide) 8/31/2006 10
Limitations of data models
• Multivariate analysis is just not working – Nobody really believes in multivariate Normal, but everybody uses it – If all a man has is a hammer, then every problem looks like a nail... As data becomes more complex the simplicity of model-based approach diminishes – Approaching the problem by looking for a data model restricts statisticians from dealing with more interesting and realistic problems 8/31/2006 11
Algorithmic models
• Have been around for some time, pioneers among statisticians, include Olshen, Friedman, Wahba, Zhang, and Singer • Many new problems have been attacked, including speech, image, and handwriting recognition, nonlinear time series, financial market predictions • Shift from data models to the properties of the algorithms • Characterizing convergence and complexity • Example: Vapnik constructed informative bounds on the generalization error of classification algorithm that depends on the capacity of the algorithm! (Support Vector Machines) 8/31/2006 12
Examples of recent advances
• Multiplicity of good models (Rashomon) – Bagging is a solution • The conflict between simplicity and accuracy (Occam) – Occam dilemma: accuracy requires more complex prediction. Simple and interpretable functions do not make accurate predictors • Dimensionality – curse or blessing (Bellman) – How to extract and put together many small pieces of information 8/31/2006 13
Breiman concluding remarks
• Nowhere it is written on stone what kind of models should be used • Breiman is not against data models, but he thinks the emphasis has to be on the problem and not model • Find a way to manage complex environments • E.g. microarray data, Internet traffic, ad-hoc network complexity, ULSI variability, etc • The root of science is to check theory against reality • Need this philosophy to address real-world problems 8/31/2006 14
Exploratory data analysis (EDA)
• Analysis can be done by various techniqes – Mathematical – Logical – Tabular – Graphical – … 8/31/2006 15
EDA
• EDA mostly uses graphical techniques, but • It is really a different philosophy to approach the problem • Differs from classical methods, also referred to by confirmatory data analysis (CDA) 8/31/2006 16
EDA vs. CDA
• CDA – A general problem to explore – Collects some data – Makes a hypothesis on the models – Carries out an analysis of the data based on models – Draws conclusions based on the model features • EDA – A general problem to explore – Collects some data – Carries out an analysis of the data – Infers a model that is appropriate – Draws conclusions based on the data features 8/31/2006 17
EDA vs. CDA (Cont’d)
• Rigor – CDA is rigorous, formal and objective – EDA is suggestive, subject to analysts view • Data treatment – In CDA few numbers summarize data properties – In EDA all data is in focus • Assumptions – In CDA one discovers statistically significant variations from the assumed model, assuming it was correct – In EDA, the assumptions are few, analysis of data has priority 8/31/2006 18
Why Exploratory Data Analysis?
• EDA is oriented toward the future, rather than the past – Utilize data to understand, rather than summarize – Really important in research • A good feel for data is invaluable – Gain insights into the process behind the data – To understand what is NOT in the data • Can (almost) only be obtained by graphical techniques – Graphs give information that no number can replace – Rely on human ability to recognize patterns and to compare 8/31/2006 19
Typical Assumptions for Measurement Process
• The data from a process is: – Random drawings (one data point should not influence the other) – From a fixed distribution (an thus generalizable) – The distribution has a fixed location (the expectation is fixed) – And a fixed variation (the way the data differs from the expectation is fixed) • We measure mean and variance to asses the last two assumptions 8/31/2006 20
EDA Techniques
• Plot a lot of aspects of the data in a variety of techniques, including scatter plots, barplots, histograms, pie charts, and factor plots • E.g. Run sequence plot for mean and variance assumptions – All values of y i are plotted on a chart where the y-axis is y i against the index i (x-axis) – Graphically check the fixed location – Graphically check the fixed variations 8/31/2006 21
Example- EDA
• Run sequence plot (compare the two) 8/31/2006 22