Theory-Based Inference

Download Report

Transcript Theory-Based Inference

Using simulation to teach inference about
correlation and regression in introductory
statistics courses
Soma Roy
Department of Statistics, Cal Poly, San Luis Obispo
Joint Mathematics Meetings, San Antonio, Texas
January 10, 2015
Overview
 Use of simulation/randomization-based methods of
inference
 Current status in statistics education
 A key feature of our approach
 Inference about Correlation and Regression
 Examples we use
 Some issues to consider
JMM: January 10, 2015
2
Simulation/Randomization-based
methods of inference
 Momentum towards the use of these methods has been
building since Cobb (2007)
 Using simulation/randomization-based methods to put
“the core logic of inference at the center” of our
curriculum
 Higher computational power now provides greater
opportunity to statistics instructors to use these methods
more easily in class
 However, many instructors (of those who use these
methods at all) still use these methods to only introduce
inference in simpler contexts (one proportion, one mean,
etc.) and then move on to using only traditional theorybased methods of inference
JMM: January 10, 2015
3
Simulation/Randomization-based
methods of inference (contd.)
 Our approach: Ask the same key question every time.
“Is the observed result surprising (unlikely) to have
happened by random chance alone?”
 Investigate, first, through simulation/randomization,
and then theory-based methods, every time
 Start with a tactile simulation/randomization using
coins, dice, cards, etc.
 Follow up with technology – purposefully-designed
(free) web applets (instead of commercial
software); self-explanatory; lots of visual explanation
 Wrap up with “theory-based” method, if available
JMM: January 10, 2015
4
Who are “we”?
 Nathan Tintle (previously from Hope College, now
Dordt College)
 George Cobb (Mt. Holyoke College)
 Todd Swanson, Jill VanderStoep (Hope College)
 Beth Chance, Allan Rossman, Soma Roy (Cal Poly, San
Luis Obispo)
Introduction to Statistical Investigations
(John Wiley and Sons)
JMM: January 10, 2015
5
Example: Draft Lottery
 In 1970, the United States Selective Service conducted
a lottery to decide which young men would be
drafted into the armed forces (Fienberg, 1971). Each of
the 366 birthdays in a year (including February 29) was
assigned a draft number. Young men born on days
assigned low draft numbers were drafted.
JMM: January 10, 2015
6
Example: Draft Lottery (contd.)
 In a fair, random lottery, we would expect the
correlation coefficient between the draft number and
sequential date of birthday to be close to zero.
 Did this happen? Let us look at the scatterplot of draft
number assigned to birthdays, and sequential date of
birthday.
 Correlation coeff.,
r = -0.226!
JMM: January 10, 2015
7
Example: Draft Lottery (contd.)
 Question to students: What are two possible
explanations for finding r = -0.226?
 Random chance
 The process of randomly assigning draft
numbers to dates was flawed
 To investigate random chance as a plausible
explanation, we ask, is r = -0.226 surprising
(unlikely) to have happened by random chance
alone?
 How can we decide?
JMM: January 10, 2015
8
Example: Draft Lottery (contd.)
 We could repeat (many times) the assignment of draft
numbers to sequential dates, randomly,
 That is, Null: Lottery was fair and random
 Each time, we could record what the r value turns out
to be by chance alone, and
 Thus, generate a pattern of what values of r are typical
and what are atypical to happen by chance alone.
 Corr/Regression applet!
JMM: January 10, 2015
9
Example: Draft Lottery (contd.)
 Corr/Regression applet!
 Very strong evidence that the assignment of draft
numbers to birthdays was not random.
JMM: January 10, 2015
10
How students do
 Last topic of the quarter; I spend 2 hours of class on this.
 In my experience, by this point students have a very
good grasp of how to use simulation/randomization to
investigate “Are the results surprising to have happened
by chance alone?”
 Students tend to do well
JMM: January 10, 2015
11
How students do (contd.)
 Final exam question:
 Data from Mythbusters:
 Describe a step-by-step tactile simulation strategy (using
either coins, or spinners, or cards) to find a p-value to
investigate whether there is a relationship between the
distance a car is from a semi-truck and the car’s gas
mileage.
JMM: January 10, 2015
12
Sample student response
 Random shuffling; Recording the statistic for the shuffled
data; Repetition; Compare observed statistic value to
null distribution
JMM: January 10, 2015
13
Some things to be aware of when
using this approach
 When comparing inference results from random
shuffling to theory-based results
 Assuming it is valid to use theory-based methods
 Another simulation strategy:
 Use random sampling instead of random shuffling
JMM: January 10, 2015
14
Random shuffling vs. Theory-based:
 Example: Predicting child’s height (in.) from biological
father’s height (in.)
 Similar results (SE for slope, p-value)
JMM: January 10, 2015
15
Random shuffling vs. Theory-based
(contd.):
 Example: Predicting height (in.) from length of index
finger (cm.)
 Different results; random shuffling gives larger SE for
slope, and hence larger p-value
JMM: January 10, 2015
16
Random shuffling vs. Theory-based
(contd.):
 Results from random shuffling tend to more closely
match the theory-based methods (“Regression Table”)
when the sample data show a pretty weak correlation;
the discrepancy increases as the correlation gets
stronger.
 Random shuffling is more conservative compared to
theory-based (t-test) method
 Addressing this issue with students
 Stat 101: I don’t address this
 Calculus-based intro stats course: I do
JMM: January 10, 2015
17
Another simulation-based strategy:
Random sampling
 Using random sampling instead of random shuffling
 Random sampling – where we repeatedly randomly
sample from a hypothetical population that has similar
features as our sample except that the slope is set to 0
(or other hypothesized value)
 Various options
 Sampling from a uniform distribution on explanatory
variable (X)
 Sampling from a bivariate normal population
 Other options…
JMM: January 10, 2015
18
Random sampling: an example
 Example (contd.): Predicting height (in.) from length of
index finger (cm.)
 y-bar = 68.4in; x-bar = 7.6cm; sx = 1cm; s = 3.5in
 Sampling from a bivariate normal population
JMM: January 10, 2015
19
Random sampling: an example
 Example (contd.): Predicting height (in.) from length of
index finger (cm.)
 For inference – sample repeatedly from the created
population; record slope for sample
JMM: January 10, 2015
20
In summary…
Advantages of using simulation/randomization-based
methods to teach inference about correlation/regression
 Helps students see that the core logic of inference stays
the same regardless of data type and data structure
 Avoids the potential misconception that randomizationbased tests only work for very simple scenarios, and that
theory-based methods are needed in other cases.
JMM: January 10, 2015
21
Acknowledgements
 Thank you for listening!
 National Science Foundation DUE/TUES-114069, 1323210
JMM: January 10, 2015
22
Resources
 Course materials: Introduction to Statistical Investigations
(John Wiley and Sons) by Tintle, Chance, Cobb, Rossman,
Roy, Swanson, VanderStoep
 Samples of our materials as well as slides for various
conference presentations are available at:
http://www.math.hope.edu/isi/
 Applets are available at:
http://www.rossmanchance.com/ISIapplets.html
 Simulation-based inference blog: www.causeweb.org/sbi/
 My email address: [email protected]
JMM: January 10, 2015
23
JMM: January 10, 2015
24