STAT 101: Data Analysis and Statistical Inference

Download Report

Transcript STAT 101: Data Analysis and Statistical Inference

Confidence Intervals I
2/1/12
• Correlation (continued)
• Population parameter versus sample statistic
• Uncertainty in estimates
• Sampling distribution
• Confidence interval
Section 3.1
Professor Kari Lock Morgan
Duke University
Correlation Guessing Game
http://istics.net/gett/gcstart.php?group_id=duke
Highest scorer in
the class gets one
extra point on the
first exam!
-0.5
0.0
0.5
NFL Teams
-1.5 -1.0
z-score for Penalty Yards
1.0
Correlation
r = 0.43
3.0
3.5
4.0
4.5
5.0
Malevolence Rating of Uniform
0.4
0.0
-0.4
r = 0.08
-0.8
z-score for Penalty Yards
Correlation
3.0
3.5
4.0
4.5
5.0
Malevolence Rating of Uniform
Same plot, but with Dolphins and Raiders (outliers) removed
Human Cannonball
Plot Y vs. X
X
What is the correlation
between X and Y?
(a) r > 0
(b) r < 0
(c) r = 0
Y
Are X and Y associated?
(a) Yes
(b) No
Correlation Cautions
1. Correlation can be heavily affected by
outliers. Always plot your data!
2. r = 0 means no linear association. The
variables could still be otherwise associated.
Always plot your data!
3. Correlation does not imply causation!
Summary: Two Quantitative Variables
• Summary Statistics
– Correlation
• Visualization
– Scatterplot
Variable(s)
Visualization
Summary Statistics
Categorical
bar chart,
pie chart
frequency table,
relative frequency table,
proportion
Quantitative
dotplot,
histogram,
boxplot
mean, median, max, min,
standard deviation,
range, IQR,
five number summary
Categorical vs
Categorical
side-by-side bar chart,
two-way table,
segmented bar chart, difference in proportions
mosaic plot
Quantitative vs
Categorical
side-by-side boxplots
statistics by group
Quantitative vs
Quantitative
scatterplot
correlation
The Big Picture
Population
Sampling
Sample
Statistical
Inference
Parameter vs Statistic
• A sample statistic is a number computed
from sample data.
• A population parameter is a number that
describes some aspect of a population
• We usually have a sample statistic and
want to make inferences about the
population parameter
The Big Picture
Population
Sampling
PARAMETERS
Sample
STATISTICS
Statistical
Inference
Parameter vs Statistic
mu
sigma
rho
beta
Obama’s Approval Rating
• Gallup surveyed 1500 Americans between Jan
28-30, 2012, and 46% of these people approve of
the job Barack Obama is doing as president
Statistic: pˆ  0.46
• What do you think is the true proportion of
Americans who approve of the job Barack Obama
is doing as president?
Parameter: p  ???
http://www.gallup.com/poll/113980/Gallup-Daily-Obama-Job-Approval.aspx
Point and Interval Estimates
• The sample statistic gives a point estimate of
the population parameter (a single number)
• Usually, it is more useful to provide an interval
estimate which gives a range of plausible values
for the population parameter:
statistic  margin of error
• How do we determine the margin of error???
Obama
Obama’s Approval Rating
Point Estimate: pˆ  0.46
Interval Estimate: 0.46  0.03
statistic
ME
  0.43,0.49 
• Between 43% and 49% of Americans currently
approve of the job Obama is doing as president
IMPORTANT POINTS
• Sample statistics vary from sample to
sample. (they will not match the parameter
exactly)
• KEY QUESTION: For a given sample
statistic, what are plausible values for the
population parameter? How much
uncertainty surrounds the sample statistic?
• KEY ANSWER: It depends on how much the
statistic varies from sample to sample!
Reese’s Pieces
• What proportion of Reese’s pieces are
orange?
• Take a random sample of 10 Reese’s pieces
• What is your sample proportion?  class
dotplot
• Give a range of plausible values for the
population proportion
Sampling Distribution
• A sampling distribution is the
distribution of statistics computed for
different samples of the same size taken
from the same population
• The sampling distribution shows us how
the statistic varies from sample to sample
• We can use the spread of the sampling
distribution to determine the margin of
error for a statistic
Sampling Distribution
In the Reese’s pieces sampling
distribution, what does each dot
represent?
a) One Reese’s piece
b) One sample statistic
Sampling Distribution
The higher the standard deviation of the
sampling distribution, the
(a) higher
(b) lower
the margin of error
Sample Size
http://www.rossmanchance.com/applets/Reeses/ReesesPieces.html
n = 10
n = 50
n = 100
• For a larger sample size you get less variability in
the statistics, so less uncertainty in your estimate
Sampling Distribution
• A sampling distribution is the
distribution of statistics computed for
different samples of the same size taken
from the same population
• The sampling distribution shows us how
the statistic varies from sample to sample
• This gives us an idea for the uncertainty
surrounding the estimate of a parameter
Random Samples
• If you take random samples, the
sampling distribution will be centered
around the true population parameter
• If sampling bias exists (if you do not take
random samples), your sampling
distribution may give you bad information
about the true parameter
Lincoln’s Gettysburg Address
Confidence Interval
• A confidence interval for a parameter is an
interval computed from sample data by a
method that will capture the parameter for a
specified proportion of all samples
• The success rate (the proportion of all samples
whose intervals contain the parameter) is
known as the confidence level
• A 95% confidence interval will contain the
true parameter for 95% of all samples
Confidence Intervals
http://bcs.whfreeman.com/ips4e/cat_010/applets/confidenceinterval.html
Sampling Distribution
Parameter
• The parameter is fixed
• The statistic is random
(depends on the sample)
• The interval is random
(depends on the sample)
Sampling Distribution
If you had access to the sampling distribution,
how would you find the margin of error to
ensure that intervals of the form
statistic  margin of error
would capture the parameter for 95% of all
samples?
Standard Error
• The standard error (SE) of a statistic is the
standard deviation of the sample statistic
• A 95% confidence interval can be created by
statistic  2  SE
http://bcs.whfreeman.com/ips4e/cat_010/applets/confidenceinterval.html
Economy
A recent survey of 1,502 Americans in January 2012
found that 86% consider the economy a “top
priority” for the president and congress this year.
The standard error for this statistic is 0.01.
What is the 95% confidence interval for the true
proportion of all Americans that consider the
economy a “top priority” for the president and
congress this year?
(a) (0.85, 0.87)
(b) (0.84, 0.88)
(c) (0.82, 0.90)
http://www.people-press.org/2012/01/23/public-priorities-deficit-rising-terrorismslipping/
Summary
• To create a plausible range of values for a
parameter:
• Take many random samples from the population,
and compute the sample statistic for each sample
• Compute the standard error as the standard
deviation of all these statistics
• Use statistic  2SE
• One small problem…
Reality
… WE ONLY HAVE ONE SAMPLE!!!!
• How do we know how much sample
statistics vary, if we only have one
sample?!?
… to be continued
Project 1
• Pose a question that you would like to investigate.
If possible, choose something related to your
major!
• Find or collect data that will help you answer this
question (you may need to edit your question
based on available data)
– If using existing data, you have to find your own (do
not use a dataset already used in this class)
– If collecting data, wait until your proposal has been
approved to collect the data
• You can choose either a single variable or a
relationship between two variables
Project 1
• The result will be a five page paper including
– Description of the data collection method, and the
implications this has for statistical inference
– Descriptive statistics (summary stats, visualization)
– Confidence intervals
– Hypothesis testing (following week)
– Distribution-based inference (after Exam 1)
• Proposal due 2/15
– Can submit earlier if want feedback sooner
– Include data if you are using existing data
– If collecting your own data, proposal should include
a detailed data collection plan
To Do
• Homework 2 (due Monday)
• Idea and data for Project 1 (proposal due 2/15)
FINDING DATA
http://library.duke.edu/data/
Joel Herndon