Inferences from sample data

Transcript Inferences from sample data

Inferences from sample data
•
•
•
Confidence Intervals
Hypothesis Testing
Regression Model
Confidence Intervals
•
How well does the xbar represent the
true population mean µ ?
• Can use CIs to determine “how close” we
are to the true mean
• General form of a confidence interval
– sample statistic ± (multiplier based on confidence
level) x (standard error of statistic)
– sampling distribution based on central limit theorem
Confidence Interval for Mean
General expression:
xbar ± t (alpha/2, df) x s/(square root of n)
student t
distribution
standard error
(n = sample size)
margin of error -- how close
we are likely (based on confidence
level) to be to population parameter
confidence level -- how confident we are that population
parameter will be in our interval -- 95 % means
alpha is .05.
A Confidence Interval Example:
Suppose xbar = 100, s = 5 and n = 25.
Construct a 95% confidence interval and
interpret the interval.
100 ± t (.025,24) x 5 / (square root of 25)
100 ± 2.064 x 1
(97.94, 102.64)
We are 95% confident that the true mean
is in our interval of (about) 98 to 103.
t
df
.100
.050
.025
.010
.005
Confidence Interval Questions :
1. If we took another sample, would we get the
same confidence interval ?
2. How does the confidence level relate to the
margin of error ?
3. What can be done to reduce the margin of error ?
Conceptual view of confidence intervals:
CONFIDENCE(alpha,standard_dev,size)
Alpha is the significance level used to compute the confidence
level. The confidence level equals 100*(1 - alpha)%, or in other
words, an alpha of 0.05 indicates a 95 percent confidence level.
Standard_dev is the population standard deviation for the data
range and is assumed to be known.
Size is the sample size.
If we assume alpha equals 0.05, we need to calculate the area
under the standard normal curve that equals (1 - alpha), or 95
percent. This value is ± 1.96. The confidence interval is therefore:
Example
Suppose we observe that, in our sample of 50 commuters,
the average length of travel to work is 30 minutes with a
population standard deviation of 2.5. With alpha = .05,
CONFIDENCE(.05, 2.5, 50) returns 0.69291. The
corresponding confidence interval is then 30 ± 0.69291 =
approximately [29.3, 30.7].
For any population mean, μ0, in this interval, the probability
of obtaining a sample mean further from μ0 than 30 is more
than 0.05. Likewise, for any population mean, μ0, outside
this interval, the probability of obtaining a sample mean
further from μ0 than 30 is less than 0.05.
Hypothesis Testing
• A study and assessment of data to
examine two hypotheses: null and
alternative.
• Six step process
1. state hypotheses, decision making alternatives and
consequences of wrong decisions
2. select the appropriate test statistic
3. sketch sampling distribution and identify rejection region
4. collect data, compute statistics
5. test the null hypothesis and state conclusions
6. state managerial decision
Hypothesis test example:
Our engineering staff claims we will obtain an average
catapult launch of more than 110 inches. We will not
market the catapult unless this is true.
1. state hypotheses
Ho = ‘statement of no effect’
Ho = mean of launch is less than or equal to 110 inches
Null action = don’t market.
Ha = ‘there is an effect or difference’
Ha = mean launch is greater than 110 inches
Alternative action = market catapult; ‘launch’ marketing
campaign.
Type 1 and Type 2 Errors
With sample data, always a chance to make incorrect
decisions setting significance level. For type 1 error, the
alpha is the maximum risk we are willing to take for
this type of error.
Rules of thumb from Harvey Brightman:
1. Type 1 error costly and type 2 is not -set alpha low -- .05 or less
2. Type 2 error costly and type 1 is not -set alpha higher -- perhaps .25 or above
3. Both errors costly -set alpha low and increase sample size
We are going to set alpha = .01
2. Select test statistic
For large samples, use Z and for small samples use t
Z=
xbar - mu
-------------------------------sigma / (square root of n)
-- for the t test statistic, substitute s for sigma
3. Sketch sampling distribution and rejection region
0.5
0.4
0.3
0.2
0.1
0
-4
-3
-2
-1
0
1
2
3
4
t
= 2.5395
.01,19
=TINV(2*0.01,19)
EXCEL
TINV(probability,degrees_freedom)
Probability is the probability associated with the two-tailed Student's tdistribution.
Degrees_freedom is the number of degrees of freedom with which to
characterize the distribution.
Remarks
•A one-tailed t-value can be returned by replacing probability with
2*probability. For a probability of 0.05 and degrees of freedom of 10, the
two-tailed value is calculated with TINV(0.05,10), which returns 2.28139.
The one-tailed value for the same probability and degrees of freedom can
be calculated with TINV(2*0.05,10), which returns 1.812462.
4. Collect data and compute statistics
Let x bar = 115, s = 8 and n = 20
115 - 110
t* =
-----------8 / (square root of 20)
t* =
2.795
5. Statistical decision
Since t * is in the rejection region, we reject the
null and accept the alternative hypothesis.
Observations from Two Populations
Comparing Two Catapults
F re q u e n c y
6
5
4
3
2
1
80 85 90 95 100 105 110 115 120 125 130
Launch distances
Cumulative Percentages
C u m u la t iv e P ro b a b ilit ie s
0
Two Catapults
1
0.8
0.6
0.4
0.2
0
80
90
100
110
Distances
120
130
Summary Statistics for the Two Groups
Group 1
Mean
Standard Error
Median
Mode
Standard Deviation
Variance
Kurtosis
Skewness
Range
Minimum
Maximum
Sum
Count
Confidence Level(0.95)
Group 2
113.950
1.375
115.000
117.000
6.151
37.839
-0.838
-0.267
21.000
104.000
125.000
2279.000
20.000
2.696
Mean
Standard Error
Median
Mode
Standard Deviation
Variance
Kurtosis
Skewness
Range
Minimum
Maximum
Sum
Count
Confidence Level(0.95)
96.050
1.562
96.000
85.000
6.985
48.787
-1.213
-0.034
22.000
85.000
107.000
1921.000
20.000
3.061
Test statistic for hypothesis test on difference in two means
Using the t test statistic to test for a difference in the two means:
Hypothesis test:
x1 x2
t 2 1 1
s ( n1  n2 )
( n11)s1 ( n2 1)s2
s
n1n2 2
2
2
2
the Numbers
In our case:
n1 = 20 s1 = 6.15
n2 = 20 s2 = 6.98
s = 9.56
t* =
17.9 / 3.02 = 5.93
Comparing to a ta at .05,18 = 1.734
Would conclude statistically different
EXCEL Example
Tools | Data Analysis |
=TINV(0.05*2,16)
=TINV(0.05,16)
Linear Regression Model
Linear regression form:
Yt   0   1 X 1t  
systematic
variation in
time series
regression function
(linear function of
time)
error term represents
unsystematic or
random variation
24
Tools |
Data Analysis |
Regression
25
Our regression model
=375.17+92.6255*C3
note:
column C is time period
26

Inferences from sample data

Transcript Inferences from sample data

Directory