presentation slides for this follow along session

Download Report

Transcript presentation slides for this follow along session

Bus 621 Statistics

Lecture 1 Basics of Statistical Inference

1.

2.

3.

4.

5.

6.

7.

Lecture 1

Inference for a single numerical variable Statement of hypotheses P-value concept How to communicate the results of a test Inference for a single numerical variable and a categorical variable with 2 categories Inference for a single categorical variable Inference for 2 categorical variables

Statistical Methods

Descriptive Statistics Tutorials Statistical Methods Estimation Inferential Statistics Hypothesis Testing

Estimation Process

Population

 

Mean,

, is unknown

 

Sample

     

Random Sample

Mean X = 50

I am 95% confident that

is between 40 & 60.

Unknown Population Parameters Are Estimated Estimate Population Parameter...

Mean

Proportion

p

Std. Dev.

Differences

 

1

-

2

with Sample Statistic

x p s

x

1

-

x

2

Estimation Methods

Point Estimation Estimation Interval Estimation

Point Estimation

1.

Provides a single value • Based on observations from one sample 2.

Gives no information about how close the value is to the unknown population parameter 3.

Example: Sample mean

x

= 3 is a

point estimate

of unknown population mean

Interval Estimation

1.

Provides a range of values • Based on observations from one sample 2.

Gives information about closeness to unknown population parameter 3.

Example: Unknown population mean lies between 50 and 70 with 95% confidence

Confidence Level

1.

2.

3.

Probability that the unknown population parameter falls within interval Denoted (1 – •   is probability that parameter is

not

within interval Typical values are 99%, 95%, 90%

Intervals & Confidence Level

Sampling Distribution of Sample Mean _

/2 1 -

 

/2

 

x =

 _

X

(1 – α)% of intervals contain μ α% do not Large number of intervals

Factors Affecting Interval Width

1.

Data dispersion More variability = larger width 2.

Sample size Larger sample = smaller width 3.

Level of confidence (1 –  ) Higher confidence = larger width © 1984-1994 T/Maker Co.

Accurate Confidence Interval for Mean (

Unknown)

Assumption: Population must be

normally distributed

Thinking Challenge

You’re a time study analyst in manufacturing. You’ve recorded the following task times (min.):

3.6, 4.2, 4.0, 3.5, 3.8, 3.1

. What is the

90%

confidence interval estimate of the population

mean

task time?

• Confidence Interval for a Mean (  ) with Unknown , Using MegaStat MegaStat does all calculations for you. We can be 90% confident that the population mean falls between 3.379 and 4.021.

Applications

An example using L1 One sample numerical variable.xlsx

Problem 1: Obtain and interpret a 95% confidence interval for the population mean for price per square foot for all combinations of SAD and with/without a pool. Problem 2: Check to see if these confidence intervals may be inaccurate by looking at normality/sample size.

Your Turn: Do PS1 problem 1

Statistical Methods

Descriptive Statistics Statistical Methods Estimation Inferential Statistics Hypothesis Testing

What’s a Hypothesis?

A belief about a population parameter • Parameter is

population

mean, proportion, slope • Must be stated

before

analysis I believe the mean GPA of this class is 3.5!

© 1984-1994 T/Maker Co.

Population

Hypothesis Testing

I believe the population mean age is 50 (hypothesis).

Reject hypothesis! Not close.

     

Random sample

Mean

X = 20

1.

2.

How do we Measure “Close”?

If hypothesized  value were really the true mean, there should be a high probability of obtaining the observed sample xbar by pure random chance. Call this the p-value If the p-value is smaller than, say, 5%, we “reject” the hypothesized value for  .

Basic Idea

It is unlikely that we would get a sample mean of this value ...

Sampling Distribution ... therefore, we reject the hypothesis that

= 50.

... if in fact this were the population mean 20

= 50 H 0 Sample Means

Naming Null & Alternative Hypotheses

2.

3.

4.

1.

Null hypothesis, H 0 sign:  ,  , or  (pronounced H-oh) always has equality Alternative hypothesis, H a , opposite of null H a always has inequality sign: Specified as H a • :   ,  Example, H a :  , or < 3   ,  , or  some value

Identifying Hypotheses

Example: Test that the population mean is not 3 Steps: • State the question statistically (   • State the opposite statistically (  3) = 3) • • • — Must be mutually exclusive & exhaustive Designate which is alternative hypothesis (   — Has the  ,

<

, or

>

sign 3) Designate which is the null hypothesis (  Called a two-tailed hypothesis because of  = 3) in H a

What Are the Hypotheses?

Is the population average amount of TV viewing equal to 12 hours?

• State the question statistically: 

= 12

• • • State the opposite statistically:  

12

Select the alternative hypothesis:

H a :

 

12

State the null hypothesis:

H 0 :

= 12

This is a two-tailed test.

What Are the Hypotheses?

Is the population average amount of TV viewing different from 12 hours?

• State the question statistically:  

12

• • • State the opposite statistically: 

= 12

Select the alternative hypothesis:

H a :

 

12

State the null hypothesis:

H 0 :

= 12

This is a two-tailed test.

What Are the Hypotheses?

Is the average amount spent in the bookstore greater than $25?

• State the question statistically:  

25

• • • State the opposite statistically:  

25

Select the alternative hypothesis:

H a :

 

25

State the null hypothesis:

H 0 :

 

25

This is a one-tailed or right-tailed test.

What Are the Hypotheses?

Is the average cost per hat less than $20?

• • • • State the question statistically:  

20

State the opposite statistically: 

≥ 20

Designate the alternative hypothesis:

H a :

 

20

State the null hypothesis:

H 0 :

≥ 20

This is a one-tailed or left-tailed test.

Level of Significance

1.

2.

3.

4.

A “tail” probability of the bell curve used to define how many std. devs. of xbar to judge “closeness” and to compare p-value against.

Designated  (alpha) • Typical values are .01, .05, .10 (.05 is most common) Selected by researcher, otherwise will be given in a problem Defines unlikely values of sample statistic if null hypothesis is true

p-Value Approach

1.

Probability of obtaining a test statistic more extreme (  or  than actual sample value, given H 0 is true is called the p-value 2.

3.

1- (p-value) is called the confidence in H a 1  is called the required confidence to conclude H a 4. Used to make a decision between hypotheses • If confidence in H a is greater than the required confidence, conclude H a otherwise find H 0 acceptable.

The Four Steps of a Hypothesis Test

1.

2.

3.

4.

State Hypotheses Determine p-value (MegaStat) Make decision based on 1-p =confidence in H a Draw conclusion within context of problem

• If confidence in H a is greater than the required confidence, conclude H a otherwise find H 0 acceptable.

t Test for Mean (

Unknown)

Assumption for p-value to be accurate • Population is normally distributed • • If not normal, take large sample (

n

 30) Or switch to a test for population median such as Wilcoxon Mann-Whitney test

One-Tailed t Test Example

Is the average capacity of batteries

less than 140

ampere-hours? A random sample of

20

batteries had a mean of

138.47

and a standard deviation of

2.66

. Assume a normal distribution. Test at the

.05

level of significance.

One-Tailed t Test Solution

• • • •

H 0 : H a :

=

 

≥ 140 < 140 .

05 df = 20 - 1 = 19 p-value =.009 (MegaStat) Conclusion: We can be 99.1% confident that the population mean is less than 140 and since that exceeds the requirement of 95% we can conclude

< 140

One-Tailed t Test

You’re a marketing analyst for Wal Mart. Wal-Mart had teddy bears on sale last week. The weekly sales ($00s) of bears sold in

10

stores was:

8 11 0 4 7 8 10 5 8 3

At the

.05

level of significance, is there evidence that the average bear sales per store is

more than 5

($00s)?

One-Tailed t Test Solution*

• • • • • •

H 0 : H a :

=

  

5 > 5 .05

df = 10 - 1 = 9 p-value = .111 from MegaStat Confidence in Ha = 1- .111 or .889

Required confidence to conclude H a is 95%.

There is insufficient evidence that pop. mean is more than 5 since we can be only 88.9% confident.

One-tailed T-test for a Mean (  ) with Unknown , Using MegaStat

One-tailed T-test for a Mean (  ) with Unknown , Using MegaStat Hypothesis Test: Mean vs. Hypothesized Value 5.0000 hypothesized value 6.4000 mean Sales ($00) 3.3731 std. dev.

1.0667 std. error 1.31 t .1109 p-value (one-tailed, upper)

Two-Tailed t Test

You work for the FTC. A manufacturer of detergent claims that the mean weight of detergent is

3.25

lb. You take a random sample of

64

containers. You calculate the sample average to be

3.238

lb. with a standard deviation of

.117

lb. At the

.01

level of significance, is the manufacturer correct?

3.25 lb.

Two-Tailed t Test Solution*

• • • • •

H 0 : H a :

 

df

 

= 3.25

 

.01

3.25

64 - 1 = 63 p-value = .208 from MegaStat Confidence in Ha = 1- .208 or .792

Need to be 99% confident to conclude H a There is insufficient evidence pop. mean is not 3.25 since we can only be 79.2% confident. The null hypothesis is acceptable.

Applications

An example using L1 One sample numerical variable.xlsx

Problem 3: Test the hypothesis that the mean price per square foot mean for SAD3Pool is different than $320 at a level of significance of .05. How does that compare to the 95% confidence interval you calculated in Problem 1. Use a level of significance of .05 in this problem and all that follow.

Problem 4: Use the Wilcoxon signed rank test to test whether the median price per square foot for SAD2Pool is different than $320.

Problem 5: Test the hypothesis that the mean price per square foot for SAD1NoPool is less than 350.

Example 6: Use the Wilcoxon signed rank test to test whether the median price per square foot for SAD1NoPool is less than $350.

Your Turn: Do PS1 problems 2,3,4

1.

Two Independent Populations Example applications

An economist wishes to determine whether there is a difference in mean family income for households in two socioeconomic groups.

2.

An admissions officer of a small liberal arts college wants to compare the mean SAT scores of applicants educated in rural high schools and in urban high schools.

• • How can we tell what to use for these situations?

Both have a numerical variable and a categorical variable (with 2 categories) See “Choosing Situation by Data Type”

Comparing Two Independent Means, μ

1

– μ

2

, assuming

unknown

Assumptions • Independent, random samples • Populations are approximately normally distributed • Population standard deviations are equal

If at least one population is not normal then an alternative test is to compare population medians using the Wilcoxon Mann-Whitney test

Hypothesis Test Example

You’re a financial analyst for Charles Schwab. Is there a difference in dividend yield between stocks listed on the NYSE and NASDAQ? You collect the following data:

NYSE NASDAQ Number Mean Std Dev 11 3.27

1.30

15 2.53

1.16

Assuming

normal

populations, is there a difference in

average

yield ( 

= .05

)?

© 1984-1994 T/Maker Co.

Independent Samples Hypothesis Test Solution

• • • • •

H H 0 a : :

  

1 -

1 .05

2

2 df

= 0 (

0 (

 

1 1 = 11 + 15 - 2 = 24

  

2 2 ) ) Need to be 95% confident to conclude H a .p-value = .1397 Confidence in H a Is 1- .1397 = .8603

There is little evidence of a difference in means since we can only be 86.03% confident that the pop. means are different

Two Sample T-test & C.I. for Mean Difference Assuming Equal Variances , Using MegaStat

Two Sample T-test & C.I. for Mean Difference Assuming Equal Variances , Using MegaStat Hypothesis Test: Independent Groups (t-test, pooled variance) NYSE 3.27

1.3

11 NASDAQ 2.53mean

1.16std. dev.

15n 0.740 difference (NYSE - NASDAQ) 1.489 pooled variance 1.220 pooled std. dev.

0.484 standard error of difference 0hypothesized difference 1.53 t .1397 p-value (two-tailed) -0.260 confidence interval 95.% lower 1.740 confidence interval 95.% upper 1.000 margin of error

Wilcoxon Mann-Whitney test using MegaStat Wilcoxon - Mann/Whitney Test Pr/SF n 32 29 61 sum of ranks 1035.5 SAD1Pool 855.5

SAD2Pool 1891 total expected 992.000

value standard 69.243

deviation z corrected for ties with continuity 0.621

correction .5346 p-value (two-tailed) H0: Population Medians are equal H1: Population Medians are not equal P-value = .5346

We can only be 46.54 % confident of a difference in population medians.

See L1 2 sample tests excel file for this example.

Applications

An example using the L1 2 sample tests.xlsx excel file.

Example 7: Test whether price per square foot has the same population means for homes with and without pools. Your Turn: Do PS1 problems 5,6,7

A single categorical variable: Z Test for a Proportion

1.

2.

Condition • nπ and n(1-π) > 5 Z-test from MegaStat Example: Do ranch style homes make up less than 50% of the population of homes?

Data: A sample of 108 homes revealed that 54 were ranch style.

One-Tailed Solution

• • • 

H 0 : H a :

= π ≥ 0.50

π < 0.50

.

05 P-value = .0271 from Excel MegaStat We can be 97.29% confident that the population proportion is less than 0.5 and therefore can conclude that π < 0.50

95% confidence interval estimate for π We can be 95% confident that the population proportion falls between .3147 and .5001.

Note: A 2-tailed test would have found the null hypothesis acceptable.

Applications

An example using the L1 Categorical variables tests and CI-1.xls file.

Example 8: Test whether less than 50% of the homes are ranch style in the population and obtaining a 95% interval estimate for that population proportion. Your Turn: Do PS1 problems 8

Two categorical variables: Chi-square Test for Independence

1.

Chi-square test statistic Example: Do 3 different school districts have the same percentage of ranch, trilevel and two-story homes?

Data: A sample of 108 homes revealed the following table.

Count of STYLE STYLE SAD ranch trilevel twostory Grand Total SAD1 SAD2 SAD3 Grand Total 8 15 21 44 24 11 4 39 11 7 7 25 43 33 32 108

Chi-square solution

• • • 

H 0 : H a :

= No relationship Relationship exists .

05 P-value = .0005 from Excel MegaStat We can be 99.95% confident that there is a relationship between school district and style of home A follow up analysis suggests that SAD 1 has fewer ranch homes and more trilevel homes than expected and that the reverse holds for SAD 3. See the Results tab in L1 Categorical variables file for details.

Applications

An example using the L1 Categorical variables tests and CI-1.xlsx file.

Example 9: Test whether there is a relationship between SAD and style of home.

Your Turn: Do PS1 problems 9