## Lecture 1 Basics of Statistical Inference

1.

2.

3.

4.

5.

6.

7.

Lecture 1

Inference for a single numerical variable Statement of hypotheses P-value concept How to communicate the results of a test Inference for a single numerical variable and a categorical variable with 2 categories Inference for a single categorical variable Inference for 2 categorical variables

### Statistical Methods

Descriptive Statistics Tutorials Statistical Methods Estimation Inferential Statistics Hypothesis Testing

### Estimation Process

Population

 

Mean,

, is unknown

 

Sample

     

Random Sample

Mean X = 50

I am 95% confident that

is between 40 & 60.

p

 

1

2

x p s

x

1

x

2

### Estimation Methods

Point Estimation Estimation Interval Estimation

### Point Estimation

1.

Provides a single value • Based on observations from one sample 2.

Gives no information about how close the value is to the unknown population parameter 3.

Example: Sample mean

x

= 3 is a

point estimate

of unknown population mean

### Interval Estimation

1.

Provides a range of values • Based on observations from one sample 2.

Gives information about closeness to unknown population parameter 3.

Example: Unknown population mean lies between 50 and 70 with 95% confidence

### Confidence Level

1.

2.

3.

Probability that the unknown population parameter falls within interval Denoted (1 – •   is probability that parameter is

not

within interval Typical values are 99%, 95%, 90%

### Intervals & Confidence Level

Sampling Distribution of Sample Mean _

/2 1 -

 

/2

 

x =

 _

X

(1 – α)% of intervals contain μ α% do not Large number of intervals

### Factors Affecting Interval Width

1.

Data dispersion More variability = larger width 2.

Sample size Larger sample = smaller width 3.

Level of confidence (1 –  ) Higher confidence = larger width © 1984-1994 T/Maker Co.

### Unknown)

Assumption: Population must be

normally distributed

### Thinking Challenge

You’re a time study analyst in manufacturing. You’ve recorded the following task times (min.):

3.6, 4.2, 4.0, 3.5, 3.8, 3.1

. What is the

90%

confidence interval estimate of the population

mean

• Confidence Interval for a Mean (  ) with Unknown , Using MegaStat MegaStat does all calculations for you. We can be 90% confident that the population mean falls between 3.379 and 4.021.

### Applications

An example using L1 One sample numerical variable.xlsx

Problem 1: Obtain and interpret a 95% confidence interval for the population mean for price per square foot for all combinations of SAD and with/without a pool. Problem 2: Check to see if these confidence intervals may be inaccurate by looking at normality/sample size.

Your Turn: Do PS1 problem 1

### Statistical Methods

Descriptive Statistics Statistical Methods Estimation Inferential Statistics Hypothesis Testing

### What’s a Hypothesis?

A belief about a population parameter • Parameter is

population

mean, proportion, slope • Must be stated

before

analysis I believe the mean GPA of this class is 3.5!

© 1984-1994 T/Maker Co.

Population

### Hypothesis Testing

I believe the population mean age is 50 (hypothesis).

Reject hypothesis! Not close.

     

Random sample

Mean

X = 20

1.

2.

### How do we Measure “Close”?

If hypothesized  value were really the true mean, there should be a high probability of obtaining the observed sample xbar by pure random chance. Call this the p-value If the p-value is smaller than, say, 5%, we “reject” the hypothesized value for  .

### Basic Idea

It is unlikely that we would get a sample mean of this value ...

Sampling Distribution ... therefore, we reject the hypothesis that

= 50.

... if in fact this were the population mean 20

= 50 H 0 Sample Means

### Naming Null & Alternative Hypotheses

2.

3.

4.

1.

Null hypothesis, H 0 sign:  ,  , or  (pronounced H-oh) always has equality Alternative hypothesis, H a , opposite of null H a always has inequality sign: Specified as H a • :   ,  Example, H a :  , or < 3   ,  , or  some value

### Identifying Hypotheses

Example: Test that the population mean is not 3 Steps: • State the question statistically (   • State the opposite statistically (  3) = 3) • • • — Must be mutually exclusive & exhaustive Designate which is alternative hypothesis (   — Has the  ,

<

, or

>

sign 3) Designate which is the null hypothesis (  Called a two-tailed hypothesis because of  = 3) in H a

### What Are the Hypotheses?

Is the population average amount of TV viewing equal to 12 hours?

• State the question statistically: 

= 12

• • • State the opposite statistically:  

12

Select the alternative hypothesis:

H a :

 

12

State the null hypothesis:

H 0 :

= 12

This is a two-tailed test.

### What Are the Hypotheses?

Is the population average amount of TV viewing different from 12 hours?

• State the question statistically:  

12

• • • State the opposite statistically: 

= 12

Select the alternative hypothesis:

H a :

 

12

State the null hypothesis:

H 0 :

= 12

This is a two-tailed test.

### What Are the Hypotheses?

Is the average amount spent in the bookstore greater than \$25?

• State the question statistically:  

25

• • • State the opposite statistically:  

25

Select the alternative hypothesis:

H a :

 

25

State the null hypothesis:

H 0 :

 

25

This is a one-tailed or right-tailed test.

### What Are the Hypotheses?

Is the average cost per hat less than \$20?

• • • • State the question statistically:  

20

State the opposite statistically: 

≥ 20

Designate the alternative hypothesis:

H a :

 

20

State the null hypothesis:

H 0 :

≥ 20

This is a one-tailed or left-tailed test.

### Level of Significance

1.

2.

3.

4.

A “tail” probability of the bell curve used to define how many std. devs. of xbar to judge “closeness” and to compare p-value against.

Designated  (alpha) • Typical values are .01, .05, .10 (.05 is most common) Selected by researcher, otherwise will be given in a problem Defines unlikely values of sample statistic if null hypothesis is true

### p-Value Approach

1.

Probability of obtaining a test statistic more extreme (  or  than actual sample value, given H 0 is true is called the p-value 2.

3.

1- (p-value) is called the confidence in H a 1  is called the required confidence to conclude H a 4. Used to make a decision between hypotheses • If confidence in H a is greater than the required confidence, conclude H a otherwise find H 0 acceptable.

## The Four Steps of a Hypothesis Test

1.

2.

3.

4.

State Hypotheses Determine p-value (MegaStat) Make decision based on 1-p =confidence in H a Draw conclusion within context of problem

• If confidence in H a is greater than the required confidence, conclude H a otherwise find H 0 acceptable.

### Unknown)

Assumption for p-value to be accurate • Population is normally distributed • • If not normal, take large sample (

n

 30) Or switch to a test for population median such as Wilcoxon Mann-Whitney test

### One-Tailed t Test Example

Is the average capacity of batteries

less than 140

ampere-hours? A random sample of

20

batteries had a mean of

138.47

and a standard deviation of

2.66

. Assume a normal distribution. Test at the

.05

level of significance.

### One-Tailed t Test Solution

• • • •

H 0 : H a :

=

 

≥ 140 < 140 .

05 df = 20 - 1 = 19 p-value =.009 (MegaStat) Conclusion: We can be 99.1% confident that the population mean is less than 140 and since that exceeds the requirement of 95% we can conclude

< 140

### One-Tailed t Test

You’re a marketing analyst for Wal Mart. Wal-Mart had teddy bears on sale last week. The weekly sales (\$00s) of bears sold in

10

stores was:

8 11 0 4 7 8 10 5 8 3

At the

.05

level of significance, is there evidence that the average bear sales per store is

more than 5

(\$00s)?

### One-Tailed t Test Solution*

• • • • • •

H 0 : H a :

=

  

5 > 5 .05

df = 10 - 1 = 9 p-value = .111 from MegaStat Confidence in Ha = 1- .111 or .889

Required confidence to conclude H a is 95%.

There is insufficient evidence that pop. mean is more than 5 since we can be only 88.9% confident.

One-tailed T-test for a Mean (  ) with Unknown , Using MegaStat

One-tailed T-test for a Mean (  ) with Unknown , Using MegaStat Hypothesis Test: Mean vs. Hypothesized Value 5.0000 hypothesized value 6.4000 mean Sales (\$00) 3.3731 std. dev.

1.0667 std. error 1.31 t .1109 p-value (one-tailed, upper)

### Two-Tailed t Test

You work for the FTC. A manufacturer of detergent claims that the mean weight of detergent is

3.25

lb. You take a random sample of

64

containers. You calculate the sample average to be

3.238

lb. with a standard deviation of

.117

lb. At the

.01

level of significance, is the manufacturer correct?

3.25 lb.

### Two-Tailed t Test Solution*

• • • • •

H 0 : H a :

 

df

 

= 3.25

 

.01

3.25

64 - 1 = 63 p-value = .208 from MegaStat Confidence in Ha = 1- .208 or .792

Need to be 99% confident to conclude H a There is insufficient evidence pop. mean is not 3.25 since we can only be 79.2% confident. The null hypothesis is acceptable.

### Applications

An example using L1 One sample numerical variable.xlsx

Problem 3: Test the hypothesis that the mean price per square foot mean for SAD3Pool is different than \$320 at a level of significance of .05. How does that compare to the 95% confidence interval you calculated in Problem 1. Use a level of significance of .05 in this problem and all that follow.

Problem 4: Use the Wilcoxon signed rank test to test whether the median price per square foot for SAD2Pool is different than \$320.

Problem 5: Test the hypothesis that the mean price per square foot for SAD1NoPool is less than 350.

Example 6: Use the Wilcoxon signed rank test to test whether the median price per square foot for SAD1NoPool is less than \$350.

Your Turn: Do PS1 problems 2,3,4

1.

### Two Independent Populations Example applications

An economist wishes to determine whether there is a difference in mean family income for households in two socioeconomic groups.

2.

An admissions officer of a small liberal arts college wants to compare the mean SAT scores of applicants educated in rural high schools and in urban high schools.

• • How can we tell what to use for these situations?

Both have a numerical variable and a categorical variable (with 2 categories) See “Choosing Situation by Data Type”

1

2

### unknown

Assumptions • Independent, random samples • Populations are approximately normally distributed • Population standard deviations are equal

If at least one population is not normal then an alternative test is to compare population medians using the Wilcoxon Mann-Whitney test

### Hypothesis Test Example

You’re a financial analyst for Charles Schwab. Is there a difference in dividend yield between stocks listed on the NYSE and NASDAQ? You collect the following data:

NYSE NASDAQ Number Mean Std Dev 11 3.27

1.30

15 2.53

1.16

Assuming

normal

populations, is there a difference in

average

yield ( 

= .05

)?

© 1984-1994 T/Maker Co.

### Independent Samples Hypothesis Test Solution

• • • • •

H H 0 a : :

  

1 -

1 .05

2

2 df

= 0 (

0 (

 

1 1 = 11 + 15 - 2 = 24

  

2 2 ) ) Need to be 95% confident to conclude H a .p-value = .1397 Confidence in H a Is 1- .1397 = .8603

There is little evidence of a difference in means since we can only be 86.03% confident that the pop. means are different

Two Sample T-test & C.I. for Mean Difference Assuming Equal Variances , Using MegaStat

Two Sample T-test & C.I. for Mean Difference Assuming Equal Variances , Using MegaStat Hypothesis Test: Independent Groups (t-test, pooled variance) NYSE 3.27

1.3

11 NASDAQ 2.53mean

1.16std. dev.

15n 0.740 difference (NYSE - NASDAQ) 1.489 pooled variance 1.220 pooled std. dev.

0.484 standard error of difference 0hypothesized difference 1.53 t .1397 p-value (two-tailed) -0.260 confidence interval 95.% lower 1.740 confidence interval 95.% upper 1.000 margin of error

Wilcoxon Mann-Whitney test using MegaStat Wilcoxon - Mann/Whitney Test Pr/SF n 32 29 61 sum of ranks 1035.5 SAD1Pool 855.5

SAD2Pool 1891 total expected 992.000

value standard 69.243

deviation z corrected for ties with continuity 0.621

correction .5346 p-value (two-tailed) H0: Population Medians are equal H1: Population Medians are not equal P-value = .5346

We can only be 46.54 % confident of a difference in population medians.

See L1 2 sample tests excel file for this example.

### Applications

An example using the L1 2 sample tests.xlsx excel file.

Example 7: Test whether price per square foot has the same population means for homes with and without pools. Your Turn: Do PS1 problems 5,6,7

A single categorical variable: Z Test for a Proportion

1.

2.

Condition • nπ and n(1-π) > 5 Z-test from MegaStat Example: Do ranch style homes make up less than 50% of the population of homes?

Data: A sample of 108 homes revealed that 54 were ranch style.

One-Tailed Solution

• • • 

H 0 : H a :

= π ≥ 0.50

π < 0.50

.

05 P-value = .0271 from Excel MegaStat We can be 97.29% confident that the population proportion is less than 0.5 and therefore can conclude that π < 0.50

95% confidence interval estimate for π We can be 95% confident that the population proportion falls between .3147 and .5001.

Note: A 2-tailed test would have found the null hypothesis acceptable.

### Applications

An example using the L1 Categorical variables tests and CI-1.xls file.

Example 8: Test whether less than 50% of the homes are ranch style in the population and obtaining a 95% interval estimate for that population proportion. Your Turn: Do PS1 problems 8

Two categorical variables: Chi-square Test for Independence

1.

Chi-square test statistic Example: Do 3 different school districts have the same percentage of ranch, trilevel and two-story homes?

Data: A sample of 108 homes revealed the following table.

Count of STYLE STYLE SAD ranch trilevel twostory Grand Total SAD1 SAD2 SAD3 Grand Total 8 15 21 44 24 11 4 39 11 7 7 25 43 33 32 108

Chi-square solution

• • • 

H 0 : H a :

= No relationship Relationship exists .

05 P-value = .0005 from Excel MegaStat We can be 99.95% confident that there is a relationship between school district and style of home A follow up analysis suggests that SAD 1 has fewer ranch homes and more trilevel homes than expected and that the reverse holds for SAD 3. See the Results tab in L1 Categorical variables file for details.

### Applications

An example using the L1 Categorical variables tests and CI-1.xlsx file.

Example 9: Test whether there is a relationship between SAD and style of home.

Your Turn: Do PS1 problems 9