#### Transcript Top Ten #1 - Armstrong

**Review of Top 10 Concepts in Statistics**

*NOTE: This Power Point file is not an introduction, but rather a checklist of topics to review*

**Top Ten**

10. Qualitative vs. Quantitative Data 9. Population vs. Sample 8. Graphical Tools 7. Variation Creates Uncertainty 6. Which Distribution?

5. P-value 4. Linear Regression 3. Confidence Intervals 2. Descriptive Statistics 1. Hypothesis Testing

**Top Ten #10**

Qualitative vs. Quantitative

**Qualitative**

Categorical data: success vs. failure ethnicity marital status color zip code 4 star hotel in tour guide

**Qualitative**

If you need an “average”, do not calculate the mean However, you can compute the mode (“average” person is married, buys a blue car made in America)

**Quantitative**

• • • • • • integer values (0,1,2,…) number of brothers number of cars arriving at gas station Real numbers, such as decimal values ($22.22) Examples: Z, t Miles per gallon, distance, duration of time

**Hypothesis Testing Confidence Intervals**

Quantitative: Mean Qualitative: Proportion

**Top Ten #9**

Population vs. Sample

**Population**

Collection of all items (all light bulbs made at factory) Parameter: measure of population characteristic (1) population mean (average number of hours in life of all bulbs) (2) population proportion (% of all bulbs that are defective)

**Sample**

Part of population (bulbs tested by inspector) Statistic: measure of sample = estimate of parameter (1) sample mean (average number of hours in life of bulbs tested by inspector) (2) sample proportion (% of bulbs in sample that are defective)

**Top Ten #8: Graphical Tools**

Pie chart or bar chart: qualitative Joint frequency table: qualitative (relate marital status vs zip code) Scatter diagram: quantitative (distance from ASU vs duration of time to reach ASU) Histograms Stem Plots

**Graphical Tools**

Line chart: trend over time Scatter diagram: relationship between two variables Bar chart: frequency for each category Histogram: frequency for each class of measured data (graph of frequency distr.) Box plot: graphical display based on quartiles, which divide data into 4 parts

**Top Ten #7**

Variation Creates Uncertainty

**No Variation**

Certainty, exact prediction Standard deviation = 0 Variance = 0 All data exactly same Example: all workers in minimum wage job

**High Variation**

Uncertainty, unpredictable High standard deviation Ex #1: Workers in downtown L.A. have variation between CEOs and garment workers Ex #2: New York temperatures in spring range from below freezing to very hot

**Comparing Standard Deviations**

Temperature Example Beach city: small standard deviation (single temperature reading close to mean) High Desert city: High standard deviation (hot days, cool nights in spring)

**Standard Error of the Mean**

Standard deviation of sample mean = standard deviation/square root of n Ex: standard deviation = 10, n =4, so standard error of the mean = 10/2= 5 Note that 5<10, so standard error < standard deviation.

As n increases, standard error decreases.

**Sampling Distribution **

Expected value of sample mean = population mean, but an individual sample mean could be smaller or larger than the population mean Population mean is a constant parameter, but sample mean is a random variable Sampling distribution is distribution of sample means

**Example**

Mean age of all students in the building is population mean Each classroom has a sample mean Distribution of sample means from all classrooms is sampling distribution

**Central Limit Theorem (CLT)**

If population standard deviation is known, sampling distribution of sample means is normal if n > 30 CLT applies even if original population is skewed

**Top Ten #6**

What Distribution to Use?

**Normal Distribution**

Continuous, bell-shaped, symmetric Mean=median=mode Measurement (dollars, inches, years) Cumulative probability under normal curve : use Z table if you know population mean and population standard deviation Sample mean: use Z table if you know population standard deviation and either normal population or n > 30

**t Distribution**

Continuous, mound-shaped, symmetric Applications similar to normal More spread out than normal Use t if normal population but population standard deviation not known Degrees of freedom = df = n-1 if estimating the mean of one population t approaches z as df increases

**Normal or t Distribution?**

Use t table if normal population but population standard deviation ( σ) is not known If you are given the sample standard deviation (

*s*

), use t table, assuming normal population

**Top Ten #5**

P-value

**P-value**

P-value = probability of getting a sample statistic as extreme (or more extreme) than the sample statistic you got from your sample, given that the null hypothesis is true

**P-value Example: one tail test**

H 0 : µ = 40 H A : µ > 40 Sample mean = 43 P-value = P(sample mean > 43, given H 0 true) Meaning: probability of observing a sample mean as large as 43 when the population mean is 40 How to use it: Reject H 0 (significance level) if p-value < α

**Two Cases**

Suppose α = .05

Case 1: suppose p-value = .02, then reject H 0 (unlikely H 0 is true; you believe population mean > 40) Case 2: suppose p-value = .08, then do not reject H 0 (H 0 may be true; you have reason to believe that the population mean may be 40)

**P-value Example: two tail test**

H 0 : µ = 70 H A : µ ≠ 70 Sample mean = 72 If two-tails, then P-value = 2 P(sample mean > 72)=2(.04)=.08

If α = .05, p-value > α, so do not reject H 0

**Top Ten #4**

Linear Regression

**Linear Regression**

*y*

ˆ

*b*

0

*b*

1

*x*

Regression equation: =dependent variable=predicted value x= independent variable b 0 =y-intercept =predicted value of y if x=0 b 1 =slope=regression coefficient =change in y per unit change in x

**Slope vs Correlation**

Positive slope (b 1 >0): positive correlation between x and y (y increase if x increase) Negative slope (b 1 <0): negative correlation (y decrease if x increase) Zero slope (b 1 =0): no correlation(predicted value for y is mean of y), no linear relationship between x and y

**Simple Linear Regression**

Simple: one independent variable, one dependent variable Linear: graph of regression equation is straight line

**Example**

y = salary (female manager, in thousands of dollars) x = number of children n = number of observations

1 4 2 x

**Given Data**

y 48 52 33

**Totals**

x 2 1 4 Sum=7 y 48 52 33 Sum=133 n=3

**Slope (b 1 ) = -6.5**

Method of Least Squares formulas not on BUS 302 exam b 1 = -6.5 given Interpretation: If one female manager has 1 more child than another, salary is $6,500 lower; that is, salary of female managers is expected to decrease by -6.5 (in thousand of dollars) per child

**Intercept (b 0 )**

*b*

0

*y*

*b*

1

*x x*

*x n*

7 3 2 .

33

*y*

*y n*

133 3 44 .

33 b 0 = 44.33 – (-6.5)(2.33) = 59.5

If number of children is zero, expected salary is $59,500

**Regression Equation**

59 .

5 6 .

5

*x*

**Forecast Salary If 3 Children**

59.5 –6.5(3) = 40 $40,000 = expected salary

**Standard Error of Estimate**

*y*

ˆ

*forecast*

*b*

0

*b*

1

*x error*

*y*

*y*

ˆ

*S*

*SSE n*

2 (

*y n*

2

*y*

ˆ ) 2

**Standard Error of Estimate**

(1)=x 2 1 4 (2)=y 48 52 33 59.5 6.5x

46.5

53 33.5

(4)= (2)-(3) 1.5

-1 -.5

(

*y*

) 2 2.25

1 .25

SSE=3.5

**Standard Error of Estimate**

*S*

3 .

5 3 2 3 .

5 1 .

9 Actual salary typically $1,900 away from expected salary

**Coefficient of Determination**

R 2 = % of total variation in y that can be explained by variation in x Measure of how close the linear regression line fits the points in a scatter diagram R 2 = 1: max. possible value: perfect linear relationship between y and x (straight line) R 2 = 0: min. value: no linear relationship

**Sources of Variation (V)**

Total V = Explained V + Unexplained V SS = Sum of Squares = V Total SS = Regression SS + Error SS SST = SSR + SSE SSR = Explained V, SSE = Unexplained

**Coefficient of Determination**

R 2 = SSR SST R 2 = 197 200.5

= .98 Interpretation: 98% of total variation in salary can be explained by variation in number of children

**0 < R 2 < 1**

0: No linear relationship since SSR=0 (explained variation =0) 1: Perfect relationship since SSR = SST (unexplained variation = SSE = 0), but does not prove cause and effect

**R=Correlation Coefficient**

Case 1: slope (b 1 ) < 0 R < 0 R is negative square root of coefficient of determination

*R*

*R*

2

**Our Example**

Slope = b 1 R 2 = .98

= -6.5

R = -.99

**Case 2: Slope > 0**

R is positive square root of coefficient of determination Ex: R 2 = .49

R = .70

R has no interpretation R overstates relationship

**Caution**

Nonlinear relationship (parabola, hyperbola, etc) can NOT be measured by R 2 In fact, you could get R 2 =0 with a nonlinear graph on a scatter diagram

**Summary: Correlation Coefficient**

Case 1: If b 1 > 0, R is the positive square root of the coefficient of determination Ex#1: y = 4+3x, R 2 =.36: R = +.60

Case 2: If b 1 < 0, R is the negative square root of the coefficient of determination Ex#2: y = 80-10x, R 2 =.49: R = -.70

NOTE! Ex#2 has stronger relationship, as measured by coefficient of determination

**Extreme Values**

R=+1: perfect positive correlation R= -1: perfect negative correlation R=0: zero correlation

**MS Excel Output**

Correlation Coefficient (-0.9912): Note that you need to change the sign because the sign of slope (b 1 ) is negative (-6.5) Coefficient of Determination Standard Error of Estimate Regression Coefficient

**Top Ten #3**

Confidence Intervals: Mean and Proportion

**Confidence Interval**

A confidence interval is a range of values within which the population parameter is expected to occur.

**Factors for Confidence Interval**

The factors that determine the width of a confidence interval are: 1. The sample size,

*n*

2. The variability in the population, usually estimated by standard deviation

*.*

3. The desired level of confidence.

**Confidence Interval: Mean**

Use normal distribution (Z table if): population standard deviation (sigma) known and either (1) or (2): (1) Normal population (2) Sample size > 30

**Confidence Interval: Mean**

If normal table, then

*x*

*z n*

*n*

**Normal Table**

Tail = .5(1 – confidence level) NOTE! Different statistics texts have different normal tables This review uses the tail of the bell curve Ex: 95% confidence: tail = .5(1-.95)= .025

Z .025

= 1.96

**Example**

n=49, Σx=490, σ=2, 95% confidence 490 1 .

96 49 9.44 < µ < 10.56

2 49 10 0 .

56

**Another Example**

One of ASU professors wants to estimate the mean number of hours worked per week by students. A sample of 49 students showed a mean of 24 hours. It is assumed that the population standard deviation is 4 hours. What is the population mean?

**Another Example – cont’d**

95 percent confidence interval for the population mean.

*X*

1 .

96

*n*

24 .

00 1 .

96 24 .

00 1 .

12 4 49 The confidence limits range from 22.88 to 25.12. We estimate with 95 percent confidence that the average number of hours worked per week by students lies between these two values.

**Confidence Interval: Mean t distribution**

Use if normal population but population standard deviation ( σ) not known If you are given the sample standard deviation (

*s*

), use t table, assuming normal population If one population, n-1 degrees of freedom

**Confidence Interval: Mean t distribution**

*x*

*t n*

1

*n s n*

**Confidence Interval: Proportion**

Use if success or failure (ex: defective or not-defective, satisfactory or unsatisfactory) Normal approximation to binomial ok if (n)( π) > 5 and (n)(1-π) > 5, where n = sample size π= population proportion NOTE: NEVER use the t table if proportion!!

**Confidence Interval: Proportion**

*p*

*z p*

( 1

*p*

)

*n*

Ex: 8 defectives out of 100, so p = .08 and n = 100, 95% confidence .

08 1 .

96 ( 0 .

08 )(.

92 ) . 08 .

05 100

**Confidence Interval: Proportion**

A sample of 500 people who own their house revealed that 175 planned to sell their homes within five years. Develop a 98% confidence interval for the proportion of people who plan to sell their house within five years.

*p*

175 500 0 .

35 .

35 2 .

33 (.

35 )(.

65 ) 500 .

35 .

0497

**Interpretation**

If 95% confidence, then 95% of all confidence intervals will include the true population parameter NOTE! Never use the term “probability” when estimating a parameter!! (ex: Do NOT say ”Probability that population mean is between 23 and 32 is .95” because parameter is not a random variable. In fact, the population mean is a fixed but unknown quantity.)

**Point vs Interval Estimate**

Point estimate: statistic (single number) Ex: sample mean, sample proportion Each sample gives different point estimate Interval estimate: range of values Ex: Population mean = sample mean + error Parameter = statistic + error

**Width of Interval**

Ex: sample mean =23, error = 3 Point estimate = 23 Interval estimate = 23 + 3, or (20,26) Width of interval = 26-20 = 6 Wide interval: Point estimate unreliable

**Wide Confidence Interval If**

(1) small sample size(n) (2) large standard deviation (3) high confidence interval (ex: 99% confidence interval wider than 95% confidence interval) If you want narrow interval, you need a large sample size or small standard deviation or low confidence level.

**Top Ten #2**

Descriptive Statistics

**Measures of Central Location**

Mean Median Mode

**Mean**

Population mean = µ= Σx/N = (5+1+6)/3 = 12/3 = 4 Algebra: Σx = N*µ = 3*4 =12 Sample mean = x-bar = Σx/n Example: the number of hours spent on the Internet: 4, 8, and 9 x-bar = (4+8+9)/3 = 7 hours Do NOT use if the number of observations is small or with extreme values Ex: Do NOT use if 3 houses were sold this week, and one was a mansion

**Median**

Median = middle value Example: 5,1,6 Step 1: Sort data: 1,5,6 Step 2: Middle value = 5 When there is an even number of observation, median is computed by averaging the two observations in the middle.

OK even if there are extreme values Home sales: 100K,200K,900K, so mean =400K, but median = 200K

**Mode**

Mode: most frequent value Ex: female, male, female Mode = female Ex: 1,1,2,3,5,8 Mode = 1 It may not be a very good measure, see the following example

**Measures of Central Location Example**

Sample: 0, 0, 5, 7, 8, 9, 12, 14, 22, 23 Sample Mean = x-bar = Σx/n = 100/10 = 10 Median = (8+9)/2 = 8.5

Mode = 0

**Relationship**

Case 1: if probability distribution symmetric (ex. bell-shaped, normal distribution), Mean = Median = Mode Case 2: if distribution positively skewed to right (ex. incomes of employers in large firm: a large number of relatively low-paid workers and a small number of high-paid executives), Mode < Median < Mean

**Relationship – cont’d**

Case 3: if distribution negatively skewed to left (ex. The time taken by students to write exams: few students hand their exams early and majority of students turn in their exam at the end of exam), Mean < Median < Mode

**Dispersion – Measures of Variability**

How much spread of data How much uncertainty Measures Range Variance Standard deviation

**Range**

Range = Max-Min > 0 But range affected by unusual values Ex: Santa Monica has a high of 105 degrees and a low of 30 once a century, but range would be 105-30 = 75

**Standard Deviation (SD)**

Better than range because all data used Population SD = Square root of variance =sigma = σ SD > 0

**Empirical Rule**

Applies to mound or bell-shaped curves Ex: normal distribution 68% of data within + one SD of mean 95% of data within + two SD of mean 99.7% of data within + three SD of mean

**Standard Deviation = Square Root of Variance**

*s*

(

*x n*

1

*x*

) 2

**Sample Standard Deviation**

x 6

*x*

*x*

6-8=-2 (

*x*

*x*

) 2 (-2)(-2)= 4 6 6-8=-2 4 7 8 13 7-8=-1 8-8=0 13-8=5 Sum=40 Mean=40/5=8 Sum=0 (-1)(-1)= 1 0 (5)(5)= 25 Sum = 34

**Standard Deviation**

Total variation = 34 Sample variance = 34/4 = 8.5

Sample standard deviation = square root of 8.5 = 2.9

**Measures of Variability - Example**

The hourly wages earned by a sample of five students are: $7, $5, $11, $8, and $6 Range: 11 – 5 = 6 Variance:

*s*

2

*X n*

1

*X*

2 Standard deviation: 7 7 .

4 2 ...

6 5 1 7 .

4 2 21 .

2 5 1 5 .

30

*s*

*s*

2 5 .

30 2 .

30

**Top Ten #1**

Hypothesis Testing

**H 0 : Null Hypothesis**

Population mean= µ Population proportion= π A statement about the value of a population parameter Never include sample statistic (such as, x bar) in hypothesis

**H A or H 1 : Alternative Hypothesis**

ONE TAIL ALTERNATIVE – Right tail: µ>number(smog ck) π>fraction(%defectives) – Left tail: µ

**One-Tailed Tests**

A test is one-tailed when the alternate hypothesis,

*H*

1 or H A , states a direction, such as: •

*H*

1 : The mean yearly salaries earned by full-time employees is more than $45,000. ( µ>$45,000) •

*H*

1 : The average speed of cars traveling on freeway is less than 75 miles per hour. ( µ<75) •

*H*

1 : Less than 20 percent of the customers pay cash for their gasoline purchase. (π <0.2)

**Two-Tail Alternative**

Population mean not equal to number (too hot or too cold) Population proportion not equal to fraction (% alcohol too weak or too strong)

**Two-Tailed Tests**

A test is two-tailed when no direction is specified in the alternate hypothesis •

*H*

1 : The mean amount of time spent for the Internet is not equal to 5 hours. ( µ 5).

•

*H*

1 : The mean price for a gallon of gasoline is not equal to $2.54. ( µ ≠ $2.54).

**Reject Null Hypothesis (H 0 ) If**

Absolute value of test statistic* > critical value* Reject H 0 if |Z Value| > critical Z Reject H 0 if | t Value| > critical t Reject H 0 if p-value < significance level (alpha)

**Note that direction of inequality is reversed!**

Reject H 0 if very large difference between sample statistic and population parameter in H 0 * Test statistic: A value, determined from sample information, used to determine whether or not to reject the null hypothesis.

* Critical value: The dividing point between the region where the null hypothesis is rejected and the region where it is not rejected.

**Example: Smog Check**

H 0 : µ = 80 H A : µ > 80 If test statistic =2.2 and critical value = 1.96, reject H 0 , and conclude that the population mean is likely > 80 If test statistic = 1.6 and critical value = 1.96, do not reject H 0 , and reserve judgment about H 0

**Type I vs Type II Error**

Alpha= α = P(type I error) = Significance level = probability that you reject true null hypothesis Beta= β = P(type II error) = probability you do not reject a null hypothesis, given H 0 false Ex: H 0 : Defendant innocent α = P(jury convicts innocent person) β =P(jury acquits guilty person)

**Type I vs Type II Error **

H 0 true H 0 false Reject H 0 Alpha = α = P(type I error) 1 – β (Correct Decision) Do not reject H 0 1 – α (Correct Decision) Beta = β = P(type II error)

**Example: Smog Check**

H 0 : µ = 80 H A : µ > 80 If p-value = 0.01 and alpha = 0.05, reject H 0 , and conclude that the population mean is likely > 80 If p-value = 0.07 and alpha = 0.05, do not reject H 0 , and reserve judgment about H 0

**Test Statistic**

When testing for the population mean from a large sample and the population standard deviation is known, the test statistic is given by:

*z*

*X*

/ n

**Example**

The processors of Best Mayo indicate on the label that the bottle contains 16 ounces of mayo. The standard deviation of the process is 0.5 ounces. A sample of 36 bottles from last hour’s production showed a mean weight of 16.12 ounces per bottle. At the .05 significance level, can we conclude that the mean amount per bottle is greater than 16 ounces?

**Example – cont’d**

1. State the null and the alternative hypotheses:

*H*

0 : μ = 16,

*H*

1 : μ > 16 2. Select the level of significance. In this case, we selected the .05 significance level.

3. Identify the test statistic. Because we know the population standard deviation, the test statistic is z

*.*

4. State the decision rule.

Reject H 0 if |z

*|> *

1.645 (= z 0.05

)

**Example – cont’d**

5. Compute the value of the test statistic

*z*

*X*

*n*

16 .

12 16 .

00 0 .

5 36 1 .

44 6. Conclusion: Do not reject the null hypothesis. We cannot conclude the mean is greater than 16 ounces.