Setup and Assumptions for This Lecture

Download Report

Transcript Setup and Assumptions for This Lecture

The Standardized Normal Distribution
X is N(  , 2 )
Z is N( 0, 12 )
The standardized normal
1. For comparison of several different
normal distributions
2. For calculations without a computer
The Normal Approximation
to the Binomial
0.3
0.25
0.2
0.15
0.1
0.05
0
0
1
2
3
4
5
6
7
8
9
10
Both distributions have the same shape…
This normal approximation to the binomial
works reasonably well when
1. np  5 and n(1-p)  5
2. No computer is nearby
But it is important fact that the binomial
distribution and normal distribution are
similar … we will return to this subject
relatively soon
… central limit …
Conceptual idea of new topics
When talking about binomial and normal probabilities,
we’ve taken the following point of view:
A situation follows certain “probabilities,” and
we can use this knowledge to deduce specific
information about the situation
Now we will take the reverse point of view:
Specific information about a situation can be used to
find “probabilities” that describe the situation
The word “features” could replace “probabilities”
For example, think about the quiz you took…
Professors have always noticed that students’ scores
on a test tend to follow a normal distribution
By actually giving a test to a sample of students,
you can estimate the mean and standard deviation
of the underlying normal distribution
For tests like the SAT, the underlying distribution is
then used as a ranking measure for students taking
the same test later
These ideas are loose, and first we’re going
to learn how to work with sample data
Populations and Samples
A population is a complete set of data representing
a given situation
A sample is a subset of the population --- ideally a
small-scale replica of the population
E.g., all students that take the SAT constitute a
population, while those taking the test on a particular
Saturday are a sample
E.g., all American citizens are a population, while those
selected for a survey are a sample
Populations are a relative concept
For the following definitions, imagine a population like the
starting salaries of all MBA students graduating this year
A population is assumed to follow a random
variable X, with values and probabilities
X = starting salary of a particular MBA student
P(X = $100,000) = ????
So we could calculate the expected value  of the
population, as well as the standard deviation 
Except it is sometimes hard to get a handle on the entire
population. Imagine finding out the starting salaries of
every single graduating MBA student in the U.S.!
So instead of trying to look at the entire population,
we look at a sample of the population, which
hopefully gives us a good picture of the population
We might take a survey of graduating MBAs to
determine the average (or expected) starting salary
• A sample statistic is a quantitative measure of a sample;
used to make estimates of the population
• A sample mean (or expected value) is used to estimate
the population mean
• A sample standard deviation is used to estimate the
population mean
Summary/Sample Measures
A sample is made up of n observations X1, X2, …, Xn
sample mean = X = (X1 + X2 + … + Xn) / n
Sample std dev = SX =
sqrt ( [ (X1 – X)2 + (X2 – X)2 + … + (Xn – X)2 ] / (n-1) )
The median is the middle of the values; 50% of the observation values
fall below the median and 50% above
The mode is the most frequent observation value
The maximum and minimum are the largest and smallest observations;
the range is the difference between the max and min
Relevant Excel Commands
= AVERAGE(array)
= STDEV(array)
Tools  Data Analysis  Descriptive Statistics
(see Excel file)
Setup and Assumptions for This Lecture
• You have a population about which you’d like to
know things such as mean, std dev, proportions
• Each member of the population is assumed to
follow the random variable X with mean X, std
dev X, and particular proportion pX
• Again, however, X, X, and pX are unknown
• The population is too big to measure directly
• You will take samples instead
• What information can be deduced?
A Practical Approach: Point Estimates
Here’s an idea…
To estimate X, X, and pX, take a sample of size
n and calculate sample mean X-bar, sample std
dev SX, and sample proportion X/n. Use these as
“point estimates” of the true X, X, and pX
(a) Xi = value of ith observation
(b) Xi = 1 if i-th
observations has
attribute, 0
otherwise
1
X  X1  ...  X n 
n
X1  X 
2
SX 
 ...  X n  X 
n 1
X 1
 X1  ...  X n 
n n
2
Can we do better?
Point estimates are nice, but is there a better
idea? After all, who knows how close X-bar, SX,
and X/n are to X, X, and pX?
…interval estimates…
For example, point estimate: “An estimate for the true mean X
is the point estimate X-bar = 23.66.”
For example, interval estimate: “There is a 95% probability that
the true mean X lies between 23.60 and 23.70.”
Interval estimates are stronger than point estimates
Yes, we can do better!
But it takes the investigation of some
pretty tricky concepts…
X
the sampling random variable
and the sampling distribution
The Sampling Random Variable and
the Sampling Distribution
Fix in your mind a number n – the number of
observations taken in a single sample
Now think about taking many different samples of size n
and calculating the sample mean for each sample taken
X
The sampling random variable is
the random variable that assigns
the sample mean to each sample
of size n …
And the sampling distribution is
the distribution of this random
variable
Key Facts about the
Sampling Distribution
The mean of the sampling
distribution is the mean of the
population
X  X

X  X
n
Central Limit Theorem
The std dev of the sampling
distribution is the std dev of
the population divided by the
square root of n
If n is large then the sampling
random variable is approximately
normally distributed
Comments on the Sampling Distribution
Remember: we don’t actually know X or X
and so we don’t know the mean and standard
deviation of the sampling distribution either
X  X

X  X
n
We can make statements like: “The sample mean of a
random sample of size n has a 95% chance of falling within
2 std devs up or down from the true population mean”
The standard deviation of the sampling distribution
is commonly called the standard error
An Example
We can make statements like: “The sample mean of a random
sample of size n has a 95% chance of falling within 2 standard
errors up or down from the true population mean”
(see Excel)
Again, we must stress that we don’t know
true population mean, population std dev, or
sampling distribution std error
How the Sampling
Distribution Can Be Used
If we don’t know anything about the sampling distribution
except “in theory,” then how can we really use it?
Well, we can determine some information about the sampling
distribution by taking an actual sample of size n
1
 X  X  X1  ...  X n 
n
 X  SX 
X1  X 
 ...  X n  X 
n (n  1)
2
SX
n

SX-bar is called the sample standard error
2
A Practical Approach:
Interval Estimates (Means)
Using a sample of size n, let X-bar serve as a
point estimate of the true population mean X and
of the mean X-bar of the sampling distribution
Also let the sample standard error SX-bar serve as
an estimate of the standard error of X-bar
From this information, we can build “confidence
intervals” for the true mean X of the population
Heart Valve Manufacturer
Dimension
Mean
Piston Diameter 0.060
Sleeve Diameter 0.065
Clearance
0.005
(unsorted)
Std. Deviation
0.0002
0.0002
0.000283
Approximately 52% of the heart valve
assemblies will meet the desired tolerance.
Can we do better?
Decision: Implement sorting with batches of 5
A random sample (after sorting has been implemented) of
100 piston/valve assemblies yields 79 valid (meet
tolerances) assemblies out of the 100 trials.
How do we know whether or not the process change has
really improved the resulting yield?
The yield (# good assemblies out of 100) is a
binomial random variable.
Our estimate of the mean (based on this sample)
is 79% (or 79 out of 100).
One way of determining whether the process has
been improved is to construct a confidence
interval about our estimate.
X = the number of good assemblies in 100 trials.
The probability that X is within + or – 10 of our estimate,
79:
P{69  X  89} = P{ X  89}- P{ X  69}
= BINOMDIST(89,100,0.79,true)BINOMDIST(69,100,0.79,true)
 0.9971 – 0.0123  0.985
79  10 is a 98.5% confidence interval for the number of
valid assemblies out of 100.
Confidence Interval
79  2
79  4
79  8
79  10
79  14
% Confidence
37.6%
67.4%
94.9%
98.5%
99.9%
Note: the larger the interval the more certain we become
that it covers the true mean.
Note that the yield of the original process was 52%. Since
the lower limit of a 99.9% confidence interval about our
sample mean is 65% (substantially larger than 52%) we can
be pretty certain the process has improved.
Confidence Intervals (Means)
Using a single sample of size n  30 with information
a 95% confidence interval for the actual population mean X is
“We are 95% confident that the true population
mean X is between these two numbers.”
Confidence Intervals (Means) (cont’d)
For a k% confidence interval, 1.960 is replaced with
the value z having P(-z  Z  z) = 0.01*(100 - k):
By formula, z = NORMSINV( 0.01*( 50 + k/2 ) )
k%
0.01*(50 + k/2)
z
k%
0.01*(50 + k/2)
NORMSINV(0.01*(50 + k/2) )
90%
95%
0.95
0.975
1.645
1.960
99%
0.995
2.576
Z is the standardized normal
Sample Problem
The corresponding Excel file contains a sample of size
80 on the length of a precision shaft for use in lathes.
a. Calculate the mean, standard deviation, and standard
error of the 80 values
b. Construct 95% and 99% confidence intervals (C.I.s)
for the population mean
(see Excel)
Confidence Intervals (Proportions)
If you have
then a 95% confidence interval for the true population proportion is
The 1.960 follows the same rules as for means with n  30
Confidence Intervals - Using Central Limit Theorem for
Heart Valve Problem
Recall that: X = # successful assemblies in 100 trials
An estimate of the probability of obtaining a successful assembly, p, is given
by:
Estimate = X/100
The estimate is, therefore, a binomial random variable with :
Mean = np/n = p
And
Standard Deviation = sqrt(np[1-p])/n = sqrt(p[1-p]/n)
Note: We can apply the CLT to approximate the binomial
with a normal having the same mean and standard deviation.
Central Limit Theorem Restated for
Population Proportions
• As the Sample size, n, increases, the sampling
distribution approaches a normal distribution with
• Mean = p
• Standard deviation = sqrt[p (1 – p)/n]
Heart Valve Example Re-Visited
• 79 out of 100 assemblies were good 
• Estimates for mean and stdev are:
• Mean = 0.79
• Stdev = sqrt[0.79 (1 – 0.79)/100] = 0.040731
To reiterate, we are using the Central
Limit Theorem to approximate the
distribution of the estimate as a Normal
distribution with mean 0.79 and standard
deviation 0.040731.
What we mean by, for example, a 95%
confidence interval is to find a number, r,
satisfying:
P(0.79 – r <= p <= 0.79 + r) = 95%
Since the Normal distribution is symmetric about its mean
(in this case 0.79), this means that exactly half of the
“leftover” probability (5% for a 95% confidence interval)
must lie in each tail. This means that a probability of 2.5%
must lie in each tail for a 95% confidence interval. In other
words,
P{estimate <= 0.79 – r} = 0.025, and
P{estimate <= 0.79 + r} = 0.975}
To perform this calculation, use the NORMINV function
from excel:
0.79 + r = NORMINV(0.975,0.79,0.040731)  0.8698.
Solving for r, we get r  0.0798, so
a 95% confidence interval for estimate is
given by
estimate = 0.79  0.0798
Results using Normal Approximation:
Confidence Level
99.9%
98.5%
95%
67.4%
Confidence Interval
0.79  0.135
0.79  0.099
0.79  0.0798
0.79  0.04
Results using Binomial Directly:
Confidence Interval
99.9%
98.5%
94.9%
67.4%
37.6%
ConfidenceLevel
0.79  0.14
0.79  0.10
0.79  0.08
0.79  0.04
0.79  0.02
Example Problem Continued
c. Estimate the population proportion of lathes that
exceed 6.625 inches. Construct a 90% C.I. for this
proportion
(see Excel)
Sample Size Needed to Achieve
High Confidence (Means)
Considering estimating X, how many observations n
are needed to obtain a 95% confidence interval for a
particular error tolerance?
The error tolerance E is ½ the
width of the confidence interval
Here,  is a conservative
(high) estimate of the true std
dev X, often gotten by doing
a preliminary small sample
1.960 can be adjusted to get different confidences
Example Problem Continued
d. Consider the sample of 80 as a preliminary
sample. Find the minimal sample size to yield a
95% C.I. for the population mean with E =
0.00005. What about for 99% confidence?
(see Excel)
Sample Size Needed to Achieve
High Confidence (Proportions)
Considering estimating pX, how many observations n
are needed to obtain a 95% confidence interval for a
particular error tolerance?
The error tolerance E is ½ the
width of the confidence interval
Here, p is a conservative
(closer to 0.5) estimate of the
true population proportion
pX, often gotten by doing a
preliminary small sample
1.960 can be adjusted to get different confidences
Polling Example
In estimating the proportion of the population that
approves of Bush’s performance as President, how many
people should be polled to provide a 95% confidence
interval with error 0.02?
“83% of Americans approve of the job Bush is
doing, plus or minus 2 percentage points”
We should use conservative estimate of true proportion
n = (1.960/0.02)^2 (0.5)(1 – 0.5) = 2401.0
But, if we’re certain px  0.6…
n = (1.960/0.02)^2 (0.6)(1 – 0.6) = 1920.8