MATH408: PROBABILITY & STATISTICS

Download Report

Transcript MATH408: PROBABILITY & STATISTICS

MATH408: Probability & Statistics
Summer 1999
WEEK 2
Dr. Srinivas R. Chakravarthy
Professor of Mathematics and Statistics
Kettering University
(GMI Engineering & Management Institute)
Flint, MI 48504-4898
Phone: 810.762.7906
Email: [email protected]
Homepage: www.kettering.edu/~schakrav
SAMPLE
• Sample: is a subset (part) of the population.
• Since it is infeasible (and impossible in
many cases) to study the entire population,
one has to rely on samples to make the
study.
• Samples have to be as representative as
possible in order to make valid conclusions
about the populations under study.
SAMPLE (cont'd)
• Contain more or less the same type of
information that the population has.
• For example if workers from three shifts are
involved in assembling cars of a particular
model, then the sample should contain units
from all three.
• Samples will be used to “estimate” the
parameters.
SAMPLE (cont’d)
• Much care should be devoted to the sampling.
• There is always going to be some error involved in
making inferences about the populations based on
the samples.
• The goal is to minimize this error as much as
possible.
• There are many ways of bringing in systematic
bias (consistently misrepresent the population).
SAMPLE (cont’d)
• This can be avoided by taking random samples.
• Simple random sample: all units are equally likely to
be selected.
• Multi-stage sample: units are selected in several
stages.
• Cluster sample: is used when there is no list of all the
elements in the population and the elements are clustered
in larger units.
SAMPLE (cont’d)
• Stratified sample: In cases where population under
study may be viewed as comprising different groups
(stratas) and where elements in each group are more or less
homogeneous, we randomly select elements from every
one of the strata.
• Convenience sample: samples are taken based on
convenience of the experimenter.
• Systematic sample: units are taken in a systematic
way such as selecting every 10th item after selecting the
first item at random.
HOW TO USE SAMPLES?
• Samples should represent the population.
• Random sample obtained will not always be
an exact copy of the population.
• Thus, there is bound to be some error:
SAMPLES (cont’d)
• Random or unbiased error: This is due to the random
selection of the sample and the mean of such error will be
0 as positive deviation and negative deviation cancel out.
This random error is also referred to as random deviation
and is measured by the standard deviation of the estimator.
• Non-random or biased error: this occurs due to several
sources such as human, machines, mistakes due to copying
or punching, recording and so on. Through careful
planning we should try to avoid or minimize this error.
EXPERIMENTS USING MINITAB
• We will illustrate the concepts of sample,
sampling error, etc with practical data using
MINITAB when we go to the laboratory
next time.
NEXT?
• Once the data has been gathered, what do
we do next?
• Before any formal statistical inference
through estimation or test of hypotheses is
conducted, EDA should be employed.
EXPLORATORY DATA
ANALYSIS (EDA)
• This is a procedure by which the data is
carefully looked for patterns, if any, and to
isolate them.
• First step in identifying appropriate model.
EDA(cont'd)
• The main difference between EDAand
conventional data analysis is:
– while the former, which is more flexible (in
terms of any assumptions on the nature of the
populations from which the data are gathered)
emphasizes on searching for evidence and clues
for the patterns, the latter concentrates on
evaluating the evidence and the hypotheses on
the nature of the parameters of the
population(s) under study.
CAPABILITY ANALYSIS
• Deals with the study of the ability of the
process to manufacture products within
specifications.
• In order to perform the capability analysis,
the process must be stable (i.e., things such
as warm up period needed on the process
before manufacturing products and others
should be taken care of).
CAPABILITY ANALYSIS
(cont'd)
• The process specifications are compared to
the variance (or the spread) of the process.
• For a process to be more capable, more
measurements would be expected to fall
within the specifications.
CAPABILITY ANALYSIS
(cont'd)
A commonly used capability index is given by

USL

x
 min 
,
 3s


C pk

x  LSL 
3s 


CAPABILITY ANALYSIS
(cont'd)
• The larger the value of Cpk, the less
evidence that the process is outside the
specifications. A value of 1.5 or higher for
Cpk is usually desired. More on this will be
seen later.
DESCRIPTIVE STATISTICS
• Deals with characterization and summary of key
observations from the data.
• Quantitative measures: mean, median, mode,
standard deviation, percentiles, etc.
• Graphs: histogram, Box plot, scatter plot, Pareto
diagram, stem-and-leaf plot, etc.
• Here one has to be careful in interpreting the
numbers. Usually more than one descriptive
measure will be used to assess the problem on
hand.
DIFFERENT TYPES OF PLOTS
• Point plot: The horizontal axis (x-axis)
covering the range of the data values and
vertically plot the points, stacking any
repeated values.
• Time series plot: x-axis corresponds to the
number of the observation or the time of the
observation or the day and so on and the yaxis will correspond to the value of the
observation.
Time-series plot
PLOTS (cont'd)
• Scatter plot: Construct x-axis and y-axis
that cover the ranges of two variables. Plot
(xi, yi) points for each observation in the
data set.
• Histogrom: This is a bar graph, where the
data is grouped into many classes. The xaxis corresponds to the classes and the yaxis gives the frequency of the observations.
Histogram
PLOTS (cont'd)
• Stem-and-leaf plot: Data is plotted in such a way
the output will look like histogram and also
features a frequency distribution. The idea is to
use the digits of the data to illustrate its range,
shape and density. Each observation is split into
leading digits and trailing digits. All the leading
digits are sorted and listed to the left of a vertical
line. The trailing digits are written to the right of
the vertical line.
Stem-and-leaf
PLOTS (cont'd)
• Pareto Diagram: Named after the Italian
economist. This is a bar diagram for
qualitative factors. This is very useful to
identify and separate the commonly
occurring factors from the less important
ones. Visually it conveys the information
very easily.
Pareto Chart for : Failures
Pareto Diagram
Solder Defects (Messina:SQC for MM, 1987)
800
700
80
Percent
600
500
60
400
40
300
200
20
100
0
o
nt s
e
i
c
uffi
s
n
I
s
ole
h
w
Blo
tted
e
w
Un
Count
440
120
80
Percent
55.0
15.0
Cum %
55.0
70.0
Defect
red
lo de
0
les
o
h
Pin
r ts
Sho
64
56
40
10.0
8.0
7.0
5.0
80.0
88.0
95.0
100.0
s
Un
Number of Occurrences
100
PLOTS (cont'd)
• Box plot: is due to J. Tukey and provides a
great deal of information. A rectangle whose
lower and upper limits are the first and third
quartiles, respectively, is drawn. The
median is given by a horizontal line
segment inside the rectangle box. The
average value is marked by a symbol such
as “x” or “+”. All points that are more
extreme are identified.
Boxplot for MPG example
Box Plot of City MPG vs Size
45
mpg_c
35
25
15
Small
Midsize
Compact
size
Large
Sports
Van
PLOTS (cont'd)
• Quantile plot: This plot is very useful
when we want to identify/ verify an
hypothesized population distribution from
which the data set could have been chosen.
A quantile, Q(r), is a number that divides a
sample (or population) into two groups so
that the specified fraction r of the data
values is less than or equal to the value of
the quantile.
PLOTS (cont'd)
• Probability plot: This involves plotting the
cumulative probability and the observed value of
the variable against a suitable probability scale
which will result in linearization of the data. The
basic steps involved here are: (a) Sorting the data
into ascending order; (b) Computing the plotting
points; (c) Selecting appropriate probability paper;
(d) Plot the points; (e) Fitting a “best” line to data.
MEASURES OF LOCATION
• MEAN: Used very often in analyzing the data.
– Although this is a common measure, if the data vary
greatly the average may take a non-typical value and
could be misleading.
• Median: is the halfway point of the data and tells us
something about the location of the distribution of the data.
• Mode: if exists, gives the data point that occur most
frequently.
– It is possible for a set of data to have 0, 1 or more
modes.
LOCATION (cont’d)
• Mean and median always exist.
• Mode need not exist.
• Median and mode are less sensitive to extreme
observations.
• Mean is most widely used.
• There are some data set for which median or mode
may be more appropriate than mean
LOCATION (cont’d)
• Percentiles: The 100pth percentile of a set of data
is the value below which a proportion p of the data
points will lie.
• Percentiles convey more information and are very
useful in setting up warranty or guarantee periods
for manufactured items.
• Also referred to as quantiles.
• The shape of the frequency data can be classified
into several classes.
LOCATION (cont’d)
• Symmetric: mean = median = mode
• Positively skewed: tail to the right; mean > median
• Negatively skewed:tail to the right; median >
mean
• In problems, such as waiting time problems one is
interested in the tails of the distributions.
• For skewed data median is preferred to the mean.
MEASURES OF SPREAD
• One should not solely rely on mean or
median or mode.
• Also two or more sets of data may have the
same mean but they may be qualitatively
different.
• In order to make a meaningful study, we
need to rely on other measures.
MEASURES OF SPREAD
• For example, we may be interested to see
how the data is spread.
• Range: is the difference between the largest
and the smallest observations.
• Quick estimate on the standard deviation.
• Plays an important role in SPC.
SPREAD (cont’d)
• Standard deviation: describes how the data is
spread around its mean.
• Coefficient of variation: The measures we have
seen so far depend on the unit of measurements. It
is sometimes necessary and convenient to have a
measure that is independent of the unit and such a
useful and common measure is given by the ratio
of the standard deviation to the mean called the
coefficient of variation.
SPREAD (cont’d)
• Interquartile range: is the difference
between the 75th and 25th percentiles.
• Gives the interval which contains the
central 50 % of the observations.
• Avoids the total dependence on extreme
data
n
1 n
1
2
X   X i and s 2 
(
X

X
)

i
n i 1
n  1 i 1
n
1
n

2
2
2
s 
Xi  
X

 n  1 i 1
 n 1
Stem-and-leaf of cycles
Leaf Unit = 100
(Problem 2.2)
1
1
5
10
22
33
(15)
22
11
5
2
0
0
0
0
1
1
1
1
1
2
2
3
7777
88899
000000011111
22222223333
444445555555555
66667777777
888899
011
22
N
= 70
Descriptive Statistics
Problem 2.2
Variable: cycles
Anderson-Darling Normality Test
A-Squared:
P-Value:
400
800
1200
1600
Mean
StDev
Variance
Skewness
Kurtosis
N
2000
Minimum
1st Quartile
Median
3rd Quartile
Maximum
95% Confidence Interval for Mu
0.250
0.735
1403.66
402.39
161914
-2.5E-02
-4.4E-01
70
375.00
1097.75
1436.50
1735.00
2265.00
95% Confidence Interval for Mu
1307.71
1250
1350
1450
1550
1499.60
95% Confidence Interval for Sigma
345.01
482.82
95% Confidence Interval for Median
95% Confidence Interval for Median
1263.17
1538.24
INFERENTIAL STATISTICS
• Recall that a parameter is a descriptive
measure of some characteristic of the
population.
• The standard ones are the mean, variance
and proportion.
• We will simply denote by , the parameter
of the population under study.
INFERENTIAL STATISTICS
• Estimation Theory and Tests of Hypotheses
are two pillars of statistical inference.
• While estimation theory is concerned about
giving point and interval estimates for
parameter(s) under study, test of hypotheses
deals with testing claims on the
parameter(s).
Illustrative Example 1
• The following data corresponds to an
experiment in which the effect of engine
RPM on the horsepower is under study.
TABLE 1: Data for HP Example
hp@4500
hp@5500
hp@4500
hp@5500
243
211
258
252
241
234
257
264
233
218
247
245
264
240
276
275
233
218
248
244
263
237
272
270
248
216
261
257
240
203
268
273
Variable N Mean Median Tr Mean
hp@4500 16 253.25 252.50 253.07
hp@5500 16 241.06 242.00 241.36
Max
Q1
StDev SE Mean
13.51
23.16
Variable
Min
Q3
hp@4500
hp@5500
233.00 276.00 241.50 263.75
203.00 275.00 218.00 262.25
3.38
5.79
Horse Power at 4500 RPM and 5500 RPM
280
270
Horse Power
260
250
240
230
220
210
200
10
20
Time Order
30
Boxplots of hp@4500 and hp@5500
(means are indicated by solid circles)
280
270
260
250
240
230
220
210
200
hp@4500
hp@5500
GROUNDWORK FOR PROBABILITY
• Looking at the data in Table 1, why is that the hp
values, say at 4500 RPM, are not exactly the same
if the experiment is repeated under the “same
conditions”?
• The fluctuation that occurs from one repetition to
another is called experimental variation, which is
usually referred to as “noise” or “statistical error”
or simply “error” [Recall this term from earlier
discussion on data collection].
PROBABILITY (cont’d)
• This represents the “variation” that is
inherently present in any (practical) system.
• The noise is a “random variable” and is
studied through probability.
What is Probability?
• A manufacturer of blender motors wants to
determine the warranty period for this
product.
• If motor life were constant, (say 8 years) the
manufacturer would have no problem. The
motor could be warranted for 8 years.
• But, in reality, the motor life is not a
constant.
PROBABILITY (cont’d)
• Some motors will fail quickly and others will last
for several years.
• There is an element of randomness in the life of
the motors.
• The manufacturer cannot precisely predict how
long any motor will last.
• Probability theory gives the manufacturer the
means to quantify what is known about motor
lifetimes and helps to quantify the risks involved
in setting a warranty period.
PROBABILITY (cont’d)
• Similar problems arise in the context of
other products.
• FMS play an important role in modern
manufacturing. Improved quality, lower
inventory, shorter lead times, higher
productivity and greater safety are some of
the benefits derived from FMS.
• All of these have random elements.
PROBABILITY (cont’d)
• Probability theory deals with randomness,
allowing the study of quantities whose
behavior cannot be predicted completely in
advance.
• The above examples deal with
manufacturing.
PROBABILITY (cont’d)
• We could just as easily find examples in
business, electrical and computer
engineering, biomedical science and
engineering, sociology, economics,
marketing, civil engineering, the behavioral
sciences and so on. The underlying
problem, randomness, is the same.
PROBABILITY (cont’d)
• One should understand the ideas of probability
and statistics from both theoretical and practical
points of view.
• To properly apply probability and statistics in the
real world, we must appreciate both sides of the
picture.
• We cannot properly apply a procedure if we don't,
at least in general terms, understand the reasoning
(theory) behind it.
PROBABILITY (cont’d)
• On the other hand, trying to apply theory without
knowledge of the area of application is foolish. We
have to have a proper perspective on both before
meaningful progress can be made.
• Probability theory develops mathematical models
for random experiments.
• A random experiment is a sequence of actions
whose outcome cannot be predicted with certainty.
PROBABILITY (cont’d)
• If you've used phrases like "one chance in a 1000", "50-50"
or "3-to-2 odds" to describe something, you have most
likely been using an informal probability model.
• If we throw two fair dice and our concern is about whether
or not the dice eventually land and come to rest, then the
throwing of the two dice is not a random experiment.
• Our knowledge of physical laws allows us to predict with
virtual certainty that this outcome will happen.
PROBABILITY (cont’d)
• If, however, we are concerned with how
many dots show on the topmost faces when
the dice come to rest, then we are
performing a random experiment in tossing
the dice, since we cannot predict with
certainty which faces will show.
PROBABILITY (cont’d)
• Outcomes of random experiments: the
length of a phone call, the gender mix of
three people chosen from a group of 25
people, and the phenotype of the offspring
of a cross breeding experiment, the number
of defects on a painted panel.
EXPERIMENT
• Calculation of MPG of a new model car.
• Measurements of current in a thin copper wire.
• Measurements of Film build thickness in a
painting process.
• Duration of phone calls.
• Time to assemble a job.
• Tossing a coin.
• Throwing a dice.
Sample space (S)
• Collection of all possible outcomes in an
experiment.
– The MPG’s of all cars from that particular
model car.
• Event (A)
– A subset of a sample space
– The MPG of the new model car exceeds, say,
25 miles.
SET THEORY
AB
AB
SET THEORY (cont’d)
A'
PROBABILITY
• is a function defined on the set of all
possible events.
• is a number between 0 and 1.
• satisfies a set of axioms:
– P(A)  0.
– P(S) = 1.


i 1
i 1
 If Ai  Aj  , then P( Ai )   Ai .
PROPERTIES
• 0  P(A) 1.
• P(A') = 1 - P(A)
• P(AB) = P(A) + P(B) - P(AB).
Classical definition
• While axiomatic definiton of probability is very
useful in developing the theory of probability, it
doesn’t tell us how to compute probabilities of
events.
• Classical Definition: If S has a finite number of
sample points and are equally likely to occur, then
P(A) = number of points in A / number in S.
• If S doesn’t contain equally likely outcomes, then
P(A) = sum of the weights associated with points
in A.
Classical definition(cont’d)
• To use this definition, we need to calculate the
number of points in S and in A.
• How do we do this without actually listing all
possible outcomes?
• Using Counting Techniques.
– Principle of addition and multiplication
– Permutations and combinations
Principle of Addition
and Multiplication
• If the task is done if any one of the subtasks
is done, then the total number of ways of
doing the main task is n1 + n2 + ... + nk .
• If the task is done if and only if all the
subtasks are done, then the total number of
ways of doing the main task is the product
n1  n2  ...  nk.
PERMUTATION
• Suppose that r objects are to be drawn
without replacement from n (r  n).
• If the order of selection is important, then
using the principle of multiplication we see
that the total number of ways of doing this
is n(n-1)...(n-r+1).
• This could be written in a compact form
using the factorials as n!/(n-r)! or Prn.
COMBINATION
• Suppose that r objects are to be drawn
without replacement from n (r  n).
• When the order of selection is not
important, any particular set of r objects can
be ordered in Prr = r! ways, the total number
of ways of selecting r out of n in which
order is immaterial is Prn /r!. It is convenient
n
n
to denote this by Cr or by  r 
EXAMPLES
CONDITIONAL PROBABILITY
• What we saw so far is referred to as
unconditional probability. That is, the
probabilities of events of interest were
computed only based on the sample space
and with no prior information.
• Sometimes it is convenient to compute
certain unconditional probabilities by first
conditioning on some event.
CONDITIONAL PROBABILITY
• Also, this plays an important role in
stochastic modeling.
• In a finite buffer queuing model,
computation of waiting time of an admitted
customer involves conditional probability.
• DEFINITION: P(B/A) = P(AB) / P(A)
• Events A and B are independent if and only
if P(AB) = P(A)P(B).
RANDOM VARIABLES
• Often in probability and statistics, the
quantities that are of interest are not the
outcomes but rather the values associated
with the outcome of the experiment.
• If n items are selected from a production lot
the quality inspector is interested in the
number of defectives out of the n chosen
and the corresponding probabilities.
RANDOM VARIABLES
(cont’d)
• A random variable, X, is a real-valued function
defined on the sample space S into the set of real
numbers.
• Random variables can be
– Discrete – taking only discrete values
• Number of defective molds
– Continuous – taking continuous values
• Time taken to assemble a product
• Mixture of discrete and continuous
• Waiting time of a customer
INDEPENDENCE
EXAMPLES
STUDY OF RANDOM
VARIABLES
•
•
•
•
Probability functions
Probability mass function (discrete)
Probability density function (continuous)
Cumulative probability distribution function