1.2x - mhsapstatistics

Download Report

Transcript 1.2x - mhsapstatistics

Chapter 1
Section 1.2
Describing Distributions
with Numbers
Parameter Fixed value about
a population
Typical unknown
Statistic -
Value
calculated
from a sample
Measures of Central Tendency
parameter
Mean - the arithmetic average
Use m to represent a population
mean
statistic
represent
ThisUse
is on the to
formula
sheet, so you do not have
to memorize it.
Formula:
x
a sample mean
x
S

n
is the capital Greek letter
sigma – it means to sum
the values that follow
Measures of Central Tendency
Median - the middle of the data; 50th
percentile
Observations must be in numerical
order
Is the middle single value if n is odd
The average of the middle two values
if n is even
NOTE: n denotes the sample size
Measures of Central Tendency
Mode – the observation that occurs the
most often
Can be more than one mode
If all values occur only once – there is
no mode
Not used as often as mean & median
Measures of Central Tendency
RangeThe difference between the
largest and smallest
observations.
This is only one number! Not
3-8 but 5
Suppose we are interested in the number of lollipops
that are bought at a certain store. A sample of 5
customers buys the following number of lollipops. Find
the median.
The numbers are in order
& n is odd – so find the
middle observation.
2
The median is 4
lollipops!
3 4 8 12
Suppose we have sample of 6 customers that buy the
following number of lollipops. The median is …
The median is 5
The numbers are in order
lollipops!
& n is even – so find the
middle two observations.
Now, average these two values.
5
2
3 4 6 8 12
Suppose we have sample of 6 customers that buy the
following number of lollipops. Find the mean.
To find the mean number of lollipops
add the observations and divide by
n.
x  5.833
2  3  4  6  8  12
6
2
3 4 6 8 12
What would happen to the median & mean if the
12 lollipops were 20?
The median is . . .
The mean is . . .
5
7.17
2  3  4  6  8  20
6 What happened?
2
3 4 6 8 20
What would happen to the median & mean if the
20 lollipops were 50?
The median is . . .
The mean is . . .
5
12.17
2  3  4  6  8  50
6 What happened?
2
3 4 6 8 50
Resistant Statistics that are not affected by outliers
Is the median resistant? YES
►Is
the mean resistant? NO
Look at the following data set. Find
the mean.
22
23
24
25
25
26
29
30
x  25 .5
Now find how eachWill
observation
this sum always
equal zero?
deviates from the mean.
This is the
deviation from
the mean.
YES
What is the sum of the deviations from
the mean?
 x  x   0
Look at the following data set. Find the mean & median.
Mean =
Median =
21
27
27
27
Create a histogram with the
data.
(use
x-scale
of 2) Then
Look
at the
placement
of
find
mean
median.
thethe
mean
andand
median
in
this symmetrical
distribution.
23
23
24
25
25
27
27
28
30
30
26
26
26
27
30
31
32
32
Look at the following data set. Find the mean & median.
Mean =
Median =
28.176
25
Create a histogram with the
data.
(use
x-scale
of 8) Then
Look
at the
placement
of
find
mean
median.
thethe
mean
andand
median
in
this right skewed
22
29 distribution.
28
22
24
25
28
21
23
62
23
24
23
26
36
38
25
Look at the following data set. Find the mean & median.
Mean =
Median =
54.588
58
Create a histogram with the
data.
Then
findplacement
the mean of
and
Look
at the
median.
the mean
and median in
this skewed left
distribution.
21
46
54
47
53
60
55
55
56
63
64
58
58
58
58
62
60
Go to java view
Recap:
In a symmetrical distribution, the mean
and median are equal.
In a skewed distribution, the mean is
pulled in the direction of the skewness.
In a symmetrical distribution, you should
report the mean!
In a skewed distribution, the median
should be reported as the measure of
center!
Quartiles
Arrange the observations in
increasing order and locate the
median M in the ordered list of
observations.
The first quartile Q1 is the median of
the 1st half of the observations
The third quartile Q3 is the median of
the2nd half of the observations.
16 19 24 25 25 33 33 34 34 37 37 40 42 46 49 73
Q1
25
median
34
Q3
41
What if there is odd number?
16 19 24 25 25 33 33 34 34
median
When dividing data in half, forget
about the middle number
The interquartile range (IQR)
The distance between the first
and third quartiles.
IQR = Q3 – Q1
Always positive
Outlier:
We call an observation an outlier if it
falls more than 1.5 x IQR above the
third or below the first.
Let’s look back at the same data:
16 19 24 25 25 33 33 34 34 37 37 40 42 46 49 73
Q1=25 IQR=41-25=16 Q3=41
25 - 1.5 x 16 = 1 41 + 1.5 x 16 = 65
Lower Cutoff
Upper Cutoff
Since 73 is above the upper
cutoff, we will call it an
outlier.
Five-number summary
Minimum
Q1
Median
Q3
Maximum
If you plot these five numbers on
a graph, we have a ………
Advantage boxplots?
ease of construction
convenient handling of outliers
construction is not subjective
(like histograms)
Used with medium or large size
data sets (n > 10)
useful for comparative displays
Disadvantage of
boxplots
does not retain the
individual observations
should not be used with
small data sets (n < 10)
How to construct
find five-number summary
Min Q1 Med Q3 Max
draw box from Q1 to Q3
draw median as center line in
the box
extend whiskers to min & max
Modified boxplots
display outliers
fences mark off the outliers
ALWAYS
modified
whiskers
extenduse
to largest
boxplots
this class!!!
(smallest)
datainvalue
inside
the fence
Modified Boxplot
Interquartile Range
Q1 ––is1.5IQR
Q3 + 1.5IQR
(IQR)
the range
are called the fences
(length)These
of
the
box
Any observation outside this
should not be seen.
Q3fence
-and
Q1 is
an outlier! Put a dot
for the outliers.
Q1
Q3
Modified Boxplot . . .
Draw the “whisker” from the
quartiles to the observation that is
within the fence!
Q1
Q3
A report from the U.S. Department of
Justice gave the following percent increase
in federal prison populations in 20
northeastern & mid-western states in 1999.
5.9 1.3 5.0
5.9
4.5
5.6
4.1
6.3
4.8
6.9
4.5 3.5 7.2
6.4
5.5
5.3
8.0
4.4
7.2
3.2
Create a modified boxplot. Describe the distribution.
Use the calculator to create a modified boxplot.
Evidence suggests that a high indoor radon
concentration might be linked to the
development of childhood cancers. The data
that follows is the radon concentration in two
different samples of houses. The first sample
consisted of houses in which a child was
diagnosed with cancer. Houses in the second
sample had no recorded cases of childhood
cancer.
(see data on note page)
Create parallel boxplots. Compare the
distributions.
Cancer
No Cancer
100
Radon
200
The median radon concentration for the no
cancer group is lower than the median for the
cancer group. The range of the cancer group
is larger than the range for the no cancer
group. Both distributions are skewed right.
The cancer group has outliers at 39, 45, 57,
and 210. The no cancer group has outliers at
55 and 85.
Assignment 1.2
Why is the study of variability
important?
Allows us to distinguish between usual &
unusual values
In some situations, want more/less
variability
scores on standardized tests
time bombs
medicine
Measures of Variability
range (max-min)
interquartile range (Q3-Q1)
deviations  x  x 
Lower case
Greek letter
variance  
sigma
standard deviation  
2
Suppose that we have these data values:
24
16
34
28
26
21
30
35
37
29
Find the mean.
Find the deviations.
x  x 
What is the sum of the deviations from the
mean?
24
16
34
28
26
21
30
35
37
29
x  x 
2
Square the deviations:
Find the average of the squared deviations:
 x  x 
2
n
The average of the deviations
squared is called the variance.
Populationparameter

2
Sample
s
2
statistic
Calculation of variance of a
sample
 x  x 
s 
n 1
2
2
n
df
A standard deviation is a
measure of the average
deviation from the mean.
Calculation of standard
deviation


x

x
 n
2
s
n 1
Degrees of
Freedom (df)
n deviations contain (n - 1)
independent pieces of
information about variability
Which measure(s) of
variability is/are
resistant?
Activity (worksheet)
Linear transformation rule
When multiplying or adding a constant to a
random variable, the mean and median
changes by both.
When multiplying or adding a constant to a
random variable, the standard deviation
changes only by multiplication.
Formulas:
max b  amx  b
 ax b  a x
An appliance repair shop charges a $30 service call
to go to a home for a repair. It also charges $25
per hour for labor. From past history, the average
length of repairs is 1 hour 15 minutes (1.25 hours)
with standard deviation of 20 minutes (1/3 hour).
Including the charge for the service call, what is the
mean and standard deviation for the charges for
labor?
m  30  25( 1.25)  $61.25
1
  25   $8.33
3
Rules for Combining two variables
To find the mean for the sum (or difference), add
(or subtract) the two means
To find the standard deviation of the sum (or
differences), ALWAYS add the variances, then
take the square root.
Formulas:
ma b  ma  mb
ma b  ma  mb
 a b   a   b
2
If variables are independent
2
Bicycles arrive at a bike shop in boxes. Before they can be
sold, they must be unpacked, assembled, and tuned
(lubricated, adjusted, etc.). Based on past experience, the
times for each setup phase are independent with the
following means & standard deviations (in minutes). What
are the mean and standard deviation for the total bicycle
setup times?
Phase
Mean
SD
Unpacking
Assembly
Tuning
3.5
21.8
12.3
0.7
2.4
2.7
mT  3.5  21.8  12.3  37.6 minutes
2
2
2
T  0.7  2.4  2.7  3.680 minutes
Assignment 1.2B