Chapter Three Numerically Summarizing Data

Download Report

Transcript Chapter Three Numerically Summarizing Data

Numerically Summarizing Data
Learning Objectives
1. Understand the difference between a
parameter and a statistic
2. Describe and compute measures of central
tendency
3. Describe and compute measures of
dispersion
4. Compute measures of location
5. Learn to read box plots and check for outliers
1
Measures of Central Tendency
(Mean, Median and Mode)
A parameter is a descriptive measure of a
population.
In most real world cases, the population
parameter is not known. For example, the
average gas price in the whole nation.
A statistic is a descriptive measure of a sample. We use
statistic to estimate the corresponding parameter. For
example, Average gas price of the nation is not known.
However, we can take a random sample of 100 stations and
compute the sample average gas price, then use the sample
average to estimate the unknown population average.
2
The population mean, is computed using all the individuals
in a population, the total # of all individuals is N.
The population mean is a parameter.
The sample mean, is computed using sample data.
The sample mean is a statistic that is an unbiased estimator
of the population mean.
NOTE: In real world applications, population mean m is usually not known,
and is estimated by using sample mean x
3
Median
The median of a variable is the value that lies in
the middle of the data when arranged in ascending
order. That is, half the data is below the median
and half the data is above the median. We use m
to represent the median.
4
Steps in Computing the Median of a Data Set
1. Arrange the data in ascending order.
2. Determine the number of observations (n).
3. Determine the observation in the middle of the
data set. The position is (n+1)/2
(a) If (n+1)/2 is an integer, locate the data value at the
(n+1)/2 position. This is the median (NOTE: for this
situation, # of data values, n is an odd number.)
(b) If (n+1)/2 is NOT an integer, the median is the
average of the two data values on either side of the
observations that lies in the (n+1)/2 position.
[ NOTE: for this situation, n is even].
5
EXAMPLE
Computing the Median of Data
Find the mean and median of the following pulse rates from a
sample of 8 individuals {NOTE: n = 8 in this case}
80, 76, 65, 68, 72, 73, 65, 80
Arrange them in ascending order: 65, 65, 68, 72, 73, 76, 80, 80
Find the position: (n+1)/2 = (8+1)/2=4.5
Position is not an integer: Median = (72+73)/2 = 72.5
Adding one additional pulse rate of 100, now find the median
of the data {NOTE n = 9 in this case}:
80, 76, 65, 68, 72, 73, 65, 80,100
Ascending order: 65,65,68,72,73,76,80,80,100
Position: (9+1)/2 = 5: Median is 73 (on the 5th position)
6
The mode of a variable is the most frequent observation of
the variable that occurs in the data set.
If there are two values that occur with the most frequency,
we say the data has is bimodal.
Exercise: Find the mode of the following pulse
rate data
80, 76, 65, 68, 72, 73, 65, 80,100, 80, 74, 65, 66, 70, 74,
65, 80,98
Modes are: 65 and 80
7
Comparing Mean and Median:
How does the extreme observation affect the mean and median?
[similar exam questions]
Example:
The following is the quiz scores of 10 students in class A:
5,5,5,5,5,7,7,7,7,7
Find mean = ______, find median: ________
The following is the quiz score of 10 students in class B:
5,5,5,5,5,7,7,7,7,30
Find mean = ________, find median =_________
Fact:
The mean is sensitive to extreme data values.
Median is robust to extreme data values.
8
How does the unusual cases affect the average,
median and the shape of the histogram?
Compare Histograms with/without the ‘outlier’ case, 5000 miles
Histogram of Miles With the case of 5000 miles
90
Histogram of Miles Without the case of 5000 miles
87
52
50
80
42
70
Shape is _________
56
50
Frequency
Frequency
60
40
40
30
Shape is
30
21
20
14
20
12
10
10
0
_________
4
0
0
750
0
0
0
1500
0
0
0
0
0
2250
Miles
0
3000
0
0
0
3750
0
0
0
0
1
1
0
4500
0
100
1
0
2
200
300
400
Distance from Home
0
1
0
500
1
600
Descriptive Statistics: Miles for 148 cases (with the case of 5000 miles)
Variable N
Miles
148
Mean
151.5
SE Mean TrMean
33.6
111.8
StDev
409.4
Min
1.0
Q1
75
Median
120
Q3
150
Max
5000
Descriptive Statistics: Miles for 147 cases (without the case of 5000 miles)
Variable N
Miles
147
Mean SE Mean TrMean
______ 6.71
111.02
StDev
81.37
Min
1.0
Q1
75
Median
______
Q3
150
Maximum
600
9
Descriptive Statistics: Miles for 147 cases (without the case of 5000
miles)
Variable N
Miles 147
Mean
118.52
Min
1
Median
120
Maximum
600
NOTE: Median remains unchange. Why? Since it only uses the middle one (or
two data points) to find median. But, it uses everyone data to find average. So,
a very large unusual data will make average larger. But, not median.
When data sets have unusually large or small
values relative to the entire set of data or when the
distribution of the data is skewed, the median is the
preferred measure of central tendency over the
arithmetic mean because it is more representative
of the typical observation.
10
Comparison of Mean, Median, and Mode for
different shapes of distributions
[Similar exam question]
Mean<Medain
Mean~Median
Mean>Median
Left-Skewed
Symmetric
Right-Skewed
Mean
Median Mode
Mean= Median = Mode
Mode Median Mean
11
Exercise
NOTE: In real world applications, distribution of a sample data can never
be perfectly symmetric. The shape can only be approximately
symmetric.
IF MEAN IS CLOSE TO MEDIAN (NOT NECESSARY EXACTLY
MEAN = MEDIAN), WE WOULD SAY THE DISTRIBUTION IS
APPROXIMATELY SYMMETRIC.
Exercise: A sample of 50 gas prices are recorded and summarized.
The average price is $3.15, median price is $3.13. Is the shape of
the price distribution more likely to be skew-to-left, approximately
symmetric, skewed-to-right?
ANS:
Measures of Dispersion
Four different measures of dispersion:
Range, Variance, Standard Deviation,
Interquartile Range (IQR)
Measures of dispersion measure the degree that the data
values spread. The larger the data values spread, the larger
the variation of the data values.
Example: Scores of 5 students in class A: 60,60,70,80,80
Scores of 5 students in class B: 40,60,70,80,100
Scores of 5 students in class C: 70,70,70,70,70
Q:
Scores in Class ____ have largest variation.
Scores in Class _____ has zero variation.
13
Visualizing Variability using Histogram
A
C
B
Which one shows the largest
variation:
Which one shows the
smallest variation:
14
How to measure the variation?
• Range = R = Largest Data Value – Smallest Data Value
• The sample variance is :
• The sample standard deviation is:
s = s2
NOTE: the divider: (n-1) is called the Degrees of Freedom.
The population variance is symbolically represented by
lower case Greek sigma squared.
The population standard deviation is:
15
NOTE: As mentioned before, for real world problems,
population mean, population variance and population
standard deviation are NOT KNOWN.
Similar to Sample Mean, sample variance and sample
standard deviation are obtained from sample data. They are
used to estimate the unknown population variance and
population standard deviation. This is the major part of the
inferential statistics, which will be dealt with later.
In this Chapter, we are learning how to compute and
interpret these sample descriptive summaries to understand
the sample data.
16
Notation:
s 2: sample variance
s  s2
: sample standard deviation
NOTE: If the original measurement unit is (ft),
the variance s2 has measurement unit (ft)2, since
If x has unit (ft), then, (x- x )2 has the unit (ft)(ft) , which is (ft)2
The measurement unit of s2 is (ft)2 .
The measurement unit of s is (ft).
s 2: population variance.
s 2 : population standard deviation
17
Some important Tips
NOTE: Sample statistics: such as sample mean ,
sample median, s, s2 will be different for different
samples.
Population parameters: such as population mean,
m, population variance, s2, population s.d., s are
fixed constant for a given population. They do
not change for different samples.
Exercise
Comparing Variation: Quiz Scores of 40 students
[similar exam questions]
20
20
20 20
1 3 2 4 9 10 5 3 1 2
0
5
10
0
Class A
4 5 6
Class B
10
0
5
10
Class C
Variation:
Which one has smallest s.d.?
Which has largest s.d.?
19
Answer
Class B has smallest standard deviation
Class A has largest standard deviation
Points to remember about variance and
standard deviation and the relationship
with histogram:
- The value of s and s2 is always greater than or equal to zero.
- The larger the value of s 2 or s, the greater the variability of the data set.
- If s 2 or s is equal to zero, all measurements must have the same.
- The standard deviation s is computed in order to have a measure of
variability having the same unit as the observations.
- The larger the s.d., the more spread the data, the flatter the histogram.
- The smaller the s.d., the more clustered the data around the mean, the
taller the peak of the histogram.
21
Exercise (Similar Exam questions)
1. The gas price is a concern for people. A random sample of 40 stations gives the following
data summary:
Sample mean = $2.15
Median = $2.12
S = $.15
Q: Is the distribution of the gas prices more likely to be
(a) Symmetric (b) skewed-to-right (c) Skewed-to-left
And WHY?
2. The following two data are prices of milk from 6 stores, one was from January, and one
year after.
Store:
A
B
C
D
E
F
Price in January 2004
1.85
1.95
1.85
2.00
1.78
1.97
Price in January 2005
2.05
2.15
2.05
2.20
1.98
2.17
True or False for each of the following statements:
(a) The average price remains the same between two years.
(b) The price range remains the same between two years.
(c ) The median remains the same between two years.
(d) The standard deviation (s) remains the same between two years.
22
Descriptive Summary for the 56 distances
x
m
s , the sample
standard deviation.
Mean after excluding the
lowest 5% and the
highest 5% of the data.
Called: Trimmed Mean
s2 = (112.2)2
Descriptive Statistics: distance
Variable
N
distance
56
Mean
142.0
Variable
Minimum
distance
5.0
Smallest
Largest
Median
TrMean
StDev
140.0
128.3
112.2
Maximum
800.0
Q1
92.5
25% of the distances are
lower than Q1, the first
Quartile, or 25th Percentile
SE Mean
15.0
Q3
160.0
75% of the distances are
lower than Q3, the third
Quartile, or 75th Percentile
23
If we add the max, 6000 to the data, so that
we 57 cases, what is the effect of 6000 to the
following summary statistics:
Increase? Decease? The same?
(a) the average distance:
(b) the median distance:
(c) the standard deviation:
(d) the range:
Answer
Adding 6000 miles to the data, then,
• Average distance is increased.
• Median distance for this example is the
same. (in general, will be almost the
same)
• Standard deviation is increased.
• Range is increased.
Empirical Rule and Applications
What is the meaning of variation and how is it
used in solving real world problems?
For Symmetric mound-shaped data (Bell-shaped )
Approximately
68% of the data is
between ± 1 s
95% of the data is
between ± 2 s
100% of the data is
between ± 3 s
of the mean
26
The important Application of Empirical rule is: It is applied
to identify rare (unusual, extreme )observations.
If an observation falls outside two s.d. range, it only has 5%
of chance to occur. Therefore, it is considered to be a rare
(or unusual) case.
34%
34%
13.5%
13.5%
2.5%
m-2s
2.5%
m-s
m
m+s
m+2s
NOTE: If you add the % on each side of the center line m, it adds to 50%. A
mounded-shape distribution is symmetric about the mean.
27
Applying Empirical Rule to identify Rare Events
A simple and powerful tool for identifying outliers, extremes, or unusual, or rare events.
We will use this rule very often through out the entire semester.
(Similar questions in the test)
Consider the 2010 ACT test, the average was 21 and a standard
deviation was 4. The distribution of the ACT scores is mounded-shaped.
Q1: A student received a score of 25. Is this an unusually high score?
Q2: If CMU will admit students with a minimum ACT to be one standard
deviations below the mean, what is the minimum ACT for CMU
admission?
Q3: A student received an ACT of 30. Is this an unusually high score?
ANSWER:
Q1: 25 = 21+4 (that is one s.d. above the mean. It is inside two s.d. from the mean.
So, it is NOT unusually high score.
Q2: The score at one s.d. below the mean = 21 – 4 = 17.
Q3: the score 30 > 21 + 2(4) = 29. 30 is outside the two s.d. from mean. There is
28
only 2.5% of scores higher than 29. Hence, 30 is an unusually high score.
Exercise: Estimating average, standard deviation
and applying Empirical Rule when distribution is
mounded-shaped
We collect a sample of 40 weekly spending from 40 students. Suppose the spending has a
mounded-shape distribution. We only know the min = $20 and max = $80. As you see
the weekly spending varies. There is a variation among spending.
(a) Give a good estimate of the average spending and standard deviation of the weekly
spending based on the 40 students data.
(b) Approximately how many % of students would spend $35 or more per week:
ANS: Since the distribution is mounded-shaped, we can use (20+80) / 2 = $50 to estimate
the average spending.
Since this is a sample, so, we use s = range/4 to estimate the s.d., which would be (8020)/4 = $15.0.
ANS: We can then use this estimated average spending and s to answer question (b):
$35 is about one s.d. below the mean. Hence, the % of spending $35 or more = 34% +
50% = 84%. Approximately 84% of individuals spend $35 or more per week.
29
Five Number Summary; Box plots
The Five-Number Summary
MINIMUM
Q1
Median
Q3
MAXIMUM
IQR (Inter-quartile Range) = Q3 – Q1
30
Steps for Drawing a Box plot
Step 1: Determine the lower and upper fence:
Lower fence = Q1 – 1.5(IQR)
Upper fence = Q3 + 1.5(IQR)
Step 2: Draw vertical lines at Q1, M and Q3. Enclose these vertical lines in a
box.
Step 3: Label the lower and upper fences.
Step 4: Draw a line from Q1 to the smallest data value that is larger than the
lower fence. Draw a line from Q3 to the largest data value that is smaller
than the upper fence.
Step 5: Any data value less than the lower fence or greater than the
upper fence are outliers and are marked with an asterisk (*).
31
EXAMPLE Drawing a Boxplot
Min
Q1
M
Q3
28
38
48
56
Max
IQR
73
Q3-Q1 =56-38=18
Draw a boxplot for the serum HDL.
Boxplot of HDL
Median
Q3
Q1
30
40
Compute the lower
and upper fence and
draw a boxplot.
Mean
50
HDL
60
70
32
Relationship between Distribution Shape and
Boxplot (Similar questions in the test)
1. If the median is near the center of the box and each of
the horizontal lines are approximately equal length, then
the distribution is roughly symmetric.
2. If the median is left of the center of the box and/or the
right line is substantially longer than the left line, the
distribution is right skewed.
3. If the median is right of the center of the box and/or the
left line is substantially longer than the right line, the
distribution is left skewed
33
Symmetric
34
Skewed Right
35
Skewed Left
36
Distance data – 100 distance data
Boxplot of Miles
Histogram of Miles
50
Frequency
40
30
20
10
0
0
200
400
Miles
600
800
0
1000
100
200
0
35
400
500
Miles
600
700
800
900
Boxplot of Miles
Histogram of Miles
female
300
200
400
600
0
800
female
male
200
400
600
800
male
Frequency
30
25
20
15
10
5
0
0
200
400
600
0
800
200
400
Panel variable: Gender
600
800
Miles
Miles
Panel variable: Gender
37
EXAMPLE
Comparing Two Data Sets Using
Boxplots
The following boxplots represent the birth rate for
women 15 - 44 years of age in 1990 and 1997 for
each state.
What conclusion can you make?
38