Transcript EPSY 439 - Texas A&M University
EPSY 439
Variability
Range Variance Standard Deviation
z-score
Concept Map
Variability - Chapter 3
Index of Vari abil ity D is tribution Single Sc ore R ange Standard D ev iation Varianc e z Sc ores D epends only on ex tr eme s c ores Same mean, s ame r ange differer nt form C omputation: Population v ariability D ev iation-Sc ore Sc ores :(X- Mean) C omputation: R aw- Sc ore D ev iation Graphing Standard D ev iations Meas ure of a s ample's v ariablity C omputation: D ev iation-Sc ore Es timate of a population's v ariability C omputation: R aw- Sc ore C onc ept of N -1 D es c ribes the relation of an X to the Mean with res pec t to the v ar iability of the dis tribution R epres ents # of SD s a s c ore is abov e or below the Mean Giv es y ou a w ay to c ompar e raw s c ores Standard Sc ore R ange betw een -3 and +3 April 24, 2020 Copyright 2000 - Robert J. Hall 2
Overview
The mean, as a single value, is a useful measure for describing the center of a distribution and where the most frequent scores in a distribution are located. It does
not
tell us very much, however, about scores away from the center of the distribution and/or scores that occur infrequently.
April 24, 2020 Copyright 2000 - Robert J. Hall 3
Overview
Thus, to describe a data set accurately, we need to know not only where scores are centered but also about how much individual scores within the distribution differ from one another. Measures of variability , then, provide context (spread/consistency) for measures of center.
April 24, 2020 Copyright 2000 - Robert J. Hall 4
Overview
Three different distributions have the same mean score
Sample A
0 2 6 10 12 S `
X
= 6
Sample B
8 7 6 5 4 S `
X
= 6
Sample C
6 6 6 6 6 S `
X
= 6 Each of the three samples has a mean of 6, so if you did not look at the raw scores, you might think that we have three identical distributions. Clearly, they are not the same.
April 24, 2020 Copyright 2000 - Robert J. Hall 5
Overview
Distance between the locations of scores in three distributions Sample A `
X
= 6 0, 2, 6, 10, 12
X X X X X
Sample B `
X
= 6 4, 5, 6, 7, 8
X X X X X
Sample C `
X
= 6 6, 6, 6, 6, 6
X X X X X
0 2 4 6 8 10 12 0 2 4 6 8 10 12 0 2 4 6 8 10 12 Here you can see differences that separate scores in the three samples.
April 24, 2020 Copyright 2000 - Robert J. Hall 6
Statistical Model of Variability
The statistical model for abnormality requires two statistical concepts: a measure of the average and a measure of the deviation from the average – We have examined the standard techniques for defining the average score; we will now focus on methods for measuring deviations.
– The fact that scores deviate from average means that they are variable. Variability, then, is one of the most basic statistical concepts.
April 24, 2020 Copyright 2000 - Robert J. Hall 7
Meaning of Variability
The term
variability
has much the same meaning in statistics as it has in everyday language; to say that things are variable means that they are not all the same.
In statistics, our goal is to measure the amount of variability for a particular set of scores, a distribution.
– Are scores all clustered together, or are they scattered over a wide range of values?
April 24, 2020 Copyright 2000 - Robert J. Hall 8
Measuring Variability
A good measure of variability should: provide an accurate the distribution, and picture of the spread of give an indication of how well an individual score (or group of scores) represents the entire distribution.
– In inferential statistics, where we often use small samples to answer general questions about large populations, variability is a particularly important concern. Why?
April 24, 2020 Copyright 2000 - Robert J. Hall 9
Types of Variability
In this section we will consider three different measures of variability: the
range
, –
interquartile range
the
standard deviation,
and the
variance
Of these three, the standard deviation is by far the most important.
April 24, 2020 Copyright 2000 - Robert J. Hall 10
Range
The
range
is the distance between the largest score (
X
max ) and the smallest score in the distribution (
X
min ). In determining this distance, you must also take into account the real limits of the maximum and minimum
X
values. The range, therefore, is computed as:
range = URL
X
max
LRL
X
min April 24, 2020 Copyright 2000 - Robert J. Hall 11
Range
The
range
is the most obvious way of describing how spread out the scores are in the distribution.
*** Problem *** Range is completely determined by the two extreme values and ignores the other scores in the distribution.
April 24, 2020 Copyright 2000 - Robert J. Hall 12
Range
Example: The following two distributions have exactly the same range (10 points in each case) but – Distribution 1 clusters scores at the upper end of the range – Distribution 2 spreads scores out over the range Distribution 1: 1, 8, 9, 9, 10, 10 Distribution 2: 1, 2, 4, 6, 8, 10 April 24, 2020 Copyright 2000 - Robert J. Hall 13
Range
Summary Because the range does not consider all the scores in the distribution, it often does not give an accurate description of the variability for the entire distribution.
Generally, the range is considered to be a crude and unreliable measure of variability.
– April 24, 2020 John Tukey has developed ways looking at data that give prominence to the dispersion of data using what is called the boxplot and interquartile range .
Copyright 2000 - Robert J. Hall 14
Interquartile Range
A distribution can be divided into four equal parts using quartiles.
The first quartile (
Q
1 ) is the score that separates the lower 25% of the distribution from the rest. The second quartile (
Q
2 or median) is the score that has exactly two quarters, or 50%, of the distribution below it.
April 24, 2020 Copyright 2000 - Robert J. Hall 15
Interquartile Range
Finally, the third quartile (
Q
3 ) is the score that divides the bottom three-fourths of the distribution from the top quarter.
The
interquartile range
is the distance between the first quartile and the third quartile: interquartile range =
Q 3 -
Q
1
April 24, 2020 Copyright 2000 - Robert J. Hall 16
Interquartile Range
Frequency distribution for a set of 16 scores 4.0
Interquartile range (3.5 points) 3.0
Bottom 25% Top 25% 2.0
April 24, 2020 1.0
0.0
2.0
4.0
6.0
8.0
3.0
Q
1 5.0
7.0
= 4.5
Q
3 SCORE 9.0
= 8 10.0
11.0
Copyright 2000 - Robert J. Hall Std. Dev = 2.50 Mean = 6.3
N = 16.00
17
Boxplot
Max Q
3 = 8
Q
2 = 6
Q
1 = 4.5
Min
12 11 10 9 8 7 6 5 4 3 2 1 0 N = 16 SCORE Copyright 2000 - Robert J. Hall April 24, 2020
IQR =
3.5
18
Construction of Boxplot
Draw a number line that includes the range of observations.
Compute
Q
1 ,
Q
2 , and
Q
3 .
Above the line drawn in the first step, draw a box extending from
Q
1 to
Q
3 .
Inside the box, draw a line at the median.
April 24, 2020 Copyright 2000 - Robert J. Hall 19
Construction of Boxplot
To identify the outliers, compute the lower and upper fences, which are located at 1.5 (IQR) below
Q
1 and above
Q
3 . That is, the lower fence is
f
L =
Q
1 - 1.5(IQR) and the upper fence is
f
U =
Q
3 + 1.5(IQR) April 24, 2020 Copyright 2000 - Robert J. Hall 20
Construction of Boxplot
Observations located beyond the fence are classified as outliers and are identified with an asterisk (*).
If there are no outliers, extend horizontal line segments (whiskers) from the ends of the box to the smallest and largest observations. If there are outliers, extend the whiskers to the smallest and largest non-outliers.
April 24, 2020 Copyright 2000 - Robert J. Hall 21
April 24, 2020 Copyright 2000 - Robert J. Hall 22
Example
Temperatures at time of space shuttle launches: 53 57 58 63 66 67 67 67 68 69 70 70 Five number summary: 70 70 72 73 75 75 76 76 78 79 80 81 Low 53
Q
1 66.84
M
70
Q
3 75.5
High 81 April 24, 2020 Copyright 2000 - Robert J. Hall 23
C F 1 1 0
Example - Frequency Distribution
T e Median: 70.5
58.3
1 7 50 16.6
3 69.5
7 .
50 8 50 41 9 .
7 8 .
3 .
50 16 .
6 5 * 1 .
5 6 8 9 69 .
5 0 .
5 70 .
0 o o April 24, 2020 Copyright 2000 - Robert J. Hall 24 m li p e v e y d t a t a l l r a t u r e a t t i m e o f S p a c e S h u t t le L a u n c h e s
C F 1 1 0
Example - Frequency Distribution
T e First Quartile: 67.5
33.3
1 7 25 12.5
3 66.5
7 8 25 12 .
20 .
8 5 9 4 12 .
.
2 5 .
34 5 .
34 * 1 6 .
8 34 66 .
5 0 .
34 66 .
84 o o April 24, 2020 Copyright 2000 - Robert J. Hall 25 m li p e v e y d t a t a l l r a t u r e a t t i m e o f S p a c e S h u t t le L a u n c h e s
C F 1 1 0
Example - Frequency Distribution
T e Third Quartile: April 24, 2020 m li p e v e y d 1 7 8.3
3 74.5
7 8 75 .
0 8 .
3 66 .
9 7 8 8 .
.
3 3 1 .
00 5 1 .
00 * 1 6 8 1 .
00 74 .
5 1 .
00 75 .
5 o o Copyright 2000 - Robert J. Hall 26 t a t a l l r a t u r e a t t i m e o f S p a c e S h u t t le L a u n c h e s
S p a c
Example - Summary Statistics
S s c s e S m h p u e r 2 a a M i s L r s a e u s n c 7 7 0 0 .
.
0 0 7 M M M e e o a d d i c c 5 7 2 .
.
.
.
6 4 2 1 3 7 S V S a k S E r .
r i e r .
o r c a s .
.
1 9 9 1 K u S E r r r t .
o r i c 6 7 7 7 0 5 .
.
.
2 0 0 7 s s c c h s o April 24, 2020 Copyright 2000 - Robert J. Hall 27 n
Example - Histogram
April 24, 2020 5 6 Histogram 4 3 2 1 0 52.5
55.0
57.5
60.0
62.5
65.0
67.5
70.0
72.5
75.0
77.5
80.0
Std. Dev = 7.22 Mean = 70.0
N = 24.00
Space Shuttle Launch Temperatures Copyright 2000 - Robert J. Hall 28
Example - Boxplot
90 80 70 60 50 N = April 24, 2020 1 24 Launch Temperature Copyright 2000 - Robert J. Hall O 1 indicates that case number 1, 53 o , is an outlier in this data set. Case 1 was the temperature at the time that the
Challenger
space shuttle was launched. On the morning of January 28, 1986
Challenger
exploded and crashed 73 seconds after take off. There were no survivors. A simple boxplot such as this might have raised enough of a red flag for mission control to abort the launch.
29
Victims of Challenger Disaster
Seven astronauts were killed when the space shuttle Challenger suddenly exploded shortly after launch on January 28, 1986. The crewmembers who died during the mission are shown here. April 24, 2020 Copyright 2000 - Robert J. Hall 30
April 24, 2020 Copyright 2000 - Robert J. Hall 31
Standard Deviation - Index of Variability
The most commonly used measure of the variability of scores in a distribution is the
standard deviation
.
It is an index of the variability (spread) of scores about the mean of a distribution, something like the average dispersion or deviation of the scores (
X
) about their mean ( `
X
).
– The greater the spread of scores, the greater the standard deviation.
April 24, 2020 Copyright 2000 - Robert J. Hall 32
Standard Deviation - Illustration
– For example, in the figure below, the scores in distribution A cluster about the mean of the distribution, and the scores in distribution B are spread out.
The standard deviation for distribution A, then, would be less than the standard deviation for distribution B.
A B 33 April 24, 2020 Copyright 2000 - Robert J. Hall
Standard Deviation - Definition
More formally, the
standard deviation
be defined as the square root of the may average squared deviation of scores from the mean of the distribution, measured in units of the original score.
This definition is reflected in the
deviation score formula
.
S
(
X N
X
) 2
x
2
N
April 24, 2020 Copyright 2000 - Robert J. Hall 34
Standard Deviation - Explanation
Major limitation of the range as a measure of the variability of scores in a distribution is that it is based on two extreme, often unstable, scores.
Another way to measure the spread of scores in a distribution might be to consider the variability of each of the scores in the distribution about the mean -- the most stable measure of central tendency -- from each of the scores in the distribution.
April 24, 2020 Copyright 2000 - Robert J. Hall 35
Standard Deviation - Deviation Scores
This process produces deviation scores that can be represented as:
x
X
X
These deviation scores provide a measure of the
distance
of each raw score from the mean of the distribution.
April 24, 2020 Copyright 2000 - Robert J. Hall 36
Standard Deviation -
S
(
X
-
`
X
)
Since we want a measure that takes all deviation scores into account, the most obvious next step would be to sum up the deviation scores.
Since the mean is the balance point of the distribution, however, we know that this attempt is doomed to failure: (
X
X
)
x
0 April 24, 2020 Copyright 2000 - Robert J. Hall 37
Standard Deviation -
S
(
X
-
`
X
)
2 How do we get around this problem?
Square each deviation score to eliminate the sign and to preserve the relative distance between scores and the mean.
This would yield: (
X
X
) 2
x
2 0 April 24, 2020 Copyright 2000 - Robert J. Hall 38
Standard Deviation - Control for
N
We’re finished, right?
No.
We still have a problem comparing across distributions because different distributions are likely to have different numbers of people.
To control for the size of the distribution, we get a measure of the “average” amount of variability about the mean of the distribution.
April 24, 2020 Copyright 2000 - Robert J. Hall 39
Standard Deviation - Average Deviation
This average can be obtained by dividing the sum of squared deviation scores by
N
: (
X N
X
) 2
x
2
N
April 24, 2020 Copyright 2000 - Robert J. Hall 40
Standard Deviation - Average Dispersion
This formula will give us something like the average dispersion of scores in the distribution about the mean; it is defined as the variance.
The last problem is to return the measure of variability back to the original score scale.
This problem can be solved by taking the square root.
April 24, 2020 Copyright 2000 - Robert J. Hall 41
Standard Deviation - Square Root
(
X N
X
) 2
x
2
N
April 24, 2020 Copyright 2000 - Robert J. Hall 42
Standard Deviation - Example (Deviation Score)
X
1 3 5 7 9 April 24, 2020
Mean
5 5 5 5 5
x
2
N
x
-4 -2 0 2 4
x
2
x 2
16 4 0 4 16 40
N x
2 8 2 .
83 S (
X
`
X
) 2 40 8 5 Copyright 2000 - Robert J. Hall 43
Standard Deviation - Example (Raw Score)
X
1 3 5 7 9 25
X
2
1 9 25 49 81 165
S
X
2
X
2
N N
165 25 2 5 5 April 24, 2020 165 125 5 Copyright 2000 - Robert J. Hall 40 5 8 2 .
83 44
Organizational Chart - Variability
Describing variability (differences between scores) Descriptive measures are used to describe a known sample of scores or a population In formulas final division uses
N
Inferential measures are used to estimate the population based on a sample In formulas final division uses
N - 1
To describe sample variance compute
S x
2 To describe population variance compute s
x
2 Taking square root gives Taking square root gives Sample standard deviation
S x
April 24, 2020 Population standard deviation s
x
To estimate population variance compute
s x
2 Taking square root gives Estimate population standard deviation
s x
Copyright 2000 - Robert J. Hall 45