EPSY 439 - Texas A&M University

Download Report

Transcript EPSY 439 - Texas A&M University

EPSY 439

Variability

Range Variance Standard Deviation

z-score

Concept Map

Variability - Chapter 3

Index of Vari abil ity D is tribution Single Sc ore R ange Standard D ev iation Varianc e z Sc ores D epends only on ex tr eme s c ores Same mean, s ame r ange differer nt form C omputation: Population v ariability D ev iation-Sc ore Sc ores :(X- Mean) C omputation: R aw- Sc ore D ev iation Graphing Standard D ev iations Meas ure of a s ample's v ariablity C omputation: D ev iation-Sc ore Es timate of a population's v ariability C omputation: R aw- Sc ore C onc ept of N -1 D es c ribes the relation of an X to the Mean with res pec t to the v ar iability of the dis tribution R epres ents # of SD s a s c ore is abov e or below the Mean Giv es y ou a w ay to c ompar e raw s c ores Standard Sc ore R ange betw een -3 and +3 April 24, 2020 Copyright 2000 - Robert J. Hall 2

Overview

The mean, as a single value, is a useful measure for describing  the center of a distribution and  where the most frequent scores in a distribution are located. It does

not

tell us very much, however, about  scores away from the center of the distribution and/or  scores that occur infrequently.

April 24, 2020 Copyright 2000 - Robert J. Hall 3

Overview

Thus, to describe a data set accurately, we need to know not only where scores are centered but also about how much individual scores within the distribution differ from one another. Measures of variability , then, provide context (spread/consistency) for measures of center.

April 24, 2020 Copyright 2000 - Robert J. Hall 4

Overview

Three different distributions have the same mean score

Sample A

0 2 6 10 12 S `

X

= 6

Sample B

8 7 6 5 4 S `

X

= 6

Sample C

6 6 6 6 6 S `

X

= 6 Each of the three samples has a mean of 6, so if you did not look at the raw scores, you might think that we have three identical distributions. Clearly, they are not the same.

April 24, 2020 Copyright 2000 - Robert J. Hall 5

Overview

Distance between the locations of scores in three distributions Sample A `

X

= 6 0, 2, 6, 10, 12

X X X X X

Sample B `

X

= 6 4, 5, 6, 7, 8

X X X X X

Sample C `

X

= 6 6, 6, 6, 6, 6

X X X X X

0 2 4 6 8 10 12 0 2 4 6 8 10 12 0 2 4 6 8 10 12 Here you can see differences that separate scores in the three samples.

April 24, 2020 Copyright 2000 - Robert J. Hall 6

Statistical Model of Variability

The statistical model for abnormality requires two statistical concepts:  a measure of the average and  a measure of the deviation from the average – We have examined the standard techniques for defining the average score; we will now focus on methods for measuring deviations.

– The fact that scores deviate from average means that they are variable. Variability, then, is one of the most basic statistical concepts.

April 24, 2020 Copyright 2000 - Robert J. Hall 7

Meaning of Variability

The term

variability

has much the same meaning in statistics as it has in everyday language; to say that things are variable means that they are not all the same.

 In statistics, our goal is to measure the amount of variability for a particular set of scores, a distribution.

– Are scores all clustered together, or are they scattered over a wide range of values?

April 24, 2020 Copyright 2000 - Robert J. Hall 8

Measuring Variability

A good measure of variability should:   provide an accurate the distribution, and picture of the spread of give an indication of how well an individual score (or group of scores) represents the entire distribution.

– In inferential statistics, where we often use small samples to answer general questions about large populations, variability is a particularly important concern. Why?

April 24, 2020 Copyright 2000 - Robert J. Hall 9

Types of Variability

In this section we will consider three different measures of variability:    the

range

, –

interquartile range

the

standard deviation,

and the

variance

Of these three, the standard deviation is by far the most important.

April 24, 2020 Copyright 2000 - Robert J. Hall 10

Range

 The

range

is the distance between the largest score (

X

max ) and the smallest score in the distribution (

X

min ).  In determining this distance, you must also take into account the real limits of the maximum and minimum

X

values.  The range, therefore, is computed as:

range = URL

X

max 

LRL

X

min April 24, 2020 Copyright 2000 - Robert J. Hall 11

Range

 The

range

is the most obvious way of describing how spread out the scores are in the distribution.

*** Problem ***  Range is completely determined by the two extreme values and ignores the other scores in the distribution.

April 24, 2020 Copyright 2000 - Robert J. Hall 12

Range

Example:  The following two distributions have exactly the same range (10 points in each case) but – Distribution 1 clusters scores at the upper end of the range – Distribution 2 spreads scores out over the range Distribution 1: 1, 8, 9, 9, 10, 10 Distribution 2: 1, 2, 4, 6, 8, 10 April 24, 2020 Copyright 2000 - Robert J. Hall 13

Range

Summary  Because the range does not consider all the scores in the distribution, it often does not give an accurate description of the variability for the entire distribution.

 Generally, the range is considered to be a crude and unreliable measure of variability.

– April 24, 2020 John Tukey has developed ways looking at data that give prominence to the dispersion of data using what is called the boxplot and interquartile range .

Copyright 2000 - Robert J. Hall 14

Interquartile Range

 A distribution can be divided into four equal parts using quartiles.

 The first quartile (

Q

1 ) is the score that separates the lower 25% of the distribution from the rest.  The second quartile (

Q

2 or median) is the score that has exactly two quarters, or 50%, of the distribution below it.

April 24, 2020 Copyright 2000 - Robert J. Hall 15

Interquartile Range

 Finally, the third quartile (

Q

3 ) is the score that divides the bottom three-fourths of the distribution from the top quarter.

 The

interquartile range

is the distance between the first quartile and the third quartile: interquartile range =

Q 3 -

Q

1

April 24, 2020 Copyright 2000 - Robert J. Hall 16

Interquartile Range

Frequency distribution for a set of 16 scores 4.0

Interquartile range (3.5 points) 3.0

Bottom 25% Top 25% 2.0

April 24, 2020 1.0

0.0

2.0

4.0

6.0

8.0

3.0

Q

1 5.0

7.0

= 4.5

Q

3 SCORE 9.0

= 8 10.0

11.0

Copyright 2000 - Robert J. Hall Std. Dev = 2.50 Mean = 6.3

N = 16.00

17

Boxplot

Max Q

3 = 8

Q

2 = 6

Q

1 = 4.5

Min

12 11 10 9 8 7 6 5 4 3 2 1 0 N = 16 SCORE Copyright 2000 - Robert J. Hall April 24, 2020

IQR =

3.5

18

Construction of Boxplot

 Draw a number line that includes the range of observations.

 Compute

Q

1 ,

Q

2 , and

Q

3 .

 Above the line drawn in the first step, draw a box extending from

Q

1 to

Q

3 .

 Inside the box, draw a line at the median.

April 24, 2020 Copyright 2000 - Robert J. Hall 19

Construction of Boxplot

 To identify the outliers, compute the lower and upper fences, which are located at 1.5 (IQR) below

Q

1 and above

Q

3 . That is, the lower fence is

f

L =

Q

1 - 1.5(IQR) and the upper fence is

f

U =

Q

3 + 1.5(IQR) April 24, 2020 Copyright 2000 - Robert J. Hall 20

Construction of Boxplot

 Observations located beyond the fence are classified as outliers and are identified with an asterisk (*).

 If there are no outliers, extend horizontal line segments (whiskers) from the ends of the box to the smallest and largest observations.  If there are outliers, extend the whiskers to the smallest and largest non-outliers.

April 24, 2020 Copyright 2000 - Robert J. Hall 21

April 24, 2020 Copyright 2000 - Robert J. Hall 22

Example

Temperatures at time of space shuttle launches: 53 57 58 63 66 67 67 67 68 69 70 70 Five number summary: 70 70 72 73 75 75 76 76 78 79 80 81 Low 53

Q

1 66.84

M

70

Q

3 75.5

High 81 April 24, 2020 Copyright 2000 - Robert J. Hall 23

C F 1 1 0

Example - Frequency Distribution

T e Median: 70.5

58.3

1 7 50 16.6

3 69.5

7 .

50 8 50  41 9 .

7  8 .

3  .

50 16 .

6 5 * 1  .

5 6 8 9 69 .

5  0 .

5  70 .

0 o o April 24, 2020 Copyright 2000 - Robert J. Hall 24 m li p e v e y d t a t a l l r a t u r e a t t i m e o f S p a c e S h u t t le L a u n c h e s

C F 1 1 0

Example - Frequency Distribution

T e First Quartile: 67.5

33.3

1 7 25 12.5

3 66.5

7 8 25  12 .

20 .

8 5 9  4 12 .

.

2 5  .

34 5 .

34 * 1  6 .

8 34 66 .

5  0 .

34  66 .

84 o o April 24, 2020 Copyright 2000 - Robert J. Hall 25 m li p e v e y d t a t a l l r a t u r e a t t i m e o f S p a c e S h u t t le L a u n c h e s

C F 1 1 0

Example - Frequency Distribution

T e Third Quartile: April 24, 2020 m li p e v e y d 1 7 8.3

3 74.5

7 8 75 .

0  8 .

3 66 .

9 7  8 8 .

.

3 3  1 .

00 5 1 .

00 * 1 6  8 1 .

00 74 .

5  1 .

00  75 .

5 o o Copyright 2000 - Robert J. Hall 26 t a t a l l r a t u r e a t t i m e o f S p a c e S h u t t le L a u n c h e s

S p a c

Example - Summary Statistics

S s c s e S m h p u e r 2 a a M i s L r s a e u s n c 7 7 0 0 .

.

0 0 7 M M M e e o a d d i c c 5 7 2 .

.

.

.

6 4 2 1 3 7 S V S a k S E r .

r i e r .

o r c a s .

.

1 9 9 1 K u S E r r r t .

o r i c 6 7 7 7 0 5 .

.

.

2 0 0 7 s s c c h s o April 24, 2020 Copyright 2000 - Robert J. Hall 27 n

Example - Histogram

April 24, 2020 5 6 Histogram 4 3 2 1 0 52.5

55.0

57.5

60.0

62.5

65.0

67.5

70.0

72.5

75.0

77.5

80.0

Std. Dev = 7.22 Mean = 70.0

N = 24.00

Space Shuttle Launch Temperatures Copyright 2000 - Robert J. Hall 28

Example - Boxplot

90 80 70 60 50 N = April 24, 2020 1 24 Launch Temperature Copyright 2000 - Robert J. Hall O 1 indicates that case number 1, 53 o , is an outlier in this data set. Case 1 was the temperature at the time that the

Challenger

space shuttle was launched. On the morning of January 28, 1986

Challenger

exploded and crashed 73 seconds after take off. There were no survivors. A simple boxplot such as this might have raised enough of a red flag for mission control to abort the launch.

29

Victims of Challenger Disaster

Seven astronauts were killed when the space shuttle Challenger suddenly exploded shortly after launch on January 28, 1986. The crewmembers who died during the mission are shown here. April 24, 2020 Copyright 2000 - Robert J. Hall 30

April 24, 2020 Copyright 2000 - Robert J. Hall 31

Standard Deviation - Index of Variability

 The most commonly used measure of the variability of scores in a distribution is the

standard deviation

.

 It is an index of the variability (spread) of scores about the mean of a distribution, something like the average dispersion or deviation of the scores (

X

) about their mean ( `

X

).

– The greater the spread of scores, the greater the standard deviation.

April 24, 2020 Copyright 2000 - Robert J. Hall 32

Standard Deviation - Illustration

– For example, in the figure below, the scores in distribution A cluster about the mean of the distribution, and the scores in distribution B are spread out.

 The standard deviation for distribution A, then, would be less than the standard deviation for distribution B.

A B 33 April 24, 2020 Copyright 2000 - Robert J. Hall

Standard Deviation - Definition

 More formally, the

standard deviation

be defined as the square root of the may average squared deviation of scores from the mean of the distribution, measured in units of the original score.

 This definition is reflected in the

deviation score formula

.

S

  (

X N

X

) 2  

x

2

N

April 24, 2020 Copyright 2000 - Robert J. Hall 34

Standard Deviation - Explanation

 Major limitation of the range as a measure of the variability of scores in a distribution is that it is based on two extreme, often unstable, scores.

 Another way to measure the spread of scores in a distribution might be to consider the variability of each of the scores in the distribution about the mean -- the most stable measure of central tendency -- from each of the scores in the distribution.

April 24, 2020 Copyright 2000 - Robert J. Hall 35

Standard Deviation - Deviation Scores

 This process produces deviation scores that can be represented as:

x

X

X

 These deviation scores provide a measure of the

distance

of each raw score from the mean of the distribution.

April 24, 2020 Copyright 2000 - Robert J. Hall 36

Standard Deviation -

S

(

X

-

`

X

)

 Since we want a measure that takes all deviation scores into account, the most obvious next step would be to sum up the deviation scores.

 Since the mean is the balance point of the distribution, however, we know that this attempt is doomed to failure:  (

X

X

)  

x

 0 April 24, 2020 Copyright 2000 - Robert J. Hall 37

Standard Deviation -

S

(

X

-

`

X

)

2  How do we get around this problem?

 Square each deviation score to eliminate the sign and to preserve the relative distance between scores and the mean.

 This would yield:  (

X

X

) 2  

x

2  0 April 24, 2020 Copyright 2000 - Robert J. Hall 38

Standard Deviation - Control for

N

 We’re finished, right?

 No.

 We still have a problem comparing across distributions because different distributions are likely to have different numbers of people.

 To control for the size of the distribution, we get a measure of the “average” amount of variability about the mean of the distribution.

April 24, 2020 Copyright 2000 - Robert J. Hall 39

Standard Deviation - Average Deviation

 This average can be obtained by dividing the sum of squared deviation scores by

N

:  (

X N

X

) 2  

x

2

N

April 24, 2020 Copyright 2000 - Robert J. Hall 40

Standard Deviation - Average Dispersion

 This formula will give us something like the average dispersion of scores in the distribution about the mean; it is defined as the variance.

 The last problem is to return the measure of variability back to the original score scale.

 This problem can be solved by taking the square root.

April 24, 2020 Copyright 2000 - Robert J. Hall 41

Standard Deviation - Square Root

 (

X N

X

) 2  

x

2

N

April 24, 2020 Copyright 2000 - Robert J. Hall 42

Standard Deviation - Example (Deviation Score)

X

1 3 5 7 9 April 24, 2020

Mean

5 5 5 5 5 

x

2

N

x

-4 -2 0 2 4 

x

2

x 2

16 4 0 4 16  40  

N x

2 8  2 .

83 S (

X

`

X

) 2  40  8 5 Copyright 2000 - Robert J. Hall 43

Standard Deviation - Example (Raw Score)

X

1 3 5 7 9 25

X

2

1 9 25 49 81 165

S

 

X

2   

X

 2

N N

165  25 2 5 5 April 24, 2020  165  125  5 Copyright 2000 - Robert J. Hall 40  5 8  2 .

83 44

Organizational Chart - Variability

Describing variability (differences between scores) Descriptive measures are used to describe a known sample of scores or a population In formulas final division uses

N

Inferential measures are used to estimate the population based on a sample In formulas final division uses

N - 1

To describe sample variance compute

S x

2 To describe population variance compute s

x

2 Taking square root gives Taking square root gives Sample standard deviation

S x

April 24, 2020 Population standard deviation s

x

To estimate population variance compute

s x

2 Taking square root gives Estimate population standard deviation

s x

Copyright 2000 - Robert J. Hall 45