GRAPHICAL METHODS FOR QUANTITATIVE DATA

Download Report

Transcript GRAPHICAL METHODS FOR QUANTITATIVE DATA

Chapter 4 Displaying and Summarizing Quantitative Data

CHAPTER OBJECTIVES

 At the conclusion of this chapter you should be able to:

1) Construct graphs that appropriately describe quantitative data

   

2) Calculate and interpret numerical summaries of quantitative data.

3) Combine numerical methods with graphical methods to analyze a data set.

4) Apply graphical methods of summarizing data to choose appropriate numerical summaries.

5) Apply software and/or calculators to automate graphical and numerical summary procedures.

Displaying Quantitative Data

Histograms Stem and Leaf Displays

Relative Frequency Histogram of Exam Grades

.30

.25

.20

.15

.10

.05

0 40 50 60 70 Grade 80 90 100

Frequency Histogram

Histograms

A histogram shows three general types of information:

It provides visual indication of where the approximate center of the data is.

We can gain an understanding of the degree of spread, or variation, in the data.

We can observe the shape of the distribution.

All 200 m Races 20.2 secs or less

200 m Races 20.2 secs or less (approx. 700)

60 50 40 30 20 10 0

Usain Bolt 2008 19.30

Michael Johnson 1996 19.32

TIMES

Histograms Showing Different Centers

Histograms Showing Different Centers (football head coach salaries)

Histograms Same Center, Different Spread (football head coach salaries)

1000 900 800 700 600 500 400 300 200 100 0

Excel Example: 2012-13 NFL Salaries

Histogram Bin

Statcrunch Example: 2012-13 NFL Salaries

Grades on a statistics exam

Data: 75 66 77 66 64 73 91 65 59 86 61 86 61 58 70 77 80 58 94 78 62 79 83 54 52 45 82 48 67 55

Frequency Distribution of Grades

Class Limits 40 up to 50 50 up to 60 60 up to 70 70 up to 80 80 up to 90 90 up to 100 Total Frequency 2 6 8 2 30 7 5

Relative Frequency Distribution of Grades

Class Limits 40 up to 50 Relative Frequency 2/30 = .067

50 up to 60 60 up to 70 70 up to 80 80 up to 90 90 up to 100 6/30 = .200

8/30 = .267

7/30 = .233

5/30 = .167

2/30 = .067

.30

.25

.20

.15

.10

.05

0

Relative Frequency Histogram of Grades

40 50 60 70 Grade 80 90 100

Based on the histo gram, about what percent of the values are between 47.5 and 52.5?

1.

2.

3.

4.

50% 5% 17% 30%

1 0% 0% 2 0% 3 0% 10

Stem and leaf displays

Have the following general appearance stem 1 2 3 4 5 6 leaf 8 9 1 2 8 9 9 2 3 8 9 0 1 6 7 4

Stem and Leaf Displays

Partition each no. in data into a “stem” and “leaf”

Constructing stem and leaf display 1) deter. stem and leaf partition (5-20 stems) 2) write stems in column with smallest stem at top; include all stems in range of data 3) only 1 digit in leaves; drop digits or round off 4) record leaf for each no. in corresponding stem row; ordering the leaves in each row helps

Example: employee ages at a small company

18 21 22 19 32 33 40 41 56 57 64 28 29 29 38 39; stem: 10’s digit; leaf: 1’s digit

18: stem=1; leaf=8; 18 = 1 | 8 stem 1 leaf 8 9 2 3 4 5 6 1 2 8 9 9 2 3 8 9 0 1 6 7 4

Suppose a 95 yr. old is hired

stem 1 2 3 4 5 6 7 8 9 leaf 8 9 1 2 8 9 9 2 3 8 9 0 1 6 7 4 5

Number of TD passes by NFL teams: 2012-2013 season ( stems are 10’s digit) stem leaf 4 03 3 247 2 2 1 0 6677789 01222233444 13467889 8

Pulse Rates n = 138

# Stem Leaves 4* 3 4. 9 5* 588 001233444 10 5. 23 6* 23 6. 16 7* 23 7. 10 8* 10 8. 5556788899 00011111122233333344444 55556666667777788888888 00000112222334444 55555666666777888888999 0000112224 5555667789 4 9* 2 9. 4 10* 10. 1 11* 0012 58 0223 1

Advantages/Disadvantages of Stem-and-Leaf Displays

Advantages 1) each measurement displayed 2) ascending order in each stem row 3) relatively simple (data set not too large)

Disadvantages display becomes unwieldy for large data sets

Population of 185 US cities with between 100,000 and 500,000

Multiply stems by 100,000

Back-to-back stem-and-leaf displays. TD passes by NFL teams: 1999-2000, 2012-13 multiply stems by 10 1999-2000 2 6 2 6655 43322221100 9998887666 421 1 1 0 4 3 3 2 2 2012-13 03 7 24 6677789 01222233444 67889 134 8

Below is a stem-and-leaf display for the pulse rates of 24 women at a health clinic. How many pulses are between 67 and 77?

Stems are 10’s digits 1.

2.

3.

4.

5.

4 6 8 10 12

0% 1 0% 2 0% 3 0% 4 0% 10

Interpreting Graphical Displays: Shape

 A distribution is

symmetric

if the right and left sides of the histogram are approximately mirror images of each other.

 A distribution is

skewed to the right

if the right side of the histogram (side with larger values) extends much farther out than the left side. It is

skewed to the left

if the left side of the histogram extends much farther out than the right side.

Complex, multimodal distribution Symmetric distribution Skewed distribution 

Not all distributions have a simple overall shape, especially when there are few observations .

Heights of Students in Recent Stats Class

Shape (cont.)Female heart attack patients in New York state

Age: left-skewed Cost: right-skewed

Shape (cont.): Outliers

An important kind of deviation is an

outlier

. Outliers are observations that lie outside the overall pattern of a distribution. Always look for outliers and try to explain them.

The overall pattern is fairly symmetrical except for 2 states clearly not belonging to the main trend. Alaska and Florida have unusual representation of the elderly in their population.

A large gap in the distribution is typically a sign of an outlier.

Alaska Florida

Center: typical value of frozen personal pizza? ~$2.65

Spread: fuel efficiency 4, 8 cylinders

4 cylinders: more spread 8 cylinders: less spread

Other Graphical Methods for

Economic Data

Time plots plot observations in time order, with time on the horizontal axis and the vari able on the vertical axis ** Time series measurements are taken at regular intervals (monthly unemployment, quarterly GDP, weather records, electricity demand, etc.)

Unemployment Rate, by Educational Attainment

Water Use During Super Bowl

Winning Times 100 M Dash

Numerical Summaries of Quantitative Data

Numerical and More Graphical Methods to Describe Univariate Data

2 characteristics of a data set to measure

center measures where the “middle” of the data is located

variability measures how “spread out” the data is

The median: a measure of center

Given a set of

n

measurements arranged in order of magnitude, Median= middle value  Ex. 2, 4, 6, 8, 10;

n

=5; median=6

n

odd mean of 2 middle values,

n

even  Ex. 2, 4, 6, 8;

n

=4; median=(4+6)/2=5

Student Pulse Rates (n=62)

38, 59, 60, 60, 62, 62, 63, 63, 64, 64, 65, 67, 68, 70, 70, 70, 70, 70, 70, 70, 71, 71, 72, 72, 73, 74, 74, 75, 75, 75, 75, 76, 77, 77, 77, 77, 78, 78, 79, 79, 80, 80, 80, 84, 84, 85, 85, 87, 90, 90, 91, 92, 93, 94, 94, 95, 96, 96, 96, 98, 98, 103 Median = (75+76)/2 = 75.5

Medians are used often

Year 2011 baseball salaries Median $1,450,000 (max=$32,000,000 Alex Rodriguez; min=$414,000)

Median fan age: MLB 45 ; NFL 43 ; NBA 41 ; NHL 39

Median existing home sales price: May

2011

$166,500 ; May 2010 $174,600

Median household income (2008 dollars) 2009 $50,221 ; 2008 $52,029

The median splits the histogram into 2 halves of equal area

Examples

Example: n = 7 17.5 2.8 3.2 13.9 14.1 25.3 45.8

Example n = 7 (ordered):

m = 14.1

2.8 3.2 13.9 14.1 17.5 25.3 45.8

Example: n = 8 17.5 2.8 3.2 13.9 14.1 25.3 35.7 45.8

Example n =8 (ordered)

m = (14.1+17.5)/2 = 15.8

2.8 3.2 13.9 14.1 17.5 25.3 35.7 45.8

Below are the annual tuition charges at 7 public universities. What is the median tuition?

4429 4960 4960 4971 5245 5546 7586 1.

2.

3.

4.

5245 4965.5

4960 4971

1 0% 0% 2.

0% 3 0% 10

Below are the annual tuition charges at 7 public universities. What is the median tuition?

4429 4960 5245 5546 4971 5587 7586 1.

2.

3.

4.

5245 4965.5

5546 4971

1 0% 0% 2.

0% 3 0% 10

Measures of Spread

The range and interquartile range

Ways to measure variability

range=largest-smallest

OK sometimes; in general, too crude; sensitive to one large or small data value

The range measures spread by examining the ends of the data

A better way to measure spread is to examine the middle portion of the data

Quartiles: Measuring spread by examining the middle 1 2 1 2 0.6

1.2

3 3 1.6

The first quartile, Q

1

, is the value in the sample that has 25% of the data at or below it (Q 1 is the median of the lower

4 5 6 7 8 4 5 6 7 6 1.9

1.5

2.1

2.3

2.3

Q 1 = first quartile = 2.3

half of the sorted data) . The third quartile, Q

3

, is the value in the sample that has 75% of the data at or below it (Q 3 is the median of the upper half of the sorted data).

9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 2.5

2.8

2.9

3.3

3.4

3.6

3.7

3.8

3.9

4.1

4.2

4.5

4.7

4.9

5.3

5.6

6.1

5 4 3 2 1 6 7 6 5 4 3 2 1 2 3 4 5

m = median = 3.4

Q 3 = third quartile = 4.2

Quartiles and median divide data into 4 pieces

1/4 1/4 1/4 Q1 M Q3 1/4

Quartiles are common measures of spread

http://www2.acs.ncsu.edu/UPA/admissi ons/fresprof.htm

http://www2.acs.ncsu.edu/UPA/peers/cu rrent/ncsu_peers/sat.htm

University of Southern California

Rules for Calculating Quartiles

Step 1: find the median of all the data (the median divides the data in half) Step 2a: find the median of the lower half; this median is Q 1 ; Step 2b: find the median of the upper half; this median is Q 3 .

Important:

when n is odd include the overall median in both halves; when n is even do not include the overall median in either half.

Example

11

 2 4 6 8 10 12 14 16 18 20 n = 10  Median  m = (10+12)/2 = 22/2 = 11  Q

1

: median of lower half 2 4 6 8 10 Q

1 = 6

 Q

3

: median of upper half 12 14 16 18 20 Q

3

=

16

Quartile example: odd no. of data values  HR’s hit by Babe Ruth in each season as a Yankee   54 59 35 41 46 25 47 60 54 46 49 46 41 34 22 Ordered values:   22 25 34 35 41 41 46 46 46 47 49 54 54 59 60 Median: value in ordered position 8. median = 46   Lower half (including overall median): 22 25 34 35 41 41 46 46

Q

1  lower quartile    Upper half (including overall median): 46 46 47 49 54 54 59 60

Q

3  upper quartile  2 2  38  51.5

Pulse Rates n = 138

# 3 Stem 4* 4.

9 5* 10 5.

23 6* 23 6.

4 2 4 16 7* 23 7.

10 8* 10 8.

9* 9.

1 10* 10.

11* Median: mean of pulses in Leaves locations 69 & 70: 588 001233444 5556788899 00011111122233333344444 55556666667777788888888 median= (70+70)/2=70 Q 1 : median of lower half (lower half = 69 smallest pulses); Q 1 = pulse in ordered position 35; 00000112222334444 55555666666777888888999 0000112224 5555667789 Q 1 = 63 0012 58 0223 Q 3 median of upper half (upper half = 69 largest pulses); Q 3 = pulse in position 35 from the high end; Q 3 =78 1

Below are the weights of 31 linemen on the NCSU football team. What is the value of the first quartile Q 1 ?

1.

2.

3.

4.

287 257.5

263.5

262.5

10 7 5 2 1 # 2 4 6 7 10 12 (4) 15 stemleaf 2255 2357 2426 257 26257 2759 281567 2935599 30333 3145 32155 336 340

1 0% 0% 2.

0% 3.

0% 10

Interquartile range

lower quartile Q 1

middle quartile: median

upper quartile Q 3

interquartile range (IQR)

IQR = Q 3 – Q 1 measures spread of middle 50% of the data

Example: beginning pulse rates

Q 3 = 78; Q 1 = 63

IQR = 78 – 63 = 15

Below are the weights of 31 linemen on the NCSU football team. The first quartile Q 1 is 263.5. What is the value of the IQR?

1.

2.

3.

4.

23.5

39.5

46 69.5

10 7 5 2 1 # 2 4 6 7 10 12 (4) 15 stemleaf 2255 2357 2426 257 26257 2759 281567 2935599 30333 3145 32155 336 340

0% 1.

0% 2.

0% 3 0% 10

5-number summary of data

Minimum Q 1 median Q 3 maximum

Pulse data 45 63 70 78 111

End of General Numerical Summaries Next: Numerical Summaries of Symmetric Data

Numerical Summaries of Symmetric Data.

Measure of Center: Mean Measure of Variability: Standard Deviation

Symmetric Data Body temp. of 93 adults

Recall: 2 characteristics of a data set to measure

center measures where the “middle” of the data is located

variability measures how “spread out” the data is

Measure of Center When Data Approx. Symmetric

mean (arithmetic mean)

notation

, , 3 ,  ,

x n n

: number of measurements in data set; sample size

i n

  1

x i x

2 3 

x n

Sample mean

x x

x

1 

x

2 

x

3   

x n n

i n

  1

n x i

Population mean  (value typically not known) N = population size  

i N

  1

N x i

Connection Between Mean and Histogram

A histogram balances when supported at the mean. Mean x = 140.6

Histogram

70 60 50 40 30 20 10 0 Fr equency

Abs e nce s f rom Work

Mean: balance point Median: 50% area each half right histo: mean 55.26 yrs, median 57.7yrs

Properties of Mean, Median

1. The mean and median are unique; that is, a data set has only 1 mean and 1 median (the mean and median are not necessarily equal).

2. The mean uses the value of every number in the data set; the median does not.

Ex. 2, 4, 6, 8.

x

 20  5; 4

m

 2  5 Ex. 2, 4, 6, 9.

x

 21  4 5 ; 4

m

 2  5

Example: class pulse rates

53 64 67 67 70 76 77 77 78 83 84 85 85 89 90 90 90 90 91 96 98 103 140 n

 23

x

i

23   1

x i

23  84.48;

m

:location: 12th obs.

m

 85

2010, 2014 baseball salaries

2010 n = 845

= $3,297,828 median = $1,330,000 max = $33,000,000

2014 n = 848

= $3,932,912 median = $1,456,250 max = $28,000,000

Disadvantage of the mean

Can be greatly influenced by just a few observations that are much greater or much smaller than the rest of the data

Mean, Median, Maximum Baseball Salaries 1985 - 2014

Baseball Salaries: Mean, Median and Maximum 1985-2014

Mean Median Maximum 3 700 000 3 200 000 2 700 000 2 200 000 1 700 000 1 200 000 700 000 200 000 35 000 000 30 000 000 25 000 000 20 000 000 15 000 000 10 000 000 5 000 000 0

Year

Skewness: comparing the

mean, and median

Skewed to the right (positively skewed)

mean>median

2011 Baseball Salaries

600 500 400 300 200 100 0 53 490 102 72 35 21 26 17 8 10 2 3 1 0 0 1

Salary ($1,000's)

Skewed to the left; negatively skewed

Mean < median

mean=78; median=87;

Histogram of Exam Scores

30 20 10 0 20 30 40 50 60 70

Exam Scores

80 90 100

Symmetric data

mean, median approx. equal

20

Bank Customers: 10:00-11:00 am

15 10 5 0 70 .8

78 .6

86 .4

94 .2

102 10 9.

8 11 7.

6 12 5.

4

Number of Customers

13 3.

2 m or e

DESCRIBING VARIABILITY OF SYMMETRIC DATA

Describing Symmetric Data (cont.)

Measure of center for symmetric data:

Sample mean

x x

x

1 2

x

3

n x n

i n

  1

x i n

Measure of variability for symmetric data?

Example

2 data sets: x 1 =49, x 2 =51 x=50 y 1 =0, y 2 =100 y=50

On average, they’re both comfortable

49 51 0 100

Ways to measure variability

1.

range=largest-smallest ok sometimes; in general, too crude; sensitive to one large or small obs.

2. measure spread from the middle, where  deviation of

x i

from the mean:

x i

x

i n

  1 (

x i

x

); sum the deviations of all the

x i i n

  1 (

x i x

Previous Example

sum of deviations from mean: (

x

1 (

x

1

y

1  

x

0,

y

1   49,

x

2

y y

2 

x

2 

x

 100; )

y

 

y

2 51; 

y x

)  50  50   0

The Sample Standard Deviation, a measure of spread around the mean

Square the deviation of each observation from the mean; find the square root of the “average” of these squared deviations

(

x i

x

) 2 ;

i n

  1 (

x i

x

) 2 and find the " average" then take the square root of the average ,

s

i n

  1 (

x i

x

) 2

n

 1 called the sample deviation standard

4 5 6 7 8 9 10 11 12 13 14

Calculations …

Women height (inches)

i x i x (x i -x) (x i -x) 2

1 59 63.4 -4.4 19.0 2 60 63.4 -3.4 11.3 3 61 63.4 -2.4 5.6 62 62 63 63 63 64 64 65 66 67 68

Mean 63.4

63.4 63.4 63.4 63.4 -1.4 -1.4 -0.4 -0.4 63.4 63.4 63.4 -0.4 0.6 0.6 63.4 63.4 1.6 2.6 63.4 63.4 3.6 4.6

Sum 0.0

1.8 1.8 0.1 0.1 0.1 0.4 0.4 2.7 7.0 13.3 21.6

Sum 85.2

Mean = 63.4

x Sum of squared deviations from mean = 85.2

(n − 1) = 13; (n − 1) is called degrees freedom (df)

i

1 2

x i

59 60

x

63.4 63.4

(x i -x)

-4.4 -3.4

(x i -x) 2

19.0 11.3 3 61 63.4 -2.4 5.6

We’ll never calculate these by hand, so make sure to know how to get the standard deviation using your

6 7 63 63 63.4 63.4 -0.4 0.1

calculator, Excel, or other software.

-0.4 0.1

x

8 63 63.4 -0.4 0.1 9 10 64 64 63.4 63.4 0.6 0.6 0.4 0.4 11 12 13 14 65 66 67 68

Mean 63.4

63.4 63.4 1.6 2.6 63.4 63.4 3.6 4.6

Sum 0.0

2.7 7.0 13.3 21.6

Sum 85.2

Mean ± 1 s.d.

1. First calculate the

variance s 2 .

s

2 

n

1  1

n

 1 (

x i

x

) 2

2. Then take the square root to get the

standard deviation s .

s

n

1  1

n

 1 (

x i

x

) 2

Population Standard Deviation

 

i N

  1 (

x i

  ) 2

N

population standard deviation value of  typically not known; 

Remarks

1. The standard deviation of a set of

measurements is an estimate of the likely size of the chance error in a single measurement

Remarks (cont.)

2. Note that s and

are always greater than or equal to zero.

3. The larger the value of s (or

greater the spread of the data.

), the When does s=0? When does

=0?

When all data values are the same.

Remarks (cont.)

4. The standard deviation is the most commonly used measure of risk in finance and business

Stocks, Mutual Funds, etc.

5. Variance

   

s 2

2 sample variance population variance Units are squared units of the original data square $, square gallons ??

Remarks 6):Why divide by n-1

instead of n?

degrees of freedom

each observation has 1 degree of freedom

however, when estimate unknown population parameter like

, you lose 1 degree of freedom

In formula for

s

, we use

x

to estimate the unkown value of  ;

s

i n

  1 (

x i n

  1

x

) 2

Remarks 6) (cont.):Why divide by n-1 instead of n? Example

Suppose we have 3 numbers whose

  

average is 9 Choose ANY values for x 1

x 1

=

x 2

= and x 2 Since the average (mean) then x

3

must be once we selected x

1

is 9, x 1 1 + x 2 equal 9*3 = 27, so x and x 2

2

+ x 3 , x

3

must was determined since the average was 9 = 27

3 numbers but only 2 “degrees of freedom”

Computational Example

observatio ns

1 , 3 , 5 , 9 ;

x

 18 4  4 .

5

s

 ( 1  4 .

5 ) 2  ( 3  4 .

5 ) 2  ( 5 4  1  4 .

5 ) 2  ( 9  4 .

5 ) 2  (  3 .

5 ) 2  (  1 .

5 ) 2  (.

5 ) 2  3 

s

2 12 .

25  2 .

25  .

25  20 .

25  3  11 .

67 ( 4 .

5 ) 2 35 3  11 .

67  3 .

42 ;

class pulse rates

53 64 67 67 70 76 77 77 78 83 84 85 85 89 90 90 90 90 91 96 98 103 140

n

 23

x

 84.48

m

 85

s s

2  290.26 (beats per minute)  17.037 beats per minute 2

Review: Properties of s and

s and

are always greater than or

equal to 0 when does s = 0?

= 0?

The larger the value of s (or

), the greater the spread of the data

the standard deviation of a set of measurements is an estimate of the likely size of the chance error in a single measurement

Summary of Notation

SAMPLE

y

sample mean POPULATION  population mean

m

sample median

s

2 sample variance

s

sample stand. dev.

m

population median  2 population variance  population stand. dev.

End of Chapter 4