F13_lect3b_ch02

Download Report

Transcript F13_lect3b_ch02

Lecture 4 Chapter 2. Numerical descriptors

Objectives (PSLS Chapter 2)

Describing distributions with numbers

      Measure of center: mean and median (Meas. Cent. Award) Measure of spread: quartiles, standard deviation, IQR (Meas. Var. Award) The five-number summary and boxplots (SUMS Award) Dealing with outliers (outliers award) Choosing among summary statistics (All Numeric Awards) Organizing a statistical problem (Foundational)

Measure of center: the mean

The mean, or arithmetic average

To calculate the average (mean) of a data set, add all values, then divide by the number of individuals. It is the “center of mass.”

x

x

1 

x

2 

....

x n n

n is the sample size x is the variable

x

 1

n i n

  1

x i

Measure of center: the median

The

median

is the midpoint of a distribution —the number such that half of the observations are smaller, and half are larger. 5 6 7 8 1 2 3 4 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

25

0.6

1.2

1.6

1.9

1.5

2.1

2.3

2.3

2.5

2.8

2.9

3.3

3.4

3.6

3.7

3.8

3.9

4.1

4.2

4.5

4.7

4.9

5.3

5.6

6.1

1) Sort observations from smallest to largest.

n

= number of observations 2) The location of the median is (

n

+ 1)/2 in the sorted list

______________________________

If

n

is

odd,

the median is the value of the center observation 

n

= 25 (

n

+1)/2 = 13 Median = 3.4

If

n

is

even,

the median is the mean of the two center observations

n

= 24  (

n+1)

/2 = 12.5

Median = (3.3+3.4)/2 = 3.35 15 16 17 18 19 20 21 22 23 24 1 2 3 4 5 6 7 8 9 10 11 12 13 14 3.7

3.8

3.9

4.1

4.2

4.5

4.7

4.9

5.3

5.6

0.6

1.2

1.6

1.9

1.5

2.1

2.3

2.3

2.5

2.8

2.9

3.3

3.4

3.6

Comparing the mean and the median

The median is a measure of center that is

resistant

to skew and outliers. The mean is not.

Mean and median for a symmetric distribution

Mean Median

Mean and median for skewed distributions

Left skew

Mean Median Mean Median

Right skew

Measure of spread: quartiles

The

first quartile, Q 1 ,

is the median of the values below the median in the sorted data set. The

third quartile, Q 3 ,

is the median of the values above the median in the sorted data set. 1 0.6

2 1.2

3 1.6

4 1.9

5 1.5

6 2.1

7 2.3

8 2.3

9 2.5

10 2.8

11 2.9

12 3.3

13 3.4

14 3.6

15 3.7

16 3.8

17 3.9

18 4.1

19 4.2

20 4.5

21 4.7

22 4.9

23 5.3

24 5.6

25 6.1

Q

1 = first quartile = 2.2

M =

median

=

3.4

Q

3 = third quartile = 4.35

How fast do skin wounds heal?

Here are the skin healing rate data from 18 newts measured in micrometers per hour: 28 12 23 14 40 18 22 33 26 27 29 11 35 30 34 22 23 35 Sorted data: 11 12 14 18 22 22 23 23 26 27 28 29 30 33 34 35 35 40

Median = ???

Quartiles = ???

Measure of spread: standard deviation

The standard deviation is used to describe the variation around the mean.

To get the standard deviation of a

SAMPLE

of data: 1) Calculate the

variance s 2

s

2 

n

1  1

n

 1 (

x i

x

) 2 2) Take the square root to get the

standard deviation s

s

n

1  1

n

 1 (

x i

x

) 2

Learn how to obtain the standard deviation of a sample using a spread sheet.

A person’s metabolic rate is the rate at which the body consumes energy. Find the mean and standard deviation for the metabolic rates of a sample of 7 men (in kilocalories, Cal, per 24 hours).

x

   (

x i

x

1 /

x

) 2

n df

n

 1  6   1600 214 , 870

s

2   ( 1

df

)  (

x i

214 , 870 6  

x

) 2 35 , 811 .

7

s

 35 , 811 .

7  189 .

2

*

Center and spread in boxplots

25 6.1

24 5.6

23 5.3

22 4.9

21 4.7

20 4.5

19 4.2

18 4.1

17 3.9

16 3.8

15 3.7

14 3.6

13 3.4

12 3.3

11 2.9

10 2.8

9 2.5

8 2.3

7 2.3

6 2.1

5 1.5

4 1.9

3 1.6

2 1.2

1 0.6

max = 6.1

Q

3 = 4.35

median = 3.4

Q

1 = 2.2

min = 0.6

“Five-number summary”

7 6 5 4 3 2 1 0

Boxplot Disease X

Boxplots and skewed data

2 1 0 7 6 5 4 3 15 14 13 12 11 10 9 8 Boxplots for a symmetric and a right-skewed distribution Disease X Multiple Myeloma Boxplots show symmetry or skew.

IQR and outliers

The

interquartile range (IQR)

is the distance between the first and third quartiles (the length of the box in the boxplot)

IQR = Q 3 – Q 1

An outlier is an individual value that falls outside the overall pattern. How far outside the overall pattern does a value have to fall to be considered a

suspected outlier

?

 Suspected low outlier: any value < Q 1 – 1.5 IQR  Suspected high outlier: any value > Q 3 + 1.5 IQR

25

7.9

24 5.6

23 5.3

22 4.9

21 4.7

20 4.5

19 4.2

18 4.1

17 3.9

16 3.8

15 3.7

14 3.6

13 3.4

12 3.3

11 2.9

10 2.8

9 2.5

8 2.3

7 2.3

6 2.1

5 1.5

4 1.9

3 1.6

2 1.2

1 0.6

Q 3 = 4.35

Q 1 = 2.2

1 0 8 7 6 5 4 3 2

* Disease X

Individual #25 has a survival of 7.9 years, which is 3.55 years above the third quartile. This is more than 1.5   Individual #25 is a suspected outlier.

Distance to Q3 7.9-4.35 = 3.55

Interquartile range Q3 – Q1 4.35-2.2 = 2.15

IQR = 3.225 years.

Dealing with outliers: Baldi and Moore’s Suggestions

What should you do if you find outliers in your data? It depends in part on what kind of outliers they are:  Human error in recording information  Human error in experimentation or data collection  Unexplainable but apparently legitimate wild observations  Are you interested in ALL individuals?

 Are you interested only in typical individuals?

 Learn.

Does the outlier tell you something interesting about biology?

Don’t discard outliers just to make your data look better, and don’t act as if they did not exist.

Choosing among summary statistics: B & M

   Because the

mean

is not resistant to outliers or skew, use it is often used to describe distributions that are fairly symmetrical and don’t have outliers.  Plot the mean and use the standard deviation for error bars.

Otherwise, use the

median

and the five-number summary, which can be plotted as a boxplot. Describe a distribution with its S.U.M.S. (shape, unusual points, middle, and spread).

69 68 67 66 65 64 63 62 61 60 59 58

Height of 30 women

Boxplot Mean ± s.d.

Deep-sea sediments.

Phytopigment concentrations in deep-sea sediments collected worldwide show a very strong right-skew.  Which of these two values is the mean and which is the median? 0.015 and 0.009 grams per square meter of bottom surface  Which would be a better summary statistic for these data?