Transcript Slide 1

Review of Previous Lecture

• • • • •

Range

– The difference between the largest and smallest values

Interquartile range

– The difference between the 25th and 75th percentiles

Variance

– The sum of squares divided by the population size or the sample size minus one

Standard deviation

– The square root of the variance

Z-scores

– The number of standard deviations an observation is away from the mean

Outline of Session

Another

Measure of Dispersion –

Coefficient of Variation

(

CV

) •

Histograms

Skewness

Kurtosis

Other

Descriptive Summary Measures

Measures of Dispersion – Coefficient of Variation

Coefficient of variation

(

CV

) measures the

spread

of a set of data as a proportion of its mean. • It is the

ratio

of the sample

standard deviation

to the sample

mean

CV

s

 100 %

x

• It is sometimes expressed as a

percentage

• There is an

equivalent

definition for the coefficient of variation of a population

Measures of Dispersion – Coefficient of Variation

• A standard application of the

Coefficient of Variation

(CV) is to characterize the

variability

of

geographic variables

over space or time •

Coefficient of Variation

(CV) is particularly applied to characterize the

interannual variability

of

climate variables

(e.g., temperature or precipitation) or

biophysical variables

(leaf area index (LAI), biomass, etc)

Mean Standard Deviation Coefficient of Variation (CV)

Chapel Hill (A) 1198.10

191.80

0.16

(16%) Bend (B) 298.07

82.08

0.28

(28%)

Coefficient of Variation (CV)

• It is a

dimensionless

number that can be used to compare the amount of variance between populations with

different means

s

2 

i n

  1 (

x i

x

) 2

n

 1

CV

s

s x

 100 %

i n

  1 (

x i n

  1

x

) 2

Source: http://www.daymet.org

Source: http://www.daymet.org

1990 - 2000 Source: Xiao & Moody, 2004

CV

of NDVI ~

CV

of temperature

CV

of NDVI ~

CV

of precipitation NDVI Temperature Precipitation

Measures of Skewness and Kurtosis

• A fundamental task in many statistical analyses is to characterize the

location

and

variability

of a data set (Measures of

central tendency

vs. measures of

dispersion)

• Both measures tell us nothing about the

shape

of the distribution • A

further

characterization of the data includes

skewness

and

kurtosis

• The

histogram

is an effective

graphical

technique for showing both the

skewness

and

kurtosis

of a data set

Histograms Fig. 3.

Histogram of crown width (m) measured

in situ

for a random sample of

Quercus robur

trees in Frame Wood (

n

= 63; mean = 9.3 m; SD = 4.64 m). Source: Koukoulas & Blackburn, 2005.

Journal of Vegetation Science:

Vol. 16, No. 5, pp. 587 –596

Frequency & Distribution

• A

histogram distribution

is one way to depict a

frequency

Frequency

is the number of times a variable takes on a particular value • Note that any variable has a frequency distribution • e.g. roll a pair of dice several times and record the resulting values (constrained to being between and 2 and 12), counting the number of times any given value occurs (the frequency of that value occurring), and take these all together to form a

frequency distribution

Frequency & Distribution

Frequencies

can be

absolute

(when the frequency provided is the actual count of the

occurrences

) or

relative

(when they are

normalized

by dividing the absolute frequency by the total number of observations [0, 1]) •

Relative frequencies

are particularly useful if you want to compare distributions drawn from two different sources (i.e. while the numbers of observations of each source may be different)

Histograms

• We may summarize our data by constructing

histograms

, which are vertical bar graphs • A

histogram

is used to

graphically

summarize the distribution of a data set • A histogram divides the range of values in a data set into

intervals

• Over each interval is placed a bar whose height represents the

frequency

of data values in the interval.

Building a Histogram

• To construct a

histogram

, the data are first

grouped

into categories • The histogram contains one

vertical bar

for each category • The

height

of the bar represents the number of observations in the category (i.e.,

frequency

) • It is common to note the

midpoint

of the category on the horizontal axis

Building a Histogram – Example

1. Develop an ungrouped frequency table

– That is, we build a table that counts the number of occurrences of each variable value from lowest to highest:

TMI Value

4.16

4.17

4.18

… 13.71

Ungrouped Freq.

2 4 0 … 1 • We could attempt to construct a bar chart from this table, but it would have too many bars to really be useful

Building a Histogram – Example

2. Construct a grouped frequency table

– Select an appropriate number of classes

Class

4.00 - 4.99

5.00 - 5.99

6.00 - 6.99

7.00 - 7.99

8.00 - 8.99

9.00 - 9.99

10.00 - 10.99

11.00 - 11.99

12.00 - 12.99

13.00 - 13.99

Frequency

120 807 1411 407 87 33 17 22 43 19

Percentage

Building a Histogram – Example

3. Plot the frequencies of each class

– All that remains is to create the bar graph

Pond Branch TMI Histogram 48 44 40 12 8 4 0 36 32 28 24 20 16 4 5 6 7 8 9 10 11 12 13 Topographic Moisture Index 14 15 16 A proxy for Soil Moisture

Further Moments of the Distribution

• While measures of dispersion are useful for helping us describe the width of the distribution, they tell us nothing about the

shape of the distribution

Source

: Earickson, RJ, and Harlin, JM. 1994. Geographic Measurement and Quantitative Analysis. USA: Macmillan College Publishing Co., p. 91.

Further Moments of the Distribution

• There are

further statistics

that describe the

shape

of the distribution, using formulae that are similar to those of the mean and variance • 1 st moment -

Mean

(describes

central value

) • 2 nd moment -

Variance

(describes

dispersion

) • 3 rd moment -

Skewness

(describes

asymmetry

) • 4 th moment -

Kurtosis

(describes

peakedness

)

Further Moments – Skewness

Skewness

measures the degree of asymmetry exhibited by the data

skewness

i n

  1 (

x

i

ns

3 

x

) 3 • If

skewness

equals zero, the histogram is

symmetric

about the mean •

Positive

skewness vs

negative

skewness

Further Moments – Skewness

Source: http://library.thinkquest.org/10030/3smodsas.htm

Further Moments – Skewness

Positive skewness

– There are more observations below the mean than above it – When the mean is greater than the median •

Negative skewness

– There are a small number of low observations and a large number of high ones – When the median is greater than the mean

Further Moments – Kurtosis

Kurtosis

measures how peaked the histogram is

kurtosis

n

i

(

x

i

ns

4 

x

) 4  3 • The

kurtosis

of a

normal distribution

is 0 •

Kurtosis

characterizes the relative

peakedness

or

flatness

of a distribution compared to the normal distribution

Further Moments – Kurtosis

Platykurtic

– When the

kurtosis < 0

, the frequencies throughout the curve are closer to be equal (i.e., the curve is more

flat

and

wide

) • Thus,

negative kurtosis

indicates a relatively

flat

distribution •

Leptokurtic

– When the

kurtosis > 0

, there are high frequencies in only a small part of the curve (i.e, the curve is more

peaked

) • Thus,

positive kurtosis

indicates a relatively

peaked

distribution

Further Moments – Kurtosis

platykurtic leptokurtic Source: http://www.riskglossary.com/link/kurtosis.htm

• • •

Kurtosis

is based on the size of a distribution's tails.

Negative

short tails kurtosis (

platykurtic

) – distributions with

Positive

kurtosis (

leptokurtic

) – distributions with relatively long tails

Why Do We Need Kurtosis?

• These two distributions have the same

variance

, approximately the same

skew

, but

differ

markedly in

kurtosis

.

Source: http://davidmlane.com/hyperstat/A53638.html

How to Graphically Summarize Data?

Histograms

Box plots

Functions of a Histogram

• The

function

of a histogram is to

graphically

summarize the distribution of a data set • The

histogram

graphically shows the following: 1.

Center

(i.e., the location) of the data 2.

Spread

(i.e., the scale) of the data 3.

Skewness

of the data 4.

Kurtosis

of the data 4. Presence of

outliers

5. Presence of multiple

modes

in the data.

Functions of a Histogram

• The

histogram

can be used to answer the following questions: 1. What kind of

population distribution

do the data come from? 2.

Where

are the data located? 3. How

spread out

are the data? 4. Are the data

symmetric

or skewed? 5. Are there

outliers

in the data?

Source: http://www.robertluttman.com/vms/Week5/page9.htm

http://office.geog.uvic.ca/geog226/frLab1.html

(Last) (First three)

Box Plots

• We can also use a

box plot

to

graphically

summarize a data set • • A box plot represents a

graphical summary

what is sometimes called a “

five-number summary

” of the distribution of – Minimum – Maximum – 25 th percentile – 75 th percentile – Median

Interquartile Range

(IQR) max.

median min.

75 th %-ile 25 th %-ile Rogerson, p. 8.

Box Plots

Example –

Consider first 9 Commodore prices ( in $,000) 6.0, 6.7, 3.8, 7.0, 5.8, 9.975, 10.5, 5.99, 20.0 •

Arrange

these in order of magnitude • • • 3.8

, 5.8, 5.99, 6.0,

6.7

, 7.0, 9.975, 10.5, 20.0

• The

median

is

Q 2

either side) = 6.7 (there are 4 values on

Q 1 Q 3 IQR

= 5.9 (median of the 4 smallest values) = 10.2 (median of the 4 largest values) = Q 3 – Q 1 = 10.2 - 5.9 = 4.3

Example

(ranked) 3.8

, 5.8, 5.99, 6.0,

6.7

, 7.0, 9.975, 10.5, 20.0

• • The

median

is

Q 1 Q 1

= 5.9

Q 3

= 6.7

= 10.2

IQR

= Q 3 – Q 1 = 10.2 - 5.9 = 4.3

Box Plots

Example

: Table 1.1 Commuting data (Rogerson, p5) Ranked commuting times: 5 , 5, 6, 9, 10, 11, 11 , 12 , 12, 14, 16, 17, 19, 21,

21

,

21

, 21, 21, 22, 23, 24, 24, 26 , 26 , 31, 31, 36, 42, 44, 47

25th percentile

is represented by observation (30+1)/4=7.75

75th percentile

is represented by observation 3(30+1)/4=23.25

25th percentile: 11.75

75th percentile: 26

Interquartile range

: 26 – 11.75 = 14.25

Example

(Ranked commuting times): 5 , 5, 6, 9, 10, 11, 11 , 12 , 12, 14, 16, 17, 19, 21,

21

,

21

, 21, 21, 22, 23, 24, 24, 26 , 26 , 31, 31, 36, 42, 44, 47

25th

percentile: 11.75

75th

percentile: 26

Interquartile range

: 26 – 11.75 = 14.25

Other Descriptive Summary Measures

Descriptive statistics

provide an

organization

and

summary

of a dataset • A small number of

summary measures

replaces the

entirety

of a dataset • We’ll briefly talk about

other

simple descriptive summary measures

Other Descriptive Summary Measures

• You're likely already familiar with some simple descriptive summary measures –

Ratios

Proportions

Percentages

Rates of Change

Location Quotients

Other Descriptive Summary Measures

• •

Ratios –

# of observations in A = # of observations in B e.g., A - 6 overcast, B - 24 mostly cloudy days

Proportions

– Relates one part or category of data to the entire set of observations, e.g., a box of marbles that contains 4 yellow, 6 red, 5 blue, and 2 green gives a yellow proportion of 4/17 or color count a count = {yellow, red, blue, green} = {4, 6, 5, 2}

proportion

 

a

i

a

i

Other Descriptive Summary Measures

Proportions

- Sum of all proportions = 1. These are useful for

comparing

two sets of data w/different sizes and category counts, e.g., a different box of marbles gives a yellow proportion of 2/23, and in order for this to be a

reasonable

comparison we need to know the totals for both samples •

Percentages

- Calculated by proportions x 100, e.g., 2/23 x 100% = 8.696%, use of these should be restricted to

larger

samples sizes, perhaps 20+ observations

Other Descriptive Summary Measures

• •

Location Quotients

- An index of relative concentration in space, a comparison of a region's share of something to the total

Example

– Suppose we have a region of 1000 Km 2 which we subdivide into three smaller areas of 200, 300, and 500 km 2 (labeled A, B, & C) • The region has an cases in A, 100 in B, and 350 in C (a total of 600 flu cases):

influenza outbreak

with 150 A B C Proportion of Area 200/1000=0.2

300/1000=0.3

500/1000=0.5

Proportion of Cases 150/600=0.25

100/600=0.17

350/600=0.58

Location Quotient 0.25/0.2=1.25

0.17/0.3 = 0.57

0.58/0.5=1.17

Assignment II

• Due by

Thursday (02/09/2006)

• Downloadable from Course website:

– http://www.unc.edu/courses/2006spring/geog/090/001/www /