Global Measures of Spatial Autocorrelation China Briggs Henan University 2010 Last Time • The concept of spatial autocorrelation. – “Near things are more similar than.

Download Report

Transcript Global Measures of Spatial Autocorrelation China Briggs Henan University 2010 Last Time • The concept of spatial autocorrelation. – “Near things are more similar than.

Global Measures of
Spatial Autocorrelation
China
1
Briggs Henan University 2010
Last Time
• The concept of spatial autocorrelation.
– “Near things are more similar than distant things”
• The use of the weights matrix Wij to measure
“nearness”
• The difficulty of measuring “nearness”
– This was a surprise!
This Time
• Measures of Spatial Autocorrelation
–
–
–
–
Join Count Statistic
Moran’s I
Geary’s C
Getis-Ord G statistic
Briggs Henan University 2010
2
Global Measures and Local Measures
• Global Measures
– A single value which applies to the entire data set
• The same pattern or process occurs over the entire
geographic area
China
• An average for the entire area
• Local Measures
– A value calculated for each observation unit
• Different patterns or processes may occur in different
parts of the region
• A unique number for each location
An equivalent local measure can be
calculated for most global measures
Briggs Henan University 2010
3
Join (or Joins or Joint) Count Statistic
• Polygons only
• binary (1,0) data only
– Polygon has or does not have a
characteristic
– For example, a candidate won or lost an
election
• Based on examining polygons which share a
border
– Do they have the same characteristic or not?
• Border same
on each side
• Border not the same
on each side
• Requires a contiguity matrix for polygons
Briggs Henan University 2010
4
Join (or Joint or Joins) Count Statistic
• Uses binary (1,0) data
– Shown here as B/W
(black/white)
Small number of BW
joins (6 only for rook)
Large proportion of BB and
WW joins
Different numbers of BW,
BB and WW joins
• Measures the number of
borders (“joins”) of each
type (1,1), (0,0), (1,0 or
0,1) relative to total
number of borders
• For 6 x 6 matrix, border
totals are:
– 60 for Rook Case
– 110 for Queen Case
Large number of BW
joins
Small number of BB and
WW joins
Briggs Henan University 2010
5
Join Count: Test Statistic
Test Statistic given by:
Z= Observed - Expected
SD of Expected
Expected = random pattern generated by tossing a coin in each cell.
Expected given by:
Standard Deviation of Expected (standard error) given by:
Where: k is the total number of joins (neighbors)
pB is the expected proportion Black, if random
pW is the expected proportion White
m is calculated from k according to:
Note: the formulae given here are for free (normality) sampling. Those for non-free
(randomization) sampling are substantially more complex. See Wong and Lee 1st ed. p. 151
compared to p. 155. Se next slide for explanation.
Briggs Henan University 2010
6
A Note on Sampling Assumptions:
applies to most tests for spatial autocorrelation
• Test results depend on the assumption made regarding the type of
sampling:
– Free (or normality) sampling
• Analogous to sampling with replacement
• After a polygon is selected for a sample, it is returned to the population set
• The same polygon can occur more than one time in a sample
– Non-free (or randomization) sampling
• Analogous to sampling without replacement
• After a polygon is selected for a sample, it is not returned to the population set
• The same polygon can occur only one time in a sample
• The formulae used to calculate the test statistic (particularly the
standard error) are different for each
– Generally, the formulae are substantially more complex for free
sampling—unfortunately, it is also the more common situation!
– Assuming free sampling requires knowledge about larger trends from
outside the region or access to additional information within the region in
order to estimate parameters.
Briggs Henan University 2010
7
Gore/Bush Presidential Election 2000
Is there evidence of clustering by State?
Use Join Count to answer this question!
Jbb
Jgg
Jbg
Total
Actual
60
21
28
109
Many BB joins
total number of joins = 109
= sum of neighbors/2 in the sparse contiguity matrix
= number of 1s/2
in the full contiguity matrix for US States
(see slides from SA Concepts lecture) Briggs Henan University 2010
8
Sparse Contiguity Matrix for US States -- obtained from Anselin's web site (see powerpoint for link)
Name
Fips
Ncount
N1
N2
N3
N4
N5
N6
N7
Alabama
1
4
28
13
12
47
Arizona
4
5
35
8
49
6
32
Arkansas
5
6
22
28
48
47
40
29
California
6
3
4
32
41
Colorado
8
7
35
4
20
40
31
49
56
Connecticut
9
3
44
36
25
Delaware
10
3
24
42
34
District of Columbia
11
2
51
24
Florida
12
2
13
1
Georgia
13
5
12
45
37
1
47
Idaho
16
6
32
41
56
49
30
53
Illinois
17
5
29
21
18
55
19
Indiana
18
4
26
21
17
39
Iowa
19
6
29
31
17
55
27
46
Kansas
20
4
40
29
31
8
Kentucky
21
7
47
29
18
39
54
51
17
Louisiana
22
3
28
48
5
Maine
23
1
33
Maryland
24
5
51
10
54
42
11
Massachusetts
25
5
44
9
36
50
33
Michigan
26
3
18
39
55
Minnesota
27
4
19
55
46
38
Mississippi
28
4
22
5
1
47
Missouri
29
8
5
40
17
21
47
20
19
Montana
30
4
16
56
38
46
Nebraska
31
6
29
20
8
19
56
46
Nevada
32
5
6
4
49
16
41
New Hampshire
33
3
25
23
50
New Jersey
34
3
10
36
42
New Mexico
35
5
48
40
8
4
49
New York
36
5
34
9
42
50
25
North Carolina
37
4
45
13
47
51
North Dakota
38
3
46
27
30
Ohio
39
5
26
21
54
42
18
Oklahoma
40
6
5
35
48
29
20
8
Oregon
41
4
6
32
16
53
Pennsylvania
42
6
24
54
10
39
36
34
Rhode Island
44
2
25
9
South Carolina
45
2
13
37
South Dakota
46
6
56
27
19
31
38
30
Tennessee
47
8
5
28
1
37
13
51
21
Texas
48
4
22
5
35
40
Utah
49
6
4
8
35
56
32
16
Vermont
50
3
36
25
33
Virginia
51
6
47
37
24
54
11
21
Washington
53
2
41
16
West Virginia
54
5
51
21
24
39
42
Wisconsin
55
4
26
17
19
27
Wyoming
56
6
49
16
31
8
46
30
N8
31
29
Queens Case
Sparse Contiguity
Matrix for US
States
•Ncount is the number
of neighbors for each
state
•Equals number of 1s in
a row of full contiguity
matrix
•Sum of Ncount is 218
•Number of common
borders (joins) =
 ncount / 2 = 109
•N1, N2… FIPS codes for
neighbors
9
Briggs Henan University 2010
Join Count Statistic for Gore/Bush 2000 by State
% of Votes
in election
Bush % (Pb) 0.49885
Gore % (Pg) 0.50115
Jbb
Jgg
Jbg
Total
Actual
60
21
28
109
Expected Stan Dev
27.125
8.667
27.375
8.704
54.500
5.220
109.000
Z-score
3.7930
-0.7325
-5.0763
• The expected number of joins is calculated based on the proportion of votes each
received in the election (for Bush = 109*.499*.499=27.125)
• K = 109= total number of joins
• There are far more Bush/Bush joins (actual = 60) than would be expected (27)
– Since test score (3.79) is greater than the critical value (2.54 at 1%) result is statistically
significant at the 99% confidence level (p <= 0.01)
– Strong evidence of spatial autocorrelation—clustering
• There are far fewer Bush/Gore joins (actual = 28) than would be expected (54)
– Since test score (-5.07) is greater than the critical value (2.54 at 1%) result is statistically
significant at 99% confidence level (p <= 0.01)
– Again, strong evidence of spatial autocorrelation—clustering
– Actual calculations available in spatstat.xls spreadsheet (JC-%vote tab)
Briggs Henan University 2010
10
Moran’s I
• The most common measure of Spatial Autocorrelation
• Use for points or polygons
– Join Count statistic only for polygons
• Use for a continuous variable (any value)
– Join Count statistic only for binary variable (1,0)
• Varies on a scale between –1 through 0* to + 1
-1
high negative spatial
autocorrelation
0
no spatial
autocorrelation*
+1
*technically it is:
–1/(n-1)
high positive spatial
autocorrelation
Can also use it as an index for dispersion/random/cluster patterns.
Dispersed Pattern
Random Pattern
Clustered Pattern
CLUSTERED
UNIFORM/
DISPERSED
Briggs Henan University 2010
11
Moran’s I and Correlation Coefficient r
Differences and Similarities
Correlation Coefficient r
r = 0.71
or
Quantity
Income
• Relationship between two variables
r = -0.71
Education
Price
Moran’s I
– Involves one variable only
– Correlation between variable, X, and the “spatial lag” of X formed
by averaging all the values of X for the neighboring polygons
Crime in r = 0.71
nearby
area
Crime Rate
Grocery
Store
Density
Nearby
r = -0.71
Grocery Store Density
Briggs Henan University 2010
12
Formula for Moran’s I
n
I
n
N  w ij (xi  x)(x j  x)
i 1 j1
n
n
n
( w ij ) (xi  x)
i 1 j1
2
i 1
• Where:
N
is the number of observations (points or polygons)
x is the mean of the variable
Xi
is the variable value at a particular location
Xj
is the variable value at another location
13
Wij is a weight indexing location of i relative to j
Briggs Henan University 2010
Correlation
Coefficient
n
1(y  y)(x  x)/n
i
i 1
n
i
2
(y

y
)
 i
2
(x

x
)
 i
n
n
i 1
i 1
n
n
N  w ij (xi  x)(x j  x)
n
( w ij ) (xi  x) 2
i 1 j1
(see next slide)
n
w
n
i 1 j1
n
n
Note the similarity of the
numerator (top) to the measures
of spatial association discussed
earlier if we view Yi as being the
Xi for the neighboring polygon
n
=
i 1
Spatial
auto-correlation
i 1 j1
n
ij
n
(xi  x)(x j  x)/  w ij
i 1 j1
n
n
2
(x

x
)
 i
2
(x

x
)
 i
n
n
i 1
i 1
Briggs Henan University 2010
14
Correlation
Coefficient
n
1(y  y)(x  x)/n
i
i 1
i
n
n
2
(y

y
)
 i
2
(x

x
)
 i
n
n
i 1
i 1
Yi is the Xi for the
neighboring polygon
n
n
N  w ij (xi  x)(x j  x)
n
( w ij ) (xi  x)
i 1 j1
2
n
w
n
i 1 j1
n
n
Spatial
weights
i 1 j1
=
i 1
n
ij
(xi  x)(x j  x)/  w ij
i 1 j1
n
n
2
(x

x
)
 i
2
(x

x
)
 i
n
n
i 1
Moran’s I
n
i 1
Briggs Henan University 2010
15
Adjustment for Short or Zero Distances
• If an inverse distance measure is used,
and distances are very short, then wij
becomes very large and distorts I.
• An adjustment for short distances can
be used, usually scaling the distance to
one mile.
• The units in the adjustment formula
are the number of data measurement
units in a mile
• In the example, the data is assumed to
be in feet.
• With this adjustment, the weights will
never exceed 1
• If a contiguity matrix is used (1or 0
only), this adjustment is unnecessary Briggs Henan University 2010
16
Statistical Significance Tests for Moran’s I
• Based on the normal frequency distribution with
I  E ( I ) Where: I is the calculated value for Moran’s I
Z
Serror ( I )
from the sample
E(I) is the expected value if random
E(I) = -1/(n-1)
S is the standard error
• Again, there are two different formula for calculating the standard
error
– The free sampling or normality method
– The nonfree sampling or randomization method
• These formulae are complicated!
– They are in Lee and Wong 1st Ed. p. 82 and 160-1
• In either case, the statistical test is carried out in the same way
Briggs Henan University 2010
17
Test Statistic for Normal Frequency Distribution
*technically –1/(n-1)
2.5%
Reject null -1.96
2.5%
–1/(n-1)
0
1%
1.96 2.54
Reject null at 5%
Reject null at 1%
Null Hypothesis: no spatial autocorrelation
*Moran’s I = 0
Alternative Hypothesis: spatial autocorrelation exists
*Moran’s I > 0
Reject Null Hypothesis if Z test statistic > 1.96 (or < -1.96)
---less than a 5% chance that, in the population, there is no
spatial autocorrelation
---95% confident that spatial auto correlation exits
18
Null Hypothesis: no spatial autocorrelation
*Moran’s I = 0
Alternative Hypothesis: spatial autocorrelation exists
*Moran’s I > 0
Reject Null Hypothesis if Z test statistic > 1.96 (or < -1.96)
---less than a 5% chance that, in the population, there is no
spatial autocorrelation
---95% confident that spatial auto correlation exits
Briggs Henan University 2010
19
Moran Scatter Plots
Moran’s I can be interpreted as the correlation between variable, X,
and the “spatial lag” of X formed by averaging all the values of
X for the neighboring polygons
We can then draw a scatter diagram between these two variables (in
standardized form): X and lag-X (or W_X)
Xi
Lag Xi
is average
of these
Least squares “best fit” line to the
points.
The slope of this regression line is
Moran’s I
(will discuss Regression later)
Briggs Henan University 2010
20
Moran Scatterplot: example
Moran’s I = 0.49
• The slope of the regression
line is Moran’s I
Lag-X
• Scatterplot of X vs. Lag-X
Low
surrounded
by low
High
surrounded
by high
X
Population density
in Puerto Rico
GISC 7361 Spatial Statistics
21
Moran’s I for rate-based data
• Moran’s I is often calculated for rates, such as crime
rates (e.g. number of crimes per 1,000 population) or
infant mortality rates (e.g. number of deaths per 1,000
births)
• An adjustment should be made, especially if the
denominator in the rate (population or number of births)
varies greatly (as it usually does)
• Adjustment is know as the EB adjustment:
– see Assuncao-Reis Empirical Bayes Standardization
Statistics in Medicine, 1999
• GeoDA software includes an option for this adjustment
Briggs Henan University 2010
22
Geary’s C (Contiguity) Ratio
• Calculation is similar to Moran’s I,
– For Moran, the cross-product is based on the deviations from the mean
for the two location values
– For Geary, the cross-product uses the actual values themselves at each
location
n
I
n
N  w ij (xi  x)(x j  x)
i 1 j1
n
n
n
( w ij ) (xi  x)
i 1 j1
i 1
2
n
C
n
N  w ij (xi  x j ) 2
i 1 j1
n
n
n
2( w ij ) (xi  x) 2
i 1 j1
i 1
• Interpretation is very different, essentially the opposite!
Geary’s C varies on a scale from 0 to 2
– 0 indicates perfect positive autocorrelation/clustered
– 1 indicates no autocorrelation/random
– 2 indicates perfect negative autocorrelation/dispersed
• Can convert to a -/+1 scale by: calculating C* = 1 - C
• Moran’s I usually used!
Briggs Henan University 2010
23
Statistical Significance Tests for Geary’s C
• Similar to Moran
• Again, based on the normal frequency distribution with
Z
C  E (C )
Serror ( I )
Where: C is the calculated value for Geary’s C
from the sample
E(C) is the expected value if no
autocorrelation
S is the standard error
however, E(C) = 1
• Again, there are two different formulations for the standard
error calculation
– The randomization or nonfree sampling method
– The normality or free sampling method
• The actual formulae for calculation are in Lee and Wong, 1st
Ed. p. 81 and p. 162
Briggs Henan University 2010
24
Hot Spots and Cold Spots
• What is a hot spot?
– A place where high values
cluster together
e.g. high
crime area
• What is a cold spot?
– A place where low values
cluster together
e.g. low crime
area
• Moran’s I and Geary’s C cannot distinguish them
• They only indicate clustering
• Cannot tell if these are hot spots, cold spots, or both
Briggs Henan University 2010
25
Getis-Ord General/Global G-Statistic
• The G statistic distinguishes between hot spots and cold spots. It
identifies spatial concentrations.
– G is relatively large if high values cluster together
– G is relatively low if low values cluster together
• The General G statistic is interpreted relative to its expected value
– The value for which there is no spatial association
– G > (larger than) expected value  potential “hot spots”
– G < (smaller than) expected value  potential “cold spots”
• A Z test statistic is used to test if the difference is statistically
significant
• Calculation of G based on a neighborhood distance within which
cluster is expected to occur
Getis, A. and Ord, J.K. (1992) The analysis of spatial association by use of
distance statistics Geographical Analysis, 24(3) 189-206
Briggs Henan University 2010
26
Calculating General G
• Begins by identifying a distance band, d, within which clustering occurs
Where:
• Actual Value for G is given by:
d
d is neighborhood distance
Wij weights matrix has only 1 or 0
1 if j is within d distance of i
0 if its beyond that distance
Thus any point beyond distance d has a
value of zero and therefore is excluded
• the terms in the numerator (top) are calculated “within a distance ring (d),”
and are then divided by totals for the entire region to create a proportion
– if nearby x values are both large (indicating “hot” spot), the numerator
(top) will be large
– If they are both small (indicating “cold” spot), the numerator (top) will be
small
• Expected value for G (if no concentration) is given by:
Number of points within distance band d
E (G ) 
W
n(n  1)
where
Total number of points in study region
Briggs Henan University 2010
27
Comments on General G
• General G will not show negative spatial autocorrelation
• Should only be calculated for ratio scale data
– data with a “natural” zero such as crime rates, birth rates
• Although it was defined using a contiguity (0,1) weights
matrix, any type of spatial weights matrix can be used
– ArcGIS gives multiple options
• There are two global versions: G and G*
– G does not include the value of Xi itself, only “neighborhood”
values
– G* includes Xi as well as “neighborhood” values
Briggs Henan University 2010
28
Testing General G
• The test statistic for G is normally distributed and is given by:
G  E (G )
Z
Serror (G )
with
W
E (G ) 
n(n  1)
Calculation of the standard error is complex.
See Lee and Wong 1st pp 164-167
or Getis and Ord 1992 for formulae.
• The next slide shows the results for running General G on
Anselin’s Columbus crime data
– This data is not good, but is very common since Anselin uses it in his
original LISA article and in the examples in the GeoDA documentation
– The geographic coordinates are completely arbitrary
Briggs Henan University 2010
29
General/Global G in ArcGIS
Shapefile containing polygon
or point data
Variable to analyze
Different options available for
specifying cluster neighborhood
--simple distance band selected, as
described in lecture
Options for measuring distance
--straight line (Euclidean)
--city block
Size of neighborhood distance band
Briggs Henan University 2010
30
General/Global G in ArcGIS: results
Observed G = .777
Expected G = .637
Observed > Expected >> “Hot spots”
Z score: 5.067 > 1.96 >> significant
But where are the hot spots?
For this we use Local Statistics
Briggs Henan University 2010
31
What have we learned today?
• Difference between global and local measures of spatial
autocorrelation
• How to calculate and interpret some global measures
– Join Count Statistic
• Used for binary (0,1) data only
– Moran’s I
• The most common global measure of spatial autocorrelation
– Geary’s C
• interpretation almost opposite of Moran’s I, but not used very often
– Getis-Ord G statistic
• Identifies hot spots or cold spots
Next Time:
local measures of spatial autocorrelation
Briggs Henan University 2010
32
Challenge for You
• Calculate Moran’s I and/or General G for
some appropriate variables in the China
provinces data set
• Use ArcGIS or GeoDA software
Briggs Henan University 2010
33
References
• O’Sullivan and Unwin Geographic Information Analysis New
York: John Wiley, 1st ed. 2003, 2nd ed. 2010
• Jay Lee and David Wong Statistical Analysis with ArcView GIS
New York: Wiley, 1st ed. 2001 (all page references are to this
book), 2nd ed. 2005
– Unfortunately, these books are based on old software (Avenue scripts
used with ArcView 3.x) and no longer work in the current version of
ArcGIS 9 or 10.
• Ned Levine and Associates CrimeStat III Washington: National
Institutes of Justice, 2010
– Available as pdf
– download from:
http://www.icpsr.umich.edu/NACJD/crimestat.html
Briggs Henan University 2010
34
35
Briggs Henan University 2010