Transcript Spatial Statistics Lecture
University at Albany School of Public Health EPI 621, Geographic Information Systems and Public Health
Introduction to Smoothing and Spatial Regression
Glen Johnson, PhD Lehman College / CUNY School of Public Health [email protected]
Consider points distributed in space “Pure” Point process: Only coordinates locating some “events”.
Set of points,
S =
{
s 1 , s 2 , … , s n
} _____________________ Examples include • location of burglaries • location of disease cases • location of trees, etc. Points represent locations of something that is measured. Values of a random variable,
Z
, are observed for a set
S
of locations, such that the set of measurements are
Z
(
s
)
=
{
Z
(
s 1
)
, Z
(
s 2
)
, … , Z
(
s n
)} ___________________________ Examples include • cases and controls (binary outcome) identified by location of residence • Population-based count (integer outcome) tied to geographic centroids • PCBs measured in mg/kg (continuous outcome) in soil cores taken at specific point locations
Example of a Pure Point Process: Baltimore Crime Events
Question: How to interpolate a smoothed surface that shows varying “intensity” of the points?
(source: http://www.people.fas.harvard.edu/~zhukov/spatial.html
)
Kernel Density Estimation
From: Cromely and McLafferty. 2002.
GIS and Public Health
.
Kernel Density Estimation
Estimate “intensity” of events at regular grid points as a function of nearby observed events. General formula for any point
x
is: 1
nh n
å
i
= 1
k
è
x
-
h x i
ø where
x i
are “observed” points for
i
= 1, … ,
n
locations in the study area,
k
(
.
) is a kernel function that assigns decreasing weight to observed points as they approach the bandwidth
h.
Points that lie beyond the bandwidth,
h
, are given zero weighting.
Results from Kernel Density Smoothing in R
Baltimore Crime Locations (Kernel Density)
Bandwidth = 0.1
Bandwidth = 0.15
160000 140000 Bandwidth = 0.007
Bandwidth = 0.05
120000 100000 80000 60000 40000 20000 0
Kernel Density Surface of Bike Share Locations in NYC Source: http://spatialityblog.com/2011/09/29/spatial-analysis-of-nyc-bikeshare-maps/
Examples of Values Observed at Point Locations, Z (
s
) : Question: How to interpolate a smoothed surface that captures variation in
Z(s)
?
First, consider “deterministic” approaches to spatial interpolation:
• Deterministic models do not acknowledge uncertainty.
• Only real advantage is simplicity; good for
exploratory
analysis • Several options, all with limitations. We will consider Inverse Distance Weighted (IDW) because of its common usage.
Inverse Distance Weighted Surface Interpolation Define search parameters
s
0
Z s
0
i n
1
i Z s i n Z s i
where the weight
i
d
0,
i p i n
1
d
0,
i p
Define power of distance-decay function
Illustration: Tampa Bay sediment total organic carbon
True “geostatistical” models assume the data,
Z
(
S
) = {
Z
(
s 1
)
, Z
(
s 2
)
, … , Z
(
s n
)}, are a partial realization of a random field.
Note that the set of locations
S
are a subset of some 2-dimensional spatial domain
D
, that is a subset of the real plane.
General Protocol: 1. Characterize properties of spatial autocorrelation through
variogram
modeling; 2. Predict values for spatial locations where no data exist, through
Kriging
.
A
semivariogram
is defined as
(h)
1 2
h
))
2 for distance
h
between the two locations, and is estimated as for
n h
h j
1 2
n h i n h
1
Z s i
i
h
))
2 pairs separated by distance
h j
(called a “lag”). After repeating for different lags, say
j
=1, … 10, the semivariance can be plotted as a function of distance.
Given any location
s i
, all other locations are treated as within distance
h
if they fall within a search window defined by the direction, lag
h
, angular tolerance and bandwidth.
bandwidth Adapted from Waller and Gotway. Applied Spatial Statistics for Public Health. Wiley, 2004.
Example semivariogram cloud for pairwise differences (red dots) , with the average semivariance for each lag (blue +), and a fitted semivariogram model (solid blue line)
Characteristics of a semivariogram
Range
= the distance within which positive spatial
Nugget Sill
autocorrelation exists = spatial discontinuity + observation error = maximum semivariance
If the variogram form does not depend on direction, the spatial process is
isotropic
. If it does depend on direction, it is
anisotropic.
Multiple semi variograms for different directions. Note changing parameter is the
range
.
Surface map of semivariance shows values more similar in NW-SE direction and more different in SW-NE direction.
Kriging then uses semivariogram model results to define weights used for interpolating values where no data exists.
The result is called the “Best Linear Unbiased Predictor”. The basic form is
Z s
0
i p
1
i Z s i
Where the
λ i
assign weights to neighboring values according to semivariogram modeling that defines a distance-decay relation within the range, beyond which the weight goes to zero.
• • • • • • • Several variations of Kriging:
Simple
(assumes known mean)
Ordinary
(assumes constant mean, though unknown) [our focus this week]
Universal
(non-stationary mean)
Cokriging
(prediction based on more than one inter-related spatial processes)
Indicator
(probability mapping based on binary variable) [you will see in the lab work]
Block
(areal prediction from point data)
And other variations …
Example of two types of Kriging for California O3:
1. Ordinary Kriging (
Detrended, Anisotropic) -continuous surface
2. Indicator Kriging
- probability isolines
What if point locations are centroids of polygons and the value Z(s
i
) represents aggregation within polygon i ?
With polygon data, we can still define neighbors as some function of Euclidean distance between polygon centroids, as we do for point-level data, but now we have other ways to define neighbors and their weights …
Defining spatial
“
Neighborhoods
”
Raster or Lattice
: Rook Queen - 1 st order Queen - 2 nd order
i iii
• •