Analysing Health and Crime Data

Download Report

Transcript Analysing Health and Crime Data

GIS in Health and Crime
Analysis
Stan Openshaw * or Andrew Turner**
School of Geography, University of Leeds
Leeds LS2 9JT
email: [email protected]
* on Tuesday and Wednesday, **on Thursday
The Role of GIS in Health and
Crime is fairly obvious!
• GIS provides an integrated spatial data
management environment for the capture,
storage, manipulation, management and
mapping of relevant data
• Developments in computerisation and IT have
allowed GIS to cover nearly all of the Health
and Crime data application areas
• Standardisation of address data and new
digital products (eg Address Point) are very
relevant
BUT
What is currently
MISSING
are many serious attempts to
use GIS for the
ANALYSIS
of Health and Crime Databases
So an
alternative title
for this talk is..
A quick account of
how to do some useful
Spatial Analysis in
GIS using Health and
Crime Data as
examples
Ah....??
You guessed it!!
I really wanted to talk about
Spatial Analysis in GIS BUT was
persuaded that this might be too
complex for lunchtime audiences
It was thought that
EQUATIONS might give you
indigestion!!
• Quite right!!! Too..
• So there are no equations, no maths and
absolutely no complex statistical stuff
• Just two things
– some reasons WHY you should be doing
Spatial Analysis in GIS despite the problems
– some EXAMPLES of useful Spatial Analysis
being done in GIS
Why do you want to analyse
Health and Crime Data?
Why ruin a perfectly good GIS
with a spotless record of sparkling
multicoloured mappings by also
expecting it to do SPATIAL
ANALYSIS as well as everything
else?
The answer is simply...
You really have NO CHOICE!
GIS has created an imperative for
mapping and analysis by putting X,Y
coordinates on data that previously
lacked it
People will now EXPECT you to be
constantly monitoring and analysing
crime and health databases for patterns
Unconvinced?
Time for a bit of AUDIENCE
baiting - bashing!
• Question 1. Hands up if you have a
database with X,Y coordinates on it or
plan to have one soon?
• Question 2. Hands up if you do not know
what an X,Y coordinate is?
• Question 3. Hands up if you do not know
where to find accurate X,Y coordinates to
add to your data?
• Question 4. Is it silly to collect data and
create GIS databases that are not fully
analysed using every suitable method?
• Answer. YES!! Very silly!!!
• Question 5. Does Mapping count as
analysis?
• Answer. NO!! It is the display of data that
have been damaged, had various biases
and noise added, and may mislead you.
Something far more powerful is needed.
Mapping is not ANALYSIS!!
Mapping is not ANALYSIS!!
• Question 6. Will a statistical package such as
SPSS or SAS help?
• Answer. No. This is a major problem. Most
statistical methods do not work well or at all
on geographical data. Sad isn’t it!!!
• Question 7. What about S+?
• Answer. No!!! NO!!
NO!!!!!!!!!
• Question 7. My GIS has a Spatial
Analysis Module or Section. Will that do?
• Answer. Grow up!! Get wise!! What the
GIS vendor’s tell you is spatial analysis is
really only spatial data manipulation
Gulp!
• Question 8. So is my GIS TOTALLY
USELESS at SPATIAL ANALYSIS and
cannot offer much relevant assistance?
• Answer. YES..Sorry. Did no one ever tell
you?
• Question 9. What about this Getis G
statistic thingey and Moran’s
autocorrelation coefficient?
• Answer. Quite useless!!
Ah!.. Well maybe we should
keep quite about this Spatial
Analysis deficiency. Sounds
rather too academic for us
practitioners. Also as no one
does it (since they cannot)
therefore no one probably
wants it!
Wrong!
Enter Joe Blogs..
“Excuse me.. are you saying
that you collect data you do not
fully analyse which I pay for?
Yes but it is not a problem!
Joe Blogs.....
“Excuse me.. are you saying
that the analysis of DISEASES
that might kill me or of
CRIMES that might harm me
is not important?
Yes... but there is no problem..
we know what is going on out there.
“How!!
If you are not doing analysis
HOW DO YOU KNOW what is
going on? I might die
prematurely because of you or
have my car stolen and wrecked
because of your ignorance and
failure to do your job properly!”
I exaggerate to make a point!
• There is a strong imperative to analyse
geographical information if it is
important to do so.
• Crime and Health Databases are
IMPORTANT
• It is surely important that are fully
analysed using state of the art methods
Spatial Analysis Crime..
• Occurs when people collect, manage,
store, cherish, archive, and map BUT not
analyse data that they should analyse
because it may contain patterns and
processes of considerable public interest.
• Are you a Spatial Analysis Criminal?
• Do you know some others who are?
• There is a lot of it about right now!
Spatial Analysis Crime
• A term invented to describe users of
GIS who have successfully created
databases relating to all kinds of
useful information BUT who then
fail to ANALYSE it for whatever
reasons
Spatial Analysis Crime is a
consequence of the success of GIS
in creating spatial databases and a
widespread failure by users to
realise that having access to a GIS
is NOT SUFFICIENT because
there are fundamental gaps in the
GIS tool-kits
People DIE
each year because no one
BOTHERS to properly
analyse DISEASE and
DEATH data for unusual
localised concentrations
People DIE
each year because the spatial
epidemiological analysis that is
done is either too limited or too
academic research orientated or
based on inappropriate
technology that basically does
not work
Criminals ESCAPE
Detection
because no one BOTHERS
to properly analyse the realtime on-line crime data that
already exists
Lets have a closer look at Police
IT!
• Some facts
– most police forces have installed or are
installing Command and Control Systems
that have GIS capabilities
– most have on-line crime recording with
accurate X,Y coordinates
– Police IT costs US lots of money
– most Police Forces do little or no Crime
Pattern Analysis and no localised crime
forecasting?
A Home Office Consultative
paper “Getting to Grips with
Crime” Sept 1997
• Creates a new need for local Crime
Pattern Analysis and Crime Audits
• Generates a new need for the analysis of
BS7666 spatially referenced crime data
•And...
According to a survey of Local
Authorities in England and Wales in
July 1996 , some 62% undertook local
Crime Pattern Analysis!
• Much depends what is meant by
ANALYSIS!
• Crime counts for Police Beats is not
spatial analysis or Crime Pattern
Analysis!
• Drawing maps is NOT spatial analysis!
Lets look at Health IT
• Even more of our money spent here than
with Police IT
• Databases cover most aspects of health,
disease, vaccinations, hospital visits,
deaths, etc
• They have done so for quite a while!
• Extensive national databases exist with
fairly geographic referencing
So WHY is there so little spatial analysis?
• Many reasons
– absolute confidentiality
– owned by this or that consultant or trust or
charity
– ethical approval needed
– more important to treat patients that spot
patterns
– a massive over-emphasis on causal
explanation rather than pattern spotting
and identifying persistent but circumstantial
associations
GIS needs spatial analysis
methods that are exploratory
• There are few or no hypotheses to test
which paralyses conventional approaches
• Here more than anywhere else there is a
rigid and unyielding addiction to
confirmatory approaches (viz..
hypothesis testing)
•BUT...
What happens if you have no
hypothesis to test?
a blank slide
A category of REAL Spatial
Analysis needs are essentially
anomalous pattern detection
• Nothing too clever!
• Hypothesis testing is more research that an
operational GIS activity
• Pattern detection via monitoring GIS
databases will meet most immediate needs
• So why is not it being routinely
done????
NO
Software!
No SOFTWARE!
• GIS vendors see no need to provide any
• They argue there is no market
• They think it is too specialised and too
complex for themselves to support
• They have been scared off by statisticians
• There is no consensus amongst
researchers as to WHAT methods should
be used
• AND..
Most serious of all.. there are no
EXISTING techniques the
vendors can re-code, copyright
and thus own!
Spatial Analysis is also
SPECIAL because unlike
much of GIS there was little
pre-GIS spatial analysis
activity and hence the costbenefit analysis is harder to
perform
GIS has created a need for
Spatial Analysis as a spin-off of
its success! The vendors do not
know how to cope with these
needs and the users are
deprived of relevant technology
and have to try and make do the
best they can.
Many Spatial Data Bases are
now available for analysis
•BUT
• very few suitable spatial analysis
tools exist that can cope with
BOTH
the data
and
the users
The Available
Methods can
be classified
as follows ..
almost a blank slide
The principal problem is
an almost complete
absence of suitable
Geographical Analysis
Technology (GAT)
for use within GIS
MapInfo (for example) defines
spatial analysis as follows:
“An operation that examines data with the
intent to extract or create new data that
fulfills some required condition or
conditions. It includes such GIS
functions as polygon overlay or buffer
generation and concepts of contains,
intersects, within or adjacent.”
(Page 396, MapInfo Professional: User Guide,
1995)
Yet drawing Maps is not a very
good idea
Map based Visualization
and Analysis is a simple but
fundamentally flawed
technology
maps can tell lies
map stories can be manipulated
the analysis task is left to the
viewer’s eyeballs
it is NOT analysis!
Unemployment
Leeds and Bradford Wards
Unemployment
Leeds and Bradford EDs
The Modifiable Areal Unit
Problem (MAUP) is an
ADDED complication
scale changes the level of
generalization and thus what you see
on maps
aggregational variability is even
more devastating since often far
more than a billion different sets of
results for each scale!
Zones are
arbitrary
and
modifiable
Unemployment Equal
Population
Unemployment Positively
correlated with Ethnic
Minorities
Unemployment negatively
correlated with unemployment
OK... So WHAT should you be
doing?
The NEED is for
Exploratory Spatial
Data Analysis
capable of being safely and
easily used and understood by
people who do not have higher
degrees in the statistical or
spatial sciences
The need is for automated
geographical analysis machines
that read data, perform some
analysis, and then tell you
about it in a readily understood
way
Mark 1 Geographical Analysis
Machine
• an early attempt at automated exploratory
spatial data analysis that was easy to
understand
• it answered a simple practical question
given some X,Y point referenced data
of something interesting WHERE might
there be evidence of localized clustering
if you do not have the foggiest idea of
where to look due to lack of knowledge?
How does GAM work?
• Uses circles as a pattern a detector
• Study region covered with millions of
overlapping circles of varying sizes
• A significance test is applied to each and
the most interesting results used to build
up a density surface of pattern strength
• You examine this density surface for
peaks which define localised excess
Geographical Analysis Machine
(GAM) Mark 1 history
• GAM/1 developed in the mid 1980s
• it was very computationally intensive
MACHINE
hence the term
because it really needed a dedicated
computing machine
• Early runs took over 1 month of CPU
time on a large Mainframe (Amdahl 580)
• Later ran on a Cray X-MP, Y-MP, and
Cray 2 super computers
• It was developed to analyze Child Leukemia
Data in Northern England
• GAM/1 easily spotted the suspected Sellafield
Cluster
• BUT
it also found an even stronger major
new cluster in Gateshead in 1986
• This is possibly the ONLY instance of a major
cancer cluster being found by analysis (rather
than journalism) since John Snow’s famous
cholera spatial epidemiology of the mid 19th
century
10 years ago GAM/1
was a mixed blessing!
• It was praised by many geographers as a
major development in useful spatial
analysis technology
• It was severely criticized by some
statisticians (mainly due to ignorance of
the geography of the problem)
• Software for GAM was never distributed
as ten years ago it was not easily run
GAM/1: good aspects
• it was automated
• prior knowledge or ignorance was rendered
irrelevant
• it looked for localized clusters at a time when
most spatial statistical methods concentrated
on global measures of pattern
• the search for local clusters was
geographically comprehensive
GAM/1: Bads
• it needed a supercomputer and was not
easy to apply because of restricted access
• there are multiple testing problems
• it upset some major statisticians who
conducted a brief campaign of intensive
criticism most of which turned out to be
either incorrect or irrelevant or
mischievous in intent
Some of the problems went away
• Criticisms faded away in the early 1990s as
spatial statisticians developed a better
understanding of the geography of the
problem and the statistical concerns were
better understood
• The Gateshead results were subsequently
corroborated although their cause remains
an official mystery?
GAM was no longer being
developed
until...
International Agency for
Research on Cancer (IARC)
• Commissioned a study in 1989-91 of different
clustering methods, many developed by critics of
GAM, FINALLY published in late 1996
• 50 synthetic cancer data sets were created for
which the degree of clustering and locations of
clusters were known but kept secret
• the data were given to the participants who
performed their analyses without any knowledge
of the correct results
• Methods applied without knowledge of the
results were
–
–
–
–
–
Potthoff-Whittingham
Cuzick-Edwards
GAM-K
Besag-Newell’s method
ISD’s Original Method
• Later extended to include 4 others but these
were applied with knowledge of the cluster
locations
–
–
–
–
ISD revised
Cuzick-Edwards one sample method
Diggle-Morris K functions
CAS method
Results published in Alexander
and Boyle (1996)
• It was anticipated that the statistical methods
preferred by the critics of GAM would work best
• Much to the SURPRISE of Alexander
and Boyle GAM/K was shown to be the
best or equivalent best means of
TESTING FOR PRESENCE OF
CLUSTERING and for FINDING
THE LOCATIONS OF CLUSTERS
Overall Performance when
Detecting Clustering
Rank
1
2
3
4
5
6
7
8
9
Method
Percentage Errors
GAM/K
CAS
Cuzick-Edwards Revised
ISD Revised
ISD Original
Cuzick-Edwards Original
Potthoff-Whittingham
Besag-Newell
Diggle-Morros
14%
16%
24%
28%`
30%
32%
34%
40%
46%
Estimated Positive Sensitivities
in Finding CLUSTER locations
Besag-Newell
36 %
Cuzick-Edwards
66 %
GAM/K
87%
Alexander and Boyle (1996)
authors of the IARC study
concluded:
“The GAM has potential applications in
this area if adequate computer resources
are available. At the present time,
however, the new, more sophisticated
version of the GAM is complex, difficult
to understand..”
(p 157)
That was in 1991!!!!!!!!
• There were THEN TWO remaining criticisms:
– (1) GAM needed a supercomputer
– (2) GAM was complex
• Others could have been added
– (3) GAM was not available for others to use
– (4) GAM linkages with GIS was unclear
• Are these criticisms still valid
today?
Reviving GAM/K
• GAM/K still runs on the later day version of
the Cray X-MP vector supercomputer (the
Cray J90)
• Efforts were made in 1996-7 to port the Cray
X-MP code on to a Cray T3D parallel
supercomputer with 512 processors
•BUT it failed!!!
Algorithm was re-programmed from scratch
• But it needed an estimated 9 Days of CPU
time on a single J90 processor to perform a
single run
• Fortunately:
– a modern workstation is as fast as a Cray J90
processor
• But.. This would hardly constitute a
generally applicable and easy to use
method!
Making GAM/K run faster
• Subsequent modifications to the spatial
data retrieval algorithm used in GAM/K
reduced the 9 days to 714 seconds on a
workstation
• GAM/K was now a PRACTICAL tool
• It can be readily linked to any GIS once it
no longer needed a supercomputer to run
it
Example 1. Burglary Data for
somewhere in Northern
England
• Part of a town analyzed
– 71,911 Address Point houses
– 3,784 burglaries
• There is no real limit on the size of area or
amount of data analysed
• Results are self-evident!
Burglaries
Results like this are NOT found
in random data... they are
REAL
..various options in GAM to
explore the statistical aspects
further (if required) and it can
be run to check on its own
performance!
Random Crimes
Example 2. Applying GAM to the
Long Term Limiting Illness data
from 1991 census for Northern
England
• based on Census EDs even though this is
rather coarse
• there are 6905 eds in the area of interest
• GAM works best on small area data ideally
one metre grid-referenced points
Map of Ward Level LLTI
Where are the localised areas of
excess?
All Data
Regional Age-Sex Covariates included
Results for Age-Sex Adjusted
Data using Bootstrap of Excess
Circle Size
Km
0.5
1
1.5
2
2.5
3
4
5
Thousand
Circles
Number
Significant
5 142
1 362
626
359
235
165
95
62
21 654
11 584
7 418
5 149
3 731
2 876
1 962
1 436
Teeside
Tyneside
Random Data
GAM/K is a descriptive tool
OK!!
so you have
found some
possible clusters
so what!!!!
But aren’t the results so
self evident that merely
mapping the data would
be enough and a blind
man with a walking stick
couldn’t have helped but
noticed them?
Ward LLTI Map and GAM
Well you PUT YOUR
GEOGRAPHER’S head back
on and start to relate the
clusters to the underlying map
patterns!
• What is associated with the clusters of
excess?
• Does their DISTRIBUTION provide any
clues?
• What is linked to the clusters of deficiency?
Clusters of Deficiency
Teeside mapped with DoE’s
Deprivation Score
Tyneside, DoE Deprivation
this is
1997
not 1967!
A Geographical Explanation
Machine will hunt out the map
associations for you!
• The Geographical Correlates Exploration
Machine of 1990 was a start
• It looked at 2 M-1 permutations of map
coverages to define clusters that could be
“EXPLAINED” by local spatial
associations
• Location and GIS data layers were used
as surrogates for missing explanatory
variables
A Geographical Explanations
Machine- GEM/1
• Explanation here is to be interpreted in
the traditional geographical sense of
there being a possibly interesting
localised spatial association between
clusters and certain GIS data layers
• Maps do not cause patterns to appear
BUT they do contain clues as to the
processes that do if only we were clever
enough to spot and decode them
GEM can be run in 4 modes
• MODE=1 is a GAM/K search for clustering
• MODE=2 is the use of 2 K permutations of M
GIS data layers to add general covariates in an
attempt to destroy the clustering (the spatial
epidemiologist approach)
• MODE=3 examines 2 K permutations to find the
strongest spatial associations with GIS layers that
enhance the clustering (GCEM)
• MODE=4 uses 2 K permutations to add local GIS
covariates to destroy the cluster
Insufficient time to
describe how GEM
works instead we
present some results
using as pseudo
coverages
Which clusters cannot be
“explained” away?
Unexplained clusters on
Tyneside
Clusters that can be “explained
away”
The other GAMs
• MAP EXplorer (MAPEX) is an intelligent
search version of GAM/K
– uses Genetic Algorithm to perform search
– uses AVS to create MPEG computer movies of
search process
• Space Time Attribute Creature (STAC)
extends MAPEX to multiple data domains
– uses Java and web browser for animation
– interactive partnership?
Future Plans
• A ESRC project to Implement GAMs
and more sophisticated GEM and
Artificial Life based Geographical
Analysis and Explanation Tools are
planned as an Internet based
distributed geographical analysis
system over next 2 years
• If you are interested then please get in
touch
Further Info: Email
[email protected]
[email protected]