A PARALLEL FORMULATION OF THE SPATIAL AUTO-REGRESSION MODEL FOR MINING LARGE GEO-SPATIAL DATASETS

Transcript A PARALLEL FORMULATION OF THE SPATIAL AUTO-REGRESSION MODEL FOR MINING LARGE GEO-SPATIAL DATASETS

A PARALLEL FORMULATION OF THE
SPATIAL AUTO-REGRESSION MODEL
FOR MINING
LARGE GEO-SPATIAL DATASETS
HPDM 2004 Workshop at SIAM Data Mining Conference
Barış M. Kazar, Shashi Shekhar, David J. Lilja, Daniel Boley
Army High Performance Computing and Research Center (AHPCRC)
Minnesota Supercomputing Institute (MSI)
Digital Technology Center (DTC)
University of Minnesota
04.24.2004
Overview
•
•
•
•
•
•
Motivation
Classical and New Data-Mining Techniques
Problem Definition
Our Approach
Experimental Results
Conclusions and Future Work
A Parallel Formulation of The Spatial Auto-Regression Model for Mining Large Geo-spatial Datasets
2
Motivation
•
Widespread use of spatial databases
 Mining spatial patterns
 The 1855 Asiatic Cholera on London [Griffith]
•
•
•
•
Fair Landing [NYT, R. Nader]
 Correlation of bank locations with loan
activity in poor neighborhoods
Retail Outlets [NYT, Walmart, McDonald etc.]
 Determining locations of stores by relating
neighborhood maps with customer
databases
Crime Hot Spot Analysis [NYT, NIJ CML]
 Explaining clusters of sexual assaults by
locating addresses of sex-offenders
Ecology [Uygar]
 Explaining location of bird nests based on
structural environmental variables
A Parallel Formulation of The Spatial Auto-Regression Model for Mining Large Geo-spatial Datasets
3
Key Concept: Neighborhood Matrix (W)
Given:
• Spatial framework
• Attributes
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
6th row
Space +
4-neighborhood
0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 
1 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0
0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0 
0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 
1 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0
0 1 0 0 1 0 1 0 0 1 0 0 0 0 0 0 
0 0 1 0 0 1 0 1 0 0 1 0 0 0 0 0 
0 0 0 1 0 0 1 0 0 0 0 1 0 0 0 0 
0 0 0 0 1 0 0 0 0 1 0 0 1 0 0 0 
0 0 0 0 0 1 0 0 1 0 1 0 0 1 0 0 
0 0 0 0 0 0 1 0 0 1 0 1 0 0 1 0 
0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 1 
0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 
0 0 0 0 0 0 0 0 0 1 0 0 1 0 1 0 
0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 1 
0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0
Binary W
 0 12 0 0 12 0 0 0 0 0 0 0 0 0 0 0 
 13 0 13 0 0 13 0 0 0 0 0 0 0 0 0 0 
 0 13 10 13 0 0 13 10 0 0 0 0 0 0 0 0 
 10 00 02 00 00 10 00 02 10 00 00 00 00 00 00 00 
 03 1 0 0 1 03 1 0 03 1 0 0 0 0 0 0 
4
4
4
4
6th row 00 00 104 10 00 104 10 104 00 00 104 10 00 00 00 00 
3
3
3
 0 0 0 0 13 0 0 0 0 13 0 0 13 0 0 0 
 0 0 0 0 0 14 0 0 14 0 14 0 0 14 0 0 
 0 0 0 0 0 0 14 0 0 14 0 14 0 0 14 0 
 0 0 0 0 0 0 0 13 10 0 13 0 0 10 0 13 
 00 00 00 00 00 00 00 00 02 10 00 00 10 02 10 00 
 0 0 0 0 0 0 0 0 0 03 1 0 03 1 03 1 
 0 0 0 0 0 0 0 0 0 0 03 12 0 03 12 03 
Row-normalized W
(i  1, j ) 2  i  p,1  j  q NORTH
 (i, j  1) 1  i  p, 1  j  q-1 EAST
neighbors(i, j )  
(i  1, j) 1  i  p-1, 1  j  q SOUTH

 (i, j  1) 1  i  p, 2  j  q WEST
W allows other neighborhood definitions
• distance based
• 8 and more neighbors
A Parallel Formulation of The Spatial Auto-Regression Model for Mining Large Geo-spatial Datasets
4
Classical and New Data-Mining Techniques
Name
Classification
Accuracy
Model
Classical Linear Regression
Spatial Auto-Regression
y  xβ  ε
y  ρWy  xβ  ε
Low
High
 : the spatial auto - regression (auto - correlation) parameter
W : n - by - n neighborho od matrix over spatial framework
• Solving Spatial Auto-regression Model
  = 0, ε = 0 : Least Squares Problem
 β = 0, ε = 0 : Eigenvalue Problem
 General case: Computationally expensive
n ln(2 ) n ln( 2 )
ln(L)  ln I  W 

 SSE
2
2
• Maximum Likelihood Estimation
• Need parallel implementation to scale up
A Parallel Formulation of The Spatial Auto-Regression Model for Mining Large Geo-spatial Datasets
5
Related Work & Our Contributions
• Related work: Li, 1996
Limitations: Solved 1-D problem
• Our Contributions
Parallel solution for 2-D problems
Portable software
 Fortran 77
 An Application of Hybrid Parallelism
» MPI messaging system
» Compiler directives of OpenMP
A Parallel Formulation of The Spatial Auto-Regression Model for Mining Large Geo-spatial Datasets
6
A Serial Solution
A
n, x, y, W
x, y , W , n,
range of 
Compute
Eigenvalues Eigenvalues
of W
B
• Golden Section
Search
• Calculate ML
Function
̂
C
ˆ , ˆ ,ˆ
Least Squares
• Compute Eigenvalues (Stage A)
 Produces dense W neighborhood matrix,
 Forms synthetic data y
 Makes W symmetric
 Householder transformation
 Convert dense symmetric matrix to tri-diagonal matrix
 QL Transformation
 Compute all eigenvalues of tri-diagonal matrix
A Parallel Formulation of The Spatial Auto-Regression Model for Mining Large Geo-spatial Datasets
7
2
Serial Response Times (sec)
• Stage A is the bottleneck & Stage B and C contribute very small to response time
Stage A
Stage B
Stage C
7000
6000
Time (sec)
5000
4000
3000
2000
1000
0
SGI
Origin
IBM SP
2500
IBM
Regatta
SGI
Origin
IBM SP
IBM
Regatta
SGI
Origin
6400
IBM SP
IBM
Regatta
10000
Problem Sizes on Different Machines
A Parallel Formulation of The Spatial Auto-Regression Model for Mining Large Geo-spatial Datasets
8
Problem Definition
Given:
• A Sequential solution procedure: “Serial Dense Matrix Approach” for
one-dimensional geo-spaces
Find:
• Parallel Formulation of Serial Dense Matrix Approach for
multi-dimensional geo-spaces
Constraints:
•   N(0,2I) IID
• Reasonably efficient parallel implementation
• Parallel Platform
• Size of W (large vs. small and dense vs. sparse)
Objective:
• Portable & scalable software
A Parallel Formulation of The Spatial Auto-Regression Model for Mining Large Geo-spatial Datasets
9
Our Approach – Parallel Spatial Auto-Regression
• Function vs. Data Partitioning
 Function partitioning: Each processor works on the
same data with different instructions
 Data partitioning (applied): Each processor works on
different data with the same instructions
• Implementation Platform:
 Fortran with MPI & OpenMP API’s
• No machine-specific compiler directives
 Portability
 Help software development and technology transfer
• Other Performance Tuning
 Static terms computed once
A Parallel Formulation of The Spatial Auto-Regression Model for Mining Large Geo-spatial Datasets
10
Data Partitioning in a Smaller Scale
• 4 processors are used and
chunk size can be determined by the user
• W is 16-by-16 and partitioned across processors
P1- (40 vs. 58)
P2- (36 vs. 42)
P3- (32 vs. 26)
P4- (28 vs. 10)
Contiguous
Round-robin with chunk size 1
16 15 14 13 12 11 10










0
1
3
0
0
1
3
0
1
2
0
1
3
0
0
1
3
0
1
2
0
0
0
1
4
0
0
0
0
1
4
0
0
0
0
0
0
1
3
0
0
0
0
1
3
0
1
2
0
0
0
0
1
4
0
0
1
3
0
0
1
3
0
1
4
0
9
8 7
0
0
0
0
1
3
0
0
0
1
2
0
0
1
4
0
1
3
0
0
1
4
0
6
5
4
3
2
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
3
0
0
0
0
0
0
0
0
0
1
4
0
0
0
0
0
0
0
0
1
4
0
0
0
0
0
0
0
1
3
0
0
0
0
0
0
1
3
0
0
0
0
0
1
4
0
0
0
0
0
0
0
0
0
1
3
0
0
0
0
0
0
0
1
4
0
0
0
0
0
0
0
0
1
4
0
0
0
0
0
0
0
0
0
1
3
0
0
0
0
0
0
0
0
0
0
1
2
0
0
0
0
0
0
0
0
0
0
0
1
3
0
0
0
0
0
0
0
0
0
0
0
P1
P3
P2
P1
P4
0
0
0
1
4
0
0
P3
P2
0
1
3
0
1
4
0
P1
P4
0
1
4
0
1
3
0
0
1
3
0
0
1
4
0
0
0
0
1
2
P3
P2
0
0
0
1
3
0
0
0
1
2
0
1
3
0
P1
P4
0
1
4
0
0
1
3
0
1
2
0
0
1
3
0
0
1
3
0










P3
P2










16 15 14 13 12 11 10
0
1
3
0
0
1
3
0
1
2
0
1
3
0
0
1
3
0
1
2
0
0
0
1
3
0
0
1
2
0
0
0
0
1
3
0
1
4
0
8
7
6
5
4
3
2
1
0
0
0
0
0
0
0
0
0
0
0
1
3
0
0
0
0
0
0
0
0
0
0
0
1
2
0
0
0
0
0
0
0
0
0
0
1
3
0
0
0
0
0
0
0
0
0
1
4
0
0
0
0
0
0
0
0
1
4
0
0
0
0
0
0
0
1
3
0
0
0
0
0
0
1
3
0
0
0
0
0
1
4
0
0
0
0
1
4
0
0
0
0
1
4
0
0
0
0
1
4
0
0
0
0
0
1
3
0
0
0
0
0
0
1
3
0
0
0
0
0
0
0
1
4
0
0
0
0
0
0
0
0
1
4
0
0
0
0
0
0
0
0
0
1
3
0
0
0
0
0
0
0
0
0
0
1
2
0
0
0
0
0
0
0
0
0
0
0
1
3
0
0
0
0
0
0
0
0
0
0
0
0
0
1
4
0
0
1
3
0
9
0
1
4
0
1
3
0
0
1
4
0
0
0
P1 P1 P1 P1
P4
P2 P2 P2 P2
0
0
0
1
4
0
0
1
3
0
1
4
0
0
1
4
0
1
3
0
0
1
3
0
0
1
4
0
0
0
0
1
2
P3 P3 P3 P3
0
0
0
1
3
0
0
0
1
2
0
1
3
0
0
1
3
0
1
2
0
1
3
0
0
1
3
0










P4 P4 P4 P4
A Parallel Formulation of The Spatial Auto-Regression Model for Mining Large Geo-spatial Datasets
11
Data Partitioning & Synchronization
A
n, x, y, W
x, y , W , n,
range of 
Compute
Eigenvalues Eigenvalues
of W
B
• Golden Section
Search
• Calculate ML
Function
̂
C
ˆ , ˆ ,ˆ 2
Least Squares
• A : Contiguous for rectangular loops
& round-robin with chunk-size 4
• B : Contiguous
• C : Contiguous
• The arrows are also synchronization points for parallel solution
A
B
C
• There are synchronization points within the boxes as well
A Parallel Formulation of The Spatial Auto-Regression Model for Mining Large Geo-spatial Datasets
12
Experimental Design
F act o r N ame
Language
Problem Size (n)
Neighborhood St ruct ure
Met hod
Aut o-regression Paramet er
Par amet er D o main
f77 w/ OpenM P & M PI
2500,6400 and 10000 observation points
2-D w/ 4-neighbors
M aximum Likelihood for exact SAM
[0,1)
Contiguous (B=n/p )
SLB
Round-robin w/ B ={1,4,8,16}
Combined (Contiguous+Round-robin)
Load-Balancing
DLB
M LB
Hardware Plat f orm
Number of Processors
Dynamic w/ B ={n/p ,1,4,8,16}
Guided w/ B ={1,4,8,16}
Affinity w/ B ={n/p ,1,4,8,16}
IBM Regatta w/ 47.5 GB M ain M emory; 32 1.3 GHz Power4
architecture processors
1,4, and 8
A Parallel Formulation of The Spatial Auto-Regression Model for Mining Large Geo-spatial Datasets
13
Experimental Results – Effect of Load Balancing
Effect of Load-Balancing Techniques on Speedup
for Problem Size 10000
mixed1
Static B=8
Dynamic B=8
Affinity B=1
Guided B=16
8
7
Speedup
6
5
4
3
2
1
0
1
4
Number of Processors
8
A Parallel Formulation of The Spatial Auto-Regression Model for Mining Large Geo-spatial Datasets
14
Experimental Results- Effect of Problem Size
Impact of Problem Size on Speedup Using Affinity Scheduling
on 8 Processors
affinity B=n/p
affinity B=1
affinity B=4
affinity B=8
afiinity B=16
8
7
6
Speedup
5
4
3
2
1
0
2500
6400
Problem Size
10000
A Parallel Formulation of The Spatial Auto-Regression Model for Mining Large Geo-spatial Datasets
15
Experimental Results- Effect of Chunk Size
Effect of Chunk Size on Speedup Using Dynamic Scheduling
on 8 Processors
Effect of Chunk Size on Speedup Using Static Scheduling on 8
Processors
PS=2500
PS=6400
PS=2500
PS=10000
8
PS=10000
7
7
6
6
5
Speedup
Speedup
PS=6400
8
4
3
5
4
3
2
2
1
1
0
0
1
4
8
Chunk Size
16
n/p
1
4
8
Chunk Size
16
n/p
• Critical value of the chunk size for which the speedup reaches the maximum.
• This value is higher for dynamic scheduling to compensate for
the scheduling overhead.
• The workload is more evenly distributed across processors at the
critical chunk size value.
A Parallel Formulation of The Spatial Auto-Regression Model for Mining Large Geo-spatial Datasets
16
Experimental Results- Effect of # of Processors
Effect of Number of Processors on Speedup (PS=10000)
4
8
8
7
5
4




3
2
1
B=
8
s ta
t ic
B=
16
s ta
t ic
B=
n /p
dy
na
mi
cB
dy
=1
na
mi
cB
dy
=4
na
mi
cB
dy
=8
na
mi
cB
dy
=1
na
6
mi
cB
=n
/p
aff
in it
yB
=1
aff
in it
yB
=4
aff
in it
yB
=8
aff
in it
yB
=1
aff
6
in it
yB
=n
/p
gu
id e
dB
=1
gu
id e
dB
=4
gu
id e
dB
=8
gu
id e
dB
=1
6
s ta
t ic
B=
4
B=
1
s ta
t ic
xed
2
s ta
t ic
mi
xed
1
0
mi
Speedup
6
Load-Balancing (Scheduling) Techniques
A Parallel Formulation of The Spatial Auto-Regression Model for Mining Large Geo-spatial Datasets
17
Summary
• Developed a parallel formulation of spatial
auto-regression model
• Estimates maximum likelihood of regular
square tessellation 1-D and 2-D planar
surface partitionings for location prediction
problems
• Used dense eigenvalue computation and
hybrid parallel programming
A Parallel Formulation of The Spatial Auto-Regression Model for Mining Large Geo-spatial Datasets
18
Future Work
1. Understand reasons of inefficiencies
– Algebraic cost model for speedup
measurements on different architectures
2. Fine tune implemented parallel
formulation
– Consider alternate parallel formulations
3. Parallelize other serial solutions using
sparse-matrix techniques
− Chebyshev Polynomial approximation
− Markov Chain Monte Carlo Estimator
A Parallel Formulation of The Spatial Auto-Regression Model for Mining Large Geo-spatial Datasets
19
Acknowledgments & Final Word
•
•
•
•
•
•
•
•
Army High Performance Computing Research Center-AHPCRC
Minnesota Supercomputing Institute - MSI
Digital Technology Center – DTC
Spatial Database Group Members
ARCTiC Labs Group Members
Dr. Sanjay Chawla
Dr. Kelley Pace
Dr. James LeSage
THANK YOU VERY MUCH
Questions?
A Parallel Formulation of The Spatial Auto-Regression Model for Mining Large Geo-spatial Datasets
20