slides-zaiben-trajectorySearch_sigmod2010
Download
Report
Transcript slides-zaiben-trajectorySearch_sigmod2010
Searching Trajectories by Locations
– An Efficiency Study
Zaiben Chen1, Heng Tao Shen1, Xiaofang Zhou1, Yu Zheng2, Xing Xie2
1
2
The University of Queensland
Microsoft Research, Asia
Outline
Research problem & application scenarios
Basic ideas
K Best-Connected Trajectory (k-BCT) query
The Incremental k-NN Algorithm (IKNN)
Performance study
Best-first
Depth-first
Optimization & extension
Experiments
Conclusion
Research Problem: Searching Trajectory Databases
How to retrieve the trajectories we want?
GPS trajectories collected by GeoLife Project, MSRA
Searching Trajectory Databases
Search by a location
Frentzos et al. Geoinfomatica07; Dfoser et al. VLDB00. (R-tree variants)
Search by a sample trajectory
Chen et al, SIGMOD05; Vlachos et al, ICDE02; Yi et al, ICDE98, etc. (Similarity)
Searching Trajectory Databases
The problem we study: Searching by multiple locations
To find trajectories that are ‘close’ to all the locations
Technically, it is an extension of the single-location based query.
But more complicated.
Practically, it produces a more general way to search trajectories.
Two extreme cases (one location, many locations)
Application motivations
The Microsoft GeoLife Project
http://research.microsoft.com/en-us/projects/geolife/
GeoLife is a location-based service built
on Microsoft Virtual Earth.
Our work benefits the following two functions
(1) Travel recommendation
E.g. To help a visitor planning a trip to
multiple attractions by considering other’s
traveling trajectories.
(2) Sharing life experiences & friend
recommendation
E.g. To find out which users share the
similar daily route through Queens Plaza,
Central Stat., Mains St.
Application motivations
Geo-Coding:
From Pictures
to Coordinates
The recommended route
Application motivations
Geo-Coding:
From Pictures
distance)
to Coordinates
The first step: to define the closeness (i.e.
between a trajectory and locations
The recommended route
Similarity Function
The similarity function reflects how close a trajectory is to the given
locations, and we call the most similar trajectory the best-connected
trajectory.
Step 1. find out the closest trajectory point on R to each location qi
Step 2. sum up the contribution of each matched pair. (unordered query)
Distq(qi, R) is the shortest distance from qi to R
Q={q1, q2, … qm}, R={p1, p2, … pn}
Problem Definition
k-Best Connected Trajectory (k-BCT) query
Given a set of trajectories T = {R1, R2, … , Rn}, a set of query locations
Q = {q1, q2, … ,qm}, and the similarity function Sim(Q, R), the k-BCT query is
to find the k trajectories among T that have the highest similarity.
Assumption:
The number of query locations is small. (m is a small constant)
Intuition:
The k-BCT result is the JOIN of m single-location based queries.
Basic ideas
Incremental k-NN Algorithm (IKNN)
Step 1. Index all the trajectory points by one single R-tree
Get the shortest distance from a query location to the trajectories
Step 2. Search for the λ-nearest neighbor (λ-NN) of each query
location (q1 to qm), by using any traditional k-nearest neighbor
algorithm over R-tree.
For any trajectory that scanned by a λ-NN, it’s shortest distance to
the query point is known.
Candidate set C = {all scanned trajectories}
IKNN algorithm
Step 3. Construct lower bounds of similarity.
For a trajectory R1 in C, assume it got 3 points p1, p2 and p3
scanned by the λ-NN search of q1, q2.
p5
p1
q1
p2
R1
p3
q2
q3
e-|q1, p1| + e-|q2, p2| + e-|q3, p5|
≥ e-|q1, p1| + e-|q2, p2|
Sim(Q, R1) =
The Incremental k-NN algorithm
Step 4. Construct upper bound of similarity.
For any trajectory that is not covered by the λ-NN search, e.g. R5
it’s distance to qi must be larger than the radius of qi
R1
radius1
q1
radius2
q2
radius3
q3
R5
e-|q1, R5| + e-|q2, R5| + e-|q3, R5|
≤ e-radius1+ e-radius2 + e-radius3
Sim(Q, R5) =
The Incremental k-NN algorithm
Step 5. Check the STOP condition (pruning condition)
For a k-BCT query, if we can get k candidate trajectories whose
lower bounds are not less than the upper bound of similarity for all
un-scanned trajectories, then the k best-connected trajectories must
be included in the candidate set.
if the condition is satisfied
go to the refinement step
else
increase λ by some Δ
repeat the search process
With the search region of the λ-NN search enlarges, eventually k
best-connected trajectories will be found.
Problem
The problem: we may need to increase λ and compute the
lower/upper bounds for many rounds before we eventually find the
k-BCT results.
The λ-NN search will run for many rounds for every query location.
(let λ be a constant k initially, and Δ be k as well)
round 1: 1 – k nearest neighbors
round 2: 1 – 2k nearest neighbors
…
round i: 1 – i*k nearest neighbors
Trajectory points are visited multiple times.
Normally, λ >> k, so the complexity is λ^2.
Problem
The problem: we may need to increase λ and compute the
lower/upper bounds for many rounds before we eventually find the
k-BCT results.
The λ-NN search will run for many rounds for every query location.
(let λ be a constant k initially, and Δ be k as well)
round 1: 1 – k nearest neighbors
Canround
we2:reduce
the overlapped search regions?
1 – 2k nearest neighbors
…
round i: 1 – i*k nearest neighbors
Normally, λ >> k, so the complexity is lambda square.
Efficiency study of the IKNN
Adaption of the λ-NN algorithm
The best-first nearest neighbor search [Hjaltason et al., TODS99]
A priority queue is maintained to store all the R-tree entries that have yet to be
visited, using the MINDIST as a key. So it visits MBRs/Objects in the order of the
MINDIST.
The depth-first nearest neighbor search [Roussopoulos et al., SIGMOD95]
It recursively traverses the R-tree level by level in a depth-first manner, while
maintaining a global list of k nearest candidates found so far.
Estimate the performance of the IKNN adopting different λ-NN
algorithms
Adaption of the λ-NN algorithm
The best-first NN search
Retrieve the λ, λ+∆, λ+2∆, … NN for each query location incrementally
until the k best-connected trajectories are included in the candidate set.
Benefit
The λ-NN is returned in an incremental way
I/O optimal, no overlap occurs, Vsum = λ
Shortcoming
Memory consumption is NOT guaranteed. A priority queue is
maintained to store all the R-tree entries that have yet to be visited.
The queue may be as large as the whole dataset in an extreme case.
The best-first strategy
Performance (R-tree leaf access)
Estimate the circle region (with radius r) that contains λ points [Belussi
et al. VLDB95]
λ objects
q
radius
Estimate the leaf access of a range query with radius r [Korn et al.
TKDE2001]
m independent λ-NN queries
Adaption of the lambda-NN algorithm
The depth-first NN search
Every time we search for the λ+∆ NN, we have to re-visit the search
region of the λ-NN query.
Benefit: Guaranteed memory usage, O(c LogcN)
Drawback: Too many overlaps
A simple improvement: Double λ at each round, to reduce the
number of rounds and amortize cost.
Pruning: All MBRs whose MAXDIST is even smaller than the current
search range of λ-NN can be skipped in the search of λ+∆ NN.
The depth-first strategy
Performance (R-tree leaf access)
The search region is not necessary a circle! So we can not use the previous
method directly.
Estimate the size of the first visited
MBR (at any level) that contains not less
than λ points
Estimate the radius (MAXDIST) of the
MBR1
region that contains the MBR
MAXDIST
qi
R-tree nodes outside the circle with
radius MAXDIST wont be visited.
The depth-first strategy (cont.)
Performance
Estimate the leaf access of a range query with radius MAXDIST [Korn et
al. TKDE2001]
Finally,
Summary
IKNN algorithm Memory usage
Object visits
The best-first
strategy
no guarantee
m × O(λ)
The depth-first
strategy
O(logN * c)
m × O(λ)
Leaf access
The best-first strategy, although has no guarantee in memory usage, it normally
runs faster and the priority queue can still be accommodated in the main memory
of a modern computer easily.
The modified depth-first strategy reaches nearly the same performance as that of
the best-first strategy, while it still preserves a low memory consumption
Optimization & Extension
Considering the importance of the query locations and assigning
different weights in exploring objects.
Extension to query locations with an order specified
Experiments
12, 653 trajectories (1,147,116 points) collected by the Geolife
project
Number of query locations: 2 to 10
Tests are conducted on PC with 2.1GHz CPU and 1GB memory
Experiments – Node Access
Experiments – Query Time
Experiments – Memory Usage
Thank you