slides-zaiben-trajectorySearch_sigmod2010

Download Report

Transcript slides-zaiben-trajectorySearch_sigmod2010

Searching Trajectories by Locations
– An Efficiency Study
Zaiben Chen1, Heng Tao Shen1, Xiaofang Zhou1, Yu Zheng2, Xing Xie2
1
2
The University of Queensland
Microsoft Research, Asia
Outline

Research problem & application scenarios

Basic ideas


K Best-Connected Trajectory (k-BCT) query

The Incremental k-NN Algorithm (IKNN)
Performance study

Best-first

Depth-first

Optimization & extension

Experiments

Conclusion
Research Problem: Searching Trajectory Databases
How to retrieve the trajectories we want?
GPS trajectories collected by GeoLife Project, MSRA
Searching Trajectory Databases

Search by a location
Frentzos et al. Geoinfomatica07; Dfoser et al. VLDB00. (R-tree variants)

Search by a sample trajectory
Chen et al, SIGMOD05; Vlachos et al, ICDE02; Yi et al, ICDE98, etc. (Similarity)
Searching Trajectory Databases

The problem we study: Searching by multiple locations

To find trajectories that are ‘close’ to all the locations


Technically, it is an extension of the single-location based query.
But more complicated.
Practically, it produces a more general way to search trajectories.
Two extreme cases (one location, many locations)
Application motivations

The Microsoft GeoLife Project
http://research.microsoft.com/en-us/projects/geolife/
GeoLife is a location-based service built
on Microsoft Virtual Earth.
Our work benefits the following two functions
(1) Travel recommendation
E.g. To help a visitor planning a trip to
multiple attractions by considering other’s
traveling trajectories.
(2) Sharing life experiences & friend
recommendation
E.g. To find out which users share the
similar daily route through Queens Plaza,
Central Stat., Mains St.
Application motivations
Geo-Coding:
From Pictures
to Coordinates
The recommended route
Application motivations
Geo-Coding:
From Pictures
distance)
to Coordinates
The first step: to define the closeness (i.e.
between a trajectory and locations
The recommended route
Similarity Function

The similarity function reflects how close a trajectory is to the given
locations, and we call the most similar trajectory the best-connected
trajectory.

Step 1. find out the closest trajectory point on R to each location qi

Step 2. sum up the contribution of each matched pair. (unordered query)
Distq(qi, R) is the shortest distance from qi to R
Q={q1, q2, … qm}, R={p1, p2, … pn}
Problem Definition

k-Best Connected Trajectory (k-BCT) query
Given a set of trajectories T = {R1, R2, … , Rn}, a set of query locations
Q = {q1, q2, … ,qm}, and the similarity function Sim(Q, R), the k-BCT query is
to find the k trajectories among T that have the highest similarity.
Assumption:
The number of query locations is small. (m is a small constant)
Intuition:
The k-BCT result is the JOIN of m single-location based queries.
Basic ideas
Incremental k-NN Algorithm (IKNN)

Step 1. Index all the trajectory points by one single R-tree


Get the shortest distance from a query location to the trajectories
Step 2. Search for the λ-nearest neighbor (λ-NN) of each query
location (q1 to qm), by using any traditional k-nearest neighbor
algorithm over R-tree.
For any trajectory that scanned by a λ-NN, it’s shortest distance to
the query point is known.
Candidate set C = {all scanned trajectories}
IKNN algorithm

Step 3. Construct lower bounds of similarity.
For a trajectory R1 in C, assume it got 3 points p1, p2 and p3
scanned by the λ-NN search of q1, q2.
p5
p1
q1
p2
R1
p3
q2
q3
e-|q1, p1| + e-|q2, p2| + e-|q3, p5|
≥ e-|q1, p1| + e-|q2, p2|
Sim(Q, R1) =
The Incremental k-NN algorithm

Step 4. Construct upper bound of similarity.
For any trajectory that is not covered by the λ-NN search, e.g. R5
it’s distance to qi must be larger than the radius of qi
R1
radius1
q1
radius2
q2
radius3
q3
R5
e-|q1, R5| + e-|q2, R5| + e-|q3, R5|
≤ e-radius1+ e-radius2 + e-radius3
Sim(Q, R5) =
The Incremental k-NN algorithm

Step 5. Check the STOP condition (pruning condition)
For a k-BCT query, if we can get k candidate trajectories whose
lower bounds are not less than the upper bound of similarity for all
un-scanned trajectories, then the k best-connected trajectories must
be included in the candidate set.
if the condition is satisfied
go to the refinement step
else
increase λ by some Δ
repeat the search process
With the search region of the λ-NN search enlarges, eventually k
best-connected trajectories will be found.
Problem

The problem: we may need to increase λ and compute the
lower/upper bounds for many rounds before we eventually find the
k-BCT results.

The λ-NN search will run for many rounds for every query location.
(let λ be a constant k initially, and Δ be k as well)
round 1: 1 – k nearest neighbors
round 2: 1 – 2k nearest neighbors
…
round i: 1 – i*k nearest neighbors
Trajectory points are visited multiple times.
Normally, λ >> k, so the complexity is λ^2.
Problem

The problem: we may need to increase λ and compute the
lower/upper bounds for many rounds before we eventually find the
k-BCT results.

The λ-NN search will run for many rounds for every query location.
(let λ be a constant k initially, and Δ be k as well)
round 1: 1 – k nearest neighbors
Canround
we2:reduce
the overlapped search regions?
1 – 2k nearest neighbors
…
round i: 1 – i*k nearest neighbors
Normally, λ >> k, so the complexity is lambda square.
Efficiency study of the IKNN

Adaption of the λ-NN algorithm

The best-first nearest neighbor search [Hjaltason et al., TODS99]
A priority queue is maintained to store all the R-tree entries that have yet to be
visited, using the MINDIST as a key. So it visits MBRs/Objects in the order of the
MINDIST.

The depth-first nearest neighbor search [Roussopoulos et al., SIGMOD95]
It recursively traverses the R-tree level by level in a depth-first manner, while
maintaining a global list of k nearest candidates found so far.

Estimate the performance of the IKNN adopting different λ-NN
algorithms
Adaption of the λ-NN algorithm

The best-first NN search

Retrieve the λ, λ+∆, λ+2∆, … NN for each query location incrementally
until the k best-connected trajectories are included in the candidate set.


Benefit
The λ-NN is returned in an incremental way
I/O optimal, no overlap occurs, Vsum = λ
Shortcoming
Memory consumption is NOT guaranteed. A priority queue is
maintained to store all the R-tree entries that have yet to be visited.
The queue may be as large as the whole dataset in an extreme case.
The best-first strategy

Performance (R-tree leaf access)

Estimate the circle region (with radius r) that contains λ points [Belussi
et al. VLDB95]
λ objects
q
radius

Estimate the leaf access of a range query with radius r [Korn et al.
TKDE2001]

m independent λ-NN queries
Adaption of the lambda-NN algorithm

The depth-first NN search

Every time we search for the λ+∆ NN, we have to re-visit the search
region of the λ-NN query.

Benefit: Guaranteed memory usage, O(c LogcN)

Drawback: Too many overlaps

A simple improvement: Double λ at each round, to reduce the
number of rounds and amortize cost.

Pruning: All MBRs whose MAXDIST is even smaller than the current
search range of λ-NN can be skipped in the search of λ+∆ NN.
The depth-first strategy

Performance (R-tree leaf access)
The search region is not necessary a circle! So we can not use the previous
method directly.

Estimate the size of the first visited
MBR (at any level) that contains not less
than λ points

Estimate the radius (MAXDIST) of the
MBR1
region that contains the MBR
MAXDIST
qi
R-tree nodes outside the circle with
radius MAXDIST wont be visited.
The depth-first strategy (cont.)

Performance

Estimate the leaf access of a range query with radius MAXDIST [Korn et
al. TKDE2001]

Finally,
Summary
IKNN algorithm Memory usage
Object visits
The best-first
strategy
no guarantee
m × O(λ)
The depth-first
strategy
O(logN * c)
m × O(λ)
Leaf access
The best-first strategy, although has no guarantee in memory usage, it normally
runs faster and the priority queue can still be accommodated in the main memory
of a modern computer easily.
The modified depth-first strategy reaches nearly the same performance as that of
the best-first strategy, while it still preserves a low memory consumption
Optimization & Extension


Considering the importance of the query locations and assigning
different weights in exploring objects.
Extension to query locations with an order specified
Experiments

12, 653 trajectories (1,147,116 points) collected by the Geolife
project

Number of query locations: 2 to 10

Tests are conducted on PC with 2.1GHz CPU and 1GB memory
Experiments – Node Access
Experiments – Query Time
Experiments – Memory Usage
Thank you