Reverse Spatial and Textual k Nearest Neighbors

Download Report

Transcript Reverse Spatial and Textual k Nearest Neighbors

Optimal Top-k Generation of
Attribute Combinations based on
Ranked Lists
Jiaheng Lu,
Renmin University of China
Joint work with Pierre Senellart, Chunbin Lin, Xiaoyong Du,
Shan Wang, and Xinxing Chen
Motivation & Problem Statement
Forward
Game
Score
ID
Center
Guard
F1
F2
C1
C2
G1
G2
Juwan Howard
LeBron James
Chris Bosh
Eddy Curry
Dwyane Wade
Terrel Harris
(G01, 3.81)
(G06, 3.59)
(G04, 3.21)
(G07, 3.03)
(G09, 2.07)
(G11, 1.70)
(G10, 1.62)
(G02, 1.59)
……
(G08, 1.19)
(G02, 6.59)
(G03, 6.19)
(G04, 5.81)
(G05, 4.01)
(G01, 3.38)
(G09, 2.25)
(G06, 1.52)
(G08, 1.51)
……
(G07, 1.00)
(G09, 7.10)
(G03, 6.01)
(G04, 3.79)
(G08, 3.02)
(G05, 2.89)
(G02, 2.52)
(G01, 2.00)
(G10, 1.59)
……
(G06, 1.52)
(G01, 9.31)
(G07, 9.02)
(G03, 8.87)
(G04, 5.02)
(G11, 4.81)
(G08, 4.02)
(G06, 4.31)
(G05, 3.59)
……
(G09, 2.06)
(G02, 8.91)
(G08, 8.07)
(G05, 7.54)
(G10, 7.52)
(G03, 6.14)
(G01, 5.05)
(G04, 5.01)
(G09, 3.34)
……
(G06, 3.01)
(G05, 7.21)
(G02, 6.01)
(G06, 5.58)
(G10, 5.51)
(G04, 5.00)
(G11, 3.09)
(G01, 2.06)
(G08, 2.03)
……
(G09, 1.98)
Goal: Select a
combination of
three players
including forward,
center, and guard
positions.
(a) Source data of three groups
select players with the highest score in each group
Methods
calculate the average scores of players across all games
……
Limitation:
overlook
team spirit
Motivation & Problem Statement
consider their combined scores in the same game
Select the top-k combinations according to top-m aggregate scores
Tuple aggregation
function
Instance aggregation
function
Top-k,m Problem
Motivation & Problem Statement
Guard
Center
Forward
F1
F2
C1
C2
G1
G2
Juwan Howard
LeBron James
Chris Bosh
Eddy Curry
Dwyane Wade
Terrel Harris
(G01, 9.31)
(G07, 9.02)
(G03, 8.87)
(G04, 5.02)
(G11, 4.81)
(G08, 4.02)
(G06, 4.31)
(G05, 3.59)
……
(G09, 2.06)
(G02, 8.91)
(G08, 8.07)
(G05, 7.54)
(G10, 7.52)
(G03, 6.14)
(G01, 5.05)
(G04, 5.01)
(G09, 3.34)
……
(G06, 3.01)
(G05, 7.21)
(G02, 6.01)
(G06, 5.58)
(G10, 5.51)
(G04, 5.00)
(G11, 3.09)
(G01, 2.06)
(G08, 2.03)
……
(G09, 1.98)
(G01, 3.81)
(G06, 3.59)
(G04, 3.21)
(G07, 3.03)
(G09, 2.07)
(G11, 1.70)
(G10, 1.62)
(G02, 1.59)
……
(G08, 1.19)
(G02, 6.59)
(G03, 6.19)
(G04, 5.81)
(G05, 4.01)
(G01, 3.38)
(G09, 2.25)
(G06, 1.52)
(G08, 1.51)
……
(G07, 1.00)
(G09, 7.10)
(G03, 6.01)
(G04, 3.79)
(G08, 3.02)
(G05, 2.89)
(G02, 2.52)
(G01, 2.00)
(G10, 1.59)
……
(G06, 1.52)
Top-1,2: select the top-1
combination of players according
to top-2 aggregate scores for
games where they played
together.
(a) Source data of three groups
F1C1G1 F1C1G2 F1C2G1 F1C2G2 F2C1G1 F2C1G2 F2C2G1 F2C2G2
(G04, 15.83) (G04, 13.81) (G01, 16.50) (G01, 15.12) (G02, 21.51) (G05, 17.64) (G02, 17.09) (G02, 13.02)
(G05, 14.81) (G05, 13.69) (G04, 14.04) (G07, 12.05) (G05, 18.76) (G02, 17.44) (G04, 14.03) (G09, 12.51)
…
…
…
…
…
…
…
…
(b) Top-2 aggregate scores for each combination
F2C1G1 is the best combination, since (21.51 + 18.76) is the highest overall score.
Difference between top-k queries and top-k,m queries
Top-k
Top-k,m
Return the top-k tuples
Return the top-k
combinations of attributes
Can be transformed into
a SQL
Cannot be transformed into a
SQL
Application
XML keyword refinement
Example
Q = {DB;UC Irvine; 2002}
Groups:
G1 = {"DB"; "database"},
G2={"UCI";"UC Irvine"}
G3 = {"2002"}.
Consider a top-1,2 query
Answer:
Q’={DB, UCI, 2002}
Application (Cont.)
• Evidence combination mining in medical
databases
• Package recommendation systems
• …
Outline
Motivation & Problem Statement
Top-k,m Query Processing
Experimental Results
Conclusion
Top-k,m Query Processing
Access Model: Sorted Accesses
(a, 9.0)
(b, 8.7)
(c, 8.7)
(i, 8.8)
(c, 7.9)
(a, 7.5)
(d, 7.4)
……
(f, 6.9)
……
(i, 5.3)
(d, 4.7)
Top-k,m Query Processing
Access Model: Random Accesses
(a, 9.0)
(b, 8.7)
(c, 8.7)
(i, 8.8)
(c, 7.9)
(a, 7.5)
(d, 7.4)
……
(f, 6.9)
……
(i, 5.3)
(d, 4.7)
Top-k,m Query Processing
Baseline Method: ETA
Calculate aggregate
score for each
combination
Compute top-m tuples
for each combination
Threshold Algorithm (TA)
(A1,B1,C1)
A1
A2
(a, 10) (c, 8.3)
(b,7.4) (d, 7.1)
……
……
G1
(A1,B1,C2)
100
…...
105
B1
B2
C1
C2
(b, 9)
(a, 8)
(e, 8)
(f, 7)
(b, 9)
(a, 8)
(e, 8)
(f, 7)
……
……
……
……
G2
…...
G3
…...
Return the top-k
combinations
Top-k,m Query Processing
Upper and Lower bounds Algorithm: ULA
Lower Bound
Consider top-m seen match instances
Compute the upper and
lower bounds for each
combination
Upper Bound
Consider threshold value and top-m
match instances
Termination condition:
k combinations meet
the hit-condition
Upper and Lower bounds Algorithm: ULA
A1
A2
B1
B2
(G5,7.8)
(G2, 7.9)
(G4,8.0)
(G2, 8.3)
(G11,7.3)
(G1,7.0)
(G8,7.3)
(G8, 3.0)
(G4, 1.8)
(G11, 4.2)
(G4, 2.6)
(G2, 1.5)
(G5, 3.3)
(G1, 9.3)
Group 1
(G2, 4.4)
(G1, 2.3)
Group 2
A1B1
Upper and Lower bounds Algorithm: ULA
A1
A2
(G1, 9.3)
(G2, 8.3)
B1
B2
(G5,7.8)
(G2, 7.9)
(G4,8.0)
(G11,7.3)
(G1,7.0)
(G8,7.3)
U: 34.4
A1B1
L: 32.5
U: 34.6
A1B2
L: 22.2
U: 31.4
A2B1
(G8, 3.0)
(G4, 1.8)
(G11, 4.2)
(G4, 2.6)
(G2, 1.5)
(G5, 3.3)
Group 1
L: 20.5
(G2, 4.4)
(G1, 2.3)
Group 2
A1B1 L: 32.5
(G1, 9.3+7.0), (G2, 7.9+8.3)
A2B1 U: 31.4
Threshold value=7.8+7.9=15.7, 15.7*2=31.4
U: 31.6
A2B2
L: 9.8
Upper and Lower bounds Algorithm: ULA
A1
A2
(G1, 9.3)
(G2, 8.3)
B1
B2
(G5,7.8)
(G2, 7.9)
(G4,8.0)
(G11,7.3)
(G1,7.0)
(G8,7.3)
U: 32.5
A1B1
L: 32.5
U: 31.2
A1B2
L: 22.2
(G8, 3.0)
(G4, 1.8)
(G11, 4.2)
(G4, 2.6)
(G2, 1.5)
(G5, 3.3)
Group 1
(G2, 4.4)
(G1, 2.3)
Group 2
A1B1
Can we run fast?
Optimization heuristics (1)
Pruning combinations without computing the bounds
(A3,B2) is dominated by (A2,B1)
6.3<7.1 and 8.0<8.2
Optimization heuristics (2)
Reducing the number of accesses
Avoiding both sorted and random accesses for specific lists
A1
A2
A3
(a, 10) (c, 8.3) (a, 6.3)
(b,7.4) (d, 7.1) (d, 4)
……
……
G1
……
B1
B2
(b, 9)
(a, 8)
(e, 8)
(f, 7)
……
……
G2
(A1,B1)and(A1,B2) cannot be part of answers, all sorted
accesses and random accesses on list A1 are unnecessary.
Optimization heuristics (3)
Reducing the number of accesses
Reducing random accesses across two lists
A1
A2
(a, 10) (c, 8.3)
(b,7.4) (d, 7.1)
……
……
G1
B1
B2
(b, 9)
(a, 8)
(e, 8)
(f, 7)
(e, 9) (k, 8)
(d, 7) (f, 7)
……
……
……
G2
C1
C2
……
G3
(A1,B1,C1)and(A1,B1,C2) cannot be part of answers, random
accesses between A1 and B1 are unnecessary.
Optimization heuristics (4)
Reducing the number of accesses
Eliminating random accesses for specific tuples
Random access from Le to Lt for tuple x is useless
Top-k,m Query Processing
Compute upper and
lower bounds for
unterminated
combinations
Prune
dominated
combinations
ULA+
Terminate combinations
by reducing number of
accesses
Until k
combinations
meet hitcondition
Interesting theoretical results
Optimality properties
Instance Optimality

If wild guesses are not allowed, and the size of each group is treated as
a constant, then ULA and ULA+ are instance-optimal.
for every instance there exist two constants a and b
such that cost(A) <= a*cost(A’) + b
The upper bound of the optimality ratio is tight
Interesting theoretical results (Cont.)
Optimality properties
No Instance Optimal Algorithms

If wild guesses are allowed, Then there is no deterministic algorithm that is
instance-optimal.
Outline
Motivation & Problem Statement
Top-k,m Query Processing
Experimental Results
Conclusion
Experimental Results
Experimental Setup
Language: Java;
Data sets
OS: Windows XP; CPU: 2.0GHz; Disk:320GB
Experimental Results
Experimental results on NBA and YQL datasets
ULA+ outperforms ETA by 1-2 orders of magnitude both in running time
and access number.
Experimental Results
Performance of optimization to reduce combinations
More than 60% combinations are pruned without computing their bounds
Experimental Results
Performance of different optimizations
Combination of all optimizations has the most powerful pruning capability.
Experimental Results
Experimental results on XML DBLP dataset
XULA and XULA+ perform better than XETA and scale well in both
running time and number of accesses.
Related Works
Top-k with both
random and
sorted accesses
U. Güntzer etc, VLDB2000
S. Nepal etc, ICDE1999
Fagin etc, PODS 2001
Top-k with only
sorted accesses
R. Fagin etc, JCSS2003
N. Mamoulis etc, TDS2007
Fagin etc, PODS 2001
Related Works
Top-k with sorted
access on
restricted lists
Top-k with no
need for exact
aggregate score
Ad-hoc top-k
queries
N. Bruno etc, ICDE2002
K. C. C. Chang etc, SIGMOD2002
I. F. Ilya etc, VLDB2002
C. Li etc, SIGMOD2006
M. L. Yiu etc, DKE2008
Related Works
Top-k Package
recommendation
T. Deng, W, Fan and F. Geerts,
On the Complexity of Package
Recommendation Problems
PODS 2012
Outline
Motivation & Problem Statement
Top-k,m Query Processing
Experimental Results
Conclusion
Conclusion
• Propose a new problem called top-k,m query
evaluation
• Developed a family of efficient algorithms,
including ULA and ULA+
• Study the optimality properties of our
algorithms
• Apply top-k,m query to the context of XML
keyword query refinement
Optimal Top-k Generation of
Attribute Combinations based on
Ranked Lists