Transcript Document
SPARK: Top-k Keyword Query in
Relational Database
Wei Wang
University of New South Wales Australia
17/07/2015
1
Outline
17/07/2015
Demo & Introduction
Ranking
Query Evaluation
Conclusions
2
Demo
17/07/2015
3
Demo …
17/07/2015
4
SPARK
I
Searching, Probing & Ranking Top-k
Results
• Thesis project (2004 – 2005) with Nino Svonja
• Taste of Research Summary Scholarship
•
(2005)
Finally, CISRA prize winner
• http://www.computing.unsw.edu.au/softwareengine
ering.php
17/07/2015
5
SPARK
II
Continued as a research project with
PhD student Yi Luo
• 2005 – 2006
• SIGMOD 2007 paper
• Still under active development
17/07/2015
6
A Motivating Example
17/07/2015
7
A Motivating Example …
Top-3 results in our system
1 Movies: “Primetime Glick” (2001) Tom Hanks/Ben Stiller
(#2.1)
2 Movies: “Primetime Glick” (2001) Tom Hanks/Ben Stiller
(#2.1) ActorPlay: Character = Himself Actors:
Hanks, Tom
3 Actors: John Hanks ActorPlay: Character = Alexander
Kerst Movies: Rosamunde Pilcher - Winduber dem
Fluss (2001)
17/07/2015
8
Improving the Effectiveness
Three factors are considered to
contribute to the final score of a search
result (joined tuple tree)
• (modified) IR ranking score.
• the completeness factor.
• the size normalization factor.
17/07/2015
9
Preliminaries
Data Model
Query Model
• Relation-based
• Joined tuple trees (JTTs)
• Sophisticated ranking
• address one flaw in previous approaches
• unify AND and OR semantics
• alternative size normalization
17/07/2015
10
Problems with DISCOVER2
1 ln(1 ln(tf ))
N 1
qtf
ln
dl
df
tQ D (1 s ) s avdl
1 ln(1 ln(tf )) ln
tQ D
score(ci)
score(pj)
score
signature
SPARK
c1 p1
1.0
1.0
2.0
(1, 1)
0.98
c2 p2
1.0
1.0
2.0
(0, 2)
0.44
17/07/2015
N 1
df
11
Virtual Document
Combine tf contributions
before tf normalization /
attenuation.
ci pj
1 ln(1 ln(tf )) ln
tQ D
N 1
df
score(maxtor)
score(netvista) scorea*
c1 p1
1.00
1.00
2.00
c2 p2
0.00
1.53
1.53
17/07/2015
12
Virtual Document Collection
1 ln(1 ln(tf ))
N 1
ln
qtf ln
dl
(1 s) s avdl
df
tQ D
Collection: 3 results
•
•
idfnetvista = ln(4/3)
idfmaxtor = ln(4/2)
Estimate idf:
•
•
17/07/2015
idfnetvista =
idfmaxtor =
ln
Estimate avdl = avdlC + avdlP
1
1 (1 1 )(1 1 )
3
3
scorea
ln 9
5
c1 p1 0.98
c2 p2 0.44
13
Completeness Factor
For “short queries”
L2 distance
netvista
• User prefer results
d=1
(1,1)
matching more keywords
Derive completeness
factor based on
extended Boolean
model
• Measure Lp distance to
the ideal position
17/07/2015
Ideal Pos
(c2 p2)
d = 0.5
d = 1.41 (c1 p1)
maxtor
scoreb
c1 p1
(1.41-0.5)/1.41 = 0.65
c2 p2
(1.41-1)/1.41 = 0.29
14
Size Normalization
Results in large CNs tend to have more
matches to the keywords
Scorec = (1+s1-s1*|CN|) * (1+s2-s2*|CNnf|)
• Empirically, s1 = 0.15, s2 = 1 / (|Q| + 1) works
well
17/07/2015
15
Putting ‘em Together
score(JTT) = scorea * scoreb * scorec
• a: IR-score of the virtual document
• b: completeness factor
• c: size normalization factor
scorea * scoreb
17/07/2015
c1 p1
0.98 * 0.65 = 0.64
c2 p2
0.44 * 0.29 = 0.13
16
Comparing Top-1 Results
17/07/2015
DBLP; Query = “nikos clique”
17
#Rel and R-Rank Results
DBLP; 18 queries; Union of top-20
results
DISCOVER2 [Liu et al, SIGMOD06]
#Rel
R-Rank
p = 1.0
p = 2.0
2
2
16
16
18
0.243
0.333
0.926
0.935
1.000
Mondial; 35 queries; Union of top-20
results
DISCOVER2 [Liu et al, SIGMOD06]
#Rel
R-Rank
17/07/2015
p = 1.4
p = 1.0
p = 1.4
p = 2.0
2
10
27
29
34
0.276
0.491
0.881
0.909
0.986
18
Query Processing
3 Steps
17/07/2015
Generate candidate tuples in every relation in the
schema (using full-text indexes)
19
Query Processing …
3 Steps
17/07/2015
Generate candidate tuples in every relation in the
schema (using full-text indexes)
Enumerate all possible Candidate Networks (CN)
20
Query Processing …
3 Steps
17/07/2015
Generate candidate tuples in every relation in the
schema (using full-text indexes)
Enumerate all possible Candidate Networks (CN)
Execute the CNs
•
•
Most algorithms differ here.
The key is how to optimize for top-k retrieval
21
Monotonic Scoring Function
Execute a CN
Assume: idfnetvista > idfmaxtor and k = 1
CN: PQ CQ
P
score(ci)
score(pj)
score
c1 p1
1.06
0.97
2.03
c2 p2
1.06
1.06
2.12
P1
c1 p1
P2
C2
C1
DISCOVER2
17/07/2015
C
<
c2 p2
c1 p1
<
c2 p2
22
Non-Monotonic Scoring Function
Assume: idfnetvista > idfmaxtor and k = 1
Execute a CN
CN: PQ CQ
P2 P1
P
score(ci)
score(pj)
scorea
c1 p1
1.06
0.97
0.98
c2 p2
1.06
1.06
0.44
?
c1 p1
?
SPARK
17/07/2015
C
C1
C2
c1 p1
<
c2 p2
<
c2 p2
1) Re-establish the early stopping criterion
2) Check candidates in an optimal order
23
Upper Bounding Function
17/07/2015
Idea: use a monotonic & tight, upper bounding
function to SPARK’s non-monotonic scoring
function
Details
•
•
•
•
•
sumidf = w idfw
monotonic
watf(t) = (1/sumidf) * w (tfw(t) * idfw)
wrt. watf(t)
A = sumidf * (1 + ln(1 + ln( t watf(t) )))
B = sumidf * t watf(t)
then, scorea uscorea = (1/(1-s)) * min(A, B)
scoreb
scoreuscore
are constants given the CN
scorec
24
Early Stopping Criterion
Execute a CN
Assume: idfnetvista > idfmaxtor and k = 1
CN: PQ CQ
P
uscore
scorea
c1 p1
1.13
0.98
c2 p2
1.76
0.44
score( ) uscore( )
score( ) uscore( )
P1
P2
C2
C1
stop!
C
SPARK
17/07/2015
1) Re-establish the early stopping criterion
2) Check candidates in an optimal order
25
Query Processing …
• Score(Pi Cj) = Score(Pi)
+ Score(Cj)
Execute the CNs
CN: PQ CQ
Operations:
P
P3
P2
P1
C1
C2
C3
[VLDB 03]
17/07/2015
• {P1, P2, …} and {C1, C2, …}
have been sorted based on
their IR relevance scores.
C
[P1 ,P1] [C1 ,C1]
C.get_next()
[P1 ,P1] C2
P.get_next()
P2 [C1 ,C2]
P.get_next()
P3 [C1 ,C2]
…
// a parametric SQL query is
sent to the dbms
26
Dominance
uscore(<Pi, Cj>) > uscore(<Pi+1, Cj>) and
uscore(<Pi, Cj>) > uscore(<Pi, Cj+1>)
Skyline Sweeping Algorithm
Execute the CNs
CN: PQ CQ
Operations:
Priority Queue:
P
P1 C1
P2 C1
P3 C1
P3
P2
P1
C1
C2
C3
<P1 , C1 >
<P2 , C1 >, <P1 , C2 >
<P3 , C1 >, <P1 , C2 >,
<P2 , C2 >
<P1 , C2 >, <P2 , C2 >,
<P4 , C1 >, <P3 , C2 >
…
C
Skyline Sweep
17/07/2015
1) Re-establish the early stopping criterion
2) Check candidates in an optimal order
sort of
27
Block Pipeline Algorithm
Inherent deficiency to bound non-monotonic function with
(a few) monotonic upper bounding functions
•
•
draw an example
Lots of candidates with high uscores return much lower (real)
score
•
•
Idea
•
•
17/07/2015
unnecessary (expensive) checking
cannot stop earlier
Partition the space (into blocks) and derive tighter upper
bounds for each partitions
“unwilling” to check a candidate until we are quite sure
about its “prospect” (bscore)
28
Block Pipeline Algorithm …
Assume: idfn > idfm and k = 1
Execute a CN
CN: PQ CQ
Block
P
(n:0,
m:1)
(n:1,
m:0)
uscore
bscore
2.74
1.05
2.63
2.63
2.63
2.63
2.50
0.95
2.74
C
(n:1, m:0) (n:0, m:1)
Block Pipeline
17/07/2015
scorea
2.41
2.63
2.63
2.38
2.63
2.63
1.05
stop!
1.05
1) Re-establish the early stopping criterion
2) Check candidates in an optimal order
29
Efficiency
17/07/2015
DBLP
•
•
•
~ 0.9M tuples in total
k = 10
PC 1.8G, 512M
30
Efficiency …
17/07/2015
DBLP, DQ13
31
Conclusions
A system that can perform effective &
efficient keyword search on relational
databases
• Meaningful query results with appropriate
•
17/07/2015
rankings
second-level response time for ~10M tuple
DB (imdb data) on a commodity PC
32
Q&A
Thank you.
17/07/2015
33