Transcript Document

SPARK: Top-k Keyword Query in
Relational Database
Wei Wang
University of New South Wales Australia
17/07/2015
1
Outline




17/07/2015
Demo & Introduction
Ranking
Query Evaluation
Conclusions
2
Demo
17/07/2015
3
Demo …
17/07/2015
4
SPARK

I
Searching, Probing & Ranking Top-k
Results
• Thesis project (2004 – 2005) with Nino Svonja
• Taste of Research Summary Scholarship
•
(2005)
Finally, CISRA prize winner
• http://www.computing.unsw.edu.au/softwareengine
ering.php
17/07/2015
5
SPARK

II
Continued as a research project with
PhD student Yi Luo
• 2005 – 2006
• SIGMOD 2007 paper
• Still under active development
17/07/2015
6
A Motivating Example
17/07/2015
7
A Motivating Example …

Top-3 results in our system
1 Movies: “Primetime Glick” (2001) Tom Hanks/Ben Stiller
(#2.1)
2 Movies: “Primetime Glick” (2001) Tom Hanks/Ben Stiller
(#2.1)  ActorPlay: Character = Himself  Actors:
Hanks, Tom
3 Actors: John Hanks  ActorPlay: Character = Alexander
Kerst  Movies: Rosamunde Pilcher - Winduber dem
Fluss (2001)
17/07/2015
8
Improving the Effectiveness

Three factors are considered to
contribute to the final score of a search
result (joined tuple tree)
• (modified) IR ranking score.
• the completeness factor.
• the size normalization factor.
17/07/2015
9
Preliminaries

Data Model

Query Model
• Relation-based
• Joined tuple trees (JTTs)
• Sophisticated ranking
• address one flaw in previous approaches
• unify AND and OR semantics
• alternative size normalization
17/07/2015
10
Problems with DISCOVER2
1  ln(1  ln(tf ))
N 1

qtf

ln

dl
df
tQ  D (1  s )  s avdl

1  ln(1  ln(tf ))  ln
tQ  D
score(ci)
score(pj)
score
signature
SPARK
c1  p1
1.0
1.0
2.0
(1, 1)
0.98
c2  p2
1.0
1.0
2.0
(0, 2)
0.44
17/07/2015
N 1
df
11
Virtual Document

Combine tf contributions
before tf normalization /
attenuation.
ci  pj

1  ln(1  ln(tf ))  ln
tQ  D
N 1
df
score(maxtor)
score(netvista) scorea*
c1  p1
1.00
1.00
2.00
c2  p2
0.00
1.53
1.53
17/07/2015
12
Virtual Document Collection


1  ln(1  ln(tf ))
N 1
ln
 qtf  ln

dl
(1  s)  s avdl
df
tQ  D
Collection: 3 results
•
•
idfnetvista = ln(4/3)
idfmaxtor = ln(4/2)
Estimate idf:
•
•
17/07/2015
idfnetvista = 
idfmaxtor =
ln

Estimate avdl = avdlC + avdlP
1
1  (1  1 )(1  1 )
3
3
scorea
 ln 9
5
c1  p1 0.98
c2  p2 0.44
13
Completeness Factor

For “short queries”
L2 distance
netvista
• User prefer results
d=1
(1,1)
matching more keywords

Derive completeness
factor based on
extended Boolean
model
• Measure Lp distance to
the ideal position
17/07/2015
Ideal Pos
(c2  p2)
d = 0.5
d = 1.41 (c1  p1)
maxtor
scoreb
c1  p1
(1.41-0.5)/1.41 = 0.65
c2  p2
(1.41-1)/1.41 = 0.29
14
Size Normalization


Results in large CNs tend to have more
matches to the keywords
Scorec = (1+s1-s1*|CN|) * (1+s2-s2*|CNnf|)
• Empirically, s1 = 0.15, s2 = 1 / (|Q| + 1) works
well
17/07/2015
15
Putting ‘em Together

score(JTT) = scorea * scoreb * scorec
• a: IR-score of the virtual document
• b: completeness factor
• c: size normalization factor
scorea * scoreb
17/07/2015
c1  p1
0.98 * 0.65 = 0.64
c2  p2
0.44 * 0.29 = 0.13
16
Comparing Top-1 Results

17/07/2015
DBLP; Query = “nikos clique”
17
#Rel and R-Rank Results

DBLP; 18 queries; Union of top-20
results
DISCOVER2 [Liu et al, SIGMOD06]
#Rel
R-Rank

p = 1.0
p = 2.0
2
2
16
16
18
 0.243
 0.333
0.926
0.935
1.000
Mondial; 35 queries; Union of top-20
results
DISCOVER2 [Liu et al, SIGMOD06]
#Rel
R-Rank
17/07/2015
p = 1.4
p = 1.0
p = 1.4
p = 2.0
2
10
27
29
34
 0.276
 0.491
0.881
0.909
0.986
18
Query Processing
3 Steps

17/07/2015
Generate candidate tuples in every relation in the
schema (using full-text indexes)
19
Query Processing …
3 Steps


17/07/2015
Generate candidate tuples in every relation in the
schema (using full-text indexes)
Enumerate all possible Candidate Networks (CN)
20
Query Processing …
3 Steps



17/07/2015
Generate candidate tuples in every relation in the
schema (using full-text indexes)
Enumerate all possible Candidate Networks (CN)
Execute the CNs
•
•
Most algorithms differ here.
The key is how to optimize for top-k retrieval
21
Monotonic Scoring Function

Execute a CN
Assume: idfnetvista > idfmaxtor and k = 1
CN: PQ  CQ
P
score(ci)
score(pj)
score
c1  p1
1.06
0.97
2.03
c2  p2
1.06
1.06
2.12
P1
c1  p1
P2
C2
C1
DISCOVER2
17/07/2015
C

<
c2  p2
c1  p1
<
c2  p2
22
Non-Monotonic Scoring Function

Assume: idfnetvista > idfmaxtor and k = 1
Execute a CN
CN: PQ  CQ
P2 P1
P
score(ci)
score(pj)
scorea
c1  p1
1.06
0.97
0.98
c2  p2
1.06
1.06
0.44
?
c1  p1
?
SPARK
17/07/2015
C
C1
C2

c1  p1
<
c2  p2
<
c2  p2
1) Re-establish the early stopping criterion
2) Check candidates in an optimal order
23
Upper Bounding Function


17/07/2015
Idea: use a monotonic & tight, upper bounding
function to SPARK’s non-monotonic scoring
function
Details
•
•
•
•
•
sumidf = w idfw
monotonic
watf(t) = (1/sumidf) * w (tfw(t) * idfw)
wrt. watf(t)
A = sumidf * (1 + ln(1 + ln( t watf(t) )))
B = sumidf * t watf(t)
then, scorea  uscorea = (1/(1-s)) * min(A, B)
scoreb
scoreuscore
are constants given the CN
scorec
24
Early Stopping Criterion

Execute a CN
Assume: idfnetvista > idfmaxtor and k = 1
CN: PQ  CQ
P
uscore
scorea
c1  p1
1.13
0.98
c2  p2
1.76
0.44
score( )  uscore( )
score( )  uscore( )
P1
P2
C2
C1
stop!
C
SPARK
17/07/2015
1) Re-establish the early stopping criterion
2) Check candidates in an optimal order


25
Query Processing …

• Score(Pi  Cj) = Score(Pi)
+ Score(Cj)
Execute the CNs
CN: PQ  CQ
Operations:

P


P3

P2

P1

C1
C2
C3
[VLDB 03]
17/07/2015
• {P1, P2, …} and {C1, C2, …}
have been sorted based on
their IR relevance scores.
C


[P1 ,P1]  [C1 ,C1]
C.get_next()
[P1 ,P1]  C2
P.get_next()
P2  [C1 ,C2]
P.get_next()
P3  [C1 ,C2]
…
// a parametric SQL query is
sent to the dbms
26
Dominance
uscore(<Pi, Cj>) > uscore(<Pi+1, Cj>) and
uscore(<Pi, Cj>) > uscore(<Pi, Cj+1>)
Skyline Sweeping Algorithm

Execute the CNs
CN: PQ  CQ
Operations:
Priority Queue:

P


P1  C1
P2  C1

P3  C1


P3
P2
P1


C1
C2
C3
<P1 , C1 >
<P2 , C1 >, <P1 , C2 >
<P3 , C1 >, <P1 , C2 >,
<P2 , C2 >
<P1 , C2 >, <P2 , C2 >,
<P4 , C1 >, <P3 , C2 >
…
C
Skyline Sweep
17/07/2015
1) Re-establish the early stopping criterion
2) Check candidates in an optimal order

sort of
27
Block Pipeline Algorithm

Inherent deficiency to bound non-monotonic function with
(a few) monotonic upper bounding functions
•
•
 draw an example
Lots of candidates with high uscores return much lower (real)
score
•
•

Idea
•
•
17/07/2015
unnecessary (expensive) checking
cannot stop earlier
Partition the space (into blocks) and derive tighter upper
bounds for each partitions
“unwilling” to check a candidate until we are quite sure
about its “prospect” (bscore)
28
Block Pipeline Algorithm …
Assume: idfn > idfm and k = 1
Execute a CN

CN: PQ  CQ
Block
P
(n:0,
m:1)
(n:1,
m:0)
uscore
bscore
2.74
1.05
2.63
2.63
2.63
2.63
2.50
0.95
2.74
C
(n:1, m:0) (n:0, m:1)
Block Pipeline
17/07/2015
scorea
2.41
2.63
2.63
2.38
2.63
2.63
1.05
stop!
1.05
1) Re-establish the early stopping criterion
2) Check candidates in an optimal order


29
Efficiency

17/07/2015
DBLP
•
•
•
~ 0.9M tuples in total
k = 10
PC 1.8G, 512M
30
Efficiency …

17/07/2015
DBLP, DQ13
31
Conclusions

A system that can perform effective &
efficient keyword search on relational
databases
• Meaningful query results with appropriate
•
17/07/2015
rankings
second-level response time for ~10M tuple
DB (imdb data) on a commodity PC
32
Q&A
Thank you.
17/07/2015
33