DELIS Highlights: Efficient and Intelligent Top-k Search in Peer-to-Peer Systems presented by Gerhard Weikum (Max-Planck Institute of Computer Science)

Download Report

Transcript DELIS Highlights: Efficient and Intelligent Top-k Search in Peer-to-Peer Systems presented by Gerhard Weikum (Max-Planck Institute of Computer Science)

DELIS Highlights:
Efficient and Intelligent Top-k Search
in Peer-to-Peer Systems
presented by Gerhard Weikum
(Max-Planck Institute of Computer Science)
Introduction
Gerhard Weikum (MPII)
Why Peer-to-Peer Web Search?
Vision: Self-organizing P2P Web Search Engine with Google-or-better functionality
• Proof of Concept for Scalable & Self-Organizing
Data Structures and Algorithms
(e.g., DHTs, Randomized Overlay Networks, Epidemic Spreading)
• Testbed for CS Models, Algorithms, Technologies
and Experimental Platform
• Better Search Result Quality (Precision, Recall, etc.)
• Powerful Search Methods for Each Peer
(Concept-based Search, Query Expansion, Personalization, etc.)
• Leverage Intellectual Input at Each Peer
(Bookmarks, Feedback, Query Logs, Click Streams, Evolving Web, etc.)
• Collaboration among Peers
(Query Routing, Incentives, Fairness, Anonymity, etc.)
• Breaking Information Monopolies
Subproject 6
Data Management on Dynamic P2P
Introduction
Gerhard Weikum (MPII)
What Google Can‘t Do
Killer queries (disregarding NLP QA, multilingual, multimedia):
drama with three women making a prophecy
to a British nobleman that he will become king
Subproject 6
Data Management on Dynamic P2P
Introduction
Gerhard Weikum (MPII)
Outline
 Vision
•
•
•
•
•
Subproject 6
Demo
Efficient Top-k Search
Ontology-based Query Expansion
Exploiting User Behavior
Isolating Selfish Peers
Data Management on Dynamic P2P
Introduction
Gerhard Weikum (MPII)
Outline
 Vision
 Demo
•
•
•
•
Subproject 6
Efficient Top-k Search
Ontology-based Query Expansion
Exploiting User Behavior
Isolating Selfish Peers
Data Management on Dynamic P2P
Efficient Top-k Search
Gerhard Weikum (MPII)
Efficient Top-k Search
TA: efficient & principled
top-k query processing
with monotonic score aggr.
Data items: d1, …, dn
d1
s(t1,d1) = 0.7
…
s(tm,d1) = 0.2
Query: q = (t1, t2, t3)
TA with sorted access only (NRA)
(Fagin 01, Güntzer/Kießling/Balke 01):
can index lists; consider d at posi in Li;
E(d) := E(d)  {i}; highi := s(ti,d);
worstscore(d) := aggr{s(t,d) |  E(d)};
bestscore(d) := aggr{worstscore(d),
aggr{high |   E(d)}};
if worstscore(d) > min-k then add d to top-k
min-k := min{worstscore(d’) | d’  top-k};
else if bestscore(d) > min-k then
cand := cand  {d}; s
threshold := max {bestscore(d’) | d’ cand};
if threshold  min-k then exit;
Index lists
t1
Ex. Google:
> 10 mio. terms
> 8 bio. docs
> 4 TB index
Subproject 6
t2
t3
d78
0.9
d64
0.8
d10
0.7
d23
0.8
d23
0.6
d78
0.5
d10
0.8
d10
0.6
d64
0.4
d1
0.7
d10
0.2
d99
0.2
d88
0.2
d78
0.1
d34
0.1
…
…
…
k=1
Scan
Scan
Scan
depth
1
depth
depth2 3
Rank Doc Worst- BestRank
WorstBestRank Doc
Docscore
Worst-score
Bestscore
score
score
score
1
d78 0.9
2.4
1 1 d78
2.0
d10 1.4
2.1
2.1
2
d64 0.8
2.4
2 2 d23
1.9
d78 1.4
1.4
2.0
3
d10 0.7
2.4
3 3 d64
0.8
2.1
d23
1.8
STOP!1.4
4 4 d10
2.1
d64 0.7
1.2
2.0
Data Management on Dynamic P2P
Probabilistic Pruning
Gerhard Weikum (MPII)
Probabilistic Pruning of Top-k Candidates
TA family of algorithms based on invariant (with sum as aggr)
 si ( d )  s( d ) 
 si ( d ) 
 highi
iE( d )
iE( d )
worstscore(d)
Add d to top-k result, if
worstscore(d) > min-k
 Drop d only if
bestscore(d) < min-k,
otherwise keep in PQ
 Often overly
conservative
(deep scans, high
memory for PQ)

iE( d )
bestscore(d)
score
bestscore(d)
drop d
from
scorepriority
predictor can use
queue
min-k
LSTs & Chernoff bounds,
Poisson approximations,
or histogram convolution
scan
depth
worstscore(d)
• Approximate top-k with probabilistic guarantees:
p( d ) : P[  si ( d )   Si   ]
iE( d )
iE( d )
discard candidates d from queue if p(d)  
Subproject 6
 E[rel. precision@k] = 1
Data Management on Dynamic P2P
Experiments with TREC-12 Web Track
Gerhard Weikum (MPII)
Experiments with TREC-12 Web-Track Benchmark
on .GOV corpus from TREC-12 Web track:
1.25 Mio. docs (html, pdf, etc.)
50 keyword queries, e.g.:
• „Lewis Clark expedition“,
• „juvenile delinquency“,
• „legalization Marihuana“,
• „air bag safety reducing injuries death facts“
#sorted accesses
elapsed time [s]
max queue size
relative precision
rank distance
score error
Subproject 6
TA-sorted
2,263,652
148.7
10849
1
0
0
speedup by factor 10
at high precision/recall
(relative to TA-sorted);
aggressive queue mgt.
even yields factor 100
at 30-50 % prec./recall
Prob-sorted (smart)
527,980
15.9
400
0.87
39.5
0.031
Data Management on Dynamic P2P
Introduction
Gerhard Weikum (MPII)
Outline
 Vision
 Demo
 Efficient Top-k Search
• Ontology-based Query Expansion
• Exploiting User Behavior
• Isolating Selfish Peers
Subproject 6
Data Management on Dynamic P2P
Query Expansion
Gerhard Weikum (MPII)
Query Expansion
Threshold-based query expansion:
substitute ~w by (c1 | ... | ck) with all ci for which sim(w, ci)  
„Old hat“ in IR; highly disputed for danger of topic dilution
Approach to careful expansion:
• determine phrases from query or best initial query results
(e.g., forming 3-grams and looking up ontology/thesaurus entries)
• if uniquely mapped to one concept
then expand with synonyms and weighted hyponyms
• alternatively use statistical learning methods
for word sense disambiguation
Problem: choice of threshold 
Subproject 6
Data Management on Dynamic P2P
Query Expansion Example
Gerhard Weikum (MPII)
Query Expansion Example
From TREC 2004 Robust Track:
Title: International Organized Crime
Description: Identify organizations that participate in international criminal activity,
the activity, and, if possible, collaborating organizations and the countries involved.
Query = {international[0.145|1.00],
~META[1.00|1.00][{gangdom[1.00|1.00], gangland[0.742|1.00],
Let us take, for example, the case of Medellin cartel's
"organ[0.213|1.00] & crime[0.312|1.00]",
camorra[0.254|1.00],
boss Pablo Escobar.
Will the fact thatmaffia[0.318|1.00],
he was eliminated
mafia[0.154|1.00], "sicilian[0.201|1.00]
& mafia[0.154|1.00]",
change anything
at all? No, it may perhaps have a
psychological
effect on other drug
dealers but,
"black[0.066|1.00] & hand[0.053|1.00]",
mob[0.123|1.00],
syndicate[0.093|1.00]}],
...
organ[0.213|1.00], crime[0.312|1.00], collabor[0.415|0.20],
... for organizing
columbian[0.686|0.20], cartel[0.466|0.20],
...}}the illicit export of metals and import
of arms. It is extremely difficult for the law-enforcement
135530 sorted accesses in 11.073s.
organs to investigate and stamp out corruption among
leading officials.
... Narcotics
Interpol Chief on Fight Against
parliamentary
Economic CounterintelligenceATasks
Viewed commission accused Swiss prosecutors
today of doing little to stop drug and money-laundering
Dresden Conference Views Growth
of Organized
Crime
Europebillions of dollars
international
networks
fromin
pumping
Report on Drug, Weapons Seizures
in Swiss
Southwest
Border Region
through
companies.
...
SWITZERLAND CALLED SOFT
ON CRIME
Results:
1.
2.
3.
4.
5.
...
Subproject 6
Data Management on Dynamic P2P
Top-k with Query Expansion
Gerhard Weikum (MPII)
Top-k Query Processing with Query Expansion
consider expandable query „algorithm and ~performance“
with score iq {max jonto(i) { sim(i,j)*sj(d)) }}
dynamic query expansion with
incremental on-demand merging of additional index lists
B+ tree index on terms
thesaurus / meta-index
response
algorithm performance time: 0.7 throughput: 0.6
92: 0.9
67: 0.9
52: 0.9
44: 0.8
55: 0.8
...
37: 0.9
44: 0.8
22: 0.7
23: 0.6
51: 0.6
52: 0.6
...
12: 0.9
14: 0.8
28: 0.6
17: 0.55
61: 0.5
44: 0.5
...
...
57: 0.6
44: 0.4
52: 0.4
33: 0.3
75: 0.3
performance
response time: 0.7
throughput: 0.6
queueing: 0.3
delay: 0.25
...
+ much more efficient than threshold-based expansion
+ no threshold tuning
+ no topic drift
Subproject 6
Data Management on Dynamic P2P
Experiments with TREC-13 Robust Track
Gerhard Weikum (MPII)
Experiments with TREC-13 Robust-Track Benchmark
speedup by factor 4
on Acquaint corpus (news articles):
528 000 docs, 2 GB raw data, 8 GB for all indexes at high precision/recall;
no topic drift, no need
50 most difficult queries, e.g.:
for threshold tuning;
„transportation tunnel disasters“
also handles TREC-13
„Hubble telescope achievements“
Terabyte benchmark
potentially expanded into:
„earthquake, flood, wind, seismology, accident, car, auto, train, ...“
„astronomical, electromagnetic radiation, cosmic source, nebulae, ...“
no exp.
static exp.
static exp. incr. merge
(=0.1)
(=0.3,
(=0.3,
(=0.1)
=0.0)
=0.1)
#sorted acc.
1,333,756 10,586,175 3,622,686
5,671,493
#random acc. 0
555,176
49,783
34,895
elapsed time [s] 9.3
156.6
79.6
43.8
max #terms
4
59
59
59
relative prec.
0.934
1.0
0.541
0.786
precision@10 0.248
0.286
0.238
0.298
MAP
0.091
0.111
0.086
0.110
with Okapi BM25 probabilistic scoring model
Subproject 6
Data Management on Dynamic P2P
Introduction
Gerhard Weikum (MPII)
Outline




Vision
Demo
Efficient Top-k Search
Ontology-based Query Expansion
• Exploiting User Behavior
• Isolating Selfish Peers
Subproject 6
Data Management on Dynamic P2P
Exploiting User Behavior
Gerhard Weikum (MPII)
Exploiting Query Logs and Click Streams
from PageRank: uniformly random choice of links + random jumps
PR( q )    j( q )  ( 1   ) 
 PR( p )  t( p,q )
pIN ( q )
Authority (page q) =
stationary prob. of visiting q
Subproject 6
Data Management on Dynamic P2P
Exploiting User Behavior
Gerhard Weikum (MPII)
Exploiting Query Logs and Click Streams
from PageRank: uniformly random choice of links + random jumps
to QRank: + query-doc transitions + query-query transitions
+ doc-doc transitions on implicit links (w/ thesaurus)
with probabilities estimated from log statistics
PR( q )    j( q )  ( 1   ) 
 PR( p )  t( p,q )
ab
a
xyz
pIN ( q )
QR( q )    j( q )  ( 1   )  


PR( p )  t( p,q ) 
pexp licitIN ( q )
( 1 )
Subproject 6

pimplicitIN ( q )
PR( p )  sim( p,q )

Data Management on Dynamic P2P
Exploiting User Behavior
Gerhard Weikum (MPII)
Preliminary Experiments
Setup:
70 000 Wikipedia docs, 18 volunteers posing Trivial-Pursuit queries
ca. 500 queries, ca. 300 refinements, ca. 1000 positive clicks
ca. 15 000 implicit links based on doc-doc similarity
Results (assessment by blind-test users):
• QRank top-10 result preferred over PageRank in 81% of all cases
• QRank has 50.3% precision@10, PageRank has 33.9%
Untrained example query „philosophy“:
PageRank
1.
2.
3.
4.
5.
QRank
x
Philosophy
Philosophy
GNU free doc. license
GNU free doc. license
Free software foundationEarly modern philosophy
Richard Stallman
Mysticism
Debian
Aristotle
Subproject 6
Data Management on Dynamic P2P
Introduction
Gerhard Weikum (MPII)
Outline





Vision
Demo
Efficient Top-k Search
Ontology-based Query Expansion
Exploiting User Behavior
• Isolating Selfish Peers
Subproject 6
Data Management on Dynamic P2P
Self-Organization for Isolating Selfish Peers
Gerhard Weikum (MPII)
Collaborative P2P Search
peer lists (directory)
term a: 17, 11, 92, ...
term f: 43, 65, 92, ...
url z: 54, 128, 7, ...
?
?
bookmarks
B0
url x: 37, 44, 12, ...
term c: 13, 92, 45, ...
?
term g: 13, 11, 45, ...
url y: 75, 43, 12, ...
query peer P0
local index X0
term g: 13, 11, 45, ...
Susceptible to misbehavior!
How do we identify and penalize or isolate selfish/malicious peers?
Subproject 6
Data Management on Dynamic P2P
Self-Organization for Isolating Selfish Peers
Gerhard Weikum (MPII)
Self-Organization for Isolating Selfish Peers
Rationale:
• mimic evolution in biological / social networks
• tag selfish vs. altruistic peers and bias interactions towards similar peers
Algorithm:
periodically do
each peer compares its “utility” with a random peer
if the other peer has higher utility then
copy that peer’s strategy and links (reproduction)
mutate with small probability: change behavior, change links
Subproject 6
Data Management on Dynamic P2P
Self-Organization for Isolating Selfish Peers
Gerhard Weikum (MPII)
Simulation Results for P2P File Sharing
• peers generate queries and answer queries based on P  [0,1]
with extreme behaviors: selfish P = 1.0 and altruistic P = 0.0
• peer utility = # hits (queries answered)
• mutation: change P randomly
queries generated
hits
average per node
60
typical run for
104 peers
Selfishness reduces
50
40
30
Average performance increases
20
10
0
0
Subproject 6
20
40
60
80
100
cycles
Data Management on Dynamic P2P
The End
Gerhard Weikum (MPII)
Thank you!
Subproject 6
Data Management on Dynamic P2P