Document

Transcript Document

Peer-to-Peer
Information Search
Sebastian Michel
Josiane Xavier Parreira
Ecole Polytechnique Fédérale Lausanne
Lausanne - Switzerland
Max-Planck Institute for Informatics
Saarbrücken - Germany
Outline of Part 1
Introduction to P2P Systems
 Distributed Hashtables & Range Queries
 Peer-to-Peer IR (Query Routing, Result
Merging)
 Overlapping Sources / Multi-key Statistics
 Top-k Query Processing
 Probabilistic Pruning
 Distributed Top-k

21/07/2015
Peer-to-Peer Information Search - SBBD 2007 Tutorial
2
P2P Systems
Peer:
“one that is of equal standing with another”
(source: Merriam-Webster Online Dictionary )

Known from Napster and others

Sharing of mostly illegal content (mp3, movies)
P2P= Pirate-to-Pirate ??


New kind of network organization; no client/server anymore
Basic Ideas:

Each peer connects to a few other peers
 All peers together form powerful networks

Potential Benefits:

No single point of failure
 Load is spread across mulitple peers
 (Resilient to failures and dynamics)
21/07/2015
Peer-to-Peer Information Search - SBBD 2007 Tutorial
3
• Developed in 1998.
• First P2P file-sharing system
File Download
Napster
File Download
21/07/2015
• Central server (index)
• Client software sends information
about users‘ contents to server.
• User send queries to server
• Server responds with IP of users
that store matching files.
 Peer-to-Peer file sharing!
Peer-to-Peer Information Search - SBBD 2007 Tutorial
4
Gnutella







Protocol for distributed file sharing
Started in 2000
in 2005: 1.81 million computers connected*
Unstructured Network
Truly decentralized
Uses message flooding during query execution.
Later: version with super nodes and query routing
* http://www.slyck.com/news.php?story=814
21/07/2015
Peer-to-Peer Information Search - SBBD 2007 Tutorial
5
Gnutella Style
TTL 1
TTL 2
TTL 0
TTL 3
Paris Hilton?
TTL 2
TTL 1
TTL 2
TTL 0
TTL 3
TTL 1
21/07/2015
Peer-to-Peer Information Search - SBBD 2007 Tutorial
6
Gnutella Style

Pros:


no complex statistical
bookkeeping
Cons:


21/07/2015
lot of network traffic
some peers might not be
reachable (TTL)
Peer-to-Peer Information Search - SBBD 2007 Tutorial
7
Bit Torrent




Idea: Load sharing through file splitting
A lot of (legal) software distributors offer software through Bit-torrent
Download information in small .torrent file
One tracker node per file (specified in torrent file)
segment 1
segment 3
File
segment 1
segment 2
segment 3
tracker node
segment 5
segment 4
segment 5
segment 4
request
segments
segment 2
Incentives: „tit-for-tat“
Each peer remembers collaborative peers  different priorities
21/07/2015
Peer-to-Peer Information Search - SBBD 2007 Tutorial
request
random
peer list
Client
8
Literature

Book: Peer-to-Peer: Harnessing the Power of Disruptive
Technologies by Andy Oram. O'Reilly Media, Inc.
21/07/2015
Peer-to-Peer Information Search - SBBD 2007 Tutorial
9
Overlay Networks


On top of existing networks
Different way to build an overlay network



21/07/2015
structured
unstructured
hybrid
Peer-to-Peer Information Search - SBBD 2007 Tutorial
10
Self* Properties (Promises)

Self-Organizing:




Self-Optimizing
Self-Configuring
Self-Healing:



evolves, grows..... without being
guided/managed
Self-Restoration
Self-Diagnostics
Self-Protecting
21/07/2015
Peer-to-Peer Information Search - SBBD 2007 Tutorial
11
Outline
Introduction to P2P Systems
 Distributed Hashtables & Range Queries
 Peer-to-Peer IR (Query Routing, Result
Merging)
 Overlapping Sources / Multi-key Statistics
 Top-k Query Processing
 Probabilistic Pruning
 Distributed Top-k

21/07/2015
Peer-to-Peer Information Search - SBBD 2007 Tutorial
12
Distributed Hash Tables




Hash-Table: given a key, return the bucket id. Based on
a hash function (like SHA-1)
Now: Distributed. For a given key, return the id of the
peer currently responsible for the key.
Challenge: Purely distributed protocols that cope with
node failures, departures, arrivals.
No central manager.
21/07/2015
Peer-to-Peer Information Search - SBBD 2007 Tutorial
13
Chord




uses an m-bit identifier space
ordered in a mod-2m circle, the
Chord ring;
maps peers and objects to
identifiers in the Chord ring, using
the hash function SHA-1
uses consistent hashing:
an object with identifier id is
placed on the successor peer,
succ(id), which is the first node
whose identifier is equal to, or
follows id on the Chord ring
Key k (e.g., hash(file name))
is assigned to the node with
key p (e.g., hash(IP address))
such that k  p and there is
no node p‘ with k  p‘ and p‘<p
p1
p56
k54
p8
k10
p51
p48
p14
p21
p42
p38
k38
k24
p32
Ion Stoica, Robert Morris, David R. Karger, M. Frans Kaashoek, Hari Balakrishnan: Chord: A
scalable peer-to-peer lookup service for internet applications. SIGCOMM 2001: 149-160
21/07/2015
Peer-to-Peer Information Search - SBBD 2007 Tutorial
k30
14
Chord
peer n maintains routing
information about
peers that lie on the
Chord ring at
logarithmically
increasing distance
 Finger tables
p51 + 1
p56
p51 + 2
p56
p51 + 4
p56
p51 + 8
p1
p51 + 16
p8
p51 + 32
p21
fingertable
p8
fingertable
p51
Lookup(54)
k54
p56
p1
p8 + 1
p14
p8 + 2
p14
p8 + 4
p14
p8 + 8
p21
p8 + 16
p32
p8 + 32
p42
p8
p51
fingertable
p42
21/07/2015
p42 + 1
p48
p42 + 2
p48
p42 + 4
p48
p42 + 8
p51
p42 + 16
p1
p42 + 32
p14
p48
Chord Ring
p14
p42
p38
p
Peer-to-Peer Information Search - SBBD 2007 Tutorial 32
p21
15
Node Joins in Chord
p42 lookup(42)
p42 sets succ pointer
p38
moving keys
p48
k43
k40
k39
21/07/2015
p42
k40
p38 updates succ pointer
k39
init_finger_tables()
successor=node.find_successor()
predecessor=successor.predecessor
predecessor.successor=new
Peer-to-Peer Information Search - SBBD 2007 Tutorial
16
And others ...




P-Grid: Karl Aberer: P-Grid: A Self-Organizing Access Structure for
P2P Information Systems. CoopIS 2001: 179-194
CAN: Sylvia Ratnasamy, Paul Francis, Mark Handley, Richard M.
Karp, Scott Shenker: A scalable content-addressable network. 161172
Pastry: Antony I. T. Rowstron, Peter Druschel: Pastry: Scalable,
Decentralized Object Location, and Routing for Large-Scale Peer-toPeer Systems. Middleware 2001: 329-350
Bamboo: Sean Rhea, Dennis Geels, Timothy Roscoe, and John
Kubiatowicz. Handling Churn in a DHT. Proceedings of the USENIX
Annual Technical Conference, June 2004.
21/07/2015
Peer-to-Peer Information Search - SBBD 2007 Tutorial
17
Range queries

Range queries
 A range
query [v1, v2] searches for those peers which
store data whit key value k [v1, v2]

DHTs only support efficiently exact-match queries
 The
naïve approach to process range queries in DHTs is
to:

21/07/2015
query each value of a range individually
It is HIGHLY EXPENSIVE!
Peer-to-Peer Information Search - SBBD 2007 Tutorial
18
DHTs and Range Queries
Order preserving hash function:
 k  min  m
f (k ): 
*2
 max min 
usually leads skewed
distributions

There are two main solutions to cope with load imbalances
i.e. to perform load balancing:


21/07/2015
transferring load, or
replicating data
Peer-to-Peer Information Search - SBBD 2007 Tutorial
19
DHT and Range Queries (2)

Existing approaches to deal with range queries:

Locality preserving hashing


OP-Chord: Triantafillou et al (2003). Skip Graphs: Aspnes et al (2004)
Hashing ranges of values instead of each value individually

CAN-based: Andrzejak et al (2002), Sahin et al (2004)

Another problem in that context: access load imbalances

One possible solution: “hot data” transferring to deal with
those load imbalances

However, data transfer does not solve access load
imbalances in skewed access (query) distributions
21/07/2015
Peer-to-Peer Information Search - SBBD 2007 Tutorial
20
HotRod: replicating hot arcs
Theoni Pitoura et al. EDBT 2006.
A peer is “hot” (or
overloaded) when
 >  _max, where
_max is the upper
limit of its resource
capacity
An arc of peers is
“hot” when at least
one of its peers is
hot
replicate ranges of
values
21/07/2015
Peer-to-Peer Information Search - SBBD 2007 Tutorial
21
Efficient Load Balancing
Lorenz Curves for Access Load Distribution
(r = 200, θ = 0.8)
100%
cumulative percentage of hits
90%
97%; 90%
80%
97%; 73%
70%
60%
Line of uniformity
50%
97%; 40%
40%
HotRoD
30%
Chord
20%
10%
OP-Chord
0%
0%
10%
20%
30%
40%
50%
60%
70%
80%
cumulative percentage of peers
HotRoD
21/07/2015
OP-Chord
Chord
90%
100%
97%
Line of uniformity
Peer-to-Peer Information Search - SBBD 2007 Tutorial
22
Outline
Introduction to P2P Systems
 Distributed Hashtables & Range Queries
 Peer-to-Peer IR (Query Routing, Result
Merging)
 Overlapping Sources / Multi-key Statistics
 Top-k Query Processing
 Probabilistic Pruning
 Distributed Top-k

21/07/2015
Peer-to-Peer Information Search - SBBD 2007 Tutorial
23
Building a P2P Search Engine
(Peer to Peer Information Retrieval)
“Distributed Google”

P2P approach best suitable




21/07/2015
large number of peers
exploit mostly idle resources
intellectual input of user community
scalable and self organizing
Peer-to-Peer Information Search - SBBD 2007 Tutorial
24
Information Retrieval Basics
5x
7x
4x
Document
21/07/2015
Terms
Peer-to-Peer Information Search - SBBD 2007 Tutorial
# of terms
(term frequency)
25
Information Retrieval Basics (2)
e.g. Fagin’s algorithm
TA or a variant without random accesses
21/07/2015
...
...
...
Top-k Query Processing: find k documents with the
highest total score
Query Execution: Usually using
some kind of threshold algorithm*:
B+ tree on terms
- sequential scans over
the index lists (round-robin)
- (random accesses to fetch
missing scores)
d28: 0.7
d53: 0.8
d51: 0.6
d11: 0.6
d55: 0.6
d12: 0.5
- aggregate scores
d17: 0.1
d44: 0.4
d14: 0.4
- stop when the threshold is d17: 0.3 d52: 0.3
d52: 0.1
d44: 0.2
reached
d28: 0.1
Peer-to-Peer Information Search - SBBD 2007 Tutorial
index lists with
(DocId: tf*idf)
sorted by Score
26
Going distributed: Index Organization
document index
Peer 2
Peer 1
d51: 0.6
d12: 0.5
d14: 0.4
d52: 0.3
d44: 0.2
d28: 0.1
...
...
d53: 0.8
d55: 0.6
d44: 0.4
d17: 0.3
d52: 0.1

Peer 3
d28: 0.7
d11: 0.6
d17: 0.1
...

Peer 2
Peer 1
peer index

every peer has its own collection (full
documents)
 distributed index = index of peer
descriptions
21/07/2015
Peer-to-Peer Information Search - SBBD 2007 Tutorial
27
(Full) Document Index


Straight forward from centralized document index
Each peer is responsible for storing the index list for a
subset of terms.
p1
p56
Query Routing: DHT lookups
Query Execution:
Distributed Top-k [TPUT ’04, KLEE ‘05]
p8
p51
p48
p14
p21
p42
p38
p32
21/07/2015
Peer-to-Peer Information Search - SBBD 2007 Tutorial
28
Peer Index


Each peer has its own local index
(e.g., created by web crawls)
Peers publish compact per-term
descriptions about their index
P2
P1
Distributed Directory
Term  List of Peers
P6
Query Routing:
1. DHT lookups
2. Retrieve Metadata
3. Find most promising peers
Query Execution:
- Send the complete Query
and merge the incoming results
21/07/2015
P3
P4
P5
Peer-to-Peer Information Search - SBBD 2007 Tutorial
29
P2P Search with Minerva
peer ranking
& statistics
peer lists (directory)
term a: 17, 11, 92, ...
term f: 43, 65, 92, ...
url z: 54, 128, 7, ...
url x: 37, 44, 12, ...
term c: 13, 92, 45, ...
peer ranking
& statistics
term g: 13, 11, 45, ...
based on
scalable,
churnresilient
DHT with
O(log n)
key lookup
url y: 75, 43, 12, ...
bookmarks
query
B0
local index X0
peer P0
term g: 13, 11, 45, ...
Query routing aims to optimize benefit/cost
driven by distributed statistics on
peers‘ content quality, content overlap,
freshness, authority, trust, etc.
Maintain semantic/social/statistical overlay network (SON)
Exploit community behavior (bookmarks, links, tags, clicks, etc.)
21/07/2015
Peer-to-Peer Information Search - SBBD 2007 Tutorial
30
Two major Problems

Task of merging the obtained results into final ranking:
Result Merging

Task of finding “high quality“ peers: Query Routing


aka database/collection/peer selection
Overview articles:


21/07/2015
J. Callan. (2000). "Distributed information retrieval." In W. B.
Croft, editor, Advances in Information Retrieval. Kluwer
Academic Publishers. (pp. 127-150).
Weiyi Meng, Clement T. Yu, King-Lup Liu: Building efficient and
effective metasearch engines. ACM Comput. Surv. 34(1): 48-89
(2002)
Peer-to-Peer Information Search - SBBD 2007 Tutorial
31
Query Routing


Given a Query Q={term1, term2, ...., termN): select the
most promising peers
Based on:
 per-term


+
document frequency
vocabulary size
normalization issues like



per-peer statistics
collection frequency
avg vocabulary size
Most popular:
 CORI,
21/07/2015
GlOSS, Decision Theoretic Framework (DTF)
Peer-to-Peer Information Search - SBBD 2007 Tutorial
32
CORI
Apply document ranking to resource ranking
s( pi , t j ) b  (1  b) *T ( pi , t j ) * I (t j )
Resource
s
p1
p2
t1
t2
....
t3
pj-1
pj
C  0.5
log(
)
cft
I (t ) 
log(C  1.0)
tk
Terms
T ( p, t ) 
df p,t
df p,t  50  150* cwp / avg _ cw
q
Query
C = #peers
df = document frequency
cf = collection frequency
cw = # distinct words per peer
1
S ( pi , Q)  s( pi , t j )
n t j Q
21/07/2015
Peer-to-Peer Information Search - SBBD 2007 Tutorial
33
Literature





J. Callan. (2000). "Distributed information retrieval." In W.
B. Croft, editor, Advances in Information Retrieval. Kluwer
Academic Publishers. (pp. 127-150).
Weiyi Meng, Clement T. Yu, King-Lup Liu: Building efficient
and effective metasearch engines. ACM Comput. Surv.
34(1): 48-89 (2002)
CORI: James P. Callan, Zhihong Lu, W. Bruce Croft:
Searching Distributed Collections with Inference Networks.
SIGIR 1995: 21-28
GlOSS: Luis Gravano, Hector Garcia-Molina, Anthony
Tomasic: GlOSS: Text-Source Discovery over the Internet.
ACM Trans. Database Syst. 24(2): 229-264 (1999)
Decision Theoretic Framework: Norbert Fuhr: A DecisionTheoretic Approach to Database Selection in Networked
IR. ACM Trans. Inf. Syst. 17(3): 229-249 (1999)
21/07/2015
Peer-to-Peer Information Search - SBBD 2007 Tutorial
34
Result Merging
Problem: incomparable scores
 Different corpus statistics

df component used in tf*ids scoring functions is not globally
known



user with lot of high quality documents for term a  high df
non expert user with some bad documents for term a  low df
Different scoring functions


21/07/2015
completely different functions
different parameters in the same function
Peer-to-Peer Information Search - SBBD 2007 Tutorial
35
Result Merging Approaches

Score Normalization by

using global statistics



score re-computation with query initiator‘s local
statistics


computation of global statistics difficult (not obvious)
solution using gossip
required re-ranking and knowledge about document contents
score re-computation using query routing scores

routing score available anyway
R'i  ( Ri  Rmin ) /( Rmax  Rmin )
D'  ( Di  Dmini ) /( Dmaxi  Dmini )
D'0.4 * D'*R'i
D' ' 
1.4
21/07/2015
Peer-to-Peer Information Search - SBBD 2007 Tutorial
36
Global DF Estimation
gdf (global doc. freq.) of a term is interesting key measure,
but overlap among peers makes simple distr. counting infeasible


hash sketches [Flajolet/Martin 1985]:
duplicate-sensitive cardinality estimator for multisets



hash each multiset element x onto m-bit bitvector
and remember least significant 1 bit
rough intuition: least-significant bit set by half of the documents,
second bit by ¼ of the documents......

Theory says: most significant bit estimator of log (n); n=#documents
 Higher accuracy: average multiple iid sketches
21/07/2015
Peer-to-Peer Information Search - SBBD 2007 Tutorial
37
Global DF Estimation
Hash sketches of different peers collected at directory peer
distributivity is free!! i {(h(x)) | x Si} = {(h(x)) | x  i Si}
 gdf estimation algorithm:




each peer p posts hash sketch for each (discriminative) term t to
directory
directory peer for term t forms union of incoming hash sketches
when a peer needs to know gdf(t), simply ask directory peer for t
sliding-window techniques for dynamic adjustment
Matthias Bender, Sebastian Michel, Peter Triantafillou, Gerhard Weikum: Global Document Frequency Estimation
in Peer-to-Peer Web Search. WebDB 2006
21/07/2015
Peer-to-Peer Information Search - SBBD 2007 Tutorial
38
Outline
Introduction to P2P Systems
 Distributed Hashtables & Range Queries
 Peer-to-Peer IR (Query Routing, Result
Merging)
 Overlapping Sources / Multi-key Statistics
 Top-k Query Processing
 Probabilistic Pruning
 Distributed Top-k

21/07/2015
Peer-to-Peer Information Search - SBBD 2007 Tutorial
39
Autonomous Peers Overlapping Sources
A
{A,B}
{A}
{A,B,C} {A,..,D}
C
?
D
E
1
3
2
4
overlap aware routing strategy:
{A,E}
{A}
1
21/07/2015
#peers
Recall
?
Recall
B
2
#peers
Peer-to-Peer Information Search - SBBD 2007 Tutorial
40
How?



Enrich published statistics with overlap
estimators.
Interested in NOVELTY and QUALITY
Iterative greedy selection process



select first peer based on quality
select next peer by quality*novelty
Suitable synopses for overlap
estimation:



Bloom filter [Bloom 1979]
hash sketches [Flajolet&Martin 1985]
min wise independent permutations [Broder
1997]
21/07/2015
Peer-to-Peer Information Search - SBBD 2007 Tutorial
41
Min-Wise Independent Permutations [Broder 97]
17 21 3 12 24 8
h1(x) = 7x + 3 mod 51
20 48 24 36 18 8
…
h2(x) = 5x + 6 mod 51
40 9 21 15 24 46
hN(x) = 3x + 9 mod 51
9 21 18 45 30 33
compute N random
permutations
8
9
…
set of ids
N
9
MIPs
vector:
minima
of perm.
MIPs
(set1)
8
9
33
24
36
9
MIPs
(set2)
8
24
45
24
48
13
estimated
overlap = 2/6
MIPs are unbiased estimator of overlap:
P [min {h(x) | xA} = min {h(y) | yB}] = |AB| / |AB|
21/07/2015
Peer-to-Peer Information Search - SBBD 2007 Tutorial
42





bit array of size m
k hash functions h_i: docId_space  {1,..,m}
insert n docs by hashing the ids and settings the corresponding
bits
document is in the Bloom Filter if the corresponding bits are set
probability of false positives (pfp)
h1

tradeoff accuracy vs. efficiency
h1
h2
9
2
h1
h2
6
2
1
bits
21/07/2015
1
2
h1
h2
1
1
3 4
5
6
7
8
14
6
h2
15
9
1 X
9 10 11 12 13 14 15 16
Peer-to-Peer Information Search - SBBD 2007 Tutorial
43
Andrei Broder and Michael Mitzenmacher: Network Applications of
Bloom Filters: A Survey. Internet Mathematics 1(4). 2005.
Bloom Filter [Bloom 1979]
Multi-Key Statistics

solves interesting problem:


peer with lot of docs on american football and lots of
documents about pop music has not a single document
about american music
cannot be predicted using per-term statistics
quality(a and b)! quality(a)  quality(b)
Obvious: Recall that
quality(a)  df (a)
21/07/2015
....
df
T
df  50 150* cw / avg _ cw
Peer-to-Peer Information Search - SBBD 2007 Tutorial
44
Multi-Key Statistics in P2P




Motivation:
 estimated_quality(a and b) = quality(a) + quality
(b) = df_a + df_b != df_(a and b)
Impossible (Infeasible) to consider all term-pairs, triplets,
quadruples, .....
Query Driven: Analyze query logs @ directory peers.
+ Data driven verficication:




P[Anna|Kournikova] = ......
P[Andy|Rodick] =
P[Berlin|Marathon] =
Whole process
can be easily
integrated into
Peer-level P2P IR
No additional messages + shorter lists + highly accurate
Sebastian Michel, Matthias Bender, Nikos Ntarmos, Peter Triantafillou, Gerhard Weikum, Christian Zimmer: Discovering and
exploiting keyword and attribute-value co-occurrences to improve P2P routing indices. CIKM 2006: 172-181
21/07/2015
Peer-to-Peer Information Search - SBBD 2007 Tutorial
45
Single-term vs. multi-term P2P document
Single term indexing
indexing
term 1  posting list 1
term 2  posting list 2
...
PEER 1
...
...
term M-1 posting list M-1
term M  posting list M
PEER N
Multi-term keys
Multi term indexing
key 11  posting list 11
key 12  posting list 12
...
...
PEER 1
key 1i  posting list 1i
large voc.
make use of highly
discriminative keys
limit influence of
overly long
index lists
consider term pairs
(triplets ...) for shorter
lists
 efficient query
processing
small voc.
long posting lists
...
key N1  posting list N1
key N2  posting list N2
...
...
PEER N
key Nj  posting list Nj
short posting lists
Gleb Skobeltsyn, Toan Luu, Ivana Podnar Zarko, Martin Rajman, Karl Aberer: Web text retrieval with a P2P
query-driven index. SIGIR 2007: 679-686
21/07/2015
Peer-to-Peer Information Search - SBBD 2007 Tutorial
46
Literature

Overlap Awareness:

Ronak Desai, Qi Yang, Zonghuan Wu, Weiyi Meng, Clement T. Yu:
Identifying redundant search engines in a very large scale metasearch
engine context. WIDM 2006: 51-58
 Matthias Bender, Sebastian Michel, Peter Triantafillou, Gerhard
Weikum, Christian Zimmer: Improving collection selection with overlap
awareness in P2P search engines. SIGIR 2005: 67-74
 Thomas Hernandez, Subbarao Kambhampati: Improving text collection
selection with coverage and overlap statistics. WWW (Special interest
tracks and posters) 2005: 1128-1129

Sketches

Andrei Z. Broder, Moses Charikar, Alan M. Frieze, Michael
Mitzenmacher: Min-Wise Independent Permutations. J. Comput. Syst.
Sci. 60(3): 630-659 (2000)
 Philippe Flajolet, G. Nigel Martin: Probabilistic Counting Algorithms for
Data Base Applications. J. Comput. Syst. Sci. 31(2): 182-209 (1985)
 Andrei Broder and Michael Mitzenmacher: Network Applications of
Bloom Filters: A Survey. Internet Mathematics 1(4). 2005.
21/07/2015
Peer-to-Peer Information Search - SBBD 2007 Tutorial
47
Literature

Multi-key statistics:



21/07/2015
Ivana Podnar, Martin Rajman, Toan Luu, Fabius Klemm, Karl
Aberer: Scalable Peer-to-Peer Web Retrieval with Highly
Discriminative Keys. ICDE 2007: 1096-1105
Gleb Skobeltsyn, Toan Luu, Ivana Podnar Zarko, Martin Rajman,
Karl Aberer: Web text retrieval with a P2P query-driven index.
SIGIR 2007: 679-686
Sebastian Michel, Matthias Bender, Nikos Ntarmos, Peter
Triantafillou, Gerhard Weikum, Christian Zimmer: Discovering
and exploiting keyword and attribute-value co-occurrences to
improve P2P routing indices. CIKM 2006: 172-181
Peer-to-Peer Information Search - SBBD 2007 Tutorial
48
Outline
Introduction to P2P Systems
 Distributed Hashtables & Range Queries
 Peer-to-Peer IR (Query Routing, Result
Merging)
 Overlapping Sources / Multi-key Statistics
 Top-k Query Processing
 Probabilistic Pruning
 Distributed Top-k

21/07/2015
Peer-to-Peer Information Search - SBBD 2007 Tutorial
49
For the IR people ....

Why top-k?

Cannot take a look at all matching documents
 E.g., Google provides millions of documents about Britney Spears
Requires ranking (scoring):
In text retrieval for instance
+ of course pagerank if you wish
Remember Part one: Local Query Execution at each peer (peer-index-model)
AND truly distributed top-k processing in the full document-index.
21/07/2015
Peer-to-Peer Information Search - SBBD 2007 Tutorial
50
For the DB guys ...

Table with schema (id, attribute, value)
SELECT id, aggr(value)
from table
group by id
sort by aggr(value) desc
limit k
21/07/2015
Peer-to-Peer Information Search - SBBD 2007 Tutorial
51
For the networking guys ...
Network Monitoring
21/07/2015
Find clients that cause
high network traffic.
IP
Bytes in kB
IP
Bytes in kB
192.168.1.7
31kB
192.168.1.8
81kB
192.168.1.3
23kB
192.168.1.3
33kB
192.168.1.4
12kB
192.168.1.1
12kB
IP
Bytes in kB
192.168.1.4
53kB
192.168.1.3
21kB
192.168.1.1
9kB
IP
Bytes in kB
192.168.1.1
29kB
192.168.1.4
28kB
192.168.1.5
12kB
Peer-to-Peer Information Search - SBBD 2007 Tutorial
52
Computational Model



m lists with (itemId, score)-pairs sorted by score descending.
One list per attribute (e.g. term)
Aggregation function


aggr()
Monotonicity is important

for all items a, b:
i score(ai )  score(bi )  aggr(a)  aggr(b)
whith

score( xi )
denoting the score of item x in list i
Goal: return the top-k items w.r.t. their aggregated (overall)
scores
21/07/2015
Peer-to-Peer Information Search - SBBD 2007 Tutorial
53
How to process this?

Most popular: Family of threshold algorithms




Fagin, 1999
Nepal/ Ramakrishna, 1999
Güntzer/Balke/Kießling, 2001
Basic ideas:

keep upper and lower score bound for each document




21/07/2015
lowerbound (or worstscore) = sum of scores we have seen so far
assuming 0 for unseen dimensions
upperbound (or bestscore) = lowerbound + highest possible value for
unseen dimensions
know what we‘ve got already; know what do expect
stop if no further step can improve the current (i.e. final) ranking
Peer-to-Peer Information Search - SBBD 2007 Tutorial
54
Fagin’s NRA

NRA(q,L):
top-k := ; candidates := ; min-k := 0;
scan all lists Li (i = 1..m) in parallel:
consider item d at position posi in Li;
E(d) := E(d)  {i};
highi := si(qi,d);
worstscore(d) := aggr{s(q,d)|E(d)};
bestscore(d):= aggr{aggr{s(q,d)|E(d)}, aggr{high|E(d)}};
if worstscore(d) > min-k then
remove argmind’{worstscore(d’)|d’top-k} from top-k;
add d to top-k
min-k := min{worstscore(d’) | d’  top-k};
else if bestscore(d) > min-k then
candidates := candidates  {d};
threshold := max {bestscore(d’) | d’ candidates};
if threshold  min-k then exit;
21/07/2015
Peer-to-Peer Information Search - SBBD 2007 Tutorial
55
Top-k Search
Data items: d1, …, dn
d1
s(t1,d1) = 0.7
…
s(tm,d1) = 0.2
Query: q = (t1, t2, t3)
Index lists
t1
t2
t3
21/07/2015
d78
0.9
d64
0.8
d10
0.7
d23
0.8
d23
0.6
d78
0.5
d10
0.8
d10
0.6
d64
0.4
d1
0.7
d10
0.2
d99
0.2
d88
0.2
d78
0.1
d34
0.1
…
…
…
k=1
Scan
Scan
Scan
depth
1
depth
depth2 3
Rank Doc Worst- BestRank
WorstBestRank Doc
Docscore
Worst-score
Bestscore
score
score
score
1
d78 0.9
2.4
1 1 d78
2.0
d10 1.4
2.1
2.1
2
d64 0.8
2.4
2 2 d23
1.9
d78 1.4
1.4
2.0
3
d10 0.7
2.4
3 3 d64
0.8
2.1
d23
1.8
STOP!1.4
4 4 d10
2.1
d64 0.7
1.2
2.0
Peer-to-Peer Information Search - SBBD 2007 Tutorial
56
Outline
Introduction to P2P Systems
 Distributed Hashtables & Range Queries
 Peer-to-Peer IR (Query Routing, Result
Merging)
 Overlapping Sources / Multi-key Statistics
 Top-k Query Processing
 Probabilistic Pruning
 Distributed Top-k

21/07/2015
Peer-to-Peer Information Search - SBBD 2007 Tutorial
57
Evolution of a Candidate’s Score
Observation: pruning often overly conservative (deep scans,
high memory for priority queue)
drop d
from the
candidate
queue
score
bestscored
min-k
worstscored

scan depth
Approximate top-k

“What is the probability that d qualifies for the top-k ?”
21/07/2015
Peer-to-Peer Information Search - SBBD 2007 Tutorial
58
Safe Thresholding vs. Probabilistic Guarantees
bestscored

NRA based on invariant

min-k
iE ( d )
si (d )  s(d ) 

iE ( d )
worstscored
δ(d)
 high
iE ( d )
bestscored

Relaxed into probabilistic threshold test



p(d ) : P   si (d )   si (d )  min k   
iE ( d )
iE ( d )

Or equivalently, with
 (d ) : min k  {si | i  E (d )}


p(d )  P   si (d )   (d )  
iE ( d )

worstscored
21/07/2015
si (d ) 
Peer-to-Peer Information Search - SBBD 2007 Tutorial
59
Expected Result Quality

Missing relevant items
 Probability
p_miss of missing a true top-k object
equals the probability of erroneously dropping a
candidate from the queue
 For each candidate p_miss ≤ ε
k 
( k r )
r
 P[recall = r/k] = P[precision = r/k] = 
 (1  pmiss ) pmiss
r
 E[precision]
= E[recall] =
 P[ precision r / k ]* r / k(1   )
r 0..k
21/07/2015
Peer-to-Peer Information Search - SBBD 2007 Tutorial
60
Outline
Introduction to P2P Systems
 Distributed Hashtables & Range Queries
 Peer-to-Peer IR (Query Routing, Result
Merging)
 Overlapping Sources / Multi-key Statistics
 Top-k Query Processing
 Probabilistic Pruning
 Distributed Top-k

21/07/2015
Peer-to-Peer Information Search - SBBD 2007 Tutorial
61
Going distributed

Key Observations:
 Network
traffic is crucial
 Number of round trips is crucial

Straight forward application of TA/NRA?
 expensive:
huge number of rounds trips
 even with batching: unpredictable performance
21/07/2015
Peer-to-Peer Information Search - SBBD 2007 Tutorial
62
Where is the data?
P0

P1
t1
d78 d23
0.9 0.8
d10 d1
0.8 0.7
d88
0.2
…
P2
t2
d64 d23
0.8 0.6
d10 d10
0.6 0.2
d78
0.1
…
P3
t3
d10 d78
0.7 0.5
d64 d99
0.4 0.2
d34
0.1
…
Consider
 network consumption
 per peer load
 latency (query response time)



21/07/2015
network
I/O
processing
P2
P1
Peer-to-Peer Information Search - SBBD 2007 Tutorial
P5
P3
P4
63
Three Phase Uniform Threshold Algorithm
[Cao and Wang, PODC 2004]
First distributed top-k algorithm with fixed number of phases!
Exactly 3 phases:
1. fetch k best entries (d, sj) from each of P1 ... Pm and
aggregate (j=1..m sj(d)) at query initiator
2. ask each of P1 ... Pm for all entries with sj > min-k / m
and aggregate results at query initiator. min-k is score of
item currently at rank k.
3. fetch missing scores for all candidates by random
lookups at P1 ... Pm
21/07/2015
Peer-to-Peer Information Search - SBBD 2007 Tutorial
64
current
top-k
-
Coordinator
Peer P0
candidate
set
min-k / m
Cohort
Peer Pi
Cohort
Peer Pj
candidates
top
k
score
score
candidates
top
k
min-k / m
21/07/2015
Index List
...
...
min-k / m
Index List
Peer-to-Peer Information Search - SBBD 2007 Tutorial
65
Analysis of TPUT

Theorem: TPUT is an exact algorithm, i.e. identifies the
true top-k items

Proof (sketch): TPUT cannot miss a true top-k item.
Assume it misses one, i.e. item is below
mink/m in all lists.
 overall score < mink
 not a true top-k item!
list 1
list 2
list 3
State after
phase 2:
min-k score
< min-k
21/07/2015
Peer-to-Peer Information Search - SBBD 2007 Tutorial
66
Analysis of TPUT


if mink / m is small TPUT retrieves a lot of data in Phase 2
 high network traffic
random accesses
 high per-peer load
KLEE [VLDB ‘05]
 Different philosophy: approximate answers
 Efficiency:
 Reduces (docId, score)-pair transfers
 no random accesses at each peer
 Two pillars:
 The HistogramBlooms structure
 The Candidate List Filter structure
21/07/2015
Peer-to-Peer Information Search - SBBD 2007 Tutorial
67
Additional Data Structures
Equi-width histogram
#docs
+ Bloom filter for each cell
+ average score per cell
+ upper/lower score
0
0
1
1
0
0
0
0
1
0
1
21/07/2015
0
1
1
0
0
0
0
1
0
1
0
1
1
1
0
0
0
1
0
1
1
score
0
1
1
0
0
0
0
1
0
1
0
0
1
1
0
0
0
1
0
1
“increase” the min-k / m
threshold
Usage:
During Phase 1:
+ fetch top-k from
each list
+ top-c cells
Peer-to-Peer Information Search - SBBD 2007 Tutorial
68
1 0 0 0 1 1 0 1
min-k / m
...
...
b bits
Histogram
min-k / m
Index List
0 0 0 0 0 0 0 1
b bits
1 0 0 0 0 0 0 0
1 0 0 0 1 0 1 1
0 0 1 0 0 0 0 1
1 0 1 0 0 1 0 1
1 0 0 0 0 1 0 1
top
k
Histogram
21/07/2015
c cells
score
score
0 0 1 0 1 0 0 1
c cells
top
k
candidates
Cohort
Peer Pj
1 0 1 0 1 0 1 0
Cohort
Peer Pi
candidate
set
0 0 1 0 1 1 1 0
current
top-k
-
Coordinator
Peer P0
candidates
KLEE
Index List
Peer-to-Peer Information Search - SBBD 2007 Tutorial
69
KLEE– Candidate Set Reduction
candidate filter matrix
Coordinator
Peer P0
current candidate
set
top-k
x
x
x
0000100000100000001
Cohort
Peer Pi
Cohort
Peer Pj
010010000100010001
top
k
0000100000100000001
candidates
0000100000100000001
100010100000010001
...
score
21/07/2015
min-k / m
Index List
Peer-to-Peer Information Search - SBBD 2007 Tutorial
70
KLEE – Candidate Retrieval
Coordinator
Peer P0
current candidate
set
top-k
candidate filter matrix
x
x
x
0000100000100000001
Cohort
Peer Pi
Cohort
Peer Pj
010010000100010001
top
k
candidates
0000100000100000001
...
score
21/07/2015
min-k / m
Index List
100010100000010001
0000100000100000001
early stopping
point
Peer-to-Peer Information Search - SBBD 2007 Tutorial
71
Literature









Ronald Fagin: Combining Fuzzy Information from Multiple Systems. J.
Comput. Syst. Sci. 58(1): 83-99 (1999)
Ronald Fagin, Amnon Lotem, Moni Naor: Optimal aggregation algorithms
for middleware. J. Comput. Syst. Sci. 66(4): 614-656 (2003)
Surya Nepal, M. V. Ramakrishna: Query Processing Issues in Image
(Multimedia) Databases. ICDE 1999: 22-29
Ulrich Güntzer, Wolf-Tilo Balke, Werner Kießling: Towards Efficient MultiFeature Queries in Heterogeneous Environments. ITCC 2001: 622-628
Martin Theobald, Gerhard Weikum, Ralf Schenkel: Top-k Query Evaluation
with Probabilistic Guarantees. VLDB 2004: 648-659
Holger Bast, Debapriyo Majumdar, Ralf Schenkel, Martin Theobald,
Gerhard Weikum: IO-Top-k: Index-access Optimized Top-k Query
Processing. VLDB 2006: 475-486
Amélie Marian, Nicolas Bruno, Luis Gravano: Evaluating top-k queries over
web-accessible databases. ACM Trans. Database Syst. 29(2): 319-362
(2004)
Pei Cao, Zhe Wang: Efficient top-K query calculation in distributed
networks. PODC 2004: 206-215
Sebastian Michel, Peter Triantafillou, Gerhard Weikum: KLEE: A Framework
for Distributed Top-k Query Algorithms. VLDB 2005: 637-648
21/07/2015
Peer-to-Peer Information Search - SBBD 2007 Tutorial
72
Part II – Social Search
21/07/2015
Peer-to-Peer Information Search - SBBD 2007 Tutorial
73
21/07/2015
Peer-to-Peer Information Search - SBBD 2007 Tutorial
74
Motivation

People connected through a network




Sharing interests



People create links to other people
Links can express friendship, recommendations, etc
Different graph structures appear
Enables users to find others who share common interests
Similar users can provide relevant content
Users and content spread at different sites

21/07/2015
Distributed nature and continuously increasing size call for peerto-peer approaches
Peer-to-Peer Information Search - SBBD 2007 Tutorial
75
Outline of the Second Part

Link Analysis: The Web as a Graph

PageRank
 Distributed Approaches





BlockRank
Local PageRank + ServerRank
Adaptive OPIC
JXP
Identifying common interests – Semantic Overlay Networks

Crespo and Garcia Molina
 pSearch
 p2pDating

Social Networks – A new paradigm

What people share
 Social graphs
 Links, Tags, users analysis
21/07/2015
Peer-to-Peer Information Search - SBBD 2007 Tutorial
76
Links are everywhere…

…connecting Web pages
www.searchtools.com
www.searchengines.com
www.searchengineguide.com
www.openp2p.com/...
www.searchengineshowdown.com
searchenginewatch.com
21/07/2015
Peer-to-Peer Information Search - SBBD 2007 Tutorial
77
Links are everywhere…

…connecting people
Example of a Flickr’s friends network
21/07/2015
Peer-to-Peer Information Search - SBBD 2007 Tutorial
78
Links are everywhere…

…connecting products
21/07/2015
Peer-to-Peer Information Search - SBBD 2007 Tutorial
79
Links Analysis

The set of nodes/pages (e.g., web pages, people,
products, etc) and the links connecting them define a
graph
www.searchtools.com
www.searchengines.com
www.searchengineguide.com
www.openp2p.com/...
www.searchengineshowdown.com
searchenginewatch.com
21/07/2015
Peer-to-Peer Information Search - SBBD 2007 Tutorial
80
Link Analysis

At the end we have something like this…

Lots of useful information can be obtained from the
analysis of the such graphs
21/07/2015
Peer-to-Peer Information Search - SBBD 2007 Tutorial
81
Adjacency Matrix


Matrix representation of graphs
Given a graph G, its adjacency matrix A is nxn and


21/07/2015
aij = 1, it there is a link from node i to node j
aij = 0, otherwise
Peer-to-Peer Information Search - SBBD 2007 Tutorial
82
PageRank – Exploring the Wisdom of Crowds




Measures relative importance of pages on the graph
Importance of a page depends on the importance of the
pages that point to it
Random Surfer Model: once in a page, the surfer
chooses to follow one of the outlinks with prob. α, or to
jump to a random page with prob. (1- α)
PR: probability of being at a certain
page, after a enough number of jumps
S. Brin & L. Page. The anatomy of a large-scale hypertextual web search engine.
In WWW Conf. 1998.
21/07/2015
Peer-to-Peer Information Search - SBBD 2007 Tutorial
83
PageRank – Formal Definition:
PR( p)
1
PR(q)    
 (1   ) 
N
p| p q out( p )
 N → Total number of pages;
PR(p) → PageRank of page p;
out(p) → Outdegree of p
ε→ Random jump probability

Can be computed using power iteration method


In practice more efficient versions can be used
Google is believed to use it on the Web graph, combined
with other metrics, to rank their search results
21/07/2015
Peer-to-Peer Information Search - SBBD 2007 Tutorial
84
PageRank – Matrix Notation

A → Matrix containing the transition probabilities
A  PT  (1   ) E

where Pij = 1/out(i), if there is a link from i to j, 0
otherwise; E is the random jumps matrix
Probability distribution vector at time k
 (0)
x

 (k )
k  ( 0)
x A x
is the starting vector
PageRank → Stationary distribution of the Markov Chain
described by A, i.e., principal eigenvector or A
(k )
PageRank limk  x
21/07/2015
Peer-to-Peer Information Search - SBBD 2007 Tutorial
85
Going Distributed


PageRank in principle needs the whole graph at one
place
Shortcomings:





Not Scalable for huge graphs, like the Web
Slow update – PageRank in such huge graph can take weeks
Not suitable for different network architectures (e.g. P2P)
Distributed approaches, where the graph is partitioned,
are clearly needed
Some distributed approaches (more details on the next
slides):



21/07/2015
Local PageRank + ServerRank (Wang et al.)
BlockRank (Kamvar et al.)
JXP (Parreira et al.)
Peer-to-Peer Information Search - SBBD 2007 Tutorial
86
The “Block Structure”

Most of links are among web pages inside same host
1
1
1
1
1
1
Pages from
Host A
1
1
1
1
1 1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1 1
1
1
1 1
1
1
1
1
1
1 1
1
1
21/07/2015
1
1
1 1
1
Block structure
can1be exploited for speeding
1
1
1
1
1 1
1
1
up
and/or
distributing
the
PR computation
1 1
1
1
1 1
1
1 1
1 1
1
1
1
1 1
1
1
1 1
1
1 1
1 1 1
1
1
1
1
1
Pages from
Host B
1 1
1 1
1
1
1
1
1
1
1
1 1 1
1
1
1
1
1
1 1
1 1
1 1
1
1
1 1
1
1
1
1
1
1
Adjacency Matrix
Peer-to-Peer Information Search - SBBD 2007 Tutorial
87
BlockRank

PageRank in three steps:
1.
2.
3.



Computes “local PageRanks” of pages for each host, by
considering only intra host links
Computes the importance of the host, using the local PR values
and the inter host links
Combines previous values to create the starting vector for the
standard PR algorithm
Speeds up computation
Step 1 can be parallelized
Still needs the whole matrix for step 3
S. Kamvar, T. Haveliwala, C. Manning & G. Golub. Exploiting the block structure of the
web for computing pagerank. Technical report, Stanford University, 2003.
21/07/2015
Peer-to-Peer Information Search - SBBD 2007 Tutorial
88
Going Distributed…

Local PR + ServerRank



Similar to BlockRank
Local PR : PR computed inside each server using intra server links
ServerRank: PR computed on server graph using inter server links




Server graph does not need to be materialized. Computation is done by
exchanging messages among servers
Local PR and ServerRank are combined to approximate the true
PR of a page
Values can be further refined by using Local PR info on ServerRank
computation and vice versa.
Server partition can be a limitation…
Y. Wang & D. J. DeWitt. Computing pagerank in a distributed internet search system.
In VLDB, 2004.
21/07/2015
Peer-to-Peer Information Search - SBBD 2007 Tutorial
89
Partition at “peer level”

In P2P networks, server partition is not suitable
Global Graph
Peer A
Peer B
21/07/2015
Peer C
Peer-to-Peer Information Search - SBBD 2007 Tutorial
90
Partition at “peer level”

Every peer crawls Web fragments at its discretion




Peers have only local (incomplete) information
Pages might be link to or linked by pages at other peers
Overlaps between peers’ graphs may occur
Peers a priori unaware of other peers’ contents
Peer A
Peer B
21/07/2015
Peer-to-Peer Information Search - SBBD 2007 Tutorial
Peer C
91
Adaptive OPIC



OPIC: Online Page Importance Computation
Computes the importance of a page on-line, with few
resources
Algorithm:




Pages initially receive some cash
Pages are randomly visited
When a page is visited, its cash is distributed between the pages it
points to
The page importance for a given page is computed using the
history of cash of that page
Serge Abiteboul, Mihai Preda, and Gregory Cobena. Adaptive on-line page
importance computation. In WWW, 2003.
21/07/2015
Peer-to-Peer Information Search - SBBD 2007 Tutorial
92
Adaptive OPIC
Example:
 Small
Web of 3 pages
 Alice has all the cash to start
(Importance independent of the
initial state)
Alice
Bob
George
Cash-Game History:
Alice received
Bob received
George received
21/07/2015
600
600
300
(200+400)
(200+100+300)
(200+100)
Peer-to-Peer Information Search - SBBD 2007 Tutorial
40%
40%
20%
93
Adaptive OPIC



No particular graph partition
No need to store the link matrix
Adapts to the changes on the web graph by considering
only the recent part of the cash history for each page



Time window: [now-T, now]
High number of messages exchanged
Does not handle case where same page is stored at
more than one place
21/07/2015
Peer-to-Peer Information Search - SBBD 2007 Tutorial
94
The JXP Algorithm





Decentralized algorithm for computing global authority
scores of pages in a P2P Network
Runs locally at every peer
No coordinator, asynchronous
Combines Local PageRank computations + Meetings
between peers
JXP scores converge to the true global PageRank
scores
Josiane Xavier Parreira, Carlos Castillo, Debora Donato, Sebastian Michel and Gerhard
Weikum: The JXP Method for Robust PageRank Approximation in a Peer-to-Peer Web
Search Network. The VLDB Journal, 2007.
21/07/2015
Peer-to-Peer Information Search - SBBD 2007 Tutorial
95
The JXP Algorithm

“World Node”:

Special node attached to the local graph at every peer
 Compact representation of all other pages in the network
 “Special features”:




All links from local pages to external pages point to World Node
Links from external pages that point to local pages (discovered during meetings) are
represented at the World Node
Score and outdegree of these external pages are stored; World Node outgoing links are
weighted to reflect score mass given by original link
Self-loop link to represent transitions among external pages
W
21/07/2015
Peer-to-Peer Information Search - SBBD 2007 Tutorial
96
The JXP Algorithm

Initialization step:

Local graph is extended by adding the world node
 PageRank is computed in the extended graph → JXP Scores

Main algorithm (for every Pi in the network)

Select Pj to meet
 Update world node



Add edges for pages in Pj that point to pages in Pi
If an edge already exists at the world node, the score of the source page is updated by taking
the highest of both scores
Compute PageRank → JXP scores
21/07/2015
Peer-to-Peer Information Search - SBBD 2007 Tutorial
97
The JXP Algorithm
A→F
A
D
W
C
F→A
W node:
B
F
E→B
G→C
E
J→E
E→G
W node:
W
E
K→E
L→G
Peer X
G
G→C
A→F
A
Peer Y
D
Theorem: “In a fair series of JXP meetings, the JXP
W
scoresW node:
of all nodes converge to the true global PR
F→A
G→C
scores”
C
B
J→E
E
F
F→A
E→G
F→E
K→E
F→G
F→E
W node:
K→E
Peer X
Subgraph relevant to Peer X
21/07/2015
Peer-to-Peer Information Search - SBBD 2007 Tutorial
98
Locating parts of the Graph



“Finding peers that share common interests”
Many applications can benefit from it
Distributed PR


21/07/2015
In principle, peers need to send content only to the peers that
contain their successors
Random messages guarantees that those peers will eventually
be reached, but part of messages will be “wasted”
Peer-to-Peer Information Search - SBBD 2007 Tutorial
99
A→F
A
D
W
C
L
W node:
B
M→R
G→C
E
J→E
W
M
E→G
W node:
P→M
WASTED MEETING!!!!
N
Peer X
N→S
A→F
A
D
We want to avoid it!!!
Peer Z
W
C
W node:
B
G→C
E
J→E
E→G
Peer X
Subgraph relevant to Peer X
21/07/2015
Peer-to-Peer Information Search - SBBD 2007 Tutorial
100
Locating parts of the Graph

Query answering



21/07/2015
Ideal: Forward query only to peers that are more likely to provide
good answers to it
Query flooding is very expensive
Hash-based queries are not suitable for approximate queries
Peer-to-Peer Information Search - SBBD 2007 Tutorial
101
Locating parts of the Graph

Locating “relevant” peers



Increase performance
Reduce traffic load
Idea: Group peers according to the semantic of their
content and place them into different overlay networks
21/07/2015
Peer-to-Peer Information Search - SBBD 2007 Tutorial
102
Outline of the Second Part

Link Analysis: The Web as a Graph

PageRank
 Distributed Approaches





BlockRank
Local PageRank + ServerRank
Adaptive OPIC
JXP
Identifying common interests – Semantic Overlay Networks

Crespo and Garcia Molina
 pSearch
 p2pDating

Social Networks – A new paradigm

What people share
 Social graphs
 Links, Tags, users analysis
21/07/2015
Peer-to-Peer Information Search - SBBD 2007 Tutorial
103
Semantic Overlay Networks


Partition the P2P network into several thematic networks
Peers with similar or beneficial/complementary content
are “clustered” together


21/07/2015
Queries for a content will be forwarded only to peers with such
content
Flooding in smaller networks with smaller TTL (or more results
with same)
Peer-to-Peer Information Search - SBBD 2007 Tutorial
104
Overlay Networks: Random vs. Semantic





Random
Peers connect to a small set of
random peers
Queries are flooded through
the network
Peers with unrelated content
receive query
Low performance: High
number of messages
Low recall if only few peers are
contacted





21/07/2015
Semantic
Peers connect to peers with
related content → Cluster of
peers
Peers identify query’s topic
and forward it only the set of
peers on that topic
Messages to peers with
unrelated content are avoided
Better performance: Smaller
number of messages
High recall by asking only few
peers
Peer-to-Peer Information Search - SBBD 2007 Tutorial
105
When creating SONs…

Two main things to consider



Node partitioning
Clustering criteria
Node partitioning - When does a peer belong to SON A?


When it contains a doc of type A
When it contains more than x docs of type A



Less peers per SON → more results sooner
Less SONs per peer → less connections
Clustering criteria - Clustering must provide:

Load-balance



21/07/2015
Each category has similar number of nodes
Each node belongs to a small number of categories
Easy and accurate way to classify a document
Peer-to-Peer Information Search - SBBD 2007 Tutorial
106
Crespo and Garcia-Molina

Uses a classification hierarchy to form the overlay networks

Documents and queries are classified into one or more concepts
 Queries are forwarded to peers in the super/sub concepts
A. Crespo and H. Garcia-Molina. Semantic Overlay Networks for P2P
Systems. Technical report, Stanford University, January 2003.
21/07/2015
Peer-to-Peer Information Search - SBBD 2007 Tutorial
107
Crespo and Garcia-Molina

Reported results show a significant improvement on number of
messages

Music file sharing scenario: To get half the documents that match a
query:




SONs: 461 msgs
Gnutella: 1731 msgs
SON links are “logical”: Two peers
that are connected on a SON can
actually be many hops away from
each other
Requirement that hierarchy and
classification algorithm are
shared among all nodes might
be a problem
21/07/2015
Peer-to-Peer Information Search - SBBD 2007 Tutorial
108
pSearch


Semantic Overlay on top of Content Addressable
Networks (CANs)
Latent Semantic Indexing (LSI) is used to generate a
semantic vector for each document



Semantic vectors are used as keys to store docs indices in the
CAN
Indices close in semantics are stored close in the overlay
Two types of operations


Publish document indices
Process queries
Chunqiang Tang, Zhichen Xu, and Sandhya Dwarkadas. Peer-to-peer
Information Retrieval Using Self-Organizing Semantic Overlay Networks. In
SIGCOMM, 2003.
21/07/2015
Peer-to-Peer Information Search - SBBD 2007 Tutorial
109
pSearch Key Idea
semantic space
21/07/2015
doc
query
Peer-to-Peer Information Search - SBBD 2007 Tutorial
110
pSearch Key Idea
A
B
C
D
E
F
G
H
I
semantic space
21/07/2015
doc
query
Peer-to-Peer Information Search - SBBD 2007 Tutorial
111
Background:Content-Addressable
Network
A
B
•
•
•
•
•
C
21/07/2015
D
E
Partition Cartesian
space into zones
Each zone is assigned
to a computer
Neighboring zones
are routing neighbors
An object key is a
point in the space
Object lookup is done
through routing
Peer-to-Peer Information Search - SBBD 2007 Tutorial
112
Background: Vector Space Model

Term Vectors represent documents and queries


Statistical computation of vector elements


Elements correspond to importance of term in document or
vector
Term frequency * inverse document frequency
Ranking of retrieved documents

21/07/2015
Similarity between document vector and query vector
Peer-to-Peer Information Search - SBBD 2007 Tutorial
113
Background: Vector Space Model
vocabulary
Va
Vq
computer
network
P2P
routing
0.5
0.5
0
0
0
0.5
0.5
0
0.25
Vb
0.375
0
0.5
0.25
0.25
A: “books on computer networks”
B: “network routing in P2P networks”
Q: “P2P network”
21/07/2015
Peer-to-Peer Information Search - SBBD 2007 Tutorial
114
Background: Latent Semantic Indexing


Document vectors dimension has to match the
dimension of the CAN network
Latent Semantic Indexing uses Singular Value
Decomposition (SVD)



high-dimensional term vector to low-dimensional semantic vector
elements correspond to importance of abstract concept in
document/query
Also helps to overcomes synonym problem (e.g., user
looks for car and don’t find document about automobile)
21/07/2015
Peer-to-Peer Information Search - SBBD 2007 Tutorial
115
Background: Latent Semantic Indexing
documents
Va
semantic vectors
Vb
V’a V’b
SVD
terms
…..
…..
SVD: singular value decomposition
 Reduce dimensionality
 Suppress noise
 Discover word semantics

21/07/2015
Car <-> Automobile
Peer-to-Peer Information Search - SBBD 2007 Tutorial
116
pSearch Basic Algorithm: Steps
1.
2.
3.
Receive a new document A: generate a semantic vector Va, store
the key in the index
Receive a new query Q: generate a semantic vector Vq, route the
query in the overlay
The query is flooded to nodes within a radius r

4.
R determined by similarity threshold or number of wanted documents
All receiving nodes do a local search and report references to
best matching documents
21/07/2015
Peer-to-Peer Information Search - SBBD 2007 Tutorial
117
pSearch Illustration
1
query
4
2
doc
4
3
3
3
search region for the query
21/07/2015
Peer-to-Peer Information Search - SBBD 2007 Tutorial
118
p2pDating


Start with a randomly connected network
Peers meet other peers they do not know (“blind dates”)

If a peer “likes” another it will remember it as a “friend”.



A remembers B  abstract link A → B
Directed links  preserves peers’ autonomy
SONs dynamically evolve from the meeting process
J. X. Parreira et al. p2pDating: Real Life Inspired Semantic Overlay Networks for Web
Search. Information Processing & Management [43], 643-664
21/07/2015
Peer-to-Peer Information Search - SBBD 2007 Tutorial
119
p2pDating

Finding new friends


Random meetings (Blind dates)
Meet friends of friends
A
B
A
B’s Friends
If Aitand
…
is very
B are
likely
friends…
the B’s friends are friends of A as well.
21/07/2015
Peer-to-Peer Information Search - SBBD 2007 Tutorial
120
Defining Good Friends

Criteria for defining a good friend  combination of
different measures

History: Credits for good behavior in the past



Collection similarity
Collection Overlap




Response time, query result precision, etc…
Different ways of estimating the overlap between two collections
Number of links between peers
Etc…
Peers might have more than one list of friends

21/07/2015
E.g., according to different criterias
Peer-to-Peer Information Search - SBBD 2007 Tutorial
121
Going Social…

Before:




Only few content producers (e.g., companies, universities)
Analysis was done using the content itself plus a few implicit
recommendations (links)
Very little information about the content consumers (mainly
through query logs)
Nowadays:



21/07/2015
New technologies to facilitate content sharing
Content consumers are now also content producers and content
describers (e.g., explicit recommendations, tags, etc)
More and more crowd wisdom that can be harvested
Peer-to-Peer Information Search - SBBD 2007 Tutorial
122
Outline of the Second Part

Link Analysis: The Web as a Graph

PageRank
 Distributed Approaches





BlockRank
Local PageRank + ServerRank
Adaptive OPIC
JXP
Identifying common interests – Semantic Overlay Networks

Crespo and Garcia Molina
 pSearch
 p2pDating

Social Networks – A new paradigm

What people share
 Social graphs
 Links, Tags, users analysis
21/07/2015
Peer-to-Peer Information Search - SBBD 2007 Tutorial
123
21/07/2015
Peer-to-Peer Information Search - SBBD 2007 Tutorial
124
Social Networks

A social structure made of nodes (which are generally
individuals or organizations) that are tied by one or more
specific types of relations, such as








values
visions
ideas
friends
conflict
web links
Etc
Social networks have been studied for over a century
21/07/2015
Peer-to-Peer Information Search - SBBD 2007 Tutorial
125
Social Network Services


Enable the creation of online social networks for communities of
people who share interests and activities, or who are interested in
exploring the interests and activities of others
Online communities offer an easy way
for users to publish and share their content.
21/07/2015
Peer-to-Peer Information Search - SBBD 2007 Tutorial
126
Social Networking Growth

Several social networking sites have experienced
dramatic growth during the past year.
Worldwide Growth of Selected Social Networking Sites.
June 2007 vs. June 2006, Users Age 15+, Source:
comScore
Total Unique Visitors (Mio.)
Social Networking Site
21/07/2015
Jun-06
Jun-07
% Change
MySpace
66.41
114.15
72
Facebook
14.08
52.17
270
Hi5
18.10
28.17
56
Friendster
14.92
24.68
65
Orkut
13.59
24.12
78
Bebo
6.69
18.20
172
Tagged
1.51
13.17
774
Peer-to-Peer Information Search - SBBD 2007 Tutorial
127
What people share…
21/07/2015
Peer-to-Peer Information Search - SBBD 2007 Tutorial
128
Social Networks

Besides sharing content, a user can…




…describe documents using tags
…maintain a list of friends
…make comments on other users’ content, exchange opinions,
discover users with similar profile.
In contrast to Web Graph, in Social Graphs users are
part of the model
21/07/2015
Peer-to-Peer Information Search - SBBD 2007 Tutorial
129
Social Content Graph
Sihem Amer-Yahia, Michael Benedikt, Philip Bohannon: Challenges in Searching
Online Communities. IEEE Data Eng. Bull. 30(2): 23-31 (2007)
21/07/2015
Peer-to-Peer Information Search - SBBD 2007 Tutorial
130
Social Graphs

Other models also possible
users
tags
Standard IR techniques for Web retrieval need to be
adapted to work on social networks - Lot of current
research dedicated on this area
docs

Directed vs. Undirected edges
 Etc.
21/07/2015
Peer-to-Peer Information Search - SBBD 2007 Tutorial
131
Social Networks

The Wisdom of Crowds: Beyond PR



Tag semantic analysis



Spectral analysis of various graphs
E.g., SocialPageRank, FolkRank.
Discovering semantic from tags co-occurrence
E.g., SocialSimRank
Distributed View


21/07/2015
Exploiting social relations to enhance search
E.g., PeerSpective
Peer-to-Peer Information Search - SBBD 2007 Tutorial
132
Link Analysis in Social Networks

SocialPageRank:



High quality web pages are usually popularly annotated and
popular web pages, up-to-date web users and hot social
annotations can be mutual enhanced.
Let MUT, MTD, MDU be the matrices corresponding to relations
UsersTags, TagsDocs, DocsUsers
Users
Compute iteratively:


rU  M'DU  rD


rD  M 'TD  rT


rT  M 'UT  rU
b
a
c
Tags
Documents
S. Bao, X. Wu, B. Fei, G. Xue, Z. Su, Y. Yu: Optimizing Web Search Using Social
Annotation. WWW 2007
21/07/2015
Peer-to-Peer Information Search - SBBD 2007 Tutorial
133
Link Analysis in Social Networks

FolkRank



Define graph G as union of graphs UsersTags, TagsDocs,
DocsUsers
Assume each user has personal preference vector
Compute iteratively:




rD   rD   MG  rD   p

FolkRank vector of docs is:


rD   0  rD   0
Andreas Hotho, Robert Jäschke, Christoph Schmitz, Gerd Stumme: Information
Retrieval in Folksonomies: Search and Ranking. ESWC 2006: 411-426
21/07/2015
Peer-to-Peer Information Search - SBBD 2007 Tutorial
134
Tag Similarity

SocialSimRank


Idea: Similar annotations (tags) are usually assigned to similar
web pages by users with common interests.
sim(t1, t2) ~ aggr {sim(d1,d2) | (t1,d1), (t2,d2)Tagging}
sim(d1, d2) ~ aggr {sim(t1,t2) | (t1,d1), (t2,d2)Tagging}
S. Bao, X. Wu, B. Fei, G. Xue, Z. Su, Y. Yu: Optimizing Web Search Using Social
Annotation. WWW 2007
21/07/2015
Peer-to-Peer Information Search - SBBD 2007 Tutorial
135
Exploring friendship connections

PeerSpective: users can query their friends’ viewed
pages


HTTP proxies on users computers index all browsed content
When a Google search in performance, query is also send to the
other proxies in parallel
Alan Mislove, Krishna P. Gummadi, and Peter Druschel. Exploiting Social
Networks for Internet Search. HotNets, 2006.
21/07/2015
Peer-to-Peer Information Search - SBBD 2007 Tutorial
136
Social Networks


New paradigm of publishing and searching content
Rich data



Different link structures
Users input for free!!!
Relatively recent topic: Lots of research opportunities

Works mentioned are by no means complete, still a lot to do
Since we are talking about Web 2.0…
http://p2pinformationsearch.blogspot.com/
21/07/2015
Peer-to-Peer Information Search - SBBD 2007 Tutorial
137