CS276A Text Information Retrieval, Mining, and Exploitation

Download Report

Transcript CS276A Text Information Retrieval, Mining, and Exploitation

CS728
Lecture 16
Web indexes II
Last Time
• Indexes for answering text queries
– given term produce all URLs containing
– Compact representations
• for postings used gap (Elias) encodings
• for dictionary: used pointers into string of terms
Today’s lecture
• Indexes for connectivity testing
• Distance and Transitive Closure
• Data Structure: 2-hop covers
Connectivity Server
• Support for fast queries on the web graph
– Which URLs point to a given URL?
– Which URLs does a given URL point to?
Stores mappings in memory from
• URL to outlinks, URL to inlinks
• Applications
– Crawl control, Web graph analysis
• Connectivity, crawl optimization
– Link analysis
Adjacency lists
• The set of neighbors of a node
• Assume each URL represented by an
integer
• E.g., for a 4 billion page web, need 32
bits per node
• Naively, this demands 64 bits to
represent each hyperlink
Adjacency list compression
• Properties exploited in
compression:
–Similarity (between lists)
–Locality (many links from a page go
to “nearby” pages)
–Use gap encodings in sorted lists
–Distribution of gap values
Storage
• Recently paper by Boldi/Vigna report
get down to an average of ~3 bits/link
– (URL to URL edge)
Why is this remarkable?
– For a 118M node web graph
• How?
Main ideas of Boldi/Vigna
• Consider lexicographically ordered list of
all URLs, e.g.,
– www.stanford.edu/alchemy
– www.stanford.edu/biology
– www.stanford.edu/biology/plant
– www.stanford.edu/biology/plant/copyright
– www.stanford.edu/biology/plant/people
– www.stanford.edu/chemistry
Boldi/Vigna
• Each of these URLs has an adjacency list
• Main thesis: because of templates, the adjacency list of
a node is similar to one of the 7 preceding URLs in the
lexicographic ordering
• Express adj list in terms of one of these
• E.g., consider these adjacency lists
– 1, 2, 4, 8, 16, 32, 64
– 1, 4, 9, 16, 25, 36, 49, 64
– 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144
– 1, 4, 8, 16, 25, 36, 49, 64
Connectivity Queries
• Beyond adjacency we’d like to answer
– Transitive closure: is there a path from x to y?
– Distance: what is the length of shortest path
from x to y?
• Applications
– Link analysis
– XML path queries with wildcards
Naïve Solutions
• Given graph
– Compute and store APSPs
– Answer any query in constant time
– Space requirements?
• OR online
– Given query compute SSSP
– No additional space
– Time to answer query?
Encoding Problem
Find a compact representation for the
transitive closure
• whose size is comparable to the data‘s size
• that supports connection tests (almost) as fast
as the naive transitive closure lookup
• that can be built efficiently for large data sets
Main Idea: 2-Hop Covers and
2-Hop Labeling
• 2-Hop cover is set of hops (x,y) so that every connected pair
is covered by 2 hops
• For each node a, maintain two sets of labels (which are
nodes): Lin(a) and Lout(a)
• For each connection (a,b),
– choose a node c on the path from a to b (center node)
– add c to Lout(a) and to Lin(b)
• Then (a,b)Transitive Closure T  Lout(a)Lin(b)≠
a
c
b
Reachability and distance queries via 2-hop Labels
(Cohen et al., SODA 2002)
2-hop Covers
• Conjecture: 2-hop covers always exist of
size O(n √m )
• Goal: Minimize the sum of the label sizes
• Problem is NP-complete
– => approximation required
• Theorem: There exists a polytime
algorithm that approx optimal within factor
of log n.
• Greedy (set cover) algorithm
Approximation Algorithm
Initial step:
What are good center nodes?
All connections
2 cover
4 many
5 uncovered
Nodes1 that can
are uncovered
connections.
3
6
 Consider the center graph of candidates
initial density:
2
I
1
4
2
5
6
O
Edges
8
8

  1.33
I  O 24 6
density
(We
canof
cover
densest
8 connections
subgraph with
6 coversame
(here:
entries)
as initial density)
Approximation Algorithm
What are good center nodes?
Initial step:
1
2
4
5
Nodes that can cover many uncovered
All connections
connections.
3
6
are uncovered
 Consider the center graph of candidates
1
2
I
3
4
4
5
O
6
initialconnections
density:
Cover
in subgraph
12density
12 with
withEdges
greatest

  1.71
corresponding
I  O 4 center
3 7 node
density of densest subgraph =
initial density (graph is complete)
Approximation Algorithm
What are good center nodes?
Next step:
1
2
4
5
Nodes that can cover many uncovered
Some connections
connections.
3
6
already covered
 Consider the center graph of candidates
I
1
2
Repeat this algorithm until all
2 O connections are covered
Theorem: Generated Cover is
optimal up to a logarithmic factor
Experimental Results
Small example from real world: subset of DBLP
6,210 documents (publications)
168,991 elements
25,368 links (citations)
14 Megabytes (uncompressed XML)
Element-level graph has 168,991 nodes and
188,149 edges
Its transitive closure:
344,992,370 connections
2,632.1 MB
Experimental Results
For example above:
Transitive Closure: 344,992,370
connections
Two-Hop Cover:
1,289,930 entries
 compression factor of ~267
 queries are still fast (~7.6 entries/node)
But:
Computation took 45 hours and 80 GB RAM!
Why Distances are Difficult
• Should be simple to add:
v
2
u
Lout(v)={u, …}
Lin(w)= {u, …}

4
Lout(v)={(u,2), …}
Lin(w)= {(u,4), …}
dist(v,w)=dist(v,u)+dist(u,w)=2+4=6
• Is this correct ...
w
Why Distances are Difficult
v
u
2
w
4
dist(v,w)=1
 Center node u does not reflect the
correct distance of v and w
Solution: Distance-aware Centergraph
• Add edges to the center graph only if the
corresponding connection is a shortest path
1
2
4
3
5
6
I
1
4
2
3
5
6
O
4
• Correct, problems:
– Expensive to build the center graph (2 additional lookups per
connection)
- Approx bound is no longer tight
Enhancements
• Allow for approx distances for more
compact representations