CS276A Text Information Retrieval, Mining, and Exploitation

Transcript CS276A Text Information Retrieval, Mining, and Exploitation

CS728-2008
Lecture 9
Storeing and Querying
Large Web Graphs
Last Time
• Algorithms for link-based clustering
• finding “tightly knit communities” TKC on the
web graphs
Today’s lecture
Dealing with large graphs
Building Indexes for adjacency and
connectivity testing
Distance and Transitive Closure
• New Data Structure: 2-hop covers
Connectivity Server
• Support for fast queries on the web graph
– Which URLs point to a given URL?
– Which URLs does a given URL point to?
Stores mappings in main memory from
• URL to outlinks, URL to inlinks
• Applications
– Crawl control, Web graph analysis
– Connectivity, crawl optimization
– TKCs, other Link analysis
Problem of Adjacency lists
• ALs store set of neighbors of each node
• Assume each URL represented by an
integer
– Use some natural ordering or hashing
– E.g., for a 60K page web, need 16 bits
– For 4 billion page web, need 32 bits per node
• Naively, this demands 32-64 bits to
represent each hyperlink
• Can we compress this?
Adjacency list compression
• Properties exploited in compression:
– Similarity (between lists)
– Locality (many links from a page go to
“nearby” pages)
– Can use gap encodings in sorted lists
– Look at distribution of gap values
Gap encoding (Elias)
• Given a list of integers in increasing order.
– E.g., 33,47,154,159,202 …
• It suffices to store gaps.
– 33,14,107,5,43 …
• We Hope: most gaps encoded with far fewer bits.
• Represent a gap G as the pair <length,offset>
• length is in unary and uses log2G +1 bits to
specify the length of the binary encoding of
• offset = G - 2log2G in binary.
Recall that the unary encoding of x is
a sequence of x 1’s followed by a 0.
Elias g codes for gap encoding
•
•
•
•
e.g., 9 represented as <1110,001>.
2 is represented as <10,0>.
Exercise: does zero have a g code?
Encoding G takes 2 log2G +1 bits.
– codes are always of odd length.
–
–
–
–
–
–
–
–
–
1 = 20 + 0 = 1
2 = 21 + 0 = 110
3 = 21 + 1 = 101
4 = 22 + 0 = 11000
5 = 22 + 1 = 11001
6 = 22 + 2 = 11010
7 = 22 + 3 = 11011
8 = 23 + 0 = 1110000
9 = 23 + 1 = 1110001
Exercise
• Given the following sequence of g-coded
gaps, reconstruct the gap sequence:
1110001110101011111101101111011
Storage Requirements
• Recently a paper by Boldi/Vigna
report that we can get down to an
average of ~3 bits/link
Why is this remarkable?
– (URL to URL edge)
– For a 118M node web graph
• How can this be possible?
Main ideas of Boldi/Vigna
• First consider lexicographically ordered list
of all URLs, e.g.,
– www.stanford.edu/alchemy
– www.stanford.edu/biology
– www.stanford.edu/biology/plant
– www.stanford.edu/biology/plant/copyright
– www.stanford.edu/biology/plant/people
– www.stanford.edu/chemistry
Boldi/Vigna
• Each of these URLs has an adjacency list
• Main thesis: because of use of webpage templates, the
adjacency list of a node is usually similar to one of the 7
preceding URLs in the lexicographic ordering
• Express adjacency list in terms of one of these
• E.g., consider these adjacency lists
– 1, 2, 4, 8, 16, 32, 64
– 1, 4, 9, 16, 25, 36, 49, 64
– 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144
– 1, 4, 8, 16, 25, 36, 49, 64
Connectivity Queries
• Beyond basic adjacency we’d like to
answer other queries…
– Transitive closure: is there a path from x to y?
– Distance: what is the length of shortest path
from x to y?
• Applications
– Link analysis
– XML path queries with wildcards
Naïve Solutions
• Given a web graph, we can compute and store
All Pairs Shortest Paths (APSPs) off-line
– Then answer any query in constant time
– What are Space requirements for an n-node graph ?
• Alternatively, given a node, we can compute
online
– Answer query Single Source Shortest Path Algorithm
– Minimal additional space required.
– What is the time complexity to answer query?
Transitive Closure Encoding
Problem
We want to find a compact representation for the
transitive closure
• whose size is comparable to the data‘s size
• that supports connection tests (almost) as fast
as the naive transitive closure lookup
• that can be built efficiently for large data sets
Main Idea: 2-Hop Covers and
2-Hop Labeling
• 2-Hop cover is set of hops (x,y) so that every connected pair
is covered by 2 hops
• For each node a, we maintain two sets of labels (which are
simply lists of nodes): Lin(a) and Lout(a)
• For each connection (a,b),
– choose a node c on the path from a to b (center node)
– add c to Lout(a) and to Lin(b)
• Then (a,b)Transitive Closure T  Lout(a)Lin(b)≠
a
c
b
Reachability and distance queries via 2-hop Labels
(Cohen et al., SODA 2002)
2-hop Covers
• Conjecture: For any graph with n nodes and m
edges, a 2-hop cover always exists and has size
bounded by O(n √m )
• Optimization Problem: Find a cover which
minimizes the sum of the label sizes
• Problem is NP-hard
– => approximation required
• Theorem: There exists a polytime algorithm that
approx optimal size 2-hop cover within factor of
log n.
• Based on a greedy (set cover) algorithm
Approximation Algorithm
What are good center nodes? Nodes that can
cover many uncovered connections.
1
2
4
3
5
6
Initial step:
All connections
are uncovered
 Consider the center graph of candidates
initial density:
2
I
1
4
2
5
6
O
Edges
8
8

  1.33
I  O 24 6
(We can cover 8 connections with
6 cover entries)
Approximation Algorithm
1
2
3
4
5
Initial step:
All connections
are uncovered
6
 Consider the center graph of candidates
1
2
I
3
4
4
5
O
6
Cover connections in subgraph
with greatest density with
corresponding center node
Approximation Algorithm
1
2
3
4
5
6
Next step:
Some connections
already covered
 Consider the center graph of candidates
I
1
2
Repeat this algorithm until all
2 O connections are covered
Theorem: Generated Cover is
optimal up to a logarithmic factor
Experimental Results
Small example from real world: subset of DBLP
6,210 documents (publications)
168,991 elements
25,368 links (citations)
14 Megabytes (uncompressed XML)
Element-level graph has 168,991 nodes and
188,149 edges
Its transitive closure:
344,992,370 connections
2,632.1 MB
Experimental Results
For example above:
Transitive Closure: 344,992,370
connections
Two-Hop Cover:
1,289,930 entries
 compression factor of ~267
 queries are still fast (~7.6 entries/node)
But:
Computation took 45 hours and 80 GB RAM!
Need: Smart partitioning of problem to fit memory
Final Results for Index Creation
Transitive Closure: 344,992,370 connections
Two-Hop Cover:
9,999,052 entries
 compression factor of ~34.5
 queries are still ok (~59.2 entries/node)
 build time is good (~23 minutes with 1
CPU and 1GB RAM)
Cover size 8 times larger than best,
but ~118 times faster with ~1% memory
Why Distances are much more
Difficult than TC
• Should be simple to add distance information:
v
2
u
Lout(v)={u, …}
Lin(w)= {u, …}

4
Lout(v)={(u,2), …}
Lin(w)= {(u,4), …}
dist(v,w)=dist(v,u)+dist(u,w)=2+4=6
• Is this correct ...
w
Why Distances are Difficult
v
u
2
w
4
dist(v,w)=1
 Center node u does not reflect the
correct distance of v and w
Solution: Distance-aware Centergraph
• Add edges to the center graph only if the
corresponding connection is a shortest path
1
2
4
3
5
6
I
1
4
2
3
5
6
O
4
• Correct, problems:
– Expensive to build the center graph (2 additional lookups per
connection)
- Approx bound is no longer tight

CS276A Text Information Retrieval, Mining, and Exploitation

Transcript CS276A Text Information Retrieval, Mining, and Exploitation

Directory