Transcript Slides

Class 38:
Googling
CS150: Computer Science
University of Virginia
Computer Science
David Evans
http://www.cs.virginia.edu/evans
Some searches...
“David Evans”
“Dave Evans”
“idiot”
“lawn lighting”
CS150 Fall 2005: Lecture 38: Googling Google
Tomorrow at 6pm (but
google doesn’t
know that!)
2
Building a Web Search Engine
• Database of web pages
– Crawling the web collecting pages and links
– Indexing them efficiently
• Responding to Searches
– How to find documents that match a query
– How to rank the “best” documents
CS150 Fall 2005: Lecture 38: Googling Google
3
Crawling Crawler
activeURLs = [ “www.yahoo.com” ]
while (len(activeURLs) > 0) :
newURLs = [ ]
for URL in activeURLs:
page = downloadPage (URL)
newURLs += extractLinks (page)
activeURLs = newURLs
Problems:
Will keep revisiting the same pages
Will take very long to get a good view of the web
Will annoy web server admins
downloadPage and extractLinks must be very robust
CS150 Fall 2005: Lecture 38: Googling Google
4
Crawling Crawler
activeURLs = [ “www.yahoo.com” ]
visitedURLs = [ ]
while (len(activeURLs) > 0) :
newURLs = [ ]
for URL in activeURLs:
visitedURLs += URL
page = downloadPage (URL)
newURLs += extractLinks (page) - visitedURLs
activeURLs = newURLs
What is the complexity?
CS150 Fall 2005: Lecture 38: Googling Google
5
Distributed Crawler
activeURLs = [ “www.yahoo.com” ]
visitedURLs = [ ]
while (len(activeURLs) > 0) :
newURLs = [ ]
parfor URL in activeURLs:
visitedURLs += URL
page = downloadPage (URL)
newURLs += extractLinks (page)
activeURLs = newURLs
CS150 Fall 2005: Lecture 38: Googling Google
Is this as
“easy” as
distributing
finding aliens?
- visitedURLs
6
Building a Web Search Engine
• Database of web pages
– Crawling the web collecting pages and links
– Indexing them efficiently
• Responding to Searches
– How to find documents that match a query
– How to rank the “best” documents
CS150 Fall 2005: Lecture 38: Googling Google
7
Building an Index
• What if we just stored all the pages?
Answering a query would be  (size of the database)
(need to look at all characters in database)
For google: about 4 Billion pages (actual size is now
considered a corporate secret)
* 60 KB (average web page size)
= ~184 Trillion
Linear is not nearly good enough when n is Trillions
CS150 Fall 2005: Lecture 38: Googling Google
8
Reverse Index
Word
…
“David”
Locations
[ …,
http://www.cs.virginia.edu/~evans/index.html:12,
…]
[ …,
http://www.cs.virginia.edu/~evans/index.html:19,
…]
…
“Evans”
…
What is time complexity of search now?
CS150 Fall 2005: Lecture 38: Googling Google
9
Best Possible Searching
• Searching Problem:
– Input: a target key key, a list of n <key,
value> pairs, sorted by key using a
comparison function cf
– Output: if key is in the list, the value
associated with key; otherwise, not found
• What is the best possible solution to the
general searching problem?
CS150 Fall 2005: Lecture 38: Googling Google
10
Recall Class 13:
Sorting problem is Ω(n log n)
• There are n! possible orderings
• Each comparison can eliminate at best ½
of them
• So, best possible sorting procedure is
Ω(log2n!)
• Sterling’s approximation: n! = Ω(nn)
– So, best possible sorting procedure is
Recall log multiplication
Ω(log (nn)) = Ω(n log n)
is normal addition:
log mn = log m + log n
CS150 Fall 2005: Lecture 38: Googling Google
11
Searching Problem is (log n)
• It is  (log n)
– Each comparison can eliminate at best ½ of all
the elements from consideration
• It is O (log n)
– We know a procedure that solves it in (log n)
• For google: n is the number of distinct
words on the web (hundreds of millions?)
– (log n) is not good enough
CS150 Fall 2005: Lecture 38: Googling Google
12
Faster Searching?
• The proof that searching is (log n) relied on
knowing that the best a comparison can do is
eliminate ½ the entries
• Can we do better?
– Without knowing anything about comparison: no
– With knowing about comparison: yes
• What if one comparison can eliminate O(n) of
the entries?
CS150 Fall 2005: Lecture 38: Googling Google
13
Bin Searching
First Letter
Items
a
[<“aardvark”,
[http://www.aardvarksareus.com, …]>,
…]
b
[…]
…
z
[ …, <“zweitgeist”, […]>]
def binsearch (key, table) : search (key, table[key[0]])
What is time complexity of binsearch?
CS150 Fall 2005: Lecture 38: Googling Google
14
Searching in O(1)
• To do better than  (log n) the number of
bins must scale with n
– Average number of elements in a bin must be
O(1)
– One comparison must eliminate O(n) of the
elements
CS150 Fall 2005: Lecture 38: Googling Google
15
Hash Tables
• Bin = H(key, number of bins)
– H is a hash function
– We’ve seen cryptographic hash functions
where H must be collision resistant
– For this, we don’t need that just need H must
distribute the keys well across the bins
• Finding a good H is difficult
– You can download google’s from
http://goog-sparsehash.sourceforge.net/
CS150 Fall 2005: Lecture 38: Googling Google
16
Google’s Lexicon
• 1998: 14 million words (much more today)
• Lookup word in H(word, nbins)
• Maps to WordID
Key
0
1
...
nbins – 1
Words
[<“aardvark”, 1024235>, ... ]
[<“aaa”, 224155>, ..., <“zzz”, 29543> ]
...
[<“abba”, 25583>, ..., <“zeit”, 50395> ]
CS150 Fall 2005: Lecture 38: Googling Google
17
Google’s Reverse Index
(From 1998 paper...may have changed some since then)
WordId ndocs pointer
00000000
3
00000001
15
...
16777215
Inverted
Barrels:
41 GB (1998)
105
Lexicon: 293 MB (1998)
CS150 Fall 2005: Lecture 38: Googling Google
18
Inverted Barrels
docid (27 bits)
nhits (5 bits) hits (16 bits
each)
plain hit:
capitalized: 1 bit
font size: 3 bits
position: 12 bits
first 4095 chars,
everything else
7630486927 23
...
extra info for
anchors, titles
(less position bits)
CS150 Fall 2005: Lecture 38: Googling Google
19
Building a Web Search Engine
• Database of web pages
– Crawling the web collecting pages and links
– Indexing them efficiently
• Responding to Searches
– How to find documents that match a query
– How to rank the “best” documents
CS150 Fall 2005: Lecture 38: Googling Google
20
Finding the “Best” Documents
• Humans rate them
– “Jerry and David’s Guide to the World Wide
Web” (became Yahoo!)
• Machines rate them
– Count number of occurrences of keyword
• Easy for sites to rig this
– Machine language understanding not good
enough
• Business Model
– Whoever pays you the most is listed first
CS150 Fall 2005: Lecture 38: Googling Google
21
Random Walk Model
Initialize all page ranks = 0
p = select a random URL
for as long as you feel like
p.rank = p.rank + 1
p = select random link from Links (p)
Eventually, ranks measure probability a random
websurfer would encounter a page
Problems with this?
CS150 Fall 2005: Lecture 38: Googling Google
22
Back Links
http://www.google.com/search?hl=en&lr=&q=link%3Awww.cs.virginia.edu%2F%7Eevans%2Findex.html&btnG=Search
= 219 backlinks
CS150 Fall 2005: Lecture 38: Googling Google
23
Counting Back Links
• link:http://www.deainc.com/
– 109 backlinks (hey, I should be first!)
• Back links are not a good measure
– Most of mine are from my own pages
• But Google doesn’t know that (always)
– Some pages are more important than others
CS150 Fall 2005: Lecture 38: Googling Google
24
PageRank
Weight the back links by the popularity of the
linking page
def PageRank (u):
rank = 0
for b in BackLinks (u)
rank = rank + PageRank (b) / Links (b)
Would this work?
return rank
CS150 Fall 2005: Lecture 38: Googling Google
25
Converging PageRank
• Ranks of all pages depend on ranks of all
other pages
• Keep recalculating ranks until they
converge
def CalculatePageRanks (urls):
initially, every rank is 1
for as many times as necessary
calculate a new rank for each page (using old ranks of other pages)
replace the old ranks with the new ranks
How do initial ranks effect results?
How many iterations are necessary?
CS150 Fall 2005: Lecture 38: Googling Google
26
PageRank
• Crawlable web (1998): 150 million pages,
1.7 Billion links
• Database of 322 million links
– Converges in ~50 iterations
• Initialization matters
– All pages = 1: very democratic, models
browser equally likely to start on random
page
– www.yahoo.com = 1, ..., all others = 0
• More like what Google probably uses
CS150 Fall 2005: Lecture 38: Googling Google
27
Query Work
• To respond to 1 query (2002)
– Read 100 MB of data
– 10s of Billions of CPU cycles
• Google in 2002:
– 15,000 commodity PCs
• Racks of 88 2GB PCs, $278,000 each rack
• Power: 10 MW-h/month ($1,500)
– If you have 15,000 PCs, there always be some
with faults: load balancing, data partitioning
CS150 Fall 2005: Lecture 38: Googling Google
28
Building a Web Search Engine
• Database of web pages
– Crawling the web collecting pages and links
– Indexing them efficiently
• Responding to Searches
– How to find documents that match a query
– How to rank the “best” documents
Ready to go become the next google?
CS150 Fall 2005: Lecture 38: Googling Google
29
Charge
• Before becoming the next Google, you
need to finish PS8!
• Tomorrow: 6pm, Lighting of the Lawn
• Friday’s class:
– A few other neat things about Google
– Guidelines for project presentations
– Exam review – email me your topics and
questions
• Monday: project presentations
CS150 Fall 2005: Lecture 38: Googling Google
30