Transcript Slides
Class 38: Googling CS150: Computer Science University of Virginia Computer Science David Evans http://www.cs.virginia.edu/evans Some searches... “David Evans” “Dave Evans” “idiot” “lawn lighting” CS150 Fall 2005: Lecture 38: Googling Google Tomorrow at 6pm (but google doesn’t know that!) 2 Building a Web Search Engine • Database of web pages – Crawling the web collecting pages and links – Indexing them efficiently • Responding to Searches – How to find documents that match a query – How to rank the “best” documents CS150 Fall 2005: Lecture 38: Googling Google 3 Crawling Crawler activeURLs = [ “www.yahoo.com” ] while (len(activeURLs) > 0) : newURLs = [ ] for URL in activeURLs: page = downloadPage (URL) newURLs += extractLinks (page) activeURLs = newURLs Problems: Will keep revisiting the same pages Will take very long to get a good view of the web Will annoy web server admins downloadPage and extractLinks must be very robust CS150 Fall 2005: Lecture 38: Googling Google 4 Crawling Crawler activeURLs = [ “www.yahoo.com” ] visitedURLs = [ ] while (len(activeURLs) > 0) : newURLs = [ ] for URL in activeURLs: visitedURLs += URL page = downloadPage (URL) newURLs += extractLinks (page) - visitedURLs activeURLs = newURLs What is the complexity? CS150 Fall 2005: Lecture 38: Googling Google 5 Distributed Crawler activeURLs = [ “www.yahoo.com” ] visitedURLs = [ ] while (len(activeURLs) > 0) : newURLs = [ ] parfor URL in activeURLs: visitedURLs += URL page = downloadPage (URL) newURLs += extractLinks (page) activeURLs = newURLs CS150 Fall 2005: Lecture 38: Googling Google Is this as “easy” as distributing finding aliens? - visitedURLs 6 Building a Web Search Engine • Database of web pages – Crawling the web collecting pages and links – Indexing them efficiently • Responding to Searches – How to find documents that match a query – How to rank the “best” documents CS150 Fall 2005: Lecture 38: Googling Google 7 Building an Index • What if we just stored all the pages? Answering a query would be (size of the database) (need to look at all characters in database) For google: about 4 Billion pages (actual size is now considered a corporate secret) * 60 KB (average web page size) = ~184 Trillion Linear is not nearly good enough when n is Trillions CS150 Fall 2005: Lecture 38: Googling Google 8 Reverse Index Word … “David” Locations [ …, http://www.cs.virginia.edu/~evans/index.html:12, …] [ …, http://www.cs.virginia.edu/~evans/index.html:19, …] … “Evans” … What is time complexity of search now? CS150 Fall 2005: Lecture 38: Googling Google 9 Best Possible Searching • Searching Problem: – Input: a target key key, a list of n <key, value> pairs, sorted by key using a comparison function cf – Output: if key is in the list, the value associated with key; otherwise, not found • What is the best possible solution to the general searching problem? CS150 Fall 2005: Lecture 38: Googling Google 10 Recall Class 13: Sorting problem is Ω(n log n) • There are n! possible orderings • Each comparison can eliminate at best ½ of them • So, best possible sorting procedure is Ω(log2n!) • Sterling’s approximation: n! = Ω(nn) – So, best possible sorting procedure is Recall log multiplication Ω(log (nn)) = Ω(n log n) is normal addition: log mn = log m + log n CS150 Fall 2005: Lecture 38: Googling Google 11 Searching Problem is (log n) • It is (log n) – Each comparison can eliminate at best ½ of all the elements from consideration • It is O (log n) – We know a procedure that solves it in (log n) • For google: n is the number of distinct words on the web (hundreds of millions?) – (log n) is not good enough CS150 Fall 2005: Lecture 38: Googling Google 12 Faster Searching? • The proof that searching is (log n) relied on knowing that the best a comparison can do is eliminate ½ the entries • Can we do better? – Without knowing anything about comparison: no – With knowing about comparison: yes • What if one comparison can eliminate O(n) of the entries? CS150 Fall 2005: Lecture 38: Googling Google 13 Bin Searching First Letter Items a [<“aardvark”, [http://www.aardvarksareus.com, …]>, …] b […] … z [ …, <“zweitgeist”, […]>] def binsearch (key, table) : search (key, table[key[0]]) What is time complexity of binsearch? CS150 Fall 2005: Lecture 38: Googling Google 14 Searching in O(1) • To do better than (log n) the number of bins must scale with n – Average number of elements in a bin must be O(1) – One comparison must eliminate O(n) of the elements CS150 Fall 2005: Lecture 38: Googling Google 15 Hash Tables • Bin = H(key, number of bins) – H is a hash function – We’ve seen cryptographic hash functions where H must be collision resistant – For this, we don’t need that just need H must distribute the keys well across the bins • Finding a good H is difficult – You can download google’s from http://goog-sparsehash.sourceforge.net/ CS150 Fall 2005: Lecture 38: Googling Google 16 Google’s Lexicon • 1998: 14 million words (much more today) • Lookup word in H(word, nbins) • Maps to WordID Key 0 1 ... nbins – 1 Words [<“aardvark”, 1024235>, ... ] [<“aaa”, 224155>, ..., <“zzz”, 29543> ] ... [<“abba”, 25583>, ..., <“zeit”, 50395> ] CS150 Fall 2005: Lecture 38: Googling Google 17 Google’s Reverse Index (From 1998 paper...may have changed some since then) WordId ndocs pointer 00000000 3 00000001 15 ... 16777215 Inverted Barrels: 41 GB (1998) 105 Lexicon: 293 MB (1998) CS150 Fall 2005: Lecture 38: Googling Google 18 Inverted Barrels docid (27 bits) nhits (5 bits) hits (16 bits each) plain hit: capitalized: 1 bit font size: 3 bits position: 12 bits first 4095 chars, everything else 7630486927 23 ... extra info for anchors, titles (less position bits) CS150 Fall 2005: Lecture 38: Googling Google 19 Building a Web Search Engine • Database of web pages – Crawling the web collecting pages and links – Indexing them efficiently • Responding to Searches – How to find documents that match a query – How to rank the “best” documents CS150 Fall 2005: Lecture 38: Googling Google 20 Finding the “Best” Documents • Humans rate them – “Jerry and David’s Guide to the World Wide Web” (became Yahoo!) • Machines rate them – Count number of occurrences of keyword • Easy for sites to rig this – Machine language understanding not good enough • Business Model – Whoever pays you the most is listed first CS150 Fall 2005: Lecture 38: Googling Google 21 Random Walk Model Initialize all page ranks = 0 p = select a random URL for as long as you feel like p.rank = p.rank + 1 p = select random link from Links (p) Eventually, ranks measure probability a random websurfer would encounter a page Problems with this? CS150 Fall 2005: Lecture 38: Googling Google 22 Back Links http://www.google.com/search?hl=en&lr=&q=link%3Awww.cs.virginia.edu%2F%7Eevans%2Findex.html&btnG=Search = 219 backlinks CS150 Fall 2005: Lecture 38: Googling Google 23 Counting Back Links • link:http://www.deainc.com/ – 109 backlinks (hey, I should be first!) • Back links are not a good measure – Most of mine are from my own pages • But Google doesn’t know that (always) – Some pages are more important than others CS150 Fall 2005: Lecture 38: Googling Google 24 PageRank Weight the back links by the popularity of the linking page def PageRank (u): rank = 0 for b in BackLinks (u) rank = rank + PageRank (b) / Links (b) Would this work? return rank CS150 Fall 2005: Lecture 38: Googling Google 25 Converging PageRank • Ranks of all pages depend on ranks of all other pages • Keep recalculating ranks until they converge def CalculatePageRanks (urls): initially, every rank is 1 for as many times as necessary calculate a new rank for each page (using old ranks of other pages) replace the old ranks with the new ranks How do initial ranks effect results? How many iterations are necessary? CS150 Fall 2005: Lecture 38: Googling Google 26 PageRank • Crawlable web (1998): 150 million pages, 1.7 Billion links • Database of 322 million links – Converges in ~50 iterations • Initialization matters – All pages = 1: very democratic, models browser equally likely to start on random page – www.yahoo.com = 1, ..., all others = 0 • More like what Google probably uses CS150 Fall 2005: Lecture 38: Googling Google 27 Query Work • To respond to 1 query (2002) – Read 100 MB of data – 10s of Billions of CPU cycles • Google in 2002: – 15,000 commodity PCs • Racks of 88 2GB PCs, $278,000 each rack • Power: 10 MW-h/month ($1,500) – If you have 15,000 PCs, there always be some with faults: load balancing, data partitioning CS150 Fall 2005: Lecture 38: Googling Google 28 Building a Web Search Engine • Database of web pages – Crawling the web collecting pages and links – Indexing them efficiently • Responding to Searches – How to find documents that match a query – How to rank the “best” documents Ready to go become the next google? CS150 Fall 2005: Lecture 38: Googling Google 29 Charge • Before becoming the next Google, you need to finish PS8! • Tomorrow: 6pm, Lighting of the Lawn • Friday’s class: – A few other neat things about Google – Guidelines for project presentations – Exam review – email me your topics and questions • Monday: project presentations CS150 Fall 2005: Lecture 38: Googling Google 30