Transcript Google
Google Scale Data Management Google (..a course on that??) (partially based on slides by Qing Li) 3/27 Google (what?..couldn’t they come up with a better name?) ..doesn’t mean anything, right? ...or does it mean “search” in French?... Surprising, but “Google” actually means stg. • It is the name of a “number”… • A big one : 1 followed by 100 zeros ..this reflects the company's mission....and the reality of the world: • • …there are a whole lot of “data” out there… …and whoever helps me manage it effectively deserves to be rich! (partially based on slides by Qing Li) 4/27 ..in fact there is more to Google than search... (partially based on slides by Qing Li) 5/27 Question #1: How can it know so much?...because it crawls… User Web Data Search Engine Crawling Query Result Searching Result Indexing K1 K2 … Index d1d 2 d1d 3 Indexes: quick lookup tables (like the index pages in a book) (partially based on slides by Qing Li) 6/27 Question #2: How can it be so fast? A whole lot of users One Search Engine A whole lot of indexed data (partially based on slides by Qing Li) 7/27 Question #2: How can it be so fast? use a lot of computers…but, intelligently! (parallelism) 1 3 2 6 4 5 Front-end web server Database (copies of all web pages!) Index servers Content servers (partially based on slides by Qing Li) 8/27 Question #2: How can it be so fast? use a lot of computers…but, intelligently! (parallelism) …organize data for fast access (indexing) 1 3 2 6 4 5 Front-end web server Database (copies of all web pages!) Index servers Content servers (partially based on slides by Qing Li) 9/27 Question #2: How can it be so fast? use a lot of computers…but, intelligently! (parallelism) …organize data for fast access (indexing) …most people want to know the same thing anyhow (caching) CACHE (copies of recent results) 1 Database (copies of all web pages!) 2 Front-end web server Index servers Content servers (partially based on slides by Qing Li) 10/27 Question 3: How can it serve the most relevant page? Text indexing/keyword search • Analyze the content of a page to determine if it is relevant to the query or not Link analysis • Analyze the (incoming and outgoing) links a pages has to determine if the page is “worthy” or not (partially based on slides by Qing Li) 11/27 Text Indexing Text Indexing Web Crawling Raw text Text analysis Index (partially based on slides by Qing Li) 12/27 Text Analysis Query: a set of keywords Page/document: is also a set of keywords (where some keywords are more important than the others) ….so, to create an index, we need to • • extract of index terms compute their weights (partially based on slides by Qing Li) 13/27 Term Extraction Extraction of index terms • Morphological Analysis (stemming in English) • “information”, “informed”, “informs”, “informative” • inform • Removal of stop words • “a”, “an”, “the”, “is”, “are”, “am”, … • they occur too often!!!!! • Compound word identification (partially based on slides by Qing Li) 14/27 Stop words Those terms that occur too frequently in a language are not good discriminators for search Frequency stop words Rank (of the term) (partially based on slides by Qing Li) 15/27 What are the weights???? They need to capture how good the term is in describing the content of the page t1 t1 better than t2 t2 n tf K term-frequency (partially based on slides by Qing Li) 17/27 What are the weights???? They need to capture how discriinating the term is (remember the stop words??) t1 t1 t1 t1 t1 t1 t1 t2 t2 t2 N idf log( ) m t2 better than t1 inverse-document-frequency (partially based on slides by Qing Li) 18/27 What are the weights???? They need to capture how • good the term is in describing the content of • the page discriminating the term is.. n N tfidf log( ) K m (partially based on slides by Qing Li) 19/27 An Example Document 1 TF for “Arizona” • In Doc 1 is 1 • In Doc 2 is 2 IDF for “Arizona” • In this collection (Doc 1 & Doc 2) IDF = ½ Document 2 TW (Arizona, Doc1) = 1. ½ = 0.5 TW (Arizona, Doc2) = 2. ½ = 1.0 (partially based on slides by Qing Li) 20/27 requires “fast” string search!! (hashes, search trees) “Inverted” index Terms Pointers . . . 2 . . . . . . 2 1 2 3 4 5 . . . 275 276 . . . . 1 2 5 4 6 . . . 3 5 . . . . 1011 1012 1 4 Doc #1 ------------Doc #2 ------------- Doc #5 ------------- ... Directory 3 ... Query “ASU” search Google . . . ASU . . . . . . tiger Posting file (partially based on slides by Qing Li) 21/27 Matching & Ranking Ranking • Retrieval Model • Boolean (exact) => Fuzzy Set (inexact) • Vector Space • Probabilistic • Inference Net • Language Model … Weighting Schemes • Index terms, query terms • Parameters in formulas (partially based on slides by Qing Li) 22/27 Vectors model of text Page with “ASU” with weight <0.5> 0 0.5 ASU (partially based on slides by Qing Li) 3 24/27 Vectors…what are they??? Page with “ASU” with weight <0.5> and “CSE” <0.7> CSE 0.7 0 0.5 ASU (partially based on slides by Qing Li) 25/27 Vectors…what are they??? Page with “ASU” with weight <0.5> and “CSE” <0.7>, and “Selcuk” <0.9> CSE 0.7 0 Selcuk 0.5 ASU 0.9 (partially based on slides by Qing Li) 26/27 A web database (a vector space!) CSE A ASU Selcuk (partially based on slides by Qing Li) 27/27 “Find 2 most similar pages to A” CSE A ASU Selcuk (partially based on slides by Qing Li) 28/27 How about “important” pages…how do identify them?? We might be able to learn how iportant a page is studying its connectivity (“linkages”) (partially based on slides by Qing Li) 29/27 Hubs and authorities •Good hubs should point to good authorities •Good authorities must be pointed by good hubs. (partially based on slides by Qing Li) 30/27 PageRank Basic idea: more links to a page implies a better page • • But, all links are not created equal Links from a more important page should count more than links from a weaker page PR(A) = PR(C)/1 PR(B) = PR(A) / 2 Basic PageRank PR(A) for page A: • • outDegree(B) = number of edges leaving page B = hyperlinks on page B Page B distributes its rank boost over all the pages it points to PR(C) = PR(A) / 2 + PR(B)/1 PR( A) PR( B) ( B , A )G outDegree( B ) (partially based on slides by Qing Li) 31/27 How to compute ranks?? PageRank (PR) definition is recursive • Rank of a page depends on and influences other pages Solved iteratively • • • To compute PageRank: • • Choose arbitrary initial PR_old and use it to compute PR_new Repeat, setting PR_old to PR_new until PR converges (the difference between old and new PR is sufficiently small) Rank values typically converge in 50-100 iterations Eventually, ranks will converge (partially based on slides by Qing Li) 32/27 Open-Source Search Engine Code Lucene Search Engine SWISH Glimpse and more • http://lucene.apache.org/ • http://swish-e.org/ • http://webglimpse.net/ (partially based on slides by Qing Li) 33/27 Reference L.Pager & S. Brin,The PageRank citation ranking: Bringing order to the web , Stanford Digital Library Technique, Working paper 1999-0120, 1998. Lawrence Page and Sergey Brin. The anatomy of a large-scale hypertextual web search engine. In Proceedings of the Seventh International Web Conference (WWW 98), 1998. (partially based on slides by Qing Li) 34/27