Transcript Slide 1
IELM 231: IT for Logistics and Manufacturing Course Agenda Introduction IT applications design: Human-Computer Interface Fundamental IT tools: sorting, searching The Client-Server architecture, Interacting applications IT in logistics, Case study 1: web-search Search robots Data processing, Data storage/retrieval (DB, indexes) Data presentation: page ranking techniques IT in logistics, Case study 2: web-based auctions How auctions work Web issues: session tracking Web issues: secure communications Web issues: cash transactions The Internet and Web searching Part 1. The infrastructure of the web - The architecture of the Internet - The ISO/OSI 7-layer communication protocol - Client-server architecture for applications Part 2. How to use a web search engine [case study: Google] - Basic searching - Advanced searching (Booleans) and filtering (setting up local engines) Part 3. How search engines work - Crawling the internet - Indexing of information, database design - Page ranking History of search engines 1993: First web robot – World Wide Web Wanderer Matthew Gray, Physics student from MIT Objective: track all pages on web to monitor growth of the web 1994: First search engine – WebCrawler Brian Pinkerton, CS student from U of Washington Objective: download web pages, store the links linked to keyword-searchable DB 1994: Jerry’s Guide to the Internet Jerry Yang, David Filo, Stanford University Objective: crawl for web pages, organize them by content into hierarchies Yet Another Hierarchical Officious Oracle (Yahoo) 1994-97: Lycos, Infoseek, AltaVista, Excite, LookSmart (meta engine) 1998: Google (Sergey Brin, Larry Page, CS students, Stanford University) Basics of web searching: two modes of searching Web directories [Boolean] keyword search Basics of web directories A hierarchy of main topics is formed (by search company) Each topic may have several sub-topics, etc. All pages ‘known’ to engine are linked to the hierarchy by humans (editors) Users search for ‘main’ articles in any subtopic by ‘browsing’ Pages deep inside a hierarchy can be searched using ‘keyword search’ Basics of web searching: user’s view Basic search modes Any page with the search term(s) occurring in it is returned What if you search of multiple words? implicit AND: page must contain all the words in search term Exact phrase search Typically by quoting the search term: “now is the time” What happens if you use: now is the time Typically: is and the will be omitted, results will differ from above! Basics of web searching: user’s view.. Boolean (Set theoretic) operations on search results: Intersection: AND (implicit) manufacturing die casting Union: OR manufacturing die OR shell casting Difference (NOT): ‘-’ manufacturing cold forging -hot Some useful Google™ search modes: Definition of a word or phrase define:search Page contains numbers in a given range ming emperor 1400..1500 Simple calculator functions 150*(1+0.05)**5-150 Basics of web searching: user’s view… Filtering Anchor: the text in an HTML tag that carries the hyperlink to a page Example: <a href=http://www-ieem.ust.hk/dfaculty/ajay/courses/ielm330/330intro.html> IELM 330 web page </a> Anchors have special significance in Google (we shall see this later) Anchor text often indicates the content of the page it links Search term must occur in the anchor text inanchor:manufacturing-wiki Search term must occur in the title of page allintitle:joneja ajay Focused search Search only on pages that occur within a particular site IELM 330 site:ust.hk The Internet and Web searching Part 1. The infrastructure of the web - The architecture of the Internet - The ISO/OSI 7-layer communication protocol - Client-server architecture for applications Part 2. How to use a web search engine [case study: Google] - Basic searching - Advanced searching (Booleans) and filtering (setting up local engines) Part 3. How search engines work - Crawling the internet - Indexing of information, database design - Page ranking Collecting information on the internet: Crawlers How do search engines know which page has what data? - By the use of robots (also called spiders or crawlers) - By asking sites to tell them who they are (registering) Crawling: Start with a small list of web-pages (seeds) := OPEN For each URL in OPEN: Fetch page; Process page (index its contents see later); Add all web-links (URL’s) found on the page to OPEN Problems in crawling Too many web-sites > 100 million Each website has many pages (estimated total: ~ 25 billion) Some pages may not be linked from any seed Dynamic page generation (e.g. DB based, or rapidly changing) Suppose a web-site generates a page based on some inputs from user How can we program a crawler to get this content ? What potential problems can be faced by this technique? Art gallery website: - 3 methods to sort thumbnails; 3 choices of font size; 2 skins 18 URL’s (using HTTP GET with different options) for same page! Crawler behavior Basic crawler is easy to program Policies used by crawlers determine their efficiency: Selection policy: which page to download (among all URL’s in OPEN) Breadth first is better than depth first why? Crawler behavior.. Re-visit policy: how often to re-visit a page (to account for dynamic content) How often to re-visit? Depends on how frequently the page is changed out-dated data cost [but how to estimate this cost?] 1, if p is identical to local copy Freshness of a page p in repository at time t = Age of a page p in repository at time t = [source: wikipedia] 0, otherwise 0, if p is identical to local copy t- time when p was modified, otherwise Crawler behavior... Policies used by crawlers determine their efficiency: Politeness policy: how not to overload resources Crawlers can send download requests faster than the server, the network or the indexer can handle Typical policy: wait n seconds before sending next page request to same server n = 10 sec … 180 sec Crawler behavior…. Policies used by crawlers determine their efficiency: Parallelization policy: how to use multiple crawlers in a coordinated search advantage: the search engine can use multiple computers to build indexes problems: - how to coordinate the merge the results of the multiple computers - how to ensure that same page is not downloaded by more than one process Search engines: Indexing What to do with the pages downloaded by the crawler? Processing the data: - categorization (mime-types) - text processing - delete stop words [Why?] [How?] - Identify and handle words - Identify and handle phrases [How many ?] - Handling similar words - plurals, past/present, mis-spellings: stemming Search engines: Indexing.. Categorization (mime-types) The indexer must know what type of data it received, and handle it accordingly: A/V mime-types: audio/mpeg, audio/x-wav, audio/x-ms-wma, video/mpeg, … Currently: check meta data (anchor text, ALT-tags, HTTP Header, … Future: Digital signal processing voice recognition text index Image mime-types: image/gif, image/jpeg, image/png, image/tiff, … Currently: check meta data (anchor text, ALT-tags, HTTP Header, …) Application mime-types: application/javascript, application/xhtml+xml, application/octet-stream, application/x-shockwave-flash, … Currently: If interpreted language handle as text, otherwise meta data Other mime-types: e.g. 3D data etc meta data, filenames, etc. Search engines: Indexing… Text processing - Identify and handle words what is a word? - English: dictionary - Programming: contiguous alphanumeric with ‘_’ - hyphenated [how does Google handle hyphenated search-terms?] - special characters - delete stop words Stop words: very common words in text, such as the, a, an, on, … How to define? language experts (human) make the list How to handle? string-match do not index stop words ? Search engines: Indexing…. - Identify and handle phrases what is a phrase? - English: dictionary of phrases, human experts - How to identify: string match - How many? - Too many Index is too large - Too few real-time Boolean operations during search (takes time!) Google n-gram dataset: ~24 GB compressed text files Number of tokens: 1,024,908,267,229 Number of unigrams: 13,588,391 Number of bigrams: 314,843,401 Number of trigrams: 977,069,902 Number of fourgrams: 1,313,818,354 Number of fivegrams: 1,176,470,663 Search engines: Indexing….. - Handling similar words - plurals, past/present, mis-spellings: stemming - Search for ‘expert’ == ? == Search for ‘experts’ Stemming: To reduce a given word to a ‘canonical’ root word Basic methods: - Use a dictionary - Analyze sentence, guess the meaning of the word used, find root from look-up table - Use simple processing rules (English) - if word ends in ‘ed’, ‘ing’, ‘ly’ remove ‘ed’, ‘ing’, ‘ly’ - How to identify mis-spellings? - Dictionary lookup and match expensive - Look-up table with list of commonly misspelled words Search engines: Indexing…… Inverted indexes: How to store processed data Most common storage structure: Inverted Index ngram Documents storm URL1, URL6, …, URL_n teacup URL1, …, URL_k … How to store? - Database file (lookup time may be long) - Hash (faster lookup on average) Challenge: Updating an index [Why?] [How?] Challenge: Merging indexes [Why? Parallel processing] [How?] Search engines: presenting results to user Typical query may match 100 millions web pages (hits) Typical user may look at perhaps 3-5 pages to find the information they want Successful search engine must provide a link to ‘best’ page among top few How to rank the relative likelihood that a page will provide ‘wanted’ info? - Google uses PageRank™ Search engines: PageRank How to rank the relative likelihood that a page will provide ‘wanted’ info? - Google uses PageRank™ PageRank: Invented by PhD student, Larry Page Similar older ideas, e.g. link analysis (graph theory), citation index Main idea of PageRank: 1. Importance of page is related to how many other pages link to it, and 2. How important are those pages Is PageRank effective? [Google’s market capitalization: US$ 200B!] Page Ranking 1 3 2 4 Consider a web with 4 pages Arrow from A B => Page A has a link to Page B Let: importance score of page k = score of k = xk Simplest scheme: xk = number of back-links to page k x1 = 2, x2 = 1, x3 = 3, x4 = 2 Problem ? does it account for condition 2? Main Reference for PageRank notes: Bryan and Leise, The 25B eigenvector: The Linear Algebra Behind Google Page Ranking.. 1 3 2 4 Let: importance score of page k = score of k = xk Suppose page j is more important (e.g. yahoo.com) those pages that are linked from page j should be important additional importance of page linked from page j xj Let Lk be the set of page k’s backlinks, then: xk xj n jLk j where nj = number of outgoing links on page j Note: we ignore any link from a page to itself Page Ranking… Let Lk be the set of page k’s backlinks, then: xk x1 = xj n jLk 1 3 2 4 j x3 + x4 /2 x2 = x1 /3 x3 = x1 /3 + x2/2 + x4 = x1 /3 + x2 /2 x4 /2 x1 x2 x3 x4 = 0 1/3 1/3 1/3 x 0 0 1/2 1/2 1 1/2 0 0 0 1/2 0 0 A x1 x2 x3 x4 x Recall from Linear Algebra that for an nXn square matrix A, a scalar l is an eigenvalue if there is an eigenvector x ≠ 0 such that Ax = lx PageRank of pages is the eigenvector of A corresponding to eigenvalue = 1 Page Ranking…. x1 x2 x3 x4 = 0 1/3 1/3 1/3 0 0 1/2 1/2 1 1/2 0 0 0 1/2 0 0 x1 x2 x3 x4 1 3 2 4 To find the eigenvector: solve the system using Gaussian elimination In our example, any scalar multiple of: 12 4 9 6 12/31 Normalize: 4/31 9/31 6/31 = .387 .129 .29 .194 Problem: Is it necessary that the link matrix of any web has eigenvalue = 1? Page Ranking ….. Problem: Is it necessary that the link matrix of any web has eigenvalue = 1? Dangling node: A web page that has no outgoing links Theorem: The link matrix of a web with no dangling nodes always has 1 as an eigenvalue Examine matrix A: Aij = 1/nj, if page j links to page i Aij = 0, otherwise sum of all entries in column j = total (unweighted) contribution of page j = 1 Column stochastic matrix: A square matrix with no negative entries, and sum of each column = 1 Page Ranking …... Column stochastic matrix: A square matrix with no negative entries, and sum of each column = 1 Our link matrix is a column stochastic matrix Theorem: Every column stochastic matrix has 1 as an eigenvalue Proof: 1. (linear algebra): square matrix A and its Transpose, AT have same eigenvalues 2. Let e = an nX1 vector with all entries = 1 3. A is column stochastic AT e = e AT has an eigenvector = e, (with eigenvalue = 1) A has an eigenvalue = 1. QED Page Ranking issues Problems with the Link Matrix approach, and How to handle them Non-unique eigenvector for eigenvalue = 1 Dangling nodes Computational effort and time Other factors that also contribute to ‘importance of the page’ Page Ranking issues: non-unique eigenvector Consider the web in the figure 1 3 A= 5 2 4 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 ½ 1 0 ½ 0 0 0 What are the eigenvectors for A, corresponding to eigenvalue = 1 ? Any linear combination of: ½ ½ 0 0 0 0 0 ½ ½ 0 What is the relative importance of the pages?? e.g.: .375 .375 .125 .125 0 Page Ranking issues: non-unique eigenvector.. Consider the web in the figure 1 3 A= 5 2 4 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 ½ 1 0 ½ 0 0 0 Intuitive reason for the problem: this web can be partitioned into two independent sub-webs Mathematically: Non-unique eigenvector results when A has a block diagonal form A= A1 0 … 0 0 A2 … 0 0 … 0 0 … Ak Solution: - Reorganize A into block diagonal form - Find Page Rank for each block independently Page Ranking issues: Dangling nodes Suppose a web has dangling nodes (any page with no outgoing links) Corresponding column of A is all 0’s The matrix A is not column stochastic (it is column sub-stochastic) For column sub-stochastic matrices, there must exist at least one eigenvalue, 0 < l < 1, whose eigenvector has all non-negative entries [Note: this eigenvector is called the Perron eigenvector] [Note 2: we will omit the proof of its existence in this course] We could use the Perron eigenvector to get page ranks (but we won’t) Dangling nodes: A probabilistic view of the Link matrix Consider an imaginary, randomized web-surfer: He starts at a random page in the web and spends time t on the page. He then jumps randomly to one of the links in that page The Page Rank of a page = expected time spent on that page by the randomized surfer! However: if our “randomized surfer” ends up on a dangling node, then he is stuck there for ever ?! One solution: If we are stuck at a dangling node, jump to a random page in the web How to model this view-point into our matrix A ? The S matrix: For any 0-column ( dangling node) in A, make all entries = 1/n Question: Is the matrix S column stochastic ? Page Ranking Issues: Other factors influencing rank Main idea (Google’s damping factor): Allow our randomized surfer to make a purely random jump “occasionally” How to implement ? What is the link matrix of a purely random surfer? Note: R is column stochastic Consider the Google matrix, G = d S + (1 – d) R R= 1/n 1/n … 1/n 1/n 1/n … 1/n … 1/n 1/n … 1/n d [0, 1] Note: G is column stochastic (Why?) Effect of using G: With probability d, we follow the randomized surfer S, and with probability (1 – d), we make a purely random jump. Page Ranking Issues: Other factors influencing rank Main idea (Google’s damping factor): Allow our randomized surfer to make a purely random jump “occasionally” Google matrix, G = d S + (1 – d) R Google found that using matrix G, and damping factor d = 0.85 “works very well” ( gives ranking that tends to rank “useful pages” higher) Page Ranking Issues: Computing the eigenvector The Rank of page n is the n-th element of the eigenvector, I, of the G matrix for eigenvalue = 1 Size of G: approx 25Billion X 25 Billion !!! Using Gaussian elimination will take months on a supercomputer! Properties of G: Large size (25Billion X 25 Billion) Sparse (approx 10 non-zero entries in typical non-dangling node column) 1 is the largest eigenvalue for column stochastic matrix G We can use the Power Method to compute the I-vector Page Ranking Issues: Computing the eigenvector.. The Power Method to compute the eigenvector of the dominant eigenvalue Start with an arbitrary rank vector, I0 Iterate: I1 = G I0 , …, …, Ik+1 = G Ik Ik converges to the eigenvector of the dominant eigenvalue We only need to do matrix multiply for sparse matrix (~10 multiply per column) [NOTE: by writing G = d S + (1 – d) R, the effect of R is merely to add a fixed quantity to each row of I faster computation] Brin and Page: Typically, Google needs 50-100 iterations to get a good I-vector Typically, Google computers take few days to calculate Typically Google updates rankings once per month (Google dance) Concluding notes Why is web-search important? - For researchers, only practical way to find information from web - For E-commerce companies: higher rank larger sales (why?) This is the reason for high stock value of Google, Baidu, … Current trend: - Page Ranking method dominates Directory-structured search engines Technology note: - Page Ranking: a practical application of abstract Linear Algebra concepts! References and Further Reading Books: Sherman and Price, The Invisible Web: Uncovering Information Sources Search Engines Can’t See, Cyber Age Books, 2001 [Note: some information in this book is already obsolete] Web sources: 1. Googleguide: google search guide 2. David Austin’s article on how Google PageRank works Next: Auctions, Web-transactions, Ebay