Transcript PowerPoint
Discussion Class 6 Crawling the Web 1 Discussion Classes Format: Questions. Ask a member of the class to answer. Provide opportunity for others to comment. When answering: Stand up. Give your name. Make sure that the TA hears it. Speak clearly so that all the class can hear. Suggestions: Do not be shy at presenting partial answers. Differing viewpoints are welcome. 2 Question 1: Background (a) When was this paper written, by whom, and why? (b) What, if anything, has changed since this paper was written? (c) How has Yahoo changed? 3 Question 2: Search engine architecture 4 Question 2: Search Engine Architecture What is the function of the following? (a) Crawl control (b) Indexer module (c) Structure index (d) Ranking module (e) Page repository 5 Question 3: What pages should the crawler download? (a) What is the problem? Why do crawlers not download every page? (b) What can a crawler know about a page without downloading it? (c) The paper describes several importance measures: interestdriven, popularity-driven, location-driven. How do they apply? (d) How do these importance measures interact with the ordering metrics? 6 Question 4: How should the crawler refresh pages? (a) What is the problem? (b) The paper discusses a "freshness" metric. What is this? Do you consider it a good metric? 7 Question 5: How should the load on the visited Web sites be minimized? (a) Why is this a problem? (b) What can a crawler do to minimize the problem? (c) What can a web site do to minimize the problem? 8 Question 6: How should the crawling process be parallelized? (a) Why should the crawling process be parallelized? (b) What are the principal options? 9