Transcript PowerPoint

Discussion Class 6
Crawling the Web
Discussion Classes
Ask a member of the class to answer.
Provide opportunity for others to comment.
When answering:
Stand up.
Give your name. Make sure that the TA hears it.
Speak clearly so that all the class can hear.
Do not be shy at presenting partial answers.
Differing viewpoints are welcome.
Question 1: Background
(a) When was this paper written, by whom, and why?
(b) What, if anything, has changed since this paper was written?
(c) How has Yahoo changed?
Question 2: Search
engine architecture
Question 2: Search Engine Architecture
What is the function of the following?
(a) Crawl control
(b) Indexer module
(c) Structure index
(d) Ranking module
(e) Page repository
Question 3: What pages should the
crawler download?
(a) What is the problem? Why do crawlers not download
every page?
(b) What can a crawler know about a page without
downloading it?
(c) The paper describes several importance measures: interestdriven, popularity-driven, location-driven. How do they
(d) How do these importance measures interact with the
ordering metrics?
Question 4: How should the crawler
refresh pages?
(a) What is the problem?
(b) The paper discusses a "freshness" metric. What is this?
Do you consider it a good metric?
Question 5: How should the load on
the visited Web sites be minimized?
(a) Why is this a problem?
(b) What can a crawler do to minimize the problem?
(c) What can a web site do to minimize the problem?
Question 6: How should the crawling
process be parallelized?
(a) Why should the crawling process be parallelized?
(b) What are the principal options?