Transcript PowerPoint

Discussion Class 6
Crawling the Web
1
Discussion Classes
Format:
Questions.
Ask a member of the class to answer.
Provide opportunity for others to comment.
When answering:
Stand up.
Give your name. Make sure that the TA hears it.
Speak clearly so that all the class can hear.
Suggestions:
Do not be shy at presenting partial answers.
Differing viewpoints are welcome.
2
Question 1: Background
(a) When was this paper written, by whom, and why?
(b) What, if anything, has changed since this paper was written?
(c) How has Yahoo changed?
3
Question 2: Search
engine architecture
4
Question 2: Search Engine Architecture
What is the function of the following?
(a) Crawl control
(b) Indexer module
(c) Structure index
(d) Ranking module
(e) Page repository
5
Question 3: What pages should the
crawler download?
(a) What is the problem? Why do crawlers not download
every page?
(b) What can a crawler know about a page without
downloading it?
(c) The paper describes several importance measures: interestdriven, popularity-driven, location-driven. How do they
apply?
(d) How do these importance measures interact with the
ordering metrics?
6
Question 4: How should the crawler
refresh pages?
(a) What is the problem?
(b) The paper discusses a "freshness" metric. What is this?
Do you consider it a good metric?
7
Question 5: How should the load on
the visited Web sites be minimized?
(a) Why is this a problem?
(b) What can a crawler do to minimize the problem?
(c) What can a web site do to minimize the problem?
8
Question 6: How should the crawling
process be parallelized?
(a) Why should the crawling process be parallelized?
(b) What are the principal options?
9