Crawling through Web Forums

Download Report

Transcript Crawling through Web Forums

Crawling the Web Forums

By: Ankush Goel Instructor: Prof. Gail Kaiser Spring 2009

Web Crawling   Automated traversal of web to collect all the useful informative pages, effectively and efficiently Gather information about link structure interconnecting the informative pages.

Generic Crawler Architecture

Generic Crawler Architecture

Generic Crawler Architecture

Generic Crawler Architecture

Generic Crawler Architecture

Generic Crawler Architecture

Web Forums   Web application designed to manage user created content.

Online discussion area where anyone can discuss their favorite topics.

Why Generic Crawler Fails in case of Web Forums     Presence of many functional links.

Inability to index relationship among post pages.

Avoids crawling deep inside a web site.

Inefficient and ineffective.

List-of-Post Page

iRobot    Tool to crawl through Web Forums.

Intelligent enough to understand structure of forums before selecting traversal paths.

It work towards two issues:   Important pages Important links

How it Works?

   Pre-samples few pages to discover the repetitive regions.

Group pre-sampled pages into clusters based on their repetitive regions where each cluster can be considered a vertex in the sitemap.

Selects optimal traversal path to crawl through sitemap.

Repetitive Regions

Information Estimation Criteria

     Important Pages More pages with similar kind of structure or repetitive regions, than the page under consideration is probably more important. The size of an important page containing valuable information would be probably larger than any invaluable page like login page. The content of an informative page would be more diverse than an invaluable page like a post page contains content created by thousand of different users and would definitely be more diverse than an automatically generated duplicate error page.

  Important Links Location Similar location in similar repetitive region, so similar function.

F L O W C H A R T

F L O W C H A R T

F L O W C H A R T

F L O W C H A R T

F L O W C H A R T

C O M P A R I S O N

C O M P A R I S O N

Thank You