Nutch in a Nutshell - National University of Singapore

Download Report

Transcript Nutch in a Nutshell - National University of Singapore

Nutch in a Nutshell
(part I)
Presented by
Liew Guo Min
Zhao Jin
Outline
Overview
 Nutch as a web crawler
 Nutch as a complete web search engine
 Special features
 Installation/Usage (with Demo)
 Exercises

Overview

Complete web search engine
 Nutch
= Crawler + Indexer/Searcher (Lucene) + GUI
+ Plugins
+ MapReduce & Distributed FS (Hadoop)

Java based, open source

Features:
 Customizable
 Extensible (Next meeting)
 Distributed (Next meeting)
Nutch as a crawler
Initial URLs
Web
Injector
CrawlDB
Webpages/files
update
Generator
get
Fetcher
CrawlDBTool
read/write
generate
Segment
read/write
Parser
Nutch as a complete web search
engine
Segments
LinkDB
CrawlDB
Indexer
(Lucene)
Index
Searcher
(Lucene)
GUI
(Tomcat)
Special Features

Customizable
 Configuration

files (XML)
Required user parameters
http.agent.name
 http.agent.description
 http.agent.url
 http.agent.email


Adjustable parameters for every component

E.g. for fetcher:
 Threads-per-host
 Threads-per-ip
Special Features
 URL
Filters (Text file)
Regular expression to filter URLs during crawling
 E.g.

To ignore files with certain suffix:
-\.(gif|exe|zip|ico)$
 To accept host in a certain domain
+^http://([a-z0-9]*\.)*apache.org/

 Plugin-information

(XML)
The metadata of the plugins (More details next
week)
Installation & Usage

Installation
 Software
needed
Nutch release
 Java
 Apache Tomcat (for GUI)
 Cgywin (for windows)

Installation & Usage

Usage

Crawling




Indexing


Initial URLs (text file or DMOZ file)
Required parameters (conf/nutch-site.xml)
URL filters (conf/crawl-urlfilter.txt)
Automatic
Searching


Location of files (WAR file, index)
The tomcat server
Demo time!
Exercises

Questions:

What are the things that need to be done before starting a crawl job with
Nutch?

What are the ways tell Nutch what to crawl and what not? What can you
do if you are the owner of a website?

Starting from v0.8, Nutch won’t run unless some minimum user
parameters, such as http.robots.agents, are set, what do you think is the
reason behind?

What do you think are good crawling behaviors?

Do you think an open-sourced search engine like Nutch would make it
easier for spammers to manipulate the search index ranking?

What are the advantages of using Nutch instead of commercial search
engines?
Answers

What are the things that need to be done before starting
a crawl job with Nutch?





Set the CLASSPATH to the Lucene Core
Set the JAVA_HOME path
Create a folder containing urls to be crawled
Amend the crawl-urlfilter file
Amend the nutch-site.xml file to include the user parameters

What are the ways tell Nutch what to crawl and what not?




Url filters
Depth in crawling
Scoring function for urls
What can you do if you are the owner of a website?
 Web Server Administrators

Use the Robot Exclusion Protocol by adding the following in
/robots.txt
 HTML Author

Add the Robots META tag

Starting from v0.8, Nutch won’t run unless some
minimum user parameters, such as http.robots.agents,
are set, what do you think is the reason behind?


To ensure accountability (although tracing is still possible without
them)
What do you think are good crawling behaviors?





Be Accountable
Test Locally
Don't hog resources
Stay with it
Share results

Do you think an open-sourced search engine like Nutch
would make it easier for spammers to manipulate the
search index ranking?


True but one can always make changes in Nutch to minimize the
effect.
What are the advantages of using Nutch instead of
commercial search engines?



Open-source
Transparent
Able to define the what are to be returned in searches and the
index ranking
Exercises

Hands-on exercises

Install Nutch, crawl a few webpages using the crawl command
and perform a search on it using the GUI

Repeat the crawling process without using the crawl command

Modify your configuration to perform each of the following crawl
jobs and think when they would be useful.




To crawl only webpages and pdfs but not anything else
To crawl the files on your harddisk
To crawl but not to parse
(Challenging) Modify Nutch such that you can unpack the
crawled files in the segments back into their original state
Q&A?
Next Meeting

Special Features
 Extensible
 Distributed

Feedback and discussion
References





http://lucene.apache.org/nutch/ -- Official
website
http://wiki.apache.org/nutch/ -- Nutch wiki
(Seriously outdated. Take with a grain of salt.)
http://lucene.apache.org/nutch/release/ Nutch
source code
www.nutchinstall.blogspot.com Installation guide
http://www.robotstxt.org/wc/robots.html The web
robot pages
Thank you!