Nutch in a Nutshell - National University of Singapore
Download
Report
Transcript Nutch in a Nutshell - National University of Singapore
Nutch in a Nutshell
(part I)
Presented by
Liew Guo Min
Zhao Jin
Outline
Overview
Nutch as a web crawler
Nutch as a complete web search engine
Special features
Installation/Usage (with Demo)
Exercises
Overview
Complete web search engine
Nutch
= Crawler + Indexer/Searcher (Lucene) + GUI
+ Plugins
+ MapReduce & Distributed FS (Hadoop)
Java based, open source
Features:
Customizable
Extensible (Next meeting)
Distributed (Next meeting)
Nutch as a crawler
Initial URLs
Web
Injector
CrawlDB
Webpages/files
update
Generator
get
Fetcher
CrawlDBTool
read/write
generate
Segment
read/write
Parser
Nutch as a complete web search
engine
Segments
LinkDB
CrawlDB
Indexer
(Lucene)
Index
Searcher
(Lucene)
GUI
(Tomcat)
Special Features
Customizable
Configuration
files (XML)
Required user parameters
http.agent.name
http.agent.description
http.agent.url
http.agent.email
Adjustable parameters for every component
E.g. for fetcher:
Threads-per-host
Threads-per-ip
Special Features
URL
Filters (Text file)
Regular expression to filter URLs during crawling
E.g.
To ignore files with certain suffix:
-\.(gif|exe|zip|ico)$
To accept host in a certain domain
+^http://([a-z0-9]*\.)*apache.org/
Plugin-information
(XML)
The metadata of the plugins (More details next
week)
Installation & Usage
Installation
Software
needed
Nutch release
Java
Apache Tomcat (for GUI)
Cgywin (for windows)
Installation & Usage
Usage
Crawling
Indexing
Initial URLs (text file or DMOZ file)
Required parameters (conf/nutch-site.xml)
URL filters (conf/crawl-urlfilter.txt)
Automatic
Searching
Location of files (WAR file, index)
The tomcat server
Demo time!
Exercises
Questions:
What are the things that need to be done before starting a crawl job with
Nutch?
What are the ways tell Nutch what to crawl and what not? What can you
do if you are the owner of a website?
Starting from v0.8, Nutch won’t run unless some minimum user
parameters, such as http.robots.agents, are set, what do you think is the
reason behind?
What do you think are good crawling behaviors?
Do you think an open-sourced search engine like Nutch would make it
easier for spammers to manipulate the search index ranking?
What are the advantages of using Nutch instead of commercial search
engines?
Answers
What are the things that need to be done before starting
a crawl job with Nutch?
Set the CLASSPATH to the Lucene Core
Set the JAVA_HOME path
Create a folder containing urls to be crawled
Amend the crawl-urlfilter file
Amend the nutch-site.xml file to include the user parameters
What are the ways tell Nutch what to crawl and what not?
Url filters
Depth in crawling
Scoring function for urls
What can you do if you are the owner of a website?
Web Server Administrators
Use the Robot Exclusion Protocol by adding the following in
/robots.txt
HTML Author
Add the Robots META tag
Starting from v0.8, Nutch won’t run unless some
minimum user parameters, such as http.robots.agents,
are set, what do you think is the reason behind?
To ensure accountability (although tracing is still possible without
them)
What do you think are good crawling behaviors?
Be Accountable
Test Locally
Don't hog resources
Stay with it
Share results
Do you think an open-sourced search engine like Nutch
would make it easier for spammers to manipulate the
search index ranking?
True but one can always make changes in Nutch to minimize the
effect.
What are the advantages of using Nutch instead of
commercial search engines?
Open-source
Transparent
Able to define the what are to be returned in searches and the
index ranking
Exercises
Hands-on exercises
Install Nutch, crawl a few webpages using the crawl command
and perform a search on it using the GUI
Repeat the crawling process without using the crawl command
Modify your configuration to perform each of the following crawl
jobs and think when they would be useful.
To crawl only webpages and pdfs but not anything else
To crawl the files on your harddisk
To crawl but not to parse
(Challenging) Modify Nutch such that you can unpack the
crawled files in the segments back into their original state
Q&A?
Next Meeting
Special Features
Extensible
Distributed
Feedback and discussion
References
http://lucene.apache.org/nutch/ -- Official
website
http://wiki.apache.org/nutch/ -- Nutch wiki
(Seriously outdated. Take with a grain of salt.)
http://lucene.apache.org/nutch/release/ Nutch
source code
www.nutchinstall.blogspot.com Installation guide
http://www.robotstxt.org/wc/robots.html The web
robot pages
Thank you!