Nutch in a Nutshell - National University of Singapore
Download
Report
Transcript Nutch in a Nutshell - National University of Singapore
Nutch in a Nutshell
Presented by
Liew Guo Min
Zhao Jin
Outline
Recap
Special features
Running Nutch in a distributed
environment (with demo)
Q&A
Discussion
Recap
Complete web search engine
Nutch
= Crawler + Indexer/Searcher (Lucene) + GUI
+ Plugins
+ MapReduce & Distributed FS (Hadoop)
Java based, open source
Features:
Customizable
Extensible
Distributed
Nutch as a crawler
Initial URLs
Web
Injector
CrawlDB
Webpages/files
update
Generator
get
Fetcher
CrawlDBTool
read/write
generate
Segment
read/write
Parser
Special Features
Extensible (Plugin system)
Most
of the essential functionalities of Nutch
are implemented as plugins
Three layers
Extension points
Extensions
What can be extended: Protocol, Parser, ScoringFilter, etc.
The interfaces to be implemented for the extension points
Plugins
The actual implementation
Special Features
Extensible (Plugin system)
Anyone
can write a plugin
Write the code
Prepare metadata files
Plugin.xml: what has been extended by what
Build.xml: how ant can build your source code
Ask nutch to include your plugin in conf/nutchsite.xml
Tell ant to build your in src/plugin/build.xml
More details @
http://wiki.apache.org/nutch/PluginCentral
Special Features
Extensible (Plugin system)
To
use a plugin
Make sure you have modified Nutch-site.xml to
include the plugin
Then, either
Nutch would automatically call it when needed, or
You can write something to call it with its classname and
then use it
Special Features
Distributed (Hadoop)
Map-Reduce
(Diagram)
A framework for distributed programming
Map -- Process the splits of data to get
intermediate results and the keys to indicate what
should be put together later
Reduce -- Process the intermediate results with
the same key and output final result
Special Features
Distributed (Hadoop)
MapReduce
in Nutch
Example1: Parsing
Input: <url, content> files from fetch
Map(url,content) <url, parse> by calling parser plugins
Reduce is identity
Example2: Dumping a segment
Input: <url, CrawlDatum>, <url, ParseText> etc. files from
segment
Map is identity
Reduce(url, value*) <url, ConcatenatedValue> by
simply concatenating the text representation of values
Special Features
Distributed (Hadoop)
Distributed File system
Write-once-read-many coherence model
Master/slave
Simple architecture
Single point of failure
Transparent
High throughput
Access via Java API
More info @
http://lucene.apache.org/hadoop/hdfs_design.html
Running Nutch in a distributed
environment
MapReduce
In
hadoop-site.xml
Specify job tracker host & port
mapred.job.tracker
Specify task numbers
mapred.map.tasks
mapred.reduce.tasks
Specify location for temporary files
Mapred.local.dir
Running Nutch in a distributed
environment
DFS
In
hadoop-site.xml
Specify namenode host, port & directory
fs.default.name
dfs.name.dir
Specify location for files on each datanode
dfs.data.dir
Demo time!
Q&A
Discussion
Exercises
Hands-on exercises
Install Nutch, crawl a few webpages using the crawl command
and perform a search on it using the GUI
Repeat the crawling process without using the crawl command
Modify your configuration to perform each of the following crawl
jobs and think when they would be useful.
To crawl only webpages and pdfs but not anything else
To crawl the files on your harddisk
To crawl but not to parse
(Challenging) Modify Nutch such that you can unpack the
crawled files in the segments back into their original state
Reference
http://wiki.apache.org/nutch/PluginCentral -- Information on Nutch
plugins
http://lucene.apache.org/hadoop/ -- Hadoop homepage
http://wiki.apache.org/lucene-hadoop/ -- Hadoop Wiki
http://wiki.apache.org/nutchdata/attachments/Presentations/attachments/mapred.pdf
"MapReduce in Nutch"
http://wiki.apache.org/nutchdata/attachments/Presentations/attachments/oscon05.pdf "Scalable
Computing with MapReduce“
http://www.mail-archive.com/[email protected]/msg01951.html Updated tutorial on
setting up Nutch, Hadoop and Lucene together
Excursion: MapReduce
Problem
Find
the number of occurrences of “cat” in a
file
What if the file is 20GB large?
Why not do it with more computers?
Solution
Split 1
File
Split 2
PC1
200
PC2
300
PC1
500
Excursion: MapReduce
Problem
Find
the number of occurrences of both “cat”
and “dog” in a very large file
Solution
Split 1
File
Split 2
PC1
cat: 200,
200,
250 250
dog:
cat: 200,
300
PC1
cat:500
PC2
cat:
300,300,
dog:
250 250
dog: 250,
250
PC2
dog:500
Map
Input Files
Sort/Group
Intermediate files
Reduce
Output files
Excursion: MapReduce
Generalized Framework
Master
Split 1
Split 2
Split 3
Split 4
Worker
Worker
k1:v1
k3:v2
k1:v3
k2:v4
k2:v5
k4:v6
Input Files
Worker
Output 1
Worker
Output 2
Worker
Output 3
k2:v4,v5
k3:v2
Worker
Map
k1:v1,v2
k4:v6
Sort/Group
Intermediate files
Reduce
Output files
back