Nutch in a Nutshell - National University of Singapore

Download Report

Transcript Nutch in a Nutshell - National University of Singapore

Nutch in a Nutshell
Presented by
Liew Guo Min
Zhao Jin
Outline
Recap
 Special features
 Running Nutch in a distributed
environment (with demo)
 Q&A
 Discussion

Recap

Complete web search engine
 Nutch
= Crawler + Indexer/Searcher (Lucene) + GUI
+ Plugins
+ MapReduce & Distributed FS (Hadoop)

Java based, open source

Features:
 Customizable
 Extensible
 Distributed
Nutch as a crawler
Initial URLs
Web
Injector
CrawlDB
Webpages/files
update
Generator
get
Fetcher
CrawlDBTool
read/write
generate
Segment
read/write
Parser
Special Features

Extensible (Plugin system)
 Most
of the essential functionalities of Nutch
are implemented as plugins
 Three layers

Extension points


Extensions


What can be extended: Protocol, Parser, ScoringFilter, etc.
The interfaces to be implemented for the extension points
Plugins

The actual implementation
Special Features

Extensible (Plugin system)
 Anyone
can write a plugin
Write the code
 Prepare metadata files

Plugin.xml: what has been extended by what
 Build.xml: how ant can build your source code

Ask nutch to include your plugin in conf/nutchsite.xml
 Tell ant to build your in src/plugin/build.xml
 More details @
http://wiki.apache.org/nutch/PluginCentral

Special Features

Extensible (Plugin system)
 To
use a plugin
Make sure you have modified Nutch-site.xml to
include the plugin
 Then, either

Nutch would automatically call it when needed, or
 You can write something to call it with its classname and
then use it

Special Features

Distributed (Hadoop)
 Map-Reduce
(Diagram)
A framework for distributed programming
 Map -- Process the splits of data to get
intermediate results and the keys to indicate what
should be put together later
 Reduce -- Process the intermediate results with
the same key and output final result

Special Features

Distributed (Hadoop)
 MapReduce

in Nutch
Example1: Parsing
Input: <url, content> files from fetch
 Map(url,content)  <url, parse> by calling parser plugins
 Reduce is identity


Example2: Dumping a segment
Input: <url, CrawlDatum>, <url, ParseText> etc. files from
segment
 Map is identity
 Reduce(url, value*)  <url, ConcatenatedValue> by
simply concatenating the text representation of values

Special Features

Distributed (Hadoop)
 Distributed File system
 Write-once-read-many coherence model


Master/slave



Simple architecture
Single point of failure
Transparent


High throughput
Access via Java API
More info @
http://lucene.apache.org/hadoop/hdfs_design.html
Running Nutch in a distributed
environment

MapReduce
 In

hadoop-site.xml
Specify job tracker host & port


mapred.job.tracker
Specify task numbers
mapred.map.tasks
 mapred.reduce.tasks


Specify location for temporary files

Mapred.local.dir
Running Nutch in a distributed
environment

DFS
 In

hadoop-site.xml
Specify namenode host, port & directory
fs.default.name
 dfs.name.dir


Specify location for files on each datanode

dfs.data.dir
Demo time!
Q&A
Discussion
Exercises

Hands-on exercises

Install Nutch, crawl a few webpages using the crawl command
and perform a search on it using the GUI

Repeat the crawling process without using the crawl command

Modify your configuration to perform each of the following crawl
jobs and think when they would be useful.




To crawl only webpages and pdfs but not anything else
To crawl the files on your harddisk
To crawl but not to parse
(Challenging) Modify Nutch such that you can unpack the
crawled files in the segments back into their original state
Reference






http://wiki.apache.org/nutch/PluginCentral -- Information on Nutch
plugins
http://lucene.apache.org/hadoop/ -- Hadoop homepage
http://wiki.apache.org/lucene-hadoop/ -- Hadoop Wiki
http://wiki.apache.org/nutchdata/attachments/Presentations/attachments/mapred.pdf
"MapReduce in Nutch"
http://wiki.apache.org/nutchdata/attachments/Presentations/attachments/oscon05.pdf "Scalable
Computing with MapReduce“
http://www.mail-archive.com/[email protected]/msg01951.html Updated tutorial on
setting up Nutch, Hadoop and Lucene together
Excursion: MapReduce

Problem
 Find
the number of occurrences of “cat” in a
file
 What if the file is 20GB large?
Why not do it with more computers?

Solution
Split 1
File
Split 2
PC1
200
PC2
300
PC1
500
Excursion: MapReduce

Problem
 Find
the number of occurrences of both “cat”
and “dog” in a very large file

Solution
Split 1
File
Split 2
PC1
cat: 200,
200,
250 250
dog:
cat: 200,
300
PC1
cat:500
PC2
cat:
300,300,
dog:
250 250
dog: 250,
250
PC2
dog:500
Map
Input Files
Sort/Group
Intermediate files
Reduce
Output files
Excursion: MapReduce

Generalized Framework
Master
Split 1
Split 2
Split 3
Split 4
Worker
Worker
k1:v1
k3:v2
k1:v3
k2:v4
k2:v5
k4:v6
Input Files
Worker
Output 1
Worker
Output 2
Worker
Output 3
k2:v4,v5
k3:v2
Worker
Map
k1:v1,v2
k4:v6
Sort/Group
Intermediate files
Reduce
Output files
back