Crawling - Sapienza
Download
Report
Transcript Crawling - Sapienza
Crawling
Ida Mele
Nutch
• Apache Nutch is an open source Java
implementation of a search engine
• We can use Nutch for crawling a portion of the Web
• Useful links:
• http://nutch.apache.org/
• http://wiki.apache.org/nutch/
• http://wiki.apache.org/nutch/NutchTutorial
Ida Mele
Crawling
1
Nutch: advantages
• Understanding
• We have the source code and we can use it to see
how a large search engine works
• Nutch has been built using ideas from academia
and industry, and it is very useful for researchers
who want to try out new search algorithms
Ida Mele
Crawling
2
Nutch: advantages
• Transparency
• The details of the ranking algorithms used by
commercial search engines are secret, and usually
there are economical reasons behind the ranked
list of results
• Nutch implementation is transparent. We know
how the ranking algorithms work, and we can
trust on the fairness of the final rankings
Ida Mele
Crawling
3
Nutch: advantages
• Extensibility
• Nutch is a platform for adding search to
heterogeneous collections of information
• It allows to customize the search interface
• We can extend the out-of-the-box functionality
through the plugin mechanism
Ida Mele
Crawling
4
Nutch vs. Lucene
• Nutch is built on top of Lucene
• Apache Lucene is a Java library for text indexing and
searching
• It ensures high-performance and full-featured text
search
• It provides support for any application that requires
full-text search
• It is used just for indexing and not for crawling
Ida Mele
Crawling
5
Architecture
• Nutch can be divided into two pieces:
• crawler which fetches pages and turns them into an
inverted index
• searcher which answers users' search queries
• The index is the interface between the crawler and
the searcher
• The crawler and searcher systems can be on separate
hardware platforms
Ida Mele
Crawling
6
Architecture
• Crawler and searcher systems can be scaled
independently
• For example, if we have a highly trafficked
search page that provides searching for a
relatively modest set of sites, we may use a
modest crawler infrastructure, and invest
more substantial resources for supporting the
searcher
Ida Mele
Crawling
7
Crawler system
• The crawler system is driven by the Nutch tool called
crawl, and by other related tools to build and
maintain the data structures
• Data structures are:
• the web database
• a set of segments
• the index
Ida Mele
Crawling
8
WebDB
• The web database (WebDB) is a data structure for
mirroring the structure and properties of the web
graph being crawled
• It stores two types of entities:
• Page: It is indexed by its URL and the MD5 hash of
its contents. Other information: the # of outlinks,
fetch information, the score of the page
• Link: It represents the connection between the
source page and the target page
Ida Mele
Crawling
9
Segment
• The segment is a collection of pages that are fetched
and indexed by the crawler in a run
• The fetchlist is a list of URLs to fetch, and it is
generated from the WebDB
• The fetcher output is the data retrieved from the
pages in the fetchlist
• Any segment has a lifespan (30 days is the default refetch interval)
Ida Mele
Crawling
10
Index
• Inverted index of all pages retrieved by the system
• The index is created by merging all of the individual
segment indexes
• Nutch uses Lucene to build the index. Note that in
Lucene there is the concept of segment, but it is different
from the segment in Nutch:
• In Lucene, the index segment is a portion of the index
• In Nutch, the segment is a fetched and indexed
portion of the WebDB
Ida Mele
Crawling
11
Crawling
• Nutch can operate at one of these three different
scales:
• Local filesystem
• Intranet
• Web
• All scales have different characteristics. For example,
crawling the file system is more reliable compared to
the other two scales
Ida Mele
Crawling
12
Crawling
• For crawling billions of pages from the web, we must:
• define the seed set (i.e., the set of pages we want to
start with)
• decide how many crawlers we use and how partition
the work among them
• decide how often we want to do the re-crawling
• cope with broken links, unresponsive sites, and
unintelligible or duplicate content
Ida Mele
Crawling
13
Crawling
• The crawling process is basically a cycle made of
three steps:
1. the crawler generates a set of fetchlists from the
WebDB (generate)
2. a set of fetchers downloads the content from the
Web (fetch)
3. the crawler updates the WebDB with new links
that were found (update)
Ida Mele
Crawling
14
Crawling
• Nutch observes:
• Politeness: URLs with the same host are always
assigned to the same fetchlist, so that a web site is
not overloaded with requests from multiple
fetchers in rapid succession
• Robots Exclusion Protocol: It allows site owners to
control which parts of their site may be crawled
Ida Mele
Crawling
15
Crawling: low-level tools
• Crawling is done by the crawl tool of Nutch, that is a
front-end to lower-level tools
• The crawl tool can be used to get started with
crawling websites, but then we need to use the
lower-level tools to perform re-crawls and other
maintenance on the data structures built during the
initial crawl
Ida Mele
Crawling
16
Crawling: low-level tools
• We can use the lower-level tools in sequence:
1. Create a new WebDB (admin db-create)
2. Inject root URLs into the WebDB (inject)
3. Generate a fetchlist from the WebDB in a new
segment (generate)
4. Fetch content from URLs in the fetchlist (fetch)
5. Update the WebDB with links from fetched pages
(updatedb)
6. Repeat steps 3-5 until the required depth is reached
Ida Mele
Crawling
17
Crawling: low-level tools
7. Update segments with scores and links from the
WebDB (updatesegs)
8. Index the fetched pages (index)
9. Eliminate duplicate content, and duplicate URLs,
from the indexes (dedup)
10. Merge the indexes into a single index for
searching (merge)
Ida Mele
Crawling
18
Crawling: low-level tools
• We create a new WebDB (step 1), and we populate it
with some seed URLs (step 2)
• Then we use the generate/fetch/update cycle (steps 3-6)
• After the cycle, the crawler creates an index (steps 7-10).
In particular,
• each segment is indexed independently (step 8)
• the duplicate pages are removed (step 9)
• the individual indexes are combined into a single
index (step 10)
Ida Mele
Crawling
19
Running a crawl with Nutch
• Download and unpack a Nutch distribution (for
example, apache-nutch-1.1-bin.zip)
• Make sure that the environment variable
NUTCH_JAVA_HOME or JAVA_HOME is set with the
Java home path:
• Run the following command or add it to the .bashrc
file:
export NUTCH_JAVA_HOME= %pathJava
Ida Mele
Crawling
20
Nutch configuration
• All of Nutch's configuration files are in the conf
subdirectory
• The main configuration file is conf/nutchdefault.xml. It contains the default settings, and
should not be modified
• To change a setting we can create or update the
conf/nutch-site.xml file
Ida Mele
Crawling
21
Nutch configuration
• Add your agent name in the value field of the
http.agent.name property of the file conf/nutchsite.xml, for example, we can use the name: Sapienza
University
<property>
<name>http.agent.name</name>
<value>Sapienza University</value>
<description>
HTTP 'User-Agent' request header. MUST NOT be empty please set this to a single word uniquely related to your
organization.
</description>
</property>
Ida Mele
Crawling
22
Url filter
• The crawl tool uses a filter to decide which URLs can
go into the WebDB (steps 2 and 5)
• This can be used to restrict the crawling to the URLs
that match any given pattern, specified by regular
expressions
• For example, if we want to restrict the domain to the
DIS domain, we have to update the configuration file
conf/crawl-urlfilter.txt
Ida Mele
Crawling
23
Url filter
• Open the file conf/crawl-urlfilter.txt and replace the
line:
+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
with:
+^http://([a-z0-9]*\.)*dis.uniroma1.it/
• The file conf/crawl-urlfilter.txt will contain:
# accept hosts in MY.DOMAIN.NAME
#+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
+^http://([a-z0-9]*\.)*dis.uniroma1.it/
Ida Mele
Crawling
24
Example
• Create a file called urls, that contains the root
URLs.
• These URLs will be used to populate the initial
fetchlist.
• For example, if we want to start from the
home page of the department, we will use:
echo ‘http://www.dis.uniroma1.it’ > urls
Ida Mele
Crawling
25
Example
• We run the crawler with:
bin/nutch crawl urls -dir mycrawl -depth 5 >
mycrawl.log
where:
•
•
•
•
Ida Mele
urls is the name of the file with the seed URLs
mycrawl is the name of the directory
5 is the depth of the crawling
mycrawl.log is the name of the log file
Crawling
26
Results of the crawl
• The directory mycrawl contains the following
subdirectories:
• crawldb
• linkdb
• segments
• index
• indexes
Ida Mele
Crawling
27
Results of the crawl: readdb
• The readdb tool parses the WebDB and displays
portions of it in human-readable form
• The stats option displays the number of pages
and links:
bin/nutch readdb mycrawl/crawldb -stats >stats.txt
Then, we can use:
more stats.txt
Ida Mele
Crawling
28
Results of the crawl: readdb
• The dump option gives the dump of the pages. Each page
appears in a separate block, with one field per line. The
ID field is the MD5 hash of the page contents. There is
also information about when the pages should be next
fetched (which defaults to 30 days), and the page scores
• We issue the command:
bin/nutch readdb mycrawl/crawldb -dump mydump
then we use:
more mydump/part-00000
Ida Mele
Crawling
29
Results of the crawl: readdb
• The readdb tool also supports extraction of an
individual page or link by URL or MD5 hash
• For example, to examine the info of the page
http://cclii.dis.uniroma1.it/airo/index.php
we use the option url by issuing the command:
bin/nutch readdb mycrawl/crawldb -url
http://www.dis.uniroma1.it/airo/index.php
Ida Mele
Crawling
30
Results of the crawl: readlinkdb
• The readlinkdb tool can be used to create the dump
of the link structure (the graph) by using the option
dump:
bin/nutch readlinkdb mycrawl/linkdb/ -dump
mylinks
• We can read the in-links by using:
more mylinks/part-00000
Note that it gives us just the list of the in-links. For
the out-links we have to merge the segments and
read the result
Ida Mele
Crawling
31
Results of the crawl: readseg
• The crawl creates a few segments in
timestamped subdirectories, one for each
generate/fetch/update cycle
• The readseg tool is the segment reader
• The option list gives a summary of all of the
generated segments:
bin/nutch readseg -list -dir mycrawl/segments/
Ida Mele
Crawling
32
Results of the crawl: readseg
• The option dump gives a dump of a given segment:
bin/nutch readseg -dump
mycrawl/segments/YYYYMMDDhhmmss/
dump_seg1
Where YYYYMMDDhhmmss is the name of the
segment, and it is given by the date and time we
created the segment
• Then we can use:
more dump_seg1/dump
Ida Mele
Crawling
33
Results of the crawl: mergeseg
• We have seen that the readlinkdb tool can be used to have the
list of in-links
• To have the out-links we need to merge the segments and read
the result
• We use the mergesegs tool:
bin/nutch mergesegs whole-segments -dir mycrawl/segments/*
• Then we can use the dump option of the readseg tool on the
result of the merge:
bin/nutch readseg -dump wholesegments/YYYYMMDDhhmmss/ dump-outlinks
Ida Mele
Crawling
34
Exercise
• We want to create the webgraph of a portion of the Web
• First of all, install and configure Nutch
• For the crawling:
• Create the file with the seed set (example urls)
• Update the conf/url-filter.txt file
• Decide the depth of the crawling and crawl a portion of
the web using the crawl tool. For example, for depth 5 we
issue:
bin/nutch crawl urls -dir mycrawl -depth 5 >
mycrawl.log
Ida Mele
Crawling
35
Exercise
• Once the crawling is completed, you can create the webgraph
• Download the directory with libraries lib.zip
available at: http://www.dis.uniroma1.it/~mele/WebIR.html
• Download the file set-classpath.sh
available at: http://www.dis.uniroma1.it/~mele/WebIR.html
Update the file set-classpath.sh with the path to your lib
directory
• Put the set-classpath.sh file in the Nutch home, open the
terminal, and set the classpath with
• source set-classpath.sh
Ida Mele
Crawling
36
Exercise
• Create the file with in-links using the following
commands:
• bin/nutch readlinkdb mycrawl/linkdb/ -dump
mylinks
• egrep -v $'^$' mylinks/part-00000 >inlinks.txt
Ida Mele
Crawling
37
Exercise
• Create the file with the out-links
1) Merge the segments:
• bin/nutch mergesegs whole-segments -dir
mycrawl/segments/*
2) Use readseg to read the segments, and then create
the file with out-links:
• bin/nutch readseg -dump wholesegments/YYYYMMDDhhmmss/dump-outlinks
• cat dump-outlinks/dump | egrep 'URL|toUrl'
>outlinks.txt
Ida Mele
Crawling
38
Exercise
• Print the in-links and out-links in the links.txt file by
issuing the following commands:
• java nutchGraph.PrintInlinks inlinks.txt >links.txt
• java nutchGraph.PrintOutlinks outlinks.txt
>>links.txt
• Remove the duplicates:
• LANG=C sort links.txt | uniq > cleaned-links.txt
Ida Mele
Crawling
39
Exercise
• Create the map of urls with the following commands:
• cut -f1 links.txt >url-list.txt
• cut -f2 links.txt >>url-list.txt
• LANG=C sort url-list.txt | uniq > sorted-url-list.txt
• java -Xmx2G it.unimi.dsi.util.FrontCodedStringList -u
-r 32 umap.fcl < sorted-url-list.txt
• java -Xmx2G it.unimi.dsi.sux4j.mph.MWHCFunction
umap.mph sorted-url-list.txt
Ida Mele
Crawling
40
Exercise
• Create the graph:
• java -Xmx2G nutchGraph.PrintEdges cleaned-links.txt
umap.mph > webgraph.dat
• numNodes=$(wc -l < sorted-url-list.txt)
• java -Xmx2G nutchGraph.IncidenceList2Webgraph
$numNodes webgraph
• java -Xmx2G it.unimi.dsi.webgraph.BVGraph –g
ASCIIGraph webgraph webgraph
Ida Mele
Crawling
41
Indexing
• Once the crawling operation is completed, we have the
graph and the indexed pages
• Remember that Nutch uses Lucene for the indexing
phase
• If we want to use MG4J for building the inverted index,
we can collect the pages fetched during the crawling by
using: wget -i sorted-url-list.txt
• Then we can use MG4J for indexing and querying the
resulting collection of web pages
Ida Mele
Crawling
42
WEB
Nutch
readdb
Link
structure
db
graph.txt
getfiles
ASCIIGraph
files
MG4J
BVGraph
Query
QueryMG4J
RankPR
Ida Mele
ParserDB
Crawling
PageRank
43
Homework
• Repeat the exercise using a different seed set
and/or depth. Create the corresponding
webgraph. Compute the Pagerank for the
nodes of the webgraph. Plot the distribution
of the Pagerank values
Ida Mele
Crawling
44