Presentation of Anatomy of a Large

Download Report

Transcript Presentation of Anatomy of a Large

Presentation of
Anatomy of a Large-Scale
Hypertextual Web Search Engine
by Sergey Brin and Lawrence
Page (1997)
Presenter: Scott White
Presentation Overview
• Problem
• Design Goals
• Google Search Engine Features
• Google Architecture
• Scalability
• Conclusions
Problem
• Web is vast and growing exponentially
• Web is heterogenous
–
–
–
–
–
Ascii
Html
Images
Java applets
Etc.
• Human Maintained Lists can’t keep up
• Previous search methodologies relied on keyword matching
producing low quality matches
• Human attention is confined to ~10-1000 documents
 User’s ability to locate documents is getting harder
Solution = Google
“Our main goal is to improve the quality of
web search engines”
• Google  googol = 10^100
• Originally part of the Stanford digital
library project known as WebBase,
commercialized in 1999
Specific Design Goals
• Deliver results that have very high precision
•
•
•
even at the expense of recall
Make search engine technology transparent, i.e.
advertising shouldn’t bias results
Bring search engine technology into academic
realm in order to support novel research
activities on large web data sets
Make system easy to use for most people, e.g.
users shouldn’t have to specify more than a
couple words
Google Search Engine Features
Two main features to increase result precision:
• Uses link structure of web (PageRank)
• Uses text surrounding hyperlinks to improve
accurate document retrieval
Other features include:
• Takes into account word proximity in documents
• Uses font size, word position, etc. to weight word
• Storage of full raw html pages
PageRank For Dummies
Intuition:
• Imagine a web surfer doing a simple random walk on
the entire web for an infinite number of steps.
• Occasionally, the surfer will get bored and instead of
following a link pointing outward from the current page
will jump to another random page.
• At some point, the percentage of time spent at each
page will converge to a fixed value.
• This value is known as the PageRank of the page.
PageRank For Techies
N(p): # outgoing links from page p
B(p): set of pages that point to p
d: tendency to get “bored”, 0  d  1
R(p): PageRank of p
R(p) = [(1-d)
R(q)/N(q)] + d
qB(p)
Why do we need d?
• In the real world virtually all web graphs
are not connected, i.e. they have deadends, islands, etc.
• If we don’t have d we get “ranks leaks”
for graphs that are not connected, i.e.
leads to numerical instability
Justifications for using PageRank
• Attempts to model user behavior
• Captures the notion that the more a page
is pointed to by “important” pages, the
more it is worth looking at
• Takes into account global structure of web
Google Architecture
Implemented in C and C++ on Solaris and Linux
Preliminary
“Hitlist” is defined as list of occurrences of a
particular word in a particular document
including additional meta info:
- position of word in doc
- font size
- capitalization
- descriptor type, e.g. title, anchor, etc.
Google Architecture (cont.)
Multiple crawlers run in parallel.
Each crawler keeps its own DNS
lookup cache and ~300 open
connections open at once.
Keeps track of URLs
that have and need
to be crawled
Compresses and
stores web pages
Stores each link and
text surrounding link.
Converts relative URLs
into absolute URLs.
Uncompresses and parses
documents. Stores link
information in anchors file.
Contains full html of every web
page. Each document is prefixed
by docID, length, and URL.
Google Architecture (cont.)
Maps absolute URLs into docIDs stored in Doc
Index. Stores anchor text in “barrels”.
Generates database of links (pairs of docIds).
Parses & distributes hit lists into
“barrels.”
Partially sorted forward
indexes sorted by docID. Each
barrel stores hitlists for a given
range of wordIDs.
In-memory hash table that
maps words to wordIds.
Contains pointer to doclist in
barrel which wordId falls into.
Creates inverted index
whereby document list
containing docID and hitlists
can be retrieved given wordID.
DocID keyed index where each entry includes info such as pointer to doc in
repository, checksum, statistics, status, etc. Also contains URL info if doc
has been crawled. If not just contains URL.
Google Architecture (cont.)
2 kinds of barrels. Short
barrell which contain hit
list which include title or
anchor hits. Long barrell
for all hit lists.
New lexicon keyed by
wordID, inverted doc
index keyed by docID,
and PageRanks used to
answer queries
List of wordIds produced
by Sorter and lexicon
created by Indexer used
to create new lexicon
used by searcher. Lexicon
stores ~14 million words.
Google Query Evaluation
1. Parse the query.
2. Convert words into wordIDs.
3. Seek to the start of the doclist in the short barrel for
4.
5.
6.
7.
8.
every word.
Scan through the doclists until there is a document that
matches all the search terms.
Compute the rank of that document for the query.
If we are in the short barrels and at the end of any
doclist, seek to the start of the doclist in the full barrel
for every word and go to step 4.
If we are not at the end of any doclist go to step 4.
Sort the documents that have matched by rank and
return the top k.
Single Word Query Ranking
• Hitlist is retrieved for single word
• Each hit can be one of several types: title, anchor, URL,
•
•
•
•
•
large font, small font, etc.
Each hit type is assigned its own weight
Type-weights make up vector of weights
# of hits of each type is counted to form count vector
Dot product of two vectors is used to compute IR score
IR score is combined with PageRank to compute final
rank
Multi-word Query Ranking
• Similar to single-word ranking except now must
•
•
•
analyze proximity
Hits occurring closer together are weighted
higher
Each proximity relation is classified into 1 of 10
values ranging from a phrase match to “not
even close”
Counts are computed for every type of hit and
proximity
Scalability
• Cluster architecture combined with
Moore’s Law make for high scalability. At
time of writing:
– ~ 24 million documents indexed in one week
– ~518 million hyperlinks indexed
– Four crawlers collected 100 documents/sec
Summary of Key Optimization
Techniques
– Each crawler maintains its own DNS lookup cache
– Use flex to generate lexical analyzer with own stack for parsing
documents
– Parallelization of indexing phase
– In-memory lexicon
– Compression of repository
– Compact encoding of hitlists accounting for major space savings
– Indexer is optimized so it is just faster than the crawler so that
crawling is the bottleneck
– Document index is updated in bulk
– Critical data structures placed on local disk
– Overall architecture designed avoid to disk seeks wherever
possible
Storage Requirements
At the time of publication, Google had the following
statistical breakdown for storage requirements:
Conclusion
• The writing is not very clear
– Weak presentation of PageRank model.
– Sentences are often too long and dense.
– Poor presentation structure
• No formal user evaluation of search result quality
• Still, this is one of the seminal papers in IR.
– Highly cited
– PageRank link analysis algorithm still one of the best algorithms
available
– Today’s Google architecture still very similar to one cited in this
paper
– Success of Google based, in large part, to ideas discussed in this
paper