Transcript Google

Google Scale Data
Management
Google (..a course on that??)
(partially based on slides by Qing Li)
3/27
Google (what?..couldn’t they
come up with a better name?)



..doesn’t mean anything, right?
...or does it mean “search” in French?...
Surprising, but “Google” actually means stg.
•

It is the name of a “number”…
• A big one : 1 followed by 100 zeros
..this reflects the company's mission....and the reality
of the world:
•
•
…there are a whole lot of “data” out there…
…and whoever helps me manage it effectively deserves
to be rich!
(partially based on slides by Qing Li)
4/27
..in fact there is more to Google
than search...
(partially based on slides by Qing Li)
5/27
Question #1: How can it know
so much?...because it crawls…
User
Web
Data
Search
Engine
Crawling
Query
Result
Searching
Result
Indexing
K1
K2
…
Index
d1d
2
d1d
3
Indexes: quick lookup tables
(like the index pages in a book)
(partially based on slides by Qing Li)
6/27
Question #2: How can it be so
fast?
A whole lot of users
One Search
Engine
A whole lot of indexed data
(partially based on slides by Qing Li)
7/27
Question #2: How can it be so
fast?

use a lot of computers…but, intelligently! (parallelism)
1
3
2
6
4
5
Front-end
web server
Database
(copies of all
web pages!)
Index servers
Content servers
(partially based on slides by Qing Li)
8/27
Question #2: How can it be so
fast?


use a lot of computers…but, intelligently! (parallelism)
…organize data for fast access (indexing)
1
3
2
6
4
5
Front-end
web server
Database
(copies of all
web pages!)
Index servers
Content servers
(partially based on slides by Qing Li)
9/27
Question #2: How can it be so
fast?



use a lot of computers…but, intelligently! (parallelism)
…organize data for fast access (indexing)
…most people want to know the same thing anyhow (caching)
CACHE
(copies of recent results)
1
Database
(copies of all
web pages!)
2
Front-end
web server
Index servers
Content servers
(partially based on slides by Qing Li)
10/27
Question 3: How can it serve the
most relevant page?

Text indexing/keyword search
• Analyze the content of a page to determine if
it is relevant to the query or not

Link analysis
• Analyze the (incoming and outgoing) links a
pages has to determine if the page is “worthy”
or not
(partially based on slides by Qing Li)
11/27
Text Indexing
Text Indexing
Web
Crawling
Raw
text
Text
analysis
Index
(partially based on slides by Qing Li)
12/27
Text Analysis



Query: a set of keywords
Page/document: is also a set of keywords
(where some keywords are more important
than the others)
….so, to create an index, we need to
•
•
extract of index terms
compute their weights
(partially based on slides by Qing Li)
13/27
Term Extraction

Extraction of index terms
• Morphological Analysis (stemming in English)
• “information”, “informed”, “informs”, “informative”
•  inform
• Removal of stop words
• “a”, “an”, “the”, “is”, “are”, “am”, …
• they occur too often!!!!!
•
Compound word identification
(partially based on slides by Qing Li)
14/27
Stop words

Those terms that occur too frequently in a
language are not good discriminators for search
Frequency
stop words
Rank (of the term)
(partially based on slides by Qing Li)
15/27
What are the weights????

They need to capture how good the term
is in describing the content of the page
t1
t1 better than t2
t2
n
tf 
K
term-frequency
(partially based on slides by Qing Li)
17/27
What are the weights????

They need to capture how discriinating
the term is (remember the stop words??)
t1
t1
t1
t1
t1
t1
t1
t2
t2
t2
N
idf  log( )
m
t2 better than t1
inverse-document-frequency
(partially based on slides by Qing Li)
18/27
What are the weights????

They need to capture how
• good the term is in describing the content of
•
the page
discriminating the term is..
n
N
tfidf  log( )
K
m
(partially based on slides by Qing Li)
19/27
An Example
Document 1

TF for “Arizona”
• In Doc 1 is 1
• In Doc 2 is 2

IDF for “Arizona”
• In this collection (Doc 1 & Doc 2)
IDF = ½
Document 2
TW (Arizona, Doc1) = 1. ½ = 0.5
TW (Arizona, Doc2) = 2. ½ = 1.0
(partially based on slides by Qing Li)
20/27
requires “fast” string
search!!
(hashes, search trees)
“Inverted” index
Terms Pointers
.
.
.
2
.
.
.
.
.
.
2
1
2
3
4
5
.
.
.
275
276
.
.
.
.
1
2
5
4
6
.
.
.
3
5
.
.
.
.
1011
1012
1
4
Doc #1
------------Doc #2
-------------
Doc #5
-------------
...
Directory
3
...
Query
“ASU”
search
Google
.
.
.
ASU
.
.
.
.
.
.
tiger
Posting file
(partially based on slides by Qing Li)
21/27
Matching & Ranking

Ranking
• Retrieval Model
• Boolean (exact) => Fuzzy Set (inexact)
• Vector Space
• Probabilistic
• Inference Net
• Language Model …

Weighting Schemes
• Index terms, query terms
• Parameters in formulas
(partially based on slides by Qing Li)
22/27
Vectors model of text

Page with “ASU” with weight <0.5>
0
0.5
ASU
(partially based on slides by Qing Li)
3
24/27
Vectors…what are they???

Page with “ASU” with weight <0.5> and
“CSE” <0.7>
CSE
0.7
0
0.5
ASU
(partially based on slides by Qing Li)
25/27
Vectors…what are they???

Page with “ASU” with weight <0.5> and
“CSE” <0.7>, and “Selcuk” <0.9>
CSE
0.7
0
Selcuk
0.5
ASU
0.9
(partially based on slides by Qing Li)
26/27
A web database (a vector
space!)
CSE
A
ASU
Selcuk
(partially based on slides by Qing Li)
27/27
“Find 2 most similar pages to A”
CSE
A
ASU
Selcuk
(partially based on slides by Qing Li)
28/27
How about “important”
pages…how do identify them??

We might be able to learn how iportant a
page is studying its connectivity
(“linkages”)
(partially based on slides by Qing Li)
29/27
Hubs and authorities
•Good hubs should
point to good
authorities
•Good authorities
must be pointed by
good hubs.
(partially based on slides by Qing Li)
30/27
PageRank

Basic idea: more links to a page
implies a better page
•
•
But, all links are not created equal
Links from a more important page
should count more than links from a
weaker page
PR(A) = PR(C)/1
PR(B) = PR(A) / 2

Basic PageRank PR(A) for page A:
•
•
outDegree(B) = number of edges
leaving page B = hyperlinks on page B
Page B distributes its rank boost over
all the pages it points to
PR(C) = PR(A) / 2 + PR(B)/1
PR( A) 
PR( B)

( B , A )G outDegree( B )
(partially based on slides by Qing Li)
31/27
How to compute ranks??


PageRank (PR) definition is recursive
•
Rank of a page depends on and influences other pages
Solved iteratively
•
•
•
To compute PageRank:
•
•
Choose arbitrary initial PR_old and use it to compute PR_new
Repeat, setting PR_old to PR_new until PR converges (the
difference between old and new PR is sufficiently small)
Rank values typically converge in 50-100 iterations
Eventually, ranks will converge
(partially based on slides by Qing Li)
32/27
Open-Source Search Engine
Code

Lucene Search Engine

SWISH

Glimpse

and more
• http://lucene.apache.org/
• http://swish-e.org/
• http://webglimpse.net/
(partially based on slides by Qing Li)
33/27
Reference

L.Pager & S. Brin,The PageRank citation ranking: Bringing
order to the web , Stanford Digital Library Technique, Working
paper 1999-0120, 1998.

Lawrence Page and Sergey Brin. The anatomy of a large-scale
hypertextual web search engine. In Proceedings of the Seventh
International Web Conference (WWW 98), 1998.
(partially based on slides by Qing Li)
34/27