Tools for memory: Document retrieval (Google)

Transcript Tools for memory: Document retrieval (Google)

Google & Document Retrieval
Qing Li
School of Computing and Informatics
Arizona State University
Outline
Simple introduction of Google
Architecture of Web search engine
Key techniques of search engine
• Indexing
• Matching & ranking
Open Sources for search engine
Arizona State University
2/31
Google Search Engine
“Google”
• Number : 1 followed by 100 zeros
• reflects the company's mission
• organize the immense amount of information available
on the web.
Information Types
• Text
• Image
• Video
Arizona State University
3/31
Google Service
Arizona State University
4/31
Google Web Searching
Arizona State University
5/31
Life of a Google Query
Arizona State University
6/31
Web Search System
Web
Data
Crawling
Search
Engine
Index
Indexing
Searching
User
Query
Information
K1
d1d2
K2
d1d3
…
Arizona State University
7/31
Conventional Overview of Text Retrieval
Text Processing
raw
text
User/System Interaction
Knowledge
Resources
& Tools
Info
Needs
Analysis of
Info Needs
Text
Analysis
Search Engine
Index
Query
Matching
& rank
Retrieval Result
Arizona State University
8/31
Text Processing (1) - Indexing
A list of terms with relevant information
• Frequency of terms
• Location of terms
• Etc.
Index terms: represent document content & separate documents
• “economy” vs “computer” in a news article of Financial Times
To get Index
• Extraction of index terms
• Computation of their weights
Arizona State University
9/31
Text Processing (2) - Extraction
Extraction of index terms
• Word or phrase level
• Morphological Analysis (stemming in English)
• “information”, “informed”, “informs”, “informative”
•  inform
• Removal of common words from “Stop list”
• “a”, “an”, “the”, “is”, “are”, “am”, …
• n-gram
• “정보검색시스템” => “_정”, “정보”, “보검”, “검색”, … (bigram)
• Surprisingly effective in some languages
Arizona State University
10/31
An Example
Identify all unique words in collection of 1,033
Abstracts in biomedicine
13,471 terms
Delete 170 common function words included
In stop list
13,301 terms left
Delete all terms with collection frequency
equal to 1 (terms occurring in one doc
with frequency 1)
7,236 terms left
Remove terminal “s” endings & combine
Identical word forms
6,056 terms left
Delete 30 very high-frequency terms occurring
In over 24% of the documents
6,026 terms left
Final indexing vocabulary
Arizona State University
11/31
Text Processing (3) – Term Weight
Calculation of term weights
• Statistical weights using frequency information
• importance of a term in a document
• E.g. TF*IDF
• TF: total frequency of a term in a document
• IDF: inverse document frequency
• DF: In how many documents the term appears?
• High TF , low DF  good word to represent text
• High TF, High DF  bad word
Arizona State University
12/31
An Example
TF for “Arizona”
Document 1
•
•
In Doc 1 is 1
In Doc 2 is 2
DF for “Arizona”
•
•
•
In this collection (Doc 1 & Doc 2)
Is 2
 IDF = ½
TW = TF*IDF
Document 2
Normalization of TF is critical to retrieval
effectiveness
•
•
prevent a bias towards longer documents
TF = 0.5 + 0.5*(TF / Max TF)
TW = TF * log2 (N / DF + 1)
Log 10-34 is -34
Arizona State University
13/31
Text Processing (4) - Storing indexing results
For raw text to index
Document 1
Index Word
Arizona
Word Info.
1
1
2
2
1
1
1
1
…
：
Document 2
：
：
University
Arizona State University
14/31
Text Processing (5) - Storing indexing results
Inverted File, …
Terms Pointers
Query
Arizona State University
.
.
.
2
.
.
.
.
.
.
2
1
2
3
4
5
.
.
.
275
276
.
.
.
.
1
2
5
4
6
.
.
.
3
5
.
.
.
.
1011
1012
1
4
Doc #1
------------Doc #2
-------------
Doc #5
-------------
...
Directory
3
...
search
Google
.
.
.
ASU
.
.
.
.
.
.
tiger
Posting file
15/31
Matching & Ranking
Ranking
• Retrieval Model
•
•
•
•
•
Boolean (exact)
Vector Space
Probabilistic
Inference Net
Language Model …
Weighting Schemes
• Index terms, query terms
• Parameters in formulas
Arizona State University
16/31
Vector Space Model
Treat document and query as a vector.
(DOC 1)
... dog ........dog....
Doc 1 = < 2>
(DOC 2)
... cat ........ cat ......
................dog.........
.....dog....................
2
0
dog
cat
2
Doc 2 = < 2, 2>
0
Arizona State University
2
dog
17/31
Vector Space Model
Query 1 : dog
(DOC)
... cat ........ cat ......
................dog.........
.....dog....................
Query 1 = <1>
Query 2 : cat, dog
Query 2 = <1,1>
If we use angles as a
similarity measure,
then Q2 is more
similar to Doc than Q1
Doc = < 2, 2>
cat
2
COS (Q1,Doc)<COS(Q2,Doc)
0
Arizona State University
2
dog
18/31
Vector Space Model
Given

x  x1 , x2 ,..., xn

y  y1 , y2 ,..., yn
Dot product
 n
x . y   xi
i 1
y
i
Cosine Similiarity

 
x. y
cosx, y    
x y
Arizona State University
19/31
Vector Space Model

 
x. y
cosx, y    
x y
<DOC 1>
... cat ........ dog ......
................dog............
....mouse .....dog........
mouse ........................
dog
D1
Q = < cat, mouse>
mouse
cat
D1 = (1, 2, 3)
Q = (1, 1,0)
Q
Term weight is only decided by the term frequency
Similarity = (1*1+2*1+3*0)/( length of line D1 + length of line Q)
Arizona State University
20/31
Matching & Ranking
Techniques for efficiency
• New storage structure esp. for new document types
• Use of accumulators for efficient generation of ranked output
• Compression/decompression of indexes
Technique for Web search engines
• Use of hyperlinks
• PageRank : Inlinks & outlinks
• HITS : Authority vs hub pages
• In conjunction with Directory Services (e.g. Yahoo)
• ...
Arizona State University
21/31
PageRank
Basic idea: more links to a page implies a
better page
•
•
But, all links are not created equal
Links from a more important page should
count more than links from a weaker page
Basic PageRank R(A) for page A:
PR( A) 
PR( x)

x pointed to A outDegree( x)
• outDegree(x) = number of edges leaving page
x= hyperlinks on page B
PR(A) = PR(C)/1
PR(B) = PR(A) / 2
PR(C) = PR(A) / 2 + PR(B)/1
• Page x distributes its rank boost over all the
pages it points to
Arizona State University
22/31
PageRank
PageRank definition is recursive
• Rank of a page depends on and influences other pages
• Eventually, ranks will converge
To compute PageRank:
• Choose arbitrary initial R_old and use it to compute R_new
• Repeat, setting R_old to R_new until R converges (the
difference between old and new R is sufficiently small)
• Rank values typically converge in 50-100 iterations
• Rank orders converge even faster
Arizona State University
23/31
Problems with Basic PageRank
Web is not a strongly connected graph
• Rank sink – single page (node) with no outward links
• Nodes not part of sink get rank of 0
Arizona State University
24/31
Extended PageRank
Remove all nodes without outlinks
•
No rank for these pages
Add decay factor, d
PR( x)
PR( A)  d  
 (1  d ) / n
x pointed to A outDegree( x)
•
•
n is the number of nodes/pages
d is a constant, typically between 0.8~ 0.9
• Represents fraction of a pages rank that is distributed among pages it links to, rest of
rank is distributed among all pages
In random surfer model, decay factor corresponds to user getting bored (or
unhappy) with links on a given page and jumping to any random page (not
linked to)
Arizona State University
25/31
Example
Set d=0.5 and Ignore n
Small pages can be directly solved
PR(A) = 0.5 + 0.5 PR(C)
PR(B) = 0.5 + 0.5 (PR(A) / 2)
PR(C) = 0.5 + 0.5 (PR(A) / 2 + PR(B))
Get：
PR(A) = 14/13 = 1.07692308
PR(B) = 10/13 = 0.76923077
PR(C) = 15/13 = 1.15384615
PR( A)  d 
Arizona State University
PR( x)
 (1  d ) / n

x pointed to A outDegree( x)
26/31
Example
PR(A) = 0.5 + 0.5 PR(C)
PR(B) = 0.5 + 0.5 (PR(A) / 2)
PR(C) = 0.5 + 0.5 (PR(A) / 2 + PR(B))
Set initial value of P(A), P(B), P(C) to 1.
After first iteration,
• PR(A) = 0.5 + 0.5 *1 = 1
PR(B) = 0.5 + 0.5 (1 / 2) = 0.75
PR(C) = 0.5 + 0.5 (PR(A) / 2 + PR(B)) =1.125
After second iteration
• PR(A) = 0.5 + 0.5 * 1.125=1.0625
PR(B) = 0.5 + 0.5 (1 / 2)= 0.765625
PR(C) = 0.5 + 0.5 (1 / 2 +0.75) =1.1484375
Arizona State University
27/31
Example
Large numbers, Iteration method
Iteration
PR(A)
PR(B)
PR(C)
0
1
1
1
1
1
0.75
1.125
2
1.0625
0.765625
1.1484375
3
1.07421875
0.76855469
1.15283203
4
1.07641602
0.76910400
1.15365601
5
1.07682800
0.76920700
1.15381050
6
1.07690525
0.76922631
1.15383947
7
1.07691973
0.76922993
1.15384490
8
1.07692245
0.76923061
1.15384592
9
1.07692296
0.76923074
1.15384611
10
1.07692305
0.76923076
1.15384615
11
1.07692307
0.76923077
1.15384615
12
1.07692308
0.76923077
1.15384615
Arizona State University
28/31
Problems with PageRank
Show bias to new WebPages
• Can be solved by a boost factor
No balance between relevancy and popularity
• Very popular pages (such as search engines and web
portals) may be returned artificially high due to their
popularity (even if not very related to the query)
Despite these problems, seems to work fairly well in
practice
Arizona State University
29/31
Open-Source Search Engine Code
Lucene Search Engine
• http://lucene.apache.org/
SWISH
• http://swish-e.org/
Glimpse
• http://webglimpse.net/
and more
Arizona State University
30/31
Reference
L.Pager & S. Brin,The PageRank citation ranking: Bringing
order to the web , Stanford Digital Library Technique, Working
paper 1999-0120, 1998.
Steven Levy (2004). All Eyes on Google. Newsweek, April 12, 2004.
E. Brown, J. Callan, B. Croft (1994). “Fast Incremental
Indexing for Full-Text Information Retrieval.” Proceedings of
the 20th International Conference on Very Large Databases
(VLDB).
Lawrence Page and Sergey Brin. The anatomy of a large-scale
hypertextual web search engine. In Proceedings of the Seventh
International Web Conference (WWW 98), 1998.
Arizona State University
31/31