Transcript PowerPoint

CS 430 / INFO 430:
Information Retrieval
Lecture 20
Web Search 2
1
Course Administration
Some thoughts about Google as an organization
2
Indexing the Web Goals: Precision
Short queries applied to very large numbers of items
leads to large numbers of hits.
• Goal is that the first 10-100 hits presented should satisfy
the user's information need
-- requires ranking hits in order that fits user's requirements
• Recall is not an important criterion
Completeness of index is not an important factor.
• Comprehensive crawling is unnecessary
3
Graphical Methods
Document
A refers to
document B
4
Document
A provides
information
about
document B
Anchor Text
The source of Document A contains the marked-up text:
<a href="http://www.cis.cornell.edu/">The Faculty of
Computing and Information Science</a>
The anchor text:
The Faculty of Computing and Information Science
can be considered descriptive metadata about the
document:
http://www.cis.cornell.edu/
5
Concept of Relevance and Importance
Document measures
Relevance, as conventionally defined, is binary (relevant
or not relevant). It is usually estimated by the similarity
between the terms in the query and each document.
Importance measures documents by their likelihood of
being useful to a variety of users. It is usually estimated
by some measure of popularity.
Web search engines rank documents by a combination of
estimates of relevance and importance.
6
Ranking Options
1. Paid advertisers
2. Manually created classification
3. Vector space ranking with corrections for document length,
and extra weighting for specific fields, e.g., title, anchors,
etc.
4. Popularity, e.g., PageRank
The details of 3 and the balance between 3 and 4 are not
made public.
7
Citation Graph
cites
Paper
is cited by
Note that journal citations always refer to earlier work.
8
Bibliometrics
Techniques that use citation analysis to measure the
similarity of journal articles or their importance
Bibliographic coupling: two papers that cite many of the
same papers
Co-citation: two papers that were cited by many of the
same papers
Impact factor (of a journal): frequency with which the
average article in a journal has been cited in a particular
year or period
9
Bibliometrics: Impact Factor
Impact Factor (Garfield, 1972)
• Set of journals in Journal Citation Reports of the Institute for
Scientific Information
• Impact factor of a journal j in a given year is the average
number of citations received by papers published in the
previous two years of journal j. Impact factor counts in-degrees
of nodes in the network.
Influence Weight (Pinski and Narin, 1976)
• A journal is influential if, recursively, it is heavily cited by
other influential journals.
10
Graphical Analysis of Hyperlinks on
the Web
This page links to
many other pages
(hub)
1
2
4
3
5
11
6
Many pages
link to this
page
(authority)
Graphical Methods on Web Links
Choices
• Graph of full Web or subgraph
• In-links to a node or all links
Algorithms
• Hubs and Authorities -- subgraph, all links (Kleinberg, 1997)
• PageRank -- full graph, in-links only (Brin and Page, 1998)
See: J. Kleinberg. Authoritative sources in a hyperlinked
environment. Journal of the ACM, 46 (1999), for descriptions
of all these methods
Or take: CS/INFO 685,The Structure of Information Networks
12
PageRank Algorithm
Used to estimate importance of documents.
Concept:
The rank of a web page is higher if many pages link to it.
Links from highly ranked pages are given greater weight
than links from less highly ranked pages.
PageRank is essentially a modified version of Pinski and
Narin's influence weights applied to the Web graph.
13
Intuitive Model (Basic Concept)
Basic (no damping)
A user:
1. Starts at a random page on the web
2. Selects a random hyperlink from the current page and
jumps to the corresponding page
3. Repeats Step 2 a very large number of times
Pages are ranked according to the relative frequency with
which they are visited.
14
Matrix Representation
Citing page (from)
P1
P2
P3
P4
P1
1
P3
Cited
page (to) P
4
1
1
1
1
P5
1
2
1
3
1
1
4
1
1
4
Number
1
1
P6
15
P6
1
P2
Number
P5
2
1
1
3
1
1
Basic Algorithm: Normalize by
Number of Links from Page
Citing page
P1
P2
P3
P4
P5
P1
Cited
page
0.33
P2
0.25
1
P3
0.25
0.5
P4
0.25
0.5
P5
0.25
16
=B
1
0.33
P6
Number
P6
1
0.33
4
2
1
1
3
1
Normalized
link matrix
Basic Algorithm: Weighting of Pages
Initially all pages
have weight 1
Recalculate
weights
0.33
1
w0 =
1
1
1
1
1
17
w1 = Bw0 =
1.25
1.75
2.08
0.25
0.33
If the user starts at a random page, the jth element of w1
is the probability of reaching page j after one step.
Basic Algorithm: Iterate
Iterate: wk = Bwk-1
w0
w1
w2
w3
... converges to ... w
1
0.33
0.08
0.03
->
0.00
1
1.25
1.83
2.80
->
2.39
1
1.75
2.79
2.06
->
2.39
1
2.08
1.12
1.05
->
1.19
1
0.25
0.08
0.02
->
0.00
1
0.33
0.08
0.03
->
0.00
The sum of the weights is the number of pages.
18
Graphical Analysis of Hyperlinks on
the Web
There is no
link out of
{2, 3, 4}
2
1
4
3
5
19
6
Google PageRank with Damping
A user:
1. Starts at a random page on the web
2a. With probability d, selects any random page and jumps to it
2b. With probability 1-d, selects a random hyperlink from the
current page and jumps to the corresponding page
3. Repeats Step 2a and 2b a very large number of times
Pages are ranked according to the relative frequency with
which they are visited.
20
The PageRank Iteration
The basic method iterates using the normalized link
matrix, B.
wk = Bwk-1
This w is the high order eigenvector of B
PageRank iterates using a damping factor. The method
iterates:
wk = dw0 + (1 - d)Bwk-1
w0 is a vector with every element equal to 1.
d is a constant found by experiment.
21
The PageRank Iteration
The iteration expression with damping can be re-written.
Let R be a matrix with every element equal to 1/n
Rwk-1 = w0 (The sum of the elements of wk-1 equals n)
Let P = dR + (1 - d)B
The iteration formula
wk = dw0 + (1 - d)Bwk-1
is equivalent to
wk = Pwk-1
so that w is the high order eigenvector of P
22
Extra slide added November 22, 2005
Iterate with Damping
Iterate: wk = Pwk-1 (d = 0.3)
23
w0
w1
w2
w3
... converges to ... w
1
0.53
0.41
0.39
->
0.38
1
1.18
1.46
1.80
->
1.68
1
1.53
2.03
1.78
->
1.87
1
1.76
1.29
1.26
->
1.31
1
0.48
0.39
0.37
->
0.37
1
0.53
0.41
0.39
->
0.38
Iteration expression corrected: November 22, 2005
Google: PageRank
The Google PageRank algorithm is usually written with the
following notation
If page A has pages Ti pointing to it.
– d: damping factor
– C(A): number of links out of A
Iterate until:
 n PTi  

P A  1  d   d  
 i 1 C Ti  
24
Note added 12/1/05: the parameter d used in this expression
is (1-d) the parameter used in Slides 20-23.
Information Retrieval Using PageRank
Simple Method
Consider all hits (i.e., all document that match the query in
the Boolean sense) as equal.
Display the hits ranked by PageRank.
The disadvantage of this method is that it gives no
attention to how closely a document matches a query
25
Combining Term Weighting with
Reference Pattern Ranking
Combined Method
1. Find all documents that contain the terms in the query vector.
2. The similarity, using conventional term weighting, between the
query and document j is sj.
3. The rank of document j using PageRank or other reference
pattern ranking is pj.
4. Calculate a combined rank cj = sj + (1- )pj, where  is a
constant.
5. Display the hits ranked by cj.
26
This method is used in several commercial systems, but the details
have not been published.
Problems with PageRank
Most pages have very small page ranks
• For searches that return large numbers of hits, there are usually a
reasonable number of pages with high PageRank.
• For searches that return smaller numbers of hits, e.g, highly
specific queries, all the pages may have very small PageRanks, so
that it is difficult to rank them in a sensible order.
Example
A search by a customer for information about a product may rank
a large number of mail order businesses that sell the product
above the manufacturer's site that provides a specification for
the product. A small number of links makes big changes to the rank.
27
Problems with Graphical Methods
(Anchor Text + PageRank)
Google Bomb: a collective hyperlinking strategy intended to
change the search results of a specific term or phrase.
Examples
The "failure" Google bomb promoted George W. Bush’s page
on whitehouse.gov to the number one rank in a search of the
phrase "failure."
The "Jew" Google bomb demoted an anti–Semitic Web site
from number one rank with a search of "Jew," and promoted
the wikipedia.org definition of "Jew" to number one.
See: Clifford Tatum, 2005,
http://www.firstmonday.org/issues/issue10_10/tatum/
28
Advanced Graphical Methods:
www.teoma.com
• Carry out a search
• Divide Web sites found by a search into clusters, known as
communities
• Calculate authority within communities
• Calculate hubs within communities, known as experts
Note: Teoma does not publish the precise algorithms it uses
29