Link-based Ranking

Download Report

Transcript Link-based Ranking

Information Retrieval
Link-based Ranking
Ranking is crucial…
“.. From our experimental data, we could
observe that the top 20% of the pages with
the highest number of incoming links
obtained 70% of the new links after 7
months, while the bottom 60% of the pages
obtained virtually no new incoming links
during that period…”
[ Cho et al., 04 ]
Query processing

First retrieve all pages meeting the
text query
Order these by their link popularity
+… other scores: TF-IDF, Title, Anchor,

Query-independent ordering


First generation: using link counts as simple
measures of popularity.
Two basic suggestions:

Undirected popularity:


Each page gets a score given by the number of in-links
plus the number of out-links (es. 3+2=5).
Directed popularity:

Score of a page = number of its in-links (es. 3).
SPAM!!!!
Second generation: PageRank

Each link has its own importance!!

PageRank is

independent of the query

many interpretations…
Basic Intuition…
What about nodes
with no out-links?
Google’s Pagerank
Random jump
B(i) : set of pages linking to i.
#out(j) : number of outgoing links from j.
e : vector of components 1/sqrt{N}.
 1

i , j   # out(i)
 0
i j
else
Pagerank: use in Search Engines

Preprocessing:





Given graph of links, build matrix P
Compute its principal eigenvector
The i-th entry is the pagerank of page i
We are interested in the relative order
Query processing:



Retrieve pages meeting query
Rank them by their pagerank
Order is query-independent
Three different interpretations

Graph (intuitive interpretation)



Matrix (easy for computation)



What is the random jump?
What happens to the “real” web graph?
Turns out to be an eigenvector computation
Or a linear system solution
Markov Chain (useful to show convergence)

a sort of Usage Simulation
Pagerank as Usage Simulation

Imagine a user doing a random walk on web:


Start at a random page
At each step, go out of the current page along
one of the links on that page, equiprobably
1/3
1/3
1/3

“In the steady state” each page has a long-term
visit rate - use this as the page’s score.
Not quite enough

The web is full of dead-ends.


Random walk can get stuck in dead-ends.
Cannot talk about long-term visit rates.
Teleporting

At each step,


with probability 10%, jump to a random page.
With remaining probability (90%), go out on a
out-link taken at random.
0.1
Any node
0.9
Neighbors
Result of teleporting


Now we CANNOT get stuck locally.
There is a long-term rate at which any page
is visited

Not obvious  Markov chains
Markov chains

A Markov chain consists of n states, plus an
nn transition probability matrix P.


At each step, we are in exactly one of the
states.
For 1  i,j  n, the matrix entry Pij tells us the
probability of going from state i to state j.
Pii>0
is OK.
i
Pij
j
Markov chains
n
Pij  1.

Clearly, for all i,
j 1
Markov chains abstracts random walks



A Markov chain is ergodic if


you have a path from any state to any other
you can be in any state at every time step,
with non-zero probability.
Not
ergodic
Not
ergodic
Ergodic Markov chains

For any ergodic Markov chain, there is a
unique long-term visit rate for each state.



Steady-state distribution.
Over a long time-period, we visit each state
in proportion to this rate.
It doesn’t matter where we start.
Computing steady-state probabilities ?

Let x = (x1, …, xn) denote the row vector of
steady-state probabilities.
If our current position is described by x,
then the next step is distributed as xP.
But x is the steady state, so x=xP.

Solving this matrix equation gives us x.




So x is the (left) eigenvector for P.
(Corresponds to the “principal” eigenvector of
P with the largest eigenvalue.)
Computing x by Power Method






Recall, regardless of where we start, we
eventually reach the steady state a.
Start with any distribution (say x=(10…0)).
After one step, we’re at xP;
after two steps at xP2 , then xP3 and so on.
“Eventually” means for “large” k, a = xPk
Algorithm: multiply x by increasing powers
of P until the product looks stable.
Why we need a fast ranking?
“…The link structure of the Web is
significantly more dynamic than the
contents on the Web. Every week, about 25%
new links are created. After a year, about
80% of the links on the Web are replaced
with new ones. This result indicates that
search engines need to update link-based
ranking metrics very often…”
[ Cho et al., 04 ]
Accelerating PageRank

Web Graph Compression to fit in internal
memory

Efficient external-memory implementation

Mathematical approaches

Combination of the above strategies
Personalized PageRank
a
b
Influencing PageRank (“Personalization”)

Input:



Web graph W
influence vector v
v : page  degree of influence
Output:


Rank vector r: page  importance wrt v
r = PR(W , v)
Personalized PageRank

α is the dumping factor

v is a personalization vector
HITS: Hubs and Authorities



A good hub page for a topic points to many
authoritative pages for that topic.
A good authority page for a topic is pointed
to by many good hubs for that topic.
Circular definition - will turn this into an
iterative computation.
HITS: Hypertext Induced Topic Search

It is query-dependent

Produces two scores per page:


Authority score
Hub score
Hubs & Authorities Calculation
Query
Base Set
Root Set
Assembling the base set

Root set typically 200-1000 nodes.

Base set may have up to 5000 nodes.

How do you find the base set nodes?


Follow out-links by parsing root set pages.
Get in-links (and out-links) from a connectivity
server. (Actually, suffices to text-index strings of the
form href=“URL” to get in-links to URL.)
Authority and Hub scores
5
2
3
1
4
1
6
7
a(1) = h(2) + h(3) + h(4)
h(1) = a(5) + a(6) + a(7)
Iterative update

Repeat the following updates, for all x:
h( x)  a( y)
x
x y
a( x)  h( y)
y x
x
HITS: Link Analysis Computation
a  A h a  A Aa

T
h  Aa  h  AA h
T
T
Where
a: Vector of Authorities’ scores
scaling
h: Vector of Hubs’ scores.
A: Adjacency matrix in which ai,j = 1 if i points to j.
Thus, h is an eigenvector of AAt
a is an eigenvector of AtA
SVD
of A
Math Behind the Algorithm

Theorem (Kleinberg, 1998). The vectors
a(p) and h(p) converge to the principal
eigenvectors of ATA and AAT, where A
is the adjacency matrix of the
(directed) Web subgraph.
How many iterations?

Claim: relative values of scores will converge
after a few iterations:



We only require the relative orders of the h()
and a() scores - not their absolute values.
In practice ~5 iterations get you close to
stability.
Scores are sensitive to small variations of the
graph. Bad !!
HITS Example Results
Authority
Hubness
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Authority and hubness weights
Weighting links

In hub/authority link analysis, can match
text to query, then weight link.
h( x)  a( y)
x y
a( x)  h( y)
y x
h( x)   w( x, y)  a( y)
x y
a( x)   w( x, y)  h( y)
y x
Weighting links

Should increase with the number of query
terms in anchor text.

x
E.g.: 1+ number of query terms.
Armonk, NY-based computer
giant IBM announced today
www.ibm.com
Weight of this
link for query
computer is 2.
y
Practical HITS

Used by ASK.com

How to compute it efficiently??????
Comparison
Pagerank
Pros


Hard to spam
Quality signal for all pages
Cons



HITS & Variants
Pros


Query specific
Works on small graphs
Cons
Non-trivial to compute
Not query specific
Doesn’t work on small graphs
Proven to be effective for general
purpose ranking
Local graph structure can
be manufactured (spam!)
 Real-time computation is
hard.
Well suited for supervised
directory construction

SALSA

Proposed by Lempel and Moran 2001.



Probabilistic extension of the HITS algorithm
Random walk is carried out by following
hyperlinks both in the forward and in the
backward direction
Two separate random walks


Hub walk
Authority walk
Forming a Bipartite Graph in SALSA
SALSA: Random Walks

Hub walk



Follow a forward-link from an hub uh to an authority wa
Traverse a back-link from wa to some page vh’
Authority Walk


Follow a back-link from an authority wa to an hub uh
Traverse a forward-link from an hub uh to an authority za
Lempel and Moran (2001) proved that SALSA
weights are more robust that HITS weights in
some cases. Authority-score converges to
Indegree, Hub-score converges to outdegree
Information Retrieval
Spamming
Spamming PageRank
Spam Farm (SF), rules of thumb

Use all available own pages in the SF

Accumulate the maximum number of inlinks to SF

NO links pointing outside the SF

Avoid dangling nodes within the SF
Spamming HITS


Easy to spam
Create a new page p pointing to many
authority pages (e.g., Yahoo, Google, etc.)
 p becomes a good hub page
… On p, add a link to your home page
Many applications
Ranking
Web
News
Financial Institutions
Peers
Human Trustness
Viruses
Social networks
You propose!
Information Retrieval
Other tools
snippet
Joint with A. Gullì
WWW 2005, Tokio (JP)
SnakeT’s personalization
 Many interesting features
 Full-adaptivity to the variegate user needs/behaviors
 Scalable to many users and their unbounded profiles
 Privacy protection: no login or profiled information is required/used
 Black-box: no change in the underlying (unpersonalized) search engine
The user first gets a “glimpse” on all themes of results
without scanning all of them,
and then
selects (filters) the results interesting for him/her.