Transcript Document

Perception of the
Information Value from
Link Analysis
Igor Kanovsky
Emek Yezreel College, Israel
[email protected]
© Igor Kanovsky @ VINet, Haifa, January 2004
New Dimension of Information
In most cases the information is linked to
other information.
The link’s topology is an important information
about information.
Examples of important linked systems:
1. Social relationships.
2. Business (organization) collaborations.
3. The Web. The Internet.
4. Biological data (DNA structure, cells
metabolism etc.).
2
2004Igor Kanovsky @ VINet , Haifa, January ©
Linked-The New Science of Networks
(A.L.Barabasi, 2002)
An information piece is a vertex, the relationships
between information pieces are edges. The structure
of this kind of graph is the object of investigation.
•We know too little about networks nature, patterns
of their structure, mechanisms of their development.
•Interdisciplinary: CS, mathematics, statistical
physics, field of science the network belongs to.
3
2004Igor Kanovsky @ VINet , Haifa, January ©
PageRank
Web pages ranking (S. Brin and L. Page 1998):
PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))
PR(A) – PageRank of the page A, C(A) – the number of links going out of the
page A, d – damping factor (0.85), {T1,…Tn} set of the pages which point to A
(citation).
Simple and powerful mechanism for estimation
of
information importance (or trusting)
out of link analyses.
4
2004Igor Kanovsky @ VINet , Haifa, January ©
Search Engine “Optimization” (SEO)
PageRank's algorithm may
be spammed for
commercial promotion of
the site and a number of
companies have a
busyness to sell Google's
ranking .
SEO is a technique for writing
web pages and link structures
so your pages come out higher
in the search engine listings
than your competitors.
5
2004Igor Kanovsky @ VINet , Haifa, January ©
Google's Florida Update
On Nov 16th 2003, Google changed search
engine ranking mechanism (Google's
"Florida" Update). New algorithm was not
published.
Versions of he changing:
1. SEO filter.
2. Searchterm list.
3. LocalRank. LocalRank is a method of modifying the
rankings based on the interconnectivity between the
pages(!).
4. Topic-sensitive search engine.
6
2004Igor Kanovsky @ VINet , Haifa, January ©
The Web as a graph
A huge digraph with similar to the Web graph
statistical characteristics is called a Web-like
graph.
The known significant properties of the Web
as a graph are:
1.Power-law distributions.
2.Small world topology.
3.Bipartite cliques.
4.“Bow-tie" shape.
7
2004Igor Kanovsky @ VINet , Haifa, January ©
Power-Law distributions (PLD)
PLD of in- and out-degrees of vertices. The number
of web pages having kin links on the page or kout links from
the page is proportional to k- for some constants in, out > 2
Andrei Broder, Ravi Kumar and others. Graph structure in the web.2001
8
2004Igor Kanovsky @ VINet , Haifa, January ©
The Small World
Small diameter of the graph.The average distance
between any two connected web graph vertices is bounded by
log N, where N is the number of the vertices in the graph.
Big clustering coefficient.
Clustering coefficient C(v) for a vertex v is a percentage of
neighbours of v connected to each other. For graph C = <C(v)>.
Clustering coefficient of the Web graph is significant
bigger in comparison to a random graph.
9
2004Igor Kanovsky @ VINet , Haifa, January ©
The Small World (2)
Lada A. Adamic. The Small World Web. 2000.
10
2004Igor Kanovsky @ VINet , Haifa, January ©
Bipartite Small Cores
A bipartite core Ci,j is a graph on i+j
nodes that contains at least one
bipartite clique Ki,j as a subgraph.
There are a lot of bipartite small
cores Ci,j (with i,j ≥ 3) in the Web
graph (a random graph does not have
small cliques).
K3,3
This small cliques are the cores of the web communities –
set of connected sites with a common content topic.
11
2004Igor Kanovsky @ VINet , Haifa, January ©
Bipartite Small Cores (2)
Number of Cij as functions of i.j
Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan and Andrew Tomkins.
Extracting large-scale knowledge bases from the web.2000.
12
2004Igor Kanovsky @ VINet , Haifa, January ©
"bow-tie" shape
The major part of web pages can be
divided into four sets: a core made by
the strongly connected components
(SCC), i.e. pages that are mutually
connected to each other, 2 sets
(upstream and downstream) made by
the pages that can only reach (or be
reached by) the pages in the core,
and a set (tendril) containing pages
that can neither reach nor be reached
from the core.
The Web graph has a "bow-tie" shape,
13
2004Igor Kanovsky @ VINet , Haifa, January ©
Web-like Graph Modeling
The aim is to find stochastic processes
yields web-like graph.
Our integrated approach is based on well
known Web graph models extended in
order to satisfy all mentioned above
statistical properties.
We try to keep a web-like graph model as
simple as possible, thus it has to have a
minimum set of parameters.
14
2004Igor Kanovsky @ VINet , Haifa, January ©
Extended scale-free model (1)
1. At each time step, a new vertex is added and is
connected to existing vertex through random
number m ( z) of new edges, where the average
number of edges per node (z) is constant for a
growing graph. The probability that an existing
vertex gains an edge is proportional to its in-degree.
kin, i  Ain
(kin, i) 
 j (kin, j  Ain)
15
2004Igor Kanovsky @ VINet , Haifa, January ©
Extended scale-free model (2)
2. Simultaneously, z-m directed edges are distributed
among all the vertices in the graph by the following
rules: (i) the source is chosen with a probability
proportional to their out degree, (ii) the target ends is
chosen with a probability proportional to their indegree.
The model has 3 parameters: average
degree z, initial attractiveness of vertex
to gain in and out edge Ain , Aout .
16
2004Igor Kanovsky @ VINet , Haifa, January ©
Simulation results.In-degree distribution.
Our model.
N = 30 K.<k>=8
Ain = 2.Aout = 6.
Web.
N = 500 M.
17
2004Igor Kanovsky @ VINet , Haifa, January ©
Advantages of our approach
Only our extended scale-free model
capture all known statistical properties of
the Web graph.
The model is very simple. It has only
three parameters.
The model may be used for developing and
testing different algorithms for Web (like
search, ranking, site promotion).
18
2004Igor Kanovsky @ VINet , Haifa, January ©
New tools needed for links analysis
Obviously
PLD,
clustering
coefficient,
bipartite clique are not the only non trivial
properties of different networks.
Are there additional link distributions ?
How to find local patterns from graph
statistical attributes?
What correlations between links may be
discovered in different graphs?
This is just the beginning!
19
2004Igor Kanovsky @ VINet , Haifa, January ©
Thank you.
For contacts:
igor kanovsky, [email protected],
http://www.yvc.ac.il/ik/
20
2004Igor Kanovsky @ VINet , Haifa, January ©