ppt

Transcript ppt

Web Graph & Link Analysis
http://net.pku.edu.cn/~wbia
彭波
[email protected]
北京大学信息科学技术学院
9/27/2010
上一讲主要内容

Crawler面临的难题


实现高效率的基本技术




Scalable, fast, polite,
robust, continuous
Cache
Prefetch
Concurrency
 多进程/多线程
 异步I/O
有趣的技术


Bloom filter
Consistent Hashing
Web Graph
http://www.touchgraph.com/TGGoogleBrowser.html
Giant Global Graph
图论、线性代数若干概念回顾

图，有向图，邻接矩阵，节点的度(degree)，两节
点间的距离（d），图的直径（r），图的连通性，
有向图的强连通分支，连通分支



d(u,v)：从u到v的最短路径的长度
r(G)：最大的距离
矩阵（A），矩阵的转置（AT），行列式（|A|），
线性相关性，矩阵的秩，特征值，特征向量，特征
子空间
Ax  x,
(I  A) x  0
Homework





求以下矩阵的特征值和特征向量
2 3 -1 -4
0 -1 -2 1
A=
0 1 2 -2
0 1 1 2
本次课大纲





Web 有多大？
Web的连通性如何？
Web上节点的分布如何？
Web上节点距离有多远？
Web上节点重要度如何度量？
Web 有多大？
Web的大小—网页总数


图大小不可知，也无法定义
估计Web图节点数的下界


搜索引擎索引的网页数(crawled pages)
 例如CNNIC中国互联网网页调查报告
能更逼近真实值吗？
Capture-Recapture Model






Unknown number of fish in a lake
Catch a sample and mark them
Let them loose
Recapture a sample and look for marks
Estimate population size
n1 = number in first sample 15
n2 = number in second sample 10
n12 = number in both samples 5
N = total population size
assume that
n1/N = n12/n2 therefore 15/N
= 5/10
N = (10 x 15) / 5 = 30
Web的大小

Estimate indexed web low-bound by analysis
overlap of Search Engine (Steve Lawrence and C.
Lee Giles,1998*)




选择6个流行的 search engine, 假设它们索引页面之间
的 independency
Sampling：通过575个查询对这些SE采样，分析它们之
间的overlap
用overlap来估计各个SE所覆盖的 indexable Web的大小
利用已知某个SE的页面数，来估计整个Web的大小
Overlap analysis


P(a) =na/N = n0/nb
N = na*nb/n0


During the test time,
HotBot report indexed
110 miliion page
lower bound on the
size of the indexable
Web of 320 million
pages. (1998)
Web的连通性如何？
Web的形状

A large scale study (Altavista crawls)
reveals interesting properties of web
(Andrei Broder ,1999)



Study of 200 million nodes & 1.5 billion
links
Some parts unreachable, Others have long
paths
found Bow-tie Structure
Bow-tie Components

Strongly
Connected
Component (SCC)


Upstream (IN)



Core can’t reach
IN
Downstream
(OUT)


Core
OUT can’t reach
core
Disconnected
Tendrils & Tubes
Component Properties


Each component is roughly same size
问题：
 ~50 million nodes
在这样一个巨大的图(200M nodes, 1.5G edges)
 Probability of a path between any 2 nodes ~1 quarter
上，Diameter怎么计算出来的？
(0.24)
Diameter, maximal minimal path length(?)



Maximal and average diameter is infinite
28 for SCC, 500 for entire graph
Average length


16 (directed path exists), 7 (undirected)
Shortest directed path between 2 nodes in SCC: 16-20
links on average
Web上节点的分布如何？
站点入度分布

会是下面哪一种情况？
80/20 原则

Pareto principle : for many
phenomena, 80% of the
consequences stem from 20% of the
causes. (Vilfredo Pareto)
scale free
Long tail
Power law
Power law




P(x=k)=CK-λ
A line appears on a log-log plot
rare events are not so rare!
Long tail
Power Law Size and Connectivity

站点大小Site Sizes（以页面数量计算）服从
power law 分布



跨越不同的规模
g 在1.6-1.9之间
节点的度connections per node 服从power law 分
布



Study at Notre Dame University reported
g = 2.45 for outdegree distribution
g = 2.1 for indegree distribution
Power Law Distribution -Examples

From Graph structure in the web, (by altavista
crawl,1999)
Random Graph .vs. Power Law Graph

Random graphs have Poisson distribution if p
is small.




Random uniform graph with random independent
edges of fixed probability p
P(x=k)= e-λ * λk/k!
Decays exponentially fast to 0 as k increases
towards its maximum value n-1
Power law graphs


Decays polynomially for large values
Power law graph  emerging order in a large
graph created by many agents
Examples with Power Law Networks

Examples of networks with Power Law
Distribution






Internet at the router and interdomain level
Citation network
Collaboration network of actors
Networks associated with metabolic pathways
Networks formed by interacting genes and proteins
Network of nervous system connection in C. elegans
Web上节点距离有多远？
What does this mean?


Size: 200M nodes, 1.5G edges
Average length: 16 (directed path exists), 7
(undirected)
Huge graph with small distance
It’s a small world
小世界网络

It is a ‘small world’



Millions of people. Yet,
separated by “six degrees” of
acquaintance relationships
Popularized by Milgram’s
famous experiment
Mathematically


Diameter of graph is small (log
N) as compared to overall size
Property seems interesting
given ‘sparse’ nature of graph
but … This property is ‘natural’
in ‘pure’ random graphs
The small world of WWW

Empirical study of Web-graph reveals small-world
property



Graph generated using power-law model
Diameter properties inferred from sampling
 Calculation of max. diameter computationally
demanding for large values of n
Average distance (d) in simulated web:
d = 0.35 + 2.06 log (n)
e.g. n = 109, d ~= 19
Implications for Web

Logarithmic scaling of diameter makes future
growth of web manageable

10-fold increase of web pages results in only 2 more
additional ‘clicks’, but …
Robustness and vulnerability

How diameter or connectivity affected by deleting
nodes randomly?


Specific nodes are targeted?


Scale-free graph are more robust than random uniform
graph
Diameter doubling when 5% important nodes removed
Topology changes under attack?


Fragment and break down
Phrase change for deletion ratio : 0.28 in exponential
graph & 0.18 in a scale-free network
Web上节点重要度如何度量？
对网页重要性的评价


Larry Page & Sergey Brin
Jon Kleinberg
PageRank算法，HITS（Hyperlink
Induced Topic Search）算法
都是为了利用HTML网页的链接特
点，改善查询的效果

当Spam页面淹没了search engine的
搜索结果页面时，除了页面内容与
查询的相关性以外，页面本身的质
量/重要性的作用就显现出来
重要度的度量
刘翔

一阶指标（“入度”）



Paul Erdös
知晓关系：社会知名度
引用关系：认可程度
“高阶指标”

和一个著名人物“共同发表”论文的“距离”：越短似
乎显得越“有荣誉”（例如，Erdos number，）
PageRank
Why and how it works?
谁重要一些？
如何用一个模型来刻画这种
感觉，使算出来的“重要性”
反映这种感觉？


认识甲的人可能和认识乙的人一样多，但认识乙的
人都是些“重要人物”，于是通常会认为乙比甲重
要
不仅是人，论文也是一样，被重要的文章引用的文
章可能就比较重要些
声望模型Reputation Model


给定一个群体S，及其在上面的一个“知晓”关系
R，于是定义了一个有向“关系图”G。用邻接矩
阵E表示，E(i,j)=1，当且仅当i “听说过” j（注意这
里没有程度之分）。我们希望确定p(i)：所有个体i∈S
的“声望”
模型一：p(i) = ∑E[k,i]，k=1,…,n，即i在G上的
“入度”，亦即E的第i列的1的个数


清楚、好计算；但是“不够好”
模型二：p(i) = ∑E[k,i]p(k)，k=1,…,n，即i的声望
等于知晓他的人的声望之和

清楚、显得要更“精确些”；但是，好计算吗？
声望模型二


对于所有i，p(i) = ∑E[k,i]p(k)，k=1,…,n
也就是，记p = (p(1), p(2), …, p(n))T,
p = ET p

问题是：



这个方程存在解吗？
如果存在，如何得到？
如果不存在，该怎么办？
一般来讲：这
个方程的非0解
是不存在的！
p = ETp 的不存在例
1
2

3
 0 0 0
T


E  1 0 0


1 1 0



S = {1,2,3}, R =
{<1,2>,<1,3>,<2,3>}
E = ((0,1,1),(0,0,1),(0,0,0))
ET = ((0,0,0),(1,0,0),(1,1,0))
不难看到，方程的成立
p(1)=0p(2)=0p(3)=0
一般来讲，p = ETp，意味着要求ET有特
征值1，这是很难得的。
先前那4个点的例子也无解



p = ETp  (I － ET)p = 0
线性代数讲，此方程组有非0解，仅当行列式|I －
ET| = 0
但我们算得|I － ET| = 2
即使有解，还有可能不唯一！


S = {1,2,3}, R = {<1,2>,<2,3>,<3,1>}
不难看出任何 p(1) = p(2) = p(3) 都是解
怎么办？
PageRank


对每一篇网页，得到一个独立于
查询词的相对“重要性”指标，
将这个指标和查询匹配情况结合
起来（以及其他因素），形成网
页的排序。
“重要性”

基本思想是前面讨论的“声望”模
型
“Random Walker”模型



设想有一个永不休止、在网上浏览网页的人,随机
选择一个链出的链接继续访问。我们问，在稳态情
况下（足够长时间后），他会正在看哪一篇网页呢？
等价于：稳态情况下，每个网页v会有一个被访问
的概率，p(v)，它可以作为网页的重要程度的度量。
我们可以合理地设想：此时到达v的概率，依赖于
上一个时刻到达“链向”v的网页的概率，以及那
些网页中超链的个数。
Random walker model
u3
u1
V
u4
u2
稳定时：

p(v) = ∑E[u,v]*p(u)/du, over u


u5
这里，du是网页u的“出度”，∑E[u,v] over u。
∑p(u) = 1
Random Walker Model (continue)

改写一下，成
E[u , v]
p (v )  
 p (u )
du
u
E[u, v]
令，L[u, v] 
du
于是，p [v]   L[u, v]  p[u ]
形式上和
“声望”模
型一样，只
是矩阵L有行
向量元素和
为1的性质。
有用吗？
T
或者写成矩阵形式，p  L  p 且有 | p | 1
u
Dangling Node（出度为0的节点）
对于这些节点，矩阵L对应着元素全0的行，元素和不为1
修正：L[u,v] = 1/N if du=0
Stochastic matrix




矩阵M，元素非负，每个行向量元素之和分别都
等于1（亦称马尔科夫转移矩阵）
L就是这种矩阵()
显然，随机矩阵的最大特征值为1，对应有一个全
1元素的特征向量
转置矩阵的行列式和原矩阵的行列式相等
M x x
| I  M | 0
| I  M | 0
T
于是1也就是LT最
大的特征值！
还有一点问题


上述“随机浏览”模型有稳态
解的条件是：由网页形成的有
向图允许通过链接关系访问到
每一个网页
但有两个情况是破坏这条件的



图中形成“圈”(rank bounce)
有入度或者出度为0的点(rank
sink)
因此该模型的表述通常要求所
形成的图是irreducible（强连
通）和aperiodic（不能有进去
后出不来的圈）。
继续修改模型

让这浏览者每次以一定的概率（1-β）沿着超链
走，以概率（β）重新随机选择一个新的起始节
点

这在物理意义上即总是有可能跳进入度为0的点，跳
出那些“圈”。在模型表达上即为



T


p  (1   ) L p  1N p   (1   ) L  (1N )  p
N
N


T

β选在0.1和0.2之间，被称作damping factor(Page & Brin
1997）
G=(1-β)LT+ β/N(1N) 被称为Google Matrix
Google Matrix特征向量求解




Power Iteration方法：
给定Google Matrix G，记|λ1| ≥|λ2| ≥…,q1
是属于λ1的特征向量
初始化向量p0，使得||p0||1=1
对于k = 1, 2, …,执行如下步骤



x = Gpk-1，
pk = x/||x||1，
基本迭代
规格化步骤
可以证明（收敛速度）

|pk – q1| = O(|λ2/λ1|k)



T
pi 1   (1   ) L  (1N )  pi
N


问题

GoogleMatrix的Power Iteration求解特征向量算法





一定会经过有限步迭代终止吗？
一定会得到有意义的解吗？（正解并且||p||1=1）
一定会得到唯一的解吗？
不管初始值P0如何，都会收敛到相同解吗？
是否需要很多次迭代才能收敛呢？
Amy Langville, Carl Meyer,
Google's PageRank and Beyond:
The Science of Search Engine Rankings.
Princeton University Press, 2006.
例子（power iteration）
1 / 11




 1/ 2


L








1 / 11 1 / 11 1 / 11
1
1
1/ 2
1/ 3
1/ 3
1/ 2
1/ 2
1/ 2
1/ 2
1 / 11 1 / 11 1 / 11 1 / 11 1 / 11 1 / 11 1 / 11






1/ 3


1/ 2

1/ 2


1/ 2

1/ 2


1

1

小规模数据求解




β取0.15
G= 0.85*LT+0.15/11(1N)
P0=(1/11,1/11,….)T
P1=GP0
You can try this in
MatLab
 ...
 。。。。。。。



Power Iteration求解得(迭代50次)
P=(0.033,0.384,0.343,0.039,0.081,
0.039,0.016……)T
State-of-art PageRank Technology
Topic-Sensitive PageRank[1]
ODP-Biasing
S qd   P(c j q ')  rank jd
j



T
pi 1   (1   ) L  (1N )  pi
N


Results
Precision @ 10 results
for test queries
Ranking preferred by majority of
users
Pagerank for product image search[2]
Results

Similarity graph generated from the top1000
search results of “Mona-Lisa.” The largest two
images contain the highest rank.
Results
References


[1] H. H. Taher, "Topic-Sensitive PageRank: A
Context-Sensitive Ranking Algorithm for Web
Search," IEEE Transactions on Knowledge and
Data Engineering, vol. 15, pp. 784-796, 2003.
[2] Y. Jing and S. Baluja, "Pagerank for product
image search," in Proceeding of the 17th
international conference on World Wide Web,
Beijing, China, 2008, pp. 307-316.
本次课小结

Web Graph的性质





Graph Link Analysis




T
pi 1   (1   ) L  (1N )  pi
N


Capture-recapture
Bow-tie structure
Power law
Small world



声望模型
Random Walker模型
PageRank算法
HITS算法
第一次编程作业



时间：10月25日截止
内容：crawler和graph link analysis
1。heritrix系统使用


要求：配置、安装Heritrix，抓取指定的网站:
http://www.ccer.pku.edu.cn
2。heritrix系统代码分析



要求：按Week2的web crawler系统结构，寻找Heritrix
系统里面的crawler的两个主要模块：
isUrlVisited，politeness
考察它们的主要实现技术，写一个简短的报告文档。

3。搜集web数据的graph link analysis

要求：回答以下问题，并给出方法的说明

这个网站有多少网页？

入度、出度分布情况如何？

top 10的最重要页面是哪些？
以小组为单位，大家需要合作完成，



小组blog里，最少每周写一次进展报告
最后提交的报告里，写上每个同学的分工
Thank You!
Q&A
阅读材料
Steve Lawrence




Dr. Lawrence received a B.Sc. (summa
cum laude) and B.Eng. (summa cum
laude) from the Queensland University
of Technology, Australia, and a Ph.D.
from the University of Queensland,
Australia.
Steve Lawrence is a Senior Staff
Research Scientist at Google. Before
joining Google, he was a Senior
Research Scientist at NEC Research
Institute in Princeton
citeseer http://citeseer.ist.psu.edu/
the creator of Google Desktop Search
Andrei Broder

Andrei Broder has a B.Sc. from Technion,
Israel Institute of Technology and a M.Sc.
and Ph.D. in Computer Science from
Stanford University. He is a member of the
research staff at Digital Equipment
Corporation's Systems Research Center in
Palo Alto, California. His main interests are
the design, analysis, and implementation of
probabilistic algorithms and supporting data
structures, in particular in the context of
Web-scale applications.
Larry Page & Sergey Brin



Google founders Larry Page and
Sergey Brin
The Anatomy of a Large-Scale
Hypertextual Web Search
Engine (www7)
PageRank, an Eigenvector based
Ranking Approach for Hypertext
(sigir98)
Zipf’s law

Sites rank ordered by their popularity
Co-citation分析

给定一个文献的集合，希望表达这些文献两两被同
时（同一篇文章）引用的情况


coc[i,j]越大，表示这两篇文章的相关性越强
形成文章之间的邻接矩阵E，使得E[i,j]=1，当且仅
当文章i引用了j；否则E[i,j]=0。



这意味着，E的第i列反映文章i被引用的情况；
同时引用文章i和文章j的文章数量等于E[*,i]和E[*,j]在相
同的行出现1的个数。考虑到E元素的{0,1}特性，即
coc[i,j]=∑E[k,i]E[k,j], k=1,2,…,n
或者coc
= ETE
文章的引用矩阵E
coc[i,j]=∑E[k,i]E[k,j]
1
i
j
1
1
1
1
i
j
coc[i,j]= ETE
i
1
1
1
1
j
修改模型

模型三：让i的声望等于知晓他的人的声望之和乘
以一个常数（对所有i相同）


p = c*ETp
与模型二的关系



p(i) = c×∑E[k,i]p(k)，k=1,…,n
效果上感觉应该差不多，因为是“共同的常数”，而对
我们有意义的只是“相对声望”
但并不完全等价!
还是要问：

非0解存在吗？如果存在，如何计算？
p = c*ETp

这就是特征值、特征向量的定义方程


注意到c只需要在一个系统中保持常量，不同的系统是
可以不一样的，1/c就是ET的特征值，可以随p同时求出
但这问题就来了！




解的存在性
ET最多可能有n个不同的特征值
如果是有多个不同的特征值，取那一个为好？
 不同的特征值对应有不同的特征向量，我们没有理由
认为这不同的特征向量反映出来的节点声望序是一致
的
即使是同一个特征值，对应的特征子空间中也可能有多
个向量（我们也没理由认为它们反映出来的节点声望序
是一致的），应该取哪一个？
还有，特征值、特征向量不是实数怎么办？
The Perron-Frobenius Theorem


如果有向图G是强连通的，则它的邻接矩阵A有一个
唯一的元素全为正实数的特征向量v，且该特征向
量属于模最大的特征值λ。
注：




这个特征向量的唯一性成立在忽略常数因子前提下
由于A是非负的，v是非负的，因此λ也一定是正实数
注意“强连通”的条件需要满足
Stephane Gaubert, et al, “The Perron-Frobenius
Theorem for Homogeneous, Monotone Functions,”
HP Lab, Bristol, 2001.
Random Walker Model (continue)

改写一下，成
E[u, v]
pk 1 (v)  
 pk (u )
du
u
E[u , v]
令，L[u , v] 
du
于是，pk 1[v]   L[u , v]  p[u ]
u
形式上和
“声望”模
型一样，只
是矩阵L有行
向量元素和
为1的性质。
有用吗？
或者写成矩阵形式，pk 1  LT  pk
Dangling Node（出度为0的节点）
对于这些节点，矩阵L对应着元素全0的行，元素和不为1
修正：L[u,v] = 1/N if du=0
但我们还得解决实际问题：修改模型

让这浏览者每次以一定的概率（1-β）沿着超链
走，以概率（β）重新随机选择一个新的起始节
点

这在物理意义上即总是有可能跳进入度为0的点，跳
出那些“圈”。在模型表达上即为



T


pi 1  (1   ) L pi 
1N pi   (1   ) L  (1N )  pi
N
N


T

剩下来的迭代就可用 power iteration, β选在0.1和0.2之间
迭代最后得到的稳定P就是PageRank（Page & Brin 1997）
HITS





从设计思想来看，HITS和PageRank的一个基本
区别是HITS针对具体查询、应用在查询时间，而
PageRank是独立于查询的
Root set, R(q): 和查询q相关的网页集合
Base set, V(q): 除了R(q)外，还包括指向R(q)元
素和被R(q)元素指向的网页
Expanded set = V - R
两个概念（直觉上有意义）


AUTHORITY（权威型网页）：内容权威，质量高的网
页
HUB（目录型网页）：指向许多authority网页的网页
Authority and Hub scores


针对u∈V(q)，在每个网页u上定义有两个参数：
a[u]和h[u]，分别表示其权威性和目录性。
交叉定义


一个网页u的a值依赖于指向它的网页v的h值
一个网页u的h值依赖于它所指的网页v的a值
a  E h
T
h  E a  E E h
T
HITS (contd.)



声望高的（入度大） 权威性高
认识许多声望高的（出度大）目录性强
如何计算？
Power Iteration on:
a  E h  E E a
T
T
h  E a  E E h
T
HITS算法过程（Topic Distillation）
1.
2.
3.
4.
Send query to a textbased IR system and
obtain the root-set.
Expand the root-set
by radius one to
obtain an expanded
graph.
Run power iterations
on the hub and
authority scores
together.
Report top-ranking
authorities and hubs.
PageRank & HITS



网页的相互链接特性，使得我们可以应用社会网络
分析的方法来从网页集合的结构中提炼有用的信息
提炼什么信息？取决于我们对应用目标的认识，还
取决于有关技术模型的精确定义、有效计算，以及
对可能产生误差的认识和评估
PageRank和HITS是两个经典的例子，大目标一致，
切入点不同。


它们不应该意味着不可能有更好的（或者更有特色的）
新角度
它们的一个本质缺陷是：将网页之间的链接关系“太当
真”。
Overlap analysis

measure the relative sizes of search engine
indices (Andrei Broder,1998)




sample uniformly from any set and check membership
in any set
Pr(A & B|A) = Size(A & B)/Size(A)
Pr(A & B|B) = Size(A & B)/Size(B),
Size(A)/Size(B) = Pr(A & B|B) / Pr(A & B|A)
Web图大小——边的数目

Web图十分稀疏(sparse)


|E| = O(n) or at least o(n2)
Average number of hyperlinks per page roughly a
constant

ppt

Transcript ppt

Directory