Link Farms 模型 - 北京大学网络与信息系统研究所

Download Report

Transcript Link Farms 模型 - 北京大学网络与信息系统研究所

Link Analysis & Spam Detection
http://net.pku.edu.cn/~wbia
黄连恩
[email protected]
北京大学信息工程学院
09/24/2013
PageRank
Why and how it works?
“Random Walker”模型



设想有一个永不休止、在网上浏览网页的人,随机
选择一个链出的链接继续访问。我们问,在稳态情
况下(足够长时间后),他会正在看哪一篇网页呢?
等价于:稳态情况下,每个网页v会有一个被访问
的概率,p(v),它可以作为网页的重要程度的度量。
我们可以合理地设想:此时到达v的概率,依赖于
上一个时刻到达“链向”v的网页的概率,以及那
些网页中超链的个数。
Google Matrix

让这浏览者每次以一定的概率(1-β)沿着超链
走,以概率(β)重新随机选择一个新的起始节
点

这在物理意义上即总是有可能跳进入度为0的点,跳
出那些“圈”。在模型表达上即为



T
p  (1   ) L p  1N  p   (1   ) L  (1N )  p
N
N


T

β选在0.1和0.2之间,被称作damping factor(Page & Brin
1997)
G=(1-β)LT+ β/N(1N) 被称为Google Matrix
Google Matrix特征向量求解




Power Iteration方法:
给定Google Matrix G,记|λ1| ≥|λ2| ≥…,q1
是属于λ1的特征向量
初始化向量p0,使得||p0||1=1
对于k = 1, 2, …,执行如下步骤



x = Gpk-1,
pk = x/||x||1,
基本迭代
规格化步骤
可以证明(收敛速度)

|pk – q1| = O(|λ2/λ1|k)



T
pi 1   (1   ) L  (1N )  pi
N


例子(power iteration)
1 / 11




 1/ 2


L








1 / 11 1 / 11 1 / 11 1 / 11 1 / 11 1 / 11 1 / 11 1 / 11 1 / 11 1 / 11

1


1

1/ 2


1/ 3
1/ 3
1/ 3


1/ 2
1/ 2

1/ 2
1/ 2


1/ 2
1/ 2

1/ 2
1/ 2


1

1

小规模数据求解




β取0.15
G= 0.85*LT+0.15/11(1N)
P0=(1/11,1/11,….)T
P1=GP0
You can try this in
MatLab
 ...
 。。。。。。。



Power Iteration求解得(迭代50次)
P=(0.033,0.384,0.343,0.039,0.081,
0.039,0.016……)T
Applications of PageRank Algorithm
Topic-Sensitive PageRank[1]
ODP-Biasing
Sqd   P(c j q')  rankjd
j



T
pi 1   (1   ) L  (1N )  pi
N


Results
Precision @ 10 results
for test queries
Ranking preferred by majority of
users
Pagerank for product image search[2]
Results

Similarity
graph
generated
from the
top1000
search
results of
“Mona-Lisa.”
The largest
two images
contain the
highest rank.
Results
References


[1] H. H. Taher, "Topic-Sensitive PageRank: A
Context-Sensitive Ranking Algorithm for Web
Search," IEEE Transactions on Knowledge and
Data Engineering, vol. 15, pp. 784-796, 2003.
[2] Y. Jing and S. Baluja, "Pagerank for product
image search," in Proceeding of the 17th
international conference on World Wide Web,
Beijing, China, 2008, pp. 307-316.
HITS(Hyperlink Induced Topic Search)





从设计思想来看,HITS和PageRank的一个基本
区别是HITS针对具体查询、应用在查询时间,而
PageRank是独立于查询的
Root set, R(q): 和查询q相关的网页集合
Base set, V(q): 除了R(q)外,还包括指向R(q)元
素和被R(q)元素指向的网页
Expanded set = V - R
两个概念(直觉上有意义)


AUTHORITY(权威型网页):内容权威,质量高的网
页
HUB(目录型网页):指向许多authority网页的网页
Authority and Hub scores


针对u∈V(q),在每个网页u上定义有两个参数:
a[u]和h[u],分别表示其权威性和目录性。
交叉定义


一个网页u的a值依赖于指向它的网页v的h值
一个网页u的h值依赖于它所指的网页v的a值
a  E h
T
h  E a  E E h
T
HITS (contd.)



声望高的(入度大) 权威性高
认识许多声望高的(出度大)目录性强
如何计算?
Power Iteration on:
a  E h  E Ea
T
T
h  Ea  EE h
T
HITS算法过程(Topic Distillation)
1.
2.
3.
4.
Send query to a textbased IR system and
obtain the root-set.
Expand the root-set by
radius one to obtain an
expanded graph.
Run power iterations on
the hub and authority
scores together.
Report top-ranking
authorities and hubs.
PageRank & HITS



网页的相互链接特性,使得我们可以应用社会网络
分析的方法来从网页集合的结构中提炼有用的信息
提炼什么信息?取决于我们对应用目标的认识,还
取决于有关技术模型的精确定义、有效计算,以及
对可能产生误差的认识和评估
PageRank和HITS是两个经典的例子,大目标一致,
切入点不同。


它们不应该意味着不可能有更好的(或者更有特色的)
新角度
它们的一个本质缺陷是:将网页之间的链接关系“太当
真”。
思考题

1. PageRank 的必要性


假如没有 PageRank 会出现什么问题?
2. PageRank VS. AnchorText

为什么你能准确查找到你想要的网页?
Web Spam 检测
Web Spam 是当前搜索引擎面临的一个最主要挑战
什么是 web spam?



Spamming = any deliberate action solely in order
to boost a web page’s position in search engine
results, incommensurate with page’s real value
Spam = web pages that are the result of
spamming
This is a very broad defintion



SEO industry might disagree!
SEO = search engine optimization
Approximately 10-15% of web pages are spam
Web Spam 的分类

Term spamming


Manipulating the text of web pages in order to appear
relevant to queries
Link spamming

Creating link structures that boost page rank or hubs
and authorities scores
Term Spamming

Repetition



Dumping



of a large number of unrelated terms
e.g., copy entire dictionaries
Weaving


of one or a few specific terms e.g., free, cheap, viagra
Goal is to subvert TF.IDF ranking schemes
Copy legitimate pages and insert spam terms at random positions
Phrase Stitching

Glue together sentences and phrases from different sources
Link Spamming

Three kinds of web pages from a spammer’s
point of view



Inaccessible pages
Accessible pages
 e.g., web log comments pages
 spammer can post links to his pages
Own pages
 Completely controlled by spammer
 May span multiple domain names
Link Farms

Spammer’s goal


Maximize the page rank of target page t
Technique





Get as many links from accessible pages as possible to
target page t
Construct “link farm” to get page rank multiplier effect
在论坛、Blog 等发表评论之类指向自己的网站
交换链接
honey pot
Link Farms 模型
Accessible
Own
1
Inaccessible
t
2
M
One of the most common and effective organizations for a link farm
Link Farms 模型分析
Own
Accessible
Inaccessibl
e
t
1
2
M
Suppose rank contributed by accessible pages = x
Let page rank of target page = y
Rank of each “farm” page = (1-)y/M + /N
y = x + (1-)M[(1- )y/M + /N] + /N
Very small; ignore
= x + (1-)2y + (1-)M/N + /N
y = x/(2-2) + cM/N where c = (1-)/(2-)
Link Farms 模型分析
Own
Accessible
Inaccessibl
e
t
1
2
M


y = x/(2-2) + cM/N where c = (1-/(2-)
For  = 0.15, 1/(2-2)= 3.6, c = 0.46
 Multiplier effect for “acquired” page rank
 By making M large, we can make y as large as we
want
Link Farms 的后果

造成大量无效信息




造成大量的虚假信息


对信息分析产生不利影响
对搜集系统产生严重影响



产生大量网页
产生大量链接
产生大量域名
产生搜集陷阱
占用大量资源
对 PageRank 产生严重影响!
检测 Web Spam 的方法

Term spamming




Analyze text using statistical methods e.g., Naïve
Bayes classifiers
Similar to email spam filtering
Also useful: detecting approximate duplicate pages
Link spamming


基于统计的检测方法
链接分析的检测方法: TrustRank,BadRank
基于统计的 Web Spam 检测


Nature
seems to create
bell curves
(range around an average)
Human activity
seems to create
power laws
(popularity skewing)
Number of Web page in-links (Broder+)
More Examples
frequency of words
protein-interaction degree distribution
Internet (AS) degree distribution
severity of inter-state wars
severity of terrorist attacks
frequency of bird sightings
size of blackouts
book sales
population of US cities
size of religions
number of citations
papers authored
popularity of surnames
number of web hits
number of web links, with cut-off
number of phone calls
size of email address book
number of species per genus
检测方法


其基本依据是:Spam 网页往往是机器自动生成的
,它们与自然生成的网页是存在区别的,那么在统
计特征上往往表现为异常点,求解其交集得到的网
页就很可能是 Spam 网页
统计点:



URL 特征、域名解析
出入度、网页内容
演化过程、集群特征
Web page out-degrees
There are 158,290 pages with out-degree 1301,
while according to the overall trend
only 1,700 such pages are expected.
Web page in-degrees
There are 369,457 pages have the in-degree of 1001,
while according to the trend
only 2,000 such pages are expected
Spammers are studious!
基于链接分析的 Web Spam 检测
41
TrustRank 思想

Basic principle: approximate isolation



It is rare for a “good” page to point to a “bad” (spam)
page
Sample a set of “seed pages” from the web
Have an oracle (human) identify the good pages
and the spam pages in the seed set

Expensive task, so must make seed set as small as
possible
生成种子集 (seed set)

The notion of human checking of a web page is represented by
Oracle function:
0
if
p
is
bad,

O( p)  
1 if p is good.




Oracle invocations are expensive,
one should strive to minimize them
To evaluate the pages without calling O,
it is necessary to estimate the probability that p is good
The Trust function yields a range of values between 0 (bad) and 1
(good)
Ideally,
T ( p)  Pr[O( p)  1].
信任传播 (Trust propagation)

Expecting that good pages point to other good pages, all
pages reachable from a good seed page in M or fewer
steps are denoted as good
1
2
3
4
good page
5
7
6
bad page

S = {1, 3, 6} set of seed pages
M = 1..3 maximum length path
1
2
3
5
6
4
7
Rules for trust propagation

Trust attenuation


The degree of trust conferred by a trusted page
decreases with distance
Trust splitting


The larger the number of outlinks from a page, the
less scrutiny the page author gives each outlink
Trust is “split” across outlinks
Trust attenuation



We cannot be absolutely sure that pages reachable from
good seeds are indeed good
Further away we are from good seed,
less certain we are that a page is good
Trust dampening
β – dampening factor


β
Trust splitting
Can be combined
β
1
β
2
1/2
t(1)=1
t(2)=1
1
2
5/12
1/2
1/3
1/3
1/3
β2
β2 3
3
5/12
t(3)=5/6
Simple model

Suppose trust of page p is t(p)


For each q in O(p), p confers the trust


Trust of p is the sum of the trust conferred on p by all
its inlinked pages
Note similarity to Topic-Specific Page Rank


t(p)/|O(p)| for 0<<1
Trust is additive


Set of outlinks O(p)
Within a scaling factor, trust rank = biased page rank
with trusted pages as teleport set
t=  · LT · t + (1- · d / |d|
TrustRank in Action







Select seed set using inversed
PageRank
s=[2, 4, 5, 1, 3, 6, 7]
0
Invoke L(=3) oracle functions
1
Populate static score distribution vector
d=[0, 1, 0, 1, 0, 0, 0]
Normalize distribution vector
0.15
d=[0, 1/2, 0, 1/2, 0, 0, 0]
4
Calculate TrustRank scores using biased
PageRank with trust dampening and
trust splitting
0.05
RESULTS [0, 0.18, 0.12, 0.15, 0.13,
7
0.05, 0.05]
0.12
2 0.18
3
0.13
5
6
0.05
选择最优种子集 (seed set)

Two conflicting considerations


Human has to inspect each seed page, so seed set
must be as small as possible
Must ensure every “good page” gets adequate trust
rank, so need make all good pages reachable from
seed set by short paths
Approaches to picking seed set


Suppose we want to pick a seed set of k pages
PageRank



Pick the top k pages by page rank
Assume high page rank pages are close to other highly
ranked pages
We care more about high page rank “good” pages
Inverse page rank


Pick the pages with the maximum number of
outlinks
Can make it recursive


Formalize as “inverse page rank”



Pick pages that link to pages with many outlinks
Construct graph G’ by reversing each edge in web
graph G
Page Rank in G’ is inverse page rank in G
Pick top k pages by inverse page rank
思考题:TrustRank VS. PageRank




是否可以使用 TrustRank 代替 PageRank?
优点是什么?
缺点是什么?
假如 Google 一开始就采用 TrustRank,还会不会
出现 Link Farms 问题?
思考题:TrustRank + PageRank

一种有效的 Link Farms 检测方法







首先计算一个正常的 PageRank 值
接下来计算一个 TrustRank 值
然后将 PageRank 减去 TrustRank
最后发现数值大的网页就是 Link Farms 的目标网页
为什么?
参考论文 “Link spam detection based on mass
estimation”
请思考:怎么结合 TrustRank 和 PageRank 获得一
个更合理的 Ranking?
本次课小结

PageRank 的应用



HITS 算法


Topic sensitive:偏向性
的 Ranking
Image Ranking:将选
择问题转化为排序问题
AUTHORITY 和 HUB
Web Spam 检测



Link Farms 模型
基于统计的检测方法
TrustRank
Thank You!
Q&A
HOME WORK

请论证 Topic sensitive 与 TrustRank 是否等价?

References:


[1] Z. Gyongyi, H. Garcia-Molina and J.Pedersen.
Combating Web Spam with TrustRank. Tech. rep.,
Stanford University, 2004.
[2] Z. Gyongyi and H. Garcia-Molina. Seed
selection in TrustRank. Tech. rep., Stanford
University, 2004.