D - 北京大学网络与信息系统研究所

Transcript D - 北京大学网络与信息系统研究所

WBIA Review
http://net.pku.edu.cn/~wbia
黄连恩
[email protected]
北京大学信息工程学院
12/24/2013
Bow-tie

Strongly
Connected
Component
(SCC)


Upstream (IN)



Core can’t reach
IN
Downstream
(OUT)


Core
OUT can’t reach
core
Disconnected
Tendrils & Tubes
Power-law


Nature
seems to create
bell curves
(range around an average)
Human activity
seems to create
power laws
(popularity skewing)
Power Law Distribution -Examples

From Graph structure in the web, (by altavista
crawl,1999)
Web Graph

习题：怎么存储Web图？
PageRank
Why and how it works?
Random walker model
u3
u1
V
u4
u2
u5
Damping Factor



T
p  (1   ) L p  1N  p   (1   ) L  (1N )  p
N
N


T

β选在0.1和0.2之间，被称作damping factor(Page & Brin
1997）
G=(1-β)LT+ β/N(1N) 被称为Google Matrix
1 / 11




 1/ 2


L








1 / 11 1 / 11 1 / 11 1 / 11 1 / 11 1 / 11 1 / 11 1 / 11 1 / 11 1 / 11

1


1

1/ 2


1/ 3
1/ 3
1/ 3


1/ 2
1/ 2

1/ 2
1/ 2


1/ 2
1/ 2

1/ 2
1/ 2


1

1

小规模数据求解




β取0.15
G= 0.85*LT+0.15/11(1N)
P0=(1/11,1/11,….)T
P1=GP0
You can try this in
MatLab
 ...
 。。。。。。。



Power Iteration求解得(迭代50次)
P=(0.033,0.384,0.343,0.039,0.081,
0.039,0.016……)T

习题：写出PageRank算法的伪码
HITS(Hyperlink Induced Topic Search)



声望高的（入度大） 权威性高
认识许多声望高的（出度大）目录性强
如何计算？
Power Iteration on:
a  E h  E Ea
T
T
h  Ea  EE h
T
Authority and Hub scores


针对u∈V(q)，在每个网页u上定义有两个参数：
a[u]和h[u]，分别表示其权威性和目录性。
交叉定义


一个网页u的a值依赖于指向它的网页v的h值
一个网页u的h值依赖于它所指的网页v的a值
a  E h
T
h  E a  E E h
T
Web Spam

Term spamming


Manipulating the text of web pages in order to appear
relevant to queries
Link spamming

Creating link structures that boost page rank or hubs
and authorities scores
TrustRank


Expecting that good pages point to other good pages, all
pages reachable from a good seed page in M or fewer
steps are denoted as good
t=  · LT · t + (1- · d / |d|
1
2
3
4
good page
5
7
6
bad page
TrustRank in Action








Select seed set using inversed
PageRank
s=[2, 4, 5, 1, 3, 6, 7]
0
Invoke L(=3) oracle functions
Populate static score distribution vector 1
d=[0, 1, 0, 1, 0, 0, 0]
Normalize distribution vector
0.15
d=[0, 1/2, 0, 1/2, 0, 0, 0]
4
Calculate TrustRank scores using biased
PageRank with trust dampening and
trust splitting
0.05
RESULTS [0, 0.18, 0.12, 0.15, 0.13,
7
0.05, 0.05]
t=  · LT · t + (1- · d / |d|
0.12
2 0.18
3
0.13
5
6
0.05
Tokenization


Friends, Romans, Countrymen, lend me your ears;
Friends | Romans | Countrymen | lend | me your | ears
Token an instance of a sequence of characters that are
grouped together as a useful semantic unit for processing
Type the class of all tokens containing the same character
sequence
Term type that is included in the system dictionary
(normalized)
Stemming and lemmatization

Stemming


Crude heuristic process that chops off the ends of the words
 Democratic  democa
Lemmatization

Use of vocabulary and morphological analysis, returns the base
form of a word (lemma)
 Democratic  democracy
 Sang  sing
Porter stemmer

Most common algorithm for stemming English






5 phases of word reduction
SSES  SS
 caresses  caress
IES  I
 ponies  poni
SS  SS
S
 cats  cat
EMENT 
 replacement  replac
 cement  cement
Bag of words model

A document can now be viewed as the collection
of terms in it and their associated weight



Mary is smarter than John
John is smarter than Mary
Equivalent in the bag of words model
Term frequency and weighting



A word that appears often in a document is probably very
descriptive of what the document is about
Assign to each term in a document a weight for that term,
that depends on the number of occurrences of the that
term in the document
Term frequency (tf)

Assign the weight to be equal to the number of occurrences of
term t in document d
Inverse document frequency
N number of documents in the collection
• N = 1000; df[the] = 1000; idf[the] = 0
• N = 1000; df[some] = 100; idf[some] = 2.3
• N = 1000; df[car] = 10; idf[car] = 4.6
• N = 1000; df[merger] = 1; idf[merger] = 6.9
it.idf weighting

Highest when t occurs many times within a small
number of documents


Lower when the term occurs fewer times in a
document, or occurs in many documents


Thus lending high discriminating power to those
documents
Thus offering a less pronounced relevance signal
Lowest when the term occurs in virtually all
documents
tf x idf term weights

tf x idf 权值计算公式:


term frequency (tf )
 or wf, some measure of term density in a doc
inverse document frequency (idf )
 表达term的重要度(稀有度)
 原始值idft = 1/dft
N 
idft  log
 同样，通常会作平滑

 df t 

为文档中每个词计算其tf.idf权重：
wt ,d  tft ,d  log(N / dft )
24
Document vector space representation



Each document is viewed as a vector with one
component corresponding to each term in the
dictionary
The value of each component is the tf-idf score
for that word
For dictionary terms that do not occur in the
document, the weights are 0
Documents as vectors


D1
D2
D3
D4
D5
D6
中国
4.1
0.0
3.7
5.9
3.1
0.0
文化
4.5
4.5
0
0
11.6
0
日本
0
3.5
2.9
0
2.1
3.9
留学生
0
3.1
5.1
12.8
0
0
教育
2.9
0
0
2.2
0
0
北京
7.1
0
0
0
4.4
3.8
…
每一个文档
…
j 能够被看作一个向量，每个term 是一个
维度，取值为tf.idf
So we have a vector space



terms are axes
docs live in this space
高维空间：即使作stemming, may have 20,000+ dimensions
26
Cosine similarity
Cosine similarity
t3

d2
d1

θ
t1

向量d1和d2的 “closeness”
可以用它们之间的夹角大
小来度量
具体的，可用cosine of the
angle x来计算向量相似度.
向量按长度归一化
Normalization

dj 
2
 
d j  dk
sim( d j , d k )    
d j dk

M
i 1
i1 w
M
28
2
w
i1 i, j
M
wi , j wi ,k
2
i, j
2
w
i1 i ,k
M
Jaccard coefficient

Resemblance
r ( A, B) 



S ( A)  S ( B)
Symmetric, reflexive, not transitive, not a metric


S ( A)  S ( B)
Note r (A,A) = 1
But r (A,B)=1 does not mean A and B are identical!
Forgives any number of occurrences and any
permutations of the terms.
Resemblance distance
d ( A, B)  1  r ( A, B)
Shingling


A contiguous subsequence contained in D is
called a shingle.
Given a document D we define its w-shingling
S(D, w) as the set of all unique shingles of size w
contained in D.
D = (a,rose,is,a,rose,is,a,rose)
 S(D,4) = {(a,rose,is,a),(rose,is,a,rose),(is,a,rose,is)}
“a rose is a rose is a rose” =>
a_rose_is_a
Why shingling?
What
is a good
rose_is_a_rose
S(D,4)
value for.vs.
w? S(D,1)
is_a_rose_is

Shingling & Jaccard Coefficient




Doc1=
"to be or not to be, that is
a question!"
Let windows size w = 2,
Resemblance r (A,B) = ?


Doc2=
"to be a question or not"
Random permutation

Random permutation





Let  be a set (1..N e.g.)
Pick a permutation : uniformly at random
={3,7,1,4,6,2,5}
A={2,3,6}
MIN((A))=?



Inverted index

对每个 term T: 保存包含T的文档(编号)列表
中国
2
4
8
16
文化
1
2
3
5
留学生
13
32
8
64
13
21
128
34
16
Postings
Dictionary
Sorted by docID (more later on why).
33
Inverted Index
with counts
• supports better
ranking algorithms
Sec. 6.4
VS-based Retrieval
Columns headed ‘n’ are acronyms for weight schemes.
Why is the base of the log in idf immaterial?
Sec. 6.4
tf-idf example: lnc.ltc
Document: car insurance auto insurance
Query: best car insurance
Term
Query
tf- tf-wt
raw
df
idf
Document
wt
n’liz
e
tf-raw
tf-wt
Pro
d
n’liz
e
wt
auto
0
0
5000
2.3
0
0
1
1
1
0.52
0
best
1
1 50000
1.3
1.3
0.34
0
0
0
0
0
car
1
1 10000
2.0
2.0
0.52
1
1
1
0.52
0.27
insurance
1
1
3.0
3.0
0.78
2
1.3
1.3
0.68
0.53
1000
Exercise: what is N, the number of docs?
Doc length = 12  02 12 1.32 1.92
Score = 0+0+0.27+0.53 = 0.8
Singular Value Decomposition
Wtd
=
td

T

DT
rr
rd
tr
对term-document矩阵作奇异值分解 Singular
Value Decomposition

r, 矩阵的rank

, singular values的对角阵（按降序排列）

D, T, 具有正交的单位长度列向量(TT’=I, DD’=I)
WWT的特征值
WTW和WWT的特征向量
Latent Semantic Model

LSI检索过程：






查询映射/投影到LSI的DT空间，称为“folded in“ ：
W=TDT，若q投影到DT中后为q’，则有
q = Tq’T
既有q’= (-1T-1q)T = qT-1
Folded in 既为 document/query vector 乘上T-1
文档集的文档向量为DT
两者通过dot-product计算相似度
Stochastic Language Models

用来生成文本的统计模型

Probability distribution over strings in a given language
M
P(
|M)
=P(
)
P(
)
P(
|M
| M,
P(
| M,
| M,
)
)

Unigram model


likely topics
count ( w)
P ( w) 
# tokens
Bigram model

grammaticality
count( wi wi 1 )
P( wi wi 1 ) 
count( wi )
Bigram Model

Approximate P(wn |w1n1)



by
P(wn | wn  1)
P(unicorn|the mythical) by P(unicorn|mythical)
Markov assumption: the probability of a word
depends only on the probability of a limited history
Generalization: the probability of a word depends
only on the probability of the n previous words



trigrams, 4-grams, …
the higher n is, the more data needed to train
backoff models…
A Simple Example: bigram model

P(I want to each Chinese food) = P(I | <start>)
P(want | I) P(to | want) P(eat | to) P(Chinese | eat)
P(food | Chinese) P(<end>|food)
LM-based Retrieval

排序公式
p(Q, d )  p(d ) p(Q | d )
 p(d ) p(Q | M d )

用最大似然估计:
pˆ (Q | M d )   pˆ ml (t | M d )
tQ

tQ
tf (t ,d )
dld
Unigram assumption:
Given a particular language mod
el, the query terms occur indepe
ndently
M d : language model of document d
tf (t ,d ) : raw tf of term t in document d
dld : total number of tokens in document
d
Laplace smoothing

Also called add-one smoothing
Just add one to all the counts!
Very simple
MLE estimate:

Laplace estimate:



Mixture model smoothing


P(w|d) = Pmle(w|Md) + (1 – )Pmle(w|Mc)
参数很重要



 值高，使得查询成为 “conjunctive-like” – 适合短查
询
 值低更适合长查询
调整 来优化性能
 比如使得它与文档长度相关 (cf. Dirichlet prior or
Witten-Bell smoothing)
Example

Document collection (2 documents)




Model: MLE unigram from documents;  = ½
Query: revenue down



d1: Xerox reports a profit but revenue is down
d2: Lucent narrows quarter loss but revenue decreases further
P(Q|d1)
 = [(1/8 + 2/16)/2] x [(1/8 + 1/16)/2]
 = 1/8 x 3/32 = 3/256
P(Q|d2)

= [(1/8 + 2/16)/2] x [(0 + 1/16)/2]

= 1/8 x 1/32 = 1/256
Ranking: d1 > d2
What is relative entropy?

KL divergence/relative entropy
Relative entropy between the two distributions
 Cost in bits of coding using Q when true
distribution is P

H ( P( x))    P(i) log(P(i))
i
DKL ( P Q)    P(i) log(Q(i))
i
 ( P(i) log(P(i)))
48
P(i)
DKL ( P Q)   P(i) log(
)
Q(i)
i
49
Precision and Recall


Precision: 检索得到的文档中相关的比率 =
P(relevant|retrieved)
Recall: 相关文档被检索出来的比率 =
P(retrieved|relevant)
Relevant
Not Relevant
Retrieved
tp
fp
Not Retrieved
fn
tn


精度Precision P = tp/(tp + fp)
召回率Recall R = tp/(tp + fn)
50
Accuracy




给定一个Query，搜索引擎对每个文档分类
classifies as “Relevant” or “Irrelevant”.
Accuracy of an engine: 分类的正确比率.
Accuracy = (tp + tn)/(tp + fp +tn + fn)
Is this a very useful evaluation measure in IR?
Retrieved
Not Retrieved
Relevant
tp
fn
51
Not Relevant
fp
tn
A combined measure: F

P/R的综合指标F measure (weighted harmonic
mean):
2
(   1) PR
F

2
1
1

PR
  (1   )
P
R
通常使用balanced F1 measure( = 1 or  = ½)
1


Harmonic mean is a conservative average，Heavily
penalizes low values of P or R
52
MAP

多个queries间的平均


微平均 Micro-average – 每个relevant document 是一个
点，用来计算平均
宏平均 Macro-average – 每个query是一个点，用来计
算平均
 Average of many queries’ average precision values
 Called mean average precision (MAP)
 “Average average precision” sounds weird
Most
common
53
Averaging across queries

多个queries间的平均


微平均 Micro-average – 每个relevant document 是一个
点，用来计算平均
宏平均 Macro-average – 每个query是一个点，用来计
算平均
 Average of many queries’ average precision values
 Called mean average precision (MAP)
 “Average average precision” sounds weird
Most
common
54








习题 8-9 [**] 在10,000篇文档构成的文档集中，
某个查询的相关文档总数为8，下面给出了某系统
针对该查询的前20个有序结果的相关(用R表示)和
不相关(用N表示)情况，其中有6篇相关文档：
RRNNN
NNNRN
RNNNR
NNNNR
a. 前20篇文档的正确率是多少？
b. 前20篇文档的F1值是多少?
c. 在25%召回率水平上的插值正确率是多少？
Sec.14.3
KNN
P(science| )?
Government
Science
Arts
56
Naïve Bayes
cMAP  argmaxP(c j | x1 , x2 ,, xn )
c j C
 argmax P(x1,x 2 , ,x n | cj )P(cj )
cj C
 argmax Pˆ(cj )iPˆ(xi | cj )
cj C
Pˆ (c j ) 
N (C  c j )
Bayes Rule
N
N ( X i  xi , C  c j )  1
ˆ
P( xi | c j ) 
N (C  c j )  k
Maximum a posteriori Hypothesis
maximum likelihood estimates
Add one smooth
Conditional Independence Assumption
Parameter estimation

Binomial model:
of documents of topic c
Pˆ ( X w  t | c j )  fraction
in which word w appears
j

Multinomial model:
Pˆ ( X i  w | c j ) 
fraction of times in which
word w appears
across all positions in the
documents of topic cj
58
NB Example

c(5)=?
59
NB Example

c(5)=?
60
Multinomial NB Classifier

Feature likelihood estimate

Posterior

Result: c(5) = China
61
NB Example

c(5)=?
62
Bernoulli NB Classifier

Feature likelihood estimate

Posterior

Result: c(5) <> China
63
例题：你的任务是将单词分成英语(English)类或非英
语类。这些单词的产生来自如下分布：
(i) 计算多项式NB分类器的参数，分类器使用字母b
、n、o、u和z作为特征。在计算参数时使用平滑方
法，零概率平滑成0.01，而非零概率不做改变。
(ii) 上述分类器对单词zoo的分类结果是什么？
Support Vector Machine (SVM)




Support vectors
SVMs maximize the margin
around the separating
hyperplane.
 A.k.a. large margin
classifiers
The decision function is fully
specified by a subset of training
Maximizes
samples, the support vectors.
Narrower
margin
margin
Solving SVMs is a quadratic
programming problem
Seen by many as the most
successful current text
*but other discriminative methods
often perform very similarly
classification method*
65
2 statistic (CHI)
Term = jaguar
Class = auto
Class  auto


2
Term  jaguar
500
3
9500
observed: fo
The null hypothesis : Term(jaguar) is independent
with Class(auto)
Then, what value are expected in this confusion
matrix?
66
2 statistic (CHI)
Term = jaguar
Class = auto
Class  auto
2 (0.25)
3 (4.75)
Term  jaguar
500
expected: fe
(502)
9500 (9498)
observed: fo


2 is interested in (fo – fe)2/fe summed over all table entries
2
2
2
(
j
,
a
)

(
O

E
)
/
E

(
2

.
25
)
/
.
25

(
3

4
.
75
)
/
4
.
75
2
2
2

(
500

502
)
/
502

(
9500

949
)
/
949

12
.
9
(
p

.
0
)


The null hypothesis is rejected with confidence .999,
since 12.9 > 10.83 (the value for .999 confidence).
67
K-Means



假设documents是实值 vectors.
基于cluster ω的中心centroids (aka the center of
gravity or mean)
划分instances到clusters是根据它到cluster
centroid中心点的距离，选择最近的centroid
K Means Example
(K=2)
Pick seeds
x
x
x
x
Reassign clusters
Compute centroids
Reassign clusters
Compute centroids
Reassign clusters
Converged!
Hierarchical Agglomerative
Clustering (HAC)


假定有了一个similarity
function来确定两个
instances的相似度.
贪心算法：



Dendrogram

每个instances为一独立的
cluster开始
选择最similar的两个
cluster，合并为一个新
cluster
直到最后剩下一个cluster
为止
上面的合并历史形成一个
binary tree或hierarchy.
Purity


 
 
Cluster I


 
 

Cluster II
Cluster I: Purity = 1/6 *(max(5, 1, 0)) = 5/6
Cluster II: Purity = 1/6 * (max(1, 4, 1)) = 4/6
Cluster III: Purity = 1/5 * (max(2, 0, 3)) = 3/5
Total: Purity =
1/17 * (5+4+3) = 12/17

 

Cluster III
Rand Index





View it as a series of decisions, one for each of
the N(N − 1)/2 pairs of documents in the
collection.
true positive (TP) decision assigns two similar
documents to the same cluster
true negative (TN) decision assigns two dissimilar
documents to different clusters.
false positive (FP) decision assigns two dissimilar
documents to the same cluster.
false negative (FN) decision assigns two similar
documents to different clusters.
Rand Index
Number of points
Same Cluster in
clustering
Different Clusters in
clustering
Same class in
ground truth
TP
FN
Different classes in
ground truth
FP
TN
Rand index Example


 
 
Cluster I


 
 
Cluster II


 

Cluster III
Thank You!
Q&A

D - 北京大学网络与信息系统研究所

Transcript D - 北京大学网络与信息系统研究所

Directory