Automatic Keyphrase Extraction via Topic Decomposition

Download Report

Transcript Automatic Keyphrase Extraction via Topic Decomposition

Proceeding EMNLP '10 Proceedings of the 2010 Conference
on Empirical Methods in Natural Language Processing
reporter: Ying-Ying, Chen
AUTOMATIC KEYPHRASE
EXTRACTION VIA TOPIC
DECOMPOSITION
OUTLINE
Introduction
 Building Topic Interpreters
 Topical PageRank for Keyphrase Extraction
 Experiments
 Related work
 Conclusion

2
INTRODUCTION

Keyphrases are defined as a set of terms in a document
that give a brief summary of its content for readers.
It is widely used in information retrieval and digital library
 It is also an essential step in document categorization, clustering
and summarization


Two principle approach: supervised and unsupervised

Supervised method
regards keyphrase extraction as a classification task
 required a documents set with human-assigned keyphrases

3
INTRODUCTION

Unsupervised method


Process:
1.
2.
3.

Graph-based rank
first build a word graph according to word co-occurrences
within the document,
use random walk techniques to measure word importance
top ranked words are selected as keyphrases
Problems:


keyphrases should be relevant to the major topics of the given
document
keyphrases should also have a good coverage of the document’s
major topics
4
INTRODUCTION

To address the problem, it is intuitive to consider the topics
of words and document in random walk for keyphrase
extraction.
decompose traditional PageRank into multiple PageRanks
specific to various topics
 obtain the importance scores of words under different topics


We call the topic-decomposed PageRank as
Topical PageRank (TPR).
Moreover, TPR is unsupervised and language independent

TPR for keyphrase extraction is a two-stage process:

1.
2.
Build a topic interpreter to acquire the topics of words and
documents.
Perform TPR to extract keyphrases for documents.
5
BUILDING TOPIC INTERPRETERS

There are two method to acquire topic distributions of words
 Use manually annotated knowledge bases.


Ex. WordNet
Use unsupervised machine learning techniques to obtain
word topics from a large-scale document collection.
LSA(Latent Semantic Analysis)
 pLSA(probability LSA),
 LDA(Latent Dirichlet Allocation)

6
BUILDING TOPIC INTERPRETERS

LDA
 Each word w of a document d is regarded to be generated
by first sampling a topic z from d’s topic distribution θ(d) ,
and then sampling a word from the distribution over
words φ(z) that characterizes topic z.
 In LDA, θ(d) and φ(z) are drawn from conjugate Dirichlet
priors α and β, separately.
 Therefore, θ and φ are integrated out and the probability
of word w given document d and priors is represented as
follows:

Where K is the number of topics
7
LDA(LATENT DIRICHLET ALLOCATION)
Dirichlet distribution(狄氏分配)
 Dirichlet分配是多項式分配的共軛分配


先驗機率為Dirichlet分配,相似度函數為多項式分配,那麼
後驗分配仍為Dirichlet分配

P(Y|X): 後驗機率; P(X):先驗機率; P(X|Y):相似度函數
8
LDA(LATENT DIRICHLET ALLOCATION)



LDA透過將文本映射到主題空間,也就是他認為一篇文章是由很多個主
題隨機構成,透過主題得到文本與文本之間的關係。
LDA和LSA、 pLSA的前提都相同,是bag of word所以不考慮任何語法
及出現順序的問題。
LDA與pLSA的差異
 pLSA的文件參數是由訓練文集中有出現的文件訓練得到

LDA會給予沒有出現在訓練文集中的文件一個機率形式的表現方式,
所以需要的參數量較少
9
LDA(LATENT DIRICHLET ALLOCATION)

1.
LDA是一個生成模型,其可以隨機生成可觀測的數據,也就是可以隨機生
成一篇由多個主題組成的文章。其建模過程是逆向透過文本的集合建立
生成模型,生成步驟如下:
選擇N,N遵守poisson(ξ)分配,這裡N代表文章長度(文章字數)
2.
選擇θ,θ遵守Dirichlet(α)分配,θ代表每個主題發生的機率,α是
Dirichlet分配的參數
3.
對N個文字中的每一個文字:
選擇主題zn,zn會遵守Multinominal(θ)多項分配。zn代表當前選擇的主題
2. 選擇wn,根據p(wn|zn;β): 在zn條件下的多項分配,β是一個K*V的矩陣,
βij=P(wj=1|zi=1)
1.

在LDA中,不同的文章會有不同的θ對應,而θ可以用來判斷文章的相似
度
10
TOPICAL PAGERANK FOR KEYPHRASE
EXTRACTION

Given a document d, the process of keyphrase extraction using
TPR consists of the following four steps :
1. Construct a word graph for d according to word co-occurrences
within d.
2. Perform TPR to calculate the importance scores for each word
with respect to different topics.
3. Using the topic-specific importance scores of words, rank
candidate keyphrases respect to each topic separately.
4. Given the topics of document d, integrate the topic-specific
rankings of candidate keyphrases into a final ranking, and the
top ranked ones are selected as keyphrases.
11
TOPICAL PAGERANK FOR KEYPHRASE
EXTRACTION

We construct a word graph according to word cooccurrences within the given document

Link weight between words


Direction


the co-occurrence count within a sliding window with maximum W
words in the word sequence.
When sliding a W-width window, at each position, we add links
from the first word pointing to other words within the window.
Format

only add adjectives and nouns in word graph
12
TOPICAL PAGERANK FOR KEYPHRASE
EXTRACTION



PageRank
The basic idea of PageRank is that a vertex is important if
there are other important vertices pointing to it.
This can be regarded as voting or recommendation among
vertices.
G = (V,E) as the graph of a document
 vertex set V = {w1,w2, · · · ,wN}
 link set (wi,wj) ∈ E if there is a link from wi to wj

the weight of link (wi,wj) as e(wi,wj)
 the out-degree of vertex wi as

λ is a damping factor range from 0 to 1
 |V| is the number of vertices

13
TOPICAL PAGERANK FOR KEYPHRASE
EXTRACTION

Topical PageRank(TPR)


Each topic-specific PageRank prefers those words with
high relevance to the corresponding topic.
In the PageRank of a specific topic z, we will assign a
topic-specific preference value pz(w) to each word w as its
random jump probability with
14
TOPICAL PAGERANK FOR KEYPHRASE
EXTRACTION

Topical PageRank(TPR)
We use three measures to set preference values for TPR:
 pz(w) = pr(w|z),



pz(w) = pr(z|w),


This indicates how much that word w focuses on topic z.
pz(w) = pr(w|z) * pr(z|w),


This indicates how much that topic z focuses on word w.
This measure is inspired by the work in (Cohn and Chang, 2000).
Terminate conditions:
when the number of iterations reaches 100
 the difference of each vertex between two neighbor iterations is
less than 0.001.

15
TOPICAL PAGERANK FOR KEYPHRASE
EXTRACTION


Extract Keyphrases Using Ranking Scores
We thus select noun phrases from a document as candidate
keyphrases for ranking.
 The document is first tokenized.
 After that, we annotate the document with part of-speech
(POS) tags.
 Third, we extract noun phrases with pattern
(adjective)*(noun)+
 We regard these noun phrases as candidate keyphrases.
16
TOPICAL PAGERANK FOR KEYPHRASE
EXTRACTION

Extract Keyphrases Using Ranking Scores

We rank them using the ranking scores obtained by
TPR.

By considering the topic distribution of document, we
further integrate topic-specific rankings of candidate
keyphrases into a final ranking
17
EXPERIMENTS


Datasets
One dataset was built by Wan and Xiao which was used in
(Wan and Xiao, 2008b).
It contains 308 news articles in DUC2001 (Over et al.,2001)
 2, 488 manually annotated keyphrases.
 There are at most 10 keyphrases for each document.
 In experiments we refer to this dataset as NEWS.


The other dataset was built by Hulth 3 which was used in
(Hulth, 2003).



It contains 2, 000 abstracts of research articles
19, 254 manually annotated keyphrases.
In experiments we refer to this dataset as RESEARCH.
18
EXPERIMENTS


Dataset
we use the Wikipedia snapshot at March 2008 to build
topic interpreters with LDA.
collected 2, 122, 618 articles
 build the vocabulary by selecting 20, 000 words according to
their document frequency.
 learned several models with different numbers of topics, from
50 to 1, 500 respectively.

19
EXPERIMENTS

Evaluation Metrics

In experiments we select three evaluation metrics.

Precision / recall / F-measure

Binary preference measure(Bpref)



R: correct keyphrases ; M: extracted keyphrases ;
r: a correct keyphrase ; n: an incorrect keyphrase
Mean reciprocal rank(MRR)

d: a document ; rankd: the rank of the first correct keyphrase with all
extracted keyphrases
20
EXPERIMENTS


Influences of Parameters to TPR
There are four parameters in TPR that may influence the
performance of keyphrase extraction:
window size W for constructing word graph
2. the number of topics K learned by LDA
3. different settings of preference values pz(w)
4. damping factor λ of TPR
1.

Except the parameter under investigation, we set
parameters to the following values: W =10, K=1000, λ=0.3
and pz(w) = pr(z|w)
21
EXPERIMENTS

Window Size W

In experiments on NEWS and W ranges from 5 to 20 as shown in
Table 1:
Similarly, W ranges from 2 to 10, the performance on RESEARCH
does not change much but it will become poor when W = 20.
 RESEARCH(121 words) are much shorter than NEWS(704 words)
 the graph will become full-connected
 the weights of links will tend to be equal

22
EXPERIMENTS

The Number of Topics K

We demonstrate the influence of the number of topics K
of LDA models in Table 2.

The influence is similar on RESEARCH
It indicates that LDA is appropriate for obtaining topics
of words and documents for TPR to extract keyphrases.

23
EXPERIMENTS

Damping Factor λ

Damping factor λ of TPR reconciles the influences of
graph walks
24
EXPERIMENTS

Preference Values

In Table 3 we show the influence when the number of
keyphrases M = 10 on NEWS.


pr(w|z) assigns preference values according to how frequently that
words appear in the given topic.
pr(z|w) prefers those words that are focused on the given topic.
25
EXPERIMENTS

Comparing with Baseline Methods

We select three baseline methods to compare with TPR

TFIDF

PageRank


TFIDF amd PageRank don’t use the topic information
LDA
computes the ranking score for each word using the topical
similarity between the word and the document.
 The LDA baseline calculated using cosine similarity which

performs the best.
26
EXPERIMENTS

In Tables 4 and 5 we show the comparing results of the
four methods on both NEWS and RESEARCH.
The improvements of TPR are all statistically significant
tested with bootstrap re-sampling with 95% confidence.
 LDA performs equal or better than TFIDF and PageRank
under precision/recall/F measure.
 the performance of LDA under MRR is much worse than
TFIDF and PageRank

27
EXPERIMENTS


In Figures 3 and 4 we show the precision-recall relations of
four methods on NEWS and RESEARCH.
Each point on the precision-recall curve is evaluated on
different numbers of extracted keyphrases M
28
EXPERIMENTS

in Table 6 we show an example of extracted keyphrases
using TPR from a news article with title “Arafat Says U.S.
Threatening to Kill PLO Officials”

Top 3 topic:
Palestine
 Israel
 terrorism

29
EXPERIMENTS

TFIDF



only considered the frequency
highly ranked the phrases with
“PLO” which appeared about 16
times in this article
LDA


without considering the frequency
failed to extract keyphrase
“political assassination”, in which
the word “assassination” occurred
8 times in this article.
30
RELATED WORK
supervised methods
1.


regarded keyphrase extraction as a classification task (Turney, 1999)
need manually annotated training set which is time-consuming
clustering techniques on word graphs for keyphrase extraction
(Grineva et al., 2009; Liu et al., 2009).
2.

performed well on short abstracts but poorly on long articles
Topical PageRank with random jumps between topics(Nie et al., 2006)
3.




did not help improve the performance for keyphrase extraction
Peter D. Turney. 1999. Learning to extract keyphrases from text. National Research Council Canada,
Institute for Information Technology, Technical Report ERB-1057.
M. Grineva, M. Grinev, and D. Lizorkin. 2009. Extractingkey terms from noisy and multi-theme
31
documents. In Proceedings of WWW, pages 661–670.
Lan Nie, Brian D. Davison, and Xiaoguang Qi. 2006. Topical link analysis for web search. In
Proceedings of SIGIR, pages 91–98.
CONCLUSION



We propose a new graph-based framework, Topical PageRank
We investigate the influence of various parameters on TPR
Future work
 We design to obtain topics using other machine learning
methods and from other knowledge bases
 consider topic information in other graph-based ranking
algorithms such as HITS (Kleinberg, 1999).
 We will investigate the influence of corpus selection in
training LDA for keyphrase extraction using TPR.
32
RELATED WORK

Topical link analysis for web search (Nie et al., 2006)

when surfing following a graph link from vertex wi to wj ,
the ranking score on topic z of wi will have a higher
probability to pass to the same topic of wj and have a
lower probability to pass to a different topic of wj .
33