TextRank_-_Bringing_Order_into_Texts
Download
Report
Transcript TextRank_-_Bringing_Order_into_Texts
TextRank : Bringing Order into Texts
Rada Mihalcea and Paul Tarau
Department of Computer Science,
University of North Texas
EMNLP 2004
(Conference on Empirical Methods in Natural Language Processing)
May 18, 2011
In-seok An
SNU Internet Database Lab.
Outline
Introduction
The TextRank Model
– Undirected Graphs
– Weighted Graphs
– Text as a Graph
Keyword Extraction
– TextRank for Keyword Extraction
– Evaluation
Sentence Extraction
– TextRank for Sentence Extraction
– Evaluation
Conclusion
2 / 33
Introduction
Graph-based ranking algorithm
– Decide on the importance of a vertex
take into account global information
– recursively computed from the entire graph
– Have been successfully used in
Citation analysis
Social networks
Link-structure of the Web
– Can be applied to NLP application
Automated extraction of key phrases
Extractive summarization
Etc.
3 / 33
Introduction
TextRank
– Graph-based model
For graphs extracted from texts
– Two NLP tasks
Keyword extraction
Sentence extraction
– TextRank are competitive
Compared with other NLP algorithms
Graph, Ranking,
Formula, Boring
We introduce TextRank
4 / 33
Outline
Introduction
The TextRank Model
– Undirected Graphs
– Weighted Graphs
– Text as a Graph
Keyword Extraction
– TextRank for Keyword Extraction
– Evaluation
Sentence Extraction
– TextRank for Sentence Extraction
– Evaluation
Conclusion
5 / 33
The TextRank Model
Graph-based ranking algorithms
– A way of deciding the importance of a vertex within a graph
– Based on global information
Recursively drawn from the entire graph
Basic idea
– Voting
– Recommendation
The score of vertex
– How many?
– Who?
6 / 33
The TextRank Model
Socre of a vertex
– S (Vi ) : Score of the vertex
–
Vi
: Vertex
– In(Vi ) : the set of vertices that point to it ( predecessors )
– Out(Vi ) : the set of vertices that vertex points to ( successors )
–
d
: damping factor
The probability of jumping from a given vertex to another vertex
– Random surfer model
– 0.85 ( PageRank )
7 / 33
The TextRank Model
The Score of Graph
– Starting from arbitrary values
– The computation iterates
Until convergence below a given threshold is achieved
The Score of vertex
– Importance of the vertex
– The final values are not affected by the initial value
Only the number of iterations to convergence may be different
8 / 33
The TextRank Model
Undirected Graphs
Recursive graph-based ranking algorithm
– Traditionally applied on directed graphs
– Can be applied to undirected graphs
The out-degree of a vertex is equal to the in-degree of the vertex
Convergence curve
– As the connectivity of the
graph increases
Fewer iterations
The convergence curves for directed
and undirected graphs practically
overlap
9 / 33
The TextRank Model
Weighted Graphs
PageRank
– Assuming unweighted graph
– Page hardly include multiple or partial links to another page
TextRank
– May include multiple or partial link between the units
The graphs are build from natural language text
– Incorporate the “strength” of connectivity
Weight of the edge
10 / 33
The TextRank Model
Weighted Graphs
New measure
– The final scores differ significantly
as compared to original measure
– The number of iterations is almost identical
for weighted and unwieghted graphs
11 / 33
The TextRank Model
Text as a Graph
Build a graph
– Represents the text
– Interconnects words or other text entities with meaningful relations
Text unit of various size
Various characteristics
– Words, entire sentences, collocations, etc.
The type of relations
– Lexical semantic relations
– Contextual overlap
– Etc.
12 / 33
The TextRank Model
Text as a Graph
4 steps of Graph-based ranking algorithms
– Identify text units
Best define the task at hand
Add them as vertices in the graph
– Identify relations
Connect such text units
Use these relations to draw edges
– Directed
– Undirected
– Iterate the graph-based ranking algorithm
Until convergence
– Sort vertices based on their final score
13 / 33
Outline
Introduction
The TextRank Model
– Undirected Graphs
– Weighted Graphs
– Text as a Graph
Keyword Extraction
– TextRank for Keyword Extraction
– Evaluation
Sentence Extraction
– TextRank for Sentence Extraction
– Evaluation
Conclusion
14 / 33
Keyword Extraction
Keyword Extraction
– Automatically identify a set of terms
Best describe the document
– Use of this keyword
Building an automatic index
Classify a text
Concise summary
Terminology extraction
Construction of domain-specific dictionaries
TextRank
– No limitation of the size of the Text
15 / 33
Keyword Extraction
Possible approach
– Frequency criterion
– Supervised learning methods
Parametrized heuristic rules ( combined with genetic algorithm )
– Turney, 1999
– Precision : 29.0%
Five key phrases extracted per document
Naïve Bayes learning scheme
– Frank et al., 1999
– Precision : 18.3%
Fifteen key phrases per document
– Keyword extraction from abstract
More widely applicable
– Many documents on the internet are not available as full-texts
– Accuracy of the system is almost doubled by adding linguistic knowledge to the
term representation
Part of speech information
Hulth, 2003
16 / 33
Keyword Extraction
TextRank for Keyword Extraction
End result of TextRank
– A set of words or phrases
Representative for a given natural language text
Sequences of one or more lexical units extracted from text
Relation
– Can be defined between two lexical units
– Co-occurrence relation
Two vertices are connected if their corresponding lexical units co-occur within a
window of maximum N words
N can be set anywhere from 2 to 10 words
Syntactic filter
– Consider only
All open class words
Nouns and verbs
Nouns and adjectives only
17 / 33
Keyword Extraction
TextRank for Keyword Extraction
TextRank process
– Text tokenizing
Annotated with part of speech tags
– Preprocessing step required to enable the application of syntactic filters
Only single words as candidates for addition to the graph
– To avoid excessive growth of the graph size
– Multi-word keywords being eventually reconstructed in the post-processing phase
– Syntactic filtering
All lexical units that pass the filter are added to the graph
Edge is added between those lexical units
– That co-occur within a window of N words
Initial score of each vertex is set to 1
18 / 33
Keyword Extraction
TextRank for Keyword Extraction
– Ranking algorithm
Is run on the graph for several iterations
– Until converges ( usually 20~30 iterations )
– Threshold of 0.0001
– Sorting
Reverse order of their score
The top T vertices are retained for post-processing
– T may be set to any fixed value ( usually ranging from 5 to 20 )
– By decides the number of keywords based on the size of the text
– T is set to a third of the number of vertices in the graph
– Post-processing
Sequences of adjacent keywords are collapsed into a multi-word keyword
– E.g.) Matlab code for plotting ambiguity functions
– If Matlab and code are selected as potential keywords
– They are collapsed into on single keyword Matlab code
19 / 33
Keyword Extraction
TextRank for Keyword Extraction
Sample graph
20 / 33
Keyword Extraction
Evaluation
Inspec database
– From journal papers from Computer Science and Information Technology
– 500 abstract
– Each abstract comes with two sets of keywords
Controlled keywords
– Restricted to a given thesaurus
Uncontrolled keywords
– Freely assigned by the indexers
– We use the uncontrolled set of keywords
In the previous experiments
– Hulth is using a total of 2000 abstracts
1000 for training
500 for development
500 for test
– TextRank is completely unsupervised
No training/development data
Only using the test documents for evaluation purposes
21 / 33
Keyword Extraction
Evaluation
Results for automatic keyword extraction
TextRank achieves the highest precision and F-measure
Larger window does not seem to help
– Relation between words that are further apart is not strong
22 / 33
Keyword Extraction
Evaluation
Syntactic filter
– Experiments were performed with various syntactic filters
– Best performance was “nouns and adjectives only”
– Linguistic information helps the process of keyword extraction
No part of speech information were significantly lower
TextRank system
– Lead to an F-measure higher than any of the previously proposed system
– Is completely unsupervised
– Relies exclusively on information drawn from the text itself
Which makes it easily portable to other text collection, domains, and languages
23 / 33
Outline
Introduction
The TextRank Model
– Undirected Graphs
– Weighted Graphs
– Text as a Graph
Keyword Extraction
– TextRank for Keyword Extraction
– Evaluation
Sentence Extraction
– TextRank for Sentence Extraction
– Evaluation
Conclusion
24 / 33
Sentence Extraction
Sentence Extraction
– Identifying sequences that are more representative for the given text
– Deal with entire sentences
– Regarded as similar to keyword extraction
The goal
– Rank entire sentences
– Extraction for automatic summarization
25 / 33
Sentence Extraction
TextRank for Sentence Extraction
Build a graph
– Vertex is added to the graph for each sentence in the text
– Determines a connection between two sentences
If there is a “similarity” relation between them
Similarity is measured as a function of their content overlap
“co-occurrence” is not a meaningful relation for sentences
– Sentence is large contexts
Link can be drawn between any two such sentences that share common
content
26 / 33
Sentence Extraction
TextRank for Sentence Extraction
Similarity
– The number of common tokens between the lexical representations of the
two sentences
It can be run through syntactic filters
– We are using a normalization factor
To avoid promoting long sentences
Sentence similarity measure
– Sentence
Si w1i , w2i ,...,wNi i ,
27 / 33
Sentence Extraction
TextRank for Sentence Extraction
The resulting graph
– Highly connected
– Weight associated with each edge
The strength of the connections established between various sentence pairs in
the text
– Use weighted graph-based ranking formula
Sentences are sorted in reversed order of their score
– Top ranked sentences are selected for inclusion in the summary
28 / 33
Sentence Extraction
TextRank for Sentence Extraction
Sample graph build for sentence extraction
29 / 33
Sentence Extraction
Evaluation
Single-document summarization
– 567 news articles
Provided during the Document Understanding Evaluations 2002 ( DUC )
– TextRank generates an 100-words summary
The task undertaken by other systems participating in this single document
summarization task ( fifteen different systems participated )
ROGUE evaluation toolkit
– Method based on Ngram statistics
– Highly correlated with human evaluations ( Lin and Hovy, 2003 )
– Two manually produced reference summaries are provided
And used in the evaluation process
– We compare the performance of TextRank with the top five performing
systems
As well as with the baseline proposed by the DUC evaluators
30 / 33
Sentence Extraction
Evaluation
Result for single document summarization
31 / 33
Sentence Extraction
Evaluation
Discussion
– Represents a summarization model closer to what humans are doing
Fully unsupervised
Relies only on the given text to derive an extractive summary
– TextRank goes beyond the sentence “connectivity” in a text
Sentence 15 would not identified
as “important” based on the number
of connection
But it is identified as “important” by
TextRank
Human also identify the sentence
as “important”
– TextRank gives a ranking over all
sentences in a text
It can be easily adapted to extracting
very short or longer summaries
32 / 33
Conclusion
We introduce TextRank
– Graph-based ranking model for text processing
We show how it can be successfully used for natural language
application
– Keyword extraction
– Sentence extraction
– Accuracy achieved by TextRank is competitive
TextRank
– It does not require deep linguistic knowledge, nor domain or language
specific annotated corpora
– Highly portable to other domains, genres, or lanugages
33 / 33