Transcript Keyphrase Extraction in Scientific Documents Min-Yen Kan School of Computing
Keyphrase Extraction in Scientific Documents
Thuy Dung Nguyen and
Min-Yen Kan
School of Computing National University of Singapore
Slides available at http://wing.comp.nus.edu.sg/downloads/keyphraseCorpus
Thuy Dung Nguyen and
Min-Yen Kan
Keyphrases!
To think about: Are tags keyphrases?
Credits: Amazon.com, ACM.org, IMDB.com
ICADL 2007 (Hanoi, Vietnam) 2
Using Keyphrases in DLs
Why are keyphrases important to digital libraries?
•
Navigation
– Searching: Better weighting for terms – Browsing and Linking: Finding similar documents
•
Reading
– Highlighting – Key Concepts Genex Helping to make the transition seamless between the two
ICADL 2007 (Hanoi, Vietnam)
Thuy Dung Nguyen and
Min-Yen Kan 3
Thuy Dung Nguyen and
Min-Yen Kan
Related Work
Generation
– Kim and Wilbur – statistical properties of distribution – Tomokiyo and Hurst – Phraseness model
Selection
– GenEx (Frank) – Kea (Frank et al.): just 3 features TF ×IDF, position, corpus frequency – Turney: selection not independent, use PMI
Assignment
– From Ontology (Medelyan & Witten), use graph features
ICADL 2007 (Hanoi, Vietnam) 4
Thuy Dung Nguyen and
Min-Yen Kan
Architecture
Scientific publication Preprocessing
: - Sentence delimiting - POS tagging - Stemming Plain text
Candidate Identification
-Simplex noun phrase detection Generic header mapping model HTML formatted output
Basic Features
- TF ×IDF - Position
Morphological Features
- Suffix sequence - POS sequence - Acronym
Structural Features
- Section distribution vector Keyphrase selection model Key phrases
•
Key difference from previous works: Centered on scientific publications
•
As such, adds two modules to capitalize on this limited domain ICADL 2007 (Hanoi, Vietnam) 5
Thuy Dung Nguyen and
Min-Yen Kan
1) Morphological Features
•
POS tags (used in previous work; e.g., Genex)
– Used to identify candidates for simplex noun phrases (i.e., matching regex “(JJ|NN)* IN? NN” ) – Noun modifiers seem to be more productive than adjectival ones (e.g. “Additive”/NN vs. “Additional”/JJ)
•
Suffixes
– sequences on modifiers and headwords (e.g.,
-ic, -al, -ive
on modifiers;
-ion, -ics, -ment
on headword) – more fine grained than POS tagging
ICADL 2007 (Hanoi, Vietnam) 6
Thuy Dung Nguyen and
Min-Yen Kan
Morphological Features
•
Acronym candidate
– Binary feature - is the word an acronym? – Using simple adjacent pattern matching of parenthesized text to candidates to their left / right ICADL (Int’l Conf. on Asian Digital Libraries) Int’l Conf. on Asian Digital Libraries (ICADL) – Weakness: - Not comparable to state-of-the-art algorithm, not meant to be - Not yet evaluated as a separate component - A finer-grained feature may be more useful
ICADL 2007 (Hanoi, Vietnam) 7
Thuy Dung Nguyen and
Min-Yen Kan
Stemming
•
After other processing, case folding and stemming conflates candidates to obtain accurate phrase counts
– Use Lovins iterated stemmer – Represent all stems using the most frequent form voxel Voxels voxelization (1) (2) (5)
Voxelization (8) ICADL 2007 (Hanoi, Vietnam) 8
Thuy Dung Nguyen and
Min-Yen Kan
2) Structural Feature
Learning which sections are more productive for keyphrases
Abstract Introduction Related Work Methods Evaluation Conclusion
Abstract
Introduction
Related Work
Methods Evaluation
Conclusion
ICADL 2007 (Hanoi, Vietnam) 9
Thuy Dung Nguyen and
Min-Yen Kan
Structural Features
Execution: create a feature vector of where a term logically appears
Mapping to Generic Section Headers
•
Method: also supervised machine learning
•
Map to 14 generic headers
1. Absolute section number (Section 3) 2. Relative position (Section 3 of 11 = 3 / (11-1) = .30) 3. Previous section header text 4. Current section header text •
Performance (on a corpus of 1020 headers)
– Maximum Entropy: 92% accuracy – Hidden Markov Model: 36% accuracy Thuy Dung Nguyen and
Min-Yen Kan ICADL 2007 (Hanoi, Vietnam) 11
Thuy Dung Nguyen and
Min-Yen Kan
Evaluation - Corpus Collection
No publicly available corpus of keyphrase assignments for scientific documents*. What to do?
So we collected our own. Freely available at: http://wing.comp.nus.edu.sg/downloads/keyphraseCorpus
•
211 documents where text was extractable
– Superset of previous set – •
Searched for “keywords general terms filetype:pdf” * Consider citeulike.org
?
ICADL 2007 (Hanoi, Vietnam) 12
Thuy Dung Nguyen and
Min-Yen Kan
Evaluation
•
120 documents with at least two sets of keyphrases
– One by original author – One or more by student annotators
•
Accuracy by matching top ten extracted keyphrases versus the gold standard
– Standard P/R/F 1 – Weighted average: use frequency of phrase in standard 1 + ln(f)
• •
Tested Naïve Bayes and Maximum Entropy Using Kea features as the baseline ICADL 2007 (Hanoi, Vietnam) 13
Thuy Dung Nguyen and
Min-Yen Kan
Evaluation Results
•
Maximum Entropy did not work as well as NB
•
NB results show statistical significance at .05 level for both evaluation schemes
Number of keywords matched 5 4.5
4 3.5
3 2.5
2 3.03 3.25 3.61 3.84
Exact Matches W eighted Matches
ICADL 2007 (Hanoi, Vietnam) 14
Thuy Dung Nguyen and
Min-Yen Kan
Discussion
Assigned Keyphrases
Neural network 3G network Soft handover (2) Cluster analysis Self organizing map Hierarchical clustering
Kea Baseline
Handover Soft handover 3G Clusters 3G network Cell
Our System
Clusters Soft handover Data 3G network Interesting clusters Neural network Errors: • Still encourage longer phrase generation • General words still appear (e.g., “data”, “cell”)
ICADL 2007 (Hanoi, Vietnam) 15
Thuy Dung Nguyen and
Min-Yen Kan
Conclusions
Current and Future Work
– Enlarge the keyphrase corpus – Integrate tagging with keyphrases – Deploy system into a scholarly digital library
Contributions: better keyphrase extraction:
– Developed features specifically for scientific documents – Developed mapping model for headers – Created a corpus for keyphrase testing
http://wing.comp.nus.edu.sg/downloads/keyphraseCorpus Advertisement: We’re hiring postdocs in terminology extraction and semistructured document processing ICADL 2007 (Hanoi, Vietnam) 16
End of Presentation
Backup slides follow
ICADL format
•
23-25 minutes for talk
•
5 minutes question
•
30 minutes in total ICADL 2007 (Hanoi, Vietnam)
Thuy Dung Nguyen and
Min-Yen Kan 18