Keyphrase Extraction in Scientific Documents Min-Yen Kan School of Computing

Download Report

Transcript Keyphrase Extraction in Scientific Documents Min-Yen Kan School of Computing

Keyphrase Extraction in Scientific Documents

Thuy Dung Nguyen and

Min-Yen Kan

School of Computing National University of Singapore

Slides available at http://wing.comp.nus.edu.sg/downloads/keyphraseCorpus

Thuy Dung Nguyen and

Min-Yen Kan

Keyphrases!

To think about: Are tags keyphrases?

Credits: Amazon.com, ACM.org, IMDB.com

ICADL 2007 (Hanoi, Vietnam) 2

Using Keyphrases in DLs

Why are keyphrases important to digital libraries?

Navigation

– Searching: Better weighting for terms – Browsing and Linking: Finding similar documents

Reading

– Highlighting – Key Concepts Genex Helping to make the transition seamless between the two

ICADL 2007 (Hanoi, Vietnam)

Thuy Dung Nguyen and

Min-Yen Kan 3

Thuy Dung Nguyen and

Min-Yen Kan

Related Work

Generation

– Kim and Wilbur – statistical properties of distribution – Tomokiyo and Hurst – Phraseness model

Selection

– GenEx (Frank) – Kea (Frank et al.): just 3 features TF ×IDF, position, corpus frequency – Turney: selection not independent, use PMI

Assignment

– From Ontology (Medelyan & Witten), use graph features

ICADL 2007 (Hanoi, Vietnam) 4

Thuy Dung Nguyen and

Min-Yen Kan

Architecture

Scientific publication Preprocessing

: - Sentence delimiting - POS tagging - Stemming Plain text

Candidate Identification

-Simplex noun phrase detection Generic header mapping model HTML formatted output

Basic Features

- TF ×IDF - Position

Morphological Features

- Suffix sequence - POS sequence - Acronym

Structural Features

- Section distribution vector Keyphrase selection model Key phrases

Key difference from previous works: Centered on scientific publications

As such, adds two modules to capitalize on this limited domain ICADL 2007 (Hanoi, Vietnam) 5

Thuy Dung Nguyen and

Min-Yen Kan

1) Morphological Features

POS tags (used in previous work; e.g., Genex)

– Used to identify candidates for simplex noun phrases (i.e., matching regex “(JJ|NN)* IN? NN” ) – Noun modifiers seem to be more productive than adjectival ones (e.g. “Additive”/NN vs. “Additional”/JJ)

Suffixes

– sequences on modifiers and headwords (e.g.,

-ic, -al, -ive

on modifiers;

-ion, -ics, -ment

on headword) – more fine grained than POS tagging

ICADL 2007 (Hanoi, Vietnam) 6

Thuy Dung Nguyen and

Min-Yen Kan

Morphological Features

Acronym candidate

– Binary feature - is the word an acronym? – Using simple adjacent pattern matching of parenthesized text to candidates to their left / right ICADL (Int’l Conf. on Asian Digital Libraries) Int’l Conf. on Asian Digital Libraries (ICADL) – Weakness: - Not comparable to state-of-the-art algorithm, not meant to be - Not yet evaluated as a separate component - A finer-grained feature may be more useful

ICADL 2007 (Hanoi, Vietnam) 7

Thuy Dung Nguyen and

Min-Yen Kan

Stemming

After other processing, case folding and stemming conflates candidates to obtain accurate phrase counts

– Use Lovins iterated stemmer – Represent all stems using the most frequent form voxel Voxels voxelization (1) (2) (5)

Voxelization (8) ICADL 2007 (Hanoi, Vietnam) 8

Thuy Dung Nguyen and

Min-Yen Kan

2) Structural Feature

Learning which sections are more productive for keyphrases

Abstract Introduction Related Work Methods Evaluation Conclusion

Abstract

Introduction

Related Work

Methods Evaluation

Conclusion

ICADL 2007 (Hanoi, Vietnam) 9

Thuy Dung Nguyen and

Min-Yen Kan

Structural Features

Execution: create a feature vector of where a term logically appears Stem A: <1, Stem B: <0, 2, 0, 4, … 0, … 0, 3, 0> 0> Caveat: Lots of unique headers in documents. Not helpful to say candidate occurs in “Metadata Extraction Approaches” Change it to “Related Work” ICADL 2007 (Hanoi, Vietnam) 10

Mapping to Generic Section Headers

Method: also supervised machine learning

Map to 14 generic headers

1. Absolute section number (Section 3) 2. Relative position (Section 3 of 11 = 3 / (11-1) = .30) 3. Previous section header text 4. Current section header text •

Performance (on a corpus of 1020 headers)

– Maximum Entropy: 92% accuracy – Hidden Markov Model: 36% accuracy Thuy Dung Nguyen and

Min-Yen Kan ICADL 2007 (Hanoi, Vietnam) 11

Thuy Dung Nguyen and

Min-Yen Kan

Evaluation - Corpus Collection

No publicly available corpus of keyphrase assignments for scientific documents*. What to do?

So we collected our own. Freely available at: http://wing.comp.nus.edu.sg/downloads/keyphraseCorpus

211 documents where text was extractable

– Superset of previous set – •

Searched for “keywords general terms filetype:pdf” * Consider citeulike.org

?

ICADL 2007 (Hanoi, Vietnam) 12

Thuy Dung Nguyen and

Min-Yen Kan

Evaluation

120 documents with at least two sets of keyphrases

– One by original author – One or more by student annotators

Accuracy by matching top ten extracted keyphrases versus the gold standard

– Standard P/R/F 1 – Weighted average: use frequency of phrase in standard 1 + ln(f)

• •

Tested Naïve Bayes and Maximum Entropy Using Kea features as the baseline ICADL 2007 (Hanoi, Vietnam) 13

Thuy Dung Nguyen and

Min-Yen Kan

Evaluation Results

Maximum Entropy did not work as well as NB

NB results show statistical significance at .05 level for both evaluation schemes

Number of keywords matched 5 4.5

4 3.5

3 2.5

2 3.03 3.25 3.61 3.84

Exact Matches W eighted Matches

ICADL 2007 (Hanoi, Vietnam) 14

Thuy Dung Nguyen and

Min-Yen Kan

Discussion

Assigned Keyphrases

Neural network 3G network Soft handover (2) Cluster analysis Self organizing map Hierarchical clustering

Kea Baseline

Handover Soft handover 3G Clusters 3G network Cell

Our System

Clusters Soft handover Data 3G network Interesting clusters Neural network Errors: • Still encourage longer phrase generation • General words still appear (e.g., “data”, “cell”)

ICADL 2007 (Hanoi, Vietnam) 15

Thuy Dung Nguyen and

Min-Yen Kan

Conclusions

Current and Future Work

– Enlarge the keyphrase corpus – Integrate tagging with keyphrases – Deploy system into a scholarly digital library

Contributions: better keyphrase extraction:

– Developed features specifically for scientific documents – Developed mapping model for headers – Created a corpus for keyphrase testing

http://wing.comp.nus.edu.sg/downloads/keyphraseCorpus Advertisement: We’re hiring postdocs in terminology extraction and semistructured document processing ICADL 2007 (Hanoi, Vietnam) 16

End of Presentation

Backup slides follow

ICADL format

23-25 minutes for talk

5 minutes question

30 minutes in total ICADL 2007 (Hanoi, Vietnam)

Thuy Dung Nguyen and

Min-Yen Kan 18