Transcript Slides

Learning to Predict Readability using
Diverse Linguistic Features
Rohit J. Kate, Xiaoqiang Luo, Siddharth Patwardhan, Martin Franz,
Radu Florian, Raymond J. Mooney, Salim Roukos, Chris Welty
Presented by: Young-Suk Lee
The University of Texas at Austin IBM T. J. Watson Research Center
© 2010 IBM Corporation
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Outline
 Problem definition and motivations
 Data
 System and Features
 Experimental Results
2
© 2010 IBM Corporation
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Readability
 DARPA machine reading program (MRP)
 “Readability is defined as a subjective judgment
of how easily a reader can extract the
information the writer or the speaker intended to
convey.”
 Task: given a general document, assign a
readability score (1 to 5)
3
© 2010 IBM Corporation
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Sample Passage: High Readability
 Industrial agriculture has grown increasingly
paradoxical, replacing natural processes
with synthetic practices and treating farms
as factories. Consequently, food has
become a marketing entity rather than a
necessity to sustain life. …
4
© 2010 IBM Corporation
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Sample Passage: Low Readability
 The word of the prince of believers may
Allah God him Talk of gold this at present
Reflections on the word of the prince of
believers may Allah pleased with him,
Prince of Believers May Allah be pleased
with him: …
5
© 2010 IBM Corporation
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Readability: Motivations
 Remove less readable documents from web-search
 Filter out less readable documents before extracting
knowledge
 Select reading materials
6
© 2010 IBM Corporation
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Contrast With Other Work
 Predicting readability: conveying message
– vs. reading difficulty (grade 1 to 12)
 Document sources: multiple genres
– vs. single domain, genre or reader group
7
© 2010 IBM Corporation
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Outline
 Problem definition and motivations
 Data
 System and Features
 Experimental Results
8
© 2010 IBM Corporation
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Data
 390 training documents
Genre
#Docs
Expert
Rating
Novice
Rating
 Each document:
nwire
56
4.93
4.23
wiki
56
4.83
4.13
weblog
55
4.46
3.75
– Nwire and wiki documents: high
q-trans
56
4.47
3.83
– MT documents: low
news-grp
55
4.26
3.34
ccap
56
4.13
3.53
mt
56
2.38
1.92
– 8 expert ratings: [1,..,5]
– 6-10 “novice” ratings: [1,…,5]
 Ratings differ by genre
9
© 2010 IBM Corporation
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Data
Histogram of Novice Ratings
250
MT
MTdocs
docs
Count
200
150
100
50
0
Speech: closed
1
2
3
4
5
ng: newsgroup
caption
Rating
10
nw
wk
wl
qt
ng
cc
mt
© 2010 IBM Corporation
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Outline
 Problem definition and motivations
 Data
 System and Features
 Experimental Results
11
© 2010 IBM Corporation
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
System Overview
LM score
Training Docs
Preprocessing
Test Doc
…
Regression
(WEKA)
Parser score
Sys. Rating
12
© 2010 IBM Corporation
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Syntactical Features
 Using Sundance [Riloff &Phillips 04] and
English Slot Grammer parsers
– Ratio of sentences without verbs
– Avg. # clauses/per sentence
– Avg. #NPs, #VPs, #PPs, #Phrases/sent,
– Failure rate of ESG parser
– ..
13
© 2010 IBM Corporation
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Language Model (LM) Features
 Normalized document probability:
– by a 5-gram generic LM
 Genre-specific LMs
– Data readily available for those genres
– Certain genre is a strong predictor of readability
14
© 2010 IBM Corporation
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Genre-based Language Model Features
 Perplexity of genre-specific LM (Mj):
History words
Document
Word
 Genre posterior perplexity (relative
probability compared to all G genres):
15
© 2010 IBM Corporation
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Lexical Features
 Fraction of known words using dictionary and
gazetteer of names
 Out-of-vocabulary (OOV) rates using genre-based
corpora
 Ratio of function words (“the”, “of” etc.)
 Ratio of pronouns
16
© 2010 IBM Corporation
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Experiments: Evaluation Metric
 Pearson correlation coefficient
– Mean expert judge rating as the gold-standard
 To compare with novice judges:
– A sampling distribution representing performance of novice judges
was generated
– Distribution mean and upper critical value were computed
 Correlation between system and mean expert ratings
– If above the upper critical value: system significantly (statistically)
better than novice judges
17
© 2010 IBM Corporation
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Outline
 Problem definition and motivations
 Data
 System and Features
 Experimental Results
18
© 2010 IBM Corporation
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Experiments: Methodology
 Compared regression algorithms
 Feature ablation experiments
 Results: 13-fold cross-validation
– Balanced genre representation
19
© 2010 IBM Corporation
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Results: Regression Algorithms
1
0.9
0.8
0.7
0.6
Correlation 0.5
0.4
0.3
0.2
0.1
0
Upper Critical Value
Distribution Mean
Bagged
Decision Tree
Linear
Regression
SVM
Regression
Gaussian
Process
Regression
Decsion Trees
Choice of regression algorithm is not critical.
20
© 2010 IBM Corporation
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Results: Feature Sets
1
0.9
0.8
0.7
0.6
Correlation 0.5
0.4
0.3
Upper Critical Value
Distribution Mean
0.2
0.1
0
All
Lexical
Syntactical
Lexical +
Syntactical
LM Based
Each feature set contributes, LM-based feature set: most useful.
21
© 2010 IBM Corporation
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Results: Genre-based Feature Sets
1
0.9
0.8
0.7
Upper Critical Value
Distribution Mean
0.6
Correlation 0.5
0.4
0.3
0.2
0.1
0
All
Genre-independent
Genre-based
Genre-independent features: better than novice mean;
Genre-specific features: significantly improve performance.
22
© 2010 IBM Corporation
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Results: Individual Feature Sets
1
0.9
0.8
0.7
Correlation
0.6
0.5
0.4
0.3
0.2
0.1
0
By itself
Ablated from all
System using all features
Upper Critical Value
Distribution Mean
All
Sundance
ESG
Perp.
Post.
Perp.
OOV rates
Posterior perplexities: best feature set,
but no single feature set is indispensable.
23
© 2010 IBM Corporation
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Official Evaluation
 Conducted by SAIC on behalf of DARPA
 Three teams participated
 Evaluation task: Predict readability of 150 test
documents using the 390 documents for training
24
© 2010 IBM Corporation
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Official Evaluation Results
1
0.9
Sig. better than human
at p<0.0001
0.8
Upper Critical Value
0.7
0.6
Correlation
Novice mean
0.5
0.4
0.3
0.2
0.1
0
Our System
System B
System C
Our system performed favorably and scored better
than the upper critical value.
25
© 2010 IBM Corporation
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Conclusions
 Readability system
– Regression over syntactical, lexical and language model features
 All features contribute, but LM features are most useful
 System is significantly (statistically) better than novice
human judges
26
© 2010 IBM Corporation
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Thank You!
Questions??
27
© 2010 IBM Corporation