Preserving Semantic Content in Text Mining Using

Transcript Preserving Semantic Content in Text Mining Using

Preserving Semantic Content in
Text Mining Using Multigrams
Yasmin H. Said
Department of Computational
and Data Sciences
George Mason University
QMDNS 2010 - May 26, 2010
This is joint work with Edward J.
Wegman
Outline
• Background on Text Mining
• Bigrams
– Term-Document and BigramDocument Matrices
– Term-Term and DocumentDocument Associations
• Example using 15,863
Documents
To read between the lines is easier than to
follow the text.
-Henry James
Text Data Mining
• Synthesis of …
– Information Retrieval
• Focuses on retrieving documents
from a fixed database
• May be multimedia including text,
images, video, audio
– Natural Language Processing
• Usually more challenging questions
• Bag-of-words methods
• Vector space models
– Statistical Data Mining
• Pattern recognition, classification,
clustering
Natural Language Processing
• Key elements are:
– Morphology (grammar of word forms)
– Syntax (grammar of word
combinations to form sentences)
– Semantics (meaning of word or
sentence)
– Lexicon (vocabulary or set of words)
• Time flies like an arrow
– Time passes speedily like an arrow
passes speedily or
– Measure the speed of a fly like you
would measure the speed of an arrow
• Ambiguity of nouns and verbs
• Ambiguity of meaning
Text Mining Tasks
• Text Classification
– Assigning a document to one of several
pre-specified classes
• Text Clustering
– Unsupervised learning
• Text Summarization
– Extracting a summary for a document
– Based on syntax and semantics
• Author Identification/Determination
– Based on stylistics, syntax, and semantics
• Automatic Translation
– Based on morphology, syntax, semantics,
and lexicon
• Cross Corpus Discovery
– Also known as Literature Based Discovery
Preprocessing
• Denoising
– Means removing stopper words …
words with little semantic meaning
such as the, an, and, of, by, that
and so on.
– Stopper words may be context
dependent, e.g. Theorem and
Proof in a mathematics document
• Stemming
– Means removal suffixes, prefixes
and infixes to root
– An example: wake, waking,
awake, woke  wake
Bigrams and Trigrams
• A bigram is a word pair where
the order of words is preserved.
– The first word is the reference word.
– The second is the neighbor word.
• A trigram is a word triple where
order is preserved.
• Bigrams and trigrams are useful
because they can capture
semantic content.
Example
• Hell hath no fury like a woman
scorned.
• Denoised: Hell hath no fury like
woman scorned.
• Stemmed: Hell has no fury like
woman scorn.
• Bigrams:
– Hell has, has no, no fury, fury like,
like woman, woman scorn, scorn .
– Note that the “.” (any sentence
ending punctuation) is treated as
a word
Bigram Proximity Matrix
.
fury
has
hell
like
no
scorn
wom
a
n
.
fury
1
has
1
hell
1
like
1
no
scorn
wom
a
n
1
1
1
Bigram Proximity Matrix
• The bigram proximity matrix
(BPM) is computed for an entire
document
– Entries in the matrix may be either
binary or a frequency count
• The BPM is a mathematical
representation of a document
with some claim to capturing
semantics
– Because bigrams capture nounverb, adjective-noun, verbadverb, verb-subject structures
– Martinez (2002)
Vector Space Methods
• The classic structure in vector
space text mining methods is a
term-document matrix where
– Rows correspond to terms, columns
correspond to documents, and
– Entries may be binary or frequency
counts
• A simple and obvious
generalization is a bigramdocument matrix where
– Rows correspond to bigrams,
columns to documents, and again
entries are either binary or frequency
counts
Example Data
• The text data were collected by the
Linguistic Data Consortium in 1997 and
were originally used in Martinez (2002)
– The data consisted of 15,863 news
reports collected from Reuters and
CNN from July 1, 1994 to June 30, 1995
– The full lexicon for the text database
included 68,354 distinct words
• In all 313 stopper words are removed
• after denoising and stemming, there
remain 45,021 words in the lexicon
– The example that I report here is based
on the full set of 15,863 documents.
This is the same basic data set that Dr.
Wegman reported on in his keynote
talk although he considered a subset
of 503 documents.
Vector Space Methods
• A document corpus we have
worked with has 45,021
denoised and stemmed entries
in its lexicon and 1,834,123
bigrams
– Thus the TDM is 45,021 by 15,863
and the BDM is 1,834,123 by 15,863
– The term vector is 45,021
dimensional and the bigram
vector is 1,834,123 dimensional
– The BPM for each document is
1,834,123 by 1,834,123 and, of
course, very sparse.
Term-Document Matrix Analysis
Zipf’s Law
Term-Document Matrix Analysis
serb
bosniannato
bosnia
crash
plane
safe
41
air
kim
iraqi
iraq
famili
give
365
effort
356 earlier
went
impact
ground
55
43
close
340
planet
telescop
jupit
fragment
comet
earth
show
perhap
train
handlong
problem 157
situat
pictur
45
move
152
bodi
67
502 part
363
seem
1 anyth
water
someth
laterlittl 57 sure
60
79
77
133
oper
big
80
76
81
home
53
nuclear
remain
fact
70
caus
36
realli
207
probabl
side
5258 22
hitnight
return277
cours
118120138
61
come
193 174
71
265
263
158
47
168
155
65
seen
278
499
28
37
abl
pilot
173
27
492 497
bobbi
144
159
498 503
129
latest
lot
5469
flood
154
world
38
150
493
125
182
washington
31
japan kobe
sort
34
165 262181198
166
30
495
74
42
35
month
4846
kind
488
mean heard
hall
494
491
area
feel
helicopt
122137
help
51
25
250
254 280202
490
40 39
rescu
167
501
64
23
quak
496
mile
170
south 68
128
56
194
130
33
123
274
chief
indic
258
353
361
5
178
456 357
32
175
164
124 481
169
247180
151
131
245
489
179
minut
259
153 341
damag
earthquak
141
349
354 162
meet
releas
405
188
126
271
116 tell
333 369
135
132
455
127
saw
227 242
206
368
militari
140
253
483 few
13 404
500 251
467
350
337
thing
177 231 191
see
115
need
134north
270
232 261
366
117 119
161
352
start
273
275
66
happen
324
176korea
362
248 228
453
386 75 252 209
267 143
355
ago
yesterdai
thank
hope
338
370
339
korean
466
359
center
256
260
continu even
160
stori
244
211
266
431
confirm208
276
good
323
204 268
344
336us 331
search
396
talk
like
well just back
156 346
189
272
join get
done
439 485
237
241
464327
335
go here
out hour
401391
269 187
try take dai
american
know
20
right
still
426
look
332
383 142
todai
216
time
think
wai
week point make
place
322
402
first
474149 342 358 326 345
cnn
192 16nation
201
88
205279
334
be
on
morn
210
392
148
clinton
225
least
476 219
explos
find
17
live
came
sai
through
419
leader
14 393
report
226
238
163
fire
far
unit
367
presid
expect
184
incid18
inform question
457
awai
year
246190
peopl
330
249
two
397
number
new
left
257
447
want
459
offici
221
107
last
213
call
480
109
6
347437
hous made state told
work
218
second
believ
222
white
87
186 264199
govern
said
220
364
12
475
458
299 114
185
86
possibl
195
turn
john
282
384 196
shot 106
438
427 217 462
215
197
200
139
477
8
408
4 389
410 214
444
296 417 429sever 284 person
injur
100
400
whether
183
five
203
15
230
298
239
233
293
411
418
283
385
292
ask
387
395
424 425
486
236 461 223 443
398
50
343
appar
10
288
290
415
463
319
409
offic
430
304
19
376
451
car
3
9
11
countri
303
289 235
issu7388
229
255
428
449
325
110
423
315
407
473
406
scene
460
381
478 448
470 310317
294
49
mr
build
308
285
300
citi
24
320
394 147
104
454
concern 21
224306
servic
375
found
85
312
301
403
hear 487
89 108
44
311
471
open
399
102
59
309 101
defens
secur
depart
99377145
328
318
111
146
26
shoot
polic
291
316446 374 84involv297
29
390
man bomb
62
121
94
465
105
321
445
against
98
372
feder
78
48293
investig
379
313
382
414
112
378
kill305 103
287
360
373
respons
90
4342
suspect
136
death
314 case 440212
469 92 91
82 422
413
420
37196
113
416
author
450 83
oklahomatest
479
95
433
307
302
484
286
421 evid
97
442
massachusett
salvi
243
329
412
380
york
234
452
468
clinic
fbi
436
472 281
charg
432 435
court
dna
abort
441 295
attornei
blood
judg
simpson
prosecut
mission generword171
appear
il
forc
deal
astronom
flight space
63
348
351
240
172
72
73
Text Example - Clusters
•
A portion of the hierarchical agglomerative tree for the clusters
Text Example - Clusters
Cluster 0, Size: 157, ISim: 0.142, ESim: 0.008
Descriptive: ireland 12.2%, ira 9.1%, northern.ireland 7.6%, irish
5.5%, fein 5.0%, sinn 5.0%, sinn.fein 5.0%, northern 3.2%,
british 3.2%, adam 2.4%
Discriminating: ireland 7.7%, ira 5.9%, northern.ireland 4.9%,
irish 3.5%, fein 3.2%, sinn 3.2%, sinn.fein 3.2%, northern 1.6%,
british 1.5%, adam 1.5%
Phrases 1: ireland 121, northern 119, british 116, irish 111, ira
110, peac 107, minist 104, govern 104, polit 104, talk 102
Phrases 2: northern.ireland 115, sinn.fein 95, irish.republican 94,
republican.armi 91, ceas.fire 87, polit.wing 76, prime.minist 71,
peac.process 66, gerri.adam 59, british.govern 50
Phrases 3: irish.republican.armi 91, prime.minist.john 47,
minist.john.major 43, ira.ceas.fire 35, ira.polit.wing 34,
british.prime.minist 34, sinn.fein.leader 30, rule.northern.ireland
27, british.rule.northern 27, declar.ceas.fire 26
Text Example - Clusters
Cluster 1, Size: 323, ISim: 0.128, ESim: 0.008
Descriptive: korea 19.8%, north 13.2%, korean 11.2%, north.korea
10.8%, kim 5.8%, north.korean 3.7%, nuclear 3.5%, pyongyang
2.0%, south 1.9%, south.korea 1.5%
Discriminating: korea 12.7%, north 7.4%, korean 7.2%,
north.korea 7.0%, kim 3.8%, north.korean 2.4%, nuclear 1.7%,
pyongyang 1.3%, south.korea 1.0%, simpson 0.8%
Phrases 1: korea 305, north 303, korean 285, south 243, unit 215,
nuclear 204, offici 196, pyongyang 179, presid 167, talk 165
Phrases 2: north.korea 291, north.korean 233, south.korea 204,
south.korean 147, kim.sung 108, presid.kim 83, nuclear.program 79,
kim.jong 74, light.water 71, presid.clinton 69
Phrases 3: light.water.reactor 56, unit.north.korea 55,
north.korea.nuclear 53, chief.warrant.offic 49, presid.kim.sung 46,
leader.kim.sung 39, presid.kim.sam 37, north.korean.offici 36,
warrant.offic.bobbi 35, bobbi.wayn.hall 29
Text Example - Clusters
Cluster 24, Size: 1788, ISim: 0.012, ESim: 0.007
Descriptive: school 2.2%, film 1.3%, children 1.2%, student 1.0%,
percent 0.8%, compani 0.7%, kid 0.7%, peopl 0.7%, movi 0.7%,
music 0.6%
Discriminating: school 2.3%, simpson 1.8%, film 1.7%, student
1.1%, presid 1.0%, serb 0.9%, children 0.8%, clinton 0.8%, movi
0.8%, music 0.8%
Phrases 1: cnn 1034, peopl 920, time 893, report 807, don 680, dai
650, look 630, call 588, live 535, lot 498
Phrases 2: littl.bit 99, lot.peopl 90, lo.angel 85, world.war 71,
thank.join 67, million.dollar 60, 000.peopl 54, york.citi 50, garsten.cnn
48, san.francisco 47
Phrases 3: jeann.moo.cnn 41, cnn.entertain.new 36, cnn.jeann.moo
32, norma.quarl.cnn 30, cnn.norma.quarl 28, cnn.jeff.flock 28,
jeff.flock.cnn 27, brian.cabel.cnn 26, pope.john.paul 25, lisa.price.cnn
25
Bigrams
Bigrams
Cluster 1
Cluster Size Distribution
Document by Cluster Plot
Cluster Identities
•
•
•
•
•
•
•
•
•
•
Cluster 02: Comet Shoemaker Levy Crashing into Jupiter.
Cluster 08: Oklahoma City Bombing.
Cluster 11: Bosnian-Serb Conflict.
Cluster 12: Court-Law, O.J. Simpson Case.
Cluster 15: Cessna Plane Crashed onto South Lawn White House.
Cluster 19: American Army Helicopter Emergency Landing in North
Korea.
Cluster 24: Death of North Korean Leader (Kim il Sung) and North
Korea’s Nuclear Ambitions.
Cluster 26: Shootings at Abortion Clinics in Boston.
Cluster 28: Two Americans Detained in Iraq.
Cluster 30: Earthquake that Hit Japan.
Bigram-Document Matrix for 50 Documents
Bigram-Bigram Matrix for 50 Documents
Bigram-Bigram Matrix Using the Top 253 Bigrams
Closing Remarks
• Text mining presents great
challenges, but is amenable to
statistical/mathematical
approaches
– Text mining using vector space
methods challenges both the
mathematical and visualization
issues
• especially in terms of
dimensionality, sparsity, and
scalability.
Acknowledgments
•
•
•
•
Dr. Angel Martinez
Dr. Jeff Solka and Avory Bryant
Dr. Walid Sharabati
Funding Sources
– National Institute on Alcohol Abuse
and Alcoholism (Grant Number
F32AA015876)
– Army Research Office (Contract
W911NF-04-1-0447)
– Army Research Laboratory
(Contract W911NF-07-1-0059)
– Isaac Newton Institute
Contact Information
Yasmin H. Said
Department of Computational and Data Sciences
Email: [email protected]
Phone: 301-538-7478
The length of this document defends it well against
the risk of its being read.
-Winston Churchill