A Mission for Computational Natural Language Learning

Transcript A Mission for Computational Natural Language Learning

Text Mining
Walter Daelemans
CNTS
Department of Linguistics
University of Antwerp
[email protected]
Centre for Dutch Language and
Speech (CNTS)


Part of department of linguistics, University of Antwerp
Staff
2 tenured + 10-15 with temporary funding from EU, IWT, FWO,
NTU, language industry, BOF, …

Topics



Corpus Linguistics (mainly Dutch)
Child language acquisition / computational psycholinguistics
Language Technology
• machine learning of language
• shallow parsing
• text mining
Information Overload


Language is the most natural and most used
knowledge representation formalism
Non-structured or weakly structured information




(Non-structured) information overload



Text
Databases with text fields
Web-pages, e-mail messages, blogs, chat, …
Doubles every three months (Gardner)
Hampers knowledge management and business
intelligence
Translation bottleneck
Natural Language
Understanding?

Word meaning




Sentence Meaning



Morphological analysis
Complex Word
Interpretation
Word Sense
Disambiguation
Syntactic structure
(parsing)
Sentence interpretation
Discourse Meaning

World Knowledge
• Frames, scenarios,
grounding, intentions, …
Fremdzugehen
External train marriages
The box is in the pen
I eat a pizza with extra cheese
I eat a pizza with a fork
I eat a pizza with my daughter
The mayors didn’t want the students to
strike because they feared violence
The mayors didn’t want the students to
strike because they preached the
revolution
State of the Art

Robust, efficient, accurate, unrestricted
language understanding will not be
available for a long time


AI-complete problem
Alternative:

text mining: automatic extraction of reusable
knowledge from text, based on linguistic
analysis of the text
Approach

Text analysis tools (shallow instead of deep
understanding)


Robust / Efficient / Accurate
Text Mining applications
Question Answering
 Summarization
 Ontology extraction
 Information extraction
 Text categorization
For embedding in


End user applications related to knowledge search /
management / discovery / communication
Examples

Application Areas:




Data mining (KDD) from unstructured and semistructured data
(Corporate) Knowledge Management
“Intelligence”
Example Applications:



Email routing and filtering (spam filtering)
Finding protein interactions in biomedical text
Brokering
• Matching on-line resumes and vacancies
• Buying and selling property
• …
Text Data Mining (Discovery)

Find relevant information
Information extraction
 Text categorization


Analyze the text


Text mining
Discovery new information
Integrate different sources
 Data mining

Don Swanson 1981: medical
hypothesis generation









stress is associated with migraines
stress can lead to loss of magnesium
calcium channel blockers prevent some migraines
magnesium is a natural calcium channel blocker
spreading cortical depression (SCD) is implicated in some
migraines
high levels of magnesium inhibit SCD
migraine patients have high platelet aggregability
magnesium can suppress platelet aggregability
…
Text analysis output
Magnesium deficiency
implicated in migraine (?)
CNTS text analysis tools

MBSP



Flexible and adaptable
Dutch and English
State of the Art accuracy and efficiency
• ~ 90% sentences / ~ 1000 words/sec


Configurable combination of linguistic modules
Modules developed using Machine Learning
• TiMBL


Adaptation through re-training and semi-supervised
learning
Client-server set-up
CNTS shallow understanding
Text
Tokenisation
POS tagging
NP chunking
NER
Relation
finding
Text
Tokenisation
POS tagging
NP chunking
NER
Relation
finding
Insulatard is an isophane insulin suspension
(NPH).
Text
Tokenisation
POS tagging
NP chunking
NER
Relation
finding
Insulatard is an isophane insulin suspension
(NPH).
Insulatard
is
an
isophane
insulin
suspension
(
NPH
)
.
Text
Insulatard is an isophane insulin suspension
(NPH).
Tokenisation
POS tagging
NP chunking
NER
Relation
finding
Insulatard
is
an
isophane
insulin
suspension
(
NPH
)
.
NNP
VBZ
DT
JJ
NN
NN
Punc
NNP
Punc
Punc
Tekst
Insulatard is an isophane insulin suspension
(NPH).
Tokenization
POS tagging
NP chunking
NER
Relation
finding
[NP Insulatard]
[VP is]
[NP an isophane insulin suspension( NPH )]
Text
Insulatard is an isophane insulin suspension
(NPH).
Tokenisation
POS tagging
NP chunking
NER
Relation
finding
Insulatard = Medicine name
NPH = Hormone
Text
Insulatard is an isophane insuline suspension
(NPH).
Tokenization
POS tagging
NP chunking
NER
Relation
finding
[SBJ Insulatard]
is
[PREDC an isophane insuline suspension ( NPH )]
Application: Question
Answering

Give answer to question
(document retrieval: find documents relevant to query)

Who invented the telephone?


Alexander Graham Bell
When was the telephone invented?

1876
QA System: Shapaqa

Parse question
When was the telephone invented?





Which slots are given?
• Verb invented
• Object telephone
Which slots are asked?
• Temporal phrase linked to verb
Document retrieval on internet with given slot keywords
Parsing of sentences with all given slots
Count most frequent entry found in asked slot (temporal phrase)
Shapaqa: example


When was the telephone invented?
Google: invented AND “the telephone”


produces 835 pages
53 parsed sentences with both input slots and with a
temporal phrase
is through his interest in Deafness and fascination with
acoustics that the telephone was invented in 1876 , with the
intent of helping Deaf and hard of hearing
The telephone was invented by Alexander Graham Bell in 1876
When Alexander Graham Bell invented the telephone in 1876 ,
he hoped that these same electrical signals could
Shapaqa: frequency ranking


So when was the phone invented?
Internet answer is noisy, but robust









17:
3:
2:ago
2:later
1:Bell
…
1876
1874
System was developed quickly
Precision 76% (Google 31%)
International competition (TREC): MRR 0.45
Application: Biomedical text
mining (EU project BioMinT)
IR
IE
Linguistic / Semantic
Features
Text Analysis
Medline abstracts
Templates
Factoids
(Partial) Factoids
The mouse lymphoma assay (MLA) utilizing the Tk gene is
widely used to identify chemical mutagens.
CELL-LINE
The mouse lymphoma assay
MLA
O
S
the Tk gene
utilizing
is widely used
to identify
O
chemical mutagens
DNA part
<!DOCTYPE MBSP SYSTEM 'mbsp.dtd'>
<MBSP>
<S cnt="s1">
<NP rel="SBJ" of="s1_1">
<W pos="DT">The</W>
<W pos="NN" sem="cell_line">mouse</W>
<W pos="NN" sem="cell_line">lymphoma</W>
<W pos="NN">assay</W>
</NP>
<W pos="openparen">(</W>
<NP>
<W pos="NN" sem="cell_line">MLA</W>
</NP>
<W pos="closeparen">)</W>
<VP id="s1_1">
<W pos="VBG">utilizing</W>
</VP>
<NP rel="OBJ" of="s1_1">
<W pos="DT">the</W>
<W pos="NN" sem="DNA_part">Tk</W>
<W pos="NN" sem="DNA_part">gene</W>
</NP>
<VP id="s1_2">
<W pos="VBZ">is</W>
<W pos="RB">widely</W>
<W pos="VBN">used</W>
</VP>
<VP id="s1_3">
<W pos="TO">to</W>
<W pos="VB">identify</W>
</VP>
</VP>
<NP rel="OBJ" of="s1_3">
<W pos="JJ">chemical</W>
<W pos="NNS">mutagens</W>
</NP>
<W pos="period">.</W>
</S>
</MBSP>
Extracted IEX Templates from
shallow parser output
NP(<X protein>) contain NP(Y "domain")
EVENT: contain
PROTEIN: <protein>
DOMAIN: “domainf”
Jee-Hyub Kim (Geneva)
NP(<X protein>) be associated with NP(Y “disease”)
EVENT: associated_with
PROTEIN: <protein>
DISEASE: “head”
NP(<X protein>) regulate NP(Y)
EVENT: regulate
PROTEIN: <protein>
Y:
(): to be extracted, <>: semantic
constraint, "": lexical constraint
Application: Ontology Extraction




Clustering of head nouns of Subject-Verb and Verb-Object relations
Combine with pattern matching and heuristics
Case study: Medline 4 million words hepatitis, SwissProt corpus
Results:


Better clusters with shallow parsing
Useful in knowledge management, thesaurus development, …
Ontobasis (IWT)
Example (SwissProt corpus)
gene | show | significant homology,
amino_acid_sequence | have/indicate/lack/reveal/show |
homology
protein | show | homology, immunoreactivity, reactivity,
sequence similarity
protein | inhibit | catalytic activity, apoptosis, protein synthesis...
protein | exhibit | significant homology
protein | bind | copper, ubiquitin
protein | correspond | isoelectric point
induction | requires | protein synthesis
Edman degradation | of | intact protein
regulatory subunit | of | cAMP-dependent protein kinase
…
liver
related_to
related_to
hepatitis
cirrhosis
sim
sim
infection
sim
disease
HBV
prevented by
antibody
produced by
culture
immunization
produced by
antisera
vaccination
Further development
Semantic roles
 Faster adaptation to new domains

Domain semantics (NER / concept tagging)
 Active Learning / semi-supervised learning


More analytic power
Negation, modality, quantification
 Limited event and scenario recognition

Conclusions



Text Mining tasks benefit from text analysis
Understanding can be formulated as a
flexible heterarchy of classifiers
These classifiers can be trained / adapted on
annotated corpora and can eventually
approximate deep understanding
Questions?

Walter Daelemans

A1.10 Campus Drie Eiken
• (September: Stadscampus)

[email protected]