A Mission for Computational Natural Language Learning
Download
Report
Transcript A Mission for Computational Natural Language Learning
Text Mining
Walter Daelemans
CNTS
Department of Linguistics
University of Antwerp
[email protected]
Centre for Dutch Language and
Speech (CNTS)
Part of department of linguistics, University of Antwerp
Staff
2 tenured + 10-15 with temporary funding from EU, IWT, FWO,
NTU, language industry, BOF, …
Topics
Corpus Linguistics (mainly Dutch)
Child language acquisition / computational psycholinguistics
Language Technology
• machine learning of language
• shallow parsing
• text mining
Information Overload
Language is the most natural and most used
knowledge representation formalism
Non-structured or weakly structured information
(Non-structured) information overload
Text
Databases with text fields
Web-pages, e-mail messages, blogs, chat, …
Doubles every three months (Gardner)
Hampers knowledge management and business
intelligence
Translation bottleneck
Natural Language
Understanding?
Word meaning
Sentence Meaning
Morphological analysis
Complex Word
Interpretation
Word Sense
Disambiguation
Syntactic structure
(parsing)
Sentence interpretation
Discourse Meaning
World Knowledge
• Frames, scenarios,
grounding, intentions, …
Fremdzugehen
External train marriages
The box is in the pen
I eat a pizza with extra cheese
I eat a pizza with a fork
I eat a pizza with my daughter
The mayors didn’t want the students to
strike because they feared violence
The mayors didn’t want the students to
strike because they preached the
revolution
State of the Art
Robust, efficient, accurate, unrestricted
language understanding will not be
available for a long time
AI-complete problem
Alternative:
text mining: automatic extraction of reusable
knowledge from text, based on linguistic
analysis of the text
Approach
Text analysis tools (shallow instead of deep
understanding)
Robust / Efficient / Accurate
Text Mining applications
Question Answering
Summarization
Ontology extraction
Information extraction
Text categorization
For embedding in
End user applications related to knowledge search /
management / discovery / communication
Examples
Application Areas:
Data mining (KDD) from unstructured and semistructured data
(Corporate) Knowledge Management
“Intelligence”
Example Applications:
Email routing and filtering (spam filtering)
Finding protein interactions in biomedical text
Brokering
• Matching on-line resumes and vacancies
• Buying and selling property
• …
Text Data Mining (Discovery)
Find relevant information
Information extraction
Text categorization
Analyze the text
Text mining
Discovery new information
Integrate different sources
Data mining
Don Swanson 1981: medical
hypothesis generation
stress is associated with migraines
stress can lead to loss of magnesium
calcium channel blockers prevent some migraines
magnesium is a natural calcium channel blocker
spreading cortical depression (SCD) is implicated in some
migraines
high levels of magnesium inhibit SCD
migraine patients have high platelet aggregability
magnesium can suppress platelet aggregability
…
Text analysis output
Magnesium deficiency
implicated in migraine (?)
CNTS text analysis tools
MBSP
Flexible and adaptable
Dutch and English
State of the Art accuracy and efficiency
• ~ 90% sentences / ~ 1000 words/sec
Configurable combination of linguistic modules
Modules developed using Machine Learning
• TiMBL
Adaptation through re-training and semi-supervised
learning
Client-server set-up
CNTS shallow understanding
Text
Tokenisation
POS tagging
NP chunking
NER
Relation
finding
Text
Tokenisation
POS tagging
NP chunking
NER
Relation
finding
Insulatard is an isophane insulin suspension
(NPH).
Text
Tokenisation
POS tagging
NP chunking
NER
Relation
finding
Insulatard is an isophane insulin suspension
(NPH).
Insulatard
is
an
isophane
insulin
suspension
(
NPH
)
.
Text
Insulatard is an isophane insulin suspension
(NPH).
Tokenisation
POS tagging
NP chunking
NER
Relation
finding
Insulatard
is
an
isophane
insulin
suspension
(
NPH
)
.
NNP
VBZ
DT
JJ
NN
NN
Punc
NNP
Punc
Punc
Tekst
Insulatard is an isophane insulin suspension
(NPH).
Tokenization
POS tagging
NP chunking
NER
Relation
finding
[NP Insulatard]
[VP is]
[NP an isophane insulin suspension( NPH )]
Text
Insulatard is an isophane insulin suspension
(NPH).
Tokenisation
POS tagging
NP chunking
NER
Relation
finding
Insulatard = Medicine name
NPH = Hormone
Text
Insulatard is an isophane insuline suspension
(NPH).
Tokenization
POS tagging
NP chunking
NER
Relation
finding
[SBJ Insulatard]
is
[PREDC an isophane insuline suspension ( NPH )]
Application: Question
Answering
Give answer to question
(document retrieval: find documents relevant to query)
Who invented the telephone?
Alexander Graham Bell
When was the telephone invented?
1876
QA System: Shapaqa
Parse question
When was the telephone invented?
Which slots are given?
• Verb invented
• Object telephone
Which slots are asked?
• Temporal phrase linked to verb
Document retrieval on internet with given slot keywords
Parsing of sentences with all given slots
Count most frequent entry found in asked slot (temporal phrase)
Shapaqa: example
When was the telephone invented?
Google: invented AND “the telephone”
produces 835 pages
53 parsed sentences with both input slots and with a
temporal phrase
is through his interest in Deafness and fascination with
acoustics that the telephone was invented in 1876 , with the
intent of helping Deaf and hard of hearing
The telephone was invented by Alexander Graham Bell in 1876
When Alexander Graham Bell invented the telephone in 1876 ,
he hoped that these same electrical signals could
Shapaqa: frequency ranking
So when was the phone invented?
Internet answer is noisy, but robust
17:
3:
2:ago
2:later
1:Bell
…
1876
1874
System was developed quickly
Precision 76% (Google 31%)
International competition (TREC): MRR 0.45
Application: Biomedical text
mining (EU project BioMinT)
IR
IE
Linguistic / Semantic
Features
Text Analysis
Medline abstracts
Templates
Factoids
(Partial) Factoids
The mouse lymphoma assay (MLA) utilizing the Tk gene is
widely used to identify chemical mutagens.
CELL-LINE
The mouse lymphoma assay
MLA
O
S
the Tk gene
utilizing
is widely used
to identify
O
chemical mutagens
DNA part
<!DOCTYPE MBSP SYSTEM 'mbsp.dtd'>
<MBSP>
<S cnt="s1">
<NP rel="SBJ" of="s1_1">
<W pos="DT">The</W>
<W pos="NN" sem="cell_line">mouse</W>
<W pos="NN" sem="cell_line">lymphoma</W>
<W pos="NN">assay</W>
</NP>
<W pos="openparen">(</W>
<NP>
<W pos="NN" sem="cell_line">MLA</W>
</NP>
<W pos="closeparen">)</W>
<VP id="s1_1">
<W pos="VBG">utilizing</W>
</VP>
<NP rel="OBJ" of="s1_1">
<W pos="DT">the</W>
<W pos="NN" sem="DNA_part">Tk</W>
<W pos="NN" sem="DNA_part">gene</W>
</NP>
<VP id="s1_2">
<W pos="VBZ">is</W>
<W pos="RB">widely</W>
<W pos="VBN">used</W>
</VP>
<VP id="s1_3">
<W pos="TO">to</W>
<W pos="VB">identify</W>
</VP>
</VP>
<NP rel="OBJ" of="s1_3">
<W pos="JJ">chemical</W>
<W pos="NNS">mutagens</W>
</NP>
<W pos="period">.</W>
</S>
</MBSP>
Extracted IEX Templates from
shallow parser output
NP(<X protein>) contain NP(Y "domain")
EVENT: contain
PROTEIN: <protein>
DOMAIN: “domainf”
Jee-Hyub Kim (Geneva)
NP(<X protein>) be associated with NP(Y “disease”)
EVENT: associated_with
PROTEIN: <protein>
DISEASE: “head”
NP(<X protein>) regulate NP(Y)
EVENT: regulate
PROTEIN: <protein>
Y:
(): to be extracted, <>: semantic
constraint, "": lexical constraint
Application: Ontology Extraction
Clustering of head nouns of Subject-Verb and Verb-Object relations
Combine with pattern matching and heuristics
Case study: Medline 4 million words hepatitis, SwissProt corpus
Results:
Better clusters with shallow parsing
Useful in knowledge management, thesaurus development, …
Ontobasis (IWT)
Example (SwissProt corpus)
gene | show | significant homology,
amino_acid_sequence | have/indicate/lack/reveal/show |
homology
protein | show | homology, immunoreactivity, reactivity,
sequence similarity
protein | inhibit | catalytic activity, apoptosis, protein synthesis...
protein | exhibit | significant homology
protein | bind | copper, ubiquitin
protein | correspond | isoelectric point
induction | requires | protein synthesis
Edman degradation | of | intact protein
regulatory subunit | of | cAMP-dependent protein kinase
…
liver
related_to
related_to
hepatitis
cirrhosis
sim
sim
infection
sim
disease
HBV
prevented by
antibody
produced by
culture
immunization
produced by
antisera
vaccination
Further development
Semantic roles
Faster adaptation to new domains
Domain semantics (NER / concept tagging)
Active Learning / semi-supervised learning
More analytic power
Negation, modality, quantification
Limited event and scenario recognition
Conclusions
Text Mining tasks benefit from text analysis
Understanding can be formulated as a
flexible heterarchy of classifiers
These classifiers can be trained / adapted on
annotated corpora and can eventually
approximate deep understanding
Questions?
Walter Daelemans
A1.10 Campus Drie Eiken
• (September: Stadscampus)
[email protected]