Transcript Ellogon:

On the Need to Bootstrap
Ontology Learning with
Extraction Grammar Learning
Georgios Paliouras
Software & Knowledge Engineering Lab
Inst. of Informatics & Telecommunications
NCSR “Demokritos”
http://www.iit.demokritos.gr/~paliourg
Kassel, 22 July 2005
Outline
• Motivation and state of the art
• SKEL research
– Vision
– Information integration in CROSSMARC.
– Meta-learning for information extraction.
– Context-free grammar learning.
– Ontology enrichment.
– Bootstrapping ontology evolution with multimedia
information extraction.
• Open issues
Kassel, 22/07/2005
ICCS’05
2
Motivation
• Practical information extraction requires a
conceptual description of the domain, e.g. an
ontology, and a grammar.
• Manual creation and maintenance of these
resources is expensive.
• Machine learning has been used to:
– Learn ontologies based on extracted instances.
– Learn extraction grammars, given the conceptual
model.
• Study how the two processes are interacting
and the possibility of combining them.
Kassel, 22/07/2005
ICCS’05
3
Information extraction
• Common approach: shallow parsing with
regular grammars.
• Limited use of deep analysis to improve
extraction accuracy (HPSGs, concept graphs).
• Linking of extraction patterns to ontologies (e.g.
information extraction ontologies).
• Initial attempts to combine syntax and
semantics (Systemic Functional Grammars).
• Learning simple extraction patterns (regular
expressions, HMMs, tree-grammars, etc.)
Kassel, 22/07/2005
ICCS’05
4
Ontology learning
• Deductive approach to ontology modification:
driven by linguistic rules.
• Inductive identification of new concepts/terms.
• Clustering, based on lexico-syntactic analysis of
the text (subcat frames).
• Formal Concept Analysis for term clustering
and concept identification.
• Clustering and merging of conceptual graphs
(conceptual graph theory).
• Deductive learning of extraction grammars in
parallel with the identification of concepts.
Kassel, 22/07/2005
ICCS’05
5
Outline
• Motivation and state of the art
• SKEL research
– Vision
– Information integration in CROSSMARC.
– Meta-learning for information extraction.
– Context-free grammar learning.
– Ontology enrichment.
– Bootstrapping ontology evolution with multimedia
information extraction.
• Open issues
Kassel, 22/07/2005
ICCS’05
6
SKEL - vision
Research objective:
innovative knowledge technologies for
reducing the information overload on the
Web
Areas of research activity:
– Information gathering (retrieval, crawling, spidering)
– Information filtering (text and multimedia
classification)
– Information extraction (named entity recognition and
classification, role identification, wrappers, grammar
and lexicon learning)
– Personalization (user stereotypes and communities)
– Ontology learning and population
Kassel, 22/07/2005
ICCS’05
7
Outline
• Motivation and state of the art
• SKEL research
– Vision
– Information integration in CROSSMARC.
– Meta-learning for information extraction.
– Context-free grammar learning.
– Ontology enrichment.
– Bootstrapping ontology evolution with multimedia
information extraction.
• Open issues
Kassel, 22/07/2005
ICCS’05
8
CROSSMARC Objectives
Develop technology for Information
Integration that can:
• crawl the Web for interesting Web pages,
• extract information from pages of different sites
without a standardized format (structured, semistructured, free text),
• process Web pages written in several
languages,
• be customized semi-automatically to new
domains and languages,
• deliver integrated information according to
personalized profiles.
Kassel, 22/07/2005
ICCS’05
9
CROSSMARC Architecture
Ontology
Kassel, 22/07/2005
ICCS’05
10
CROSSMARC Ontology
<node idref="OV-d0e1041">
<synonym>Intel Pentium III</synonym>
<synonym>Pentium III</synonym>
<synonym>P3</synonym>
<synonym>PIII</synonym>
</node>
…
<description>Laptops</description>
<features>
Lexicon
<feature id="OF-d0e5">
<description>Processor</description>
<attribute type="basic" id="OA-d0e7">
<description>Processor Name</description>
<discrete_set type="open">
<value id="OV-d0e1041">
<description>Intel Pentium 3</description>
</value>
<node idref="OA-d0e7">
…
<synonym>Όνομα Επεξεργαστή</synonym>
Ontology
</node>
Greek Lexicon
Kassel, 22/07/2005
ICCS’05
11
Outline
• Motivation and state of the art
• SKEL research
– Vision
– Information integration in CROSSMARC.
– Meta-learning for information extraction.
– Context-free grammar learning.
– Ontology enrichment.
– Bootstrapping ontology evolution with multimedia
information extraction.
• Open issues
Kassel, 22/07/2005
ICCS’05
12
Meta-learning for Web IE
Motivation:
• There are many different learning
methods, producing different types of
extraction grammar.
• In CROSSMARC we had four different
approaches with significant difference in
the extracted information.
Proposed approach:
• Use meta-learning to combine the
strengths of individual learning methods.
Kassel, 22/07/2005
ICCS’05
13
Meta-learning for Web IE
Stacked generalization
New vector x
Base-level dataset D
Dj
D \ Dj
C1(j)…CN(j)
L1…LN
L1…LN
C1...CN
Meta-level
vector
MDj
LM
CM
Meta-level dataset MD
Class value y(x)
Kassel, 22/07/2005
ICCS’05
14
Meta-learning for Web IE
Information Extraction is not naturally a classification task
In IE we deal with text documents, paired with templates
Each template is filled with instances <t(s,e), f>
…TransPort ZX <br> <font size="1"> <b> 15" XGA TFT Display </b> <br> Intel
<b> Pentium III 600 MHZ </b> 256k Mobile processor <br> <b> 256 MB
SDRAM up to 1GB…
Template T
t(s,e)
s, e
Field f
Transport ZX
47, 49
Model
15”
56, 58
screenSize
TFT
59, 60
screenType
Intel <b> Pentium III
63, 67
procName
600 MHz
67, 69
procSpeed
256 MB
76, 78
ram
Kassel, 22/07/2005
ICCS’05
15
Meta-learning for Web IE
Combining Information Extraction systems
…TransPort ZX <br> <font size="1"> <b> 15" XGA TFT Display </b> <br>
Intel <b> Pentium III 600 MHZ </b> 256k Mobile processor <br> <b> 256 MB
SDRAM up to 1GB…
T1 filled by the IE system E1
T2 filled by the IE system E2
t(s, e)
s, e
f
t(s, e)
s, e
f
Transport ZX
47, 49
model
Transport ZX
47, 49
manuf
15”
56, 58
screenSize
TFT
59, 60
screenType
TFT
59, 60
screenType
Intel <b> Pentium
63, 66
procName
Intel <b> Pentium III
63, 67
procName
600 MHz
67, 69
procSpeed
600 MHz
67, 69
procSpeed
256 MB
76, 78
ram
256 MB
76, 78
ram
1 GB
81, 83
HDcapacity
1 GB
81, 83
ram
Kassel, 22/07/2005
ICCS’05
16
Meta-learning for Web IE
Creating a stacked template
…TransPort ZX <br> <font size="1"> <b> 15" XGA TFT Display </b> <br>
Intel <b> Pentium III 600 MHZ </b> 256k Mobile processor <br> <b> 256
MB SDRAM up to 1GB…
Stacked template (ST)
s, e
t(s, e)
Field by E1
Field by E2
Correct field
47, 49
Transport ZX
model
manuf
model
56, 58
15”
screenSize
-
screenSize
59, 60
TFT
screenType
screenType
screenType
63, 66
Intel<b>Pentium
-
procName
-
63, 67
Intel<b>Pentium III
procName
-
procName
67, 69
600 MHz
procSpeed
procSpeed
procSpeed
76, 78
256 MB
ram
ram
ram
81, 83
1 GB
ram
HDcapacity
-
Kassel, 22/07/2005
ICCS’05
17
Meta-learning for Web IE
Training in the new stacking framework
D = set of documents, paired with hand-filled templates
Dj
D \ Dj
E1(j)…EN(j)
L1…LN
ST1
ST2
L1…LN
E1…EN
…
MDj
LM
CM
MD = set of meta-level feature vectors
Kassel, 22/07/2005
ICCS’05
18
Meta-learning for Web IE
Stacking at run-time
New
document d
E1
T1
E2
T2
Stacked
template
…
EN
Kassel, 22/07/2005
CM
<t(s,e), f>
TN
ICCS’05
Final
template
T
19
Experimental results
F1-scores (combined recall and precision) on four
benchmark domains and one of the CROSSMARC
domains.
Domain
Courses
Best base
65.73
Stacking
71.93
Projects
61.64
70.66
Laptops
63.81
71.55
Jobs
83.22
85.94
Seminars
86.23
90.03
Kassel, 22/07/2005
ICCS’05
20
Outline
• Motivation and state of the art
• SKEL research
– Vision
– Information integration in CROSSMARC.
– Meta-learning for information extraction.
– Context-free grammar learning.
– Ontology enrichment.
– Bootstrapping ontology evolution with multimedia
information extraction.
• Open issues
Kassel, 22/07/2005
ICCS’05
21
Learning CFGs
Motivation:
• Wanting to provide more complex extraction
patterns for less structured text.
• Wanting to learn more compact and humancomprehensible grammars.
• Wanting to be able to process large corpora
containing only positive examples.
Proposed approach:
• Efficient learning of context free grammars from
positive examples, guided by Minimum
Description Length.
Kassel, 22/07/2005
ICCS’05
22
Learning CFGs
Introducing eg-GRIDS
• Infers context-free grammars.
• Learns from positive examples only.
• Overgenarisation controlled through a
heuristic, based on MDL.
• Two basic/three auxiliary learning operators.
• Two search strategies:
– Beam search.
– Genetic search.
Kassel, 22/07/2005
ICCS’05
23
Learning CFGs
Minimum Description Length (MDL)
Model Length (ML) = GDL + DDL
Derivations Description Length (DDL)
Grammar Description Length
(GDL)General
Overly
Grammar Bits required to encode all
Bits required to encode
training examples, as
the grammar G.
encoded by the grammar
Hypothese G.
s
GDL
DDL
Overly Specific
Grammar
Kassel, 22/07/2005
ICCS’05
24
Learning CFGs
eg-GRIDS Architecture
Training
Examples
Overly Specific
Grammar
Beam of
Grammars
Mutation
Learning Operators
Evolutionary
Algorithm
Search Organisation
Selection
Merge NT
Operator
Create NT
Operator
Body
Substitution
Create
Optional NT
Operator
Mode
Detect Center
Embedding
YES
Final
Grammar
Kassel, 22/07/2005
ICCS’05
NO
Any Inferred
Grammar better
than those in beam?
25
Experimental results
• The Dyck language with k=1:
S→SS|(S)|є
Errors of:
• Omission: failures to parse sentences
generated from the “correct” grammar (longer
test sentences than in the training set).
– Overly specific grammar.
• Commission: failures of the “correct” grammar
to parse sentences generated by the inferred
grammar.
– Overly general grammar.
Kassel, 22/07/2005
ICCS’05
26
Experimental results
Probability of parsing a valid sentence (1-errors of omission)
Kassel, 22/07/2005
ICCS’05
27
Experimental results
Probability of generating a valid sentence (1-errors of commission)
Kassel, 22/07/2005
ICCS’05
28
Outline
• Motivation and state of the art
• SKEL research
– Vision
– Information integration in CROSSMARC.
– Meta-learning for information extraction.
– Context-free grammar learning.
– Ontology enrichment.
– Bootstrapping ontology evolution with multimedia
information extraction.
• Open issues
Kassel, 22/07/2005
ICCS’05
29
Ontology Enrichment
• We concentrate on instances.
• Highly evolving domain (e.g. laptop
descriptions)
– New Instances characterize new concepts.
e.g. ‘Pentium 2’ is an instance that denotes a new concept if it
doesn’t exist in the ontology.
– New surface appearance of an instance.
e.g. ‘PIII’ is a different surface appearance of ‘Intel Pentium 3’
• The poor performance of many Information
Integration systems is due to their incapability to
handle the evolving nature of the domain.
Kassel, 22/07/2005
ICCS’05
30
Ontology Enrichment
Annotating Corpus
Using Domain Ontology
machine
learning
Corpus
Additional
annotations
Multi-Lingual
Domain Ontology
Information
extraction
Ontology
Enrichment /
Population
Validation
Domain Expert
Kassel, 22/07/2005
ICCS’05
31
Finding synonyms
• The number of instances for validation increases
with the size of the corpus and the ontology.
• There is a need for supporting the enrichment of
the ‘synonymy’ relationship.
• Discover
automatically
different
surface
appearances of an instance (CROSSMARC
synonymy relationship).
• Issues to be handled:
Synonym :
Orthographical :
Lexicographical :
Combination :
Kassel, 22/07/2005
‘Intel pentium 3’ - ‘Intel pIII’
‘Intel p3’ - ‘intell p3’
‘Hewlett Packard’ - ‘HP’
‘Intell Pentium 3’ - ‘P III’
ICCS’05
32
COCLU
• COCLU (COmpression-based CLUstering): a model
based algorithm that discovers typographic similarities
between strings (sequences of elements-letters) over an
alphabet (ASCII characters) employing a new score
function CCDiff.
• CCDiff is defined as the difference in the code length of a
cluster (i.e., of its instances), when adding a candidate
string. Huffman trees are used as models of the clusters.
• COCLU iteratively computes the CCDiff of each new
string from each cluster implementing a hill-climbing
search. The new string is added to the closest cluster, or a
new cluster is created (threshold on CCDiff ).
Kassel, 22/07/2005
ICCS’05
33
Experimental results
Discovering new instances:
Hide part of the known
instances.
Evolve ontology and grammars
to recover them.
Kassel, 22/07/2005
ICCS’05
Accuracy (%)
Discovering lexical synonyms:
Assign an instance to a group,
while decreasing
proportionally the number of
instances available initially in
each group.
100
90
80
70
60
50
0
20
40
60
80
Instances removed (%)
Initial
15/58
28/58
40/58
2nd iter.
48/58
56/58
57/58
34
Outline
• Motivation and state of the art
• SKEL research
– Vision
– Information integration in CROSSMARC.
– Meta-learning for information extraction.
– Context-free grammar learning.
– Ontology enrichment.
– BOEMIE: Bootstrapping ontology evolution with
multimedia information extraction.
• Open issues
Kassel, 22/07/2005
ICCS’05
35
BOEMIE - motivation
•
Multimedia content grows with increasing rates in public
and proprietary webs.
•
Hard to provide semantic indexing of multimedia
content.
•
Significant advances in automatic extraction of low-level
features from visual content.
•
Little progress in the identification of high-level semantic
features
•
Little progress in the effective combination of semantic
features from different modalities.
•
Great effort in producing ontologies for semantic webs.
•
Hard to build and maintain domain-specific multimedia
ontologies.
Kassel, 22/07/2005
ICCS’05
36
BOEMIE- approach
OTHER
ONTOLOGIES
Content
Collection
(crawlers,
spiders, etc.)
SEMANTICS EXTRACTION
FROM VISUAL
CONTENT
FROM NON-VISUAL
CONTENT
FROM FUSED
CONTENT
INITIAL
ONTOLOGY
SEMANTICS
EXTRACTION
RESULTS
MULTIMEDIA
CONTENT
ONTOLOGY EVOLUTION
POPULATION &
ENRICHMENT
EVOLVED
ONTOLOGY
COORDINATION
INTERMEDIATE
ONTOLOGY
ONTOLOGY EVOLUTION TOOLKIT
Kassel, 22/07/2005
SEMANTICS EXTRACTION TOOLKIT
ONTOLOGY MANAGEMENT TOOL
VISUAL EXTRACTION TOOLS
LEARNING TOLS
TEXT EXTRACTION TOOLS
REASONING ENGINE
AUDIO EXTRACTION TOOLS
MATCHING TOOLS
INFORMATION FUSION TOOLS
ICCS’05
37
Outline
• Motivation and state of the art
• SKEL research
– Vision
– Information integration in CROSSMARC.
– Meta-learning for information extraction.
– Context-free grammar learning.
– Ontology enrichment.
– Bootstrapping ontology evolution with multimedia
information extraction.
• Open issues
Kassel, 22/07/2005
ICCS’05
38
KR issues
• Is there a common formalism to capture
the necessary semantics + syntactic +
lexical knowledge for IE?
• Is that better than having separate
representations for different tasks?
• Do we need an intermediate formalism
(e.g. grammar + CG + ontology)?
• Do we need to represent uncertainty (e.g.
using probabilistic graphical models)?
Kassel, 22/07/2005
ICCS’05
39
ML issues
• What types and which aspects of
grammars and conceptual structures can
we learn?
• What training data do we need? Can we
reduce the manual annotation effort?
• What background knowledge do we need
and what is the role of deduction?
• What is the role of multi-strategy learning,
especially if complex representations are
used?
Kassel, 22/07/2005
ICCS’05
40
Content-type issues
• What is the role of semantically annotated
content in learning, e.g. as training data?
• What is the role of hypertext as a graph?
• Can we extract information from
multimedia content?
• How can ontologies and learning help
improve extraction from multimedia?
Kassel, 22/07/2005
ICCS’05
41
SKEL Introduction
Acknowledgements
• This is research of many current and past members
of SKEL.
• CROSSMARC is joint work of the project consortium
(NCSR “Demokritos”, Uni of Edinburgh, Uni of Roma
‘Tor Vergata’, Veltinet, Lingway).
Kassel, 22/07/2005
ICCS’05
42