Information Extraction

Download Report

Transcript Information Extraction

Information Extraction
Adapted from slides by Junichi Tsujii, Ronen Feldman and others
Most Data are Unstructured (Text)
or Semi-Structured…






Email
Insurance claims
News articles
Web pages
Patent portfolios
…





Customer complaint letters
Contracts
Transcripts of phone calls with
customers
Technical documents
…
Text data mining has become more and more important…
(Adapted from J. Dorre et al. “Text Mining:
Finding Nuggets in Mountains of Textual Data”)
Application Tasks of NLP
(1)Information Retrieval/Detection
To search and retrieve documents in response to queries
for information
(2)Passage Retrieval
To search and retrieve part of documents in response
to queries for information
(3)Information Extraction
To extract information that fits pre-defined database schemas
or templates, specifying the output formats
(4) Question/Answering Tasks
To answer general questions by using texts as knowledge
base: Fact retrieval, combination of IR and IE
(5)Text Understanding
To understand texts as people do: Artificial Intelligence
Information Extraction:
A Pragmatic Approach




Let application requirements drive semantic analysis
Identify the types of entities that are relevant to a
particular task
Identify the range of facts that one is interested in for
those entities
Ignore everything else
IE definitions




Entity: an object of interest such as a person or
organization
Attribute: A property of an entity such as name, alias,
descriptor or type
Fact: A relationship held between two or more entities
such as Position of Person in Company
Event: An activity involving several entities such as
terrorist act, airline crash, product information
IE accuracy typical figures by information
type

Entity:
90-98%

Attribute: 80%

Fact:
60-70%

Event:
50-60%
MUC conferences



MUC 1 to MUC 7
1987 to 1997
Topics:
 Naval operations (2)
 Terrorist Activity (2)
 Joint venture and microelectronics
 Management changes
 Space Vehicles and Missile launches
MUC and Scenario Templates



Define a set of “interesting entities”
 Persons, organizations, locations…
Define a complex scenario involving interesting events
and relations over entities
 Example: management succession: persons,
companies, positions, reasons for succession
This collection of entities and relations is called a
“scenario template.”
Problems with Scenario Template


Encouraged development of highly domain specific
ontologies, rule systems, heuristics, etc.
Most of the effort expended on building a scenario
template system was not directly applicable to a different
scenario template.
Addressing the Problem


Address a large number of smaller, more focused
scenario templates (Event-99)
Develop a more systematic ground-up approach to
semantics by focusing on elementary entities, relations,
and events (ACE)
The ACE Evaluation


The ACE program – challenge of extracting content from
human language. Research effort directed to master
 first the extraction of “entities”
 Then the extraction of “relations” among these entities
 Finally the extraction of “events” that are causally
related sets of relations
After two years, top systems successfully capture well
over 50 % of the value at the entity level
The ACE Program




“Automated Content Extraction”
Develop core information extraction technology by focusing on extracting
specific semantic entities and relations over a very wide range of texts.
Corpora: Newswire and broadcast transcripts, but broad range of topics and
genres.
 Third person reports
 Interviews
 Editorials
 Topics: foreign relations, significant events, human interest, sports,
weather
Discourage highly domain- and genre-dependent solutions
Applications of IE




Routing of information
Infrastructure for IR and categorization (higher level
features)
Event based summarization
Automatic creation of databases and knowledge bases
Where would IE be useful?



Semi-structured text
Generic documents like news articles
Most of the information in the doc is centered around a
set of easily identifiable entities
The Problem
Date
Time: Start - End
Location
Speaker
Person
What is “Information Extraction”
As a task:
Filling slots in a database from sub-segments of text.
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the opensource concept, by which software code is
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.
NAME
TITLE
ORGANIZATION
"We can be open source. We love the concept
of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.“
Richard Stallman, founder of the Free
Software Foundation, countered saying…
Courtesy of William W. Cohen
What is “Information Extraction”
As a task:
Filling slots in a database from sub-segments of text.
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the opensource concept, by which software code is
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.
IE
NAME
Bill Gates
Bill Veghte
Richard Stallman
TITLE
ORGANIZATION
CEO
Microsoft
VP
Microsoft
founder Free Soft..
"We can be open source. We love the concept
of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.“
Richard Stallman, founder of the Free
Software Foundation, countered saying…
Courtesy of William W. Cohen
What is “Information Extraction”
Information Extraction =
segmentation + classification + association + clustering
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the opensource concept, by which software code is
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.
"We can be open source. We love the concept
of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.“
Richard Stallman, founder of the Free
Software Foundation, countered saying…
Microsoft Corporation
CEO
Bill Gates
Microsoft
aka “named entity
Gates
extraction”
Microsoft
Bill Veghte
Microsoft
VP
Richard Stallman
founder
Free Software Foundation
Courtesy of William W. Cohen
What is “Information Extraction”
Information Extraction =
segmentation + classification + association + clustering
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the opensource concept, by which software code is
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.
"We can be open source. We love the concept
of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.“
Richard Stallman, founder of the Free
Software Foundation, countered saying…
Microsoft Corporation
CEO
Bill Gates
Microsoft
Gates
Microsoft
Bill Veghte
Microsoft
VP
Richard Stallman
founder
Free Software Foundation
Courtesy of William W. Cohen
What is “Information Extraction”
Information Extraction =
segmentation + classification + association + clustering
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the opensource concept, by which software code is
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.
"We can be open source. We love the concept
of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.“
Richard Stallman, founder of the Free
Software Foundation, countered saying…
Microsoft Corporation
CEO
Bill Gates
Microsoft
Gates
Microsoft
Bill Veghte
Microsoft
VP
Richard Stallman
founder
Free Software Foundation
Courtesy of William W. Cohen
What is “Information Extraction”
Information Extraction =
segmentation + classification + association + clustering
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the opensource concept, by which software code is
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.
"We can be open source. We love the concept
of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.“
Richard Stallman, founder of the Free
Software Foundation, countered saying…
* Microsoft Corporation
CEO
Bill Gates
* Microsoft
Gates
* Microsoft
Bill Veghte
* Microsoft
VP
Richard Stallman
founder
Free Software Foundation
Courtesy of William W. Cohen
Landscape of IE Tasks:
Single Field/Record
Jack Welch will retire as CEO of General Electric tomorrow. The top role
at the Connecticut company will be filled by Jeffrey Immelt.
Single entity
Binary relationship
Person: Jack Welch
Relation: Person-Title
Person: Jack Welch
Title:
CEO
Person: Jeffrey Immelt
Location: Connecticut
“Named entity” extraction
Relation: Company-Location
Company: General Electric
Location: Connecticut
N-ary record
Relation:
Company:
Title:
Out:
In:
Succession
General Electric
CEO
Jack Welsh
Jeffrey Immelt
Landscape of IE Techniques
Classify Pre-segmented
Candidates
Lexicons
Abraham Lincoln was born in Kentucky.
member?
Alabama
Alaska
…
Wisconsin
Wyoming
Boundary Models
Abraham Lincoln was born in Kentucky.
Sliding Window
Abraham Lincoln was born in Kentucky.
Classifier
Classifier
which class?
which class?
Try alternate
window sizes:
Finite State Machines
Abraham Lincoln was born in Kentucky. Abraham Lincoln was born in Kentucky.
Context Free Grammars
Abraham Lincoln was born in Kentucky.
BEGIN
Most likely state sequence?
NNP
NNP
V
V
P
Classifier
PP
which class?
VP
NP
BEGIN
END
BEGIN
NP
END
VP
S
Courtesy of William W. Cohen
IE with Hidden Markov Models
Given a sequence of observations:
Yesterday Pedro Domingos spoke this example sentence.
and a trained HMM:
person name
location name
background
 
Find the most likely state sequence: (Viterbi) arg max s P( s , o )
Yesterday Pedro Domingos spoke this example sentence.
Any words said to be generated by the designated “person name”
state extract as a person name:
Person name: Pedro Domingos
HMM for Segmentation

Simplest Model: One state per entity type
Discriminative Approaches
Yesterday Pedro Domingos spoke this example sentence.
Is this phrase (X) a name? Y=1 (yes);Y=0 (no)
Learn from many examples to predict Y from X
Maximum Entropy, Logistic
Regression:
parameters
n
1
p(Y | X )  exp(  i f i ( X , Y ))
Z
i 1
Features (e.g., is the phrase capitalized?)
More sophisticated: Consider dependency between different labels
(e.g. Conditional Random Fields)
Example of IE: FASTUS(1993)
Bridgestone Sports Co. said Friday it had set up a joint venture
in Taiwan with a local concern and a Japanese trading house to
produce golf clubs to be supplied to Japan.
The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20
million new Taiwan dollars, will start production in January 1990
with production of 20,000 iron and “metal wood” clubs a month.
TIE-UP-1
Relationship: TIE-UP
Entities: “Bridgestone Sport Co.”
“a local concern”
“a Japanese trading house”
Joint Venture Company:
“Bridgestone Sports Taiwan Co.”
Activity:
ACTIVITY-1
Amount:
NT$200000000
ACTIVITY-1
Activity: PRODUCTION
Company:
“Bridgestone Sports Taiwan Co.”
Product:
“iron and ‘metal wood’ clubs”
Start Date:
DURING: January 1990
Example of IE: FASTUS(1993)
Bridgestone Sports Co. said Friday it had set up a joint venture
in Taiwan with a local concern and a Japanese trading house to
produce golf clubs to be supplied to Japan.
The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20
million new Taiwan dollars, will start production in January 1990
with production of 20,000 iron and “metal wood” clubs a month.
TIE-UP-1
Relationship: TIE-UP
Entities: “Bridgestone Sport Co.”
“a local concern”
“a Japanese trading house”
Joint Venture Company:
“Bridgestone Sports Taiwan Co.”
Activity:
ACTIVITY-1
Amount:
NT$200000000
ACTIVITY-1
Activity: PRODUCTION
Company:
“Bridgestone Sports Taiwan Co.”
Product:
“iron and ‘metal wood’ clubs”
Start Date:
DURING: January 1990
Example of IE: FASTUS(1993)
Bridgestone Sports Co. said Friday it had set up a joint venture
in Taiwan with a local concern and a Japanese trading house to
produce golf clubs to be supplied to Japan.
The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20
million new Taiwan dollars, will start production in January 1990
with production of 20,000 iron and “metal wood” clubs a month.
TIE-UP-1
Relationship: TIE-UP
Entities: “Bridgestone Sport Co.”
“a local concern”
“a Japanese trading house”
Joint Venture Company:
“Bridgestone Sports Taiwan Co.”
Activity:
ACTIVITY-1
Amount:
NT$200000000
ACTIVITY-1
Activity: PRODUCTION
Company:
“Bridgestone Sports Taiwan Co.”
Product:
“iron and ‘metal wood’ clubs”
Start Date:
DURING: January 1990
Example of IE: FASTUS(1993)
Bridgestone Sports Co. said Friday it had set up a joint venture
in Taiwan with a local concern and a Japanese trading house to
produce golf clubs to be supplied to Japan.
The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20
million new Taiwan dollars, will start production in January 1990
with production of 20,000 iron and “metal wood” clubs a month.
TIE-UP-1
Relationship: TIE-UP
Entities: “Bridgestone Sport Co.”
“a local concern”
“a Japanese trading house”
Joint Venture Company:
“Bridgestone Sports Taiwan Co.”
Activity:
ACTIVITY-1
Amount:
NT$200000000
ACTIVITY-1
Activity: PRODUCTION
Company:
“Bridgestone Sports Taiwan Co.”
Product:
“iron and ‘metal wood’ clubs”
Start Date:
DURING: January 1990
Example of IE: FASTUS(1993)
Bridgestone Sports Co. said Friday it had set up a joint venture
in Taiwan with a local concern and a Japanese trading house to
produce golf clubs to be supplied to Japan.
The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20
million new Taiwan dollars, will start production in January 1990
with production of 20,000 iron and “metal wood” clubs a month.
TIE-UP-1
Relationship: TIE-UP
Entities: “Bridgestone Sport Co.”
“a local concern”
“a Japanese trading house”
Joint Venture Company:
“Bridgestone Sports Taiwan Co.”
Activity:
ACTIVITY-1
Amount:
NT$200000000
ACTIVITY-1
Activity: PRODUCTION
Company:
“Bridgestone Sports Taiwan Co.”
Product:
“iron and ‘metal wood’ clubs”
Start Date:
DURING: January 1990
FASTUS
Based on finite states automata (FSA)
set up
new Twaiwan dallors
1.Complex Words:
a Japanese trading house
had set up
2.Basic Phrases:
production of
20, 000 iron and
metal wood clubs
3.Complex phrases:
Recognition of multi-words and proper names
Simple noun groups, verb groups and particles
Complex noun groups and verb groups
4.Domain Events:
[company]
[set up]
[Joint-Venture]
with
[company]
Patterns for events of interest to the application
Basic templates are to be built.
5. Merging Structures:
Templates from different parts of the texts are
merged if they provide information about the
same entity or event.
Example of IE: FASTUS(1993)
Bridgestone Sports Co. said Friday it had set up a joint venture
in Taiwan with a local concern and a Japanese trading house to
produce golf clubs to be supplied to Japan.
The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20
million new Taiwan dollars, will start production in January 1990
with production of 20,000 iron and “metal wood” clubs a month.
TIE-UP-1
Relationship: TIE-UP
Entities: “Bridgestone Sport Co.”
“a local concern”
“a Japanese trading house”
Joint Venture Company:
“Bridgestone Sports Taiwan Co.”
Activity:
ACTIVITY-1
Amount:
NT$200000000
ACTIVITY-1
Activity: PRODUCTION
Company:
“Bridgestone Sports Taiwan Co.”
Product:
“iron and ‘metal wood’ clubs”
Start Date:
DURING: January 1990
Information Extraction
……….
Jurgen Pfrang, 51, reportedly stumbled upon the robbers on the
second floor of his Nanjing home early on Sunday.
The deputy general manager of Yaxing Benz, a Sino-German
joint venture that makes buses and bus chassis in nearby Yangzhou,
was hacked to death with 45 cm watermelon knives.
……….
Name of the Venture: Yaxing Benz
Products:
buses and bus chassis
Location:
Yangzhou,China
Companies involved: (1)Name: X?
Country: German
(2)Name: Y?
Country: China
Information Extraction
A German vehicle-firm executive was stabbed to death ….
……….
Jurgen Pfrang, 51, reportedly stumbled upon the robbers on the
second floor of his Nanjing home early on Sunday.
The deputy general manager of Yaxing Benz, a Sino-German
joint venture that makes buses and bus chassis in nearby Yangzhou,
was hacked to death with 45 cm watermelon knives.
……….
Different template
Crime-Type: Murder
for crimes
Type: Stabbing
The killed: Name: Jurgen Pfrang
Age:
51
Profession: Deputy general manager
Location: Nanjing, China
Interpretation of Texts
(1)Information Retrieval/Detection
User
(2)Passage Retrieval
User
(3)Information Extraction
System
(4) Question/Answering Tasks
System
(5)Text Understanding
System
Characterization of Texts
IR System
Queries
Collection of Texts
Knowledge
Interpretation
Characterization of Texts
IR System
Queries
Collection of Texts
Knowledge
Interpretation
Characterization of Texts
Passage
IR System
Collection of Texts
Queries
Knowledge
Characterization of Texts
Interpretation
Passage
IR System
IE System
Queries
Structures
of
Sentences
NLP
Collection of Texts
Texts
Templates
Knowledge
Interpretation
IE System
Texts
Templates
IE as
compromise NLP
Knowledge
Interpretation
IE System
General Framework
of
NLP/NLU
Texts
Templates
Predefined
Performance Evaluation
(1)Information Retrieval/Detection
Rather clear
(2)Passage Retrieval
A bit vague
(3)Information Extraction
Rather clear
(4) Question/Answering Tasks
A bit vague
(5)Text Understanding
Very vague
Query
N: Correct Documents
M:Retrieved Documents
C: Correct Documents that are
actually retrieved
N
Collection of Documents
M
Precision: C
M
C
Recall:
N
F-Value: 2P・R
P+R
P
C
R
Query
N: Correct Templates
M:Retrieved Templates
C: Correct Templates that are
actually retrieved
N
Collection of Documents
M
Precision: C
M
C
Recall:
N
F-Value: 2P・R
P+R
P
C
R
More complicated due to partially
filled templates
Framework of IE
IE as compromise NLP
Difficulties of NLP
General Framework of NLP
(1) Robustness:
Incomplete Knowledge
Morphological and
Lexical Processing
Syntactic Analysis
Predefined
Aspects of
Information
Semantic Analysis
Context processing
Interpretation
Incomplete
Domain Knowledge
Interpretation Rules
Difficulties of NLP
General Framework of NLP
(1) Robustness:
Incomplete Knowledge
Morphological and
Lexical Processing
Syntactic Analysis
Predefined
Aspects of
Information
Semantic Analysis
Context processing
Interpretation
Incomplete
Domain Knowledge
Interpretation Rules
Approaches for building IE systems

Knowledge Engineering Approach
 Rules crafted by linguists in cooperation with domain
experts
 Most of the work done by insoecting a set of relevant
documents
Approaches for building IE systems

Automatically trainable systems
 Techniques based on statistics and almost no
linguistic knowledge
 Language independent
 Main input – annotated corpus
 Small effort for creating rules, but crating annotated
corpus laborious
Techniques in IE
(1) Domain Specific Partial Knowledge:
Knowledge relevant to information to be extracted
(2) Ambiguities:
Ignoring irrelevant ambiguities
Simpler NLP techniques
(3) Robustness:
Coping with Incomplete dictionaries
(open class words)
Ignoring irrelevant parts of sentences
(4) Adaptation Techniques:
Machine Learning, Trainable systems
General Framework of NLP
Morphological and
Lexical Processing
Syntactic Analysis
Semantic Anaysis
Context processing
Interpretation
95 %
FSA rules
Part of Speech Tagger
Statistic taggers
Open class words:
Named entity recognition
(ex) Locations
Persons
Companies
Organizations
Position names
Local Context
Statistical Bias
Domain specific rules:
<Word><Word>, Inc.
Mr. <Cpt-L>. <Word>
Machine Learning:
HMM, Decision Trees
Rules + Machine Learning
F-Value
90
Domain
Dependent
FASTUS
General Framework of NLP
Based on finite states automata (FSA)
1.Complex Words:
Morphological and
Lexical Processing
Recognition of multi-words and proper names
2.Basic Phrases:
Simple noun groups, verb groups and particles
Syntactic Analysis
3.Complex phrases:
Complex noun groups and verb groups
Semantic Anaysis
4.Domain Events:
Patterns for events of interest to the application
Basic templates are to be built.
Context processing
Interpretation
5. Merging Structures:
Templates from different parts of the texts are
merged if they provide information about the
same entity or event.
FASTUS
General Framework of NLP
Based on finite states automata (FSA)
1.Complex Words:
Morphological and
Lexical Processing
Recognition of multi-words and proper names
2.Basic Phrases:
Simple noun groups, verb groups and particles
Syntactic Analysis
3.Complex phrases:
Complex noun groups and verb groups
Semantic Anaysis
4.Domain Events:
Patterns for events of interest to the application
Basic templates are to be built.
Context processing
Interpretation
5. Merging Structures:
Templates from different parts of the texts are
merged if they provide information about the
same entity or event.
FASTUS
General Framework of NLP
Based on finite states automata (FSA)
1.Complex Words:
Morphological and
Lexical Processing
Recognition of multi-words and proper names
2.Basic Phrases:
Simple noun groups, verb groups and particles
Syntactic Analysis
3.Complex phrases:
Complex noun groups and verb groups
Semantic Analysis
4.Domain Events:
Patterns for events of interest to the application
Basic templates are to be built.
Context processing
Interpretation
5. Merging Structures:
Templates from different parts of the texts are
merged if they provide information about the
same entity or event.
Chomsky Hierarchy
of Grammar
Hierarchy
of Automata
Regular Grammar
Finite State Automata
Context Free Grammar
Push Down Automata
Context Sensitive Grammar
Linear Bounded Automata
Type 0 Grammar
Turing Machine
Computationally more complex, Less Efficiency
Chomsky Hierarchy
of Grammar
Hierarchy
of Automata
Regular Grammar
Finite State Automata
AnB n
Context Free Grammar
Push Down Automata
Context Sensitive Grammar
Linear Bounded Automata
Type 0 Grammar
Turing Machine
Computationally more complex, Less Efficiency
1
’s
PN
0
Art
2
ADJ
N
’s
3
John’s interesting
book with a nice cover
Art
P
PN
4
1
’s
PN
0
Art
2
ADJ
N
’s
3
John’s interesting
book with a nice cover
Art
P
PN
4
1
’s
PN
0
Art
2
ADJ
N
’s
3
John’s interesting
book with a nice cover
Art
P
PN
4
1
’s
PN
0
Art
2
ADJ
N
’s
3
John’s interesting
book with a nice cover
Art
P
PN
4
1
’s
PN
0
Art
2
ADJ
N
’s
3
John’s interesting
book with a nice cover
Art
P
PN
4
1
’s
PN
0
Art
2
ADJ
N
’s
3
John’s interesting
book with a nice cover
Art
P
PN
4
1
’s
PN
0
Art
2
ADJ
N
’s
3
John’s interesting
book with a nice cover
Art
P
PN
4
1
’s
PN
0
Art
2
ADJ
N
’s
3
John’s interesting
book with a nice cover
Art
P
PN
4
1
’s
PN
0
Art
2
ADJ
N
’s
3
John’s interesting
book with a nice cover
Art
P
PN
4
1
’s
PN
0
Art
2
ADJ
N
’s
3
John’s interesting
book with a nice cover
Art
P
PN
4
Pattern-maching
{PN ’s/ Art}(ADJ)* N(P Art (ADJ)* N)*
PN ’s (ADJ)* N P Art (ADJ)* N
1
’s
PN
0
Art
2
ADJ
N
’s
3
John’s interesting
book with a nice cover
Art
P
PN
4
FASTUS
General Framework of NLP
Based on finite states automata (FSA)
1.Complex Words:
Morphological and
Lexical Processing
Recognition of multi-words and proper names
2.Basic Phrases:
Simple noun groups, verb groups and particles
Syntactic Analysis
3.Complex phrases:
Complex noun groups and verb groups
Semantic Analysis
4.Domain Events:
Patterns for events of interest to the application
Basic templates are to be built.
Context processing
Interpretation
5. Merging Structures:
Templates from different parts of the texts are
merged if they provide information about the
same entity or event.
Example of IE: FASTUS(1993)
Bridgestone Sports Co. said Friday it had set up a joint venture
in Taiwan with a local concern and a Japanese trading house to
produce golf clubs to be supplied to Japan.
The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20
million new Taiwan dollars, will start production in January 1990
with production of 20,000 “metal wood” clubs a month.
1.Complex words
Attachment
Ambiguities
are not made
explicit
2.Basic Phrases:
Bridgestone Sports Co.: Company name
said
: Verb Group
Friday
: Noun Group
it
: Noun Group
had set up
: Verb Group
a joint venture
: Noun Group
in
: Preposition
Taiwan
: Location
Example of IE: FASTUS(1993)
Bridgestone Sports Co. said Friday it had set up a joint venture
in Taiwan with a local concern and
{{ a Japanese trading house to
}}
produce golf clubs to be supplied to Japan.
The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20
million new Taiwan dollars, will start production in January 1990
with production of 20,000 “metal wood” clubs a month.
1.Complex words
a Japanese tea house
a [Japanese tea] house
a Japanese [tea house]
2.Basic Phrases:
Bridgestone Sports Co.: Company name
said
: Verb Group
Friday
: Noun Group
it
: Noun Group
had set up
: Verb Group
a joint venture
: Noun Group
in
: Preposition
Taiwan
: Location
Example of IE: FASTUS(1993)
Bridgestone Sports Co. said Friday it had set up a joint venture
in Taiwan with a local concern and a Japanese trading house to
produce golf clubs to be supplied to Japan.
The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20
million new Taiwan dollars, will start production in January 1990
with production of 20,000 “metal wood” clubs a month.
1.Complex words
Structural
Ambiguities of
NP are ignored
2.Basic Phrases:
Bridgestone Sports Co.: Company name
said
: Verb Group
Friday
: Noun Group
it
: Noun Group
had set up
: Verb Group
a joint venture
: Noun Group
in
: Preposition
Taiwan
: Location
Example of IE: FASTUS(1993)
Bridgestone Sports Co. said Friday it had set up a joint venture
in Taiwan with a local concern and a Japanese trading house to
produce golf clubs to be supplied to Japan.
The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20
million new Taiwan dollars, will start production in January 1990
with production of 20,000 “metal wood” clubs a month.
2.Basic Phrases:
Bridgestone Sports Co.: Company name
said
: Verb Group
Friday
: Noun Group
it
: Noun Group
had set up
: Verb Group
a joint venture
: Noun Group
in
: Preposition
Taiwan
: Location
3.Complex Phrases
Example of IE: FASTUS(1993)
[COMPNY] said Friday it [SET-UP] [JOINT-VENTURE]
in [LOCATION] with [COMPANY] and [COMPNY] to
produce [PRODUCT] to be supplied to [LOCATION].
[JOINT-VENTURE], [COMPNY], capitalized at 20 million
[CURRENCY-UNIT] [START] production in [TIME]
with production of 20,000 [PRODUCT] a month.
2.Basic Phrases:
Bridgestone Sports Co.: Company name
said
: Verb Group
Friday
: Noun Group
it
: Noun Group
had set up
: Verb Group
a joint venture
: Noun Group
in
: Preposition
Taiwan
: Location
3.Complex Phrases
Some syntactic structures
like …
Example of IE: FASTUS(1993)
[COMPNY] said Friday it [SET-UP] [JOINT-VENTURE]
in [LOCATION] with [COMPANY] to
produce [PRODUCT] to be supplied to [LOCATION].
[JOINT-VENTURE] capitalized at [CURRENCY] [START]
production in [TIME]
with production of [PRODUCT] a month.
2.Basic Phrases:
Bridgestone Sports Co.: Company name
said
: Verb Group
Friday
: Noun Group
it
: Noun Group
had set up
: Verb Group
a joint venture
: Noun Group
in
: Preposition
Taiwan
: Location
3.Complex Phrases
Syntactic structures relevant
to information to be extracted
are dealt with.
Syntactic variations
GM set up a joint venture with Toyota.
GM announced it was setting up a joint venture with Toyota.
GM signed an agreement setting up a joint venture with Toyota.
GM announced it was signing an agreement to set up a joint
venture with Toyota.
Syntactic variations
GM set up a joint venture with Toyota.
GM announced it was setting up a joint venture with Toyota.
GM signed an agreement setting up a joint venture with Toyota.
GM announced it was signing an agreement to set up a joint
venture with Toyota.
S
NP
GM
[SET-UP]
VP
V
signed
NP
VP
N
agreement
V
GM plans to set up a joint venture with Toyota. setting up
GM expects to set up a joint venture with Toyota.
Syntactic variations
GM set up a joint venture with Toyota.
GM announced it was setting up a joint venture with Toyota.
GM signed an agreement setting up a joint venture with Toyota.
GM announced it was signing an agreement to set up a joint
venture with Toyota.
S
NP
GM
[SET-UP]
VP
V
set up
GM plans to set up a joint venture with Toyota.
GM expects to set up a joint venture with Toyota.
Example of IE: FASTUS(1993)
[COMPNY] [SET-UP] [JOINT-VENTURE]
in [LOCATION] with [COMPANY] to
produce [PRODUCT] to be supplied to [LOCATION].
[JOINT-VENTURE] capitalized at [CURRENCY] [START]
production in [TIME]
with production of [PRODUCT] a month.
3.Complex Phrases
4.Domain Events
[COMPANY][SET-UP][JOINT-VENTURE]with[COMPNY]
[COMPANY][SET-UP][JOINT-VENTURE] (others)* with[COMPNY]
The attachment positions of PP are determined at this stage.
Irrelevant parts of sentences are ignored.
Complications caused by syntactic variations
Relative clause
The mayor, who was kidnapped yesterday, was found dead today.
[NG] Relpro {NG/others}* [VG] {NG/others}*[VG]
[NG] Relpro {NG/others}* [VG]
Complications caused by syntactic variations
Relative clause
The mayor, who was kidnapped yesterday, was found dead today.
[NG] Relpro {NG/others}* [VG] {NG/others}*[VG]
[NG] Relpro {NG/others}* [VG]
Complications caused by syntactic variations
Relative clause
The mayor, who was kidnapped yesterday, was found dead today.
[NG] Relpro {NG/others}* [VG] {NG/others}*[VG]
[NG] Relpro {NG/others}* [VG]
Basic patterns
Surface Pattern
Generator
Patterns used
by Domain Event
Relative clause construction
Passivization, etc.
FASTUS
Based on finite states automata (FSA)
1.Complex Words:
NP, who was kidnapped, was found.
2.Basic Phrases:
3.Complex phrases:
4.Domain Events:
Piece-wise recognition
Patterns for events of interest to the application
of basic templates
Basic templates are to be built.
5. Merging Structures:
Reconstructing information
Templates from different parts of the texts are carried via syntactic structures
merged if they provide information about the
by merging basic templates
same entity or event.
FASTUS
Based on finite states automata (FSA)
1.Complex Words:
NP, who was kidnapped, was found.
2.Basic Phrases:
3.Complex phrases:
4.Domain Events:
Piece-wise recognition
Patterns for events of interest to the application
of basic templates
Basic templates are to be built.
5. Merging Structures:
Reconstructing information
Templates from different parts of the texts are carried via syntactic structures
merged if they provide information about the
by merging basic templates
same entity or event.
FASTUS
Based on finite states automata (FSA)
1.Complex Words:
NP, who was kidnapped, was found.
2.Basic Phrases:
3.Complex phrases:
4.Domain Events:
Piece-wise recognition
Patterns for events of interest to the application
of basic templates
Basic templates are to be built.
5. Merging Structures:
Reconstructing information
Templates from different parts of the texts are carried via syntactic structures
merged if they provide information about the
by merging basic templates
same entity or event.
FASTUS
Based on finite states automata (FSA)
1.Complex Words:
NP, who was kidnapped, was found.
2.Basic Phrases:
3.Complex phrases:
4.Domain Events:
Piece-wise recognition
Patterns for events of interest to the application
of basic templates
Basic templates are to be built.
5. Merging Structures:
Reconstructing information
Templates from different parts of the texts are carried via syntactic structures
merged if they provide information about the
by merging basic templates
same entity or event.
Current state of the arts of IE
1. Carefully constructed IE systems
F-60 level (interannotater agreement: 60-80%)
Domain: telegraphic messages about naval operation
(MUC-1:87, MUC-2:89)
news articles and transcriptions of radio broadcasts
Latin American terrorism (MUC-3:91, MUC-4:1992)
News articles about joint ventures (MUC-5, 93)
News articles about management changes (MUC-6, 95)
News articles about space vehicle (MUC-7, 97)
2. Handcrafted rules (named entity recognition, domain events, etc)
Automatic learning from texts:
Supervised learning : corpus preparation
Non-supervised, or controlled learning