Information Extraction (Several slides based on those by Ray Weld’s class)

Transcript Information Extraction (Several slides based on those by Ray Weld’s class)

Make-up Class:
Tomorrow (Wed) 10:30—11:45AM
BY 210 (next to the advising office)
Information Extraction
(Several slides based on those by Ray
Mooney, Cohen/McCallum (via Dan
Weld’s class)
1
Intended Use of Semantic Web?
• Pages should be annotated with RDF triples, with links to
RDF-S (our OWL) background ontology.
• E.g. See Jim Hendler’s page…
2
Database vs. Semantic Web Inference
(and the Magellan Story)
• Also templated extraction as undoing XMLHTML
conversion. Templated extraction is by DOM-patterns;
unstructured extraction is (sort of) by grammar parse tree
patterns. Grammar learning is mostly from +ve examples.
Rinku Patel
To be added
3
Who will annotate the data?
•
Semantic web works if the users annotate their pages using some existing
ontology (or their own ontology, but with mapping to other ontologies)
–
•
But users typically do not conform to standards..
• and are not patient enough for delayed gratification…
Two Solutions
–
1. Intercede in the way pages are created (act as if you are helping them write
web-pages)
• What if we change the MS Frontpage/Claris Homepage so that they (slyly)
add annotations?
• E.g. The Mangrove project at U. Wash.
– Help user in tagging their data (allow graphical editing)
– Provide instant gratification by running services that use the tags.
–
2. Collaborative tagging!
• “Folksonomies” (look at Wikipedia article)
– FLICKR, Technorati, deli.cio.us etc
• CBIOC, ESP game etc.
– Need to incentivize users to do the annotations..
–
3. Automated information extraction (next topic)
4
Folksonomies—The good
• Bottom-up approach to taxonomies/ontologies
– [In systems like] Furl, Flickr and Del.icio.us... people classify
their pictures/bookmarks/web pages with tags (e.g. wedding),
and then the most popular tags float to the top (e.g. Flickr's
tags or Del.icio.us on the right)....
– [F]olksonomies can work well for certain kinds of information
because they offer a small reward for using one of the popular
categories (such as your photo appearing on a popular page).
People who enjoy the social aspects of the system will
gravitate to popular categories while still having the freedom
to keep their own lists of tags.
5
Works best when
Many people
Tag the same
Info…
6
Folksonomies… the bad
• On the other hand, not hard to see a few reasons why a
folksonomy would be less than ideal in a lot of cases:
– None of the current implementations have synonym control
(e.g. "selfportrait" and "me" are distinct Flickr tags, as are
"mac" and "macintosh" on Del.icio.us).
– Also, there's a certain lack of precision involved in using
simple one-word tags--like which Lance are we talking about?
– And, of course, there's no heirarchy and the content types
(bookmarks, photos) are fairly simple.
• For indexing and library people, folksonomies are about as
appealing as Wikipedia is to encyclopedia editors.
– But.. there's some interesting stuff happening around them.
7
Mass Collaboration
(& Mice running the Earth)
• The quality of the tags generated through folksonomies is
notoriously hard to control
– So, design mechanisms that ensure correctness of tags..
• ESP game makes it fun to
• CBIOC and Google Co-op restrict annotation previleges to
trusted users..
• It is hard to get people to tag things in which they don’t
have personal interest..
– Find incentive structures..
• ESP makes it a “game” with points
• CBIOC and Google Co-op try to promise delayed
gratification in terms of improved search later..
8
Who will annotate the data?
•
Semantic web works if the users annotate their pages using some existing
ontology (or their own ontology, but with mapping to other ontologies)
–
•
But users typically do not conform to standards..
• and are not patient enough for delayed gratification…
Two Solutions
–
1. Intercede in the way pages are created (act as if you are helping them write
web-pages)
• What if we change the MS Frontpage/Claris Homepage so that they (slyly)
add annotations?
• E.g. The Mangrove project at U. Wash.
– Help user in tagging their data (allow graphical editing)
– Provide instant gratification by running services that use the tags.
–
2. Collaborative tagging!
• “Folksonomies” (look at Wikipedia article)
– FLICKR, Technorati, deli.cio.us etc
• CBIOC, ESP game etc.
– Need to incentivize users to do the annotations..
–
3. Automated information extraction
Next Topic
9
Information Extraction (IE)
• Identify specific pieces of information (data) in a
unstructured or semi-structured textual document.
• Transform unstructured information in a corpus of
documents or web pages into a structured database.
• Applied to different types of text:
–
–
–
–
–
–
–
Newspaper articles
Web pages
Scientific articles
Newsgroup messages
Classified ads
Medical notes
Wikipedia (info boxes)..
10
Information Extraction vs. NLP?
• Information extraction is attempting to find
some of the structure and meaning in the
hopefully template driven web pages.
• As IE becomes more ambitious and text
becomes more free form, then ultimately we
have IE becoming equal to NLP.
• Web does give one particular boost to NLP
– Massive corpora..
11
MUC
• DARPA funded significant efforts in IE in the
early to mid 1990’s.
• Message Understanding Conference (MUC) was
an annual event/competition where results were
presented.
• Focused on extracting information from news
articles:
– Terrorist events
– Industrial joint ventures
– Company management changes
• Information extraction of particular interest to the
intelligence community (CIA, NSA).
12
Other Applications
• Job postings:
– Newsgroups: Rapier from austin.jobs
– Web pages: Flipdog
• Job resumes:
– BurningGlass
– Mohomine
•
•
•
•
•
•
Seminar announcements
Company information from the web
Continuing education course info from the web
University information from the web
Apartment rental ads
Molecular biology information from MEDLINE
13
Wikipedia Infoboxes..
• Wikipedia has
both
unstructured
text and
structured
info boxes..
Infobox
14
Sample Job Posting
Subject: US-TN-SOFTWARE PROGRAMMER
Date: 17 Nov 1996 17:37:29 GMT
Organization: Reference.Com Posting Service
Message-ID: <[email protected]>
SOFTWARE PROGRAMMER
Position available for Software Programmer experienced in generating software for PCBased Voice Mail systems. Experienced in C Programming. Must be familiar with
communicating with and controlling voice cards; preferable Dialogic, however, experience
with others such as Rhetorix and Natural Microsystems is okay. Prefer 5 years or more
experience with PC Based Voice Mail, but will consider as little as 2 years. Need to find a
Senior level person who can come on board and pick up code with very little training.
Present Operating System is DOS. May go to OS-2 or UNIX in future.
Please reply to:
Kim Anderson
AdNET
(901) 458-2888 fax
[email protected]
15
Extracted Job Template
computer_science_job
id: [email protected]
title: SOFTWARE PROGRAMMER
salary:
company:
recruiter:
state: TN
city:
country: US
language: C
platform: PC \ DOS \ OS-2 \ UNIX
application:
area: Voice Mail
req_years_experience: 2
desired_years_experience: 5
req_degree:
desired_degree:
post_date: 17 Nov 1996
16
Amazon Book Description
….
</td></tr>
</table>
The Age of Spiritual Machines : When Computers Exceed Human Intelligence 

by <a href="/exec/obidos/search-handle-url/index=books&field-author=
Kurzweil%2C%20Ray/002-6235079-4593641">
Ray Kurzweil</a> 

 
<a href="http://images.amazon.com/images/P/0140282025.01.LZZZZZZZ.jpg">
<img src="http://images.amazon.com/images/P/0140282025.01.MZZZZZZZ.gif" width=90
height=140 align=left border=0></a>



List Price: $14.95 
Our Price: $11.96 
You Save: $2.99 
(20%) 

17
 …

 
Extracted Book Template
Title: The Age of Spiritual Machines :
When Computers Exceed Human Intelligence
Author: Ray Kurzweil
List-Price: $14.95
Price: $11.96
:
:
18
Extraction from Templated Text
• Many web pages are generated automatically from an
underlying database.
• Therefore, the HTML structure of pages is fairly
specific and regular (semi-structured).
• However, output is intended for human consumption,
not machine interpretation.
• An IE system for such generated pages allows the web
site to be viewed as a structured database.
• An extractor for a semi-structured web site is
sometimes referred to as a wrapper.
• Process of extracting from such pages is sometimes
referred to as screen scraping.
19
Templated Extraction using DOM Trees
• Web extraction may be aided by first
parsing web pages into DOM trees.
• Extraction patterns can then be specified as
paths from the root of the DOM tree to the
node containing the text to extract.
• May still need regex patterns to identify
proper portion of the final CharacterData
node.
20
Sample DOM Tree Extraction
HTML
Element
HEADER
BODY
B
Age of Spiritual
Machines
Character-Data
FONT
by
A
Ray
Kurzweil
Title: HTMLBODYBCharacterData
Author: HTML BODYFONTA CharacterData
21
Template Types
• Slots in template typically filled by a substring
from the document.
• Some slots may have a fixed set of pre-specified
possible fillers that may not occur in the text itself.
– Terrorist act: threatened, attempted, accomplished.
– Job type: clerical, service, custodial, etc.
– Company type: SEC code
• Some slots may allow multiple fillers.
– Programming language
• Some domains may allow multiple extracted
templates per document.
– Multiple apartment listings in one ad
22
Simple Extraction Patterns
• Specify an item to extract for a slot using a regular
expression pattern.
– Price pattern: “\b\$\d+(\.\d{2})?\b”
• May require preceding (pre-filler) pattern to
identify proper context.
– Amazon list price:
• Pre-filler pattern: “List Price: ”
• Filler pattern: “\$\d+(\.\d{2})?\b”
• May require succeeding (post-filler) pattern to
identify the end of the filler.
– Amazon list price:
• Pre-filler pattern: “List Price: ”
• Filler pattern: “.+”
• Post-filler pattern: “”
23
Simple Template Extraction
• Extract slots in order, starting the search for the
filler of the n+1 slot where the filler for the nth
slot ended. Assumes slots always in a fixed order.
–
–
–
–
Title
Author
List price
…
• Make patterns specific enough to identify each
filler always starting from the beginning of the
document.
24
Pre-Specified Filler Extraction
• If a slot has a fixed set of pre-specified
possible fillers, text categorization can be
used to fill the slot.
– Job category
– Company type
• Treat each of the possible values of the slot
as a category, and classify the entire
document to determine the correct filler.
25
Learning for IE
• Writing accurate patterns for each slot for each domain (e.g. each
web site) requires laborious software engineering.
• Alternative is to use machine learning:
– Build a training set of documents paired with human-produced
filled extraction templates.
– Learn extraction patterns for each slot using an appropriate
machine learning algorithm.
26
Information Extraction from
unstructured text
37
Information Extraction from Unstructured Text:
• Semantic web needs:
– Tagged data
– Background knowledge
• (blue sky approaches to) automate both
– Knowledge Extraction
• Extract base level knowledge (“facts”) directly from
the web
– Automated tagging
• Start with a background ontology and tag other web
pages
– Semtag/Seeker
39
Fielded IE Systems: Citeseer, Google Scholar; Libra
How do they do it? Why do they fail?

40
What is “Information Extraction”
As a task:
Filling slots in a database from sub-segments of text.
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the opensource concept, by which software code is
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.
NAME
TITLE
ORGANIZATION
"We can be open source. We love the concept
of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.“
Richard Stallman, founder of the Free
Software Foundation, countered saying…
Slides from Cohen & McCallum
What is “Information Extraction”
As a task:
Filling slots in a database from sub-segments of text.
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the opensource concept, by which software code is
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.
IE
NAME
Bill Gates
Bill Veghte
Richard Stallman
TITLE
ORGANIZATION
CEO
Microsoft
VP
Microsoft
founder Free Soft..
"We can be open source. We love the concept
of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.“
Richard Stallman, founder of the Free
Software Foundation, countered saying…
Slides from Cohen & McCallum
What is “Information Extraction”
As a family
of techniques:
Information Extraction =
segmentation + classification + clustering + association
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the opensource concept, by which software code is
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.
"We can be open source. We love the concept
of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.“
Microsoft Corporation
CEO
Bill Gates
Microsoft
Gates
Microsoft
Bill Veghte
Microsoft
VP
Richard Stallman
founder
Free Software Foundation
Richard Stallman, founder of the Free
Software Foundation, countered saying…
Slides from Cohen & McCallum
What is “Information Extraction”
As a family
of techniques:
Information Extraction =
segmentation + classification + association + clustering
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the opensource concept, by which software code is
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.
"We can be open source. We love the concept
of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.“
Microsoft Corporation
CEO
Bill Gates
Microsoft
Gates
Microsoft
Bill Veghte
Microsoft
VP
Richard Stallman
founder
Free Software Foundation
Richard Stallman, founder of the Free
Software Foundation, countered saying…
Slides from Cohen & McCallum
What is “Information Extraction”
As a family
of techniques:
Information Extraction =
segmentation + classification + association + clustering
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the opensource concept, by which software code is
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.
"We can be open source. We love the concept
of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.“
Microsoft Corporation
CEO
Bill Gates
Microsoft
Gates
Microsoft
Bill Veghte
Microsoft
VP
Richard Stallman
founder
Free Software Foundation
Richard Stallman, founder of the Free
Software Foundation, countered saying…
Slides from Cohen & McCallum
What is “Information Extraction”
As a family
of techniques:
Information Extraction =
segmentation + classification + association + clustering
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the opensource concept, by which software code is
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.
"We can be open source. We love the concept
of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.“
* Microsoft Corporation
CEO
Bill Gates
* Microsoft
Gates
* Microsoft
Bill Veghte
* Microsoft
VP
Richard Stallman
founder
Free Software Foundation
Richard Stallman, founder of the Free
Software Foundation, countered saying…
Slides from Cohen & McCallum
IE in Context
Create ontology
Spider
Filter by relevance
IE
Segment
Classify
Associate
Cluster
Load DB
Document
collection
Train extraction models
Label training data
Database
Query,
Search
Data mine
Slides from Cohen & McCallum
IE History
Pre-Web
• Mostly news articles
– De Jong’s FRUMP [1982]
• Hand-built system to fill Schank-style “scripts” from news wire
– Message Understanding Conference (MUC) DARPA [’87-’95], TIPSTER [’92’96]
• Most early work dominated by hand-built models
– E.g. SRI’s FASTUS, hand-built FSMs.
– But by 1990’s, some machine learning: Lehnert, Cardie, Grishman and then
HMMs: Elkan [Leek ’97], BBN [Bikel et al ’98]
Web
• AAAI ’94 Spring Symposium on “Software Agents”
– Much discussion of ML applied to Web. Maes, Mitchell, Etzioni.
• Tom Mitchell’s WebKB, ‘96
– Build KB’s from the Web.
• Wrapper Induction
– First by hand, then ML: [Doorenbos ‘96], [Soderland ’96], [Kushmerick ’97],…
Slides from Cohen & McCallum
What makes IE from the Web Different?
Less grammar, but more formatting & linking
Newswire
Web
www.apple.com/retail
Apple to Open Its First Retail Store
in New York City
MACWORLD EXPO, NEW YORK--July 17, 2002-Apple's first retail store in New York City will open in
Manhattan's SoHo district on Thursday, July 18 at
8:00 a.m. EDT. The SoHo store will be Apple's
largest retail store to date and is a stunning example
of Apple's commitment to offering customers the
world's best computer shopping experience.
www.apple.com/retail/soho
www.apple.com/retail/soho/theatre.html
"Fourteen months after opening our first retail store,
our 31 stores are attracting over 100,000 visitors
each week," said Steve Jobs, Apple's CEO. "We
hope our SoHo store will surprise and delight both
Mac and PC users who want to see everything the
Mac can do to enhance their digital lifestyles."
The directory structure, link structure,
formatting & layout of the Web is its own
new grammar.
Slides from Cohen & McCallum
Landscape of IE Tasks (1/4):
Pattern Feature Domain
Text paragraphs
without formatting
Grammatical sentences
and some formatting & links
Astro Teller is the CEO and co-founder of
BodyMedia. Astro holds a Ph.D. in Artificial
Intelligence from Carnegie Mellon University,
where he was inducted as a national Hertz fellow.
His M.S. in symbolic and heuristic computation
and B.S. in computer science are from Stanford
University. His work in science, literature and
business has appeared in international media from
the New York Times to CNN to NPR.
Non-grammatical snippets,
rich formatting & links
Tables
Slides from Cohen & McCallum
Landscape of IE Tasks (2/4):
Pattern Scope
Web site specific
Formatting
Amazon.com Book Pages
Genre specific
Layout
Resumes
Wide, non-specific
Language
University Names
Slides from Cohen & McCallum
Landscape of IE Tasks (3/4):
Pattern Complexity
E.g. word patterns:
Closed set
Regular set
U.S. states
U.S. phone numbers
He was born in Alabama…
Phone: (413) 545-1323
The big Wyoming sky…
The CALD main office can be
reached at 412-268-1299
Complex pattern
U.S. postal addresses
University of Arkansas
P.O. Box 140
Hope, AR 71802
Headquarters:
1128 Main Street, 4th Floor
Cincinnati, Ohio 45210
Ambiguous patterns,
needing context + many
sources of evidence
Person names
…was among the six houses
sold by Hope Feldman that year.
Pawel Opalinski, Software
Engineer at WhizBang Labs.
Slides from Cohen & McCallum
Landscape of IE Tasks (4/4):
Pattern Combinations
Jack Welch will retire as CEO of General Electric tomorrow. The top role
at the Connecticut company will be filled by Jeffrey Immelt.
Single entity
Binary relationship
Person: Jack Welch
Relation: Person-Title
Person: Jack Welch
Title:
CEO
Person: Jeffrey Immelt
Location: Connecticut
N-ary record
Relation:
Company:
Title:
Out:
In:
Succession
General Electric
CEO
Jack Welsh
Jeffrey Immelt
Relation: Company-Location
Company: General Electric
Location: Connecticut
“Named entity” extraction
Slides from Cohen & McCallum
Evaluation of Single Entity Extraction
TRUTH:
Michael Kearns and Sebastian Seung will start Monday’s tutorial, followed by Richard M. Karpe and Martin Cooke.
PRED:
Michael Kearns and Sebastian Seung will start Monday’s tutorial, followed by Richard M. Karpe and Martin Cooke.
# correctly predicted segments
Precision =
2
=
# predicted segments
6
# correctly predicted segments
Recall
=
2
=
# true segments
4
1
F1
=
Harmonic mean of Precision & Recall =
((1/P) + (1/R)) / 2
Slides from Cohen & McCallum
State of the Art Performance
• Named entity recognition
– Person, Location, Organization, …
– F1 in high 80’s or low- to mid-90’s
• Binary relation extraction
– Contained-in (Location1, Location2)
Member-of (Person1, Organization1)
– F1 in 60’s or 70’s or 80’s
• Wrapper induction
– Extremely accurate performance obtainable
– Human effort (~30min) required on each site
Slides from Cohen & McCallum
Landscape of IE Techniques (1/1):
Models
Classify Pre-segmented
Candidates
Lexicons
Abraham Lincoln was born in Kentucky.
member?
Alabama
Alaska
…
Wisconsin
Wyoming
Boundary Models
Abraham Lincoln was born in Kentucky.
Abraham Lincoln was born in Kentucky.
Sliding Window
Abraham Lincoln was born in Kentucky.
Classifier
Classifier
which class?
which class?
Try alternate
window sizes:
Finite State Machines
Abraham Lincoln was born in Kentucky.
Context Free Grammars
Abraham Lincoln was born in Kentucky.
BEGIN
Most likely state sequence?
NNP
NNP
V
V
P
Classifier
PP
which class?
VP
NP
BEGIN
END
BEGIN
NP
END
VP
S
…and beyond
Any of these models can be used to capture words, formatting or
both.
Slides from Cohen & McCallum
Sliding Windows
Slides from Cohen & McCallum
Extraction by Sliding Window
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell
School of Computer Science
Carnegie Mellon University
E.g.
Looking for
seminar
location
3:30 pm
7500 Wean Hall
Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence
during the 1980s and 1990s.
As a result
of its success and growth, machine learning
is evolving into a collection of related
disciplines: inductive concept acquisition,
analytic learning in problem solving (e.g.
analogy, explanation-based learning),
learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
Slides from Cohen & McCallum
Extraction by Sliding Window
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell
School of Computer Science
Carnegie Mellon University
E.g.
Looking for
seminar
location
3:30 pm
7500 Wean Hall
Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence
during the 1980s and 1990s.
As a result
of its success and growth, machine learning
is evolving into a collection of related
disciplines: inductive concept acquisition,
analytic learning in problem solving (e.g.
analogy, explanation-based learning),
learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
Slides from Cohen & McCallum
Extraction by Sliding Window
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell
School of Computer Science
Carnegie Mellon University
E.g.
Looking for
seminar
location
3:30 pm
7500 Wean Hall
Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence
during the 1980s and 1990s.
As a result
of its success and growth, machine learning
is evolving into a collection of related
disciplines: inductive concept acquisition,
analytic learning in problem solving (e.g.
analogy, explanation-based learning),
learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
Slides from Cohen & McCallum
Extraction by Sliding Window
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell
School of Computer Science
Carnegie Mellon University
E.g.
Looking for
seminar
location
3:30 pm
7500 Wean Hall
Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence
during the 1980s and 1990s.
As a result
of its success and growth, machine learning
is evolving into a collection of related
disciplines: inductive concept acquisition,
analytic learning in problem solving (e.g.
analogy, explanation-based learning),
learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
Slides from Cohen & McCallum
A “Naïve Bayes” Sliding Window Model
[Freitag 1997]
…
00 : pm Place : Wean Hall Rm 5409 Speaker : Sebastian Thrun …
w t-m
w t-1 w t
w t+n
w t+n+1
w t+n+m
prefix
contents
suffix
Estimate Pr(LOCATION|window) using Bayes rule
Try all “reasonable” windows (vary length, position)
Assume independence for length, prefix words, suffix words, content words
Estimate from data quantities like: Pr(“Place” in prefix|LOCATION)
If P(“Wean Hall Rm 5409” = LOCATION) is above some threshold, extract it.
Other examples of sliding window: [Baluja et al 2000]
(decision tree over individual words & their context)
Slides from Cohen & McCallum
“Naïve Bayes” Sliding Window Results
Domain: CMU UseNet Seminar Announcements
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell
School of Computer Science
Carnegie Mellon University
3:30 pm
7500 Wean Hall
Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence during
the 1980s and 1990s.
As a result of its
success and growth, machine learning is
evolving into a collection of related
disciplines: inductive concept acquisition,
analytic learning in problem solving (e.g.
analogy, explanation-based learning),
learning theory (e.g. PAC learning), genetic
algorithms, connectionist learning, hybrid
systems, and so on.
Field
Person Name:
Location:
Start Time:
F1
30%
61%
98%
Slides from Cohen & McCallum
Realistic sliding-window-classifier IE
• What windows to consider?
– all windows containing as many tokens as the
shortest example, but no more tokens than the
longest example
• How to represent a classifier? It might:
– Restrict the length of window;
– Restrict the vocabulary or formatting used
before/after/inside window;
– Restrict the relative order of tokens, etc.
• Learning Method
– SRV: Top-Down Rule Learning
[Frietag AAAI ‘98]
– Rapier: Bottom-Up
[Califf & Mooney, AAAI ‘99]
Slides from Cohen & McCallum
Rapier: results – precision/recall
Slides from Cohen & McCallum
Rule-learning approaches to slidingwindow classification: Summary
• SRV, Rapier, and WHISK [Soderland KDD ‘97]
– Representations for classifiers allow restriction of
the relationships between tokens, etc
– Representations are carefully chosen subsets of
even more powerful representations based on
logic programming (ILP and Prolog)
– Use of these “heavyweight” representations is
complicated, but seems to pay off in results
• Can simpler representations for classifiers
work?
Slides from Cohen & McCallum
BWI: Learning to detect boundaries
[Freitag & Kushmerick, AAAI 2000]
• Another formulation: learn three probabilistic
classifiers:
– START(i) = Prob( position i starts a field)
– END(j) = Prob( position j ends a field)
– LEN(k) = Prob( an extracted field has length k)
• Then score a possible extraction (i,j) by
START(i) * END(j) * LEN(j-i)
• LEN(k) is estimated from a histogram
Slides from Cohen & McCallum
BWI: Learning to detect boundaries
• BWI uses boosting to find “detectors” for
START and END
• Each weak detector has a BEFORE and
AFTER pattern (on tokens before/after
position i).
• Each “pattern” is a sequence of
– tokens and/or
– wildcards like: anyAlphabeticToken, anyNumber, …
• Weak learner for “patterns” uses greedy
search (+ lookahead) to repeatedly extend a
pair of empty BEFORE,AFTER patterns
Slides from Cohen & McCallum
BWI: Learning to detect boundaries
Field
Person Name:
Location:
Start Time:
F1
30%
61%
98%
Slides from Cohen & McCallum
Problems with Sliding Windows
and Boundary Finders
• Decisions in neighboring parts of the input
are made independently from each other.
– Naïve Bayes Sliding Window may predict a
“seminar end time” before the “seminar start time”.
– It is possible for two overlapping windows to both
be above threshold.
– In a Boundary-Finding system, left boundaries are
laid down independently from right boundaries,
and their pairing happens as a separate step.
Solution? Joint inference…
Slides from Cohen & McCallum
More Ambitious (Blue Sky) Approaches
• The information extraction
tasks in fielded
applications like
Citeseer/Libra are
narrowly focused
– We assume that we are
learning specific relations
(e.g. author/title etc)
– We assume that the
extracted relations will be
put in a database for dbstyle look-up
• Semantic web needs:
– Tagged data
– Background knowledge
• (blue sky approaches to)
automate both
– Knowledge Extraction
• Extract base level
knowledge (“facts”)
directly from the web
– Automated tagging
• Start with a background
ontology and tag other
web pages
– Semtag/Seeker
Let’s look at state of the feasible art
before going to blue-sky..
82
• If extracting from automatically generated web
pages, simple regex patterns usually work.
• If extracting from more natural, unstructured,
human-written text, some NLP may help.
– Part-of-speech (POS) tagging
• Mark each word as a noun, verb, preposition, etc.
– Syntactic parsing
• Identify phrases: NP, VP, PP
– Semantic word categories (e.g. from WordNet)
• KILL: kill, murder, assassinate, strangle, suffocate
• Off-the-shelf software available to do this!
– The “Brill” tagger
• Extraction patterns can use POS or phrase tags.
Analogy to regex patterns on DOM trees for structured tex
Extraction from Free Text involves
Natural Language Processing
83
I. Generate-n-Test Architecture
Generic extraction patterns (Hearst ’92):
• “…Cities such as Boston, Los Angeles, and Seattle…”
(“C such as NP1, NP2, and NP3”) =>
IS-A(each(head(NP)), C), …
•Detailed information for several countries
such as maps, …” ProperNoun(head(NP))
• “I listen to pretty much all music but prefer
Template
Driven
Extraction
(where template
In in terms of
Syntax Tree)
country such as Garth Brooks”
84
Test
Assess candidate extractions using Mutual
Information (PMI-IR) (Turney ’01).
| Hits ( Seattle  City) |
PMI ( Seattle, City) 
| Hits ( Seattle) |
Many variations are possible…
85
..but many things indicate “city”ness
Discriminator phrases fi :
“x is a city”
“x has a population of”
“x is the capital of y”
“x’s baseball team…”
| Hits ( I  D) |
PMI ( I , D) 
| Hits ( I ) |
•PMI = frequency of I & D co-occurrence
•5-50 discriminators Di
•Each PMI for Di is a feature fi
•Naïve Bayes evidence combination:
P ( | f 1 , f 2 ,... f n ) 
Keep the probablities
with the extracted facts
P ( )i P ( f i |  )
P ( )i P ( f i |  )  P ( )i P ( f i |  )
PMI is used for feature selection. NBC is used for learning. Hits used for assessing
86
PMI as well as conditional probabilities
Assessment In Action
1.
2.
3.
4.
I = “Yakima” (1,340,000)
D = <class name>
I+D = “Yakima city” (2760)
PMI = (2760 / 1.34M)= 0.02
•I = “Avocado” (1,000,000)
•I+D =“Avocado city” (10)
PMI = 0.00001 << 0.02
87
Some Sources of ambiguity
•
•
•
•
Time: “Clinton is the president” (in 1996).
Context: “common misconceptions..”
Opinion: Elvis…
Multiple word senses: Amazon, Chicago,
Chevy Chase, etc.
– Dominant senses can mask recessive ones!
– Approach: unmasking. ‘Chicago –City’
88
Chicago
City
Movie
| Hits ( I  D | C ) |
PMI ( I , D, C ) 
| Hits ( I | C ) |
89
Chicago Unmasked
City sense
Movie sense
| Hits (Chicago  Movie  City) |
| Hits (Chicago  City) |
90
Impact of Unmasking on PMI
Name
Washington
Casablanca
Chevy Chase
Chicago
Recessive Original Unmask Boost
city
0.50
0.99
96%
city
0.41
0.93
127%
actor
0.09
0.58 512%
movie
0.02
0.21
972%
91
CBioC: Collaborative BioCuration

Motivation



To help get information nuggets of articles and
abstracts and store in a database.
The challenge is that the number of articles are
huge and they keep growing, and need to
process natural language.
The two existing approaches


human curation and use of automatic information
extraction systems
They are not able to meet the challenge, as the first is
expensive, while the second is error-prone.
92
CBioC (cont’d)

Approach: We propose a solution that is
inexpensive, and that scales up.

Our approach takes advantage of automatic information
extraction methods as a starting point,



Based on the premise that if there are a lot of articles, then
there must be a lot of readers and authors of these articles.
We provide a mechanism by which the readers of the
articles can participate and collaborate in the curation of
information.
We refer to our approach as “Collaborative Curation''.
93
Using the C-BioCurator System
(cont’d)
DIP
…
Nature
BioPax
…
...
Pubmed Science
...
Reactome
CBioC Interface
Download
Agent
ExistingDB
Download
Agent
Browse Facts
Vote Facts
Add/Modify New
Facts
Add New
Schema
Invoke
IntExtractor
User
Management
Data Format
Exchange System
TextDB
CBioC
Database
Collaborative Bio-Curation System
Extractor
Systems
94
(a)
(c)
(b)
(d)
What is the main difference between Knowitall and CBIOC?
Assessment– Knowitall does it by HITS. CBioC by voting
Annotation
“The Chicago Bulls announced yesterday that
Michael Jordan will. . . ”
The <resource ref="http://tap.stanford.edu/
BasketballTeam_Bulls">Chicago Bulls</resource>
announced yesterday that <resource ref=
"http://tap.stanford.edu/AthleteJordan,_Michael">
Michael Jordan</resource> will...’’
96
Semantic Annotation
Name Entity Identification
This simplest task of meta-data
extraction on NL is to establish “type”
relation between entities in the NL
resources and concepts in ontologies.
97
Picture from http://lsdis.cs.uga.edu/courses/SemWebFall2005/courseMaterials/CSCI8350-Metadata.ppt
Semantics
• Semantic Annotation
- The content of annotation consists of some rich
semantic information
- Targeted not only at human reader of resources
but also software agents
- formal : metadata following structural standards
informal : personal notes written in the margin while
reading an article
- explicit : carry sufficient information for interpretation
tacit : many personal annotations (telegraphic and
incomplete)
http://www-scf.usc.edu/~csci586/slides/6
98
Uses of Annotation
99
http://www-scf.usc.edu/~csci586/slides/8
Objectives of Annotation
• Generate Metadata for existing information
– e.g., author-tag in HTML
– RDF descriptions to HTML
– Content description to Multimedia files
• Employ metadata for
– Improved search
– Navigation
– Presentation
– Summarization of contents
http://www.aifb.unikarlsruhe.de/WBS/sst/Teaching/Intelligente%20System%20im%20WWW%20SS%202000/10-Annotation.pdf
100
Annotation
Current practice of annotation for knowledge identification and extraction
is time
consuming
needs annotation by
experts
is complex
Reduce burden of text
annotation for Knowledge
Management
101
www.racai.ro/EUROLAN-2003/html/presentations/SheffieldWilksBrewsterDingli/Eurolan2003AlexieiDingli.ppt
SemTag & Seeker


WWW-03 Best Paper Prize
Seeded with TAP ontology (72k concepts)



And ~700 human judgments
Crawled 264 million web pages
Extracted 434 million semantic tags

Automatically disambiguated
SemTag
• Uses broad, shallow knowledge base
• TAP – lexical and taxonomic information
about popular objects
– Music
– Movies
– Sports
– Etc.
104
SemTag
• Problem:
– No write access to original document, so how
do you annotate?
• Solution:
– Store annotations in a web-available
database
105
SemTag
• Semantic Label Bureau
– Separate store of semantic annotation
information
– HTTP server that can be queried for
annotation information
– Example
• Find all semantic tags for a given document
• Find all semantic tags for a particular object
106
SemTag
• Methodology
107
SemTag
•
Three phases
1.
Spotting Pass:
–
–
2.
Tokenize the document
All instances plus 20 word window
Learning Pass:
–
–
3.
Find corpus-wide distribution of terms at each internal node
of taxonomy
Based on a representative sample
Tagging Pass:
–
–
Scan windows to disambiguate each reference
Finally determined to be a TAP object
108
SemTag
•
Solution:
–
•
Taxonomy Based Disambiguation (TBD)
TBD expectation:
– Human tuned parameters used in small,
critical sections
– Automated approaches deal with bulk of
information
110
SemTag
• TBD methodology:
– Each node in the taxonomy is associated with
a set of labels
• Cats, Football, Cars all contain “jaguar”
– Each label in the text is stored with a window
of 20 words – the context
– Each node has an associated similarity
function mapping a context to a similarity
• Higher similarity  more likely to contain a
reference
111
SemTag
• Similarity:
– Built a 200,000 word lexicon (200,100 most
common – 100 most common)
– 200,000 dimensional vector space
– Training: spots (label, context) and correct
node
– Estimated the distribution of terms for nodes
– Standard cosine similarity
– TFIDF vectors (context vs. node)
112
SemTag
• Some internal nodes very popular:
– Associate a measurement of how accurate
Sim is likely to be at a node
– Also, how ambiguous the node is overall
(consistency of human judgment)
• TBD Algorithm: returns 1 or 0 to indicate
whether a particular context c is on topic
for a node v
• 82% accuracy on 434 million spots
114
Summary
• Information extraction can be motivated either as
explicating more structure from the data or as an
automated way to Semantic Web
• Extraction complexity depends on whether the
text you have is “templated” or “free-form”
– Extraction from templated text can be done by regular
expressions
– Extraction from free form text requires NLP
• Can be done in terms of parts-of-speech-tagging
• “Annotation” involves connecting terms in a free
form text to items in the background knowledge
– It too can be automated
116

Information Extraction (Several slides based on those by Ray Weld’s class)

Transcript Information Extraction (Several slides based on those by Ray Weld’s class)

Directory