Information Extraction Dwar Ev ceremoniously soldered the final connection with gold.

Download Report

Transcript Information Extraction Dwar Ev ceremoniously soldered the final connection with gold.

Information Extraction

Dwar Ev ceremoniously soldered the final connection with gold. The eyes of a dozen television cameras watched him and the subether bore throughout the universe a dozen pictures of what he was doing.

He straightened and nodded to Dwar Reyn, then moved to a position beside the switch that would complete the contact when he threw it. The switch that would connect, all at once, all of the monster computing machines of all the populated planets in the universe - ninety-six billion planets - into the supercircuit that would connect them all into one supercalculator, one cybernetics machine that would combine all the knowledge of all the galaxies.

Dwar Reyn spoke briefly to the watching and listening trillions. Then after a moment’s silence he said, “Now, Dwar Ev.” Dwar Ev threw the switch. There was a mighty hum, the surge of power from ninety-six billion planets. Lights flashed and quieted along the miles-long panel.

Dwar Ev stepped back and drew a deep breath. “The honour of asking the first questions is yours, Dwar Reyn.” “Thank you,” said Dwar Reyn. “It shall be a question which no single cybernetics machine has been able to answer.” He turned to face the machine. “Is there a God ?” The mighty voice answered without hesitation, without the clicking of a single relay.

“Yes,

now

there is a god.” Sudden fear flashed on the face of Dwar Ev. He leaped to grab the switch. A bolt of lightning from the cloudless sky struck him down and fused the switch shut.

‘Answer’ by Fredric Brown.

©1954, Angels and Spaceships

Information Extraction

• • What is covered?

– What is information extraction? • “(ML Approaches to) Extracting Structured Information from Text” • “Learning How to Turn Words into Data” – Applications: • Web info extraction: building catalogs, directories, etc from web sites • Biotext info extraction: extracting facts like • ….

– Techniques: • Named entity recognition: finding

names regulates(CDC23,TNF-1b)

• Question-answering: answering Q’s like “who invented the light bulb?” in text – … – Graphical models for

classifying sequences of tokens

• Extracting facts (aka events, relationships) – classifying pairs of extractions • Normalizing extracted data – classifying pairs of extractions • Semi- and unsupervised approaches to finding information from large corpora (aka bookstrapping – “read the web” like techniques Today: – Admin, motivation – A brief overview of IE, and a less brief overview of

named entity recognition

Motivation: Why bother with IE?

Dwar Ev ceremoniously soldered the final connection with gold. The eyes of a dozen television cameras watched him and the subether bore throughout the universe a dozen pictures of what he was doing.

He straightened and nodded to Dwar Reyn, then moved to a position beside the switch that would complete the contact when he threw it. The switch that would connect, all at once, all of the monster computing machines of all the populated planets in the universe - ninety-six billion planets - into the supercircuit that would connect them all into one supercalculator, one cybernetics machine that would combine all the knowledge of all the galaxies.

Dwar Reyn spoke briefly to the watching and listening trillions. Then after a moment’s silence he said, “Now, Dwar Ev.” Dwar Ev threw the switch. There was a mighty hum, the surge of power from ninety-six billion planets. Lights flashed and quieted along the miles-long panel.

Dwar Ev stepped back and drew a deep breath. “The honour of asking the first questions is yours, Dwar Reyn.” “Thank you,” said Dwar Reyn. “It shall be a question which no single cybernetics machine has been able to answer.” He turned to face the machine. “Is there a God ?” The mighty voice answered without hesitation, without the clicking of a single relay.

“Yes,

now

there is a god.” Sudden fear flashed on the face of Dwar Ev. He leaped to grab the switch. A bolt of lightning from the cloudless sky struck him down and fused the switch shut.

‘Answer’ by Fredric Brown.

©1954, Angels and Spaceships

Some observations

• In the distant future: – Complex AI systems are completed by ceremonially soldering the final connection, not ceremonially compiling the last Java class – Performance is monitored by clicking relays – A “lightning-from-a-cloudless-sky” peripheral exists • Writing and debugging device drivers is a dangerous and highly skilled profession – Question-answering interfaces are still in use • Natural-language query in, answer out – Answering (some) complex questions requires

combining information

from many different places • With different parts contributed by different people?

Two ways to manage information

“ceremonial soldering”

Query retrieval Answer

Xxx xxxx xxxx xxx xxx xxx xx xxxx xxxx xxx

X:advisor(wc,X)&affil(X,lti) ?

Query

{X=em; X=nl}

Answer inference

Xxx xxxx xxxx xxx xxx xxx xx xxxx xxxx xxx Xxx xxxx Xxx xxxx xx xxxx xxxx xxx xxxx xxx Xxx xxxx xxxx xxx xxx xxx xx xxxx xxxx xxx advisor(wc,nl) advisor(yh,tm) affil(wc,mld) affil(vc,nl) name(wc,William Cohen ) name(nl,Ni Lao)

AND

Some observations

• Using computers to

combine

multiple places is and information from

has been

important…

Some observations

• Using computers to

merge been

important… information is and

has

– Data cleaning and integration, record linkage, … – Standards for data exchange: • KQML, KIF, DAML+OIL, … • Semantic web: N3Logic, OWL, … – Friend-of-a-friend, GeneOntology, ….

– Growth from 456 OWL ontologies in 2004 to 14,600 in 2007 • Number of web pages estimated at 11.5B as of early 2006 – #webPages/#ontologies =~ 1,000,000 ?

– #webSites/#ontologies =~ 10,000 ?

– It seems to be much easier to generate

sharable text

generate

sharable knowledge

.

than to – A lot of accessible knowledge is

only accessible in text

How do you extract information?

[Cohen / McCallum tutorial, NIPS 2002, KDD 2003, …] [Some pilfering from Tom Mitchell’s invited talks]

What is “Information Extraction” As a task:

Filling slots in a database from sub-segments of text.

October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… NAME TITLE ORGANIZATION

What is “Information Extraction” As a task:

Filling slots in a database from sub-segments of text.

October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte , a Microsoft VP . "That's a super-important shift for us in terms of code access.“ Richard Stallman , founder of the Free Software Foundation , countered saying… IE NAME TITLE ORGANIZATION Bill Gates CEO Microsoft Bill Veghte VP Microsoft Richard Stallman founder Free Soft..

What is “Information Extraction” As a task:

Filling slots in a database from sub-segments of text.

October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte , a Microsoft VP . "That's a super-important shift for us in terms of code access.“ Richard Stallman , founder of the Free Software Foundation , countered saying… IE NAME TITLE ORGANIZATION Bill Gates CEO Microsoft Bill Veghte VP Microsoft Richard Stallman founder Free Soft..

QA End User

What is “Information Extraction” As a family of techniques:

Information Extraction = segmentation + classification + clustering + association October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation , countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates aka “named entity extraction” Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation

What is “Information Extraction” As a family of techniques:

Information Extraction = segmentation + classification + association + clustering October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte , a Microsoft VP . "That's a super-important shift for us in terms of code access.“ Richard Stallman , founder of the Free Software Foundation , countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation

What is “Information Extraction” As a family of techniques:

Information Extraction = segmentation + classification + association + clustering October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte , a Microsoft VP . "That's a super-important shift for us in terms of code access.“ Richard Stallman , founder of the Free Software Foundation , countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation

What is “Information Extraction” As a family of techniques:

Information Extraction = segmentation + classification + association + clustering October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte , a Microsoft VP . "That's a super-important shift for us in terms of code access.“ * * * * Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation Richard Stallman , founder of the Free Software Foundation , countered saying…

Example: Finding Jobs Ads on the Web

Martin Baker, a person Genomics job Employers job posting form

Example: A Solution

Extracting Job Openings from the Web

foodscience.com-Job2

JobTitle: Ice Cream Guru Employer: foodscience.com

JobCategory: Travel/Hospitality JobFunction: Food Services JobLocation: Upper Midwest Contact Phone: 800-488-2611 DateExtracted: January 8, 2001 Source: www.foodscience.com/jobs_midwest.html

OtherCompanyJobs: foodscience.com-Job1

Data Mining the Extracted Job Information

Notice that we get something useful from just identifying the person names and then doing some counting and trending

Sunita’s Breakdown of IE

• What’s the end goal (application?) • What’s the input (corpus)? How is it preprocessed? How is output postprocessed (to make querying easier)?

• What structure is extracted?

– Entity names?

(“William Cohen, “Anthony ‘Van’ Jones”)

– Relationships between entities?

(“Richard Wang” studentOf “William Cohen”)

– Features/properties/adjectives describing entities?

(“iPhone 3G”

“expensive service plan”, “color screen”)

• What (learning) methods are used?

Landscape of IE Tasks (1/4): Degree of Formatting

Text paragraphs without formatting

Astro Teller is the CEO and co-founder of BodyMedia. Astro holds a Ph.D. in Artificial Intelligence from Carnegie Mellon University, where he was inducted as a national Hertz fellow. His M.S. in symbolic and heuristic computation and B.S. in computer science are from Stanford University. His work in science, literature and business has appeared in international media from the New York Times to CNN to NPR.

Non-grammatical snippets, rich formatting & links Grammatical sentences and some formatting & links Tables

Landscape of IE Tasks (2/4): Intended Breadth of Coverage

Web site specific Formatting Amazon.com Book Pages Genre specific Layout Resumes Wide, non-specific Language University Names

Landscape of IE Tasks (3/4): Complexity of extraction task

E.g. word patterns: Closed set U.S. states

He was born in Alabama …

Regular set U.S. phone numbers

Phone: (413) 545-1323 The big Wyoming sky… The CALD main office can be reached at 412-268-1299

Complex pattern U.S. postal addresses

University of Arkansas P.O. Box 140 Hope, AR 71802 Headquarters: 1128 Main Street, 4th Floor Cincinnati, Ohio 45210

Ambiguous patterns, needing context and many sources of evidence Person names

…was among the six houses sold by Hope Feldman that year.

Pawel Opalinski, Software Engineer at WhizBang Labs.

Landscape of IE Tasks (4/4): Single Field/Record

Jack Welch will retire as CEO of General Electric tomorrow. The top role at the Connecticut company will be filled by Jeffrey Immelt.

Single entity Binary relationship N-ary record Person: Jack Welch Person: Jeffrey Immelt Location: Connecticut Relation: Person-Title

Person: Title:

Jack Welch CEO

Relation:

Company-Location Company: General Electric Location: Connecticut

Relation:

Succession Company: General Electric

Title:

CEO

Out: In:

Jack Welsh Jeffrey Immelt

“Named entity” extraction

A little more depth on named entity recognition (NER)

Models for NER

Lexicons

Abraham Lincoln was born in Kentucky.

member?

Alabama Alaska … Wisconsin Wyoming

Classify Pre-segmented Candidates

Abraham Lincoln was born in Kentucky.

Classifier

which class?

Sliding Window

Abraham Lincoln was born in Kentucky.

Classifier

which class?

Try alternate window sizes: Boundary Models

Abraham Lincoln was born in Kentucky.

BEGIN

Classifier

BEGIN END BEGIN END which class?

Token Tagging

Abraham Lincoln was born in Kentucky.

Most likely state sequence?

This is often treated as a structured prediction problem…classifying tokens sequentially HMMs, CRFs, ….

Sliding Windows

Extraction by Sliding Window

E.g.

Looking for seminar location

GRAND CHALLENGES FOR MACHINE LEARNING Jaime Carbonell School of Computer Science Carnegie Mellon University 3:30 pm 7500 Wean Hall Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.

CMU UseNet Seminar Announcement

Extraction by Sliding Window

E.g.

Looking for seminar location

GRAND CHALLENGES FOR MACHINE LEARNING Jaime Carbonell School of Computer Science Carnegie Mellon University 3:30 pm 7500 Wean Hall Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.

CMU UseNet Seminar Announcement

Extraction by Sliding Window

E.g.

Looking for seminar location

GRAND CHALLENGES FOR MACHINE LEARNING Jaime Carbonell School of Computer Science Carnegie Mellon University 3:30 pm 7500 Wean Hall Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.

CMU UseNet Seminar Announcement

Extraction by Sliding Window

E.g.

Looking for seminar location

GRAND CHALLENGES FOR MACHINE LEARNING Jaime Carbonell School of Computer Science Carnegie Mellon University 3:30 pm 7500 Wean Hall Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.

CMU UseNet Seminar Announcement

A “Naïve Bayes” Sliding Window Model

[Freitag 1997]

… 00 : pm Place : Wean Hall Rm 5409 Speaker : Sebastian Thrun

w t-m w t-1 w t w t+n w t+n+1 w t+n+m

… suffix prefix contents Estimate Pr(LOCATION|window) using Bayes rule Try all “reasonable” windows (vary length, position) Assume independence for length, prefix words, suffix words, content words Estimate from data quantities like: Pr(“Place” in prefix|LOCATION) If

P(“Wean Hall Rm 5409” = LOCATION)

is above some threshold, extract it.

A “Naïve Bayes” Sliding Window Model

[Freitag 1997]

… 00 : pm Place : Wean Hall Rm 5409 Speaker : Sebastian Thrun

w t-m w t-1 w t w t+n w t+n+1 w t+n+m

… prefix contents suffix

1.

2.

3.

Create dataset of examples like these: +(prefix00,…,prefixColon, contentWean,contentHall,….,suffixSpeaker,…) (prefixColon,…,prefixWean,contentHall,….,ContentSpeaker,suffixColon,….) … Train a NaiveBayes classifier (or YFCL), treating the examples like BOWs for text classification • If Pr(class=+|prefix,contents,suffix) > threshold, predict the content window is a location.

To think about: what if the extracted entities aren’t consistent, eg if the location overlaps with the speaker?

“Naïve Bayes” Sliding Window Results

Domain: CMU UseNet Seminar Announcements

GRAND CHALLENGES FOR MACHINE LEARNING Jaime Carbonell School of Computer Science Carnegie Mellon University 3:30 pm 7500 Wean Hall Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.

Field F1 Person Name: 30% Location: Start Time: 61% 98%

Token Tagging

NER by tagging tokens

Given a sentence:

Yesterday Pedro Domingos flew to New York.

1) Break the sentence into

tokens

, and

classify

each token with a label indicating

what sort of entity

it’s part of: person name location name background

Yesterday Pedro Domingos flew to New York

2) Identify names based on the entity labels

Person name: Pedro Domingos Location name: New York

3) To learn an NER system, use YFCL.

NER by tagging tokens

Similar labels tend to cluster together in text Yesterday Pedro Domingos flew to New York Another common labeling scheme is BIO (begin, inside, outside; e.g. beginPerson, insidePerson, beginLocation, insideLocation, outside) BIO also leads to strong dependencies between nearby labels (eg inside follows begin)

person name location name background

NER with Hidden Markov Models

Given a sequence of observations: Yesterday Pedro Domingos spoke this example sentence.

and a trained HMM:

person name location name background

Find the most likely state sequence: (Viterbi)

arg max 

s P

(

s

 , 

o

)

Yesterday Pedro Domingos spoke this example sentence.

Any words said to be generated by the designated “person name” state extract as a person name: Person name: Pedro Domingos

HMM for Segmentation of Addresses

CA NY PA … 0.15

0.11

0.08

… Hall Wean N-S … 0.15

0.03

0.02

• Simplest HMM Architecture: One state per entity type

[Pilfered from Sunita Sarawagi, IIT/Bombay]

HMMs for Information Extraction

… 00 : pm Place : Wean Hall Rm 5409 Speaker : Sebastian Thrun …

1.

2.

3.

• • The HMM consists of two probability tables

Pr(currentState=s|previousState=t)

for s=background, location, speaker,

Pr(currentWord=w|currentState=s)

for s=background, location, … • Estimate these tables with a (smoothed) CPT Prob(location|location) = #(loc->loc)/#(loc->*) transitions Given a new sentence, find the most likely sequence of hidden states using Viterbi method: MaxProb(curr=s|position k)= Max state t MaxProb(curr=t|position=k-1) * Prob(word=w k-1 |t)*Prob(curr=s|prev=t)

“Naïve Bayes” Sliding Window vs

HMMs Domain: CMU UseNet Seminar Announcements

GRAND CHALLENGES FOR MACHINE LEARNING Jaime Carbonell School of Computer Science Carnegie Mellon University 3:30 pm 7500 Wean Hall Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.

Field Speaker: Location: Start Time: F1 30% 61% 98% Field Speaker: Location: Start Time: F1 77% 79% 98%

What is a “symbol” ???

Cohen => “Cohen”, “cohen”, “Xxxxx”, “Xx”, … ?

5317 => “5317”, “9999”, “9+”, “number”, … ?

All Numbers Words Delimiters 3-digits 5-digits Others Chars Multi-letter . , / - + ? # 000..

...999

00000..

..99999

0..99

0000..9999

000000..

A..

..z

aa..

Datamold:

choose best

abstraction level using

holdout

set

HMM Example: “Nymble”

Task: Named Entity Extraction

[Bikel, et al 1998], [BBN “IdentiFinder”]

start-of sentence Person Org (Five other name classes) Other end-of sentence Transition probabilities

P(s t | s t-1 , o t-1 )

Back-off to:

P(s t | s t-1 )

Observation probabilities

P(o t | s t , s t-1 )

or

P(o t | s t , o t-1 )

Back-off to:

P(o t | s t )

Train on ~500k words of news wire text.

Results: Case Language

Mixed English Upper English Mixed Spanish

F1 .

93% 91% 90%

P(s t ) P(o t )

Other examples of shrinkage for HMMs in IE:

[Freitag and McCallum ‘99]

What is a symbol?

Bikel

et al

mix symbols from

two

abstraction levels

What is a symbol?

Ideally we would like to use many, arbitrary, overlapping features of words.

identity of word ends in “-ski” is capitalized is part of a noun phrase is in a list of city names is under node X in WordNet is in bold font is indented is in hyperlink anchor … S t - 1 is “Wisniewski” part of noun phrase ends in “-ski” O t -1 S t O t S t+1 O t +1

Lots of learning systems are

not

confounded by multiple, non independent features: decision trees, neural nets, SVMs, …

… …

What is a symbol?

identity of word ends in “-ski” is capitalized is part of a noun phrase is in a list of city names is under node X in WordNet is in bold font is indented is in hyperlink anchor … S t - 1 is “Wisniewski” part of noun phrase ends in “-ski” O t -1 S t O t S t+1 … O t +1

Idea: replace

generative

model in HMM with a

maxent

model, where

state

depends on

observations

Pr(

s t

|

x t

)  ...

What is a symbol?

identity of word ends in “-ski” is capitalized is part of a noun phrase is in a list of city names is under node X in WordNet is in bold font is indented is in hyperlink anchor … S t - 1 is “Wisniewski” part of noun phrase ends in “-ski” O t -1 S t O t S t+1 O t +1

Idea: replace

generative

model in HMM with a

maxent

model, where

state

depends on

observations

and

previous state

Pr(

s t

|

x t

,

s t

 1 , )  ...

… …

What is a symbol?

identity of word ends in “-ski” is capitalized is part of a noun phrase is in a list of city names is under node X in WordNet is in bold font is indented is in hyperlink anchor … S t - 1 is “Wisniewski” part of noun phrase ends in “-ski” O t -1 O t S t S t+1 O t +1

Idea: replace

generative

model in HMM with a

maxent

model, where

state

depends on

observations

and

previous state

history

Pr(

s t

|

x t

,

s t

 1 ,

s t

 2 , ...)  ...

… …

Ratnaparkhi’s MXPOST

• Sequential learning problem: predict POS tags of words.

• Uses MaxEnt model described above.

• Rich feature set.

• To smooth, discard features occurring < 10 times.

Conditional Markov Models (CMMs) aka MEMMs aka Maxent Taggers vs HMMS

Pr(

s

,

o

)  

i

Pr(

s i

|

s i

 1 ) Pr(

o i

|

s i

 1 )

O t-1 S t-1 S t O t O t+1 S t+1 ...

Pr(

s

|

o

)  

i

Pr(

s i

|

s i

 1 ,

o i

 1 )

O t-1 S t-1 S t O t O t+1 S t+1 ...

HMM

HMMs vs MEMM vs CRF

MEMM CRF

Some things to think about

• We’ve seen sliding windows, non-sequential token tagging, and sequential token tagging.

– Which of these are likely to work best, and when?

– Are there other ways to formulate NER as a learning task?

– Is there a benefit from using more complex graphical models? What potentially useful information does a linear chain CRF not capture?

– Can you combine sliding windows with a sequential model?

• Next lecture will survey IE of sets of

related

(e.g., person and his/her affiliation).

– How can you formalize that as a learning task?

entities