Practical Natural Language Processing Catherine Havasi Luminoso / MIT Media Lab [email protected] Christine C. Quinn, the New York City Council speaker, released a video and.

Download Report

Transcript Practical Natural Language Processing Catherine Havasi Luminoso / MIT Media Lab [email protected] Christine C. Quinn, the New York City Council speaker, released a video and.

Practical Natural
Language Processing
Catherine Havasi
Luminoso / MIT Media Lab
[email protected]
Christine C. Quinn, the New York City Council speaker, released a video and planned to visit all five boroughs on Sunday as she officially began her campaign Many social norms, like “tha.
There are notes!
luminoso.com/blog
Too much text?
Wouldn’t it be cool if we
could talk to a computer?
This is hard.
It takes a lot of knowledge to
understand language
I made her duck.
I made her duck
•
•
•
•
•
I cooked waterfowl for her benefit (to eat)
I cooked waterfowl belonging to her
I created the (plaster?) duck she owns
I made sure she got her head down
I waved my magic wand and turned her i
nto undifferentiated waterfowl
Language is Recursive
• You can build new concepts out of old
ones indefinitely
Language is Creative
It smelled terrible.
It was really stuffy.
It was like it had
been shut away for
a long time.
Smells like an
old house.
Was like a wet dog.
Smelled really musty.
Reminds me of a
dusty closet.
Really stale.
Confidential Luminoso http://lumino.so
A multi-lingual world
Linguistics to the rescue?
Linguistics to the rescue?
--Randall Munroe, xkcd.org/114
“Much Debate”
We just want to get things
done.
So, what is state of the art?
The NLP process
•
•
•
•
•
•
Take in a string of language
Where are the words?
What are the root forms of these words?
How do the words fit together?
Which words look important?
What decisions should we make based on
these words?
The NLP process (simplified)
• Fake understanding
The NLP process (simplified)
• Fake understanding
• Until you make understanding
Example: Detecting bad words
• You want to flag content with certain bad
words in it
• Don’t just match sequences of characters
• That would lead to this classic mistake
Many forms of fowl language
• Suppose we want people to not say the word
“duck”
Many forms of fowl language
“What the duck’s wrong with this”
“It’s all ducked up”
“Un-ducking-believable”
Step 1: break text into tokens
it
’s
all
ducked
up
un
ducking
believable
Step 2: replace tokens with their root
forms
it
→ it
’s
→ is
all
→ all
ducked → duck
up
→ up
un
→ un
ducking → duck
believable → believe
In a few lines of Python:
>>> import nltk
>>> text = "It's all ducked up. Un-ducking-believable."
>>> tokens = nltk.wordpunct_tokenize(text.lower())
>>> tokens
[’it', "'", 's', 'all', 'ducked', 'up', '.', ’un', '-', 'ducking', '-', 'believable', '.']
>>> stemmer = nltk.stem.PorterStemmer()
>>> [stemmer.stem_word(token) for token in tokens]
[’it', "'", 's', 'all', 'duck', 'up', '.', ’un', '-', 'duck', '-', 'believ', '.']
Stemmers can spell things oddly
•
•
•
•
•
•
duck → duck
ducking → duck
believe → believ
believable → believ
happy → happi
happiness → happi
Stemmers can mix up some words
•
•
•
•
sincere → sincer
sincerity → sincer
universe → univers
university → univers
The NLP tool chain
• Some source of text (a database, a labeled
corpus, Web scraping, Twitter...)
• Tokenizer: breaks text into word-like things
• Stemmer: finds words with the same root
• Tagger: identifies parts of speech
• Chunker: identifies key phrases
• Something that makes decisions based on
these results
Useful toolkits
•
•
•
•
NLTK (Python)
LingPipe (Java)
Stanford Core NLP (Java; many wrappers)
FreeLing (C++)
The statistics of text
• Often we want to understand the differences
between different categories of text
– Different genres
– Different writers
– Different forms of writing
Collecting word counts
•
•
•
•
Start with a corpus of text
Brown corpus (1961)
British National Corpus (1993)
Google Books (2009, 2012)
Collecting word counts
>>> import nltk
>>> from nltk.corpus import brown
>>> from collections import Counter
>>> counts = Counter(brown.words())
>>> counts.most_common()[:20]
[('the', 62713), (',', 58334), ('.', 49346), ('of', 36080),
('and', 27915), ('to', 25732), ('a', 21881), ('in', 19536),
('that', 10237), ('is', 10011), ('was', 9777), ('for', 8841),
('``', 8837), ("''", 8789), ('The', 7258), ('with', 7012), ('it',
6723), ('as', 6706), ('he', 6566), ('his', 6466)]
Collecting word counts
>>> for category in brown.categories():
... frequency = Counter(brown.words(
...
categories=category))
...
... for word in frequency:
...
frequency[word] /= counts[word] + 100.
...
... # format the results nicely
... print "%20s -> %s" % (category,
...
', '.join(word for word, prop
...
in frequency.most_common()[:10]))
Prominent words by category
editorial -> Berlin, Khrushchev, East, editor, nuclear,
West, Soviet, Podger, Kennedy, budget
fiction -> Kate, Winston, Scotty, Rector, Hans,
Watson, Alex, Eileen, doctor, !
government -> fiscal, Rhode, Act, Government, shelter,
States, tax, Island, property, shall
hobbies -> feed, clay, Hanover, site, your,
design, mold, Class, Junior, Juniors
news -> Mrs., Monday, Mantle, yesterday, Dallas,
Texas, Kennedy, Tuesday, jury, Palmer
religion -> God, Christ, Him, Christian, Jesus,
membership, faith, sin, Church, Catholic
reviews -> music, musical, Sept., jazz, Keys,
audience, singing, Newport, cholesterol
science_fiction -> Ekstrohm, Helva, Hal, B'dikkat, Mercer,
Ryan, Earth, ship, Mike, Hesperus
Classifying text
• We can take text that’s categorized and figure
out its word frequencies
• Wouldn’t it be more useful to look at word
frequencies and figure out the category?
Example: Spam filtering
• Paul Graham’s SpamBayes (2002)
• Remember what e-mail was like before 2002?
• A simple classifier (Naive Bayes) changed
everything
Supervised classification
• Distinguish things from other things based on
examples
Applications
•
•
•
•
•
Spam filtering
Detecting important e-mails
Topic detection
Language detection
Sentiment analysis
Naive Bayes
• We know the probability of various data given a
category
• Estimate the probability of the category given the
data
• Assume all features of the data are independent
(that’s the naive part)
• It’s simple
• It’s fast
• Sometimes it even works
A quick Naive Bayes experiment
• nltk.corpus.movie_reviews: movie reviews
labeled as ‘pos’ or ‘neg’
• Define document_features(doc) to describe a
document by the words it contains
Statistics beyond single words
• Many interesting things about text are longer
than one word
• bigram: a sequence of two tokens
• collocation: a bigram that seems to be more
than the sum of its parts
When is a bigram interesting?
#(vice president)
#(vice)
#(president)
total words
Guess the text
Guess the text
>>> from nltk.book import text4
>>> text4.collocations()
United States; fellow citizens; four years; years
ago; Federal Government; General
Government; American people; Vice
President; Old World; Almighty God; Fellow
citizens; Chief Magistrate; Chief Justice; God
bless; every citizen; Indian tribes; public debt;
one another; foreign nations; political parties
Guess the text
>>> from nltk.book import text3
>>> text3.collocations()
said unto; pray thee; thou shalt; thou hast; thy
seed; years old; spake unto; thou art; LORD
God; every living; God hath; begat sons; seven
years; shalt thou; little ones; living creature;
creeping thing; savoury meat; thirty years;
every beast
Guess the text
>>> from nltk.book import text6
>>> text6.collocations()
BLACK KNIGHT; HEAD KNIGHT; Holy Grail;
FRENCH GUARD; Sir Robin; Run away;
CARTOON CHARACTER; King Arthur; Iesu
domine; Pie Iesu; DEAD PERSON; Round Table;
OLD MAN; dramatic chord; dona eis; eis
requiem; LEFT HEAD; FRENCH GUARDS; music
stops; Sir Launcelot
What about grammar?
• Eh
• Too hard
What about word meanings?
• “I liked the movie.”
• “I enjoyed the film.”
• These have a lot more in common than “I” and
“the”.
WordNet
• A dictionary for computers
• Contains links between definitions
• Words form (roughly) a tree
Synset
Definition
good, right, ripe – (most suitable or right for
a particular purpose; "a good time to plant
tomatoes"; "the right time to act"; "the time
is ripe for great sociological changes")
Glosses
Measuring word similarity
• Various methods of measuring word similarity using
paths in WordNet
>>> from nltk.corpus import wordnet as wn
>>> wn.wup_similarity(wn.synset('movie.n.1'),
wn.synset('film.n.1'))
1.0
>>> wn.wup_similarity(wn.synset('cat.n.1'),
wn.synset('dog.n.1'))
0.8571
>>> wn.wup_similarity(wn.synset('cat.n.1'),
wn.synset('movie.n.1'))
0.3636
The black hats have WordNet too
• This is why content farms might try to tell you
“What to Anticipate When You’re
Anticipating”
Limitations of WordNet
>>> print wn.wup_similarity(wn.synset('taxi.n.1'),
wn.synset('driver.n.1'))
0.235294117647
>>> print wn.wup_similarity(wn.synset(’kitten.n.1'),
wn.synset(’adorable.a.1'))
None
ConceptNet
•
•
•
•
More types of word relationships
More languages
Less precise definitions
Conceptnet5.media.mit.edu
ood
has
supermarket
spend
cook
person
buy
groceries
buy
money
groceries
wallet
building
bank
Take-away points
• NLP in general is hard
• Specific things are easy
• Find tools that work well and chain them
together
• Try experimenting with NLTK
• If you need to classify things, try Naive Bayes
first
Catherine Havasi
Luminoso & MIT Media Lab
[email protected]