NLP and corpora

Transcript NLP and corpora

Introduction to Natural Language
Processing
Source: Natural Language Processing
with Python --- Analyzing Text with
the Natural Language Toolkit
Status
• We have progressed with Object-Oriented
Programming in Python
– Simple I/O, File I/O
– Lists, Strings, Tuples, and their methods
– Numeric types and operations
– Control structures: if, for, while
– Function definition and use
• Parameters for defining the function, arguments for
calling the function
Applying what we have
• We have looked at some of the NLTK book.
• Chapter 1 of the NLTK book repeats much
of what we see in the other text.
• Now in the context of an application
domain: Natural Language Processing
– Note: there are similar packages for other
domains
• Book examples in chapter 1 are all done
with the interactive python shell
Reasons
• What can we achieve by combining simple
programming techniques with large
quantities of text?
• How can we automatically extract key
words and phrases that sum up the style
and content of a text?
• What tools and techniques does the Python
programming language provide for such
work?
• What are some of the interesting
challenges of natural language processing?
Quote from nltk
book
Since text can cover any subject area, it is a general interest area to
explore in some depth.
The NLTK
• The natural language tool kit
– modules
– datasets
– tutorials
• Contains: align, app (package), book, ccg (package), chat
(package, chunk (package), classify (package), cluster
(package), collocations, compat, containers, corpus
(package), data, decorators, downloader, draw (package),
etree (package), evaluate, examples (package), featstruct,
grammar), help, inference (package), internals, lazyimport,
metrics (package), misc (package), model (package), olac,
parse (package), probability, sem (package), sourcedstring,
stem (package), tag (package), text, tokenize (package),
toolbox (package), tree, treetransforms, util, yamltags
We will not have time to explore all of them, but
this gives a full list for further exploration.
Recall - the NLTK
>>> import nltk
>>> nltk.download()
opens a window showing this:
Do it now, if you
have not done so
Getting data from the
downloaded files
• Previously, we used
from math import pi
– to get something specific from a module
• Now, from the nltk.book, we will get the
text files we will use
– from nltk.book import *
Import the data files
Do it now.
Then type sent1 at a
>>> import nltk
python prompt to see the
>>> from nltk.book import *
fist sentence of Moby Dick
*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it. Repeat for sent2 .. sent9
to see the first sentence of
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
each text.
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
Take note of the collection
text4: Inaugural Address Corpus
of texts. Great variety.
text5: Chat Corpus
Different ones will be
text6: Monty Python and the Holy Grail
useful for different types
text7: Wall Street Journal
of exploration
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908
What type of data is each first sentence?
Searching the texts
>>> text9.concordance("sunset")
Building index...
Displaying 14 of 14 matches:
E suburb of Saffron Park lay on the sunset side of London , as red and ragged
n , as red and ragged as a cloud of sunset . It was built of a bright brick th
bered in that place for its strange sunset . It looked like the end of the wor
ival ; it was upon the night of the sunset that his solitude suddenly ended .
he Embankment once under a dark red sunset . The red river reflected the red s
st seemed of fiercer flame than the sunset it mirrored . It looked like a stre
he passionate plumage of the cloudy sunset had been swept away , and a naked m
der the sea . The sealed and sullen sunset behind the dark dome of St . Paul '
ming with the colour and quality of sunset . The Colonel suggested that , befo
gold . Up this side street the last sunset light shone as sharp and narrow as
of gas , which in the full flush of sunset seemed coloured like a sunset cloud
sh of sunset seemed coloured like a sunset cloud . " After all ," he said , "
y and quietly , like a long , low , sunset cloud , a long , low house , mellow
house , mellow in the mild light of sunset . All the six friends compared note
A concordance shows a word in context
Same word in different texts
>>> text1.concordance("monstrous")
Building index...
Displaying 11 of 11 matches:
ong the former , one was of a most monstrous size . ... This came towards us ,
ON OF THE PSALMS . " Touching that monstrous bulk of the whale or ork we have r
ll over with a heathenish array of monstrous clubs and spears . Some were thick
d as you gazed , and wondered what monstrous cannibal and savage could ever hav
that has survived the flood ; most monstrous and most mountainous ! That Himmal
they might scout at Moby Dick as a monstrous fable , or still worse and more de
th of Radney .'" CHAPTER 55 Of the Monstrous Pictures of Whales . I shall ere l
ing Scenes . In connexion with the monstrous pictures of whales , I am strongly
ere to enter upon those still more monstrous stories of them which are to be fo
ght have been rummaged out of this monstrous cabinet there is no telling . But
of Whale - Bones ; for Whales of a monstrous size are oftentimes cast up dead u
>>> text2.concordance("monstrous")
Building index...
Displaying 11 of 11 matches:
. " Now , Palmer , you shall see a monstrous pretty girl ." He immediately went
your sister is to marry him . I am monstrous glad of it , for then I shall have
ou may tell your sister . She is a monstrous lucky girl to get him , upon my ho
k how you will like them . Lucy is monstrous pretty , and so good humoured and
Jennings , " I am sure I shall be monstrous glad of Miss Marianne ' s company
usual noisy cheerfulness , " I am monstrous glad to see you -- sorry I could n
t however , as it turns out , I am monstrous glad there was never any thing in
so scornfully ! for they say he is monstrous fond of her , as well he may . I s
possible that she should ." " I am monstrous glad of it . Good gracious ! I hav
thing of the kind . So then he was monstrous happy , and talked on some time ab
e very genteel people . He makes a monstrous deal of money , and they keep thei
>>>
Moby Dick
Sense and Sensibility
>>> text1.similar("monstrous")
abundant candid careful christian contemptible curious delightfully
determined doleful domineering exasperate fearless few gamesome
horrible impalpable imperial lamentable lazy loving
>>>
>>> text2.similar("monstrous")
Building word-context index...
very exceedingly heartily so a amazingly as extremely good great
remarkably sweet vast
>>>
Note different sense of the
word in the two texts.
Looking at vocabulary
>>> len(text3)
44764
>>>
Total number of tokens,
includes non words and repeated words
>>> len(set(text3))
2789
>>> len(set(text2))
6833
>>>
What do these numbers mean?
A rough measure
>>>
of lexical richness
float(len(text2))/float(len(set(text2
)))
20.719449729255086
What does this tell us?
>>>
On average, a word is used > 20 times
>>> from __future__ import division
>>> 100*text2.count("money")/len(text2)
0.018364694581002431
>>>
What does this tell us?
Note two ways to get floating point
results when dividing integers
Making life easier
>>> def lexical_diversity(text):
... return len(text) / len(set(text))
...
>>> def percentage(count,total):
... return 100*count/total
...
>>> lexical_diversity(text2)
20.719449729255086
>>> percentage(text2.count('money'),len(text2))
0.018364694581002431
>>>
Spot check
1. Modify the function percentage so that
you only have to pass it the name of
the text and the word to count
– the new call will look like this:
– percentage(text2, “money”)
2. In which of the texts is “money” most
dominant?
– Where is it least dominant?
– What are the percentages for each text?
Indexing the texts
• Each of the texts is a list, and so all our
list methods work, including slicing:
The first 101 elements in the list for text2 (Sense and Sensibility) Note
that the first element is itself a list.
>>> text2[0:100]
['[', 'Sense', 'and', 'Sensibility', 'by', 'Jane', 'Austen', '1811', ']', 'CHAPTER', '1',
'The', 'family', 'of', 'Dashwood', 'had', 'long', 'been', 'settled', 'in', 'Sussex', '.',
'Their', 'estate', 'was', 'large', ',', 'and', 'their', 'residence', 'was', 'at', 'Norland',
'Park', ',', 'in', 'the', 'centre', 'of', 'their', 'property', ',', 'where', ',', 'for', 'many',
'generations', ',', 'they', 'had', 'lived', 'in', 'so', 'respectable', 'a', 'manner', 'as', 'to',
'engage', 'the', 'general', 'good', 'opinion', 'of', 'their', 'surrounding',
'acquaintance', '.', 'The', 'late', 'owner', 'of', 'this', 'estate', 'was', 'a', 'single',
'man', ',', 'who', 'lived', 'to', 'a', 'very', 'advanced', 'age', ',', 'and', 'who', 'for',
'many', 'years', 'of', 'his', 'life', ',', 'had', 'a', 'constant', 'companion']
>>>
Text index
• We can see what is at a position:
>>> text2[302]
'devolved’
• And where a word appears:
>>> text2.index('marriage')
255
>>>
Remember that indexing begins at 0 and the
index tells how far removed you are from
the initial element.
Strings
• Each of the elements in each of the text
lists is a string, and all the string
methods apply.
Frequency distributions
>>> fdist1=FreqDist(text1)
>>> fdist1
<FreqDist with 260819 outcomes>
>>> vocabulary1=fdist1.keys()
>>> vocabulary1[:50]
[',', 'the', '.', 'of', 'and', 'a', 'to', ';', 'in', 'that', "'", '-', 'his', 'it', 'I', 's', 'is', 'he',
'with', 'was', 'as', '"', 'all', 'for', 'this', '!', 'at', 'by', 'but', 'not', '--', 'him', 'from',
'be', 'on', 'so', 'whale', 'one', 'you', 'had', 'have', 'there', 'But', 'or', 'were',
'now', 'which', '?', 'me', 'like']
>>>
These are the 50 most common tokens in the text of Moby
Dick. Many of these are not useful in characterizing the
text. We call them “stop words” and will see how to
eliminate them from consideration later.
More precise specification
• Consider the mathematical expression
{w | w ÎV & P(w)}
• Python implementation is
– [w for w in V if p(w)]
List comprehension – we
saw it first last week
>>> AustenVoc=set(text2)
>>> long_words_2=[w for w in AustenVoc if len(w) >15]
>>> long_words_2
['incomprehensible', 'disqualifications', 'disinterestedness', 'companionableness']
>>>
Add to the condition
fdist2=FreqDist(text2)
>>> long_words_2=sorted([w for w in AustenVoc if len(w) >12 and fdist2[w]>5])
>>> long_words_2
['Somersetshire', 'accommodation', 'circumstances', 'communication',
'consciousness', 'consideration', 'disappointment', 'distinguished',
'embarrassment', 'encouragement', 'establishment', 'extraordinary',
'inconvenience', 'indisposition', 'neighbourhood', 'unaccountable',
'uncomfortable', 'understanding', 'unfortunately']
So, our if p(w) can be as complex as we need
Spot check
• Find all the words longer than 12
characters, which occur at least 5 times,
in each of the texts.
– How well do they give you a sense of the
texts?
Collocations and Bigrams
• Sometimes a word by itself is not representative of
its role in a text. It is only with a companion word
that we get the intended sense.
– red wine
– high horse
– sign of hope
• Bigrams are two word combinations
– not all bigrams are useful, of course
– len(bigrams(text2)) == 141575
• including “and among”, “they could” , …
• Collocations provides bigrams that include
uncommon words – words that might be significant
in the text.
– text2.collocations has 20 pairs
>>> colloc2=text2.collocations()
Colonel Brandon; Sir John; Lady Middleton; Miss Dashwood; every thing;
thousand pounds; dare say; Miss Steeles; said Elinor; Miss Steele;
every body; John Dashwood; great deal; Harley Street; Berkeley Street;
Miss Dashwoods; young man; Combe Magna; every day; next morning
>>> [len(w) for w in text2]
[1, 5, 3, 11, 2, 4, 6, 4, 1, 7, 1, 3, 6, 2, 8, 3, 4, 4, 7, 2, 6, 1, 5, 6, 3, 5, 1, 3, 5, 9, 3, 2,
7, 4, 1, 2, 3, 6, 2, 5, 8, 1, 5, 1, 3, 4, 11, 1, 4, 3, 5, 2, 2, 11, 1, 6, 2, 2, 6, 3, 7, 4, 7, 2,
5, 11, 12, 1, 3, 4, 5, 2, 4, 6, 3, 1, 6, 3, 1, 3, 5, 2, 1, 4, 8, 3, 1, 3, 3, 3, 4, 5, 2, 3, 4, 1,
3, 1, 8, 9, 3, 11, 2, 3, 6, 1, 3, 3, 5, 1, 5, 8, 3, 5, 6, 3, 3, 1, 8, …
For each word in text2, return its length
>>> fdist2=FreqDist([len(w) for w in text2])
>>> fdist2
<FreqDist with 141576 outcomes>
>>> fdist2.keys()
There are 141,576 words,
[3, 2, 1, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 17, 16] each with a length. But
>>>
there are only 17 different
word lengths.
>>> fdist2.items()
[(3, 28839), (2, 24826), (1, 23009), (4, 21352), (5, 11438), (6, 9507), (7, 8158), (8,
5676), (9, 3736), (10, 2596), (11, 1278), (12, 711), (13, 334), (14, 87), (15, 24), (17,
3), (16, 2)]
>>>
There are 28,839 3-letter words in Sense and Sensibility
(not unique words, necessarily)
>>> fdist2.keys()
[3, 2, 1, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 17, 16]
>>> fdist2.items()
[(3, 28839), (2, 24826), (1, 23009), (4, 21352), (5, 11438), (6, 9507), (7, 8158), (8,
5676), (9, 3736), (10, 2596), (11, 1278), (12, 711), (13, 334), (14, 87), (15, 24), (17,
3), (16, 2)]
>>> fdist2.max()
3
>>> fdist2[3]
28839
There are 28,839 3-letter words and 334 13>>> fdist2[13]
letter words in Sense and Sensibility
334
>>>
Table 1.2 – FreqDist functions
Example
Descripiton
fdist = FreqDist(samples)
create a frequency distribution containing the given
samples
fdist.inc(sample)
increment the count for this sample
fdist['monstrous']
count of the number of times a given sample occurred
fdist.freq(‘monstrous’)
frequency of a given sample
fdist.N()
total number of samples
fdist.keys()
The samples sorted in order of decreasing frequency
for sample in fdist:
iterate over the samples, in order of decreasing frequency
fdist.max()
sample with the greatest count
tabulate the frequency distribution
fdist.tabulate()
fdist.plot()
graphical plot of the frequency distribution
fdist.plot(cumulative=True)
cumulative plot of the frequency distribution
fdist1<fdist2
test if samples in fdist1 occur less frequently than in fdist2
Conditionals
We have seen conditionals and loop statements. These
are some special functions for work on text
Function
s.startswith(t)
s.endswith(t)
t in s
s.islower()
s.isupper()
s.isalpha()
s.isalnum()
s.isdigit()
s.istitle()
Meaning
test if s starts with t
test if s ends with t
test if t is contained inside s
test if all cased characters in s are lowercase
test if all cased characters in s are uppercase
test if all characters in s are alphabetic
test if all characters in s are alphanumeric
test if all characters in s are digits
test if s is titlecased (all words in s have have initial capitals)
Spot check
From the NLTK book: Run the following examples
and explain what is happening. Then make up
some tests of your own.
>>> sorted([w for w in set(text7) if '-' in w and 'index' in w])
>>> sorted([wd for wd in set(text3) if wd.istitle() and len(wd) > 10])
>>> sorted([w for w in set(sent7) if not w.islower()])
>>> sorted([t for t in set(text2) if 'cie' in t or 'cei' in t])
Ending the double count of words
• The count of words from the various texts
was flawed. How?
• We had
>>> len(text1)
260819
>>> len(set(text1))
19317
>>> len(set([word.lower() for word in text1]))
17231
>>>
• What’s the problem? How do we fix it?
>>> len(set([word.lower() for word in text1 if word.isalpha()]))
16948
>>>
Nested loops and loops with
conditions
>>> for token in sent1:
... if token.islower():
...
print token, 'is a lowercase word'
... elif token.istitle():
...
print token, 'is a titlecase word'
... else:
...
print token, 'is punctuation'
...
Call is a titlecase word
me is a lowercase word
Ishmael is a titlecase word
. is punctuation
>>>
• Follow what happens.
Another example
>>> tricky = sorted([w for w in set(text2) if 'cie' in w or 'cei' in w])
>>> for word in tricky:
... print word,
ancient ceiling conceit conceited conceive conscience
conscientious conscientiously deceitful deceive ...
>>>
Automatic Text Understanding
• See section 1.5
• Some realistic, interesting problems associated with
Natural Language Processing
– Word sense disambiguation
a.
b.
c.
The lost children were found by the searchers (agentive)
The lost children were found by the mountain (locative)
The lost children were found by the afternoon (temporal)
– Pronoun resolution
a.
b.
c.
The thieves stole the paintings. They were subsequently sold.
The thieves stole the paintings. They were subsequently caught.
The thieves stole the paintings. They were subsequently found.
Generating text!
>>> text4.generate()
Building ngram index...
Fellow - Citizens : Under Providence I have given freedom new reach ,
and maintain lasting peace -- based on righteousness and justice .
There was this reason only why the cotton - producing States should be
promoted by just and abundant society , on just principles . These
later years have elapsed , and civil war . More than this , we affirm
a new beginning is a destiny . May Congress prohibit slavery in the
workshop , in translating humanity ' s strongest , but we have adopted
, and fear of God . And , in each
>>>
An inaugural address??
-- MIT hoax – conference submission
Translation
Babel> How long before the next flight to Alice Springs?
Babel> german
Babel> run
0> How long before the next flight to Alice Springs?
1> Wie lang vor dem folgenden Flug zu Alice Springs?
2> How long before the following flight to Alice jump?
3> Wie lang vor dem folgenden Flug zu Alice springen Sie?
4> How long before the following flight to Alice do you jump?
5> Wie lang, bevor der folgende Flug zu Alice tun, Sie springen?
6> How long, before the following flight to Alice does, do you jump?
7> Wie lang bevor der folgende Flug zu Alice tut, tun Sie springen?
8> How long before the following flight to Alice does, do you jump?
9> Wie lang, bevor der folgende Flug zu Alice tut, tun Sie springen?
10> How long, before the following flight does to Alice, do do you jump?
11> Wie lang bevor der folgende Flug zu Alice tut, Sie tun Sprung?
12> How long before the following flight does leap to Alice, does you?
Babel>
Jeopardy and Watson
The ultimate example of a machine and language
http://www.youtube.com/watch?v=xm8iUjzgPTg&feature=related
http://www.youtube.com/watch?v=7h4baBEi0iA&feature=related -- the
strange response
http://www.youtube.com/watch?src_vid=7h4baBEi0iA&feature=iv&v=lIM7O_bRNg&annotation_id=annotation_383798#t=3m11s
Explanation of the strange response
Text corpora
• A collection of text entities
– Usually there is some unifying
characteristic, but not always
– Typical examples
• All issues of a newspaper for a period of time
• A collection of reports from a particular industry
or standards body
– More recent
• The whole collection of posts to twitter
• All the entries in a blog or set of blogs
Check it out
• Go to http://www.gutenberg.org/
• Take a few minutes to explore the site.
– Look at the top 100 downloads of yesterday
– Can you characterize them? What do you
think of this list?
Corpora in nltk
• The nltk includes part of the Gutenberg
collection
• Find out which ones by
>>>nltk.corpus.gutenberg.fileids()
• These are the texts of the Gutenberg
collection that are downloaded with the
nltk package.
Accessing other texts
• We will explore the files loaded with nltk
• You may want to explore other texts also.
• From the help(nltk.corpus):
– If C{item} is one of the unique identifiers listed
in the corpus module's C{items} variable, then
the corresponding document will be loaded
from the NLTK corpus package.
– If C{item} is a filename, then that file will be
read.
For now – just a note that we can use these tools on other
texts that we download or acquire from any source.
Using the tools we saw before
• The particular texts we saw in chapter 1
were accessed through aliases that
simplified the interaction.
• Now, more general case, we have to do
more.
• To get the list of words in a text:
>>>emma = nltk.corpus.gutenberg.words('austen-emma.txt')
• Now we have the form we had for the texts of Chapter 1
and can use the tools found there. Try:
>>> len(emma)
Note the frequency of use of Jane Austen books ???
Shortened reference
• Global context
– Instead of citing the gutenberg corpus for each
resource,
>>> from nltk.corpus import gutenberg
>>> gutenberg.fileids()
['austen-emma.txt', 'austen-persuasion.txt', 'austensense.txt', ...]
>>> emma = gutenberg.words('austen-emma.txt')
• So,
nltk.corpus.gutenberg.words('austen-emma.txt')
becomes just
gutenberg.words('austen-emma.txt')
Other access options
• gutenberg.words('austen-emma.txt')
– the words of the text
• gutenberg.raw('austen-emma.txt')
– the original text, no separation into tokens
(words). One long string.
• gutenberg.sents('austen-emma.txt')
– the text divided into sentences
Some code to run
• Enter and run the code for counting
characters, words, sentences and finding
the lexical diversity score of each text in
the corpus.
import nltk
from nltk.corpus import gutenberg
for fileid in gutenberg.fileids():
num_chars = len(gutenberg.raw(fileid))
num_words = len(gutenberg.words(fileid))
num_sents = len(gutenberg.sents(fileid))
num_vocab = len(set([w.lower() for w in gutenberg.words(fileid)]))
print int(num_chars/num_words), int(num_words/num_sents), \
int(num_words/num_vocab), fileid
Short, simple code. Already seeing some noticeable time to execute
Modify the code
• Simple change – print out the total
number of characters, words, sentences
for each text.
The text corpus
• Take a look at your directory of nltk_data to
see the variety of text materials accessible
to you.
– Some are not plain text and we cannot use
them yet – but will
– Of the plain text, note the diversity
• Classic published materials
• News feeds, movie reviews
• Overheard conversations, internet chat
– All categories of language are needed to
understand the language as it is defined and as
it is used.
The Brown Corpus
• First 1 million word corpus
• Explore –
– what are the categories?
– Access words or sentences from one or
more categories or fileids
>>> from nltk.corpus import brown
>>> brown.categories()
>>> brown.fileids(categories=”<choose>")
Sylistics
>>> from nltk.corpus import brown
>>> news_text = brown.words(categories='news')
>>> fdist = nltk.FreqDist([w.lower() for w in news_text])
>>> modals = ['can', 'could', 'may', 'might', 'must', 'will']
>>> for m in modals:
... print m + ':', fdist[m],
• Enter that code and run it.
• What does it give you?
• What does it mean?
Spot check
• Repeat the previous code, but look for
the use of those same words in the
categories for religion, government
• Now analyze the use of the “wh” words
in the news category and one other of
your choice. (Who, What, Where,
When, Why)
One step comparison
• Consider the following code:
import nltk
from nltk.corpus import brown
cfd = nltk.ConditionalFreqDist(
(genre, word)
for genre in brown.categories()
for word in brown.words(categories=genre))
genres = ['news', 'religion', 'hobbies', 'science_fiction', 'romance', 'humor']
modals = ['can', 'could', 'may', 'might', 'must', 'will']
cfd.tabulate(conditions=genres, samples=modals)
Enter and run it.
What does it do?
Other corpora
• There is some information about the
Reuters and Inaugural address corpora
also. Take a look at them with the online
site. (5 minutes or so)
Spot Check
• Take a look at Table 2-2 for a list of some of
the material available from the nltk project. (I
cannot fit it on a slide in any meaningful way)
• Confirm that you have downloaded all of
these (when you did the nltk.download, if you
selected all)
• Find them in your directory and explore.
– How many languages are represented?
– How would you describe the variety of content?
What do you find most
interesting/unusual/strange/fun?
Languages
• The Universal Declaration of Human Rights
is available in 300 languages.
>>>udhr.fileids()
Organization of Corpora
• The organization will vary according to
the type of corpus. Knowing the
organization may be important for using
the corpus.
Table 2.3 – Basic Corpus Functionality in NLTK
Example
fileids()
fileids([categories])
categories
categories()
categories([fileids])
files
raw()
raw(fileids=[f1,f2,f3])
raw(categories=[c1,c2])
words()
words(fileids=[f1,f2,f3])
words(categories=[c1,c2])
sents()
sents(fileids=[f1,f2,f3])
sents(categories=[c1,c2])
abspath(fileid)
encoding(fileid)
open(fileid)
root()
readme()
Description
the files of the corpus
the files of the corpus corresponding to these
the categories of the corpus
the categories of the corpus corresponding to these
the raw content of the corpus
the raw content of the specified files
the raw content of the specified categories
the words of the whole corpus
the words of the specified fileids
the words of the specified categories
the sentences of the whole corpus
the sentences of the specified fileids
the sentences of the specified categories
the location of the given file on disk
the encoding of the file (if known)
open a stream for reading the given corpus file
the path to the root of locally installed corpus
the contents of the README file of the corpus
from help(nltk.corpus.reader)
Corpus reader functions are named based on the type of information
they return. Some common examples, and their return types, are:
- I{corpus}.words(): list of str
Types of information
- I{corpus}.sents(): list of (list of str)
returned from typical
- I{corpus}.paras(): list of (list of (list of str))
functions
- I{corpus}.tagged_words(): list of (str,str) tuple
- I{corpus}.tagged_sents(): list of (list of (str,str))
- I{corpus}.tagged_paras(): list of (list of (list of (str,str)))
- I{corpus}.chunked_sents(): list of (Tree w/ (str,str) leaves)
- I{corpus}.parsed_sents(): list of (Tree with str leaves)
- I{corpus}.parsed_paras(): list of (list of (Tree with str leaves))
- I{corpus}.xml(): A single xml ElementTree
- I{corpus}.raw(): unprocessed corpus contents
For example, to read a list of the words in the Brown Corpus, use
C{nltk.corpus.brown.words()}:
>>> from nltk.corpus import brown
>>> print brown.words()
Spot check
• Choose a corpus and exercise some of
the functions
– Look at raw, words, sents, categories,
fileids, encoding
• Repeat for a source in a different
language.
• Work in pairs and talk about what you
find, what you might want to look for.
– Report out briefly
Working with your own sources
• NLTK provides a great bunch of resources,
but you will certainly want to access your
own collections – other books you
download, or files you create, etc.
from nltk.corpus import PlaintextCorpusReader
>>> corpus_root = '/usr/share/dict'
>>> wordlists = PlaintextCorpusReader(corpus_root, '.*')
>>> wordlists.fileids()
['README', 'connectives', 'propernames', 'web2', 'web2a', 'words']
You could get the
>>> wordlists.words('connectives')
list of files in any
['the', 'of', 'and', 'to', 'a', 'in', 'that', 'is', ...]
directory
Other Corpus readers
• There are a number of different readers
for different types of corpora.
• Many files in corpora are “marked up” in
various ways and the reader needs to
understand the markings to return
meaningful results.
• We will stick to the
PlaintextCorpusReader for now
Conditional Frequency
Distribution
• When texts in a corpus are divided into
categories, we may want to look at the
characteristics by category – word use by
author or over time, for example
Figure 2.4: Counting Words Appearing in a Text Collection (a conditional
frequency distribution)
Frequency Distributions
• A frequency distribution counts some
occurrence, such as the use of a word or
phrase.
• A conditional frequency distribution,
counts some occurrence separately for
each of some number of conditions
(Author, date, genre, etc.)
• For example:
>>> genre_word = [(genre, word)
Think about this.
...
for genre in ['news', 'romance']
What exactly is
...
for word in brown.words(categories=genre)] happening?
>>> len(genre_word)
170576
What are those 170,576 things?, Run the
code, then enter just >>> genre_word
>>> genre_word = [(genre, word)
...
for genre in ['news', 'romance']
...
for word in
brown.words(categories=genre)]
>>> len(genre_word)
170576
• For each genre (‘news’, ‘romance’)
• loop over every word in that genre
• produce the pairs showing the genre and
the word
• What type of data is genre_word?
Spot check
• Refining the result
– When you displayed genre_word, you may
have noticed that some of the words are
not words at all. They are punctuation
marks.
– Refine this code to eliminate the entries in
genre_word in which the word is not all
alphabetic.
– Remove duplicate words that differ only in
capitalization.
Work together. Talk about what you are doing. Share your ideas and
insights
Conditional Frequency
Distribution
• From the list of pairs we created, we can
generate a conditional frequency
distribution of words by genre
>>> cfd = nltk.ConditionalFreqDist(genre_word)
>>> cfd
Run these. Look at the results
>>> cfd.conditions()
Look at the conditional
distributions
>>> cfd['news']
<FreqDist with 100554 outcomes>
>>> cfd['romance']
<FreqDist with 70022 outcomes>
>>> list(cfd['romance'])
[',', '.', 'the', 'and', 'to', 'a', 'of', '``', "''", 'was', 'I', 'in', 'he', 'had',
'?', 'her', 'that', 'it', 'his', 'she', 'with', 'you', 'for', 'at', 'He', 'on', 'him',
'said', '!', '--', 'be', 'as', ';', 'have', 'but', 'not', 'would', 'She', 'The', ...]
>>> cfd['romance']['could']
193
Presenting the results
• Plotting and tabulating
– concise representations of the frequency
distributions
• Tabulate cfd.tabulate()
• With no parameters, simply tabulates all
the conditions against all the values
Look closely
>>> from nltk.corpus import inaugural
>>> cfd = nltk.ConditionalFreqDist(
...
(target, fileid[:4])
Get the text
The two axes
...
...
for fileid in inaugural.fileids()
for w in inaugural.words(fileid)
All the words in each file
...
...
for target in ['america', 'citizen']
if w.lower().startswith(target))
Narrow the word choice
Remember List Comprehension?
Three elements
• For a conditional frequency distribution:
– Two axes
• condition or event, something of interest
• some connected characteristic – a year, a place, an
author, anything that is related in some way to the event
– Something to count
• For the condition and the characteristic, what are we
counting? Words? actions? what?
– From the previous example
• inaugural addresses
• specific words
• count the number of times that a form of either of those
words occurred in that address
Spot check
• Run the code on the previous example.
• How many times was some version of
“citizen” used in the 1909 inaugural
address?
• How many times was “america”
mentioned in 2009?
• Play with the code. What can you leave
off and still get some meaningful
output?
Another case
• Somewhat simpler specification
• Distribution of length of word in
languages, with restriction on languages
>>> from nltk.corpus import udhr
>>> languages = ['Chickasaw', 'English', 'German_Deutsch',
... 'Greenlandic_Inuktikut', 'Hungarian_Magyar', 'Ibibio_Efik']
>>> cfd = nltk.ConditionalFreqDist(
...
(lang, len(word))
...
for lang in languages
...
for word in udhr.words(lang + '-Latin1'))
Now tabulate
>>> cfd.tabulate(conditions=['English', 'German_Deutsch'],
...
samples=range(10), cumulative=True)
0 1
2
3 4
5
6
7
8
9
English 0 185 525 883 997 1166 1283 1440 1558 1638
German_Deutsch 0 171 263 614 717 894 1013 1110 1213 1275
• Only choose to tabulate some of the
results.
• import matplotlib
• cfd.plot()
Plot
Common methods for Conditional
Frequency Distributions
•
•
•
•
•
•
•
•
•
cfdist = ConditionalFreqDist(pairs)
create a conditional frequency
distribution from a list of pairs
cfdist.conditions() alphabetically sorted list of conditions
cfdist[condition] the frequency distribution for this condition
cfdist[condition][sample] frequency for the given sample for this
condition
cfdist.tabulate() tabulate the conditional frequency distribution
cfdist.tabulate(samples, conditions) tabulation limited to the specified
samples and conditions
cfdist.plot() graphical plot of the conditional frequency distribution
cfdist.plot(samples, conditions) graphical plot limited to the specified
samples and conditions
cfdist1 < cfdist2 test if samples in cfdist1 occur less frequently than in
cfdist2
References
• This set of slides comes very directly
from the book, Natural Language
Processing with Python. www.nltk.org