NLTK & Python Day 4 LING 681.02 Computational Linguistics Harry Howard Tulane University Course organization I have requested that Python and NLTK be installed on the.
Download
Report
Transcript NLTK & Python Day 4 LING 681.02 Computational Linguistics Harry Howard Tulane University Course organization I have requested that Python and NLTK be installed on the.
NLTK & Python
Day 4
LING 681.02
Computational Linguistics
Harry Howard
Tulane University
Course organization
I have requested that Python and NLTK be
installed on the computers in this room.
31-Aug-2009
LING 681.02, Prof. Howard, Tulane University
2
NLPP
§1 Language processing & Python
§1.1 Computing with language
Loading the book's texts
>>> from nltk.book import *
*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908
>>>
31-Aug-2009
LING 681.02, Prof. Howard, Tulane University
4
Searching text
Show every token of a word in context, called
concordance view.
text1.concordance("monstrous")
Show the words that appear in a similar range of
contexts.
text1.similar("monstrous")
Show the contexts that two words share.
text1.common_contexts("monstrous")
31-Aug-2009
LING 681.02, Prof. Howard, Tulane University
5
Searching text, cont.
Plot how far each token of a word is from
the beginning of a text.
text1.dispersion_plot(["monstrous"])
Needs NumPy & Matplotlib, though it didn't
work for me.
Generate random text.
text1.generate()
31-Aug-2009
LING 681.02, Prof. Howard, Tulane University
6
Counting vocabulary
Count the word and punctuation tokens in a text:
len(text1)
List the distinct words, i.e. the word types, in a
text:
set(text1)
Count how many types there are in a text:
len(set(text1))
Count the tokens of a word type:
text1.count("smote")
31-Aug-2009
LING 681.02, Prof. Howard, Tulane University
7
Lexical richness or
diversity
The lexical richness or diversity of a text
can be estimated as tokens per type:
len(text1) / len(set(text1)
The frequency of a type can be estimated as
tokens per all tokens:
100 * text1.count('a') / len(text1)
This is integer division, however.
p. 8 "_future_" is some kind of error
31-Aug-2009
LING 681.02, Prof. Howard, Tulane University
8
Making your own
function in Python
To save you from typing the same thing
over and over, you can define your own
function:
>>> def lexical_diversity(text):
...
return len(text1) / len(set(text1)
You call this function just by typing it and
filling in the argument, a text name, in the
parenthesis:
>>> lexical_diversity(text1)
31-Aug-2009
LING 681.02, Prof. Howard, Tulane University
9
Other functions
Sort the word types in a text alphabetically:
sorted(set(text1))
31-Aug-2009
LING 681.02, Prof. Howard, Tulane University
10
Exercises 1.8.…
4. … How many words are there in text2? How
many distinct words are there?
5. Compare the lexical diversity scores for humor
and romance fiction in Table 1.1. Which genre is
more lexically diverse?
8. Consider the following Python expression:
len(set(text4)). State the purpose of this
expression. Describe the two steps involved in
performing this computation.
31-Aug-2009
LING 681.02, Prof. Howard, Tulane University
11
NLPP
§1.2 A Closer Look at Python: Texts
as Lists of Words
The representation of a
text
We will think of a text as nothing more than a
sequence of words and punctuation.
The opening sentence of Moby Dick:
>>> sent1 = ['Call', 'me', 'Ishmael', '.']
The bracketed material is known as a list in
Python.
We can inspect it by typing the name.
How would you find out how many words it has?
31-Aug-2009
LING 681.02, Prof. Howard, Tulane University
13
List construction
Append one list to the end of another with
'+', known as concatenation:
>>> ['Monty', 'Python'] + ['and', 'the', 'Holy', 'Grail']
['Monty', 'Python', 'and', 'the', 'Holy', 'Grail']
>>> sent4 + sent1
['Fellow', '-', 'Citizens', 'of', 'the', 'Senate', 'and',
'of', 'the','House', 'of', 'Representatives', ':',
'Call', 'me', 'Ishmael', '.']
Append a single item to a list
>>> sent1.append("Some")
sent1 ['Call', 'me', 'Ishmael', '.', 'Some']
31-Aug-2009
LING 681.02, Prof. Howard, Tulane University
14
List indexing
Each element in a list is numbered in sequence, a number known as the
element's index.
Show the item that occurs at an index such as 173 in a text:
>>> text4[173]
'awaken'
Show the index of an element's first occurrence:
>>>text4.index('awaken')
173
Show the elements between two indices (slicing):
>>> text5[16715:16735]
>>> text5[16715:]
>>> text5[:16735]
Assign an element to an index:
>>> text[0] = 'First'
31-Aug-2009
LING 681.02, Prof. Howard, Tulane University
15
Python counts from 0
Create a list:
>>> sent = ['word1', 'word2', 'word3', 'word4', 'word5',
...
'word6', 'word7', 'word8', 'word9', 'word10']
Find the first word:
>>> sent[0]
'word1'
Find the last word:
>>> sent[9]
'word10'
What does sent[10] do?
It produces a runtime error.
31-Aug-2009
LING 681.02, Prof. Howard, Tulane University
16
List exercises
31-Aug-2009
LING 681.02, Prof. Howard, Tulane University
17
Next time
NLPP: finish §1 and do all exercises;
do up to Ex 8 in §2