NLTK & Python Day 4 LING 681.02 Computational Linguistics Harry Howard Tulane University Course organization  I have requested that Python and NLTK be installed on the.

Download Report

Transcript NLTK & Python Day 4 LING 681.02 Computational Linguistics Harry Howard Tulane University Course organization  I have requested that Python and NLTK be installed on the.

NLTK & Python
Day 4
LING 681.02
Computational Linguistics
Harry Howard
Tulane University
Course organization
 I have requested that Python and NLTK be
installed on the computers in this room.
31-Aug-2009
LING 681.02, Prof. Howard, Tulane University
2
NLPP
§1 Language processing & Python
§1.1 Computing with language
Loading the book's texts
>>> from nltk.book import *
*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908
>>>
31-Aug-2009
LING 681.02, Prof. Howard, Tulane University
4
Searching text
 Show every token of a word in context, called
concordance view.
 text1.concordance("monstrous")
 Show the words that appear in a similar range of
contexts.
 text1.similar("monstrous")
 Show the contexts that two words share.
 text1.common_contexts("monstrous")
31-Aug-2009
LING 681.02, Prof. Howard, Tulane University
5
Searching text, cont.
 Plot how far each token of a word is from
the beginning of a text.
 text1.dispersion_plot(["monstrous"])
Needs NumPy & Matplotlib, though it didn't
work for me.
 Generate random text.
 text1.generate()
31-Aug-2009
LING 681.02, Prof. Howard, Tulane University
6
Counting vocabulary
 Count the word and punctuation tokens in a text:
 len(text1)
 List the distinct words, i.e. the word types, in a
text:
 set(text1)
 Count how many types there are in a text:
 len(set(text1))
 Count the tokens of a word type:
 text1.count("smote")
31-Aug-2009
LING 681.02, Prof. Howard, Tulane University
7
Lexical richness or
diversity
 The lexical richness or diversity of a text
can be estimated as tokens per type:
 len(text1) / len(set(text1)
 The frequency of a type can be estimated as
tokens per all tokens:
 100 * text1.count('a') / len(text1)
 This is integer division, however.
p. 8 "_future_" is some kind of error
31-Aug-2009
LING 681.02, Prof. Howard, Tulane University
8
Making your own
function in Python
 To save you from typing the same thing
over and over, you can define your own
function:
>>> def lexical_diversity(text):
...
return len(text1) / len(set(text1)
 You call this function just by typing it and
filling in the argument, a text name, in the
parenthesis:
>>> lexical_diversity(text1)
31-Aug-2009
LING 681.02, Prof. Howard, Tulane University
9
Other functions
 Sort the word types in a text alphabetically:
 sorted(set(text1))
31-Aug-2009
LING 681.02, Prof. Howard, Tulane University
10
Exercises 1.8.…
 4. … How many words are there in text2? How
many distinct words are there?
 5. Compare the lexical diversity scores for humor
and romance fiction in Table 1.1. Which genre is
more lexically diverse?
 8. Consider the following Python expression:
len(set(text4)). State the purpose of this
expression. Describe the two steps involved in
performing this computation.
31-Aug-2009
LING 681.02, Prof. Howard, Tulane University
11
NLPP
§1.2 A Closer Look at Python: Texts
as Lists of Words
The representation of a
text
 We will think of a text as nothing more than a
sequence of words and punctuation.
 The opening sentence of Moby Dick:
>>> sent1 = ['Call', 'me', 'Ishmael', '.']
 The bracketed material is known as a list in
Python.
 We can inspect it by typing the name.
 How would you find out how many words it has?
31-Aug-2009
LING 681.02, Prof. Howard, Tulane University
13
List construction
 Append one list to the end of another with
'+', known as concatenation:
>>> ['Monty', 'Python'] + ['and', 'the', 'Holy', 'Grail']
['Monty', 'Python', 'and', 'the', 'Holy', 'Grail']
>>> sent4 + sent1
['Fellow', '-', 'Citizens', 'of', 'the', 'Senate', 'and',
'of', 'the','House', 'of', 'Representatives', ':',
'Call', 'me', 'Ishmael', '.']
 Append a single item to a list
 >>> sent1.append("Some")
 sent1 ['Call', 'me', 'Ishmael', '.', 'Some']
31-Aug-2009
LING 681.02, Prof. Howard, Tulane University
14
List indexing
 Each element in a list is numbered in sequence, a number known as the
element's index.
 Show the item that occurs at an index such as 173 in a text:
>>> text4[173]
'awaken'
 Show the index of an element's first occurrence:
>>>text4.index('awaken')
173
 Show the elements between two indices (slicing):
>>> text5[16715:16735]
>>> text5[16715:]
>>> text5[:16735]
 Assign an element to an index:
>>> text[0] = 'First'
31-Aug-2009
LING 681.02, Prof. Howard, Tulane University
15
Python counts from 0
 Create a list:
>>> sent = ['word1', 'word2', 'word3', 'word4', 'word5',
...
'word6', 'word7', 'word8', 'word9', 'word10']
 Find the first word:
>>> sent[0]
'word1'
Find the last word:
>>> sent[9]
'word10'
 What does sent[10] do?
It produces a runtime error.
31-Aug-2009
LING 681.02, Prof. Howard, Tulane University
16
List exercises
31-Aug-2009
LING 681.02, Prof. Howard, Tulane University
17
Next time
NLPP: finish §1 and do all exercises;
do up to Ex 8 in §2