Document 7677249
Download
Report
Transcript Document 7677249
Natural Language Processing
>> Tokenisation <<
winter / fall 2010/2011
41.4268
Prof. Dr. Bettina Harriehausen-Mühlbauer
Univ. of Applied Science, Darmstadt, Germany
www.fbi.h-da.de/~harriehausen
[email protected]
[email protected]
content
1
Tokenisation
2
Sentence segmentation
3
Lexical analyser
4
Unix/Linux tools
WS 2010/2011
NLP - Harriehausen
2
content
1
Tokenisation
2
Sentence segmentation
3
Lexical analyser
4
Unix/Linux tools
WS 2010/2011
NLP - Harriehausen
3
definition
Tokenisation
• It is often useful to represent a text as a list of tokens.
• The process of breaking a text up into its constituent tokens is
known as tokenisation.
• Tokenisation can occur at a number of different levels:
a text could be broken up into paragraphs, sentences, words,
syllables, or phonemes. And for any given level of
tokenisation, there are many different algorithms for breaking
up the text.
• For example, at the word level, it is not immediately clear how to
treat such strings as "can't," "$22.50," "New York," and "so-called."
WS 2010/2011
NLP - Harriehausen
4
definition
• Breaking a text into smaller units – each unit having some
particular meaning.
• In most cases, into separated words and sentences.
• Carried out by finding the word boundaries, the points where one
words ends and another begins.
• Tokens/lexemes: the words identified by the process of tokenisation.
• Word segmentation
• Tokenisation in languages where no word boundaries are
explicitly marked
• e.g. when whitespaces are not used to signify word boundaries.
• Chinese, Thai
WS 2010/2011
NLP - Harriehausen
5
differences…
Programming Languages
• Part of the lexical analysis in the compilation of a
programming language source.
• Programming languages are designed to be
unambiguous – both with regard to lexemes and syntax.
Natural Languages
• The same letter can serve many different functions.
• The syntax is not as strict as in programming languages.
WS 2010/2011
NLP - Harriehausen
6
Is tokenisation easy ?
• “Clairson International Corp. said it expects to report a
net loss for its second quarter ended March 26 and
doesn’t expect to meet analysts’ profit estimates of $3.9
to $4 million, or 76 cents a share to 79 cents a share, for
its year ending Sept. 24.”
• The period is used in three different ways. When is a
period a part of a token and when not?
• ’ used in two different ways.
WS 2010/2011
NLP - Harriehausen
7
Tokenisation; examples of problems
Abbreviations
Abbreviations need to be recognised.
• “The overt involvement of Mr. Obama’s team
they have tried to ease Gov. David A. Paterson
. . . ” (New York Times, 22.09.2009)
• Dr. (doctor ? drive ? – i.e. language dependent !)
Multiword expressions
In some cases, a sequence of tokens needs one token.
• in spite of; on the other hand; 26. mars
WS 2010/2011
NLP - Harriehausen
8
Tokenisation; examples of problems
from last semester‘s project
WS 2010/2011
NLP - Harriehausen
9
Tokenisation; examples of problems
from last semester‘s project
Relevant Fields:
• Forename, Surname
• Job Title, Academic Title
• Address
• Telephone, Telefax, Mobile Phone
• E-mail, Web Address
WS 2010/2011
NLP - Harriehausen
10
Tokenisation; complexity of „easy“ tokens
Telefon:
Telefon,Tel,TEL,Phone,phone,tel,t,T,fon,Durchwahl,
Residence,Privat
Patterns: +(area-code, 2 characters)(6-20 numbers and
special characters)
Characters: 0-9,-,(,),.,[,],+,/
Lookup: table of area codes
(Telefon|Tel|TEL|Phone|phone|tel|t|T)?[\s]*
[\.]?[\s]*[:]?[\s]*[\(]?[\s]*[+]?[\s]*[\(]?
[\s]*\d{0,6}[\s]*[\)]?[\s]*[\.\]?[\s]*[\(\)\d\.\-]{6,20}
WS 2010/2011
NLP - Harriehausen
11
Tokenisation; complexity of „easy“ tokens
Telefax:
Indicator word(s): Telefax,Fax,FAX,fax,f,F, PC-Fax
Patterns: +(area-code, 2 characters)(6-20 numbers and
special characters)
Characters: 0-9,-,(,),.,[,],+,/
Lookup: table of area codes
WS 2010/2011
NLP - Harriehausen
12
Tokenisation; complexity of „easy“ tokens
Mobile:
Indicator word(s):
Mobile,mobile,MOBIL,MOBILE,m,M,Handy,Mob
Patterns: +(area code, 2 characters)(mobile network code
3 characters)(numbers and special characters)
Characters: 0-9,-,(,),.,[,],+,/
Lookup: table of area codes
WS 2010/2011
NLP - Harriehausen
13
Tokenisation; complexity of „easy“ tokens
Email:
Indicator word(s):
e-mail,E-mail,Email,E,Mail,e-mail,eMail
Patterns: (numbers and characters)@(numbers and
characters).(top-level domain)
Characters: a-z, 0-9, .,Lookup: List of top-level domains
WS 2010/2011
NLP - Harriehausen
14
Tokenisation; complexity of „easy“ tokens
Web Address:
Indicator word(s):
Web,URL Homepage,Internet,web
Patterns: (http://)(www.)(numbers and (special)
characters).(top-level domain)
Characters: Lookup: -
WS 2010/2011
NLP - Harriehausen
15
Tokenisation; complexity of „easy“ tokens
Forname and Surname:
Indicator word(s): Patterns: (Forename middle name Surname) Forename
and surname can be composed of two names, divided by
"-", Middle name can be abbreviated
Characters: Lookup: names database
WS 2010/2011
NLP - Harriehausen
16
Tokenisation; complexity of „easy“ tokens
Company name:
Indicator word(s): GmbH, Ltd, LTD, Inc., company, AG,
Aktiengesellschaft, Fachhochschule, Universität, Beratung,
e.K., Versicherungsdienst, technik, Co., & Partner, Co.KG,
&, Solutions, System, Systems
Patterns: Characters: Lookup: Company keywords database
Check with domain name web and email address
The company name often occurs multiple times, because
of recognized logos etc.
WS 2010/2011
NLP - Harriehausen
17
Tokenisation; complexity of „easy“ tokens
German Address:
Zip:
5 digits number (Zip identification sequence)
followed by the city name,
proceeded by either DE- or D- or nothing.
City:
String of symbols that is proceeded by a the zip.
Can be identified by cross matching zip with city
Street: Consists of letters and special symbols, followed by digits and
letters
Followed by D- DE- in combination with the zip
Can be identified by cross matching zip with street
Can be identified by cross matching city and street
WS 2010/2011
NLP - Harriehausen
18
Tokenisation; complexity of „easy“ tokens
US Address:
Indicator word(s):
PASEO,PLAZA,PASAJE,CARR,PARQUE,VEREDA,VISTA,VIA,CALLEJON,P
ATIO,BLVD,CAMINO
A ZipCode can be of the format:
99999 or 99999-9999
Characters: Lookup: ZipCode database
See also http://www.usps.com/ncsc/addressstds/deliveryaddress.htm
WS 2010/2011
NLP - Harriehausen
19
Lexemes vs. tokens
• A token is a categorized block of text.
• The block of text corresponding to the token is known as
a lexeme.
Example
• “.”, “?”, “!”, may all be categorized as punctuation tokens.
However, they are all different lexemes.
• “321.56“, ”12“, ”19.9”, may all be categorized as number
tokens. However, they are all different lexemes.
WS 2010/2011
NLP - Harriehausen
20
content
1
Tokenisation
2
Sentence segmentation
3
Lexical analyser
4
Unix/Linux tools
WS 2010/2011
NLP - Harriehausen
21
Sentence segmentation
• Breaking a text into sentences.
• Requires an understanding of the various uses of
punctuation characters in a language.
• The boundaries between sentences need to be
recognised.
• The boundaries occur between words.
• “Sentence boundary detection”
• At first sight, this seems simple.
• Can’t we just search for “.”, “?”, “!”
• And sometimes “:”, “;”
WS 2010/2011
NLP - Harriehausen
22
Sentence segmentation … BUT…
Counter examples that make segmentation difficult:
• direct speech:
’’ Oh my goodness ! ’’, he said, while listening to the bad news.
• abbreviations:
• Dogs, cats, budgies, fish, etc. all belong to the category of ‘pets’.
• “The contemporary viewer may simply ogle the vast wooded
vistas rising up from the Saguenay River and Lac St. Jean,
standing in for the St.Lawrence River.”
•“The firm said it plans to sublease its current headquarters at 55
Water St. A spokesman declined to elaborate.”
• ellipsis:
WS 2010/2011
text…text
NLP - Harriehausen
23
Sentence segmentation … rules
Is a simple rule not sufficient?
delim = “.” | “!” | “?”
IF (right context = delim + space + capital letter OR
delim + quote + space + capital letter OR
delim + space + quote + capital letter)
THEN sentence boundary
WS 2010/2011
NLP - Harriehausen
24
A simple sentence segmentation
• If a period preceeding a space is used as an indication of
sentence boundaries, then one can recognise about 90%
of the periods which end a sentence in the Brown corpus
(http://en.wikipedia.org/wiki/Brown_Corpus).
• One can get quite far by using simple regular expressions
without using a list of abbreviations.
• Let us assume three kinds of abbreviations in English:
A., B., C. [A-Za-z]\.
U.S., m.p.h. [A-Za-z]\.([A-Za-z]\.)+
Mr., St., Assn. [A-Z][bcdfghj-np-tvxz]+\.
• By using these two simple methods one can correctly
recognise about 98% of the sentence boundaries in the
Brown corpus.
WS 2010/2011
NLP - Harriehausen
25
(Brown Corpus (http://en.wikipedia.org/wiki/Brown_Corpus))
• The Brown University Standard Corpus of Present-Day American
English (or just Brown Corpus) was compiled in the 1960s by Henry
Kucera and W. Nelson Francis at Brown University, Providence,
Rhode Island as a general corpus (text collection) in the field of
corpus linguistics.
• The initial Brown Corpus had only the words themselves, plus a
location identifier for each.
• Over the following several years part-of-speech tags were applied.
The tagged Brown Corpus used a selection of about 80 parts of
speech, as well as special indicators for compound forms,
contractions, foreign words and a few other phenomena, and
formed the basis for many later corpora such as the LancasterOslo/Bergen Corpus. The tagged corpus enabled far more
sophisticated statistical analysis.
WS 2010/2011
NLP - Harriehausen
26
(Brown Corpus)
Part-of-speech tags used
Tag
.
(
)
*
,
-,
:
ABL
ABN
ABX
AP
AT
BE
BED
WS 2010/2011
Definition
sentence closer (. ; ? *)
left paren
right paren
Not
n't
dash
comma
colon
pre-qualifier (quite, rather)
pre-quantifier (half, all)
pre-quantifier (both)
post-determiner (many, several, next)
article (a, the, no)
be
were
NLP - Harriehausen
27
(Brown Corpus)
Part-of-speech tags used
Tag
BEDZ
BEG
BEM
BEN
BER
BEZ
CC
CD
CS
DO
DOD
DOZ
DT
DTI
DTS
WS 2010/2011
Definition
was
being
am
been
are, art
is
coordinating conjunction (and, or)
cardinal numeral (one, two, 2, etc.)
subordinating conjunction (if, although)
do
did
does
singular determiner/quantifier (this, that)
singular or plural determiner/quantifier (some, any)
plural determiner (these, those)
NLP - Harriehausen
28
(Brown Corpus)
Part-of-speech tags used
Tag
DTX
EX
FW
HV
HVD
HVG
HVN
Definition
determiner/double conjunction (either)
existential there
foreign word (hyphenated before regular tag)
have
had (past tense)
having
had (past
etc.
WS 2010/2011
NLP - Harriehausen
29
(Different taggers)
Tag-Sets
• The tag-sets contains the number of tags that are assigned by a tagger
Tag-Set Number of Tags
•
•
•
•
•
•
Brown Corpus 87
Lancaster-Oslo/Bergen 135
Lancaster UCREL 165
London-Lund Corpus of Spoken English 197
Penn Treebank 36 + 12
IBM / Microsoft 8
WS 2010/2011
NLP - Harriehausen
30
content
1
Tokenisation
2
Sentence segmentation
3
Lexical analyser
4
Unix/Linux tools
WS 2010/2011
NLP - Harriehausen
31
Lexical Analyser
• A lexical analyser is a program which breaks a
text into lexemes (tokens).
• A program which generates a lexical analyser is called a
lexical analyser generator.
• Examples: Lex/Flex/JFlex (http://jflex.de/)
• The user defines a set of regular expression patterns.
• The program generates finite-state automata.
• The automata are used to recognise tokens.
WS 2010/2011
NLP - Harriehausen
32
JFlex example (http://jflex.de/manual.html)
%% A finite-state automata recognising (a|b)*abb
%public
%class Simple
%standalone
%unicode
%{
String str = "Found: ";
%}
Pattern = (a|b)*abb
%%
{Pattern} { System.out.println(str + " " + yytext());}
.
{ ;}
WS 2010/2011
NLP - Harriehausen
33
JFlex example
%% A good tokeniser for English?
%public
%class EngGood
%standalone
%unicode
%{
%}
WhiteSpace = [ \t\f\n]
Lower = [a-z]
Upper = [A-Z]
EngChar = {Upper}|{Lower}
EngWord = {EngChar}+
%%
{WhiteSpace} {;}
{EngWord} { System.out.println(yytext());}
. { System.out.println(yytext());}
WS 2010/2011
NLP - Harriehausen
34
content
1
Tokenisation
2
Sentence segmentation
3
Lexical analyser
4
Unix/Linux tools
WS 2010/2011
NLP - Harriehausen
35
Unix/Linux tools
Various Unix tools exist which simplify the tokenisation
and processing of texts:
• grep (general regular expression parser)
• tr (translate characters)
• sed (string/stream edit) (http://en.wikipedia.org/wiki/Sed)
• Search-Programming languages:
• PERL
WS 2010/2011
NLP - Harriehausen
36
Grep (http://www.panix.com/~elflord/unix/grep.html)
grep is a command line text search utility originally written for Unix. The
name is taken from the first letters in global / regular expression / print.
A backronym of the unusual name also exists in the form of Generalized
Regular Expression Parser. The grep command searches files or standard
input globally for lines matching a given regular expression, and prints
them to the program's standard output.
This is an example of a common grep usage:
grep apple fruitlist.txt
In this case, grep prints all lines containing apple from the file fruitlist.txt,
regardless of word boundaries; therefore lines containing pineapple or
apples are also printed. The grep command is case sensitive by default,
so this example's output does not include lines containing Apple (with a
capital A) unless they also contain apple.
WS 2010/2011
NLP - Harriehausen
37
Grep (http://www.panix.com/~elflord/unix/grep.html)
To search all .txt files in a directory for apple in a shell that
supports globbing, use an asterisk in place of the file name:
grep apple *.txt
Regular expressions can be used to match more complicated
queries.
The following prints all lines in the file that begin with the letter a,
followed by any one character, then the letters ple.
grep ^a.ple fruitlist.txt
WS 2010/2011
NLP - Harriehausen
38
tr (http://linux.die.net/man/1/tr) or
(http://publib.boulder.ibm.com/infocenter/iseries/v5r4/index.jsp?topic=%2Frzahz%2Ftr.htm)
NAME
tr - translate characters
SYNOPSIS
tr [ -cds ] [ string1 [ string2 ] ]
DESCRIPTION
Tr copies the standard input to the standard output with
substitution or deletion of selected characters (runes). Input
characters found in string1 are mapped into the corresponding
characters of string2. When string2 is short it is padded to the
length of string1 by duplicating its last character. Any
combination of the options -cds may be used:
-c Complement string1: replace it with a lexicographically
ordered list of all other characters.
-d Delete from input all characters in string1.
-s Squeeze repeated output characters that occur in string2 to
single characters.
WS 2010/2011
NLP - Harriehausen
39
tr (http://linux.die.net/man/1/tr) or
(http://publib.boulder.ibm.com/infocenter/iseries/v5r4/index.jsp?topic=%2Frzahz%2Ftr.htm)
EXAMPLES
Replace all upper-case ASCII letters by lower-case.
tr A-Z a-z <mixed >lower
Create a list of all the words in one per line in where a
word is taken to be a maximal string of alphabetics.
String2 is given as a quoted newline.
tr -cs A-Za-z ’ ’ <file1 >file2
WS 2010/2011
NLP - Harriehausen
40