Transcript Chapter 4 : Corpus-Based Work
Foundations of Statistical NLP
Chapter 4. Corpus-Based Work
박 태 원
Abstract
Getting Set Up – Computers, Corpora, Software Looking at Text – – – – Low-level formatting issues Tokenization : What is a word?
Morphology Sentences Mark-up Data – Markup schemes – Grammatical tagging 2
Getting Set up(1/2)
Text corpora are usually big.
– major limitation on the use of corpora – Computer 의 발전으로 극복 Corpora – use text corpora distributed by main organization – – corpus : special collection of textual material general issue is representative sample of the population of interest.
3
Getting Set up(2/2)
Software – Text editors : shows fairly literally – Regular expressions : find certain pattern – Programming languages : C, C++, Perl – Programming techniques 4
5
Looking at Text
Text come a row format or marked up.
Markup – a term is used for putting code of some sort into a computer file.
– commercial word processing : WYSIWYG Features of text in human languages – difficulty to process automatically 6
Low-level formatting issues
Junk formatting/content – – junk : document header, separator, table, diagram, etc.
OCR : deal with only English text -> remove junk (other text) Uppercase and lowercase – – The original Brown corpus : * was used to capital letter Should we treat brown in Richard Brown and brown paint as the same?
– proper name detection : difficult problem 7
Tokenization : What is a word?(1)
Tokenization – To divide the input text into unit called token – what is a word?
• graphic word (Kucera and Francis. 1967) “a string of contiguous alphanumeric characters with space on either side;may include hyphens and apo strophes, but no other punctuation marks” -> workable definition : $22.50, Micro$oft, C|net 8
Tokenization : What is a word?(2)
Period – distinction end of sentence punctuation marks, abbreviation makrs as in etc. or Wash.
Single apostrophes – – English contractions : I’ll or isn’t dog’s : dog is or dog has or genitive case Hyphenation – – line-breaking hyphen is present in typographical source e-mail, 26-year-old, co-operate 9
Tokenization : What is a word?(3)
The same form representing multiple “words” – homographs :
‘saw’
has two lexemes (chap 7) Word segmentation in other languages – Many languages do not put spaces in between words Whitespace not indicating a word break – the New York-New Haven railroad Variant coding of information of a certain seman tic type 10
Morphology
Stemming processing – a process that strips off affixes and leaves you with a stem.
lemmatization – one is attempting to find the lemma or lexeme of which one is looking at an inflected form IR community has shown that doing stemm ing does not help the performance 11
Sentences
What is a sentence?
– – something ending with a ‘.’, ‘?’ or ‘!.’ colon, semicolon, dash is regarded as a sentence recent research sentence boundary detection – Riley(1989) : statistical classification tree – Palmer and Hearst (1994; 1997) : a neural network to predict sentence boundaries – Mikheev(1998) : Maximum Entropy approaches to the problem 12
Mark-up Schemes
early days, markup schemes – including header information in texts (giving author, date, title, etc.) SGML – general language that lets one define a grammar for texts, XML – subset of SGML particularly designed for web 13
Grammatical tagging
first step of analysis – – automatic grammatical tagging for categories distinguishing comparative and superlative Tag sets (Table 4.5) – incorporate morphological distinction of a particular language The design of a tag set – target feature of classification • useful information about the grammatical class of a word – predictive feature • prediction the behavior of other words in the context 14
15