Chapter 4 : Corpus-Based Work

Transcript Chapter 4 : Corpus-Based Work

Foundations of Statistical NLP

Chapter 4. Corpus-Based Work

박 태 원

Abstract

   Getting Set Up – Computers, Corpora, Software Looking at Text – – – – Low-level formatting issues Tokenization : What is a word?

Morphology Sentences Mark-up Data – Markup schemes – Grammatical tagging 2

Getting Set up(1/2)

 Text corpora are usually big.

– major limitation on the use of corpora – Computer 의 발전으로 극복  Corpora – use text corpora distributed by main organization – – corpus : special collection of textual material general issue is representative sample of the population of interest.

Getting Set up(2/2)

 Software – Text editors : shows fairly literally – Regular expressions : find certain pattern – Programming languages : C, C++, Perl – Programming techniques 4

Looking at Text

  Text come a row format or marked up.

Markup – a term is used for putting code of some sort into a computer file.

– commercial word processing : WYSIWYG  Features of text in human languages – difficulty to process automatically 6

Low-level formatting issues

 Junk formatting/content – – junk : document header, separator, table, diagram, etc.

OCR : deal with only English text -> remove junk (other text)  Uppercase and lowercase – – The original Brown corpus : * was used to capital letter Should we treat brown in Richard Brown and brown paint as the same?

– proper name detection : difficult problem 7

Tokenization : What is a word?(1)

 Tokenization – To divide the input text into unit called token – what is a word?

• graphic word (Kucera and Francis. 1967) “a string of contiguous alphanumeric characters with space on either side;may include hyphens and apo strophes, but no other punctuation marks” -> workable definition : $22.50, Micro$oft, C|net 8

Tokenization : What is a word?(2)

 Period – distinction end of sentence punctuation marks, abbreviation makrs as in etc. or Wash.

 Single apostrophes – – English contractions : I’ll or isn’t dog’s : dog is or dog has or genitive case  Hyphenation – – line-breaking hyphen is present in typographical source e-mail, 26-year-old, co-operate 9

Tokenization : What is a word?(3)

 The same form representing multiple “words” – homographs :

‘saw’

has two lexemes (chap 7)  Word segmentation in other languages – Many languages do not put spaces in between words  Whitespace not indicating a word break – the New York-New Haven railroad  Variant coding of information of a certain seman tic type 10

Morphology

 Stemming processing – a process that strips off affixes and leaves you with a stem.

 lemmatization – one is attempting to find the lemma or lexeme of which one is looking at an inflected form  IR community has shown that doing stemm ing does not help the performance 11

Sentences

 What is a sentence?

– – something ending with a ‘.’, ‘?’ or ‘!.’ colon, semicolon, dash is regarded as a sentence  recent research sentence boundary detection – Riley(1989) : statistical classification tree – Palmer and Hearst (1994; 1997) : a neural network to predict sentence boundaries – Mikheev(1998) : Maximum Entropy approaches to the problem 12

Mark-up Schemes

 early days, markup schemes – including header information in texts (giving author, date, title, etc.)  SGML – general language that lets one define a grammar for texts,  XML – subset of SGML particularly designed for web 13

Grammatical tagging

   first step of analysis – – automatic grammatical tagging for categories distinguishing comparative and superlative Tag sets (Table 4.5) – incorporate morphological distinction of a particular language The design of a tag set – target feature of classification • useful information about the grammatical class of a word – predictive feature • prediction the behavior of other words in the context 14

Chapter 4 : Corpus-Based Work

Transcript Chapter 4 : Corpus-Based Work

Foundations of Statistical NLP