CS506/606: Text Normalization Richard Sproat, Steven Bedrick TA: Emily Tucker-Prud’hommeaux Fall 2011 Introduction URL: http://www.cslu.ogi.edu/~sproatr/Courses/TextNorm/ RT @Bedricks TxtNrm rcks!! #CS506/606

Download Report

Transcript CS506/606: Text Normalization Richard Sproat, Steven Bedrick TA: Emily Tucker-Prud’hommeaux Fall 2011 Introduction URL: http://www.cslu.ogi.edu/~sproatr/Courses/TextNorm/ RT @Bedricks TxtNrm rcks!! #CS506/606

CS506/606:

Text Normalization

Richard Sproat, Steven Bedrick TA: Emily Tucker Prud’hommeaux

Fall 2011

Introduction

URL : http://www.cslu.ogi.edu/~sproatr/Courses/TextNorm/

RT @Bedricks TxtNrm rcks!! #CS506/606

Course Outline

• This course will consist of a combination of – a (few) lectures, – discussion of papers from the literature, – a lab component where the class as a team will build a set of modules for text normalization using the Thrax open-source finite-state grammar toolkit. • For most classes, there will be a combination of reading discussion, and discussion of progress on the project.

RT @Bedricks TxtNrm rcks!! #CS506/606

Text Normalization 1

Text Normalization

• Conversion of text that includes ‘non standard’ words like numbers, abbreviations, misspellings . . . into normal words.

– Abbreviation expansion (including novel abbreviations) – Expansion of numbers into ‘number names’ – Correction of misspellings – Disambiguation in cases where there is ambiguity

RT @Bedricks TxtNrm rcks!! #CS506/606

Text Normalization 2

Where is normalization needed?

• Very little in cases like this: Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, ‘and what is the use of a book,’ thought Alice ‘without pictures or conversation?’ So she was considering in her own mind (as well as she could, for the hot day made her feel very sleepy and stupid), whether the pleasure of making a daisy-chain would be worth the trouble of getting up and picking the daisies, when suddenly a White Rabbit with pink eyes ran close by her.

RT @Bedricks TxtNrm rcks!! #CS506/606

Text Normalization 3

Where is normalization needed?

• A lot in cases like this:

RT @Bedricks TxtNrm rcks!! #CS506/606

Text Normalization 4

Humans are pretty good at this: can you read this?

f u cn rd ths thn u r dng btr thn ny autmtc txt nrmlztion prgrm cn do.

RT @Bedricks TxtNrm rcks!! #CS506/606

Text Normalization 5

How about this?

Aoccdrnig to a rscheearch at Cmabrigde Uinervtisy, it deosn’t mttaer in what oredr the ltteers in a wrod are, the olny iprmoetnt tihng is taht the frist and lsat ltteer be at the rghit pclae. The rset can be a total mses and you can sitll raed it wouthit porbelm. Tihs is bcuseae the huamn mnid deos not raed ervey lteter by istlef, but the wrod as a wlohe.

RT @Bedricks TxtNrm rcks!! #CS506/606

Text Normalization 6

Or this?

Goccdrnia to a hscheearcr at Emabrigdc Yinervtisu, it teosn’d rttaem in tahw rredo the stteerl in a drow are, the ylno tprmoetni gihnt is taht the trisf and tsal rtteel be at the tghir eclap. The tser can be a lotat ssem and you can litls daer it touthiw morbelp. Siht is ecuseab the nuamh dnim seod not daer yrvee rtetel by fstlei, but the drow as a elohw.

RT @Bedricks TxtNrm rcks!! #CS506/606

Text Normalization 7

Two components of text normalization

• Given a string of characters in a text, what is the (reasonable) set of possible actual words (or word sequences) that might correspond to it.

• Which of those is right for the particular context?

RT @Bedricks TxtNrm rcks!! #CS506/606

Text Normalization 8

An illustration

RT @Bedricks TxtNrm rcks!! #CS506/606

Text Normalization 9

Two components of text normalization • A component that gives you the set of possibilities: –

123 = one hundred (and) twenty three

123 = one twenty three

123 = one two three

• A component that tells you which one(s) are appropriate to a particular context.

RT @Bedricks TxtNrm rcks!! #CS506/606

Text Normalization 10

A concrete example of finite-state methods in text normalization: digit to number name translation • Factor digit string: –

123

1 · 10 2 + 2 · 10 1 + 3

• Translate factors into number names: – – –

10 2 2 · 10 1

→ →

hundred twenty

1 · 10 1 + 3

thirteen

• Languages vary on how extensive these lexicons are. Some (e.g. Chinese) have very regular (hence very simple) number name systems; others (e.g. Urdu/Hindi) have a large set of number names with a name for almost every number from 1 to 100.

• Each of these steps can be accomplished with FSTs

RT @Bedricks TxtNrm rcks!! #CS506/606

Text Normalization 11

1 2 3 4 8 9 5 6 7 10 11 12 13 14 15 16 17 18 19 20 dau eik

Urdu (Hindi) Number Names

21 ik-kees 41 ikta-lees 61 ik-shat 81 ik-si 22 ba-ees 42 baya-lees 62 ba-shat 82 baya-si teen chaar paanch chay saath aath nau dus gyaa-raan baa-raan te-raan chau-daan pand-raan so-laan sat-raan attha-raan un-nees bees 35 36 37 38 39 40 23 24 25 26 27 28 29 30 31 32 33 34 ta-ees chau-bees pach-chees chab-bees satta-ees attha-ees unat-tees tees ikat-tees bat-tees tain-tees chaun-tees pan-tees chat-tees san-tees ear-tees unta-lees cha-lees 55 56 57 58 59 60 43 44 45 46 47 48 49 50 51 52 53 54 tainta-lees chawa-lees painta-lees chaya-lees santa-lees arta-lees un-chas pa-chas ika-vun ba-vun tera-pun chav-van pach-pan chap-pan sata-van atha-van un-shat shaat 75 76 77 78 79 80 63 64 65 66 67 68 69 70 71 72 73 74 tere-shat chaun-shat paen-shat sar-shat / chay-aa-shat sataath athath unat-tar sat-tar ikat-tar bahat-tar tehat-tar chohat-tar pagat-tar chayat-tar satat-tar athat-tar una-si assi 95 96 97 98 99 100 83 84 85 86 87 88 89 90 91 92 93 94 tera-si chaura-si picha-si chaya-si sata-si atha-si navay ikan-vay ban-vay teran-vay chauran-vay pichan-vay chiyan-vay chatan-vay athan-vay ninan-vay saw

RT @Bedricks TxtNrm rcks!! #CS506/606

Text Normalization 12

Digit string factoring transducer (fragment)

RT @Bedricks TxtNrm rcks!! #CS506/606

Text Normalization 13

Germanic “decade flop”

zwanzig 2 4 vier und

RT @Bedricks TxtNrm rcks!! #CS506/606

Text Normalization 14

70’s

RT @Bedricks TxtNrm rcks!! #CS506/606

Text Normalization 15

Digit-string to number name translation: German • • Factor digit string: –

123

1 · 10 2 + 2 · 10 1 + 3 Flip decades and units: 2 · 10 1 + 3

3 + 2 · 10 1

• Translate factors into number names: –

10 2

hundert

2 · 10 1

zwanzig

1 · 10 1 + 3

dreizehn

RT @Bedricks TxtNrm rcks!! #CS506/606

Text Normalization 16

German number grammar (fragment)

RT @Bedricks TxtNrm rcks!! #CS506/606

Text Normalization 17

Concrete example from English

Consider a machine that maps between digit strings and their reading as number names in English.

30,294,005,179,018,903.56

thirty quadrillion, two hundred and ninety four trillion, five billion, one hundred seventy nine million, eighteen thousand, nine hundred three, point five six

RT @Bedricks TxtNrm rcks!! #CS506/606

Text Normalization 18

566 states and 1492 arcs

RT @Bedricks TxtNrm rcks!! #CS506/606

Text Normalization 19

RT @Bedricks TxtNrm rcks!! #CS506/606

Text Normalization 20

RT @Bedricks TxtNrm rcks!! #CS506/606

Text Normalization 21

RT @Bedricks TxtNrm rcks!! #CS506/606

Text Normalization 22

NSW Classification

RT @Bedricks TxtNrm rcks!! #CS506/606

Text Normalization 23

RT @Bedricks TxtNrm rcks!! #CS506/606

Text Normalization 24

RT @Bedricks TxtNrm rcks!! #CS506/606

Text Normalization 25

RT @Bedricks TxtNrm rcks!! #CS506/606

Text Normalization 26

RT @Bedricks TxtNrm rcks!! #CS506/606

Text Normalization 27

RT @Bedricks TxtNrm rcks!! #CS506/606

Text Normalization 28

RT @Bedricks TxtNrm rcks!! #CS506/606

Text Normalization 29

RT @Bedricks TxtNrm rcks!! #CS506/606

Text Normalization 30

RT @Bedricks TxtNrm rcks!! #CS506/606

Text Normalization 31

RT @Bedricks TxtNrm rcks!! #CS506/606

Text Normalization 32

RT @Bedricks TxtNrm rcks!! #CS506/606

Text Normalization 33

RT @Bedricks TxtNrm rcks!! #CS506/606

Text Normalization 34

RT @Bedricks TxtNrm rcks!! #CS506/606

Text Normalization 35

RT @Bedricks TxtNrm rcks!! #CS506/606

Text Normalization 36

RT @Bedricks TxtNrm rcks!! #CS506/606

Text Normalization 37

RT @Bedricks TxtNrm rcks!! #CS506/606

Text Normalization 38

RT @Bedricks TxtNrm rcks!! #CS506/606

Text Normalization 39

RT @Bedricks TxtNrm rcks!! #CS506/606

Text Normalization 40

RT @Bedricks TxtNrm rcks!! #CS506/606

Text Normalization 41

RT @Bedricks TxtNrm rcks!! #CS506/606

Text Normalization 42

RT @Bedricks TxtNrm rcks!! #CS506/606

Text Normalization 43

RT @Bedricks TxtNrm rcks!! #CS506/606

Text Normalization 44

Introduction to Thrax

• • The

OpenGrm Thrax

tools compile grammars expressed as regular expressions and context dependent rewrite rules into weighted finite-state transducers. It makes use of functionality in the OpenFst library to create, access and manipulate n-gram models. It is named after Dionysius Thrax ( Διονύσιος ὁ Θρᾷξ) (170 BC – 90 BC), the reputed first Greek grammarian. http://www.openfst.org/twiki/bin/view/GRM/Thrax

RT @Bedricks TxtNrm rcks!! #CS506/606

Text Normalization 45

Reading Assignment

• Richard Sproat, Alan Black, Stanley Chen, Shankar Kumar, Mari Ostendorf, and Christopher Richards. "Normalization of non-standard words."

Computer Speech and Language

, 15(3), 287-333, 2001.

46

RT @Bedricks TxtNrm rcks!! #CS506/606

Text Normalization