CS506/606: Text Normalization Richard Sproat, Steven Bedrick TA: Emily Tucker-Prud’hommeaux Fall 2011 Introduction URL: http://www.cslu.ogi.edu/~sproatr/Courses/TextNorm/ RT @Bedricks TxtNrm rcks!! #CS506/606
Download ReportTranscript CS506/606: Text Normalization Richard Sproat, Steven Bedrick TA: Emily Tucker-Prud’hommeaux Fall 2011 Introduction URL: http://www.cslu.ogi.edu/~sproatr/Courses/TextNorm/ RT @Bedricks TxtNrm rcks!! #CS506/606
CS506/606:
Text Normalization
Richard Sproat, Steven Bedrick TA: Emily Tucker Prud’hommeaux
Fall 2011
Introduction
URL : http://www.cslu.ogi.edu/~sproatr/Courses/TextNorm/
RT @Bedricks TxtNrm rcks!! #CS506/606
Course Outline
• This course will consist of a combination of – a (few) lectures, – discussion of papers from the literature, – a lab component where the class as a team will build a set of modules for text normalization using the Thrax open-source finite-state grammar toolkit. • For most classes, there will be a combination of reading discussion, and discussion of progress on the project.
RT @Bedricks TxtNrm rcks!! #CS506/606
Text Normalization 1
Text Normalization
• Conversion of text that includes ‘non standard’ words like numbers, abbreviations, misspellings . . . into normal words.
– Abbreviation expansion (including novel abbreviations) – Expansion of numbers into ‘number names’ – Correction of misspellings – Disambiguation in cases where there is ambiguity
RT @Bedricks TxtNrm rcks!! #CS506/606
Text Normalization 2
Where is normalization needed?
• Very little in cases like this: Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, ‘and what is the use of a book,’ thought Alice ‘without pictures or conversation?’ So she was considering in her own mind (as well as she could, for the hot day made her feel very sleepy and stupid), whether the pleasure of making a daisy-chain would be worth the trouble of getting up and picking the daisies, when suddenly a White Rabbit with pink eyes ran close by her.
RT @Bedricks TxtNrm rcks!! #CS506/606
Text Normalization 3
Where is normalization needed?
• A lot in cases like this:
RT @Bedricks TxtNrm rcks!! #CS506/606
Text Normalization 4
Humans are pretty good at this: can you read this?
f u cn rd ths thn u r dng btr thn ny autmtc txt nrmlztion prgrm cn do.
RT @Bedricks TxtNrm rcks!! #CS506/606
Text Normalization 5
How about this?
Aoccdrnig to a rscheearch at Cmabrigde Uinervtisy, it deosn’t mttaer in what oredr the ltteers in a wrod are, the olny iprmoetnt tihng is taht the frist and lsat ltteer be at the rghit pclae. The rset can be a total mses and you can sitll raed it wouthit porbelm. Tihs is bcuseae the huamn mnid deos not raed ervey lteter by istlef, but the wrod as a wlohe.
RT @Bedricks TxtNrm rcks!! #CS506/606
Text Normalization 6
Or this?
Goccdrnia to a hscheearcr at Emabrigdc Yinervtisu, it teosn’d rttaem in tahw rredo the stteerl in a drow are, the ylno tprmoetni gihnt is taht the trisf and tsal rtteel be at the tghir eclap. The tser can be a lotat ssem and you can litls daer it touthiw morbelp. Siht is ecuseab the nuamh dnim seod not daer yrvee rtetel by fstlei, but the drow as a elohw.
RT @Bedricks TxtNrm rcks!! #CS506/606
Text Normalization 7
Two components of text normalization
• Given a string of characters in a text, what is the (reasonable) set of possible actual words (or word sequences) that might correspond to it.
• Which of those is right for the particular context?
RT @Bedricks TxtNrm rcks!! #CS506/606
Text Normalization 8
An illustration
RT @Bedricks TxtNrm rcks!! #CS506/606
Text Normalization 9
Two components of text normalization • A component that gives you the set of possibilities: –
123 = one hundred (and) twenty three
–
123 = one twenty three
–
123 = one two three
• A component that tells you which one(s) are appropriate to a particular context.
RT @Bedricks TxtNrm rcks!! #CS506/606
Text Normalization 10
A concrete example of finite-state methods in text normalization: digit to number name translation • Factor digit string: –
123
→
1 · 10 2 + 2 · 10 1 + 3
• Translate factors into number names: – – –
10 2 2 · 10 1
→ →
hundred twenty
1 · 10 1 + 3
→
thirteen
• Languages vary on how extensive these lexicons are. Some (e.g. Chinese) have very regular (hence very simple) number name systems; others (e.g. Urdu/Hindi) have a large set of number names with a name for almost every number from 1 to 100.
• Each of these steps can be accomplished with FSTs
RT @Bedricks TxtNrm rcks!! #CS506/606
Text Normalization 11
1 2 3 4 8 9 5 6 7 10 11 12 13 14 15 16 17 18 19 20 dau eik
Urdu (Hindi) Number Names
21 ik-kees 41 ikta-lees 61 ik-shat 81 ik-si 22 ba-ees 42 baya-lees 62 ba-shat 82 baya-si teen chaar paanch chay saath aath nau dus gyaa-raan baa-raan te-raan chau-daan pand-raan so-laan sat-raan attha-raan un-nees bees 35 36 37 38 39 40 23 24 25 26 27 28 29 30 31 32 33 34 ta-ees chau-bees pach-chees chab-bees satta-ees attha-ees unat-tees tees ikat-tees bat-tees tain-tees chaun-tees pan-tees chat-tees san-tees ear-tees unta-lees cha-lees 55 56 57 58 59 60 43 44 45 46 47 48 49 50 51 52 53 54 tainta-lees chawa-lees painta-lees chaya-lees santa-lees arta-lees un-chas pa-chas ika-vun ba-vun tera-pun chav-van pach-pan chap-pan sata-van atha-van un-shat shaat 75 76 77 78 79 80 63 64 65 66 67 68 69 70 71 72 73 74 tere-shat chaun-shat paen-shat sar-shat / chay-aa-shat sataath athath unat-tar sat-tar ikat-tar bahat-tar tehat-tar chohat-tar pagat-tar chayat-tar satat-tar athat-tar una-si assi 95 96 97 98 99 100 83 84 85 86 87 88 89 90 91 92 93 94 tera-si chaura-si picha-si chaya-si sata-si atha-si navay ikan-vay ban-vay teran-vay chauran-vay pichan-vay chiyan-vay chatan-vay athan-vay ninan-vay saw
RT @Bedricks TxtNrm rcks!! #CS506/606
Text Normalization 12
Digit string factoring transducer (fragment)
RT @Bedricks TxtNrm rcks!! #CS506/606
Text Normalization 13
Germanic “decade flop”
zwanzig 2 4 vier und
RT @Bedricks TxtNrm rcks!! #CS506/606
Text Normalization 14
70’s
RT @Bedricks TxtNrm rcks!! #CS506/606
Text Normalization 15
Digit-string to number name translation: German • • Factor digit string: –
123
→
1 · 10 2 + 2 · 10 1 + 3 Flip decades and units: 2 · 10 1 + 3
→
3 + 2 · 10 1
• Translate factors into number names: –
10 2
→
hundert
–
2 · 10 1
→
zwanzig
–
1 · 10 1 + 3
→
dreizehn
RT @Bedricks TxtNrm rcks!! #CS506/606
Text Normalization 16
German number grammar (fragment)
RT @Bedricks TxtNrm rcks!! #CS506/606
Text Normalization 17
Concrete example from English
Consider a machine that maps between digit strings and their reading as number names in English.
30,294,005,179,018,903.56
→
thirty quadrillion, two hundred and ninety four trillion, five billion, one hundred seventy nine million, eighteen thousand, nine hundred three, point five six
RT @Bedricks TxtNrm rcks!! #CS506/606
Text Normalization 18
566 states and 1492 arcs
RT @Bedricks TxtNrm rcks!! #CS506/606
Text Normalization 19
RT @Bedricks TxtNrm rcks!! #CS506/606
Text Normalization 20
RT @Bedricks TxtNrm rcks!! #CS506/606
Text Normalization 21
RT @Bedricks TxtNrm rcks!! #CS506/606
Text Normalization 22
NSW Classification
RT @Bedricks TxtNrm rcks!! #CS506/606
Text Normalization 23
RT @Bedricks TxtNrm rcks!! #CS506/606
Text Normalization 24
RT @Bedricks TxtNrm rcks!! #CS506/606
Text Normalization 25
RT @Bedricks TxtNrm rcks!! #CS506/606
Text Normalization 26
RT @Bedricks TxtNrm rcks!! #CS506/606
Text Normalization 27
RT @Bedricks TxtNrm rcks!! #CS506/606
Text Normalization 28
RT @Bedricks TxtNrm rcks!! #CS506/606
Text Normalization 29
RT @Bedricks TxtNrm rcks!! #CS506/606
Text Normalization 30
RT @Bedricks TxtNrm rcks!! #CS506/606
Text Normalization 31
RT @Bedricks TxtNrm rcks!! #CS506/606
Text Normalization 32
RT @Bedricks TxtNrm rcks!! #CS506/606
Text Normalization 33
RT @Bedricks TxtNrm rcks!! #CS506/606
Text Normalization 34
RT @Bedricks TxtNrm rcks!! #CS506/606
Text Normalization 35
RT @Bedricks TxtNrm rcks!! #CS506/606
Text Normalization 36
RT @Bedricks TxtNrm rcks!! #CS506/606
Text Normalization 37
RT @Bedricks TxtNrm rcks!! #CS506/606
Text Normalization 38
RT @Bedricks TxtNrm rcks!! #CS506/606
Text Normalization 39
RT @Bedricks TxtNrm rcks!! #CS506/606
Text Normalization 40
RT @Bedricks TxtNrm rcks!! #CS506/606
Text Normalization 41
RT @Bedricks TxtNrm rcks!! #CS506/606
Text Normalization 42
RT @Bedricks TxtNrm rcks!! #CS506/606
Text Normalization 43
RT @Bedricks TxtNrm rcks!! #CS506/606
Text Normalization 44
Introduction to Thrax
• • The
OpenGrm Thrax
tools compile grammars expressed as regular expressions and context dependent rewrite rules into weighted finite-state transducers. It makes use of functionality in the OpenFst library to create, access and manipulate n-gram models. It is named after Dionysius Thrax ( Διονύσιος ὁ Θρᾷξ) (170 BC – 90 BC), the reputed first Greek grammarian. http://www.openfst.org/twiki/bin/view/GRM/Thrax
RT @Bedricks TxtNrm rcks!! #CS506/606
Text Normalization 45
Reading Assignment
• Richard Sproat, Alan Black, Stanley Chen, Shankar Kumar, Mari Ostendorf, and Christopher Richards. "Normalization of non-standard words."
Computer Speech and Language
, 15(3), 287-333, 2001.
46
RT @Bedricks TxtNrm rcks!! #CS506/606
Text Normalization