15-505: Lecture 11 Generative Models for Text Classification and Information Extraction Kamal Nigam
Download ReportTranscript 15-505: Lecture 11 Generative Models for Text Classification and Information Extraction Kamal Nigam
15-505: Lecture 11 Generative Models for Text Classification and Information Extraction Kamal Nigam Some slides from William Cohen, Andrew McCallum Text Classification by Example Text Classification by Example Text Classification by Example Text Classification by Example Text Classification by Example How could you build a text classifier? • Take some ideas from machine learning – Supervised learning setting – Examples of each class (a few or thousands) • Take some ideas from machine translation – Generative models – Language models • Simplify each and stir thoroughly Basic Approach of Generative Modeling 1. Pick representation for data 2. Write down probabilistic generative model 3. Estimate model parameters with training data 4. Turn model around to calculate unknown values for new data Naïve Bayes: Bag of Words Representation Occurrence counts All words in dictionary Corn prices rose today while corn futures dropped in surprising trading activity. Corn ... activity cable corn damp drawer dropped elbow earning 1 0 3 0 0 1 0 0 . . . . . . Naïve Bayes: Mixture of Multinomials Model 1. Pick the class: P(class) 2. For every word, pick from the class urn: P(word|class) java modem windows web the in again the windows soccer polo dropped the while in the soccer activity COMPUTERS SPORTS ball Word independence assumption! Naïve Bayes: Estimating Parameters • Just like estimating biased coin flip probabilities • Estimate MAP word probabilities: N(word , doc) P( word | class ) | Vocab | N (doc) 1 docclass docclass • Estimate MAP class priors: 1 N(doc, class ) P(class ) N(class ) N(doc) Naïve Bayes: Performing Classification P(doc | class ) P(class ) P(class | doc) P(doc) P(class ) P(word | class ) worddoc • Word independence assumption • Take the class with the highest probability Classification Tricks of the Trade • Stemming – run, runs, running, ran run – table, tables, tabled table – computer, compute, computing compute • Stopwords – Very frequent function words generally uninformative – if, in, the, like, … • Information gain feature selection – Keep just most indicative words in the vocabulary Naïve Bayes Rules of Thumb • Need hundreds of labeled examples per class for good performance (~85% accuracy) • Stemming and stopwords may or may not help • Feature selection may or may not help • Predicted probabilities will be very extreme • Use sum of logs instead of multiplying probabilities for underflow prevention • Coding this up is trivial, either as a mapreduce or not Information Extraction with Generative Models Example: A Problem Mt. Baker, the school district Baker Hostetler, the company Baker, a job opening Genomics job Example: A Solution Job Openings: Category = Food Services Keyword = Baker Location = Continental U.S. Extracting Job Openings from the Web Title: Ice Cream Guru Description: If you dream of cold creamy… Contact: [email protected] Category: Travel/Hospitality Function: Food Services Potential Enabler of Faceted Search Lots of Structured Information in Text IE from Research Papers What is Information Extraction? • Recovering structured data from formatted text What is Information Extraction? • Recovering structured data from formatted text – Identifying fields (e.g. named entity recognition) What is Information Extraction? • Recovering structured data from formatted text – Identifying fields (e.g. named entity recognition) – Understanding relations between fields (e.g. record association) What is Information Extraction? • Recovering structured data from formatted text – Identifying fields (e.g. named entity recognition) – Understanding relations between fields (e.g. record association) – Normalization and deduplication What is Information Extraction? • Recovering structured data from formatted text – Identifying fields (e.g. named entity recognition) – Understanding relations between fields (e.g. record association) – Normalization and deduplication • Today, focus on field identification IE Posed as a Machine Learning Task • Training data: documents marked up with ground truth • In contrast to text classification, local features crucial. Features of: – – – – … Contents Text just before item Text just after item Begin/end boundaries 00 : pm Place : Wean Hall Rm 5409 Speaker : Sebastian Thrun prefix contents suffix … Good Features for Information Extraction Creativity and Domain Knowledge Required! contains-question-mark begins-with-number Example word features: – identity of word contains-question-word begins-with-ordinal – is in all caps ends-with-question-mark begins-with-punctuation – ends in “-ski” first-alpha-is-capitalized begins-with-question-word– is part of a noun phrase indented – is in a list of city names begins-with-subject – is under node X in WordNet orindented-1-to-4 blank Cyc indented-5-to-10 contains-alphanum – is in bold font more-than-one-third-space contains-bracketed– is in hyperlink anchor only-punctuation – features of past & future number – last person name was female prev-is-blank contains-http – next two words are “and prev-begins-with-ordinal contains-non-space Associates” shorter-than-30 contains-number contains-pipe Good Features for Information Extraction Creativity and Domain Knowledge Required! Is Capitalized Is Mixed Caps Is All Caps Initial Cap Contains Digit All lowercase Is Initial Punctuation Period Comma Apostrophe Dash Preceded by HTML tag Character n-gram classifier says string is a person name (80% accurate) In stopword list (the, of, their, etc) In honorific list (Mr, Mrs, Dr, Sen, etc) In person suffix list (Jr, Sr, PhD, etc) In name particle list (de, la, van, der, etc) In Census lastname list; segmented by P(name) In Census firstname list; segmented by P(name) In locations lists (states, cities, countries) In company name list (“J. C. Penny”) In list of company suffixes (Inc, & Associates, Foundation) Word Features – lists of job titles, – Lists of prefixes – Lists of suffixes – 350 informative phrases HTML/Formatting Features – {begin, end, in} x {<b>, <i>, <a>, <hN>} x {lengths 1, 2, 3, 4, or longer} – {begin, end} of line Landscape of ML Techniques for IE: Classify Candidates Abraham Lincoln was born in Kentucky. Sliding Window Boundary Models Abraham Lincoln was born in Kentucky. Abraham Lincoln was born in Kentucky. BEGIN Classifier Classifier which class? which class? Classifier Try alternate window sizes: which class? BEGIN Finite State Machines Abraham Lincoln was born in Kentucky. END BEGIN END Wrapper Induction <b><i>Abraham Lincoln</i></b> was born in Kentucky. Most likely state sequence? Learn and apply pattern for a website <b> <i> PersonName Any of these models can be used to capture words, formatting or both. Sliding Windows & Boundary Detection Information Extraction by Sliding Windows GRAND CHALLENGES FOR MACHINE LEARNING Jaime Carbonell School of Computer Science Carnegie Mellon University E.g. Looking for seminar location 3:30 pm 7500 Wean Hall Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on. CMU UseNet Seminar Announcement Information Extraction by Sliding Windows GRAND CHALLENGES FOR MACHINE LEARNING Jaime Carbonell School of Computer Science Carnegie Mellon University E.g. Looking for seminar location 3:30 pm 7500 Wean Hall Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on. CMU UseNet Seminar Announcement Information Extraction by Sliding Window GRAND CHALLENGES FOR MACHINE LEARNING Jaime Carbonell School of Computer Science Carnegie Mellon University E.g. Looking for seminar location 3:30 pm 7500 Wean Hall Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on. CMU UseNet Seminar Announcement Information Extraction by Sliding Window GRAND CHALLENGES FOR MACHINE LEARNING Jaime Carbonell School of Computer Science Carnegie Mellon University E.g. Looking for seminar location 3:30 pm 7500 Wean Hall Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on. CMU UseNet Seminar Announcement Information Extraction by Sliding Window GRAND CHALLENGES FOR MACHINE LEARNING Jaime Carbonell School of Computer Science Carnegie Mellon University E.g. Looking for seminar location 3:30 pm 7500 Wean Hall Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on. CMU UseNet Seminar Announcement Information Extraction with Sliding Windows [Freitag 97, 98; Soderland 97; Califf 98] … 00 : pm Place : Wean Hall Rm 5409 Speaker : Sebastian Thrun … w t-m w t-1 w t w t+n w t+n+1 w t+n+m prefix contents suffix • Standard supervised learning setting – Positive instances: Windows with real label – Negative instances: All other windows – Features based on candidate, prefix and suffix • Special-purpose rule learning systems work well courseNumber(X) :tokenLength(X,=,2), every(X, inTitle, false), some(X, A, <previousToken>, inTitle, true), some(X, B, <>. tripleton, true) IE by Boundary Detection GRAND CHALLENGES FOR MACHINE LEARNING Jaime Carbonell School of Computer Science Carnegie Mellon University E.g. Looking for seminar location 3:30 pm 7500 Wean Hall Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on. CMU UseNet Seminar Announcement IE by Boundary Detection GRAND CHALLENGES FOR MACHINE LEARNING Jaime Carbonell School of Computer Science Carnegie Mellon University E.g. Looking for seminar location 3:30 pm 7500 Wean Hall Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on. CMU UseNet Seminar Announcement IE by Boundary Detection GRAND CHALLENGES FOR MACHINE LEARNING Jaime Carbonell School of Computer Science Carnegie Mellon University E.g. Looking for seminar location 3:30 pm 7500 Wean Hall Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on. CMU UseNet Seminar Announcement IE by Boundary Detection GRAND CHALLENGES FOR MACHINE LEARNING Jaime Carbonell School of Computer Science Carnegie Mellon University E.g. Looking for seminar location 3:30 pm 7500 Wean Hall Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on. CMU UseNet Seminar Announcement IE by Boundary Detection GRAND CHALLENGES FOR MACHINE LEARNING Jaime Carbonell School of Computer Science Carnegie Mellon University E.g. Looking for seminar location 3:30 pm 7500 Wean Hall Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on. CMU UseNet Seminar Announcement BWI: Learning to detect boundaries [Freitag & Kushmerick, AAAI 2000] • Another formulation: learn three probabilistic classifiers: – START(i) = Prob( position i starts a field) – END(j) = Prob( position j ends a field) – LEN(k) = Prob( an extracted field has length k) • Then score a possible extraction (i,j) by START(i) * END(j) * LEN(j-i) • LEN(k) is estimated from a histogram • START(i) and END(j) learned by boosting over simple boundary patterns and features Problems with Sliding Windows and Boundary Finders • Decisions in neighboring parts of the input are made independently from each other. – Sliding Window may predict a “seminar end time” before the “seminar start time”. – It is possible for two overlapping windows to both be above threshold. – In a Boundary-Finding system, left boundaries are laid down independently from right boundaries, and their pairing happens as a separate step. Hidden Markov Models Citation Parsing • Fahlman, Scott & Lebiere, Christian (1989). The cascade-correlation learning architecture. Advances in Neural Information Processing Systems, pp. 524-532. • Fahlman, S.E. and Lebiere, C., “The Cascade Correlation Learning Architecture,” Neural Information Processing Systems, pp. 524-532, 1990. • Fahlman, S. E. (1991) The recurrent cascade-correlation learning architecture. NIPS 3, 190-205. Can we do this with probabilistic generative models? • Could have classes for {author, title, journal, year, pages} • Could classify every word or sequence? – Which sequences? • Something interesting in the sequence of fields that we’d like to capture – Authors come first – Title comes before journal – Page numbers come near the end Hidden Markov Models: The Representation • A document is a sequence of words • Each word is tagged by its class • fahlman s e and lebiere c the cascade correlation learning architecture neural information processing systems pp 524 532 1990 HMM: Generative Model (1) Author Title Year Journal Pages HMM: Generative Model (2) Author Title Year Pages HMM: Generative Model (3) • States: xi • State transitions: P(xi|xj) = a[xi|xj] • Output probabilities: P(oi|xj) = b[oi|xj] • Markov independence assumption HMMs: Estimating Parameters • With fully-labeled data, just like naïve Bayes • Estimate MAP output probabilities: N(o , word ) 1 b[oi | x j ] worddata@ x j | Vocab | i 1 worddata@ x j • Estimate MAP state transitions: 1 a[ xi | x j ] N(x x j data | x| j 1 , xi ) 1 x j data HMMs: Performing Extraction • Given output words: – fahlman s e 1991 the recurrent cascade correlation learning architecture nips 3 190 205 • Find state sequence that maximizes: a[ x | x i i 1 ]b[oi | xi ] i • Lots of possible state sequences to test (514) Hmm… Representation for Paths: Trellis Representation for Paths: Trellis HMM Example: Nymble [Bikel, et al 97] Task: Named Entity Extraction • Bigram within classes • Backoff to unigram • Special capitalization and number features… Person start-ofsentence end-ofsentence Org (Five other name classes) Other Train on 450k words of news wire text. Results: Case Mixed Upper Mixed Language English English Spanish F1 . 93% 91% 90% Nymble word features HMMs: A Plethora of Applications • Information extraction • Part of speech tagging • Word segmentation • Gene finding • Protein structure prediction • Speech recognition • Economics, Climatology, Robotics, …