Machine Learning for Information Extraction: An Overview Kamal Nigam Google Pittsburgh With input, slides and suggestions from William Cohen, Andrew McCallum and Ion Muslea.
Download ReportTranscript Machine Learning for Information Extraction: An Overview Kamal Nigam Google Pittsburgh With input, slides and suggestions from William Cohen, Andrew McCallum and Ion Muslea.
Machine Learning for Information Extraction: An Overview Kamal Nigam Google Pittsburgh With input, slides and suggestions from William Cohen, Andrew McCallum and Ion Muslea Example: A Problem Mt. Baker, the school district Baker Hostetler, the company Baker, a job opening Genomics job Example: A Solution Job Openings: Category = Food Services Keyword = Baker Location = Continental U.S. Extracting Job Openings from the Web Title: Ice Cream Guru Description: If you dream of cold creamy… Contact: [email protected] Category: Travel/Hospitality Function: Food Services Potential Enabler of Faceted Search Lots of Structured Information in Text IE from Research Papers What is Information Extraction? • Recovering structured data from formatted text What is Information Extraction? • Recovering structured data from formatted text – Identifying fields (e.g. named entity recognition) What is Information Extraction? • Recovering structured data from formatted text – Identifying fields (e.g. named entity recognition) – Understanding relations between fields (e.g. record association) What is Information Extraction? • Recovering structured data from formatted text – Identifying fields (e.g. named entity recognition) – Understanding relations between fields (e.g. record association) – Normalization and deduplication What is Information Extraction? • Recovering structured data from formatted text – Identifying fields (e.g. named entity recognition) – Understanding relations between fields (e.g. record association) – Normalization and deduplication • Today, focus mostly on field identification & a little on record association IE Posed as a Machine Learning Task • Training data: documents marked up with ground truth • In contrast to text classification, local features crucial. Features of: – – – – … Contents Text just before item Text just after item Begin/end boundaries 00 : pm Place : Wean Hall Rm 5409 Speaker : Sebastian Thrun prefix contents suffix … Good Features for Information Extraction Creativity and Domain Knowledge Required! contains-question-mark begins-with-number Example word features: – identity of word contains-question-word begins-with-ordinal – is in all caps ends-with-question-mark begins-with-punctuation – ends in “-ski” first-alpha-is-capitalized begins-with-question-word– is part of a noun phrase indented – is in a list of city names begins-with-subject – is under node X in WordNet orindented-1-to-4 blank Cyc indented-5-to-10 contains-alphanum – is in bold font more-than-one-third-space contains-bracketed– is in hyperlink anchor only-punctuation – features of past & future number – last person name was female prev-is-blank contains-http – next two words are “and prev-begins-with-ordinal contains-non-space Associates” shorter-than-30 contains-number contains-pipe Good Features for Information Extraction Creativity and Domain Knowledge Required! Is Capitalized Is Mixed Caps Is All Caps Initial Cap Contains Digit All lowercase Is Initial Punctuation Period Comma Apostrophe Dash Preceded by HTML tag Character n-gram classifier says string is a person name (80% accurate) In stopword list (the, of, their, etc) In honorific list (Mr, Mrs, Dr, Sen, etc) In person suffix list (Jr, Sr, PhD, etc) In name particle list (de, la, van, der, etc) In Census lastname list; segmented by P(name) In Census firstname list; segmented by P(name) In locations lists (states, cities, countries) In company name list (“J. C. Penny”) In list of company suffixes (Inc, & Associates, Foundation) Word Features – lists of job titles, – Lists of prefixes – Lists of suffixes – 350 informative phrases HTML/Formatting Features – {begin, end, in} x {<b>, <i>, <a>, <hN>} x {lengths 1, 2, 3, 4, or longer} – {begin, end} of line IE History Pre-Web • Mostly news articles – De Jong’s FRUMP [1982] • Hand-built system to fill Schank-style “scripts” from news wire – Message Understanding Conference (MUC) DARPA [’87-’95], TIPSTER [’92-’96] • Most early work dominated by hand-built models – E.g. SRI’s FASTUS, hand-built FSMs. – But by 1990’s, some machine learning: Lehnert, Cardie, Grishman and then HMMs: Elkan [Leek ’97], BBN [Bikel et al ’98] Web • AAAI ’94 Spring Symposium on “Software Agents” – Much discussion of ML applied to Web. Maes, Mitchell, Etzioni. • Tom Mitchell’s WebKB, ‘96 – Build KB’s from the Web. • Wrapper Induction – Initially hand-build, then ML: [Soderland ’96], [Kushmeric ’97],… Landscape of ML Techniques for IE: Classify Candidates Abraham Lincoln was born in Kentucky. Sliding Window Boundary Models Abraham Lincoln was born in Kentucky. Abraham Lincoln was born in Kentucky. BEGIN Classifier Classifier which class? which class? Classifier Try alternate window sizes: which class? BEGIN Finite State Machines Abraham Lincoln was born in Kentucky. END BEGIN END Wrapper Induction <b><i>Abraham Lincoln</i></b> was born in Kentucky. Most likely state sequence? Learn and apply pattern for a website <b> <i> PersonName Any of these models can be used to capture words, formatting or both. Sliding Windows & Boundary Detection Information Extraction by Sliding Windows GRAND CHALLENGES FOR MACHINE LEARNING Jaime Carbonell School of Computer Science Carnegie Mellon University E.g. Looking for seminar location 3:30 pm 7500 Wean Hall Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on. CMU UseNet Seminar Announcement Information Extraction by Sliding Windows GRAND CHALLENGES FOR MACHINE LEARNING Jaime Carbonell School of Computer Science Carnegie Mellon University E.g. Looking for seminar location 3:30 pm 7500 Wean Hall Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on. CMU UseNet Seminar Announcement Information Extraction by Sliding Window GRAND CHALLENGES FOR MACHINE LEARNING Jaime Carbonell School of Computer Science Carnegie Mellon University E.g. Looking for seminar location 3:30 pm 7500 Wean Hall Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on. CMU UseNet Seminar Announcement Information Extraction by Sliding Window GRAND CHALLENGES FOR MACHINE LEARNING Jaime Carbonell School of Computer Science Carnegie Mellon University E.g. Looking for seminar location 3:30 pm 7500 Wean Hall Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on. CMU UseNet Seminar Announcement Information Extraction by Sliding Window GRAND CHALLENGES FOR MACHINE LEARNING Jaime Carbonell School of Computer Science Carnegie Mellon University E.g. Looking for seminar location 3:30 pm 7500 Wean Hall Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on. CMU UseNet Seminar Announcement Information Extraction with Sliding Windows [Freitag 97, 98; Soderland 97; Califf 98] … 00 : pm Place : Wean Hall Rm 5409 Speaker : Sebastian Thrun … w t-m w t-1 w t w t+n w t+n+1 w t+n+m prefix contents suffix • Standard supervised learning setting – Positive instances: Candidates with real label – Negative instances: All other candidates – Features based on candidate, prefix and suffix • Special-purpose rule learning systems work well courseNumber(X) :tokenLength(X,=,2), every(X, inTitle, false), some(X, A, <previousToken>, inTitle, true), some(X, B, <>. tripleton, true) Rule-learning approaches to slidingwindow classification: Summary • Representations for classifiers allow restriction of the relationships between tokens, etc • Representations are carefully chosen subsets of even more powerful representations based on logic programming (ILP and Prolog) • Use of these “heavyweight” representations is complicated, but seems to pay off in results IE by Boundary Detection GRAND CHALLENGES FOR MACHINE LEARNING Jaime Carbonell School of Computer Science Carnegie Mellon University E.g. Looking for seminar location 3:30 pm 7500 Wean Hall Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on. CMU UseNet Seminar Announcement IE by Boundary Detection GRAND CHALLENGES FOR MACHINE LEARNING Jaime Carbonell School of Computer Science Carnegie Mellon University E.g. Looking for seminar location 3:30 pm 7500 Wean Hall Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on. CMU UseNet Seminar Announcement IE by Boundary Detection GRAND CHALLENGES FOR MACHINE LEARNING Jaime Carbonell School of Computer Science Carnegie Mellon University E.g. Looking for seminar location 3:30 pm 7500 Wean Hall Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on. CMU UseNet Seminar Announcement IE by Boundary Detection GRAND CHALLENGES FOR MACHINE LEARNING Jaime Carbonell School of Computer Science Carnegie Mellon University E.g. Looking for seminar location 3:30 pm 7500 Wean Hall Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on. CMU UseNet Seminar Announcement IE by Boundary Detection GRAND CHALLENGES FOR MACHINE LEARNING Jaime Carbonell School of Computer Science Carnegie Mellon University E.g. Looking for seminar location 3:30 pm 7500 Wean Hall Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on. CMU UseNet Seminar Announcement BWI: Learning to detect boundaries [Freitag & Kushmerick, AAAI 2000] • Another formulation: learn three probabilistic classifiers: – START(i) = Prob( position i starts a field) – END(j) = Prob( position j ends a field) – LEN(k) = Prob( an extracted field has length k) • Then score a possible extraction (i,j) by START(i) * END(j) * LEN(j-i) • LEN(k) is estimated from a histogram BWI: Learning to detect boundaries • BWI uses boosting to find “detectors” for START and END • Each weak detector has a BEFORE and AFTER pattern (on tokens before/after position i). • Each “pattern” is a sequence of tokens and/or wildcards like: anyAlphabeticToken, anyToken, anyUpperCaseLetter, anyNumber, … • Weak learner for “patterns” uses greedy search (+ lookahead) to repeatedly extend a pair of empty BEFORE,AFTER patterns BWI: Learning to detect boundaries Field Person Name: Location: Start Time: F1 30% 61% 98% Problems with Sliding Windows and Boundary Finders • Decisions in neighboring parts of the input are made independently from each other. – Naïve Bayes Sliding Window may predict a “seminar end time” before the “seminar start time”. – It is possible for two overlapping windows to both be above threshold. – In a Boundary-Finding system, left boundaries are laid down independently from right boundaries, and their pairing happens as a separate step. Finite State Machines Hidden Markov Models HMMs are the standard sequence modeling tool in genomics, music, speech, NLP, … Graphical model Finite state model S t-1 St S t+1 ... ... observations ... Generates: State sequence Observation sequence transitions O Ot t -1 O t +1 |o| o1 o2 o3 o4 o5 o6 o7 o8 P( s , o ) P( st | st 1 ) P(ot | st ) S={s1,s2,…} Start state probabilities: P(st ) Transition probabilities: P(st|st-1 ) t 1 Parameters: for all states Usually a multinomial over Observation (emission) probabilities: P(ot|st ) atomic, fixed alphabet Training: Maximize probability of training observations (w/ prior) IE with Hidden Markov Models Given a sequence of observations: Yesterday Lawrence Saul spoke this example sentence. and a trained HMM: Find the most likely state sequence: (Viterbi) arg maxs P(s , o) Yesterday Lawrence Saul spoke this example sentence. Any words said to be generated by the designated “person name” state extract as a person name: Person name: Lawrence Saul Generative Extraction with HMMs [McCallum, Nigam, Seymore & Rennie ‘00] • Parameters: {P(st|st-1), P(ot|st), for all states st, words ot} • Parameters define generative model: |o| P( s , o ) P( st | st 1 ) P(ot | st ) t 1 HMM Example: “Nymble” [Bikel, et al 97] Task: Named Entity Extraction Person end-ofsentence start-ofsentence Org Other Train on 450k words of news wire text. Case Mixed Upper Mixed Observation probabilities P(st | st-1, ot-1 ) P(ot | st , st-1 ) or (Five other name classes) Results: Transition probabilities Language English English Spanish P(ot | st , ot-1 ) Back-off to: Back-off to: P(st | st-1 ) P(ot | st ) P(st ) P(ot ) F1 . 93% 91% 90% Other examples of HMMs in IE: [Leek ’97; Freitag & McCallum ’99; Seymore et al. 99] Regrets from Atomic View of Tokens Would like richer representation of text: multiple overlapping features, whole chunks of text. Example word features: – – – – – – – – – – – line, sentence, or paragraph features: identity of word is in all caps ends in “-ski” is part of a noun phrase is in a list of city names is under node X in WordNet or Cyc is in bold font is in hyperlink anchor features of past & future last person name was female next two words are “and Associates” – – – – – – – – – length is centered in page percent of non-alphabetics white-space aligns with next line containing sentence has two verbs grammatically contains a question contains links to “authoritative” pages emissions that are uncountable features at multiple levels of granularity Problems with Richer Representation and a Generative Model • These arbitrary features are not independent: – Overlapping and long-distance dependences – Multiple levels of granularity (words, characters) – Multiple modalities (words, formatting, layout) – Observations from past and future P( s , o ) • HMMs are generative models of the text: • Generative models do not easily handle these nonindependent features. Two choices: – Model the dependencies. Each state would have its own Bayes Net. But we are already starved for training data! – Ignore the dependencies. This causes “over-counting” of evidence (ala naïve Bayes). Big problem when combining evidence, as in Viterbi! Conditional Sequence Models • We would prefer a conditional model: P(s|o) instead of P(s,o): – Can examine features, but not responsible for generating them. – Don’t have to explicitly model their dependencies. – Don’t “waste modeling effort” trying to generate what we are given at test time anyway. • If successful, this answers the challenge of integrating the ability to handle many arbitrary features with the full power of finite state automata. Conditional Markov Models Maximum Entropy Markov Models [McCallum, Freitag & Pereira, 2000] MaxEnt POS Tagger [Ratnaparkhi, 1996] SNoW-based Markov Model [Punyakanok & Roth, 2000] Conditional Generative (traditional HMM) S t-1 St S t+1 ... transitions ... observations ... O t -1 Ot |o| O t +1 P( s , o ) P( st | st 1 ) P(ot | st ) t 1 S t-1 St S t+1 ... transitions ... observations ... O t -1 Ot O t +1 |o| P( s | o ) P( st | st 1 , ot ) t 1 Standard belief propagation: forward-backward procedure. Viterbi and Baum-Welch follow naturally. Exponential Form for “Next State” Function Capture dependency on st-1 with |S| independent functions, Pst-1(st|ot). st st-1 Each state contains a “next-state classifier” that, given the next observation, produces a probability of the next state, Pst-1(st|ot). 1 P(st | st 1 , ot ) Ps t1 (st | ot ) exp k f k (ot , st ) Z (ot , st 1 ) k weight Recipe: - Labeled data is assigned to transitions. - Train each state’s exponential model by maximum entropy feature Label Bias Problem • Consider this MEMM, and enough training data to perfectly model it: Pr(0123|rib)=1 Pr(0453|rob)=1 Pr(0123|rob) = Pr(1|0,r)/Z1 * Pr(2|1,o)/Z2 * Pr(3|2,b)/Z3 = 0.5 * 1 * 1 Pr(0453|rib) = Pr(4|0,r)/Z1’ * Pr(5|4,i)/Z2’ * Pr(3|5,b)/Z3’ = 0.5 * 1 *1 Conditional Random Fields (CRFs) [Lafferty, McCallum, Pereira ‘2001] From HMMs to MEMMs to CRFs s s1 , s2 ,...sn HMM o o1 , o2 ,...on |o| P( s , o ) P( st | st 1 ) P(ot | st ) t 1 |o | MEMM St-1 Ot-1 P( s | o ) P( st | st 1 , ot ) Ot St-1 t 1 j f j ( st , st 1 ) j 1 exp t 1 Z st 1 ,ot k g k ( st , ot ) k j f j ( st , st 1 ) |o | j 1 P( s | o ) exp Z o t 1 k g k ( st , ot ) k St St+1 ... Ot+1 St ... St+1 ... |o | CRF Ot-1 Ot St-1 Ot-1 (A special case of MEMMs and CRFs.) Ot+1 St Ot ... St+1 ... Ot+1 ... Conditional Random Fields (CRFs) [Lafferty, McCallum, Pereira ‘2001] St St+1 St+2 St+3 St+4 O = Ot, Ot+1, Ot+2, Ot+3, Ot+4 Markov on s, conditional dependency on o. |o| 1 P( s | o ) exp f ( s , s , o , t) k k t t 1 Z o t 1 k Hammersley-Clifford-Besag theorem stipulates that the CRF has this form—an exponential function of the cliques in the graph. Assuming that the dependency structure of the states is tree-shaped (linear chain is a trivial tree), inference can be done by dynamic programming in time O(|o| |S|2)—just like HMMs. Training CRFs Maximizelog - likelihoodof parametersgiven training data : (i ) L({ k } | { o, s }) Log - likelihoodgradient : feature count using correct labels - feature count using labelsassigned by current parameters - smoothing penalty L 2 Ck ( s (i ) , o (i ) ) P{ k } ( s | o (i ) ) Ck ( s , o (i ) ) k k i i s Ck ( s , o ) f k (o, t , st 1 , st ) t Methods: • iterative scaling (quite slow) • conjugate gradient (much faster) • conjugate gradient with preconditioning (super fast) • limited-memory quasi-Newton methods (also super fast) Complexity comparable to standard Baum-Welch [Sha & Pereira 2002] & [Malouf 2002] Sample IE Applications of CRFs • • • • • • Noun phrase segmentation [Sha & Pereira, 03] Named entity recognition [McCallum & Li 03] Protein names in bio abstracts [Settles 05] Addresses in web pages [Culotta et al. 05] Semantic roles in text [Roth & Yih 05] RNA structural alignment [Sato & Satakibara 05] Examples of Recent CRF Research • Semi-Markov CRFs [Sarawagi & Cohen 05] – Awkwardness of token level decisions for segments – Segment sequence model alleviates this – Two-level model with sequences of segments, which are sequences of tokens • Stochastic Meta-Descent [Vishwanathan 06] – – – – Stochastic gradient optimization for training Take gradient step with small batches of examples Order of magnitude faster than L-BFGS Same resulting accuracies for extraction Further Reading about CRFs Charles Sutton and Andrew McCallum. An Introduction to Conditional Random Fields for Relational Learning. In Introduction to Statistical Relational Learning. Edited by Lise Getoor and Ben Taskar. MIT Press. 2006. http://www.cs.umass.edu/~mccallum/papers/crf-tutorial.pdf