Transcript Information Extraction
Information Extraction from Web Documents
CS 652 Information Extraction and Integration Li Xu Yihong Ding
IR and IE
IR (Information Retrieval) Retrieves relevant documents from collections Information theory, probabilistic theory, and statistics IE (Information Extraction) Extracts relevant information from documents Machine learning, computational linguistics, and natural language processing 2
History of IE
Large amount of both online and offline textual data.
Message Understanding Conference (MUC) Quantitative evaluation of IE systems Tasks Latin American terrorism Joint ventures Microelectronics Company management changes 3
Evaluation Metrics
Precision Recall F-measure 4
Web Documents
Unstructured (Free) Text Regular sentences and paragraphs Linguistic techniques, e.g., NLP Structured Text Itemized information Uniform syntactic clues, e.g., table understanding Semistructured Text Ungrammatical, telegraphic (e.g., missing attributes, multi-value attributes, …) Specialized programs, e.g., wrappers 5
Approaches to IE
Knowledge Engineering Grammars are constructed by hand Domain patterns are discovered by human experts through introspection and inspection of a corpus Much laborious tuning and “hill climbing” Machine Learning Use statistical methods when possible Learn rules from annotated corpora Learn rules from interaction with user 6
Knowledge Engineering
Advantages With skills and experience, good performing systems are not conceptually hard to develop.
The best performing systems have been hand crafted.
Disadvantages Very laborious development process Some changes to specifications can be hard to accommodate Required expertise may not be available 7
Machine Learning
Advantages Domain portability is relatively straightforward System expertise is not required for customization “Data driven” rule acquisition ensures full coverage of examples Disadvantages Training data may not exist, and may be very expensive to acquire Large volume of training data may be required Changes to specifications may require reannotation of large quantities of training data 8
Wrapper
A specialized program that identifies data of interest and maps them to some suitable format (e.g. XML or relational tables) Challenge: recognizing the data of interest among many other uninterested pieces of text Tasks Source understanding Data processing 9
Free Text
AutoSlog Liep Palka Hasten Crystal WebFoot WHISK 10
AutoSlog [1993]
The Parliament building
was bombed by Carlos.
11
LIEP [1995]
The Parliament building
was bombed by
Carlos
.
12
PALKA [1995]
The Parliament building
was bombed by
Carlos
.
13
HASTEN [1995]
The Parliament building
was bombed by
Carlos
.
Egraphs ( SemanticLabel, StructuralElement ) 14
CRYSTAL [1995]
The Parliament building
was bombed by
Carlos
.
15
CRYSTAL + Webfoot [1997]
16
WHISK [1999]
The Parliament building
was bombed by
Carlos.
WHISK Rule: *(
PhyObj
)*@passive *F ‘bombed’ * {PP ‘by’ *F (
Person
)} Context-based patterns 17
Web Documents
Semistructured and Unstructured RAPIER (E. Califf, 1997) SRV (D. Freitag, 1998) WHISK (S. Soderland, 1998) Semistructured and Structured WIEN (N. Kushmerick, 1997) SoftMealy (C-H. Hsu, 1998) STALKER (I. Muslea, S. Minton, C. Knoblock, 1998) 18
Inductive Learning
Task Inductive Inference Learning Systems Zero-order First-order, e.g., Inductive Logic Programming (ILP) 19
RAPIER [1997]
Inductive Logic Programming Extraction Rules Syntactic information Semantic information Advantage Efficient learning (bottom-up) Drawback Single-slot extraction 20
RAPIER Rule
21
SRV [1998]
Relational Algorithm (top-down) Features Simple features (e.g., length, character type, …) Relational features (e.g., next-token, …) Advantages Expressive rule representation Drawbacks Single-slot rule generation Large-volume of training data 22
SRV Rule
23
WHISK [1998]
Covering Algorithm (top-down) Advantages Learn multi-slot extraction rules Handle various order of items-to-be-extracted Handle document types from free text to structured text Drawbacks Must see all the permutations of items Less expressive feature set Need large volume of training data 24
WHISK Rule
25
WIEN [1997]
Assumes Items are always in fixed, known order Introduces several types of wrappers Advantages Fast to learn and extract Drawbacks Can not handle permutations and missing items Must label entire pages Does not use semantic classes 26
WIEN Rule
27
SoftMealy [1998]
Learns a transducer Advantages Learns order of items Allows item permutations and missing items Allows both the use of semantic classes and disjunctions Drawbacks Must see all possible permutations Can not use delimiters that do not immediately precede and follow the relevant items 28
SoftMealy Rule
29
STALKER [1998,1999,2001]
Hierarchical Information Extraction Embedded Catalog Tree (ECT) Formalism Advantages Extracts nested data Allows item permutations and missing items Need not see all of the permutations One hard-to-extract item does not affect others Drawbacks Does not exploit item order 30
STALKER Rule
31
Web IE Tools
(main technique used) Wrapper languages (TSIMMIS, Web-OQL) HTML-aware (X4F, XWRAP, RoadRunner, Lixto) NLP-based (RAPIER, SRV, WHISK) Inductive learning (WIEN, SoftMealy, Stalker) Modeling-based (NoDoSE, DEByE) Ontology-based (BYU ontology) 32
Degree of Automation
Trade-off: page lay-out dependent RoadRunner Assume target pages were automatically generated from some data sources The only fully automatic wrapper generator BYU ontology Manually created with graphical editing tool Extraction process fully automatic 33
Support of Complex Objects
Complex objects: nested objects, graphs, trees, complex tables, … Earlier tools do not support extracting from complex objects, like RAPIER, SRV, WHISK, and WIEN.
BYU ontology Support 34
Page Contents
Semistructured data (table type, richly tagged) Semistructured text (text type, rarely tagged) NLP-based tools: text type only Other tools (except ontology-based): table type only BYU ontology: both types 35
Ease of Use
HTML-aware tools, easiest to use Wrapper languages, hardest to use Other tools, in the middle 36
Output
XML is the best output format for data sharing on the Web.
37
Support for Non-HTML Sources
NLP-based and ontology-based, automatically support Other tools, may support but need additional helper like syntactical and semantic analyzer BYU ontology support 38
Resilience and Adaptiveness
Resilience: continuing to work properly in the occurrence of changes in the target pages Adaptiveness: working properly with pages from some other sources but in the same application domain Only BYU ontology has both the features.
39
Summary of Qualitative Analysis
40
Graphical Perspective of Qualitative Analysis
41
Name
WIEN
Struc_ ture X Semi Free Single slot X Multi -slot X Missing items Permuta_ tions Nested_ data Resilient
SoftMe aly STALKE R RAPIER SRV WHISK AutoSlo g ROAD_ RUNNER BYU Onto
X X X X X X X X X X X X X X ?
?
X X ?
X X X X X X X X * X X X X X X X X X X X X* X X X X* X X X X ?
?
?
X X X means the information extraction system has the capability; X* means the information extraction system has the ability as long as the training corpus can accommodate the required training data; ? Shows that the systems can has the ability in somewhat degree; * means that the extraction pattern itself doesn’t show the ability, but the overall system has the capability.
42
Problem of IE
(unstructured documents) Information Extraction Source Target Meaning Knowledge Data Information 43
Problem of IE
(structured documents) Information Extraction Source Target Meaning Knowledge Data Information 44
Problem of IE
(semistructured documents) Information Extraction Source Target Meaning Knowledge Data Information 45
Solution of IE
(the Semantic Web) Information Extraction Source Target Meaning Knowledge Data Information 46