FlashExtract : A General Framework for Data Extraction by Examples Vu Le (UC Davis) Sumit Gulwani (MSR)

Download Report

Transcript FlashExtract : A General Framework for Data Extraction by Examples Vu Le (UC Davis) Sumit Gulwani (MSR)

FlashExtract :
A General Framework for
Data Extraction by Examples
Vu Le (UC Davis)
Sumit Gulwani (MSR)
motivation
..…
demo
schema extraction program
o
Output schema
o
Field extraction programs for all fields in
the schema
output schema
o
XML-like: sequence and structure
Seq([blue] Struct(Name: [green] String,
City: [yellow] String))
field extraction program
o
An ancestor
o
A program in the DSL
Examples
o
Green = <Blue, PRegion>
o
Yellow = <, PSeqRegion>
data extraction DSL
o
o
DSL is a tuple (G, N1, N2)
o
G : grammar defining extraction strategies
o
N1 : top-level SeqRegion nonterminal
o
N2 : top-level Region nonterminal
Each non-terminal has a learn method
core algebra
o
Decomposable Map Operator
o
Filter Operators
o
Merge Operator
o
Pair Operator
city example
city example
1.
Filter lines that end with
“WA”
city example
Filter lines that end with
“WA”
2. Map each selected line
to a pair of positions
1.
city example
Filter lines that end with
“WA”
2. Map each selected line
to a pair of positions
3. Learn two leaf exprs for
the two positions
1.
learning algorithm
o
Inductive on the grammar structure
o
Learn city = learn a map operator
o
The lines that hold the city
o
The pair that identifies the city within a line
learning algorithm
o
Inductive on the grammar structure
o
Learn city = learn a map operator
o
o
The lines that hold the city
o
The pair that identifies the city within a line
Learn lines = learn a Boolean filter
inductive synthesis
1.
Problem Definition: Identify a vertical domain of tasks that users
struggle with
2.
Domain-Specific Language (DSL): Design a DSL that can succinctly
describe tasks in that domain
3.
Synthesis Algorithm: Develop an algorithm that can efficiently translate
examples into likely programs in DSL
4.
Machine Learning: Rank the various programs
5.
User Interface: Provide an appropriate interaction mechanism to
resolve ambiguities
pros & cons
o
o
Advantages
o
Efficient synthesizer
o
Easier ranking control
o
Tighter integration with user interaction model
Disadvantages
o
Non-constructive: require thinking & implementation
o
Non-modular: DSL is not extensible
inductive meta-synthesis
o
A synthesizer for a related family of DSLs that
supports a common user interaction model
o
Alleviate disadvantages of the generic
methodology
inductive meta-synthesis
o
Identify a family of vertical task domains
o
Design an algebra for DSLs
o
Implement a search algorithm for each
algebra operator
inductive meta-synthesis
o
Identify a family of vertical task domains
o
Design an algebra for DSLs
o
Implement a search algorithm for each
algebra operator
extraction meta-synthesis
o
Identify a family of vertical task domains
o
o
Design an algebra for DSLs
o
o
Extraction of semi-structured documents
Merge, Map, FilterBool, FilterInt, Pair
Implement a search algorithm for each algebra
operator
o
Compositional and inductive learners
synthesis algorithm
o
Top-down
o
o
Top-level SeqRegion, Region symbols N1, N2
Grammar-guided
o
Grammar built from the algebra operators
key insight
o
Reduce learning task for an expression
to learning tasks for its sub-expressions
o
Examples: Learn Map (λx : F, S)
o
Learn the scalar expression F
o
Learn the sequence expression S
instantiations
o
Text files
o
Web pages
o
Spreadsheets
demo
evaluation
o
Can FlashExtract extract data from real-world files?
o
How many interactions typically required?
o
How efficient/real-time is FlashExtract?
expressiveness
o
Can FlashExtract extract data from real-world files?
o
How many interactions typically required?
o
How efficient/real-time is FlashExtract?
benchmarks
o
25 text files
o
o
o
o
25 webpages from [1]
o
o
System log files
Copied texts from web pages and PDFs
Samples from “Pro Perl Parsing”
Add two more test cases for each web page
25 spreadsheets
o
o
7 from [2] that are applicable for extracting
18 from EUSES corpus
[1] E. Oro, M. Ruffolo, and S. Staab. Sxpath: extending xpath towards spatial querying on web documents.
Proc. VLDB Endow., 2010.
[2] B. Harris and S. Gulwani. Spreadsheet table transformations from examples. In PLDI, 2011.
effectiveness
o
Can FlashExtract extract data from real-world files?
Yes
o
How many interactions typically required?
2.36 examples
o
How efficient/real-time is FlashExtract?
efficiency
o
Can FlashExtract extract data from real-world files?
Yes
o
How many interactions typically required?
2.36 examples
o
How efficient/real-time is FlashExtract?
0.82s last interaction
conclusion
o
Inductive meta-synthesis
o
FlashExtract is general
o
o
Text file, web page, spreadsheet instantiations
FlashExtract is practical
o
Extract real-world data, in real time, within a few
examples
thank you
Questions?