Transcript slides

Efficiently Incorporating User Feedback
into Information Extraction and
Integration Programs
Xiaoyong Chai, Ba-Quy Vuong,
AnHai Doan, Jeffrey F. Naughton
University of Wisconsin-Madison
The Need for Incorporating User Feedback
Panels Chair
Current Approach
Code
Data
…
3
This Is Not Just For DBLife

A growing number of applications use IE and II
–
–
–
–
–

Avatar@IBM Almaden
AliBaba@Humboldt Univ. of Berlin
YAGO@MPI
Kylin@Univ. of Washington
…
A systematic user-feedback solution could significantly
benefit them
4
What User Feedback To Incorporate?
Types of User Feedback
Flagging an Error
Fixing an Error
Editing Data
Input
Intermediate
Results
Editing Code
Output
5
Challenges

How to expose program data for user feedback?

How to incorporate user feedback?

How to efficiently execute a program?
6
Exposing Program Data for User Feedback

Extracting conference services
services
name
conf
role
Joe Hellerstein CIDR 2009 PC Chair
…
…
…
Views
User Interfaces
name conf role
… … …
Wiki
name role page
… … …
Spreadsheet
url
…
Form
roles
name role page
…
…
findRoles
extractConf
…
extractNames
crawl
url
date
http://.../cidr09/ 09/01/2008
…
…
dataSources
7
Writing User-Feedback Rules
to Expose Program Data

Write extraction program, e.g., in xlog [Shen et al, 07]
R1: pages(page) : dataSources(url, date), crawl(url, page)
R2: conferences(conf, page): pages(page), extractConf(page, conf)
R3: names(name, page) : pages(page), extractNames(page, name)
R4: roles(name, role, page) : names(name, page), findRoles(name, page, role)
R5: services(name, conf, role) : conferences(conf, page), roles(name, role, page)

Write user-feedback rules to specify views and user interfaces
R6: dataSourcesForUserFeedback(url) #form-UI
: dataSources(url, date), date >= “01/01/2009”
R7: rolesForUserFeedback(pos, page#no-edit)#spreadsheet-UI : roles(role, page)
R8: servicesForUserFeedback(name, conf, role)#wiki-UI : services(name, conf, role)
8
Program Semantics
Views
User Interfaces
name conf role
… … …
Wiki
name role page
… … …
Spreadsheet
url
…
Form
services
name
conf
role
Joe Hellerstein CIDR 2009 PC Chair
…
…
…
roles
name role page
…
extractConf
…
findRoles
…
extractNames
crawl
url
date
http://.../cidr09/ 09/01/2008
…
…
dataSources
9
Incorporating Previous User Feedback
O
t  t’
O’
O
p
p
I
I
Interpretation: for operator p, if t is in the output, change t into t’
name
A. Smith
A. Jones
Change “A. Smith”
to “D. Smith”
extractNames
extractNames
… D.
Smith, A. Jones, ...
page
p1
name
A. Smith
Dr. A. Smith is ...
……
page
p2
10
Interpreting User Feedback Based On
Tuple Provenance

Provenance of output tuple t :
– the set of input tuples that operator p used to produce t
name
A. Smith p1
A. Jones p1
extractNames
page
p1
Change “A. Smith” to “D. Smith”
If the operator produces
{“A. Smith”, “A. Jones”} from {p1},
then replace {“A. Smith”, “A. Jones”}
with {“D. Smith”, “A. Jones”}
name
A. Smith p1
A. Jones p1
A. Smith p2
extractNames
page
p1
p2
11
Challenges

How to expose program data for user feedback?

How to incorporate user feedback?

How to efficiently execute a program?
– Incremental execution
– Improved concurrency control
12
Incrementally Executing the Program
name
?
…
extractNames
extractNames
page
p1
p2
page
p1
p2
p3
extractNames(I+I)
=
extractNames(I)
+
extractNames(I)

Similar problem in incremental view maintenance

Incremental-update properties
–
–
–
–
–
Closed-formed insertion
Closed-formed deletion
Input partitionability
Partition correlation
Attribute independence
13
Concurrently Executing Transactions
services
name
conf
Operator-Skipping
role
Skips executing the join operator
after updating the roles table
Joe Hellerstein CIDR 2009 PC Chair
…
…
…
roles
name role page
…
…
T2
findRoles
extractConf
extractNames
crawl
url
…
T1
Table-Locking
Locks only the input and output
tables of the crawl operator
date
http://.../cidr09/ 09/01/2008
…
…
dataSources
14
Experiment Setup

Testbed
– A 5-stage DBLife workflow
– 13 blackbox operators: 6 IE operators and 3 II operators

Wrote xlog program and user-feedback rules in < 1 hr

Simulated user-feedback transactions
– On each stage of the workflow
– Each transaction randomly deletes, inserts, or modifies
1/10 of the tuples in a table
15
Incremental-Update Properties are
Broadly Applicable
DBLife Operators
Get Data Pages
Get People Variations
Get Publication Variations
Get Organization Variations
Find People Variations
Find Publication Variations
Find Organization Variations
Find People Entities
Find Publication Entities
Find Organization Entities
Find Related People
Find Authorship
Find Related Organizations
Inc. Update Properties
ci cd ip
ai
pc











































16
Incremental Update
Reduces Execution Time
17
Table-Locking and Operator-Skipping
Improve Concurrency Degree

Reduce transaction response time by 43% and 98%
Graph-locking
Table-locking
Operator-skipping

Min
~0s
1s
~0s
Max
7,584s
5,485s
457s
Average
3,203s
1,841s
43s
-43%
-98%
Increase transaction throughput by 50% and 500%
18
Related Work

User feedback in IE and II
– [Doan et al, 01], [Chiticariu et al, 08], [Jeffery et al, 08]
– Leveraging user feedback to improve results of individual operations

Provenance
– [Woodruff & Stonebraker, 97], [Cui & Widom, 01], [Buneman et al, 01],
[Bohannon et al, 08] ], [Huang et al, 08]

Incremental execution
– View maintenance [Blakeley et al, 86], [Griffin & Libkin, 95], [Gupta &
Mumick, 95]
– Schema matching [Bernstein et al, 06], IE [Chen et al, 07]
19
Conclusions and Future Work

Incorporating user feedback into IE and II programs
is important

Identify key issues and provide initial solutions:
– Write user-feedback rules to expose program data to UIs
– Model and incorporate user feedback
– Efficiently execute program to process user feedback

Future work:
– Handle unreliable user feedback
– Propagate user feedback down in the workflow
– Conduct user study
20