Transcript slides

PROGRAM BOOSTING:
PROGRAM SYNTHESIS VIA
CROWD-SOURCING
Robert
Cochran
Loris
D’Antoni
Benjamin
Livshits
David
Molnar
Margus
Veanes
2
In Search of the Perfect URL Validation Regex
http://mathiasbynens.be/url-regex
Matias Bynens
“I’m looking for a decent
regular expression to validate
URLs.”
- @mathias
Submissions:
1. @krijnhoetmer
2. @cowboy
3. @mattfarina
4. @stephenhay
5. @scottgonzales
6. @rodneyrehm
7. @imme_emosol
8. @diegoperini
3
Key Insight for Crowd-Sourcing of Programs
Regular expressions
CrowdBoost
• Most people get easy cases right
• In this project we apply this intuition to
• People are good with positive examples
programs
• CrowdBoost
• …but bad at rejecting negative
examples – more permissive than they
should be
• However, piecing together different
solutions will produce a good score on
the examples
• Crowd-source initial programs
• “Blend” them together
• Refine the result
• We call this program boosting
4
Overview of Program Boosting
Specification
• Specification is not formal, and
often elusive and incomplete
• Textual description
• Open to interpretation
Training set
• Broad space of inputs difficult to
get full test coverage for
• Easy to get started, tough to get
good precision
• Provided by whoever defines the task
• Positive and negative examples
Initial programs
• Get something right
• But usually get something wrong
5
Outline
• Vision and motivation
• Our approach: CrowdBoost
• Technical details: regular expressions and SFAs
• Experiment setup
• Experimental results
6
CrowdBoost in a nutshell
Specification
7
CrowdBoost Outline
• Crowd-source initial programs
• We use genetic programming approach for blending
• Needed program operations:
Shuffles (2 programs => program)
2. Mutations (program => program)
3. Training Set Generation and Refinement (program => new labeled examples)
1.
ID
Label
Ex1
+
Ex2
-
Ex3
+
Ex4
-
8
Example of Program Blending
Input 2
(0.58)
Input 1
(0.53)
Fitness
0.9
0.78
0.74 0.76
0.8
Mutation
(0.60)
Shuffle
(0.62)
Mutation
(0.60)
0.58
0.62
0.5
Mutation
(0.50)
Mutation
(0.69)
…
Winner!
(0.85)
Shuffle
(0.63)
0.85
0.69
0.7
0.6
0.81 0.81 0.82 0.81
Need to prevent
over-fitting
0.4
0.3
0.2
0.1
0
0
1
2
3
4
5
6
Iteration
7
8
9
10
9
How Do We Measure Quality?
Candidate Fitness
Training Set Coverage
• How well does a program perform on a
given training set?
• S = Examples accepted by program
• P = Positive examples
• N = Negative examples
Accuracy =
|S ∩ P| + |S \ A|
|P ∪ N|
-
?+
+
?- ?
? ?
+
?
?
+
+ +
+ Initial Examples “Gold Set”
+
- ?
+
?
+? Possible Input Space
+
+
-?
?
?+
10
Skilled and Unskilled Crowds
Skilled
Unskilled
• More expensive, longer units of work
• Cheaper, smaller units of work (seconds
(hours)
• May require multiple rounds of
interaction
• Provide initial programs
or minutes)
• Automated process for hiring, vetting
and retrieving work
• Used to grow/evolve training examples
CrowdBoost
Specification
+
+
+ Initial Examples -
Select
successful
candidates
Shuffle /
Mutate
“Gold Set”
CrowdBoost
Assess
fitness
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =
+/−
𝑇𝑜𝑡𝑎𝑙
Refine
training set
12
Outline
• Vision and motivation
• Our approach: CrowdBoost
• Technical details: regular expressions and SFAs
• Experiment setup
• Experimental results
13
Working with Regular Expressions
• Our approach is general
• Tradeoff:
• expressiveness VS complexity
• Our results are very specific
• We use a restricted notion of
programs
• Regular expressions can be
expressed as Automata, that
permit efficient implementations
of key operations
Shuffles
2. Mutations (positive and negative)
3. Training Set Generation
1.
14
Automata Shuffle: Overview
• Goal: Interleave two automata A
A
and B
• Large number of edges doesn’t
scale
i2
• Very high complexity
• We also don’t want to swap random
edges, we want to have an alignment
between A and B
Not all shuffles are successful.
Success rates are sometimes
less than 1% B
i1
15
Shuffle: Example
• Regular expressions for phone numbers
A. ^[0-9]{3}-[0-9]*-[0-9]{4}$
B. ^[0-9]{3}-[0-9]{3}-[0-9]*$
Shuffle:
^[0-9]{3}-[0-9]{3}-[0-9]{4}$
A
B
16
Mutation
• Positive Mutation: ftp://foo.com
• Negative Mutation: http://#
h
h
f
f
t
t
t
t
Add edge for “f”
p
p
s
:
s
:
:
:
/
/
/
/
[#&().-:=?-Z_a-z]
[#&().-:=?-Z_a-z]
[&().-:=?-Z_a-z]
[#&().-:=?-Z_a-z]
[#&().-:=?-Z_a-z]
Remove “#”
17
Training Set Refinement
• Existing Strings:
✔
f
✔
t
✔
✔
t
t
✔
✔ p
p
✔s ✔
✔ :
/
✔
/
✔
State to cover
h
• http://foo.com/
• ftp://a.x/
:
Generate new
string
[^/?#\s]
[^/?#\s]
[^/?#\s]
✔
.
✔
/
[^\s]
/
✔
18
Training Set Generation
• Compute automaton D of strings
passing through uncovered states
• Choose string s in D at random
• https://f.o/..Q/
• ftp://1.bd:9/:44ZW1
• http://h:68576/:X
• https://f68.ug.dk.it.no.fm
• ftp://hz8.bh8.fzpd85.frn7..
• ftp://i4.ncm2.lkxp.r9..:5811
• ftp://bi.mt..:349/
• http://n.ytnsw.yt.ee8o.w.fos.o
• Given a string e, choose find the
closest string to e in D
• e = “http://youtube.com”
• Whttp://youtube.com
• http://y_outube.com
• h_ttp://youtube.com
• WWWhttp://youtube.co/m
• http://yout.pe.com
• ftp://yo.tube.com
• http://y.foutube.com
19
Outline
• Vision and motivation
• Our approach: CrowdBoost
• Technical details: regular expressions and SFAs
• Experiment setup
• Experimental results
20
Four Crowd-Sourcing Tasks
• We consider 4 task
specifications
• Phone numbers
• Dates
• Emails
• URLs
• For Bountify sourcing we used
a handful of + and - examples
Date Specification:
• Please write a regular expression that validates
dates in different formats. Note that we are asking for
original work. Please do not copy your answer from
other sites.
• + (9 total)
• June 7, 2013
• 7/7/2013
• June-7-2013
• - (10 total)
• Junu 7, 2013
• 7/77/2013
• Jul-7-2013
• Please provide the regular expression in the form /^
YOUR ANSWER IS HERE $/ as part of your answer.
Please test your regex on the samples provided before
submitting. You may want to
use http://regexpal.com for testing.
21
Bountify Experience
22
Worker Interface to Classify Strings
23
Outline
• Vision and motivation
• Our approach: CrowdBoost
• Technical details: regular expressions and SFAs
• Experiment setup
• Experimental results
Experimental Analysis
Specification:
Phone, Email, Date or URL
-
+ +
+Initial Examples
-
Select
successful
candidates
h
t
t
72+ / 90-
p
s
:
:
/
Results measured:
• Boost
in fitness
Evolution
• Mturk costs
Process
Measure fitness • Worker latency
using gold and • Running times
/
[#&().-:=?-Z_a-z]
“Gold Set”
SFA Shuffle
/ SFA
Mutate
[#&().-:=?-Z_a-z]
465 experiments
(pairs)
h
f
t
t
t
p
p
s
:
refined example set
:
/
/
30 total regexes.
10 from Bountify,
20 found online.
[^/?#\s]
[^/?#\s]
.
[^/?#\s]
/
[^\s]
/
Classify new
examples
using Mturk
Generate
examples using
edit distance for
state coverage
25
Final Fitness After Boosting
Phone
Date
Email
Final fitness
upwards of 90%
Url
1.00
0.75
0.50
0.25
In
iti
al
Ev
Ev
ol
ol
ve
ve
+
C
ro
w
d
In
iti
al
Ev
Ev
ol
ol
ve
ve
+
C
ro
w
d
In
iti
al
Ev
Ev
ol
ol
ve
ve
+
C
ro
w
d
In
iti
al
Ev
Ev
ol
ol
ve
ve
+
C
ro
w
d
0.00
Positive boost
26
Other experimental results (per pair)
Task
Mechanical Turk
Latency (avg)
Phone
Total running time
(avg)
Mechanical Turk Cost
(avg)
8 minutes
25 minutes
0.41 $
Date
30 minutes
55 minutes
2.59 $
Email
11 minutes
17 minutes
0.50 $
URL
30 minutes
70 minutes
3.00 $
• We run up to 10 generations
• Classification tasks given in
• Often 5 or 6 generations are
batches
• We hire 5 workers per batch
enough to hit plateau
27
Potential Appliction: Sanitizers
• Sanitizers for Security
• String sanitization functions on
untrusted data
• Can be modeled as transducers
(automata with output)
Specification
• “Write an HTML sanitizer”
Training set
• Input/output pairs
Initial programs
“
\”
• Sanitizers produced by
developers
28
Conclusions
• Programs that implement non-trivial
tasks can be crowd-sourced
effectively
• We focus on tasks that defy easy
specification and involve controversy
• CrowdBoost: use genetic
programming, to produce an
improvement in quality of
crowdsourced programs
• Experiments with regular
expressions
• Tested on 4 complex tasks
• Phone numbers, Dates, Emails, URLs
• Considered pairs of regexes from
Bountify, RegexLib, etc.
CONSISTENT BOOSTS 0.12 -0.28 median increase in fitness
MTURK LATENCY
8 – 37 minutes per iteration
RUNNING TIME
10 minutes to 2 hours
MTURK COSTS
$0.41 to $3.00
29
BACKUP
30
Potential Appliction: Browser Rendering
Specification
• “Render HTML/CSS/Javascript”
Training set
• HTML/CSS/Javascript and
render example pairs
Initial programs
• Rendering Engines
31
Overall Running Time
More
optimization
still possible
32
New Strings in the Refined Training Set
250
• Fewer strings, less money and time
needs to be spent on mturk
200
• Each generation we produce new
150
strings to reach 100% coverage
100
• The number of strings is relatively
50
low, with max across all pairs being
about 200
0
Phone
Date
Email
URLs
33
Value of Mechanical Turk Refinement
34
Boost
35
Successful Shuffles and Mutations (%)
60
Fewer
considered,
higher yield
50
40
Combining
works well
Dates
Email
URLs
31
30
30
20
10
0
5.5
0.071
1.51
Phone
54
1.62
Shuffles
Mutations
32
36
Final Fitness
37
Algorithm Outline
T := PositiveExamples U NegativeExamples
C := Program1…Programn
while(!IsPerfectFitness(C, T) && budget > 0 && generations < 10) {
foreach((Ci, Cj) in ShuffleCandidates(C))
C’ += Shuffle(Ci, Cj)
foreach((Ci, sj) in MutateCandidates(C, T))
C’ += Mutate(Ci, sj)
ΔT := GenerateCoveringStrings(C’,T)
(ΔP, ΔN, budget) := GetConsensusFromCrowd(ΔT, budget)
T := T U ΔP U ΔN
C := FilterByFitness(C, T, k)
generations++
}
40
Characterizing the Input Regexes
350
Phone
288
300
Dates
250
Email
225
URLs
200
Varying levels of length
and complexity
150
100
50
83
45
40
14
10
0
Regex length
State count
23
41
Shuffles and Mutations (in thousands)
200
179
180
Phone
160
Dates
140
Email
120
100
URLs
108
98
Training set contains
more examples with
wide chars
80
60
40
Smaller automata
produce fewer
shuffles
20
8
0
Shuffles
6
12
1
Mutations
60
42
Mechanical Turk Costs
• Classification tasks were batched
$12
into varying sizes (max 50) and had
scaled payment rates ($0.5 - $1.00)
$8
• 5 workers per batch
$4
$3.00
$2.59
$0.50
$0.41
$0
Phone
Date
Email
Url
• Median cost per pair: $1.5 to $8.9
44
100
Larger batches
for workers
150
75
50
30.10
30.05
25
11.63
7.78
Total Time Per Pair (minutes)
Elapsed Mturk Time Per Pair (minutes)
Latency
100
69.40
54.64
50
24.27
16.53
0
0
Phone
Date
Email
Url
Phone
Date
Email
Url
45
Characterizing the Boosting Process
0.9
• Two representative pairs profiled
Fitness
0.8
from each category
• Want the process to terminate:
limit the # of generations to 10
0.7
0.6
0.5
2.5
5.0
7.5
10.0
Generation
1.0
Date
Email
Phone
Url
• Often finish after 5 or 6 generations
0.9
Fitness
• Occasionally, all are required
• While we hit a plateau at 10, in
0.8
some cases we’re likely to improve
with more generations
0.7
0.6
2.5
5.0
7.5
10.0
Generation
Date
Email
Phone
Url
46
Proposed Regexes
Length of URL Regexes
@scottgonzales
1347
@cowboy
1241
Spoon Library
979
@diegoperini
502
@mattfarina
287
Jeffrey Friedl
241
@gruber
218
@krijnhoetmer
115
@rodneyrehm
109
@gruber
71
@imme_emosol
54
@stephenhay
38
0
200
400
600
800
1000
1200
1400
1600
47
Symbolic Finite Automata
h
f
t
• Extension of classical finite state
t
automata
• Allow transitions to be labeled
with predicates
• Need to handle UTF16
p
•
216
characters
• Implemented using Automata.dll
t
p
s
:
:
/
/
[^/?#\s]
[^/?#\s]
.
[^/?#\s]
/
[^\s]
/
48
Shuffle: Collapsing into Components
h
h
f
h
f
t
t
t
t
t
t
p
p
p
p
s
:
:
:
/
/
/
SCC
p
Stretches
.
:
[^\s]
/
/
[^/?#\s]
One-Entry
One-Exit
[^/?#\s]
/
/
/
p
s
:
[^/?#\s]
[^/?#\s]
[^/?#\s]
Manageable number of
edges to shuffle
:
/
[^/?#\s]
s
f
/
/
/
[^\s]
[^\s]
/
/
[^\s]
49
Some Regexes
50
Bountify Process
Solution 2
Winner
Solution 4