Recent Advances in Software Engineering in Microsoft Research Judith Bishop Microsoft Research jbishop@microsoft.com University of Nanjing, 28 May 2015

Recent Advances in Software Engineering in Microsoft Research Judith Bishop Microsoft Research [email protected] University of Nanjing, 28 May 2015

Transcript Recent Advances in Software Engineering in Microsoft Research Judith Bishop Microsoft Research [email protected] University of Nanjing, 28 May 2015

Recent Advances in
Software Engineering in
Microsoft Research
Judith Bishop
Microsoft Research
[email protected]
University of Nanjing, 28 May 2015
• IntelliTest
• Code Hunt
• Z3
• And Friends
• WER, CRANE
• Testing
Prevention
Education
Maintenance
Hardware
• Statistics
• Trends
Software runs on hardware – lots of it
Worldwide PC units for personal
devices increased by 5% year
over year in 1Q14 with sales of
basic and utility tablets in
emerging markets, plus
smartphones driving total device
market growth during the
quarter. Gartner June 2014
Connected Devices and The Cloud
Most recent technology shift
Desktop operating system market share
Source: www.netmarketshare.com
Mobile/tablet market share
Source: www.netmarketshare.com
Market share of operating systems in the United
States from January 2012 to September 2014
Not
Windows
• IntelliTest
• Code Hunt
• Z3
• And Friends
• WER, CRANE
• Testing
Prevention
Education
Maintenance
Hardware
• Statistics
• Trends
Maintenance
The Challenge for Microsoft
Microsoft ships software to 1 billion users around the world
We want to
fix bugs regardless of source
application or OS
software, hardware, or malware
prioritize bugs that affect the most
users
generalize the solution to be used
by any programmer
get the solutions out to users most
efficiently
try to prevent bugs in the first place
11
Debugging in the Large with WER…
Minidump
!analyze
5
17
23,450,649
12
WER’s properties
The huge data based can be mined to prioritize work
Fix bugs from most (not loudest) users
Correlate failures to co-located components
Show when a collection of unrelated crashes all contain the same culprit (e.g. a
device driver)
Proven itself “in the wild”
Found and fixed 5000 bugs in beta releases of Windows after programmers had
found 100 000 with static analysis and model checking tools.
Kirk Glerum, Kinshuman Kinshumann, Steve Greenberg, Gabriel Aul, Vince Orgovan, Greg Nichols, David Grant,
Gretchen Loihle, and Galen Hunt, Debugging in the (Very) Large: Ten Years of Implementation and Experience, in SOSP
'09, Big Sky, MT, October 2009
Bucketing Mostly Works
One bug can hit multiple buckets
up to 40% of error reports
duplicate buckets must be hand triaged
Multiple bugs can hit one bucket
up to 4% of error reports
harder to isolate each bug
But if bucketing is wrong 44% of the time?
Solution: scale is our friend
With billions of error reports, we can throw away a few
million 
14
Top 20 Buckets for MS Word 2010
100%
75%
50%
CDF
Relative hit count
3-week internal
deployment
to 9,000 users.
25%
Bucket #:
0%
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
 Just 20 buckets account for 50% of all errors
 Fixing a small # of bugs will help many users
15
Hardware: Processor Bug
Reports as % of Peak
100%
80%
60%
40%
20%
0%
Day #: -9
-6
-3
0
3
6
9
12
15
18
 WER helped fix hardware error
 Manufacturer could have caught this earlier w/ WER
16
WER works because …
… bucketing mostly works
Windows Error Reporting (WER) is
the first post-mortem reporting system with automatic
diagnosis
the largest client-server system in the world (by installs)
helped 700 companies fix 1000s of bugs and billions of
errors
fundamentally changed software development at
Microsoft
http://winqual.microsoft.com
17
CRANE: Risk Prediction and Change Risk
Analysis
Goal: to improve
hotfix quality and
response time
CRANE adoption in Windows
Retrospective evaluation of CRANE on Windows
Categorization of fixes that failed in the field
Recommendation: Make metrics simple, empirical and insightful,
project and context specific, non-redundant and actionable.
Jacek Czerwonka, Rajiv Das, Nachiappan Nagappan, Alex Tarvo, Alex Teterev: CRANE: Failure Prediction,
Change Analysis and Test Prioritization in Practice - Experiences from Windows. ICST 2011: 357-366
IMPROVING TESTING PROCESSES
Release cycles impact verification process
• Testing becomes bottleneck for development.
• How much testing is enough?
• How reliable and effective are tests?
• When should we run a test?
Kim Herzig£$, Michaela Greiler$, Jacek Czerwonka$, Brendan Murphy£
The Art of Testing Less without Sacrificing Code Quality, ICSE 2015.
Engineering Process
Engineers desktop
Integration process
System and Integration Testing
Quality gates
•
•
•
Developers have to pass quality gates (no control over test selection)
Checking system constraints: e.g. compatibility or performance
Failures not isolated
 involve human inspections
 causes development freeze for corresponding branch
System and Integration Testing
Software testing is expensive
•
•
10k+ gates executed, 1M+ test cases
Different branches, architectures, languages, …
•
•
Aims to find code issues as early as possible
Slows down product development
Research Objective
Running less tests increases code velocity
•
•
We cannot run all tests on all code changes anymore.
Identify tests that are more likely to find defects (not coverage).
Only run effective and reliable tests
•
•
Not every tests performs equally well, depends on code base
Reduce execution frequency of tests that cause false test alarms
(failures due to test and infrastructure issues)
Do not sacrifice code quality
•
•
Run every test at least once on every code change
Eventually find all code defects, taking risk of finding defects later ok.
HISTORIC TEST FAILURE PROBABILITIES
Analyzing past test runs: failure probabilities
• How often did the test fail and detected a code defect? (𝑃𝑇𝑃 )
• How often did the test report a false test alarm? (𝑃𝐹𝑃 )
Build
?
time
low
high
Quality
Gate
Reliability
Build
Build
Effectiveness
Execution history
high
low
High cost,
unknown value
$$$$
High cost,
low value
$$$$
Low cost,
good value
$$
Low cost,
low value
$
These probabilities depend on the execution context!
Does it Pay Off?
Less test executions
reduce cost
~11 month period
> 30 million test execs
multiple branches
Taking risk
increases cost
~3 month period
> 1.2 million test execs
single branch
~12 month period
> 6.5 million test execs
multiple branches
Across All Products
TABLE I. SIMULATION RESULTS FOR MICROSOFT WINDOWS, OFFICE, AND DYNAMICS.
Measurement
Test executions
Test time
Test result inspection
Escaped defects
Total cost balance
Windows
Rel. improvement
Cost
improvement
40.58%
-40.31%
$1,567,607.76
33.04%
$61,532.80
0.20%
$11,970.56
$1,617,170.00
Office
Rel. improvement
34.9%
40.1%
21.1%
8.7%
Cost
improvement
-$76,509.24
$104,880.00
$75,326.40
$106,063.24
Dynamics
Rel. improvement
Cost
improvement
50.36%
-47.45%
$19,979.03
32.53%
$2,337,926.40
13.40%
$310,159.42
$2,047,746.01
Results vary
•
•
•
Branching structure
Runtime of tests
We save cost on all products
Fine-tuning possible, better results but not general
DYNAMIC & SELF-ADAPTIVE
Probabilities are dynamic (change over time)
• Skipping tests influences risk factors (of higher level branches)
• Tests re-enabled when code quality drops
• Feedback-loop between decision points
relative test reduction rate
70%
60%
automatically enable tests again
50%
40%
30%
20%
10%
0%
Time (Windows 8.1)
Training period
Impact on Development Process
Development speed
•
•
Impact on development speed hard to estimate through simulation
Product teams invest as they believe that removing tests:
 Increases code velocity (at least lower bound)
 Avoids additional changes due to merge conflicts
 Reduces the number of required integration branches as their main purpose is to test product
“We used the data your team has provided to cut a bunch of bad content and are running a much leaner BVT system […]
we’re panning out to scale about 4x and run in well under 2 hours” (Jason Means, Windows BVT PM)
Secondary Improvements
•
•
Machine Setup
We may lower the number of machines allocated to testing process
Developer satisfaction
Removing false test failures increases confidence in testing process
• IntelliTest
• Code Hunt
• Z3
• And Friends
• WER, CRANE
• Testing
Prevention
Education
Maintenance
Hardware
• Statistics
• Trends
Prevention
Continual abstraction
Automated Theorem Prover
Boolean
Algebra
First-order
Axioms
Bit Vectors
Linear
Arithmetic
Floating
Point
Z3 reasons over
a combination of logical
theories
Non-linear,
Reals
Sets/Maps/…
Algebraic
Data Types
Won 19/21
divisions in SMT
2011 Competition
The most
influential tool
paper in the first
20 years of TACAS
(2014)
Leonardo de Moura and Nikolaj Bjørner. Satisfiability modulo theories: introduction and applications.
Commun. ACM, 54(9):69-77, 2011.
33
Automated Test Generation and Safety/Termination Checking
SAGE: Binary File Fuzzing
Fuzzing bugs found in Win7
(over 100s of file parsers):
Symbolic execution of x86 traces to
generate new input files
Z3 theories: bit vectors and arrays
Corral: Whole Program analysis
Finds assertion violations using
stratified inlining of procedures and
calls to Z3
Z3 theories: arrays, linear arithmetic,
bit vectors, un-interpreted functions
Random +
Regression All Others
SAGE
As of Windows Threshold, Corral is the program analysis engine for SDV (Static Driver Verifier)
34
Validating Network ACLs in the Datacenter
Problem:
1000s of devices
Low level access control lists for
different policies
Updates to Edge ACL can break
policies
Complexity is “inhumane”
35
Education
• IntelliTest
• Code Hunt
• Z3
• And Friends
• WER, CRANE
• Testing
Prevention
Education
Maintenance
Hardware
• Statistics
• Trends
IntelliTest in Visual Studio 2015
Available in Visual
Studio since 2010
(as Pex and Smart Unit
Tests)
Nikolai Tillmann, Jonathan de Halleux, Tao Xie:
Transferring an automated test generation tool to
practice: from pex to fakes and code digger. ASE
2014: 385-396
Working and learning for fun
Enjoyment adds to long term retention on a task
Discovery is a powerful driver, contrasting with direct instructions
Gaming joins these two, and is hugely popular
Can we add these elements to coding?
Code Hunt can!
www.codehunt.com
Code Hunt
• Is a serious programming game
• Works in C# and Java (Python coming)
• Appeals to coders wishing to hone their programming skills
• And also to students learning to code
• Code Hunt has had over 300,000 users since launching in March 2014
with around 1,000 users a day
• Stickiness (loyalty) is very high
Gameplay
code
test cases
1. User writes code in browser
2. Cloud analyzes code – test cases show differences
As long as there are differences: User must adapt code, repeat
When they are no more differences: User wins level!
Dynamic Symbolic Execution
Choose next path
Solve
void CoverMe(int[] a)
{
if (a == null) return;
if (a.Length > 0)
if (a[0] == 1234567890)
throw new Exception("bug");
}
F
F
a.Length>0
a==null
T
T
Constraints to solve
Input
Observed constraints
a!=null
null
{}
a==null
a!=null &&
!(a.Length>0)
a==null &&
a.Length>0 &&
a[0]!=1234567890
a==null &&
a.Length>0 &&
a[0]==1234567890
a!=null &&
a.Length>0
{0}
a!=null &&
a.Length>0 &&
a[0]==123456890
{123…}
Done: There is no path left.
a[0]==123…
F
Execute&Monitor
T
Code Hunt - the APCS (default) Zone
• Opened in March 2014
• 129 problems covering the Advanced Placement Computer Science course
• By August 2014, over 45,000 users started.
APCS Zone, First three sectors, 45K to 1K
50000
45000
40000
Players
35000
30000
25000
20000
Sector and Level
15000
10000
5000
0
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
1.10 1.11 1.12 1.13 1.14 1.15
2.1
2.2
2.3
2.4
2.5
2.6
2.7
3.1
3.2
3.3
3.4
3.5
3.6
3.7
3.8
Effect of difficulty on drop off in sectors 1-3
Yellow – Division
Blue – Operators
Green - Sectors
Puzzle
Aug 2014 and Feb 2015
Effect of Puzzle Difficulty on Drop off
60
50
Percentage Drop Off
40
30
Level Aug Feb-A
Compute -X
1.1
17
22
Compute 4 / X
1.6
18
21
Compute X-Y
1.7
18
22
Compute X/Y
1.11
32
38
Compute X%3+1
1.13
15
18
Compute 10%X
1.14
12
16
Construct a list of numbers 0..N-1 2.1
37
48
Construct a list of multiples of N
2.2
19
23
Compute x^y
3.1
11
18
Compute X! the factorial of X
3.2
16
19
Compute sum of i*(i+1)/2
3.5
17
22
20
10
0
1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 1.10 1.11 1.12 1.13 1.14 1.15 2.1 2.2 2.3 2.4 2.5 2.6 2.7 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8
-10
Aug
Feb-A
Towards a Course Experience
Public Data release in open source
For experimentation on how people program and reach solutions
Total Try Average Try
Count
Count
13374
363
Max Try
Count
1306
Total
Solved
Users
1581
For ImCupSept
257 users x 24 puzzles x approx. 10 tries =
about 13,000 programs
Github.com/microsoft/code-hunt
Upcoming events
PLOOC 2015 at PLDI 2015, June 14
2015, Portland, OR, USA
CHESE 2015 at ISSTA 2015, July 14,
2015, Baltimore, MD, USA
Worldwide intern and summer school
contests
Public Code Hunt Contests are over for
the summer
Special ICSE attendees Contest. Register
at aka.ms/ICSE2015
Code Hunt Workshop February 2015
Summary:
Code Hunt: A Game for Coding
1.
2.
3.
4.
Powerful and versatile platform for coding as a game
Unique in working from unit tests not specifications
Contest experience fun and robust
Large contest numbers with public data sets from cloud data
• Enables testing of hypotheses and making conclusions about how players are
mastering coding, and what holds them up
5. Has potential to be a teaching platform
• collaborators needed
Public Data release in open source
For experimentation on how people program and reach solutions
Total Try Average Try
Count
Count
13374
363
Max Try
Count
1306
Total
Solved
Users
1581
For ImCupSept
257 users x 24 puzzles x approx. 10 tries =
about 13,000 programs
Github.com/microsoft/code-hunt
Upcoming events
PLOOC 2015 at PLDI 2015, June 14
2015, Portland, OR, USA
CHESE 2015 at ISSTA 2015, July 14,
2015, Baltimore, MD, USA
Worldwide intern and summer school
contests
Public Code Hunt Contests are over for
the summer
Special ICSE attendees Contest. Register
at aka.ms/ICSE2015
Code Hunt Workshop February 2015
Summary:
Code Hunt: A Game for Coding
1.
2.
3.
4.
Powerful and versatile platform for coding as a game
Unique in working from unit tests not specifications
Contest experience fun and robust
Large contest numbers with public data sets from cloud data
• Enables testing of hypotheses and making conclusions about how players are
mastering coding, and what holds them up
5. Has potential to be a teaching platform
• collaborators needed
Websites
Game
Project
Community
Data Release
Blogs
Office Mix
www.codehunt.com
research.microsoft.com/codehunt
research.microsoft.com/codehuntcommunity
github.com/microsoft/code-hunt
Linked on the Project page
mix.office.com
Conclusions
1. Software runs on hardware and hardware is increasingly varied
2. The hardware sector that is growing (mobile) is the most tricky
3. Maintenance increases in complexity with the number of
deployments
4. Addressing human factors in large maintenance teams pays off
5. Prevention is a hugely valuable aid to maintenance
6. Gaming is a way for practicing software engineering skills
Thank you! Questions?

Recent Advances in Software Engineering in Microsoft Research Judith Bishop Microsoft Research [email protected] University of Nanjing, 28 May 2015

Transcript Recent Advances in Software Engineering in Microsoft Research Judith Bishop Microsoft Research [email protected] University of Nanjing, 28 May 2015

Directory