Recent Advances in Software Engineering in Microsoft Research Judith Bishop Microsoft Research [email protected] University of Nanjing, 28 May 2015
Download ReportTranscript Recent Advances in Software Engineering in Microsoft Research Judith Bishop Microsoft Research [email protected] University of Nanjing, 28 May 2015
Recent Advances in Software Engineering in Microsoft Research Judith Bishop Microsoft Research [email protected] University of Nanjing, 28 May 2015 • IntelliTest • Code Hunt • Z3 • And Friends • WER, CRANE • Testing Prevention Education Maintenance Hardware • Statistics • Trends Software runs on hardware – lots of it Worldwide PC units for personal devices increased by 5% year over year in 1Q14 with sales of basic and utility tablets in emerging markets, plus smartphones driving total device market growth during the quarter. Gartner June 2014 Connected Devices and The Cloud Most recent technology shift Desktop operating system market share Source: www.netmarketshare.com Mobile/tablet market share Source: www.netmarketshare.com Market share of operating systems in the United States from January 2012 to September 2014 Not Windows • IntelliTest • Code Hunt • Z3 • And Friends • WER, CRANE • Testing Prevention Education Maintenance Hardware • Statistics • Trends Maintenance The Challenge for Microsoft Microsoft ships software to 1 billion users around the world We want to fix bugs regardless of source application or OS software, hardware, or malware prioritize bugs that affect the most users generalize the solution to be used by any programmer get the solutions out to users most efficiently try to prevent bugs in the first place 11 Debugging in the Large with WER… Minidump !analyze 5 17 23,450,649 12 WER’s properties The huge data based can be mined to prioritize work Fix bugs from most (not loudest) users Correlate failures to co-located components Show when a collection of unrelated crashes all contain the same culprit (e.g. a device driver) Proven itself “in the wild” Found and fixed 5000 bugs in beta releases of Windows after programmers had found 100 000 with static analysis and model checking tools. Kirk Glerum, Kinshuman Kinshumann, Steve Greenberg, Gabriel Aul, Vince Orgovan, Greg Nichols, David Grant, Gretchen Loihle, and Galen Hunt, Debugging in the (Very) Large: Ten Years of Implementation and Experience, in SOSP '09, Big Sky, MT, October 2009 Bucketing Mostly Works One bug can hit multiple buckets up to 40% of error reports duplicate buckets must be hand triaged Multiple bugs can hit one bucket up to 4% of error reports harder to isolate each bug But if bucketing is wrong 44% of the time? Solution: scale is our friend With billions of error reports, we can throw away a few million 14 Top 20 Buckets for MS Word 2010 100% 75% 50% CDF Relative hit count 3-week internal deployment to 9,000 users. 25% Bucket #: 0% 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Just 20 buckets account for 50% of all errors Fixing a small # of bugs will help many users 15 Hardware: Processor Bug Reports as % of Peak 100% 80% 60% 40% 20% 0% Day #: -9 -6 -3 0 3 6 9 12 15 18 WER helped fix hardware error Manufacturer could have caught this earlier w/ WER 16 WER works because … … bucketing mostly works Windows Error Reporting (WER) is the first post-mortem reporting system with automatic diagnosis the largest client-server system in the world (by installs) helped 700 companies fix 1000s of bugs and billions of errors fundamentally changed software development at Microsoft http://winqual.microsoft.com 17 CRANE: Risk Prediction and Change Risk Analysis Goal: to improve hotfix quality and response time CRANE adoption in Windows Retrospective evaluation of CRANE on Windows Categorization of fixes that failed in the field Recommendation: Make metrics simple, empirical and insightful, project and context specific, non-redundant and actionable. Jacek Czerwonka, Rajiv Das, Nachiappan Nagappan, Alex Tarvo, Alex Teterev: CRANE: Failure Prediction, Change Analysis and Test Prioritization in Practice - Experiences from Windows. ICST 2011: 357-366 IMPROVING TESTING PROCESSES Release cycles impact verification process • Testing becomes bottleneck for development. • How much testing is enough? • How reliable and effective are tests? • When should we run a test? Kim Herzig£$, Michaela Greiler$, Jacek Czerwonka$, Brendan Murphy£ The Art of Testing Less without Sacrificing Code Quality, ICSE 2015. Engineering Process Engineers desktop Integration process System and Integration Testing Quality gates • • • Developers have to pass quality gates (no control over test selection) Checking system constraints: e.g. compatibility or performance Failures not isolated involve human inspections causes development freeze for corresponding branch System and Integration Testing Software testing is expensive • • 10k+ gates executed, 1M+ test cases Different branches, architectures, languages, … • • Aims to find code issues as early as possible Slows down product development Research Objective Running less tests increases code velocity • • We cannot run all tests on all code changes anymore. Identify tests that are more likely to find defects (not coverage). Only run effective and reliable tests • • Not every tests performs equally well, depends on code base Reduce execution frequency of tests that cause false test alarms (failures due to test and infrastructure issues) Do not sacrifice code quality • • Run every test at least once on every code change Eventually find all code defects, taking risk of finding defects later ok. HISTORIC TEST FAILURE PROBABILITIES Analyzing past test runs: failure probabilities • How often did the test fail and detected a code defect? (𝑃𝑇𝑃 ) • How often did the test report a false test alarm? (𝑃𝐹𝑃 ) Build ? time low high Quality Gate Reliability Build Build Effectiveness Execution history high low High cost, unknown value $$$$ High cost, low value $$$$ Low cost, good value $$ Low cost, low value $ These probabilities depend on the execution context! Does it Pay Off? Less test executions reduce cost ~11 month period > 30 million test execs multiple branches Taking risk increases cost ~3 month period > 1.2 million test execs single branch ~12 month period > 6.5 million test execs multiple branches Across All Products TABLE I. SIMULATION RESULTS FOR MICROSOFT WINDOWS, OFFICE, AND DYNAMICS. Measurement Test executions Test time Test result inspection Escaped defects Total cost balance Windows Rel. improvement Cost improvement 40.58% -40.31% $1,567,607.76 33.04% $61,532.80 0.20% $11,970.56 $1,617,170.00 Office Rel. improvement 34.9% 40.1% 21.1% 8.7% Cost improvement -$76,509.24 $104,880.00 $75,326.40 $106,063.24 Dynamics Rel. improvement Cost improvement 50.36% -47.45% $19,979.03 32.53% $2,337,926.40 13.40% $310,159.42 $2,047,746.01 Results vary • • • Branching structure Runtime of tests We save cost on all products Fine-tuning possible, better results but not general DYNAMIC & SELF-ADAPTIVE Probabilities are dynamic (change over time) • Skipping tests influences risk factors (of higher level branches) • Tests re-enabled when code quality drops • Feedback-loop between decision points relative test reduction rate 70% 60% automatically enable tests again 50% 40% 30% 20% 10% 0% Time (Windows 8.1) Training period Impact on Development Process Development speed • • Impact on development speed hard to estimate through simulation Product teams invest as they believe that removing tests: Increases code velocity (at least lower bound) Avoids additional changes due to merge conflicts Reduces the number of required integration branches as their main purpose is to test product “We used the data your team has provided to cut a bunch of bad content and are running a much leaner BVT system […] we’re panning out to scale about 4x and run in well under 2 hours” (Jason Means, Windows BVT PM) Secondary Improvements • • Machine Setup We may lower the number of machines allocated to testing process Developer satisfaction Removing false test failures increases confidence in testing process • IntelliTest • Code Hunt • Z3 • And Friends • WER, CRANE • Testing Prevention Education Maintenance Hardware • Statistics • Trends Prevention Continual abstraction Automated Theorem Prover Boolean Algebra First-order Axioms Bit Vectors Linear Arithmetic Floating Point Z3 reasons over a combination of logical theories Non-linear, Reals Sets/Maps/… Algebraic Data Types Won 19/21 divisions in SMT 2011 Competition The most influential tool paper in the first 20 years of TACAS (2014) Leonardo de Moura and Nikolaj Bjørner. Satisfiability modulo theories: introduction and applications. Commun. ACM, 54(9):69-77, 2011. 33 Automated Test Generation and Safety/Termination Checking SAGE: Binary File Fuzzing Fuzzing bugs found in Win7 (over 100s of file parsers): Symbolic execution of x86 traces to generate new input files Z3 theories: bit vectors and arrays Corral: Whole Program analysis Finds assertion violations using stratified inlining of procedures and calls to Z3 Z3 theories: arrays, linear arithmetic, bit vectors, un-interpreted functions Random + Regression All Others SAGE As of Windows Threshold, Corral is the program analysis engine for SDV (Static Driver Verifier) 34 Validating Network ACLs in the Datacenter Problem: 1000s of devices Low level access control lists for different policies Updates to Edge ACL can break policies Complexity is “inhumane” 35 Education • IntelliTest • Code Hunt • Z3 • And Friends • WER, CRANE • Testing Prevention Education Maintenance Hardware • Statistics • Trends IntelliTest in Visual Studio 2015 Available in Visual Studio since 2010 (as Pex and Smart Unit Tests) Nikolai Tillmann, Jonathan de Halleux, Tao Xie: Transferring an automated test generation tool to practice: from pex to fakes and code digger. ASE 2014: 385-396 Working and learning for fun Enjoyment adds to long term retention on a task Discovery is a powerful driver, contrasting with direct instructions Gaming joins these two, and is hugely popular Can we add these elements to coding? Code Hunt can! www.codehunt.com Code Hunt • Is a serious programming game • Works in C# and Java (Python coming) • Appeals to coders wishing to hone their programming skills • And also to students learning to code • Code Hunt has had over 300,000 users since launching in March 2014 with around 1,000 users a day • Stickiness (loyalty) is very high Gameplay code test cases 1. User writes code in browser 2. Cloud analyzes code – test cases show differences As long as there are differences: User must adapt code, repeat When they are no more differences: User wins level! Dynamic Symbolic Execution Choose next path Solve void CoverMe(int[] a) { if (a == null) return; if (a.Length > 0) if (a[0] == 1234567890) throw new Exception("bug"); } F F a.Length>0 a==null T T Constraints to solve Input Observed constraints a!=null null {} a==null a!=null && !(a.Length>0) a==null && a.Length>0 && a[0]!=1234567890 a==null && a.Length>0 && a[0]==1234567890 a!=null && a.Length>0 {0} a!=null && a.Length>0 && a[0]==123456890 {123…} Done: There is no path left. a[0]==123… F Execute&Monitor T Code Hunt - the APCS (default) Zone • Opened in March 2014 • 129 problems covering the Advanced Placement Computer Science course • By August 2014, over 45,000 users started. APCS Zone, First three sectors, 45K to 1K 50000 45000 40000 Players 35000 30000 25000 20000 Sector and Level 15000 10000 5000 0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 1.10 1.11 1.12 1.13 1.14 1.15 2.1 2.2 2.3 2.4 2.5 2.6 2.7 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 Effect of difficulty on drop off in sectors 1-3 Yellow – Division Blue – Operators Green - Sectors Puzzle Aug 2014 and Feb 2015 Effect of Puzzle Difficulty on Drop off 60 50 Percentage Drop Off 40 30 Level Aug Feb-A Compute -X 1.1 17 22 Compute 4 / X 1.6 18 21 Compute X-Y 1.7 18 22 Compute X/Y 1.11 32 38 Compute X%3+1 1.13 15 18 Compute 10%X 1.14 12 16 Construct a list of numbers 0..N-1 2.1 37 48 Construct a list of multiples of N 2.2 19 23 Compute x^y 3.1 11 18 Compute X! the factorial of X 3.2 16 19 Compute sum of i*(i+1)/2 3.5 17 22 20 10 0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 1.10 1.11 1.12 1.13 1.14 1.15 2.1 2.2 2.3 2.4 2.5 2.6 2.7 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 -10 Aug Feb-A Towards a Course Experience Public Data release in open source For experimentation on how people program and reach solutions Total Try Average Try Count Count 13374 363 Max Try Count 1306 Total Solved Users 1581 For ImCupSept 257 users x 24 puzzles x approx. 10 tries = about 13,000 programs Github.com/microsoft/code-hunt Upcoming events PLOOC 2015 at PLDI 2015, June 14 2015, Portland, OR, USA CHESE 2015 at ISSTA 2015, July 14, 2015, Baltimore, MD, USA Worldwide intern and summer school contests Public Code Hunt Contests are over for the summer Special ICSE attendees Contest. Register at aka.ms/ICSE2015 Code Hunt Workshop February 2015 Summary: Code Hunt: A Game for Coding 1. 2. 3. 4. Powerful and versatile platform for coding as a game Unique in working from unit tests not specifications Contest experience fun and robust Large contest numbers with public data sets from cloud data • Enables testing of hypotheses and making conclusions about how players are mastering coding, and what holds them up 5. Has potential to be a teaching platform • collaborators needed Public Data release in open source For experimentation on how people program and reach solutions Total Try Average Try Count Count 13374 363 Max Try Count 1306 Total Solved Users 1581 For ImCupSept 257 users x 24 puzzles x approx. 10 tries = about 13,000 programs Github.com/microsoft/code-hunt Upcoming events PLOOC 2015 at PLDI 2015, June 14 2015, Portland, OR, USA CHESE 2015 at ISSTA 2015, July 14, 2015, Baltimore, MD, USA Worldwide intern and summer school contests Public Code Hunt Contests are over for the summer Special ICSE attendees Contest. Register at aka.ms/ICSE2015 Code Hunt Workshop February 2015 Summary: Code Hunt: A Game for Coding 1. 2. 3. 4. Powerful and versatile platform for coding as a game Unique in working from unit tests not specifications Contest experience fun and robust Large contest numbers with public data sets from cloud data • Enables testing of hypotheses and making conclusions about how players are mastering coding, and what holds them up 5. Has potential to be a teaching platform • collaborators needed Websites Game Project Community Data Release Blogs Office Mix www.codehunt.com research.microsoft.com/codehunt research.microsoft.com/codehuntcommunity github.com/microsoft/code-hunt Linked on the Project page mix.office.com Conclusions 1. Software runs on hardware and hardware is increasingly varied 2. The hardware sector that is growing (mobile) is the most tricky 3. Maintenance increases in complexity with the number of deployments 4. Addressing human factors in large maintenance teams pays off 5. Prevention is a hugely valuable aid to maintenance 6. Gaming is a way for practicing software engineering skills Thank you! Questions?