Improve the Power of Your Tests with Risk

Transcript Improve the Power of Your Tests with Risk

An Overview of High Volume Test Automation (Early Draft: Feb 24, 2012)

Cem Kaner, J.D., Ph.D.

Professor of Software Engineering Florida Institute of Technology Acknowledgments: Many of the ideas presented here were developed in collaboration with Douglas Hoffman.

These notes are partially based on research that was supported by NSF Grant CCLI 0717613 “Adaptation & Implementation of an Activity Based Online or Hybrid Course in Software Testing.” Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not Colloquium: Florida Tech Cem Kaner Copyright © 2012 1

Abstract

This talk is an introduction to the start of a research program. Drs. Bond, Gallagher and I have some experience with high volume test automation but we haven't done formal, funded research in the area. We've decided to explore it in more detail, with the expectation of supervising research students. We think this will be an excellent foundation for future employment in industry or university. If you're interested, you should talk with us. Most discussions of automated software testing focus on automated regression testing. Regression tests rerun tests that have been run before. This type of testing makes sense for testing the manufacturing of physical objects, but it is wasteful for software. Automating regression tests *might* make them cheaper (if the test maintenance costs are low enough, which they often are not) but if a test doesn't have much value to begin with, how much should we be willing to spend to make it easier to reuse it? Suppose we decided to break away from the regression testing tradition and use our technology to create a steady stream on new tests instead. What would that look like? What would our goals be? What should we expect to achieve?

This is not yet funded research--we are still planning our initial grant proposals. We might not get funded, and if we do, we probably won't get anything for at least a year. So, if you're interested in working with us, you should expect to support yourself (e.g. via GSA) for at least a year and maybe longer.

Assess the tests

Typical Testing Tasks

• Polish their design

Analyze product & its risks

• Benefits & features • Risks in use • Evaluate any bugs found by them

Execute the tests

• Market expectations • Interaction with external S/W • Diversity / stability of platforms • Troubleshoot failures • Report bugs • Identify broken tests • Extent of prior testing • Assess source code

Document the tests

• What test ideas or spec items does each test cover?

Develop testing strategy

• Pick key techniques • Prioritize testing foci • What algorithms generated the tests? • What oracles are relevant?

Maintain the tests Design tests

• Select key test ideas • Create tests for each idea the program passed or failed a test • Recreate broken tests • Redocument revised tests

Design oracles

• Mechanisms for determining whether

Manage test environment

• Set up test lab • Select / use hardware/software configurations • Manage test tools

Keep archival records

• What tests have we run Colloquium: Florida Tech Cem Kaner

Regression testing

This is the most commonly discussed approach to automated testing: • Create a test case • Run it and inspect the output • If the program fails, report a bug and try again later • If the program passes the test, save the resulting outputs • In future testing: – – Run the program Compare the output to the saved results. – Report an exception whenever the current output and the saved output don’t match. Colloquium: Florida Tech Cem Kaner Copyright © 2012 4

Really? This is automation?

• Analyze product & its risks • Develop testing strategy • Design tests • Design oracles • Run each test the first time • Assess the tests • Save the code • Save the results for comparison • Document the tests • (Re-)Execute the tests • Evaluate the results • Maintain the tests • Manage test environment • Keep archival records - - - - - - - - - - Human Human Human Human Human Human - - Human Human Human --

Computer Computer

This is computer-assisted testing, not automated testing. ALL testing is computer assisted.

Other computer assistance…

• Tools to help create tests • Tools to sort, summarize or evaluate test output or test results • Tools (simulators) to help us predict results • Tools to build models (e.g. state models) of the software, from which we can build tests and evaluate / interpret results • Tools to vary inputs, generating a large number of similar (but not the same) tests on the same theme, at minimal cost for the variation • Tools to capture test output in ways that make test result replication easier • Tools to expose the API to the non-programmer subject matter expert, improving the maintainability of SME-designed tests • Support tools for parafunctional tests (usability, performance, etc.) Colloquium: Florida Tech Cem Kaner Copyright © 2012

Don't think "automated or not"

•

Think continuum: more to less Not, "can we automate"

•

Instead: "can we automate more?"

A hypothetical

• System conversion (e.g. Filemaker application to SQL) – Database application, 100 types of transactions, extensively specified (we know the fields involved in each transaction, know their characteristics via data dictionary) – 15000 regression tests – Should we assess the new system by making it pass the 15000 regression tests?

– Maybe to start, but what about… ° Create a test generator to create high volumes of data combinations for each transaction. THEN: ° Randomize the order of transactions to check for interactions that lead to intermittent failures – This lets us learn things we don’t know, and ask / answer questions we don’t know how to study in other ways Colloquium: Florida Tech Cem Kaner Copyright © 2012

Suppose you decided to never run another regression test

What kind of automation could you do?

Inputs • Input filters • Function • • Consequences Output filters Combinations Task sequences File contents • Input / reference / config State transitions Execution environment Fuzzing Sampling system Long Sequence Regression Oracles Model Reference Diagnostic Constraint Colloquium: Florida Tech Cem Kaner Copyright © 2012 11

Issues that Drive Design of Test Automation

• Theory of error •

Detection

What kinds of errors do we hope to expose?

• Input data How will we select and generate input data and conditions?

• What heuristics/rules tell us there might be a problem?

Evaluation

How to decide whether X is a problem or not?

• Sequential dependence Should tests be independent? If not, what info should persist or drive sequence from test N to N+1?

• Execution How well are test suites run, especially in case of individual test failures?

• Output data Observe which outputs, and what dimensions of them? • Comparison data IF detection is via comparison to oracle data, where do we get the data?

• • • • •

Troubleshooting support

Failure triggers what further data collection?

Notification

How/when is failure reported?

Retention

In general, what data do we keep?

Maintenance

How are tests / suites updated / replaced?

Relevant contexts

Under what circumstances is this approach relevant/desirable?

Primary drivers of our designs

The primary driver of a design is the key factor that motivates us or makes the testing possible. In Doug's and my experience, the most common primary drivers have been: • Theory of error – We’re hunting a class of bug that we have no better way to find • Available oracle – We have an opportunity to verify or validate a behavior with a tool • Ability to drive long sequences – We can execute a lot of these tests cheaply.

More on … Theory of Error

• Computational errors • Communications problems – – protocol error their-fault interoperability failure • Resource unavailability or corruption, driven by – history of operations – competition for the resource • Race conditions or other time-related or thread-related errors • Failure caused by toxic data value combinations – – that span a large portion or a small portion of the data space that are likely or unlikely to be visible in "obvious" tests based on customer usage or common heuristics Colloquium: Florida Tech Cem Kaner Copyright © 2012 14

• • •

Simulate Events with Diagnostic Probes

display. One of the first PBX's with integrated voice and data. 108 voice features, 110 data features.

•

Simulate traffic on

•

system, with Settable probabiliti es of state transitions Diagnostic reporting whenever a suspicious event detected

More on … Available Oracle

Typical oracles used in test automation • Reference program • Model that predicts results • Embedded or self-verifying data • Checks for known constraints • Diagnostics Colloquium: Florida Tech Cem Kaner Copyright © 2012 16

Function Equivalence Testing

• MASPAR (the Massively Parallel computer, 64K parallel processors). • The MASPAR computer has several built-in mathematical functions. We’re going to consider the Integer square root.

• This function takes a 32-bit word as an input. Any bit pattern in that word can be interpreted as an integer whose value is between 0 and 232-1. There are 4,294,967,296 possible inputs to this function.

Function Equivalence Test

• The 32-bit tests took the computer only 6 minutes to run the tests and compare the results to an oracle. • There were 2 (two) errors, neither of them near any boundary. (The underlying error was that a bit was sometimes mis-set, but in most error cases, there was no effect on the final calculated result.) Without an exhaustive test, these errors probably wouldn’t have shown up. • For 64-bit integer square root, function equivalence tests involved random sample rather than exhaustive testing because the full set would have required 6 minutes x 232 tests. Colloquium: Florida Tech Cem Kaner Copyright © 2012 18

This tests for equivalence of functions, but it is less exhaustive than it looks (Acknowledgement: From Doug Hoffman)

Program state System state Intended inputs Configuration and system resources Cooperating processes, clients or servers Program state System under test Program state, (and uninspected outputs) System state Monitored outputs Impacts on connected devices / resources To cooperating processes, clients or servers System state Intended inputs Configuration and system resources Cooperating processes, clients or servers Reference function Program state, (and uninspected outputs) System state Monitored outputs Impacts on connected devices / resources To cooperating processes, clients or servers

More on … Ability to Drive Long Sequences

Any execution engine will (potentially) do: • Commercial regression-test execution tools • Customized tools for driving programs with (for example) – Messages (to be sent to other systems or subsystems) – – Inputs that will cause state transitions Inputs for evaluation (e.g. inputs to functions) Colloquium: Florida Tech Cem Kaner Copyright © 2012 20

Long-sequence regression

• Tests taken from the pool of tests

the program has passed in this build.

• The tests sampled are run in random order until the software under test fails (e.g crash).

• Typical defects found include timing problems, memory corruption (including stack corruption), and memory leaks.

• Recent (2004) release: 293 reported failures exposed 74 distinct bugs, including 14 showstoppers. Note: • these tests are no longer testing for the failures they were designed to expose.

• these tests add

nothing

to typical measures of coverage, because the statements, branches and subpaths within these tests were covered the first time these tests were run in this build.

Imagining a structure for high-volume automated testing

Some common characteristics

• The tester codes a testing process rather than individual tests.

• Following the tester’s algorithms, the computer creates tests (maybe millions of tests), runs them, evaluates their results, reports suspicious results (possible failures), and reports a summary of its testing session. • The tests often expose bugs that we don’t know how to design focused tests to look for. – They expose memory leaks, wild pointers, stack corruption, timing errors and many other problems that are not anticipated in the specification, but are clearly inappropriate (i.e. bugs). – Traditional expected results (the expected result of 2+3 is 5) are often irrelevant.

What can we vary?

• Inputs to functions – – To check input filters To check operation of the function – To check consequences (what the other parts of the program do with the results of the function) – To drive the program's outputs • Combinations of data • Sequences of tasks • Contents of files – – – Input files Reference files Configuration files • State transitions – – Sequences in a state model Sequences that drive toward a result • Execution environment – – Background activity Competition for specific resources • Message streams Colloquium: Florida Tech Cem Kaner Copyright © 2012 24

Fuzzing:

• Random generation / selection of tests • Execution engine • Weak oracle (run till crash)

Fuzzing examples

• Random inputs • Random state transitions (dumb monkey) • File contents • Message streams • Grammars Colloquium: Florida Tech Cem Kaner • Test selection optimized against some criteria

Long-sequence regression Model-based oracle

• E.g. state machine • E.g. mathematical model

Reference program Diagnostic oracle Constraint oracle

Fuzzi ng Sampli ng system Long Sequen ce Regres sion Oracles Mod el Refere nce Diagno stic Constr aint Inputs • Input filters • Function • Consequenc es • Output filters Combinations Task sequences File contents Colloquium: Florida Tech reference / Cem Kaner Copyright © 2012 26

Issues that Drive Design of Test Automation

• Theory of error •

Detection

What kinds of errors do we hope to expose?

• Input data How will we select and generate input data and conditions?

• What heuristics/rules tell us there might be a problem?

Evaluation

How to decide whether X is a problem or not?

• Sequential dependence Should tests be independent? If not, what info should persist or drive sequence from test N to N+1?

• Execution How well are test suites run, especially in case of individual test failures?

• Output data Observe which outputs, and what dimensions of them? • Comparison data IF detection is via comparison to oracle data, where do we get the data?

• • • • •

Troubleshooting support

Failure triggers what further data collection?

Notification

How/when is failure reported?

Retention

In general, what data do we keep?

Maintenance

How are tests / suites updated / replaced?

Relevant contexts

Under what circumstances is this approach relevant/desirable?

About Cem Kaner

• Professor of Software Engineering, Florida Tech I’ve worked in all areas of product development (programmer, tester, writer, teacher, user interface designer, software salesperson, organization development consultant, as a manager of user documentation, software testing, and software development, and as an attorney focusing on the law of software quality.) Senior author of three books: •

Lessons Learned in Software Testing

Pettichord) (with James Bach & Bret • •

Bad Software Testing Computer Software

Nguyen).

(with David Pels) (with Jack Falk & Hung Quoc My doctoral research on psychophysics (perceptual measurement) nurtured my interests in human factors (usable computer systems) and measurement theory.

Improve the Power of Your Tests with Risk

Transcript Improve the Power of Your Tests with Risk

An Overview of High Volume Test Automation (Early Draft: Feb 24, 2012)

Cem Kaner, J.D., Ph.D.

Abstract

Typical Testing Tasks

Regression testing

Really? This is automation?

Other computer assistance…

A hypothetical

Suppose you decided to never run another regression test

What kind of automation could you do?

Issues that Drive Design of Test Automation

Primary drivers of our designs

More on … Theory of Error

Simulate Events with Diagnostic Probes

More on … Available Oracle

Function Equivalence Testing

Function Equivalence Test

This tests for equivalence of functions, but it is less exhaustive than it looks (Acknowledgement: From Doug Hoffman)

More on … Ability to Drive Long Sequences

Long-sequence regression

Imagining a structure for high-volume automated testing

Some common characteristics

What can we vary?

Fuzzing:

Fuzzing examples

Long-sequence regression Model-based oracle

Reference program Diagnostic oracle Constraint oracle

Issues that Drive Design of Test Automation

About Cem Kaner

Directory