Software Safety Program

Download Report

Transcript Software Safety Program

Software Safety Engineering (S2E) Program Status

Dan Fitch March 7, 2001

Software Safety Program - Overview  General Safety Concepts - WHY  Software Safety and CLCS - HOW  Known Hazards  Designing for Safety  Safety & Reliability Thread  Current Status

Software Safety – What is it?

Limit

Anticipate Rate Slope

Limit

Detect Absolute Value

Limit Limit Limit

Control Prevent

Limit

Mitigate Limit Damage Return to Safe State

Software Safety – What is it?

 Definitions  Functionally-critical  Mission completion  Safety-Critical  Humans = Life & Limb  Hardware = $10 6  Some set theory  Input versus output

Set of Inputs (

) Unknowns (

)

Some Theory…

Set of Outputs Known Known Safe Unsafe Sources: Normal Operation Hardware Failures Human Intervention Models/Simulators Assumed Safe

Software Safety – Why do it?

 Direction:  DoD  NASA

Mil-Std-882D, DoD-Std-2167

 CLCS

NSTS-07700, NSS-8719.13, NASA-GB-1740.13, NSS-22206, NSS-22254, Direction from Dan Goldin 84K-00055, KDP-P-2901

Software Safety – Why do it?

 Objective: Identify & Mitigate Risk  Known Fault Scenarios – by requirements, analyses & test  Possible Unknowns – by design approach & further test

“Knowns”  Hardware fault-driven scenarios  Legacy of hardware failure data available from the 1970’s  Hardware-driven hazards  May be analyzed – the SSA  May be tested – specific fault injection  Identifies Risk & Yields Design Changes – Issues/ESRs  The Safety Case – Summary of Risk Findings

 “Stuff” Happens “Unknowns”  Software doesn’t fail – It just doesn’t do what we thought it would  Hardware and some functions (e.g., seeds & races) cause most random errors  Specification & Coding errors = Prime Cause  90% of errors are in the specifications  C++ and Java are inherently powerful, but dangerous

Farengi Software Safety Rule #76

If it "touches*" hardware that can impact the safety of people or equipment, an SSA is absolutely necessary.

*(i.e., controls, monitors, or mitigates the risk of using)

SSA - What and When  Assessment of risk factors due to software  Hardware Hazards  SFMEA and SFTA  KDP-P-2901  Schedule: 30 days before the first interaction with Flight Hardware  In time for 5A/B Testing  Presented at TRR/ORR

IPT/DP-1 Conceptual Design System Safety Analysis SRS/DP-2 Detail Design DDS/ODS/DP-3 Code Development

IPT/DP-1 Conceptual Design System Safety Analysis SRS/DP-2 Detail Design DDS/ODS/DP-3 Code Development Val/Ver Test

3A/B 4A/B

TRR/ ORR

Readiness Reviews

System Test

5A/B (With Hdwr)

IPT/DP-1 Conceptual Design System Safety Analysis SRS/DP-2 Detail Design DDS/ODS/DP-3 Code Development S-C Matrix PHA KDP-P-2901 SSA Process Val/Ver Test

3A/B 4A/B

TRR/ ORR

Readiness Reviews

System Test

5A/B (With Hdwr)

IPT/DP-1 Conceptual Design System Safety Analysis SRS/DP-2 Detail Design DDS/ODS/DP-3 Code Development S-C Matrix PHA FTA/ FMEA KDP-P-2901 SSA Process Val/Ver Test

3A/B 4A/B

TRR/ ORR

Readiness Reviews

System Test

5A/B (With Hdwr)

IPT/DP-1 Conceptual Design System Safety Analysis SRS/DP-2 Detail Design DDS/ODS/DP-3 Code Development S-C Matrix PHA FTA/ FMEA KDP-P-2901 SSA Process Risk Assessment Val/Ver Test

3A/B 4A/B

TRR/ ORR

Readiness Reviews

System Test

5A/B (With Hdwr)

IPT/DP-1 Conceptual Design System Safety Analysis SRS/DP-2 Detail Design DDS/ODS/DP-3 Code Development S-C Matrix PHA FTA/ FMEA KDP-P-2901 SSA Process

Risk

SSA Report Risk Assessment

CM-Driven Changes

Val/Ver Test

3A/B 4A/B

TRR/ ORR

Readiness Reviews

System Test

5A/B (With Hdwr)

Software Fault Tree Analysis  Works backward from the fault to its root causes  Uses design details of the entire system  Leads to better understanding of causes and their prevention  Unknown fault events not considered

Top Event

Fault Tree Analysis

Fill Valve not closed Causal Relationship S/W did not anticipate rapid pressure rise S/W did not react to over pressure Intermediate Events Human did not notice pressur e Other Root Cause Basic Fault Events

Analysis & CLCS Architecture

Hazardous Event Hardware Safing Detection & Anticipation Control & Mitigation Applications Apps Srvcs Sys Srvcs System S/W Remaining Risk

The Software FMEA  Predicted hardware failures followed to their conclusion through the software  What can go wrong?

 What happens when it does?

 Must know system failures up front  Won’t prevent the unexpected

 Spiral Development  Cultural Changes  Failure of software  Test CLCS

SSA – Traditional Approach

Fault Tree Analysis Failure Modes & Effects Analysis Traditional Development

•All or most code available •A lot known about the system •Too late…

SSA An Iterative Process

Safety Criticality Assessment Fault Tree Analysis Spiral Development Engineering Design Changes Failure Modes & Effects Analysis

SSA - Where

S&MA will perform a Software Safety Analysis (SSA) for each Delivery and every location; i.e., as we step up to each new drop. After the initial SSA, an update of the analysis and a new SSA report will be done for each modification to the safety critical software.

SSA - Planning

Design Begin … Val/Ver Test PHA FTA FMEA Risk Assessment SSA Report

On a Pert chart, the SSA preparation activity will begin during the preparation of the design specifications and have a finish-to-finish relationship with the validation/verification (4A/B) testing.

Farengi Software Safety Rule #304

The SSA isn’t enough.

 Spiral Development  Cultural Changes  Failure of software  Test CLCS

Paradigms  Software Failures: “Software does not fail - it just does not perform as intended” Dr Nancy Leveson, MIT

Paradigms  Design and test for functionality: Also specify what the system should not do.

Then test it.

Some Theory… 2 nd Look

Set of Inputs (

) Unknowns (

) Fault Injection (added known) Known Set of Outputs Known Safe Unsafe Sources: Normal Operation Hardware Failures Human Intervention Models/Simulators Assumed Safe

Design for Safety  “Program and Project Responsibilities”  Dan Goldin message:  Safety is more than FMEA and FTA  Safety must be designed in at the earliest  Existing Specifications  Must include safety  Methods & techniques for mitigation of hazards  Requirements – Traceable and Testable

Initiatives  Dan Goldin: “Design for Safety”  Smart Practices applied early to designs  Early engineering changes are cheaper  Provide draft guidance for design of safety-critical software  Process changes  Design Guidelines – NASA-GB-7410.13

 Peer reviews – enhanced checklist  Test development – Fault Injection for Robustness  Works to prevent unforeseen fault scenarios

Objectives  Known fault scenarios –  Analysis  Redesign  Test – functionality and robustness  Unknowns  Design them out of the system  Test – fault injection

S/W Safety – Where we are.

 Safety-Critical software identified & in engineering review  Software Safety Integration Team formed  Software FTA/FMEA in work  Will be recurring due to spiral development  Design for Safety concepts being integrated  Safety & Reliability Thread introduced  Post-SSA Analysis Tools being procured

S/W Safety – What’s Next?

 Today  “Design for Safety” and “Known Fault Analyses”  Tomorrow  Recursive and bi-directional analyses  Reliability predictions, Markov, Numerical Integration, Weibull analysis techniques  Probabilistic fault injection techniques

Summary  Life on the Leading Edge  Probably the “Largest real-time safety-critical control system on the planet”  Safety is our #1 core value  We are on front and center stage – The NASA team is watching