Transcript Software Safety Program
Software Safety Engineering (S2E) Program Status
Dan Fitch March 7, 2001
Software Safety Program - Overview General Safety Concepts - WHY Software Safety and CLCS - HOW Known Hazards Designing for Safety Safety & Reliability Thread Current Status
Software Safety – What is it?
Limit
Anticipate Rate Slope
Limit
Detect Absolute Value
Limit Limit Limit
Control Prevent
Limit
Mitigate Limit Damage Return to Safe State
Software Safety – What is it?
Definitions Functionally-critical Mission completion Safety-Critical Humans = Life & Limb Hardware = $10 6 Some set theory Input versus output
Set of Inputs (
) Unknowns (
)
Some Theory…
Set of Outputs Known Known Safe Unsafe Sources: Normal Operation Hardware Failures Human Intervention Models/Simulators Assumed Safe
Software Safety – Why do it?
Direction: DoD NASA
Mil-Std-882D, DoD-Std-2167
CLCS
NSTS-07700, NSS-8719.13, NASA-GB-1740.13, NSS-22206, NSS-22254, Direction from Dan Goldin 84K-00055, KDP-P-2901
Software Safety – Why do it?
Objective: Identify & Mitigate Risk Known Fault Scenarios – by requirements, analyses & test Possible Unknowns – by design approach & further test
“Knowns” Hardware fault-driven scenarios Legacy of hardware failure data available from the 1970’s Hardware-driven hazards May be analyzed – the SSA May be tested – specific fault injection Identifies Risk & Yields Design Changes – Issues/ESRs The Safety Case – Summary of Risk Findings
“Stuff” Happens “Unknowns” Software doesn’t fail – It just doesn’t do what we thought it would Hardware and some functions (e.g., seeds & races) cause most random errors Specification & Coding errors = Prime Cause 90% of errors are in the specifications C++ and Java are inherently powerful, but dangerous
Farengi Software Safety Rule #76
If it "touches*" hardware that can impact the safety of people or equipment, an SSA is absolutely necessary.
*(i.e., controls, monitors, or mitigates the risk of using)
SSA - What and When Assessment of risk factors due to software Hardware Hazards SFMEA and SFTA KDP-P-2901 Schedule: 30 days before the first interaction with Flight Hardware In time for 5A/B Testing Presented at TRR/ORR
IPT/DP-1 Conceptual Design System Safety Analysis SRS/DP-2 Detail Design DDS/ODS/DP-3 Code Development
IPT/DP-1 Conceptual Design System Safety Analysis SRS/DP-2 Detail Design DDS/ODS/DP-3 Code Development Val/Ver Test
3A/B 4A/B
TRR/ ORR
Readiness Reviews
System Test
5A/B (With Hdwr)
IPT/DP-1 Conceptual Design System Safety Analysis SRS/DP-2 Detail Design DDS/ODS/DP-3 Code Development S-C Matrix PHA KDP-P-2901 SSA Process Val/Ver Test
3A/B 4A/B
TRR/ ORR
Readiness Reviews
System Test
5A/B (With Hdwr)
IPT/DP-1 Conceptual Design System Safety Analysis SRS/DP-2 Detail Design DDS/ODS/DP-3 Code Development S-C Matrix PHA FTA/ FMEA KDP-P-2901 SSA Process Val/Ver Test
3A/B 4A/B
TRR/ ORR
Readiness Reviews
System Test
5A/B (With Hdwr)
IPT/DP-1 Conceptual Design System Safety Analysis SRS/DP-2 Detail Design DDS/ODS/DP-3 Code Development S-C Matrix PHA FTA/ FMEA KDP-P-2901 SSA Process Risk Assessment Val/Ver Test
3A/B 4A/B
TRR/ ORR
Readiness Reviews
System Test
5A/B (With Hdwr)
IPT/DP-1 Conceptual Design System Safety Analysis SRS/DP-2 Detail Design DDS/ODS/DP-3 Code Development S-C Matrix PHA FTA/ FMEA KDP-P-2901 SSA Process
Risk
SSA Report Risk Assessment
CM-Driven Changes
Val/Ver Test
3A/B 4A/B
TRR/ ORR
Readiness Reviews
System Test
5A/B (With Hdwr)
Software Fault Tree Analysis Works backward from the fault to its root causes Uses design details of the entire system Leads to better understanding of causes and their prevention Unknown fault events not considered
Top Event
Fault Tree Analysis
Fill Valve not closed Causal Relationship S/W did not anticipate rapid pressure rise S/W did not react to over pressure Intermediate Events Human did not notice pressur e Other Root Cause Basic Fault Events
Analysis & CLCS Architecture
Hazardous Event Hardware Safing Detection & Anticipation Control & Mitigation Applications Apps Srvcs Sys Srvcs System S/W Remaining Risk
The Software FMEA Predicted hardware failures followed to their conclusion through the software What can go wrong?
What happens when it does?
Must know system failures up front Won’t prevent the unexpected
Spiral Development Cultural Changes Failure of software Test CLCS
SSA – Traditional Approach
Fault Tree Analysis Failure Modes & Effects Analysis Traditional Development
•All or most code available •A lot known about the system •Too late…
SSA An Iterative Process
Safety Criticality Assessment Fault Tree Analysis Spiral Development Engineering Design Changes Failure Modes & Effects Analysis
SSA - Where
S&MA will perform a Software Safety Analysis (SSA) for each Delivery and every location; i.e., as we step up to each new drop. After the initial SSA, an update of the analysis and a new SSA report will be done for each modification to the safety critical software.
SSA - Planning
Design Begin … Val/Ver Test PHA FTA FMEA Risk Assessment SSA Report
On a Pert chart, the SSA preparation activity will begin during the preparation of the design specifications and have a finish-to-finish relationship with the validation/verification (4A/B) testing.
Farengi Software Safety Rule #304
The SSA isn’t enough.
Spiral Development Cultural Changes Failure of software Test CLCS
Paradigms Software Failures: “Software does not fail - it just does not perform as intended” Dr Nancy Leveson, MIT
Paradigms Design and test for functionality: Also specify what the system should not do.
Then test it.
Some Theory… 2 nd Look
Set of Inputs (
) Unknowns (
) Fault Injection (added known) Known Set of Outputs Known Safe Unsafe Sources: Normal Operation Hardware Failures Human Intervention Models/Simulators Assumed Safe
Design for Safety “Program and Project Responsibilities” Dan Goldin message: Safety is more than FMEA and FTA Safety must be designed in at the earliest Existing Specifications Must include safety Methods & techniques for mitigation of hazards Requirements – Traceable and Testable
Initiatives Dan Goldin: “Design for Safety” Smart Practices applied early to designs Early engineering changes are cheaper Provide draft guidance for design of safety-critical software Process changes Design Guidelines – NASA-GB-7410.13
Peer reviews – enhanced checklist Test development – Fault Injection for Robustness Works to prevent unforeseen fault scenarios
Objectives Known fault scenarios – Analysis Redesign Test – functionality and robustness Unknowns Design them out of the system Test – fault injection
S/W Safety – Where we are.
Safety-Critical software identified & in engineering review Software Safety Integration Team formed Software FTA/FMEA in work Will be recurring due to spiral development Design for Safety concepts being integrated Safety & Reliability Thread introduced Post-SSA Analysis Tools being procured
S/W Safety – What’s Next?
Today “Design for Safety” and “Known Fault Analyses” Tomorrow Recursive and bi-directional analyses Reliability predictions, Markov, Numerical Integration, Weibull analysis techniques Probabilistic fault injection techniques
Summary Life on the Leading Edge Probably the “Largest real-time safety-critical control system on the planet” Safety is our #1 core value We are on front and center stage – The NASA team is watching