Safety of Embedded Software - Massachusetts Institute of

Download Report

Transcript Safety of Embedded Software - Massachusetts Institute of

Engineering a Safer
World
Traditional Approach to Safety
• Traditionally view safety as a failure problem
– Chain of random, directly related failure events leads to loss
– Establish barriers between events or try to prevent individual
component failures
e.g., redundancy, overdesign, safety margins, punishment and training
for operators, defense in depth
• Analysis techniques
– Focus on probability of component failures and combinations of
component failures
– Where do they get the probabilities?
• Historical failure rates
• Make up numbers for human error and software error or ignore
these in the analysis
Chain-of-events example
Limitations of Traditional Approach
• Systems are becoming more complex
– Accidents often result from interactions among components, not
just component failures
– Too complex to anticipate all potential interactions
• By designers
• By operators
– Indirect and non-linear interactions
– Can no longer exhaustively test and get out design errors
• Omits or oversimplifies important factors
– Human error
– New technology, particularly software
– Culture and management
– Evolution and adaptation
Accident with No Component Failures
Types of Accidents
• Component Failure Accidents
– Single or multiple component failures
– Usually assume random failure
• Component Interaction Accidents
– Arise in interactions among components
– Complexity getting to point where cannot anticipate or
guard against all potential interactions
– Exacerbated by introduction of computers and software
Interactive Complexity
• Arises in interactions among system components
– Software allows us to build highly coupled and interactively
complex systems
– Coupling causes interdependence
– Increases number of interfaces and potential interactions
• Too complex to anticipate all potential interactions
– By designers
– By operators
• May lead to accidents even when no individual
component failures
Non-Linear Complexity
• Cause and effect not related in an obvious way
• Systemic factors in accidents, e.g., safety culture, work
environment, production pressures, etc.
– Our safety engineering techniques assume linearity
– Systemic factors affect events in non-linear and indirect ways
Dynamic Complexity
• Related to changes over time
– Systems are not static, but we often assume they are
– Systems migrate toward states of high risk under
competitive and financial pressures [Rasmussen]
– Need to control and identify unsafe changes
Software-Related Accidents
• Are usually caused by flawed requirements
– Incomplete or wrong assumptions about operation of
controlled system or required operation of computer
– Unhandled controlled-system states and environmental
conditions
• Merely trying to get the software “correct” or to make it
reliable will not make it safer under these conditions.
Software-Related Accidents (2)
• Software may be highly reliable and “correct” and still be
unsafe:
– Correctly implements requirements but specified behavior unsafe
from a system perspective.
– Requirements do not specify some particular behavior required for
system safety (incomplete)
– Software has unintended (and unsafe) behavior beyond what is
specified in requirements.
Safety vs. Correctness
•
Safety involves more than simply getting the software
“correct”:
Example: altitude switch
1. Signal safety-increasing 
Require any of three altimeters report below threshold
2. Signal safety-decreasing 
Require all three altimeters to report below threshold
Confusing Safety and Reliability
Event-based
Thinking
Systems Thinking
STAMP:
System-Theoretic Accident
Model and Processes
Based on Systems Theory
(vs. Reliability Theory)
Applying Systems Thinking to Safety
• Accidents involve a complex, dynamic “process”
– Not simply chains of failure events
– Arise in interactions among humans, machines and the
environment
• Treat safety as a dynamic control problem
– Safety requires enforcing a set of constraints on system
behavior
– Accidents occur when interactions among system
components violate those constraints
– Safety becomes a control problem rather than just a
reliability problem
Safety as a Dynamic Control Problem
• Examples
– O-ring did not control propellant gas release by sealing gap
in field joint of Challenger Space Shuttle
– Software did not adequately control descent speed of Mars
Polar Lander
– At Texas City, did not control the level of liquids in the ISOM
tower;
– In DWH, did not control the pressure in the well;
– Financial system did not adequately control the use of
financial instruments
Safety as a Dynamic Control Problem (2)
• Most major accidents arise from a slow migration of the
entire system toward a state of high-risk
• Need to control and detect this migration
• A change in emphasis:
“prevent failures”
↓
“enforce safety constraints on system behavior”
Example
Safety
Control
Structure
Qi Hommes, 2012
Safety as a Control Problem (3)
• Goal: Design an effective control structure that
eliminates or reduces adverse events.
– Need clear definition of expectations, responsibilities,
authority, and accountability at all levels of safety control
structure
– Entire control structure must together enforce the system
safety property (constraints)
• Physical design (inherent safety)
• Operations
• Management
• Social interactions and culture
Role of Process Models in Control
• Controllers use a process model to
determine control actions
Controller
Control
Algorithm
Process
Model
Control
Actions
Feedback
Controlled Process
• Accidents often occur when the
process model is incorrect
• Four types of hazardous control
actions:
• Control commands required for safety
are not given
• Unsafe ones are given
• Potentially safe commands given too
early, too late
• Control stops too soon or applied too
long
22
(Leveson, 2003); (Leveson, 2011)
Processes
System Engineering
(e.g., Specification,
Safety-Guided Design,
Design Principles)
Risk Management
Management Principles/
Organizational Design
Operations
Regulation
Tools
Accident/Event Analysis
Hazard Analysis
Specification Tools
CAST
STPA
SpecTRM
Organizational/Cultural
Risk Analysis
Identifying Leading
Indicators
STAMP: Theoretical Causality Model
Learning from Events
• CAST: Causal Analysis based on System Theory
• Goal: more complete causal analysis of accidents,
incidents, and adverse events
Facts about Accidents
• Almost never have single causes
– “Root cause seduction”
– Accidents are complex processes
• Usually involve flaws in
– Engineered equipment
– Operator behavior
– Management decision making
– Safety culture
– Regulatory oversight
Root Cause Seduction
• Assuming there is a root cause gives us an illusion of
control.
– Usually focus on operator error or technical failures
– Ignore systemic and management factors
– Leads to a sophisticated “whack a mole” game
• Fix symptoms but not process that led to those symptoms
• In continual fire-fighting mode
• Having the same accident over and over
Three Levels of Analysis
• What (events)
– e.g., explosion
• Who and how (conditions)
– e.g., bad valve design, operator did not notice something
• Why (systemic factors)
– e.g., production pressures, cost concerns, flaws in design
process, flaws in reporting process, etc.
– Why was safety control structure ineffective in preventing the
loss?
Hazard Analysis
• “Investigating an accident before it occurs”
• Identify potential causal scenarios and try to eliminate them
• Must be based on some model of how and why accidents
occur
• STPA (System-Theoretic Process Analysis)
– Based on STAMP
– Assumes accidents are more complex processes than just
chains of component failure events
Goals for an Accident Analysis Technique
• Minimize hindsight bias
• Provide a framework or process to assist in
understanding entire accident process and identifying
systemic factors
• Get away from blame (“who”) and shift focus to “why”
and how to prevent in the future
• Goal is to determine
– Why people behaved the way they did
– Weaknesses in the safety control structure that allowed
the loss to occur
STPA Step 2
Control input or
external information
wrong or missing
Controller
Inappropriate,
ineffective, or
missing control
action
Inadequate Control
Algorithm
Process
Model
(Flaws in creation,
process changes,
incorrect modification or
adaptation)
(inconsistent,
incomplete, or
incorrect)
Missing or wrong
communication
with another
Controller
controller
Inadequate or
missing feedback
Feedback Delays
Actuator
Sensor
Inadequate
operation
Inadequate
operation
Incorrect or no
information provided
Delayed
operation
Controller
Conflicting control actions
Measurement
inaccuracies
Controlled Process
Component failures
Changes over time
Process input missing or wrong
Feedback delays
Unidentified or
out-of-range
disturbance
Process output
contributes to
system hazard
30
STPA (System-Theoretic Process Analysis)
• A top-down, system engineering technique
• Identifies safety constraints (system and component safety
requirements)
• Identifies scenarios leading to violation of safety constraints;
use results to design or redesign system to be safer
• Can be used on technical design and organizational design
• Supports a safety-driven design process where
– Hazard analysis influences and shapes early design decisions
– Hazard analysis iterated and refined as design evolves
Steps in STPA
• Establish fundamentals
– Define “accident” for your system
– Define hazards
– Rewrite hazards as constraints on system design
– Draw preliminary (high-level) safety control structure
• Identify potentially unsafe control actions (high-level safety
requirements and constraints)
• Determine how each potentially hazardous control action
could occur
Steps in STPA
• Establish foundation for analysis
– Define “accident” for your system
– Define hazards
x
– Rewrite hazards as constraints on system design
– Draw preliminary (high-level) safety control structure
• Step 1: Identify potentially unsafe control actions (high-level
safety requirements and constraints) [Step 1 STPA]
• Step 2: Determine how each potentially hazardous control
action could occur [Step 2 STPA]
Identifying Accidents and Hazards
• Accident (Loss): An undesired or unplanned event that
results in a loss, including a loss of human life or human
injury, property damage, environmental pollution, mission
loss, financial loss, etc.
• Hazard: A system state or set of conditions that together
with a worst-case set of environmental conditions, will lead
to an accident (loss)
• Accident (Loss): An undesired or unplanned event that
results in a loss, including a loss of human life or human
injury, property damage, environmental pollution, mission
loss, financial loss, etc.
• Hazard: A system state or set of conditions that together with
a worst-case set of environmental conditions, will lead to an
accident (loss)
System
Accident
Hazard
ACC
Two vehicles collide
Inadequate distance between
vehicle and one in front or in
back
Chemical People die or are injured
Plant
due to exposure to
chemicals
Chemicals in air after release
from plant
Train
Passenger falls out of train
door
controller
Door is open when train starts
Door is open while train
moving
Identify High-Level Safety Constraints
(Requirements)
Hazard
Safety Constraint (Requirement)
Inadequate distance between
vehicle and one in front or in back
Vehicles must never violate
minimum separation requirements
Chemicals in air after release from
plant
Chemicals must never be released
inadvertently from plant
Door is open when train starts
Train must never start while door is
open
Door is open while train moving
Train must never open while train is
moving
Safety Constraints vs. Safety Requirements
• Design constraints:
– ACC must not violate separation requirements with object
ahead
– ACC must not brake too abruptly
• Design requirements
– ACC shall maintain a TBD amount of distance between
the vehicle and the object in front when engaged
– ACC shall limit vehicle deceleration to no more than TBC
m/s2
In-Class Example
In-Trail Procedure (ITP)
A new passing procedure for oceanic flights
In-Trail Procedure (ITP)
• Enables aircraft to achieve FL changes on a more frequent basis.
• Designed for oceanic and remote airspaces not covered by radar.
• Permits climb and descent using new reduced longitudinal separation
standards.
• Potential Benefits
– Reduced fuel burn and CO2 emissions via more opportunities to reach the optimum
FL or FL with more favorable winds.
– Increased safety via more opportunities to leave turbulent FL.
• But standard separation requirements not met during maneuver
ITP Procedure – Step by Step
Flight Crew
1.
Check that ITP criteria are met.
2.
If ITP is possible, request ATC
clearance
Air Traffic Controller
3. Check that there are no blocking
aircraft other than Reference
Aircraft in the ITP request.
4. Check that ITP request is
applicable (i.e. standard request
not sufficient) and compliant with
ITP phraseology.
5. Check that ITP criteria are met.
8. When ITP clearance is received,
6. If all checks are positive, issue ITP
clearance.
check that ITP criteria are still met.
9. If ITP criteria are still met, accept ITP
clearance via CPDLC.
10. Execute ITP clearance without delay.
11. Report when established at the
cleared FL.
Involves multiple aircraft,
crew, communications
(ADS-B, GPS) , ATC
Accident and Hazard Definition for ITP
Accident: Two aircraft collide
Hazard?:
Batch Reactor In-Class Exercise
• What is the accident?
• What is the system-level hazard (associated with that
accident)?
• What is the high-level system safety requirement (safety
constraint)?
Steps in STPA
• Establish foundation for analysis
– Define “accident” for your system
– Define hazards
– Rewrite hazards as constraints on system design
– Draw preliminary (high-level) functional control structure
• Step 1: Identify potentially unsafe control actions (high-level
safety requirements and constraints)
• Step 2: Determine how each potentially hazardous control
action could occur
Draw the Functional Control Structure
• Identify major components and controllers
(HINT: Start at very high level)
• Label control and feedback arrows
• Create the preliminary process models
ITP High-Level Control Structure
• What are the major components and controllers of the
system?
ITP High-Level Control Structure (2)
• What commands are sent and feedback provided?
ATC
Clearance
to pass
(to execute ITP)
Requests
Acknowledgements
Pilot
Execute ITP
maneuver
A/C status, position, etc.
Aircraft
High-Level Control
Structure for ITP
Pilot Responsibilities and Process Model
• Responsibilities:
–
–
–
–
–
–
–
Assess whether ITP appropriate
Check if ITP criteria are met
Request ITP
Receive ITP approval
Recheck criteria
Execute flight level change
Confirm new flight level to ATC
• Process Model
– Own ship climb/descend capability
– ADS-B data for nearby aircraft (velocity, position, orientation)
– ITP criteria (speed, distance, relative attitude, similar track, data
quality)
– State of ITP request/approval
– etc.
For the Batch Reactor
Draw the functional control structure
What are the Responsibilities of the
Software Controller?
• Not the requirements, simply the basic functions to be
implemented
Steps in STPA
• Establish foundation for analysis
– Define “accident” for your system
– Define hazards
– Rewrite hazards as constraints on system design
– Draw preliminary (high-level) safety control structure
• Step 1: Identify potentially unsafe
control actions (high-level
x
safety requirements and constraints)
• Step 2: Determine how each potentially hazardous control
action could occur
Four Ways Unsafe Control Can Occur
• A control action required for safety is not provided or is not
followed
• An unsafe control action is provided that leads to a hazard
• A potentially safe control action provided too late, too early,
or out of sequence
• A safe control action is stopped too soon or applied too
long (for a continuous or non-discrete control action)
Four Ways Unsafe Control Can Occur
• A control action required for safety is not provided or is not
followed
• An unsafe control action is provided that leads to a hazard
• A potentially safe control action provided too late, too early,
or out of sequence
• A safe control action is stopped too soon or applied too
long (for a continuous or non-discrete control action)
Control
Action
Not providing
causes hazard
Providing
causes
hazard
Too early/too Stopped too
late, wrong
soon/ applied
order
too long
Step 1 for Pilot
Pilot
Execute ITP
maneuver
A/C status, position, etc.
Aircraft
Control
Action
Execute
ITP
Maneuver
Not providing
causes hazard
Providing
causes
hazard
Too early/too Stopped too
late, wrong
soon/ applied
order
too long
Potentially Hazardous Control Actions
by the Flight Crew
Control
Action
Not Providing
Causes Hazard
Providing Causes
Hazard
ITP executed when
not approved
ITP executed when
ITP criteria are not
satisfied
Execute ITP
ITP executed with
incorrect climb rate,
final altitude, etc
Abnormal
Termination
of ITP
FC continues with
maneuver in
dangerous
situation
FC aborts
unnecessarily
FC does not follow
regional contingency
procedures while
aborting
Wrong
Timing/Order
Causes Hazard
Stopped Too
Soon/Applied
Too Long
ITP executed too
soon before
approval
ITP aircraft
levels off above
requested FL
ITP executed too
late after
reassessment
ITP aircraft
levels off below
requested FL
High Level Constraints on Flight Crew
•
The flight crew must not execute the ITP when it has not been
approved by ATC.
•
The flight crew must not execute an ITP when the ITP criteria are
not satisfied.
•
The flight crew must execute the ITP with correct climb rate, flight
levels, Mach number, and other associated performance criteria.
•
The flight crew must not continue the ITP maneuver when it would
be dangerous to do so.
•
The flight crew must not abort the ITP unnecessarily. (Rationale:
An abort may violate separation minimums)
•
When performing an abort, the flight crew must follow regional
contingency procedures.
•
The flight crew must not execute the ITP before approval by ATC.
•
The flight crew must execute the ITP immediately when approved
unless it would be dangerous to do so.
•
The crew shall be given positive notification of arrival at the
requested FL
Potentially Hazardous Control
Actions for ATC
Control
Action
Not Providing
Causes Hazard
Approve ITP
request
Providing Causes
Hazard
Wrong
Timing/Order
Causes Hazard
Approval given when
criteria are not met
Approval given
too early
Approval given to
incorrect aircraft
Approval given
too late
Abort instruction
given when abort is
not necessary
Abort instruction
given too late
Deny ITP
request
Abnormal
Termination
Instruction
Aircraft should
abort but
instruction
not given
Stopped Too
Soon or Applied
Too Long
Causes Hazard
High-Level Constraints on ATC
• Approval of an ITP request must be given only when the ITP
criteria are met.
• Approval must be given to the requesting aircraft only.
• Approval must not be given too early or too late [needs to be
clarified as to the actual time limits]
• An abnormal termination instruction must be given when
continuing the ITP would be unsafe.
• An abnormal termination instruction must not be given when it
is not required to maintain safety and would result in a loss of
separation.
• An abnormal termination instruction must be given
immediately if an abort is required.
For the Batch Reactor:
Hazard: Catalyst in reactor without reflux condenser
operating (water flowing through it)
Control
Action
Not providing
causes hazard
Providing
causes
hazard
Too early/too Stopped too
late, wrong
soon/ applied
order
too long
Create this table for the computer controlling the valves
loop in your control diagram
What are the safety requirements (constraints) on
the software controller given this table?
Steps in STPA
• Establish foundation for analysis
– Define “accident” for your system
– Define hazards
– Rewrite hazards as constraints on system design
– Draw preliminary (high-level) safety control structure
• Step 1: Identify potentially unsafe control actions (high-level
safety requirements and constraints)
• Step 2: Determine how each potentially hazardous control
x
action could occur
Control input or
external information
wrong or missing
STPA Step 2
Controller
Inappropriate,
ineffective, or
missing control
action
Inadequate Control
Algorithm
(Flaws in creation,
process changes,
incorrect modification
or adaptation)
Missing or wrong
communication with
another controller Controller
Process Model
(inconsistent,
incomplete, or
incorrect)
Inadequate or
missing feedback
Feedback Delays
Actuator
Sensor
Inadequate
operation
Inadequate
operation
Incorrect or no
information provided
Delayed
operation
Controller
Conflicting control actions
Measurement
inaccuracies
Controlled Process
Component failures
Changes over time
Process input missing or wrong
Feedback delays
Unidentified or
out-of-range
disturbance
Process output
contributes to
system hazard
62
Example
STPA
Results
Example Causal Analysis
• Unsafe control action: Pilot executes maneuver when
criteria not met
• Possible Causes?
Exercise Continued (Batch Reactor)
• STEP 2: Identify some causes of the hazardous control
action: Open catalyst valve when water valve not open
– HINT: Consider how controller’s process model could identify that
water valve is open when it is not.
• What are some causes for a required control action (e.g.,
open water valve) being given by the software but not
executed.
• What design features (controls) might you use to protect the
system from the scenarios you found?
Results
• What did you find?
• What “controls” did you add?
• What about the actual scenario that occurred in the
accident?
Four Ways Unsafe Control Can Occur
• A control action required for safety is not provided or is not
followed
• An unsafe control action is provided that leads to a hazard
• A potentially safe control action provided too late, too early,
or out of sequence
• A safe control action is stopped too soon or applied too
long (for a continuous or non-discrete control action)
Safe control
action provided
but not followed Controller
Inadequate Control
Algorithm
(Flaws in creation,
process changes,
incorrect
modification or
adaptation)
Control input or
external information
wrong or missing
Missing or wrong
communication
with another
Controller
controller
Process
Model
(inconsistent,
incomplete,
or incorrect)
Inadequate or
missing feedback
Feedback Delays
Actuator
Sensor
Inadequate
operation
Inadequate
operation
Incorrect or no
information provided
Delayed
operation
Controller
Conflicting control actions
Measurement
inaccuracies
Controlled Process
Component failures
Changes over time
Process input missing or wrong
Feedback delays
Unidentified or
out-of-range
disturbance
Process output
contributes to
system hazard
Exercise Continued (Batch Reactor)
• STEP 2: Identify some causes of the hazardous control
action: Open catalyst valve when water valve not open
– HINT: Consider how controller’s process model could identify that
water valve is open when it is not.
• What are some causes for a required control action (e.g.,
open water valve) being given by the software but not
executed.
• What design features (controls) might you use to protect the
system from the scenarios you found?
Results
• What did you find?
• What “controls” did you add?
• What about the actual scenario that occurred in the
accident?
Next Steps
• Use causal analysis to identify detailed safety design
requirements and design options
• If desired, iterate top-down
– Refine into more detailed control structures
– Refine safety constraints (requirements) into more detailed
requirements for each component
• Use STPA results to assure safety when use in different
environment(s) than assumed for original certification
• Use STPA results to evaluate other types of changes to
system and environment
Example Controls for Causal Scenarios
• Scenario 1 - Operator was expecting patient to have been
positioned, but table positioning was delayed compared to
plan (e.g. because of delays in patient preparation or patient
transfer to treatment area; because of unexpected delays in
beam availability or technical issues being processed by other
personnel without proper communication with the operator).
• Controls:
– Provide operator with direct visual feedback to the gantry
coupling point, and require check that patient has been
positioned before starting treatment (M1).
– Provide a physical interlock that prevents beam-on unless
table positioned according to plan
Example Controls for Causal Scenarios (2)
• Scenario 2 - Operator is asked to turn the beam on outside
of a treatment sequence (e.g. because the design team wants
to troubleshoot a problem) but inadvertently starts treatment
and does not realize that the facility proceeds with reading the
treatment plan.
• Controls:
– Reduce the likelihood that non-treatment activities have
access to treatment-related input by creating a nontreatment mode to be used for QA and experiments,
during which facility does not read treatment plans that
may have been previously been loaded (M2);
– Make procedures (including button design if pushing a
button is what starts treatment) to start treatment
sufficiently different from non-treatment beam on
procedures that the confusion is unlikely.
Additional Things Not Covered Today
• Rigorous method for performing Step 1
– Much of it can be automated or assistance provided
– Generate executable and analyzable model-based safety
requirements
• Multiple controller analysis
Organizational Aspects of Risk
• Examples so far focus on physical level
• Also requirements and control responsibilities at
management level to satisfy system safety requirements
• Can identify unsafe control actions and causal scenarios
at higher levels of the control structure (perform a risk
analysis) and build in controls to prevent them
• Behavior and control structures change over time
– Prevent migration to higher levels of risk
– Detect when occurs
Organizational Aspects of Risk (2)
• Can look at non-safety risks, including project risks, budget
risks, schedule risks and tradeoffs
• Goal may be to evaluate an existing control structure or to
create a new one
• Creating leading indicators
• Current or past examples:
– NASA safety management after Columbia
– Radiation therapy at UCSD and UCLA hospitals (and maybe
Boston Mass General)
– CO2 capture, transport, and storage (Samadi, Ecole des Mines)
– Product Development Process (Goerges, Cummins Engine)
Other Topics Covered by STAMP
• Operations
• Managing safety-critical projects
• Integrating safety into system engineering
– Designing safety into systems from the beginning
– Specification to support maintenance and evolution
Current Projects
• Human factors engineering
– Design to reduce human error
– Integrating sophisticated human factors into hazard analysis
• Leading Indicators
• Cyber Warfare and Nuclear Security
• Organizational, Managerial, and Project Risk Analysis
• More applications: high-speed rail, autos, medicine,
aircraft, NextGen (TBO)
• Application to Financial Systems
• Other Emergent System Properties
• Tools and formal assistance with analysis