How safe is safe enough? (and how do we demonstrate that?)

Download Report

Transcript How safe is safe enough? (and how do we demonstrate that?)

How safe is safe enough?
(and how do we demonstrate that?)
Dr David Pumfrey
High Integrity Systems Engineering Group
Department of Computer Science
University of York
Why System Safety ?
Why do we strive to make systems safe?
Self interest
we wouldn’t want to be harmed by systems we develop and use
unsafe systems are bad business
We have to do so
required by law
required by standards
But what do the law and standards represent?
laws try to prevent what society finds morally unacceptable
ultimately assessed by the courts, as representatives of society
standards try to define what is acceptable practise
to discharge legal and moral responsibilities
2
Perception of Safety
Perception (and hence individual acceptance) of risk
affected by many factors
(Apparent) degree of control
Number of deaths in one accident (aircraft versus cars)
Familiarity vs. novelty
“Dreadness” of risk (“falling out of the sky”, nuclear radiation)
Voluntary vs. involuntary risk (hang gliding vs nuclear accident)
Politics and journalism
Frequency / profile of reporting of accidents / issues
Experience
Individual factors – age, sex, religion, culture
How do companies (engineers?) make decisions given
diversity of views?
3
Getting it wrong 1: Boeing 777
An incident of massive altitude fluctuations on a flight out
of Perth
Problem caused by Air Data Inertial Reference Unit (ADIRU)
Software contained a latent fault which was revealed by a change
Problem was in fault management/dispatch logic
June 2001 accelerometer
#5 fails with erroneous
high output values, ADIRU
discards output values
Power Cycle on ADIRU
occurs each occasion
aircraft electrical system
is restarted
http://www.atsb.gov.au/publications/investigation_reports/2005/
AAIR/aair200503722.aspx
4
Aug 2006 accelerometer
#6 fails, latent software
error allows use of
previously failed accel #5
Getting it wrong 2: Therac 25
Therac 25 was a development of (safe, successful)
earlier medical machines
Intended for operation on tumours
Uses linear accelerator to produce electron stream and generate
X-rays (both can be used in treatments)
X-ray therapy requires about 100 times more electron
energy than electron therapy
this level of electron energy is hazardous if patient exposed
directly
Selection of treatment type controlled by a turntable
5
Therac 25 Schematic
Position sense
microswitch
assembly
Locking
plunger
Mirror
Counter
weight
X-ray mode
target
Electron mode
scan target
6
Software in Therac-25
On older models, there were mechanical interlocks on
turntable position and beam intensity
In Therac-25, mechanical interlocks were removed; turntable
position and beam activation were both computer controlled
Older models required operator to enter data twice - at
patient’s side, in shielded area – and then cross-checked
In Therac-25, data only entered once (to speed up therapy sessions)
Very poor user interface
Display updated so slowly experienced therapists could “type ahead”
Undocumented error codes
which occurred so often the operators ignored them
Six over-dosage accidents (resulting in deaths)
May have been many cases where ineffective treatment was given
7
Safety Life Cycle 1
Simple “V” model of development lifecycle
Requirements
Delivered Platform
Platform
Design &
Decomposition
Systems
Integration
& Test
Units
Implementation
8
Safety Life Cycle 2
Major safety activities during development:
Hazard Identification and Requirements Setting
identifying potential accidents and associated hazards
assessing risk
OUTPUT: derived safety requirements to avoid / minimise hazards
Driving Design
examining design proposals
identifying causes of hazards, potential weaknesses, and risks
OUTPUT: new derived safety requirements to improve design
preliminary assessment that design proposals can meet targets
OUTPUT: evidence to justify design decisions
Producing Safety Evidence
confirming that design meets requirements
OUTPUT: evidence of achieved safety
9
Safety Life Cycle 3
Integrated Development and Safety Processes
Requirements
· Platform Concept
· Safety Targets
· (Initial Hazard List)
Delivered Platform
· Safe Platform
· Safety Case
Hazard Identification and
Requirements Setting
Hazard Identification
and assessment
Integration of
safety evidence
Platform
Driving design
Examining
proposals
Design &
Decomposition
Systems
Causal
analysis
Units
Producing safety
evidence
Safety analysis
of completed
design
Preliminary
safety
justification
Implementation
10
Integration
& Test
Eurofighter Typhoon
11
Display Processor Hardware
Second MMU
Second MMU
Timers
Private RAM
Private bus
Processor
Private bus
Processor
Private ROM
Timers
Private RAM
Private ROM
Specialist
hardware
I/O
Arbitration
Arbitration
Shared RAM
Shared ROM
Local bus
Arbitration
Arbitration
System bus
12
Timing
diagram
Sync 1
CPE
(All)
CPE
Context
Switch
G enerate
Broadcast
Interrupt
System Health Monitor
Sync 2
CPE
MIM Input
O utput
Sync 3
PE1
(CPE)
Processor
CBIT
Sync 4
CPE
SG BLT
Sync 5
IC
Processor
CBIT
Sync 6
PE3
Localise
Bus Data
Sync 7
CPE
(All)
Processor
CBIT
Non-synchronised supervisor
operations
Processor
CBIT
CPE
Requests
HUD
Monitor
Context
Switch
IC
HSG _CBIT &
CHECKSUM
CPE_USER
VME Bus Block T ransfer Period
T imer
Interrupt
Level 6
SUPERVISO R
Check_SG Status
Context
Switch
System Health Monitor
MIM Input
Processor
CBIT
SG BLT
Radar
T ransfer
Localise
Bus Data
Processor
CBIT
Radar
Interface
Manager
Service
Processor
Discrete I/F
CBIT
Context
Switch
Sync with
data word
CPU level 2
IC_USER
Broadcast
Interrupt
Level 5
PE1
Context
Switch
Radar EO F interrputs
CPU level 5
No latency
System Health Monitor
Processor
CBIT
G lobal Bus
Input Data
Localise
Bus Data
SG BLT
Processor
CBIT
Processor
CBIT
Processor
CBIT
Processor
CBIT
System Health Monitor
Processor
CBIT
Processor
CBIT
Localise
Bus Data
SG BLT
Processor
CBIT
Processor
CBIT
Processor
CBIT
Processor
CBIT
System Health Monitor
Processor
CBIT
G lobal Bus
Input Data
Localise
Bus Data
Processor
CBIT
SG BLT
Processor
CBIT
Processor
CBIT
Processor
CBIT
USER
C_MSG _CBIT
&
CHECKSUM
Context
Switch
W arning
interrupt
USER
SUPERVISO R
R_MSG _CBIT
&
CHECKSUM
PE_USER
Context
Switch
L_MSG _CBIT
& CHECKSUM
& MIM_CBIT
SUPERVISO R
Broadcast
Interrupt
Level 5
PE3
SUPERVISO R
Context
Switch
PE_USER
Context
Switch
Acyclic
interrupts
IFF EO F interrputs
CPU level 5
No latency
Broadcast
Interrupt
Level 5
PE2
USER
MC
T imer
Update
Context
Switch
USER
SUPERVISO R
CBIT
CHECKSUM
PE_USER
USER
Broadcast
Interrupt
Level 5
PE4
Context
Switch
System Health Monitor
Processor
CBIT
Processor
CBIT
Localise
Bus Data
Processor
CBIT
SG BLT
Processor
CBIT
Processor
CBIT
Processor
CBIT
Context
Switch
SUPERVISO R
CBIT
CHECKSUM
PE_USER
USER
Broadcast
Interrupt
Level 5
PE5
Context
Switch
System Health Monitor
Processor
CBIT
G lobal Bus
Input Data
Localise
Bus Data
Processor
CBIT
Processor
CBIT
SG BLT
Processor
CBIT
Processor
CBIT
Context
Switch
SUPERVISO R
CBIT
CHECKSUM
PE_USER
USER
Broadcast
Interrupt
Level 5
PE6
Context
Switch
System Health Monitor
Processor
CBIT
Processor
CBIT
Localise
Bus Data
Processor
CBIT
Processor
CBIT
SG BLT
Processor
CBIT
Processor
CBIT
Multi-mission data / PDS
load
Context
Switch
SUPERVISO R
PE_USER
Broadcast
Interrupt
Level 5
13
CBIT
CHECKSUM
USER
Recursive Resource Dependency
EVENTS
Interrupts
MEMORY
Output events
Master
cycle clock
Timer
registers
Interrupt
configuration
registers
ROM
RAM
Program
ROM
Stack
RAM
MMU
registers
RAM
Critical
variables
Bus arbitration
control registers
Initialisation
routines
ROM
CPU regs
All resources
I/O regs
Intrinsically
critical resources
Primary control
resources
Initialisation routines for
primary control resources use
system resources, and
dependencies become cyclic.
CPU regs
14
Safety Cases: Who are they for?
Many people and organisations will have an interest in a
safety case
supplier / manufacturer
operator
regulatory authorities
bodies that conduct acceptance trials
people who will work with the system
and their representatives (unions)
“neighbours” (e.g. general public who live round an air base)
emergency services
May need more than one “presentation” of safety case to
suit different audiences
Who has the greatest interest?
15
Goal Structuring Notation
Purpose of a Goal Structure
To show how goals
are broken down into
sub-goals, and eventually supported by evidence
(solutions)
strategies
whilst making clear the
adopted, the rationale for the
approach (assumptions, justifications)
A/J
and the context
in which goals are stated
16
A Simple Goal Structure
Control System
is Safe
Hazards Identified
from FHA (Ref Y)
Tolerability targets
(Ref Z)
H1 has been
eliminated
Formal
Verification
1x10-6 p.a.
limit for
Catastrophic
Hazards
Software
developed to I.L.
appropriate to
hazards involved
All identified hazards
eliminated /
sufficiently mitigated
J
Probability of H2
occurring
< 1 x 10-6 per
annum
Probability of H3
occurring
< 1 x 10-3 per
annum
Fault Tree
Analysis
17
Primary Protection
System developed
to I.L. 4
Process
Evidence
of I.L. 4
I.L. Process
Guidelines defined
by Ref X.
Secondary
Protection System
developed to I.L. 2
Process
Evidence of
I.L. 2
Westland Helicopters EH101 (Merlin)
18
Traditional flight controls
Rods and links
Power assistance from high
pressure hydraulics
19
HEAT project
replaces this…
with this…
20
…and this…
…with this.
21
HEAT: Developing the Argument
Top goal
Trials aircraft is acceptably safe
to fly with HEAT/ACT fitted
System
Integration
Clearance
SMS
HEAT/ACT system
is acceptably safe
Trials a/c remains
acceptably safe with
HEAT fitted
Procedures for flight
clearance and
certification followed
SMS
implemented
to DS00-56
Product
Process
All identified hazards
have been suitably
addressed
All relevant requirements
and standards have been
complied with
22
Progressive Development
G 1.1.4.7
Hazard Log
requirement satisfied
G 1.1.4.7
Hazard Log
requirement satisfied
G 1.1.4.7.1
Hazard Log initiated
G 1.1.4.7.3
Hazard Log used to
assess levels of risk
throughout project
G 1.1.4.7.2
Hazard Log correctly
maintained
Hazard Log
Application
Hazard Log
Application
G 1.1.4.7.1
Hazard Log initiated
Hazard Log
Application
G 1.1.4.7.2
Hazard Log correctly
maintained
Hazard Log
Guidance
Notes
document
G 1.1.4.7.2.4
Hazard Log update
procedure understood
and correctly followed
G 1.1.4.7.2.1
Access rights to
Hazard Log correctly
controlled
G 1.1.4.7
Hazard Log
requirement satisfied
G 1.1.4.7.2.2
Sign-off procedure and
rights to Hazard Log
correctly controlled
G 1.1.4.7.3
Hazard Log used to
assess levels of risk
throughout project
G 1.1.4.7.2.3
Hazard Log
used consistently
Hazard Log
Guidance
Notes
document
Safety
Review
minutes
ISAT Hazard
Log audit
report
Safety
Review
minutes
23
An analogy
Safety case like a legal case presented in court
Like a legal case, a safety case must:
be clear
be credible
be compelling
make best use of available evidence
Like a legal case, a safety case will always be subjective
There is no such thing as absolute safety
Safety can never be proved
Always making an argument of acceptability
24
What is a convincing argument?
Example: The
Completeness Problem
G1.1.2.1.1
All relevant airworthiness
requirements have been
identified completely and
correctly
AwComplete
AwCorrect
Argument by showing
extreme improbability of
overlooking relevant
requirements
Argument by showing
assumptions used to
derive requirements
were correct
DS970
G1.1.2.1.1.1
Airworthiness
requirements
specified
G1.1.2.1.1.2
Relevant airworthiness
requirements satisfy
mandated standards
where applicable
Def Stan 00-970
JAR 29
G1.1.2.1.1.3
G1.1.2.1.1.4
Relevance of airworthiness
requirements to HEAT/
ACT assessed by
competent staff
Assumptions are
proven correct
by flight test
GRS
BoC
EH101 General
Requirement
Specification
Basis of
Certification
document
#####
25
CompAwStaff
AwSigs
Competencies of
staff used to filter
airworthiness
requirements
Competencies of
specialists used to
vet and approve
requirements
FltTest
Assumptions
proven by
flight test
How is evidence used?
Think about evidence in used in legal (court) case
Direct - Supports a conclusion with no “intermediate steps”
e.g. a witness testifies that he saw the suspect at point X at time Y.
Circumstantial - Requires an inference to be made to reach a
conclusion
e.g. ballistics test proves the suspect’s gun fired the fatal shot.
Safety case evidence is similar
e.g. Testing is direct – shows how the system behaves in specific
instance
Conformance to design rules is indirect – allows inference that system is
fit for purpose (if rules have been proven)
Evidence may “stack up” in different ways:
Strong, specific –
individually compelling,
taken together show
system properties
26
Weak, general –
compelling in
sum
Conclusions
Demonstrating safety is a challenge
We are building ever more complex systems
Much of the “bespoke” complexity is in software
Essential that safety is a design driver...
... and also, design for ability to demonstrate safety
27