Risk Module - Space Systems Engineering

Download Report

Transcript Risk Module - Space Systems Engineering

Risk Module:
Risk Management, Fault Trees and
Failure Mode Effects Analysis
Space Systems Engineering, version 1.0
Space Systems Engineering: Risk Module
Module Purpose: Risk
 To understand risk, risk management, fault tree
analysis and failure mode effects analysis in the
context of project development
 Acknowledge that risks are inevitable and recognize
that through systematic management and analytic
techniques they can be reduced
 Review three techniques that are used to discover,
assess, rank and mitigate risk - risk management,
fault tree analysis and failure mode effects analysis
Space Systems Engineering: Risk Module
2
What are Risks and Risk Management?
 Risks are potential events that have negative impacts
on safety or project technical performance, cost or
schedule
 Risks are an inevitable fact of life – risks can be
reduced but never eliminated
 Risk Management comprises purposeful thought to
the sources, magnitude, and mitigation of risk, and
actions directed toward its balanced reduction
 The same tools and perspectives that are used to
discover, manage and reduce risks can be used to
discover, manage and increase project opportunities opportunity management
Space Systems Engineering: Risk Module
3
What is Risk Management?
Risk management is a continuous
and iterative decision making
technique designed to improve the
probability of success. It is a
proactive approach that:




Seeks or identifies risks
Assesses the likelihood and impact of these risks
Develops mitigation options for all identified risks
Identifies the most significant risks and chooses which
mitigation options to implement
 Tracks progress to confirm that cumulative project risk is indeed
declining
 Communicates and documents the project risk status
 Repeats this process throughout the project life
Space Systems Engineering: Risk Module
4
Risk Management Considers the Entire
Development and Operations Life of a Project
Risk Type
Examples
 Technical Performance
Risk
 Failure to meet a spacecraft technical
requirement or specification during verification
 Cost Risk
 Failure to stay within a cost cap for the project
 Programmatic Risk
 Schedule Risk
 Failure to secure long-term political support
 Failure to meet a critical launch window
 Liability Risk
 Spacecraft deorbits prematurely causing
damage over the debris footprint
 Regulatory Risk
 Failure to secure proper approvals for launch
of nuclear materials
 Operational Risk
 Failure of spacecraft during mission
 Safety Risk
 Hazardous material release while fueling
during ground operations
 Supportability Risk
 Failure to resupply sufficient material to
support human presence as planned
Space Systems Engineering: Risk Module
5
Every NASA Space Flight Project Begins
with a Plan for Risk Management
 This plan reflects the project’s risk management philosophy:
•
•
•
•
•
•
•
Priority (criticality to long-term strategic plans)
National significance
Mission lifetime (primary baseline mission)
Estimated project life cycle cost
Launch constraints
In-flight maintenance feasibility
Alternative research opportunities or re-flight opportunities
 The risk management philosophy is reflected in a number of ways:
•
•
•
•
Whether single point failures are allowed
Whether the system is monitored continuously during operations
How much slack is in the development schedule
How technical resource margins (i.e., mass, power, MIPS, etc.) are
allocated throughout the development
Space Systems Engineering: Risk Module
6
Other Factors to Consider in Assessing
Risk (but not limited to)…
 Complexity of management and technical interfaces
 Design and test margins
 Mission criticality
 Availability and allocation of resources such as mass, power,
volume, data volume, data rates, and computing resources
 Scheduling and manpower limitations
 Ability to adjust to cost and funding profile constraints
 Mission operations
 Data handling, i.e., acquisition, archiving, distribution and
analysis
 Launch system characteristics
 Available facilities
Space Systems Engineering: Risk Module
7
Risk Identification
 Risks are identified by the development team, peer
reviews, lessons from past projects and expert
review
 Lessons from past projects are captured via ‘trigger
questions’, or questions that challenge a
development strategy or design solution
 The project risk status and top ten risk list are
reviewed periodically - usually monthly - and at the
project milestone reviews
Space Systems Engineering: Risk Module
8
Example Risk Trigger Questions
 Have requirements been implemented such that a small change
in requirements has the potential to cause large cost,
performance or schedule system ramifications?
 Do designs or requirements push the current state-of-the-art?
 Has the concept for operating, maintaining, decommissioning or
disposal of the system been adequately defined to ensure the
identification of all requirements?
 Has an independent cost estimate (ICE) been performed?
 Is the schedule adequate to handle the level of requirements or
objectives changes that are occurring or are likely to occur?
 Have the necessary facilities for environmental test been
identified and availability problems been resolved?
Space Systems Engineering: Risk Module
9
More Considerations for Risk Discovery
While each space project has its unique risks, a list of the underlying sources
of risks would include the following:
 Technical complexity - many design constraints or many dependent
operational sequences having to occur in the right sequence and at the
right time
 Organizational complexity - many independent organizations having to
perform with limited coordination
 Inadequate margins or reserves
 Inadequate implementation plans
 Unrealistic schedules
 Total and year-by-year budgets mismatched to the actual implementation
risks
 Over-optimistic designs pressured by mission expectations
 Limited engineering analysis and understanding due to inadequate
engineering tools and models
 Limited understanding of the mission’s space environments
 Inadequately trained or inexperienced project personnel
 Inadequate processes or inadequate adherence to proven processes
Space Systems Engineering: Risk Module
10
Pause and Learn Opportunity
Engage the class in identifying risks for a familiar
project.
• What kinds of risks are identified?
• What is the basis for their search for risks?
After the class has thought for a while, the instructor
could present some trigger questions which may help
discover new risks and show the value of the trigger
questions.
Space Systems Engineering: Risk Module
Cartoon: Dilbert Identifies Risks
© United Features Syndicate, Inc.
Space Systems Engineering: Risk Module
12
The Benefits of Preparing for the Unexpected
Background:
On January 21, 2004 (Sol 18), Spirit abruptly ceased communicating with
mission control. The next day the rover radioed a 7.8 bit/s beep,
confirming that it had received a transmission from Earth but indicating
that the spacecraft believed it was in a fault mode.
Mars Spirit Rover Flash Memory Problem
“The thing that strikes me most about all this is how critical
it was to have that INIT_CRIPPLED command in the system.
It’s not the kind of command that you’d ever expect to use
under normal conditions on Mars. But back during the
earliest days of the project Glenn realized that someday we
might need the flexibility to deal with a broken flash file
system, and he put INIT_CRIPPLED in the system and left it
there. And when the anomaly hit, it saved the mission.”
–From “Roving Mars” by Steve Squires, Hyperion 2005
Be prepared for the low probability event with a huge
consequence.
Space Systems Engineering: Risk Module
13
After Identification Risks are Assessed
 Risks are assessed by characterizing the probability that a
project will experience an undesired event and the
consequences, impact or severity of the undesired event, were
it to occur
 Risks can be compared on iso-curves consisting of a likelihood
measure and a consequence measure
 Since the assessment of the likelihood and consequence of a
risk is both subjective and has significant uncertainty the
characterization of risk either qualitative (low medium or high) or
semi-quantitative (risk are captured on a 5x5 matrix)
Likelihood
(Probability)
1.0
High Risk
Medium
Risk
Low
Risk
0.0
Severity of Consequence
Space Systems Engineering: Risk Module
14
An Example of Some Semi-Quantitative Definitions to
Enable a Project to Compare and Rank Risks
Impact of Consequences
Probability of
Occurrence
Scale
5
Measure
Near certain to occur
(80-100%).
4
Highly likely to occur
(60-80%).
3
Likely to occur (4060%).
2
Unlikely to occur (2040%).
1
Not likely; Improbable
(0-20%).
Space Systems Engineering: Risk Module
Class
Technical
Schedule
Cost
Class I
Catastrophic
(Scale 5)
A condition that may cause death
or permanently disabling injury,
facility destruction on the ground,
or loss of crew, major systems, or
vehicle during the mission
launch window
to be missed
cost
overrun >
50 % of
planned
cost
Class II
Critical
(Scale 4)
A condition that may cause
severe injury or occupational
illness, or major property damage
to facilities, systems, equipment,
or flight hardware
schedule
slippage
causing
launch date to
be missed
cost
overrun 15
% to 50 %
of planned
cost
Class III
Moderate
(Scale 3)
A condition that may cause minor
injury or occupational illness, or
minor property damage to
facilities, systems, equipment, or
flight hardware
internal
schedule slip
that does not
impact launch
date
cost
overrun 2 %
to 15 % of
planned
cost
Class IV
Negligible
(Scale 2)
A condition that could cause the
need for minor first aid treatment
but would not adversely affect
personal safety or health; damage
to facilities, equipment, or flight
hardware more than normal wear
and tear level
internal
schedule slip
that does not
impact internal
development
milestones
cost
overrun <
2 % of
planned
cost
15
A 5x5 Risk Matrix Provides a Quick
Visual Comparison of All Project Risks
High risks – mission success jeopardized immediate action required
Medium risk – review regularly – contingent
action if does not improve
Low risk – watch and review periodically
Space Systems Engineering: Risk Module
16
Top Risks and their Trends are Periodically
Reviewed for the SOFIA Project
SOFIA Risk Matrix
Rank &
Trend
Risk
ID
Appr
oach
1
DFRC-34
R
Landing Gear Door System
Failure
2
DFRC-12
M
3
DFRC-07
W
Sched Integration problems
structure vs.. avionics
Cost growth for engine
components
4
DFRC-24
A
Quality Control Resources
insufficient
5
DFRC-01
W
6
DFRC-11
R
Avionics software behind
schedule
Payload Capacity & Volume
Trade-offs design issues
7
DFRC-04
R
8
DFRC-02
R
Likelihood
5
4
3
1
3
4
5
2
6
2
8 7
1
1
2
3
4
5
CONSEQUENCES
Criticality L x C Trend
High
Med
Low
Decreasing (Improving)
Increasing (Worsening)
Unchanged
New Since Last Period
Space Systems Engineering: Risk Module
Approach
Risk Title
Limited Flight Envelope, due
to technical issues
More flight testing may be
required for Soft V&V
M - Mitigate
W - Watch
A - Accept
R - Research
17
Top Risks and their Trends are Periodically
Reviewed for the Constellation SE&I
SE&I Top Risk List
L
I
K
E
L
I
H
O
O
D
5
5
4
6
7
3
8
4
1, 2
R
a
n
k
T
r
e
n
d
1
N
2
S
A
F
E
P
E
R
F
S
C
H
E
D
C
O
S
T
FP_SIG
4
4
5
5
5
FP_SIG
4
5
5
4
4
SE&IPRIMO
2
0
2
2
2
SE&IAT&A
3
0
4
0
4
SE&I_SO
A
5
3
4
4
4
State Limits Launch Availability
 1125 - Software Development
CSI_SIG
4
3
3
3
3
 1677 - Ares I/Orion Ascent
N
 1676 - Structural loads on CEV
and LSAM during TLI
1
3
1
2
3
4
CONSEQUENCE
Legend
Decreasing
(Improving)






Owning
Team
Title
Aeroacoustic Environments
3
2
Consequence
L
I
K
E
5

Maturation
4

5
N
6

 1603 - (SRR) Abort Site Sea
and Assurance
Top Directorate Risk (TDR)
Top Program Risk (TPR)
 1135 - Program Visibility for
Closing the Architecture
Increasing (Worsening)
Unchanged
 1122 - Requirements
7

 1195 - CxP Lifecycle cost
SE&I_SO
A
4
0
0
0
4
8

1046 - Tailoring of Human-
SE&I_PT
I_HR
3
0
0
3
3
Top Project Risk (TProjR)
Rating requirements
1
Space Systems Engineering: Risk Module
18
The Status of the Most Significant Risks and Their
Mitigation Options are Reviewed Periodically
 Title of risk
 Description or Root cause
 Possible categorizations
•
•
•
System or subsystem
Cause category (technology, programmatic, cost, schedule, etc.)
Resources affected (budget, schedule slack, technical margins, etc.)
 Owner
 Assessment of Implementation risk or Mission risk
• Likelihood - estimate of the probability of the risk event
• Consequences - estimate of the performance, cost, safety and
schedule effects
 Mitigation
•
•
•
•
Description, including costs of mitigation options
Mitigation option leverage or reduction in the assessed risk
Current mitigation activities
Current trends in risk significance - likelihood and impact
 Significant milestones
•
•
Opening and closing of the window of occurrence
Decision points for mitigation implementation effectiveness
Space Systems Engineering: Risk Module
19
Part 2 of Risk Module:
Fault Tree Analysis
Event Tree Analysis
Space Systems Engineering: Risk Module
Fault Tree Analysis Supports Design
Decisions and Failure Investigations
 Fault Tree Analysis - FTA - uses a top-down symbolic logic
model and estimates of failure probabilities of ‘initiators’ to
estimate the occurrence (failure) of the pre-determined,
undesirable, ‘top’ event
 An initiator is a credible undesirable event that is a contributing
cause to top event failure
 ‘Cut sets’ are groups of initiators, when taken together, cause
top event failure
 ‘Path sets’ are groups of initiators that if none occur the top
event does not fail
 FTA is both a design and a diagnostic tool
 As a design tool FTA is used to compare alternative design
solutions and the resulting TOP event probability
 As a diagnostic tool FTA is used to investigate scenarios that
may have led to the TOP event failure - leading to an estimate of
the most likely cut sets
Space Systems Engineering: Risk Module
21
Fault Tree Analysis
Fault tree analysis is a graphical
representation of the combination
of faults that will result in the
occurrence of some (undesired)
top event.
In the construction of a fault tree,
successive subordinate failure
events are identified and logically
linked to the top event.
The linked events form a tree
structure connected by symbols
called gates.
Space Systems Engineering: Risk Module
22
Refer to NASA Reference Publication 1358:
System Engineering “Toolbox” for
Design-Oriented Engineers
Section 3.6: Fault Tree Analysis
(Handout)
Particular points:
And/Or Gates explanation
Example Fault Tree (Fig 3-20)
Space Systems Engineering: Risk Module
Event Trees
 Event trees can be viewed as a special case of fault trees,
where the branches are all ORs weighted by their probabilities.
 Event trees are generated both in the success and failure
domains.
 This technique explores system responses to an initiating
“challenge” and enables assessment of the probability of an
unfavorable or favorable outcome. The system challenge may
be a failure or fault, an undesirable event, or a normal system
operating command.
 In constructing the event tree, one traces each path to eventual
success or failure.
 This technique is typically performed in phase C but may also
be performed in phase B.
 See NASA Reference Publication 1358: System Engineering
“Toolbox” for Design-Oriented Engineers section 3.8 for
additional discussion.
Space Systems Engineering: Risk Module
24
Will the Stage Make it from Hangman’s Hill to Placer Gulch?
Station
Probability of no
horses
1, 2, 3
0.2
4
0.1
Placer Gulch event tree
example from a Safety &
Mission Assurance training
course by Pat Clemons of
Sverdrup.
Space Systems Engineering: Risk Module
25
Fault Tree Analysis of the Placer Gulch Stage
Space Systems Engineering: Risk Module
26
Part 3 of Risk Module:
Failure Mode Effects Analysis
Space Systems Engineering: Risk Module
Failure Mode Effects Analysis
• Objective
• To ensure all failure modes have been identified and evaluated
• Technique
•
•
•
•
Select a method to rank project failure modes
Identify failure modes including all single point failure modes
Analyze failure modes and their mission effect
Determine those failure modes that might benefit from
corrective action, e.g.,
– Alternative designs
– Redundancy
– Increased reliability
• Determine which, if any, corrective actions
implemented
Space Systems Engineering: Risk Module
will be
28
Failure Mode Effects Analysis
 FMEA is a design tool for identifying risk in the
system or mission design, with the intent of mitigating
those risks with design changes. The FMEA risk
mitigation:
1. Recognizes and evaluates the potential failure of
a system and its effects;
2. Identifies actions which could eliminate or reduce
the chance of a potential failure occurring.
 FMEA is initiated in Phase B (Preliminary Design)
and used to support design decisions in Phase C
(Final Design).
Space Systems Engineering: Risk Module
29
Failure Mode and Effects Analysis
Item
Function
Potential
Failure
Mode
Potential
Effects of
Failure
S
e
v
C
O
D R
Actions Results
l Potential Causes/c
e
P
Current
Responsibility
a
c
Controls et N Recommended & Target
s Mechanisms(s) u
O DR
s
r Prevention/Detection
Failure
c
Action(s)
Completion Date Actions S
e
c e P
Taken
v c t N
What can be done?
What
are the
Effects?
How
bad
is it?
- Design changes
What are the
functions
or requirements?
What can go
wrong?
- No Function
- Partially
Degraded
Function
- Intermittent
Function
- Unintended
Function
Space Systems Engineering: Risk Module
What are
the Cause(s)?
How
often
does
it
happen
?
How
good is
this
method
How can this
at
be prevented detecting
and detected? it?
- Process changes
- Special controls
What did they
do and what
are the
outcomes
- Changes to standards,
procedures, or guides
Who is going
to do it and
when?
30
Module Summary: Risk
 Risk is inevitable, so risks can be reduced but not eliminated.
 Risk management is a proactive systematic approach to
assessing risks, generating alternatives and reducing
cumulative project risk.
 Fault Tree Analysis is both a design and a diagnostic tool that
estimates failure probabilities of initiators to estimate the failure
of the pre-determined, undesirable, ‘top’ event.
 Failure Mode Effects Analysis is a design tool for identifying risk
in the system design, with the intent of mitigating those risks
with design changes.
Space Systems Engineering: Risk Module
31
Backup Slides
for Risk Module
Space Systems Engineering: Risk Module
Uncertainties that Plague Projects
Uncertainties
Mission Objectives
Technical Factors
 Will the baseline system satisfy the
needs & objectives?
 Are they the best ones?
 Thorough study
 Analyses
 Cost & schedule credibility
 Can baseline technology achieve the
objectives?
 Can the specified technology be
attained?
 Are all the requirements known?
 Technology development plan
 Paper studies
 Design reviews
 Establish performance
margins
 Engineering model test and
prototyping
 Test & evaluation
 Can the plan and strategy meet the
objectives?
Resources
Internal Factors
External Factors
Space Systems Engineering: Risk Module
Offsets
•Manpower skills
•Time
•Facilities
Program strategy
Budget allocations
Contingency planning
 Will outside influences jeopardize
the project?
Contingency
Robust design
33
Project Risk Categories
Typical
Technical
Risk Sources
Typical
Programmatic
Risk Sources
• Physical properties
• Material availability
• Material properties
• Personnel availability
• Radiation properties
• Personnel skills
• Testing/Modeling
• Safety
• Integration/Interface
• Security
• Software Design
• Environmental
impact
• Safety
• Requirement
changes
• Fault detection
• Operating
environment
• Proven/Unproven
technology
• System complexity
• Unique/Special
Resources
• COTS performance
• Communication
problems
• Labor strikes
• Requirement
changes
Typical
Supportability
Risk Sources
Typical
Cost
Risk Sources
Typical
Schedule
Risk Sources
• Reliability and
maintainability
• Sensitivity to
technical risk
• Sensitivity to
technical risk
• Training
• Sensitivity to
programmatic risk
• Sensitivity to
programmatic risk
• Sensitivity to
supportability risk
• Sensitivity to
supportability risk
• Sensitivity to
schedule risk
• Sensitivity to cost
risk
• Labor rates
• Degree of currency
• Estimating error
• Number of critical
path items
• Operations and
support
• Manpower
considerations
• Facility
considerations
• Interoperability
considerations
• System safety
• Estimating error
• Technical data
• Stakeholder
advocacy
• Contractor stability
• Funding continuity
and profile
• Regulatory changes
• Embedded training
Space Systems Engineering: Risk Module
34