Aerospace Mishaps and Lessons Learned 2004 MAPLD International Conference Washington, D.C.

Download Report

Transcript Aerospace Mishaps and Lessons Learned 2004 MAPLD International Conference Washington, D.C.

Aerospace Mishaps and Lessons
Learned
2004 MAPLD International Conference
Washington, D.C.
September 7, 2004
2004 MAPLD
1
Aerospace Mishaps and Lessons Learned
"... most accidents are not the result of
unknown scientific principles but rather
of a failure to apply well-known,
standard engineering practices."
Nancy Leveson in Safeware, 1995.
2004 MAPLD
2
Aerospace Mishaps and Lessons Learned
Seminar Program
Time
Speaker
Affiliation
Mishap Title
9:00
Richard Katz
NASA Office of Logic
Design
Introduction
9:15
Faith Chandler
NASA HQ
Using Root-Cause Analysis to Understand Failures
10:00
Jonathan F Binkley
Aerospace Corp.
The Space System Engineering Database (SSED)
10:45
BREAK
11:00
Owen Brown
DARPA
Apollo 13 Mishap
12:00
Kathryn Anne Weiss
MIT
An Analysis of Causation in Aerospace Accidents
12:45
LUNCH
1:30
Susan C. Lee
JHU/APL
The Near Earth Asteroid Rendezvous (NEAR)
Rendezvous Burn Anomaly
2:45
Rick Obenschain
NASA GSFC
SEASAT: Lessons Learned and Not Learned
3:30
BREAK
3:45
Keith E. Van Tassel
NASA JSC
STS-86/SAFER
4:30
Paul Cheng
Aerospace Corp
Aerospace 100 Questions That Should Be Asked
During Technical Reviews
5:15
Keith Avery
Mission Research Corp.
STRV-1c/1d Mishap
6:00
SESSION ENDS
2004 MAPLD
3
Aerospace Mishaps and Lessons Learned
Training vs. Education
• The NASA Office of Logic Design works to
educate design engineers, not train them.
– Training promotes rote responses
– Education promotes thinking and the ability to
adapt to and cope with new situations.
• Hence, MAPLD hosts seminars and not
training sessions.
2004 MAPLD
4
Aerospace Mishaps and Lessons Learned
Design Seminars
• These case studies are real and are not contrived examples.
Many of the leaders have first hand knowledge of these
mishaps.
• Contribute: Discuss the topics presented, disagree with
them, present interesting cases you wish to share,
additional lessons, or alternative viewpoints.
• Do not sit there quietly and expect to be treated like a
cocker spaniel being trained and drilled to emit Pavlovian
responses in response to stimuli (bell for dogs, donuts for
engineers).
2004 MAPLD
5
Aerospace Mishaps and Lessons Learned
Material
• Material will be made available on
– CD-ROM
– Hardcopy
– klabs.org
• All public domain, you may use the
material as you wish.
2004 MAPLD
6
Aerospace Mishaps and Lessons Learned
I Was Reading AW&ST …
Aviation Week & Space Technology, August 23/30, 2004, pp. 29-30
2004 MAPLD
7
Aerospace Mishaps and Lessons Learned
Barto's Law: Every circuit
is considered guilty until
proven innocent.
2004 MAPLD
8
Aerospace Mishaps and Lessons Learned
A Recent Mishap
(that gave me the idea for this seminar)
2004 MAPLD
9
Aerospace Mishaps and Lessons Learned
Background
• Popular single board computer
• Everything was working fine
• Ran vibration test
– Unpowered and unmonitored
• Subsequently failed to boot intermittently
– Testing at manufacturer’s also showed
intermittent failures, although at a lower rate
than observed at the contractor.
2004 MAPLD
10
Aerospace Mishaps and Lessons Learned
Project’s Corrective Action
• Unit (S/N 031) pulled from the flight
instrument
• New unit (S/N 034) installed in the flight
instrument
• Repeated testing with the new unit was
successful
• Signed off, ready for launch
2004 MAPLD
11
Aerospace Mishaps and Lessons Learned
Risk Reduction Effort
• Reviewed problem/failure report
– No root cause or failure mechanism identified
– Conclusion of the Verification and Analysis Section stated:
Each time there was a failure to boot, the power was cycled and the computer
subsequently rebooted. The result of the testing at XXXXXX was that the
most probable cause of the boot failure was a workmanship issue specific to
SN034 and is not endemic to the XXXXXXXX computer and therefore does
not affect SN031.
– No direct or indirect evidence given in the “Verification and
Analysis” section to support a workmanship issue.
– No analysis given to show that the workmanship problem was not
systemic to all units. Since the unit is clearly marginal and it is
difficult to make fail, it is not shown that other units have sufficient
margin to support operation in all operating environments over the
design life of the unit. …
2004 MAPLD
12
Aerospace Mishaps and Lessons Learned
Risk Reduction Effort
– Note: the “analyst” consistently remarks that after a
failed boot the next power cycle results in correct
operation of the board. Yet the board fails multiple
times. This is evidence of the “PC mentality” seen in
many Projects where, when there is a problem, the
solution is to switch the power off and back on to
“correct it.”
– Contractor and Project claimed repeatedly that the unit
was troubleshot and nothing more could be done.
2004 MAPLD
13
Aerospace Mishaps and Lessons Learned
Let’s Take a Closer Look
• Examination of failures at manufacturer
– The failures reported were a result of test equipment;
there was zero failures detected at the manufacturer
– Intermittent operation of the computer could not be
supported. Electrical environment suspicion grows
– “What if” analysis results in a large number of possible
failure mechanisms
2004 MAPLD
14
Aerospace Mishaps and Lessons Learned
Let’s Take a Closer Look
• Examination of troubleshooting at contractor
– Previously claimed fully troubleshot
– Examination shows that no oscilloscope probe ever
touched the board
• Examined at interface points only
– Throughout organization “failures to boot” were routine
• Many failures reports written over many units.
– Contractor did not use available diagnostic signals and
port to ascertain status of the CPU and computer
2004 MAPLD
15
Aerospace Mishaps and Lessons Learned
Troubleshooting Again
• Contractor fought hard to prevent
– Stalled effort for many months
• Initial examination showed that the protection signals for
the EEPROM memories did not behave as predicted by the
analysis
– Contractor would not show the analysis
• Examination of diagnostic signals quickly showed that the
CPU had halted
2004 MAPLD
16
Aerospace Mishaps and Lessons Learned
Troubleshooting Results
• Cause of failure determined
– Known issue with pipeline timing
– Software service routines not installed to handle all conditions
– Project previously had assured the independent review that
software was installed to handle all conditions
• Did not fail at manufacturer since test software installed
properly handled the interrupt from the pipelining issue
• No support for “a workmanship issue specific to SN034
…”
• Flight software rewritten
2004 MAPLD
17
Aerospace Mishaps and Lessons Learned
Lessons and Suggestions
• Problem/Failure Reports
– Examine original documents.
– Request and examine all related P/FRs from all units
• Provide direct evidence (at a minimum!) for determination
of the cause of failure
– Intermittent’s after vibration test led to the conclusion of a
workmanship error; the “bad solder joint” was never identified
– “Failures” at the manufacturer reinforced the false conclusion as
those “failures” were not examined in detail and were a result of a
testing error.
• Do not conduct reviews in a board room with PowerPoint
slides
– Pack up your oscilloscope and go into the lab
2004 MAPLD
18
Aerospace Mishaps and Lessons Learned
Enjoy your seminar!
2004 MAPLD
19
Aerospace Mishaps and Lessons Learned