FAILURE ANALYSIS

Download Report

Transcript FAILURE ANALYSIS

FAILURE ANALYSIS
A Brief Discussion of Failure in Design
prgodin @ gmail.com
January 2015
1
Laws of Processes
• If something can go wrong, it will go wrong. (Murphy’s law)
• Additional observations related to Murphy’s Law:
– If nothing can go wrong, it will.
– If something goes wrong it will be the flaw that causes the most
damage.
– Sooner or later the worst possible combination of events will
happen.
– Discoveries are made by mistake.
– The one that predicts the greatest cost and the longest time for
a project is usually the one with the most experience.
– Perfect science is hind-sight
2
Processes
• Not all failures are directly caused by hardware or software flaws.
– Groupthink: When a group of people naturally all want to agree and reach
a consensus, thereby ignoring and discouraging any dissenting viewpoints.
– Ineffective Communication
– Ineffective leadership: Leaders must maintain a vision of the project,
ensure all teams function together and all required resources are
available. Poor leadership will lead to conflict, inability to get the work
completed, inability to delegate, a lowering of moral and poor work ethic.
– Cultural Constraints: Some cultures do not permit a member of a team to
question leader’s decisions or make decisions, even in emergencies.
Effective and experienced managers may be forced to resign due to an
event that was out of their control.
– Responsibility: Poorly-defined responsibilities can lead to lack of
ownership that lead to overlooked problems and errors, and to divisions
between groups that should be working together.
– Incompetence: Leaders and team members may be personally incapable
or unwilling to follow protocols & procedures, make important decisions,
foresee problems or find appropriate solutions. Some are willing to
gamble and take risks that should not be taken.
3
Space Missions
• Space missions are particularly vulnerable to
faults because they are unique, expensive and
complex systems in the harshest environment
with no hope of rescue or repair.
• The advantage to analyzing these faults is that
they are often thoroughly investigated.
4
NASA Stardust Mission: Genesis Probe
•
•
•
•
3 year mission
5 months collecting space dust
1.8 Billion Miles
$260 Million Dollars
• Crashes to earth because the atmosphere sensor that was to
release the parachute was installed backward.
5
NASA WIRE Satellite
• The Wide-field Infrared Explorer (WIRE) was a
satellite meant to explore molecular space gasses at
the infrared scale
• $73 million
• Lost all of its required coolant within minutes of
orbiting, seriously affecting its mission.
• Problem attributed to timing errors and failsafe
design flaws within the digital logic circuit.
6
NASA WIRE Satellite
• On power-up, the FPGA did not receive a reset before all the
systems were operational. This resulted in a random signal which
ejected the telescope cover at the wrong time. All the coolant
was lost.
• The mission failed because of multiple errors in the digital circuit
design, including a flawed power-up circuit, and the propagation
delays were not taken into account.
http://klabs.org
7
The design on the left demonstrates the
14-bit counter that provides a 100ms
pulse to “arm” the pyrotechnic bolts to
the cover. The counter relies on the
reset inputs to turn the counter on or
off. In the absence of a power-up reset,
the counter will start in a random state
and may immediately send an “arm”
signal.
This flaw prematurely fired the cover
and damaged the main satellite
instrument due to loss of coolant.
The factors in the electronics circuit design which proved problematic included:
• FPGAs do not start in a predictable manner so there must be allowances for
incorrect outputs on startup
• other associated circuits produced glitch states on start-up due to start-up delays,
compounding the problem
• oscillators takes time to start up and will not provide reliable edges for some time,
and this compounded the problem
• simulation software does not take start-up into account and was a poor substitute
for testing the functionality of the actual circuit.
8
Other NASA Failures
• DART mission (2005): Software bug (calculation & data acquisition errors)
caused the vehicle to burn excess fuel and crash into its rendezvous vehicle
• SBIRS Satellites (2007): Classified mission, but appears there was a timing
problem with the communication bus on the circuit boards and the backup
system (“safehold”) failed, locking out any possibility of re-establishing
control.
• Mars Polar Lander (1999): Communication lost on final approach to Mars.
The software likely misinterpreted vibrations on decent as contact with the
surface and the decent engines were shut down prematurely.
• Mars Climate Orbiter (1998): Well-known failure for its engineering fault:
forgetting to convert between Nm and Ft-lbs led to the spacecraft crashing
into the planet.
These and other losses were partially attributed to using simulation software
instead of physical tests. For instance, simulation software typically did not
include vibration as a possible problem-causing event.
9
IN 2003 a technician removed the 24 bolts that held the adapter plate to the cart
but did not record the action. The team that later attached the satellite to the
adapter plate did not check for the bolts. As the satellite was moved to a
horizontal position it toppled. A complete lack of procedural discipline was to
blame as the processes were in place but ignored. Cost $135 Million to repair.
10
NASA were not the only ones with problems and challenges.
China, Russia, Japan, Korea, EU, India, Iran and other countries
(and companies) experienced failures with satellites. Many
were mechanical and some are simply unknown or unreported.
11
Wenzhou Train Collision
• In July 2011, two high-speed passenger trains collided in China, killing 40
and injuring more than 200 people.
• One train received a green “go” signal instead of a red “stop” signal,
allowing it to occupy the same track as another train.
• A lightning strike blew fuses for the logic circuits that detected trains and
controlled the signaling. The design flaw was the deactivated circuit
provided a green light even though it could not detect traffic on the rail.
The wrong “failsafe” was incorporated into the design.
12
Fukushima
• One of the worst nuclear accident ever recorded, released
massive amounts of radioactive materials following a major
earthquake and tsunami on March 11, 2011 in Japan.
– the earthquake shifted Japan a few meters to the east and dropped
the landmass by .5 meters.
– the 14 meter tsunami that followed killed 19,000 people and
destroyed whole towns.
– there were 11 nuclear reactors in the affected area. Several were in
maintenance mode, all shut down automatically.
• Fukushima
– Fukishima Daiichi units 1 to 3 had problems with the shutdown
– Pool storage for spent and highly radioactive fuel damaged
– #1 would eventually completely fail and breach containment
13
Known Risk
• The Japanese and Americans knew of the risks involved in
building a nuclear plant at this location.
• In 1896 an estimated 8.5 quake occurred in the area with a
10.5 meter tsunami.
• Before the accident the Japanese had anticipated a possible
disaster due to several studies but did nothing about it.
Ancient stone in Japan warning people not to
build below this point: a warning ignored
14
Nuclear Plant
• Based on the early Westinghouse design:
– Water is used to cool the radioactive fuel rods contained
within a pressure vessel
– Water levels and temperature must be maintained
• All 3 vessels had problems
– Lost electrical power needed to control recirculating and
other systems
– All backup generators were flooded. Batteries ran out
after one day. Other battery banks flooded.
– All “Failsafe” backup cooling systems non-functional
– Reactors 1, 2 and 3 overheated and all 3 received damage
to their fuel rods.
– The storage pool for waste fuel developed a crack and was
also losing coolant. This emergency was most pressing and
the reactors were initially ignored.
15
Number 1
• Without power, #1, 2, 3
temperature and pressure
climbed. Coolant levels
dropped below the fuel rods.
Rods exposed to air
disintegrated due to heat.
• Pressure release also released
hydrogen which exploded.
• Likely coolant leak at base of #1.
• Water continuously pumped
into the system to avert a more
serious explosion and release.
• Radioactive materials at the
bottom of the #1 vessel caused
it to eventually rupture,
releasing highly radioactive
particles into the environment,
and melted over 2ft into the
concrete below.
16
Fukushima Lessons
• Tsunami was predictable and had happened in recent history. Several reports
had determined a serious risk existed yet no action was taken.
• The initial report by the electrical company stated the cause of the accident was
“natural causes” but subsequent government analysis demonstrated the
accident was caused by a complete lack of concern, foresight and will. The
initial report itself was also singled out as a further example of these problems.
• Cultural conventions contributed significantly to the prevention and handling of
the disaster.
• Essential generators could have been placed a few dozens of meters higher and
this would have prevented the catastrophe. Instead, they were placed below
the pressure vessel and remained flooded.
• All other backup systems failed due to flooding.
• Emergency process were mismanaged due to confusion over authority, such as
who could authorize airlifting additional backup generators.
• Too much nuclear waste was stored at the facility, and the containment had
been breached by the earthquake.
• Cost of disaster estimated at greater than $127 Billion, some say $250B.
• Land and ocean areas will never be occupied again.
17
Three Mile Island
• IN 1979 a critical valve stuck open yet a panel light
indicated it was closed. The indicator was wired to the
signal going to the valve and did not detect if the valve
was actually closed.
• The open valve reduced the amount of coolant in the
reactor core. There was no sensor for water levels.
• Other protocols and regulations were not followed.
Coolant backup systems were off-line due to
maintenance. The operators did not apply proper
thinking when analyzing multiple instrument readings
that indicated low coolant.
• A combination of errors led to a loss of coolant, a partial meltdown and
release of radioactive materials outside of the plant.
• Only small amounts of radiation was leaked to the outside environment
but this accident created serious concern among the population.
• This accident led to a heightened awareness of managing risk.
18
Consumer Products
• There is a large list of consumer items that have been
recalled because of electrical or electronic design flaws.
• In addition to possible lawsuits due to secondary
damage, these recalls cost companies a substantial
amount of money in logistics, labour, and market image.
• In the US, it is estimated that product recalls cost $1
Trillion per year.
19
Consumer Product Flaws
• Countries like Canada require specific, 3rd party accredited testing
for electrical products, in accordance with the Standards Council
of Canada (SCC) and provincial regulator guidelines. The
standards are set by the Canadian Safety Association (CSA).
• Some companies, such as Nemko, specialize in testing consumer
products to ensure they are safe and functional over time..
• One analysis of the latest recalls stated that 80% were caused by
design, 20% by production. Many of the flaws were due to
incomplete product testing in a variety of operational conditions.
http://www.ecnmag.com
20
Consumer Recalls: Fire and Overheating
• Most of these products have a defective electronic
design or component that may cause overheating:
–
–
–
–
–
–
–
–
Ryobi battery chargers: 550,000 units
Genie garage door openers: 18,000 units
GE Humidifiers: 2,700,000 units
Gree Dehumidifiers: 2,200,000 units
Maytag dishwashers: 1,700,000
HP Chromebook Chargers: 145,000
Schneider surge protector: 15,000,000
many more…
GE ADEW30LN humidifier, before and after
21
Noteworthy Consumer Product Failures 1
(How were these failures detected?)
• Tyco Smoke Detector (2006) Faulty sensor rendered them unable
to detect Smoke in high humidity. Recall of 150,000 units.
• Tyco Simplex Fire Alarm Control (2014): Defective chip, 750 units
• Tyco Simplex Grinnell Fire Alarm Control (2011): Defective
software fails to alert monitoring centers, 540 units
• Kidde residential smoke/combo CO alarms (2014): Defective
design. If a power outage occurs when the device performs its
once-per-minute health check the device goes into a ‘latched’
mode, causing it not to alarm.
22
Noteworthy Consumer Product Failures 2
(How were these failures detected?)
• Visonic Personal Emergency Response kit recalled because after a
reboot it may fail to reconnect with the personal pendant.
Defective design. (1700 units) A previous recall for the same
model was a failure of the low battery indicator. (24,000 units)
• Bosch Corporate Security Systems fail to activate an alarm in an
alarm condition due to design defect (2000 units).
• Honda vehicles brake unexpectedly due to an electronic defect in
the stability control system where the system fails to boot
properly on startup (2013): (344,000+ vehicles)
23
Conclusion
• There will always be flaws in design, processes
lacking and poor decisions made by people.
• Using design and simulation tools provides the
ability to quickly develop and evaluate
solutions.
• Ultimately the hardware must function over
its anticipated lifetime so it is important to
prototype and anticipate what physical factors
may affect its operation.
24
Testing Products
• A good manufacturing practice is to sampletest a batch of product for durability and
functionality.
• Many electronic component failures occur
within the first few hours of use.
• Environmental chambers with hot-cold and
humid-dry cycles, different atmospheric
pressures, electrical noise, vibration, UV light
and other environmental factors are used on
circuit boards.
Environmental test chamber
http://www.candctechinc.com
25
Counterfeit
Parts
Image of a counterfeit
500GB hard drive that was
actually a 128MB flash that
reports 500GB but
overwrites the data.
Purchased in a shop in China
in 2011.
https://www.jitbit.com + other sources
26
Counterfeit Parts
• A major issue in the electronics industry
– Significant financial risk and costs for manufacturers and
consumers
• Repairs to a 10₵ part can cost hundreds in labour costs to identify and
rectify, or in downtime, replacement, recalls and lawsuits.
• According to the US Government, counterfeiting accounts for 8% of
all merchandise trade ($1T)
• Time-consuming verification for genuine product implementation
– Safety risk due to failed component
Fake parts have been found in aircraft
control boards, military equipment,
NASA equipment, medical equipment,
consumer items, electrical devices, etc.
multiple sources
27
Counterfeit Parts
• Techniques:
– Remarking (black-topped or sanded):
• Substituted
• Inferior part
• Used part
Note faint
previous label
image: multiple sources
– Used parts sold as new (leads cleaned)
– Defective or substandard batches sold as good
– Dies of older versions packaged as new versions
– Some fakes completely lack the electronics inside
28
Identifying Counterfeit Parts
• Likely to be more expensive part such as ICs (81%), Transistors (8%)
• Check markings for:
–
–
–
–
–
–
poor printing, offset or uneven
date codes that are impossible
incorrect package codes
codes that do not match the box they came in
topping removable by standards-based solvent
manufacturer couldn’t have manufactured part
• Check surface of the parts for:
–
–
–
–
–
http://spectrum.ieee.org/
inconsistency between parts in the same batch
filled in cavities that manufacturers normally leave on the chip
scratches from sanding
inconsistent finish
check underside for inconsistencies
http://www.aeri.com/counterfeit-electronic-component-detection/
29
How do they enter the market?
• Parts purchased on the “open market” such as eBay.
• Unethical or inexperienced brokers, suppliers or vendors
• Often genuine parts are sent as “samples” for testing, subsequent
orders are fakes
• Often the fakes function and may not be noticed
• Fake performance reports, certificates, specification & safety sheets
• Counterfeiters are getting better
• Source of the materials:
–
–
–
–
Parts stripped from used electronics
Remarked cheaper, lower quality or lower performance parts
Obsolete or defective parts and dies purchased in bulk
Most come from China (over 60%) where there is no legal enforcement
preventing counterfeiting
30
Reclamation &
Repackaging
from waste to “new”
images: multiple sources
31
Conclusion
•
•
•
•
Only order parts from authorized distributors
Train staff to look for counterfeit components
Require certificates of authenticity when necessary
Don’t order parts from eBay or questionable suppliers
• Video: Power supply from eBay (funny too):
https://www.cablewholesale.com
https://www.youtube.com/watch?v=DZDh8z9UDTo
Cover Covered Aluminum (CCA) Cat 5e wire: not acceptable
32