No Slide Title

Transcript No Slide Title

Advancing
Medical Equipment
Maintenance
using
RCM Methodology
Malcolm G. Ridgway, Ph.D., CCE
Senior Vice President, Technology Management
Masterplan, Inc., Chatsworth, California
1
How A Machine Fails
Traditional / Classical Concept
(Pre-1945)
2
First Generation Maintenance
(Pre-1945)
Was – like the machines – relatively simple.
Primary maintenance strategy was “keep it
looking sharp” and “Run To Failure”
Primary maintenance tool was an oily rag
3
How A Machine Fails
Second Generation Concept
The “Bath Tub” Curve
4
Second Generation
Maintenance
(1945 - 60)
Was – like the machines – a little more complex
because the consequences of unreliable machines
had become more serious (economically).
Maintenance strategy – Fixed Interval Overhauls
PM was still relatively primitive – more of a craft
than a science, and based on the manufacturer’s
experience-based (?) recommendations.
5
Third Generation Maintenance
(1960s)
Became – like the machines – considerably more
complex. The civil aviation industry became the
driver on machine reliability because of the FAA’s
concerns for the public safety
1960 - FAA established a Task Force which became
known as the Maintenance Steering Group (MSG)
1968 – Landmark document (MSG-1) revolutionized
the maintenance business and made the 747 viable
6
How Machines Really Fail
Third Generation Concept
Based on FAA data
7
In the case of aircraft components
 Only 6% show a wear-out failure (Type B) pattern
 And only 14% have a random failure (Type E) pattern
Whereas
 72% show an infant mortality (Type F) characteristic
8
The Famous
Moment of Enlightenment
in the 1960s…
...About Scheduled Maintenance
9
More frequent PM can lead to
lower reliability !!
10
How This New Approach To
Maintenance Made Jumbo Jets
Economically Feasible
DC8 – Required the scheduled overhaul of 339 items and 4M
man-hours of maintenance prior to its 20,000 hour inspection
DC10 – Required the scheduled overhaul of 7 items and 66K
man-hours of maintenance prior to its 20,000 hour inspection
The DC10 is 3X larger, more complex, and 200X more
reliable than the DC8
The “event” rate of the DC 8 is
60 per million takeoffs;
The “event” rate of the DC10 is 0.3 per million takeoffs.
11
The 1970s
Introduction of the systems approach to maintenance
1974 – DOD contracted with United Airlines to
document the maintenance processes being used by
the civil aviation industry, and directed that the new
approach embodied in the pioneering new concepts be
labeled Reliability-Centered Maintenance (RCM).
1978 – Publication of the book “Reliability-Centered
Maintenance” by Stanley Nowlan and Howard Heap.
12
Explosive growth of RCM during
the 80s & 90s
The military adopts RCM for its ships (including its
nuclear submarines) and its aircraft
NASA joins in with its Shuttle Program
The utility industry adopts RCM for many of its power
stations, including its nuclear power plants.
1982 – MSG-3 rev 2 Type Certification for the 757/ 767
13
What Exactly Is
Reliability-Centered Maintenance?
 Uses processes based on modern reliability analyses
 Considers the entire system: equipment; accessories;
user; maintainer; environment; utilities; & the patient
 Focuses on maintaining the device’s function with
minimum downtime and acceptable levels of safety
 Uses FMEA to define what can go wrong and why
 Uses precise effectiveness metrics and criteria for
whether or not proactive maintenance is cost effective
 If interval-based maintenance is feasible, it provides
precise formulas for what the intervals should be
14
Benefits (claimed to result)
from using RCM
1. Increased reliability – 50-70% reduction in repairs
2. Increased availability – 25-50% reduction in downtime
3. Greater maintenance cost effectiveness
4. Improved levels of safety
5. Longer useful life of maintained items
6. Creation of comprehensive maintenance databases
15
Current Joint Commission Standards
Standard EC.02.04.01
The hospital manages medical equipment risks
Elements of Performance for EC.02.04.01
3.
The hospital identifies the activities, in writing, for maintaining,
inspecting, and testing for all medical equipment on the
inventory
Note: Hospitals may use different strategies for different items, as
appropriate. For example, strategies such as predictive
maintenance, reliability-centered maintenance, interval-based
inspections, corrective maintenance, or metered maintenance may
be selected to ensure reliable performance.
16
Reality Check
• Maintenance (particularly PM) is an issue of declining
importance - relative to several other equipment
issues (such as use errors and network connectivity)
• But we are still dedicating an estimated 3000 FTEs
(costing about $300M /year) to our PM programs
• We could (and should) be doing something more
productive and more valuable with these resources !
17
Key PM Issues
1. We still do not have a good consensus on what we
mean by the term “PM”, or even why we do it !
2. Although the Joint Commission has allowed us to
exclude “non-critical” devices from our PM
programs since 1989, we still don’t have a rational
definition for a non-critical/ non-life-support device.
3. We don’t have any good methods for justifying the PM
intervals that we use.
4. The PM procedures that most of us use could be
improved.
18
What Causes Equipment To Fail? (1)
1) Progressive wear or deterioration of a component part
2) Random failure of a component part
3) Poor fabrication or assembly of the hardware
4) Poor design of the system (hardware or processes)
5) Subjecting the device to physical stress outside its
design tolerances
6) Exposing the device to environmental stress outside its
design tolerances
19
What Causes Equipment To Fail? (2)
7) Incorrect set up or operation of the device by the user
8) The use of a wrong or defective accessory
9) Poor or incomplete initial set-up or installation, or a poor
quality previous repair
10) Human interference with the device including (possibly)
earlier intrusive PM
Only the first and (possibly) the last of these could be
classed as maintenance-related failures
20
Hidden failures
 Equipment failures are either likely to be noticed
(they are evident…i.e.overt) or they are hidden.
 Ideally, devices that are safety-critical or
downtime-critical and that have hidden failure
modes i.e. failures that are unlikely to be noticed
by the “operating crew” should be provided with
special protection mechanisms.
 It is important to subject devices that are safety
critical or downtime-critical and that have
hidden failure modes, without reliable special
protection mechanisms , to appropriate
performance and safety testing.
21
Special Protection
Mechanisms
1) Operator warning devices
2) Automatic shut-down devices
3) Automatic relief devices
4) Dual components for functional redundancy
5) Guard mechanisms
Special concern = “multiple failures” = failure
modes within the protection mechanisms
22
PM Basics – Why do we do it?
•
PM should address:
1. Failures that result from the degradation of
the device’s non-durable parts and
2. Detecting the presence of hidden failures.
•
•
•
PM cannot and does not prevent all types
of equipment failures.
There are several other, more common,
causes of device failure.
Very important PM issue = hidden failures
of any special protection mechanisms
23
What does PM achieve?
• PM prevents some equipment failures and the
associated downtime.
• It creates a certain (usually unspecified) level
of confidence that the devices tested are safe
(because they are not in a hidden failed state).
24
Indirect benefits of PM programs
1.
2.
3.
Finding failed or damaged devices that
have not been reported as needing to
be repaired
Periodically confirming that the devices
are actually still present in the facility
Providing some level of comfort and
security that everything possible is
being done to maximize the level of
equipment safety.
25
What PM does not achieve?
• PM cannot and does not prevent all equipment
failures – only those that would have resulted
from the degradation of the device’s non-durable
parts.
• PM cannot and does not mitigate the most
common causes of adverse equipment-related
accidents
26
The Bottom Line on PM
• With respect to:
• reducing the downtime of downtime-critical equipment, and
• eliminating the most common causes of adverse equipment related incidents and accidents…..
• ..even a well implemented PM program provides only
a relatively limited value – and it also has a cost
• The more we can optimize the program and quantify
the benefits, the easier it will be to balance the value
gained from a well-implemented PM program against
its cost
27
Better PM terminology
• True preventive maintenance (TPM) =
inspecting, cleaning, lubricating, adjusting or
replacing the device’s non-durable parts…
(aka scheduled restoration, scheduled
discard tasks or predictive maintenance - JIT
remediation via Condition Monitoring)
• Performance verification and/or safety testing
(PVST) = functional testing to detect hidden
failures … (aka failure-finding tasks)
28
TPM = True Preventive Maintenance
…is the inspection, cleaning, lubricating, adjustment
or replacement of a device’s non-durable parts.
Non-durable parts are those components of the
device that have been identified either by the device
manufacturer or by general industry experience as
needing periodic attention, or being subject to
functional deterioration and having a useful lifetime
less than that of the complete device.
Examples include filters, batteries, cables, bearings,
gaskets, and flexible tubing.
29
Predictive Maintenance…
…involves direct monitoring of some variable that will provide a
reliable early warning that a non-durable part is about to fail
(aka Condition Monitoring).
An example might be using an oil contaminant sensor in your car’s
engine lubricant to turn on a dashboard warning light to tell you
when it is time to change your oil.
At the moment this particular PM strategy probably has more
potential in the physical plant area than in the biomedical area.
Physical plant examples include: using vibration analysis to warn of
bearing wear, and using infrared scanning to detect overheating
in electrical switchgear
30
PVST = Performance Verification
and Safety Testing
…is functional testing to detect hidden failures.
Examples of hidden failures include:
Defibrillators that are delivering significantly
less energy than they are set to deliver;
heart rate alarms that do not alarm at the set
threshold, and protective power cut-offs on
hypo-hyperthermia machines that do not
operate at the pre-set cut-off temperature.
31
32
33
34
Special features of the ASHE format
•
•
•
•
The procedure number as a
“universal product code”
Separation of the TPM and PVST
tasks
Use of the Note box for concise
reporting
User tasks disclaimer
35
36
37
38
Repair Call Cause Coding
39
Repair Call Cause Coding
Cat 1 Are the device and its accessories still working
properly and safely? If yes, this a Category 1
failure (aka: use error; “cannot duplicate”).
Cat 2. Is the device itself OK; the problem is due to use
of a wrong or defective accessory or problem in a
connected network? If …
Cat 3. Is the problem due to physical stress? If …
Cat 4. Is there evidence that this problem could be the
result of a poor initial installation or an incomplete
repair of a previous problem (a “run on”)? If ….
Cat 5. Is there evidence that the failure was due to an
out-of-tolerance ambient environmental
condition?
40
Repair Call Cause Coding
Cat 8. Is there evidence that the failure is due to a
battery problem? If yes, ….
Cat 7. Is there evidence that the failure was due to a
lack of preventive maintenance? If yes, ….
Cat 8. Is there evidence that the failure was caused by
human interference e.g. earlier intrusive PM? If
Cat 9. Is there any reason to believe that the failure
was due to general wear and tear? If yes, ….
Cat 0. The cause of failure is unknown (cannot be
categorized).
41
Typical Cause Coding Analysis
Code Cause of repair call
Call
Count
%age
Aust.
1
User-related
54
10.2
14%
2
Accessory or connectivity
7
1.3
3%
3
Physical stress-related
120
22.8
25%
4
Run-on related
11
2.1
1%
5
Environmental stress-related
13
2.5
1%
6
Battery-related
32
6.1
-
7
Inadequate PM-related
17
3.2
1%
8
Human interference-related
0
0
0
9
Random, unpredictable failures
273
51.8
52%
0
Uncategorized repair calls
527
100
100%
42
Some types of devices will benefit more
than others from receiving PM:
(1) Those with non-durable parts
1.
2.
3.
4.
Identify all possible PM–preventable failure modes by
examining each TPM task listed in the PM procedure
Perform a PM Risk Analysis. Rank each failure mode
according to the Level of Severity of its potential
adverse consequences (LOS score).
Estimate the MTBF (Likelihood of Occurrence score)
(How far out is the knee on the Type B Failure Curve)
Multiply the LOS score by the LOO score to
determine the device’s PM Risk Score.
43
Classifying the Level of Severity (LOS)
of any likely adverse consequences from
(1) any non-durable parts-related failures
LOS
4
A PM-preventable failure mode that could be lifethreatening or economically “catastrophic” ($$$$)
3
A PM-preventable failure mode that could cause an
injury, have a major impact on patient care, or ($$$)
2
A PM-preventable failure mode that could have some
impact on patient care, or facility economics ($$)
1
A PM-preventable failure mode that would have only a
minor impact on patient care, or facility economics ($)
44
Adverse consequences of (overt)
equipment failures
Three different kinds of consequences:
1. Adverse safety consequences
•
Life-threatening (LOS = 4), safety-major concern (LOS=3),
safety-moderate concern (LOS=2), safety-only minor concern
2. Adverse operational consequences (uptime)
•
Uptime-critical (LOS = 4), uptime-major concern (LOS = 3),
uptime-moderate concern (LOS=2), etc
3. Adverse non-operational consequences (cost of
repair)
•
Very high cost of repair (LOS = 4), high cost of repair (LOS=3),
moderate cost of repair (LOS=2), etc
45
Adverse consequences of (overt)
equipment failures
Economic consequences:
• Uptime-critical devices (LOS =4)
•
•
Uptime-major concern devices (LOS =3)
•
•
Sophisticated imaging devices, such as CT scanners
Key devices with little or no back-up, such as large
central sterilizers and automated lab analyzers
High and very high cost of repair devices
(LOS = 3 and 4)
•
Specialized devices, such as lasers, some sterilizers,
some ventilators, etc.
46
Classifying the
Likelihood of Failure (LOF)
of (1) any non-durable parts
LOF
4
Frequent. Wear-out type failure likely to occur within a one
year period (MTBF of up to 1 year)
3
Occasional. Wear-out type failure likely to occur within a one
to two year period (MTBF of between 1 and 2 years)
2
Uncommon. Wear-out type failure likely to occur within a
two to five year period (MTBF of between 2 and 5 years)
1
Remote. Wear-out type failure not likely to occur within a
five year period (MTBF of more than 5 years)
47
RCM Risk Score.
Compounding Level of Severity (LOS)
and Likelihood of Failure (LOF)
LOS = 4
4
8
12
16
LOS = 3
3
6
9
12
LOS = 2
2
4
6
8
LOS = 1
1
2
3
4
LOF = 1
LOF = 2
LOF = 3
LOF = 4
“Remote”
“Uncommon”
“Occasional”
“Frequent”
12 - 16 = Critical risk
6 – 9 = “Worth doing”
48
Some types of devices will benefit more
than others from receiving PM:
(2) Those with hidden failure modes
1.
2.
3.
4.
Identify all possible hidden failure modes by
examining each PVST task listed in the PM
procedure
Perform a PM Risk Analysis. Rank each hidden
failure mode according to the Level of Severity of its
potential adverse consequences (LOS Score).
Rank the Likelihood of Failure of each hidden failure
(LOF Score) by reviewing data on the “yield” of
previous PVST testing (# of HFs/ device-year)
Multiply the LOS Score by the LOF Score to
determine the device’s PM Risk Score.
49
Classifying the Level of Severity (LOS)
of any likely adverse consequences from
(2) any hidden failures
LOS
4
A hidden failure mode that could be life-threatening or
economically “catastrophic” ($$$$s)
3
A hidden failure mode that could cause an injury or have
a major impact on patient care (or $$$s)
2
A hidden failure mode that could have some moderate
impact on patient care (or $$s)
1
A hidden failure mode that would have only a minor
impact on patient care (or only $)
50
Adverse consequences of hidden
equipment failures
Safety consequences:
• Safety-life-threatening devices (LOS =4)
•
Defibrillator with zero or very low output
• Safety-major impact devices (LOS =3)
•
•
Blood warmer with defective over-temp alarm
Hypo/ hyperthermia with defective over-temp
alarm or power cut-off mechanism
51
Classifying the
Likelihood of Failure (LOF) of
(2) any hidden failures
LOO
4
Frequent. “Yield” or hidden failure discovery rate of
more than 1 per device- year
3
Occasional. “Yield” or hidden failure discovery rate of
0.5 – 1.0 per device- year
2
Uncommon. “Yield” or hidden failure discovery rate of
0.2 – 0.5 per device- year
1
Remote. “Yield” or hidden failure discovery rate of
less than 0.2 per device- year
52
RCM Risk Score.
Compounding Level of Severity (LOS)
and Likelihood of Failure (LOF)
LOS = 4
4
8
12
16
LOS = 3
3
6
9
12
LOS = 2
2
4
6
8
LOS = 1
1
2
3
4
LOF = 1
LOF = 2
LOF = 3
LOF = 4
“Remote”
“Uncommon”
“Occasional”
“Frequent”
12 - 16 = Critical risk
6 – 9 = “Worth doing”
53
Classifying a device’s PM Priority
according to its (worst-case)
RCM Risk Score
Risk
PM
Score Priority
12 -16
1
“Must-do PM” = (PM–Critical)
6-9
2
PM judged to be “worth doing”
3-4
3
PM worth doing – if economics
justify (3A) – otherwise (3B) RTF
1-2
0
Do no PM = “Run to Failure”
54
Documenting the PM Risk Analysis (1)
Note device type and PM procedure number
For each TPM task statement
•
Describe briefly the severity of the consequence if
this part degenerates either partially or totally
•
Is the LOS a 4,3,2 or 1?
•
Estimate the time lapse before this degeneration
will occur. Is the LOF a 4,3,2, or 1?
•
What is the combined RCM Risk Score?
•
What is the corresponding PM Priority Level?
55
Documenting the PM Risk Analysis (2)
For each PVST task statement
•
Describe briefly the hidden failure that this testing will
detect and the severity of the consequences
•
Is the LOS a 4,3,2 or 1?
•
Consult database or estimate how often this failure is
likely to occur. Is the LOF a 4,3,2, or 1?
•
What is the combined RCM Risk Score?
•
What is the corresponding PM Priority Level?
If worst case is Priority 1,2 or 3A, which PM strategy will
be implemented?
If implementing fixed interval PM, what is the optimum?
56
Alternative PM strategies
1. Performing JIT TPM when indicated by direct
condition monitoring (aka Predictive Maintenance)
•
Optimum approach, but techniques are scarce
2. Using JIT on-board automated or operator-
implemented performance and safety testing
•
optimum approach, but no techniques available (yet)
3. Using variable intervals based on usage (metered
maintenance)
4. Using fixed intervals (prescriptive or optimized)
•
This is the traditional approach, favored by many regulators
5. Allowing the device to Run-to-Failure
•
Most cost-effective approach for PM Priority 3B and 0 devices
57
Selecting the most cost-effective
PM strategy


If device is PM Priority 3B or 0 – Use RTF
Otherwise – select in the following order
•
•
•
•
JIT TPM / JIT PVST (Predictive Maintenance)
Metered maintenance
Fixed interval (optimized)
Fixed interval (prescriptive)
58
Infusion Pump Analysis
1.
2.
3.
Using standard FMEA analysis from the
classical RCM method, the Thorburn team
from The Royal Adelaide Hospital in South
Australia identified 145 potential failure
modes.
But only six were judged to be addressable
by some kind of PM task
One had a risk score of 8 (PM Priority 2)
which the team described as “worth doing”
59
Metrics for Monitoring
PM Effectiveness
1.
What percentage of repair calls are caused by
Category 7 failures (lack of PM) - and what
percentage were considered to be in the
highest Level of Severity?
2.
The frequency of occurrence and level of
potential severity of equipment-related patient
incidents that were attributable to a hidden
failure
60
Determining PM intervals
How we do it now
• Based on the Fennigkoh-Smith EM number (No-no)
• Whatever the manufacturer recommends (?)
• Pursuant to the JC’s July 1, 2001 revision to EC.1.6.
(f) and EC.2.10.3. permitting “maintenance
strategies” other than the traditional time-based
inspection intervals.
Text change from “apply professional judgment” to
“data-driven decisions” (But which data and how?)
61
Finding Optimum PM Intervals
1) For Predictive (On-Condition) Maintenance - this
involves finding a condition monitoring technique
with a long P – F (warning) interval
2) For TPM (aka scheduled restoration or scheduled
discard tasks) – this requires knowledge of the
device’s age-related failure pattern.
3) For PVST functional testing (aka failure-finding
tasks) - this requires data on the device’s Mean
Time Between Failures (MTBF).
62
Finding the Optimum PM Interval
2) For TPM (True Preventive Maintenance)
•
•
•
Requires knowledge of the device’s
age-related failure pattern (interval
exploration)
The period between being put into
service and the “knee” is called the
Economic Life Limit.
The most efficient interval is just less
than 100% of the Economic Life Limit.
63
Failure
Rate
Age-related failure pattern
Time
•
•
The period between being put into
service and the “knee” is called the
Economic Life Limit.
Most efficient interval is just less than
100% of the economic life limit.
64
Finding the Optimum PM Interval
3) For PVST (functional testing)
•
•
•
Requires knowledge of the failure
mode’s mean time between failures
(MTBF) – from PM testing database
And what level of confidence (LOC) is
desired that the device is in a “safe
operating condition” (SOC)?
These two factors set the maximum
testing interval.
65
Hypothetical data from 4 years of PM testing
100 devices were checked annually for 4 years
Hidden failure (e.g. high leakage current) found 16 times
MTBF = 400 (device-years)/ 16 = 25 years
From this data we can establish a statistical probability
(level of confidence) that, between the tests, one of these
devices was actually in a (hidden) failed state
16 devices were in a failed state for (on average) 6 months
Total hidden downtime was therefore 8 device-years
Probability that device in (hidden) failed state = 8/ 400 = 2%
Probability that device is in safe operating condition = 98%
66
According to RCM theory, the relationship between the
MTBF, the testing interval (TI), and the probability that the
device is in a (hidden) failed state (HFS) is:
HFS (%) = 50 x TI (in years) / MTBF (in years)
And the level of confidence (LOC) that the device is in a
safe operating condition is:
LOC (%) = 100 – HFS (%)
As the ratio of the test interval to the MTBF gets
smaller, the probability that the device is in a (hidden)
failed state also gets smaller.
67
The ratio of the test interval (TI) to the MTBF determines
the Level of Confidence (LOC) that the device is in a
Safe Operating Condition (i.e. not in a HFS)
TI
(yrs)
MTBF
(yrs)
HFS
(%)
LOC/SOC
(%)
0.5
25
1%
99%
0.5
50
0.5%
99.5%
0.5
100
0.25%
99.75%
1
25
2%
98%
1
50
1%
99%
1
100
0.5%
99.5%
2
50
2%
98%
2
100
1%
99%
4
100
2%
98%
4
200
1%
99%
68
Relationship between the LOC (that the
device is not in a HFS), the Testing
Interval (TI) and the MTBF
HFS
LOC
2%
98%
1%
99%
1
2
3
4
Testing Interval (TI) in years
69
Manufacturer-recommended
maintenance intervals
 Legal question: “Did you follow the manufacturer’s
maintenance recommendations?”
 Selection of the optimum interval requires knowledge
of the NDP’s age-related failure pattern
 Extensive (pre-market) testing in a simulated
environment is time consuming and costly.
Therefore it is highly likely that the
manufacturer’s recommendations are based
more on “guestimates” than on actual testing.
70
Questions ?
71

No Slide Title

Transcript No Slide Title

Directory