Systems Monitoring and EDS SLA TPTF August 26

Download Report

Transcript Systems Monitoring and EDS SLA TPTF August 26

Systems Monitoring and EDS
SLA
TPTF
August 26th 2008
David Forfia
Aaron Smallwood
Systems Monitoring value and readiness criteria
2
August 2008
2
ERCOT Public
Service availability monitoring limitations
•
Capabilities of third-party monitoring systems are limited
–
–
Can monitor availability and performance at a basic level. Example: response to a basic query to determine
up / down availability and response time (see sample screen shot below)
Cannot determine the availability or correctness of specific functionalities
3
August 2008
3
Service functionality monitoring is effort intensive
•
•
Requires parsing of application logs and outputs
–
Immature code yields false positives and negatives and requires additional resources for interpretation and
remediation
–
Compiling monthly availability reports from incident logs takes significant effort (See Nodal EDS detailed
incident log below for month of July for just 3 applications: SE, LFC and SCED)
Application code that is not ready will lead to rework of monitoring systems and diverts key resources
Month
Notification
Date
July
7/2/08
Issue Duration
Issue Start Date Issue End Date & (Hours/Mins/Sec
& Time
Time
s)
7/2/08 12:07 AM
7/2/08 5:17 AM
5:10:00
Nodal
SLA Item
Application Impacted (See
Impacted
Instructions)
MMS
SCED
SLA Item
Outage
Duration
Issue Description
5:10:00
Telemetry was lost.
Root Cause Details
SCED did not dispacthed all CC
plant physical units.
Market Database is down due to
Market database is down. node eviction.
July
7/9/08
7/9/08 8:02 PM
7/9/08 8:12 PM
0:10:00
MMS
SCED
0:10:00
July
7/18/08
7/18/08 2:09 AM
7/18/08 12:45 PM
10:36:00
EMS
SE
10:36:00
RTNET was stuck.
July
7/18/08
7/18/08 2:11 AM
7/18/08 2:20 AM
0:09:10
EMS
LFC
0:09:10
LFC was stuck.
July
7/23/08
7/23/08 3:08 AM
7/23/08 5:37 AM
2:28:39
EMS
SE
2:28:39
RTNET was stuck.
July
7/24/08
7/24/08 1:08 AM
7/24/08 8:23 AM
7:15:00
EMS
SE
7:15:00
RTNET was stuck.
July
7/24/08
7/24/08 1:21 AM
7/24/08 1:50 AM
0:29:32
EMS
LFC
0:29:32
LFC was stuck.
July
7/24/08
7/24/08 5:08 PM
7/24/08 5:22 PM
0:14:00
EMS
SE
0:14:00
RTNET was stuck.
July
7/24/08
7/24/08 8:08 PM
7/24/08 9:18 PM
1:10:00
EMS
SE,LFC
1:10:00
RTNET was stuck.
July
7/25/08
7/25/08 4:28 PM
7/25/08 4:33 PM
0:05:06
EMS
LFC
0:05:06
ERPEMSF failed.
July
7/30/08
7/30/08 3:57 AM
7/30/08 8:37 AM
4:40:50
EMS
SE
SCADA fails to supply input to
RTNET causing it to hang.
SCADA fails to supply input to
RTNET causing it to hang. This
caused LFC to fail.
Market Impact
Resolution
Interval of 310
minutes was lost
Need to get the EMS and MMS
databases in sync.
An Oracle patch has been deployed
Interval of 10
for EDS MMS and same patch will be
minutes was lost
deployed to EDS EMS.
N/A
AREVA is looking into the problem.
Currently it is still under investigation.
N/A
AREVA is looking into the problem.
Currently it is still under investigation.
N/A
AREVA is looking into the problem.
Currently it is still under investigation.
N/A
AREVA is looking into the problem.
Currently it is still under investigation.
N/A
AREVA is looking into the problem.
Currently it is still under investigation.
SCADA fails to supply input to
RTNET causing it to hang.
N/A
AREVA is looking into the problem.
Currently it is still under investigation.
SCADA fails to supply input to
RTNET causing it to hang.
N/A
SCADA fails to supply input to
RTNET causing it to hang.
SCADA fails to supply input to
RTNET causing it to hang.
SCADA fails to supply input to
RTNET causing it to hang. This
caused LFC to fail.
Server failure.
NETIO communication has been
lost between local CFGCTRL and
4:40:50OLNETSEQ did not start on ERPEMSGCFGCTRL on ERPEMSF
N/A
AREVA is looking into the problem.
Currently it is still under investigation.
Windows admin turned on ERPEMSF
back online.
N/A
System admins are looking into to this
issue.
4
August 2008
4
EDS SLA – Availability Targets and Planned Outage Windows
•
Availability Targets
– 9/01/08 to (168 hour test – 31 days)
• EDS infrastructure and applications (except ICCP): 98%, ICCP: 100%
– (168 hour test – 30 days) onwards
• Production level availability targets
•
Planned Outage windows
– 9/01/08 to (168 hour test – 31 days)
• 8:00 AM – 10:00 AM CT Wednesdays, anytime Saturdays and holidays
• For critical bugs, as determined by EDS PM and Director of IT infrastructure &
Operations and communicated via Market Notices
– (168 hour test – 30 days) onwards
• Production level maintenance windows
5
August 2008
5
EDS SLA - Availability Metrics collection and reporting
•
EDS SLA in review today (v5.3) supports monitoring at capability level
2 discussed in the introductory slide
•
Availability metrics will be collected per the following schedule
– For some EMS and MMS applications currently in EDS: 9/5/2008
– For all other applications currently in EDS: 10/1/2008
– For applications not in EDS: “Deployment in EDS + 1 Month”
•
Availability metrics will be reported monthly
– Report will include current month and YTD availabilities, Detailed Incident log
and Planned Outages log
6
August 2008
6
EDS SLA Appendix A:
Nodal Infrastructure,
Application or Service
1.
EDS H/W Infrastructure
2.
EMS
Start date for
Availability
Metrics
collection
ICCP
9/5/2008


In effect

State Estimator
Network Topology
Builder
Bus Load Forecast

Alarm Processor

Forced Outage
Detection

LFC

Network Security
Analysis
Dynamic Ratings
Processor
Transmission
Constraint
Management
Topology
Consistency Analyzer
AS Capacity
Monitoring



Nodal Infrastructure,
Application or Service
In effect


Timeline for
metrics
collection
readiness
10/1/2008

Resource Limit Calc

Load Forecasting

Archiving
Application
9/5/2008

SESTAT
9/5/2008

Market Operations
Test Environment
(MOTE)
Voltage and Stability
Analysis

10/1/2008

In effect
Start date for
Availability
Metrics
collection
Outage Evaluation
3
Wind Forecast
4.
MMS
Timeline for
metrics
collection
readiness
9/5/2008
TBD
TBD
TBD
TBD
Deployment in
EDS + 1 month
Deployment in
EDS + 1 month
Deployment in
EDS + 1 month
Deployment in
EDS + 1 month
9/5/2008
10/1/2008
10/1/2008

DAM

SASM

RUC
TBD

CCT
TBD

SCED
TBD
10/1/2008
10/1/2008
9/5/2008
Deployment in
EDS + 1 month
9/5/2008
Deployment in
EDS + 1 month
Deployment in
EDS + 1 month
In effect
7
August 2008
7
EDS SLA Appendix A:
Nodal Infrastructure,
Application or Service
5.
Web Services
6.
MIS
Start date for
Availability
Metrics
collection
Timeline for
metrics
collection
readiness
Nodal Infrastructure,
Application or Service
10/01/2008

MIS Portal
10/01/2008

LMP Contour Map
10/01/2008

MMS UI Application

Reports Explorer
TBD

Extracts Scheduler
TBD

Market Participant
Notifications
Start date for
Availability
Metrics
collection
7.
MPIM
TBD
8.
Current Day Reports
10/01/2008
9.
Outage Scheduler
10/01/2008
10.
NMMS
10/01/2008
11.
CRR
10/01/2008
12.
Settlements & Billing
10/01/2008
13.
CMM
10/01/2008
14
Financial Transfer
Timeline for
metrics
collection
readiness
Deployment in
EDS + 1 month
10/01/2008
TBD
Deployment in
EDS + 1 month
Deployment in
EDS + 1 month
Deployment in
EDS + 1 month
TBD
Deployment in
EDS + 1 month
8
August 2008
8