Group Disaster Recovery Architecture

Download Report

Transcript Group Disaster Recovery Architecture

Euroclear Disaster Recovery
Concepts and configuration choices
Renaud Colin
Euroclear
Corporate Technology
Technical Services - Enterprise Systems
1
Agenda
•The Euroclear disaster recovery was already presented during
17/06/2009 GSE by Didier Lemaitre :
Business Continuity @ Euroclear N.V.
•This time the focus will be on :
► The technologies used at Euroclear
► The disaster recovery time objectives implications
► Complexity of 3 DCs versus 2 DCs
► The active / inactive Mainframe configuration
2
The technologies used at Euroclear
3
Data centre configuration
Consolidated data centers
• 3 data centres
► 2 at synchronous replication
distance in Paris : 20 km
(Fiberoptics/DWDM)
► The 3rd at asynchronous replication
distance in Brussels : 300 km (TCPIP)
• 3 mainframe production sysplexes for 3
workloads :
► EB : Euroclear Bank
► ESES : Euroclear Settlement of
Euronext-zone Securities
►
SSE : Single Settlement Engine
• Multiple operating systems flavours
• Network connectivity through all 3 data
centres.
4
Infrastructure global architecture
Key elements of the Euroclear DRP
•Market standards to deliver its local and regional disaster recovery services
•The key elements of our architecture are: 1) three data centres, 2) clustering technologies, 3) data replication,
4) enterprise data consistency groups, 5) planned full data centre swaps and 6) twice annual regional disaster
tests.
5
1
DC1
DC2
Swaps
2
Clustering
processing
data
4
3
Enterprise
Consistency
Group
Data replication (sync)
6
RDR Test
5
DC3
Multiple systems interconnected
Application mapping to platforms
Front
end
Back end
EB
HP NSK
IBM Mainframe IBM Mainframe
Windows EMC EMC
EMC
HP
ESES
AIX
IBM Mainframe
EMC
EUI
SSE - Single
Settlement
Engine
SUN
HP NSK
Solaris EMC
EMC
IBM Mainframe
EMC
IBM Mainframe
HP
EMC
• Interconnection between applications on different systems drives the need for
Enterprise data consistency at storage level whenever possible. IBM Mainframe
and distributed systems are in a single consistency group and HP NSK requires
applicative resynchronisation procedures.
EMC consistency group
Entity
6
Disaster recovery versus availability
DRP strategy and concepts use Enterprise data consistency
The priorities for disaster recovery are data consistency, no data loss and system availability. Data consistency
is key to ensuring that systems can be restarted from a known baseline.
As our application architecture is spread over many technologies, full data consistency is provided at the
storage systems level. This approach is powerful and all encompassing, but challenging because of the
infrastructure diversity and continuous technical development.
Client comms
Reports etc
processing
Mainframe
Unix
Mainframe
Unix
Windows
Windows
Enterprise
Consistency
Group
(EMC)
Tandem
Tandem
Limited consistency
managed by
application logic
Tandem
Storage
(HP)
7
Note, the HP based systems maintain data consistency amongst themselves, but have to be excluded from the EMC Enterprise Consistency Group as a
result of incompatible technologies.
The disaster recovery time objectives implications
8
The disaster recovery time objectives
Base DRP Concepts: RPO, RTO, “TRTO”
Euroclear requirements :
• Recovery Time Objective = RTO
►
• RTO < 2 hours, RPO = 0 for LDR
describes the time within which business processes must be restored
after a disaster (or disruption)
• RTO < 4 hours, RPO = 1 min for RDR
• Recovery Point Objective = RPO
►
• TRTO = RTO – 1 hour
describes the acceptable amount of data loss measured with a break
in business continuity
T1 : An important
incident occurs:
Production is in
deep troubles
but still
manageable
Normal Production
T2 : Decision
taken for Disaster
Recovery
Technicians try to solve
problem
Clean shutdown
(system level)
T5 : Back to
normal situation:
no more backlog
and normal timing
for critical
applications
T4 : All
applications are
(re)synchronized
and ready to start
T3
System switch
Contingency team
kept informed
Restart of critical DCS
compliant applications
Backlog
Normal Production
Technical RTO
RTO
TRTO: Technical RTO
LDR : Local Disaster Recovery
RDR : Regional Disaster Recovery
DCS : Data centre Consolidation Strategy
Resume communication
channels to members /
participants
9
DR time objectives and how to reach them
• To reach these time objectives we need :
►
Automation
– to simplify and qualify (make it repeatable and testable)
– to avoid relying on experts (commuting times of watch duty, in meeting or out for lunch during
office hours, on holiday,... )
– to speed up operations to reach the TRTO
– to use the same building blocks in planned swap operations as would be used in an unplanned
situation.
►
Frequent testing to ensure all changes do not jeopardise DR capabilities (microcode on disks
controllers, on CPUs, on SAN switches, on Oses and configurations, on hypervisors, on DR automation,
etc).
►
Awareness on disaster recovery specific setup to maintain and monitoring of the key elements of this
setup.
►
Frequent opportunities to train staff on the DR operations.
We use GDDR do drive 6 DC1-DC2 planned data centre swap and 2 DC3 RDR tests per year.
10
EMC GDDR automation
• EMC GDDR provides automation of disk recovery and zOS LPAR level recovery (Activate, IPL)
• Distributed Systems and mainframe workloads are managed through GDDR exits
• GDDR identifies DR events and automates planned and unplanned pre-defined scenarios
Active site
Inactive site KSYS
Unix W2K zOS
Unix W2K zOS
R1
R2
GDDR home
system
SRDFS
DC1
A
DF
SR
BCV
(S
RD
FA
)
KSYS
BCV
DC2
R2’
KSYS
BCV
DC3
11
Complexity of 3 DCs versus 2 DCs
12
Complexity of 3 DCs versus 2 DCs
Disks and zOS specifics
• With 3 data centres the expectation is to maintain disaster readiness after a
disaster. For this a basic recommendation* is that all the primary disks must be in
the same site, even after a disaster.
• Couple DataSets (CDS) are not allowed in the consistency/autoswap group and
are spread with primaries in one site and alternates in the other site. A snapshot of
these is copied to DC3 twice a week and after a policy change.
There is one exception to this rule for the logger CDS as it needs to be replicated
to DC3 for recoverability (it contains pointers to positions in the logstreams).
• Page datasets are replicated on the synchronous leg so that zOS systems survive
an autoswap but are not replicated on the asynchronous leg to avoid a waste of
bandwidth. A snapshot of these to DC3 is done weekly.
• DC3 production Lpars are only activated during regional disaster test (... or real).
* Alternatives exists but they require manual reconfiguration to recover disaster readiness and manual
actions to come back to normal.
13
Complexity of 3 DCs versus 2 DCs
Handling the tapes
• - Tape cannot be asynchronously replicated to DC3 slower than the disk asynchronous
replication (as they are needed to restore the file situation of each batch running at the
moment of the disaster). But disk and tape replication technologies are independent. So we
need to write our tapes on disks with CA Vtape software before going to VSM. This means
two virtualisation layers.
DC1 MF
DC2 MF
VTape
VTape
VSM Clustering
VSM
VSM
Duplexing
VSM
Duplexing
DC 3 MF
CNT
CNT
Tape Robot
VTape
Tape Robot
CNT
DC1
VSM
VSM
CNT
DC2
FICON
Tape Robot
DC3
FICON Channel Extender (CNT)
Fibre Channel
SRDF/A Fibre Channel over IP
14
The active / inactive Mainframe configuration
15
Production sysplex configurations
DC1 MF
ESES (SIC3)
ESES Appl
DB
DC2 MF
Plex: PLEX01
CF
Plex: PLXY
SP (EOCZ)
SSE Appl
DB
EB (EOCC)
EB Appl
DB
DVLP (EOCA)
Dvl + Test
DB
ESES (SIC1)
ESES Appl
DB
SP (EOCY)
CF
Plex: PLXP
CF
Data sharing
structures for :
•System messaging
•GRS
•Logger
•RACF cache
•Catalog cache
•...
SSE Appl
DB
EB (EOCP)
EB Appl
DB
• Each production sysplex runs active / passive. All the production active lpars are located in the same data
centre. EB sysplex includes the development system to achieve PSLC CPU aggregation.
• Distributed systems implement similar configuration with MSCS stretched cluster and AIX HACMP.
16
Managing the workload placement
•A parallel sysplex is optimized to run active / active workload. In
order to run an active / inactive sysplex the setup is slightly
different :
►The middleware's must be started in one system only with a
mechanism to ensure this.
►JES2 independent mode is used to avoid execution on the
inactive node.
►The network access to application must be transparent from
the site and system running it.
•Using OPSMVS System State Manager (SSM), a unique task
“ACTEOC” is managing the active workload placement through
SSM dependancy logic. ACTEOC uses Automatic Restart
Manager (ARM) services to ensure only one instance of the
workload runs in the sysplex.
17
Workload swap and « routing »
Moving
part
under the
control of
ACTEOC
Static
part
Applications :
Batches, Kernels, OICS, CIISIN
Applications :
Batches, Kernels, OICS, CIISIN
Routing tasks:
DVIPA, DVTAM, JES2ACT
Routing tasks:
DVIPA, DVTAM, JES2ACT
Middleware :
DB2, CICS, MQSeries, Connect Direct,
CA7, EOS, VPS, SAS, etc.
Middleware :
DB2, CICS, MQSeries, Connect Direct,
CA7, EOS, VPS, SAS, etc.
zOS, JES2, VTAM, TCPIP, TSO, RMF,
OPSMVS, SLS, etc.
zOS, JES2, VTAM, TCPIP, TSO, RMF,
OPSMVS, SLS, etc.
DC1
DC2
Moving
part
under the
control of
ACTEOC
Static
part
18
Network virtualisation
OSA design
DC1
DC2
Production & Clones
Production & Clones
E
O
C
C
E
O
C
Z
S
I
C
3
E
O
C
A
S
Y
3
0
E
O
C
P
LPARs
...
...
E
O
C
Y
S
I
C
1
E
O
C
G
S
Y
3
1
E
O
C
I
x
1
EOCC
Interface = IP1
OSA y
x
2
x
3
x
4
y
1
OSA x
OSA z
Chpids
y
2
y
3
y
4
z
1
z
2
z
3
z
4
x
1
Chpids
EOCP
Interface = IP3
EOCC
Interface = IP2
LPARs
...
...
Channel subsystem
Channel subsystem
OSA x
S
Y
3
7
OSA y
x
2
x
3
x
4
y
1
OSA z
y
2
y
3
y
4
z
1
z
2
z
3
z
4
Chpids
EOCP
Interface = IP4
LAN
Switch A
Switch A
Switch B
EOCC
Static VIPA = IP5
EOCA
Static VIPA
Switch B
EOCP
Static VIPA = IP6
EOCP or EOCC
Dynamic VIPA = IP7
EOCG
Static VIPA
19
Network virtualisation
IP address and VTAM Applid virtualised (logical level)
• VIPA allows the virtualisation of the physical interfaces (Static VIPA) but also the
virtualisation of the application access (Dynamic VIPA) by moving the IP
address where the application runs.
• Enterprise Extender removes the need to have pure SNA definitions but SNA
remains...
• Dynamic VTAM applids (with ‘?’ and ‘*’ in definitions) allow to virtualise the
application access. The applids get created when the application connects to
VTAM.
• VTAM aliases are used to link a generic application name (e.g. TSOPROD) to the
instance running on the system hosting the applications (e.g. EOCPTSO or
EOCCTSO).
• These definitions are changed by the automation driving the workload swap.
20
Questions ?
21