PowerPoint Presentation - LHC Logging Project

Download Report

Transcript PowerPoint Presentation - LHC Logging Project

LHC Logging Cluster
Nilo Segura
IT/DB
Agenda
●
●
●
●
Hardware Components
Software Components
Transparent Application Failover
Service definition
Hardware Components
●
Two Sun Fire V240
–
–
●
One Sun Storedge 3510FC
–
–
–
–
●
Dual CPU 1Ghz, 4Gb memory
Dual internal disks, dual power supply
2Gb fiber channel architecture
12x146Gb 10k RPM FC disks
Two Raid controllers with 1GB cache
Can accept up to 2x3510 Jbod expansion trays
Both machines share the same set of disks
–
The 3510 can accept up to 8 hosts directly attached (or
up to 4 with a redundant config.).
Software Components
●
●
●
●
●
●
Sun Cluster 3.1 Update 1
Solaris 9
Oracle RDBMS 9.2.0.5 (Real Application Cluster)
Oracle Distributed Lock Manager
Veritas Volume Manager 3.5
Sun certification completed
–
–
–
–
checking correct level of patches
shutting down one of the nodes
disconnecting one of the nodes from the disk system
etc..
High Availability
●
●
●
●
The purpose of the cluster installation is to offer 365x24
access to the database
No single point of failure
– Two nodes, two disk system , two....
Recovery/Availability offered by the Oracle software (Real
Application Cluster)
– Transactions are recovered by the surviving instance
Tested the following cases
– Listener down (re-connection immediate)
– Listener up but instance down (re-connection immediate)
– Machine down (re-connection takes longer, 3minutes
connecting from a Linux client due to TCP driver timeout)
● Timeout can be tweaked but...
Transparent Application Failover
●
●
●
For SELECT operations, if the connection is lost, the session
is resumed transparently in the surviving node
– Tested and working, the session stops for a few seconds
and then resumes withouth the user issuing a new connect
request
– Not tested from a JDBC Thin driver.... it will work with the
JDBC OCI driver
Sessions modifying data will still lose the connection and need
to re-connect
– As expected, the current transaction will be rolled-back
Possibility of LOAD BALANCING at the level of the connect
string
– Not enabled for the moment, perhaps later
Service definition - General
●
●
●
Service to run 365x24, backups will not interrupt the database
access
– Export + hot backups
– Oracle Recovery Manager will reduce the backup window time
Problems with the service to be reported to [email protected]
and/or Oracle GSM telephone (depending on the criticality of the
problem).
– Same mechanism used for SUNSLPS and LEP Database servers
However, the system can still collapsed due to other reasons
(network outage, power failure, gremlins....) so applications must be
able to react to these events (local buffering?)
– Instance failure when recovering a distributed transaction
– Surviving instance tried to recover and crashed in the same point
Service Definition - Patches
●
●
We may need to interrupt the service for updates...
– If all goes well, one day (scheduled) interruption per year
– We should be able to apply Solaris patches one node at a time
● Moving applications from one node to another
– Oracle offers apparently Rolling upgrade features in their RAC
patches
● Some patches that touch common structures used by all the
instances will still require database downtime
But : critical security patches may need to be applied at any given
moment (following Sun and/or CERN SecurityChief requests)
– Removed all unneeded Solaris services to avoid potential
problems
● Private firewall for all the database servers ala AIS ?
lhclog=(DESCRIPTION=
(FAILOVER=on)
(LOAD_BALANCE=off)
(ADDRESS=
(PROTOCOL=TCP)
(HOST=sunlhclog01.cern.ch)
(PORT=1521)
)
(ADDRESS=
(PROTOCOL=TCP)
(HOST=sunlhclog02.cern.ch)
(PORT=1521)
)
(CONNECT_DATA=
(SERVICE_NAME=LHCLOGDB)
(FAILOVER_MODE=
(TYPE=SELECT)
(METHOD=BASIC)
)
)
)
lhclog=(DESCRIPTION=
(FAILOVER=on)
(LOAD_BALANCE=off)
(ADDRESS=
(PROTOCOL=TCP)
(HOST=sunlhclog01.cern.ch)
(PORT=1521)
)
(ADDRESS=
(PROTOCOL=TCP)
(HOST=sunlhclog02.cern.ch)
(PORT=1521)
)
(CONNECT_DATA=
(SERVICE_NAME=LHCLOGDB)
(FAILOVER_MODE=
(TYPE=SELECT)
(METHOD=PRECONNECT)
)
)
)
Database
●
Space will be managed automatically by Oracle
–
–
No need to specify extent size
Unlimited number of extents