ASM without HW RAID

Download Report

Transcript ASM without HW RAID

Implementing ASM Without HW RAID,
A User’s Experience
Luca Canali, CERN
Dawid Wojcik, CERN
UKOUG, Birmingham, December 2008
CERN IT Department
CH-1211 Genève 23
Switzerland
www.cern.ch/it
Outlook
• Introduction to ASM
– Disk groups, fail groups, normal redundancy
• Scalability and Performance of the solution
• Possible pitfalls, sharing experiences
• Implementation details, monitoring, and
tools to ease ASM deployment
CERN IT Department
CH-1211 Genève 23
Switzerland
www.cern.ch/it
Architecture and main concepts
• Why ASM ?
– Provides functionality of volume manager and a
cluster file system
– Raw access to storage for performance
• Why ASM-provided mirroring?
– Allows to use lower-cost storage arrays
– Allows to mirror across storage arrays
• arrays are not single points of failure
• Array (HW) maintenances can be done in a rolling way
– Stretch clusters
CERN IT Department
CH-1211 Genève 23
Switzerland
www.cern.ch/it
ASM and cluster DB architecture
• Oracle architecture of redundant low-cost
components
Servers
SAN
Storage
CERN IT Department
CH-1211 Genève 23
Switzerland
www.cern.ch/it
Files, extents, and failure groups
Files and
extent
pointers
Failgroups
and ASM
mirroring
CERN IT Department
CH-1211 Genève 23
Switzerland
www.cern.ch/it
ASM disk groups
• Example: HW = 4 disk arrays with 8 disks each
• An ASM diskgroup is created using all available disks
–
–
–
–
The end result is similar to a file system on RAID 1+0
ASM allows to mirror across storage arrays
Oracle RDBMS processes directly access the storage
RAW disk access
ASM Diskgroup
Mirroring
Striping
CERN IT Department
CH-1211 Genève 23
Switzerland
www.cern.ch/it
Failgroup1
Striping
Failgroup2
Performance and scalability
• ASM with normal redundancy
– Stress tested for CERN’s use
– Scales and performs
CERN IT Department
CH-1211 Genève 23
Switzerland
www.cern.ch/it
Case Study: the largest cluster I have
ever installed, RAC5
• The test used:14 servers
CERN IT Department
CH-1211 Genève 23
Switzerland
www.cern.ch/it
Multipathed fiber channel
• 8 FC switches: 4Gbps (10Gbps uplink)
CERN IT Department
CH-1211 Genève 23
Switzerland
www.cern.ch/it
Many spindles
• 26 storage arrays (16 SATA disks each)
CERN IT Department
CH-1211 Genève 23
Switzerland
www.cern.ch/it
Case Study: I/O metrics for the RAC5
cluster
• Measured, sequential I/O
– Read: 6 GB/sec
– Read-Write: 3+3 GB/sec
• Measured, small random IO
– Read: 40K IOPS (8 KB read ops)
• Note:
– 410 SATA disks, 26 HBAS on the storage arrays
– Servers: 14 x 4+4Gbps HBAs, 112 cores, 224
GB of RAM
CERN IT Department
CH-1211 Genève 23
Switzerland
www.cern.ch/it
How the test was run
• A custom SQL-based DB workload:
– IOPS: Probe randomly a large table (several
TBs) via several parallel queries slaves (each
reads a single block at a time)
– MBPS: Read a large (several TBs) table with
parallel query
• The test table used for the RAC5 cluster
was 5 TB in size
– created inside a disk group of 70TB
CERN IT Department
CH-1211 Genève 23
Switzerland
www.cern.ch/it
Possible pitfalls
• Production Stories
– Sharing experiences
– 3 years in production, 550 TB of raw capacity
CERN IT Department
CH-1211 Genève 23
Switzerland
www.cern.ch/it
Rebalancing speed
• Rebalancing is performed (and mandatory)
after space management operations
–
–
–
–
Typically after HW failures (restore mirror)
Goal: balanced space allocation across disks
Not based on performance or utilization
ASM instances are in charge of rebalancing
• Scalability of rebalancing operations?
– In 10g serialization wait events can limit scalability
– Even at maximum speed rebalancing is not always I/O
bound
CERN IT Department
CH-1211 Genève 23
Switzerland
www.cern.ch/it
Rebalancing, an example
ASM Rebalancing Performance (RAC)
Rate, MB/min
7000
6000
Oracle 11g
5000
Oracle 10g
4000
3000
2000
1000
0
0
2
4
6
8
10
Diskgroup Rebalance Parallelism
CERN IT Department
CH-1211 Genève 23
Switzerland
www.cern.ch/it
12
VLDB and rebalancing
• Rebalancing operations can move more
data than expected
• Example:
– 5 TB (allocated): ~100 disks, 200 GB each
– A disk is replaced (diskgroup rebalance)
• The total IO workload is 1.6 TB (8x the disk size!)
• How to see this: query v$asm_operation, the column
EST_WORK keeps growing during rebalance
• The issue: excessive repartnering
CERN IT Department
CH-1211 Genève 23
Switzerland
www.cern.ch/it
Rebalancing issues wrap-up
• Rebalancing can be slow
– Many hours for very large disk groups
• Risk associated
– 2nd disk failure while rebalancing
– Worst case - loss of the diskgroup because
partner disks fail
CERN IT Department
CH-1211 Genève 23
Switzerland
www.cern.ch/it
Fast Mirror Resync
• ASM 10g with normal redundancy does not
allow to offline part of the storage
– A transient error in a storage array can cause
several hours of rebalancing to drop and add
disks
– It is a limiting factor for scheduled maintenances
• 11g has new feature ‘fast mirror resync’
– Great feature for rolling intervention on HW
CERN IT Department
CH-1211 Genève 23
Switzerland
www.cern.ch/it
ASM and filesystem utilities
• Only a few tools can access ASM
–
–
–
–
Asmcmd, dbms_file_transfer, xdb, ftp
Limited operations (no copy, rename, etc)
Require open DB instances
file operations difficult in 10g
• 11g asmcmd has the copy command
CERN IT Department
CH-1211 Genève 23
Switzerland
www.cern.ch/it
ASM and corruption
• ASM metadata corruption
– Can be caused by ‘bugs’
– One case in prod after disk eviction
• Physical data corruption
– ASM will fix automatically most corruption on
primary extent
– Typically when doing a full backup
– Secondary extent corruption goes undetected
untill disk failure/rebalance can expose it
CERN IT Department
CH-1211 Genève 23
Switzerland
www.cern.ch/it
Disaster recovery
• Corruption issues were fixed using physical
standby to move to ‘fresh’ storage
• For HA our experience is that disaster
recovery is needed
– Standby DB
– On-disk (flash) copy of DB
CERN IT Department
CH-1211 Genève 23
Switzerland
www.cern.ch/it
Implementation details
CERN IT Department
CH-1211 Genève 23
Switzerland
www.cern.ch/it
Storage deployment
• Current storage deployment for Physics
Databases at CERN
– SAN, FC (4Gb/s) storage enclosures with SATA
disks (8 or 16)
– Linux x86_64, no ASM lib, device mapper instead
(naming persistence + HA)
– Over 150 FC storage arrays (production,
integration and test) and ~ 2000 LUNs exposed
– Biggest DB over 7TB (more to come when LHC
starts – estimated growth up to 11TB/year)
CERN IT Department
CH-1211 Genève 23
Switzerland
www.cern.ch/it
Storage deployment
• ASM implementation details
– Storage in JBOD configuration (1 disk -> 1 LUN)
– Each disk partitioned on OS level
• 1st partition – 45% of disk size – faster part of disk –
short stroke
• 2nd partition – rest – slower part – full stroke
inner sectors
– full stroke
outer sectors
– short stroke
CERN IT Department
CH-1211 Genève 23
Switzerland
www.cern.ch/it
Storage deployment
• Two diskgroups created for each cluster
– DATA – data files and online redo logs – outer
part of the disks
– RECO – flash recovery area destination –
archived redo logs and on disk backups – inner
part of the disks
• One failgroup per storage array
DATA_DG1
RECO_DG1
Failgroup1
CERN IT Department
CH-1211 Genève 23
Switzerland
www.cern.ch/it
Failgroup2
Failgroup3
Failgroup4
Storage management
• SAN configuration in JBOD configuration –
many steps, can be time consuming
– Storage level
• logical disks
• LUNs
• mappings
– FC infrastructure – zoning
– OS – creating device mapper configuration
• multipath.conf – name persistency + HA
CERN IT Department
CH-1211 Genève 23
Switzerland
www.cern.ch/it
Storage management
• Storage manageability
– DBAs set-up initial configuration
– ASM – extra maintenance in case of storage
maintenance (disk failure)
– Problems
• How to quickly set-up SAN configuration
• How to manage disks and keep track of the mappings:
physical disk -> LUN -> Linux disk -> ASM Disk
CERN IT Department
CH-1211 Genève 23
Switzerland
www.cern.ch/it
SCSI [1:0:1:3] & [2:0:1:3] ->
/dev/sdn & /dev/sdax ->
/dev/mpath/rstor901_3 ->
ASM – TEST1_DATADG1_0016
Storage management
• Solution
– Configuration DB - repository of FC switches, port
allocations and of all SCSI identifiers for all nodes
and storages
• Big initial effort
• Easy to maintain
• High ROI
– Custom tools
• Tools to identify
– SCSI (block) devices <-> device mapper device <->
physical storage and FC port
– Device mapper mapper device <-> ASM disk
• Automatic generation of device mapper configuration
CERN IT Department
CH-1211 Genève 23
Switzerland
www.cern.ch/it
Storage management
Custom made script
[ ~]$ lssdisks.py
The following storages are connected:
* Host interface 1:
Target ID 1:0:0: - WWPN: 210000D0230BE0B5 - Storage: rstor316, Port: 0
Target ID 1:0:1: - WWPN: 210000D0231C3F8D - Storage: rstor317, Port: 0
Target ID 1:0:2: - WWPN: 210000D0232BE081 - Storage: rstor318, Port: 0
Target ID 1:0:3: - WWPN: 210000D0233C4000 - Storage: rstor319, Port: 0
Target ID 1:0:4: - WWPN: 210000D0234C3F68 - Storage: rstor320, Port: 0
SCSI id
(host,channel,id) ->
storage name and
FC port
* Host interface 2:
Target ID 2:0:0: - WWPN: 220000D0230BE0B5 - Storage: rstor316, Port: 1
Target ID 2:0:4: - WWPN: 220000D0234C3F68 - Storage: rstor320,
SCSI ID -> block device
Port: 1 -> device mapper name and
Port: status
1
-> storage name and FC
Port: 1
port
SCSI Id
Storage
Target ID 2:0:1: - WWPN: 220000D0231C3F8D - Storage: rstor317, Port: 1
Target ID 2:0:2: - WWPN: 220000D0232BE081 - Storage: rstor318,
Target ID 2:0:3: - WWPN: 220000D0233C4000 - Storage: rstor319,
Block DEV
MPath name
MP status
Port
------------- ---------------- -------------------- ---------- ------------------ -----
CERN IT Department
CH-1211 Genève 23
Switzerland
www.cern.ch/it
[0:0:0:0]
/dev/sda
-
-
-
-
[1:0:0:0]
/dev/sdb
rstor316_CRS
OK
rstor316
0
[1:0:0:1]
/dev/sdc
rstor316_1
OK
rstor316
0
[1:0:0:2]
/dev/sdd
rstor316_2
FAILED
rstor316
0
[1:0:0:3]
/dev/sde
rstor316_3
OK
rstor316
0
[1:0:0:4]
/dev/sdf
rstor316_4
OK
rstor316
0
[1:0:0:5]
/dev/sdg
rstor316_5
OK
rstor316
0
[1:0:0:6]
/dev/sdh
rstor316_6
OK
rstor316
0
. . .
. . .
Storage management
Custom made script
[ ~]$ listdisks.py
DISK
NAME
GROUP_NAME
FG
H_STATUS
MODE
MOUNT_S
STATE
TOTAL_GB USED_GB
---------------- ------------------ ------------- ---------- ---------- ------- -------- ------- ------ ----rstor401_1p1
RAC9_DATADG1_0006
RAC9_DATADG1
RSTOR401
MEMBER
ONLINE
CACHED
NORMAL
111.8
68.5
rstor401_1p2
RAC9_RECODG1_0000
RAC9_RECODG1
RSTOR401
MEMBER
ONLINE
CACHED
NORMAL
119.9
1.7
rstor401_2p1
--
--
--
UNKNOWN
ONLINE
CLOSED
NORMAL
111.8 111.8
rstor401_2p2
--
--
--
UNKNOWN
ONLINE
CLOSED
NORMAL
120.9 120.9
rstor401_3p1
RAC9_DATADG1_0007
RAC9_DATADG1
RSTOR401
MEMBER
ONLINE
CACHED
NORMAL
111.8
68.6
rstor401_3p2
RAC9_RECODG1_0005
RAC9_RECODG1
RSTOR401
MEMBER
ONLINE
CACHED
NORMAL
120.9
1.8
rstor401_4p1
RAC9_DATADG1_0002
RAC9_DATADG1
RSTOR401
MEMBER
ONLINE
CACHED
NORMAL
111.8
68.5
rstor401_4p2
RAC9_RECODG1_0002
RAC9_RECODG1
RSTOR401
MEMBER
ONLINE
CACHED
NORMAL
120.9
1.8
rstor401_5p1
RAC9_DATADG1_0001
RAC9_DATADG1
RSTOR401
MEMBER
ONLINE
CACHED
NORMAL
111.8
68.5
rstor401_5p2
RAC9_RECODG1_0006
RAC9_RECODG1
RSTOR401
MEMBER
ONLINE
CACHED
NORMAL
120.9
1.8
rstor401_6p1
RAC9_DATADG1_0005
RAC9_DATADG1
RSTOR401
MEMBER
ONLINE
CACHED
NORMAL
111.8
68.5
rstor401_6p2
RAC9_RECODG1_0007
RAC9_RECODG1
RSTOR401
MEMBER
ONLINE
CACHED
NORMAL
120.9
1.8
rstor401_7p1
RAC9_DATADG1_0000
RAC9_DATADG1
RSTOR401
MEMBER
ONLINE
CACHED
NORMAL
111.8
68.6
rstor401_7p2
RAC9_RECODG1_0001
RAC9_RECODG1
RSTOR401
MEMBER
ONLINE
CACHED
NORMAL
120.9
1.8
rstor401_8p1
RAC9_DATADG1_0004
RAC9_DATADG1
RSTOR401
MEMBER
ONLINE
CACHED
NORMAL
111.8
68.6
rstor401_8p2
RAC9_RECODG1_0004
RAC9_RECODG1
RSTOR401
MEMBER
ONLINE
CACHED
NORMAL
120.9
1.8
RAC9_DATADG1_0015
RAC9_DATADG1
RSTOR402
MEMBER
ONLINE
CACHED
NORMAL
111.8
59.9
rstor401_CRS1
rstor401_CRS2
rstor401_CRS3
rstor402_1p1
. . .
. . .
CERN IT Department
CH-1211 Genève 23
Switzerland
www.cern.ch/it
device mapper
name -> ASM disk
and status
Storage management
[ ~]$ gen_multipath.py
# multipath default configuration for PDB
Custom made script
defaults {
udev_dir
/dev
polling_interval
10
selector
"round-robin 0"
. . .
}
. . .
device mapper alias –
naming persistency and
multipathing (HA)
multipaths {
multipath {
wwid
3600d0230006c26660be0b5080a407e00
alias
rstor916_CRS
}
multipath {
wwid
3600d0230006c26660be0b5080a407e01
alias
rstor916_1
}
. . .
}
CERN IT Department
CH-1211 Genève 23
Switzerland
www.cern.ch/it
SCSI [1:0:1:3] & [2:0:1:3] ->
/dev/sdn & /dev/sdax ->
/dev/mpath/rstor916_1
Storage monitoring
• ASM-based mirroring means
– Oracle DBAs need to be alerted of disk failures
and evictions
– Dashboard – global overview – custom solution –
RACMon
• ASM level monitoring
– Oracle Enterprise Manager Grid Control
– RACMon – alerts on missing disks and failgroups
plus dashboard
• Storage level monitoring
– RACMon – LUNs’ health and storage
configuration details – dashboard
CERN IT Department
CH-1211 Genève 23
Switzerland
www.cern.ch/it
Storage monitoring
• ASM instance level monitoring
new failing disk on
RSTOR614
• Storage level monitoring
new disk installed on
RSTOR903 slot 2
CERN IT Department
CH-1211 Genève 23
Switzerland
www.cern.ch/it
Conclusions
• Oracle ASM diskgroups with normal
redundancy
–
–
–
–
Used at CERN instead of HW RAID
Performance and scalability are very good
Allows to use low-cost HW
Requires more admin effort from the DBAs than
high end storage
– 11g has important improvements
• Custom tools to ease administration
CERN IT Department
CH-1211 Genève 23
Switzerland
www.cern.ch/it
Q&A
Thank you
• Links:
– http://cern.ch/phydb
– http://www.cern.ch/canali
CERN IT Department
CH-1211 Genève 23
Switzerland
www.cern.ch/it