Speakers: RAC and ASM on LINUX Forum: Performance and Maximum Availability Sheraton Hotel, Tel Aviv, Israel. July 16, 2008 Oded Ilan, Gadi Chen, Alejandro Vargas, Eyal Druskin, Nickita Chernovski, Saar.
Download ReportTranscript Speakers: RAC and ASM on LINUX Forum: Performance and Maximum Availability Sheraton Hotel, Tel Aviv, Israel. July 16, 2008 Oded Ilan, Gadi Chen, Alejandro Vargas, Eyal Druskin, Nickita Chernovski, Saar.
Agenda
• What is ASM and how does it works and its advantages for better performance • Configure ASM for max performance
How ASM Works
Database file open RDBMS Instance Server ASM -
File layout info - Update statistics - Health info
Instance
Create File Extent Map ASM manages metadata Not in the I/O path RAW performance Operating System Disk Group
Asm is an implementation of the SAME Methodology
Disk performance properties : • Disk drive capacity • Rotational speed • seek time • Transfer rate
Asm optimizes the
sequential throughput
and
random access
(OLTP) without making (DWH) tradeoff
Maximize sequential throughput (DWH)
• Use sequential transfer of 1 MB or larger to achieve good balance for a high efficiency and throughput • Placing most frequently accessed data towards the outer edge of spindles
Optimizing random access (OLTP)
In order to Limit the length that the disk head moves (seek ) between data and access positioning most frequently accessed data in the same half or quarter of the disk drive
ASM Striping with RAID 0+1 Online Redo – Undo – Temp – DB Files
Highest Fastest 20% 40% Slow Slowest 20% 20%
Archived logs – Dump destinations
Asm ….
•Stripes data to equalize the workload across disk and eliminate ”hot spots”
•Implements 1mb striping to achieve high sequential bandwidth Striping designed to meet RDBMS I/O characteristics File level define in templates FINE striping Stripe size =128KB for latency sensitive and small files Redo logs, control file COARSE striping Stripe size = AU size = 1-64MB striping for large I/O Data files, backup files, etc
Asm …. Cont’d
•Will take advantage of luns created from the outer half of the disk . to provide fast transfer rates and to minimize seek overhead
DB_FILE_DIRECT_IO_COUNT x Oracle block size = 1MB In performing tablespace creation, 1MB I/O block size can be 31% faster than 64KB
Read:
DB_FILE_MULTIBLOCK_READ_COUNT x Oracle block size = 1MB Full table scan was 62% faster with 1MB than 64KB block size
ASM Performance Benefits
• • • • Automatic I/O tuning for all loads Prevention of disk space fragmentation Online disk space reorganization - Rebalance Data redundancy to provide fault tolerance (up to 3-way mirror)
Reference: http://www.oracle.com/technology/products/database/asm/pdf/take%20the%20gu esswork%20out%20of%20db%20tuning%2001-06.pdf
Thanks for you attention!
E-mail: [email protected]
Agenda
• ASM Allocation Units • ASM Extents sizes
ASM Allocation Units
- AU
ASM - Allocation Units - AU
What is a AU ?
• AU is the minimum segment of disk that can be assigned to a file.
• Each extent consist of number of AU.
• Each file consist of number of Extents.
ASM - Allocation Units - AU
When do we set it ?
• The AU is set at the disk group creation time.
• Can be set to 1,2,4,8,16,32,64 MB.
• Larger AU sizes typically provide performance advantages for data warehouse applications that use large sequential reads.
• Default size 1 MB (_asm_ausize)
ASM Extents sizes
ASM – Extent Allocation
What is an extent ?
• Group of AU on a file.
extent
ASM – Extent Allocation
• The extent size is dictated by the size of the AU.
• There are only three possible sizes for an extent in any single data file.
Less then 20,000 Between 20,000 and 40,000 More Then 40,000
Extent Size
AU AU x 8 AU x 64
Extent Allocation Sample
100 Gb file with AU=1Mb Without Variable Extent Size # Of Extents
102,400
AU 1 Mb Total Size 100 Gb With Variable Extent Size # Of Extents AU 20,000 1 Mb 20,000 12,800
52,800
8 Mb (1*8) 64 Mb (1*64) Total Size 20 Gb 160 Gb 820 Gb
ASM – Extent Allocation
How do we set it ?
• • • • • So...we don't.
It is automatic driven architecture.
11g new feature - variable segment size.
completely automatic .
Must be enabled in the disk group creation.
Implement Variable extents size
CREATE DISKGROUP FTS EXTERNAL REDUNDANCY DISK ’ORCL:FTS1’ SIZE 102398 M , ’ORCL:FTS2’ SIZE 102398 M , ’ORCL:FTS3’ SIZE 102398 M , ’ORCL:FTS4’ SIZE 102398 M ATTRIBUTE ’compatible.rdbms’ = ’11.1’, ’compatible.asm’ = ’11.1’, ’au_size’ = ’64M’;
Variable extents size
• Enable support for larger ASM files .
• Reduces SGA memory requirement for VLDB.
• Improve performance for file create and open functions.
ASM – Extent Allocation
• ASM extent variable array size File Size 100gb AU size 1 Mb 16 Mb 64 Mb Without VES 800 k 50 k 12.5 k With VES (11g) 237 k 50 k 12.5 k % save 70% 0% 0% 1Tb 1 Mb 16 Mb 64 Mb 8192 k 512 K 128 k 418 k 201 k 128 k 95% 60% 0%
10Tb 1 Mb 16 Mb 64 Mb 80 Mb 5 Mb 1.2 Mb 1570 k 370 k 296 k 98% 92% 76% 1PB 1 Mb 16 Mb 64 Mb 8192 Mb 512 Mb 128 Mb 128 Mb 8 Mb 2 Mb 98% 98% 98% • The benefit of variable extent size smaller SGA.
Near Unlimited Scalability
• • • • 63 disk groups 10,000 ASM disks 1 million files per disk group 4 petabyte per ASM disk • Maximum Database Size: • • • External redundancy: Normal redundancy: High redundancy: 140 PB 42 PB 15 PB
1
ASM Fast Mirror Resync
• • Fraction of time to re-establish redundancy Only changed blocks are resync’ed
Failure time < DISK_REPAIR_TIME 2 Disk again accessible: Only need to resync modified extents
• Benefits: • Fast recovery from transient failures • Enables pro-active maintenance
ASM Preferred Mirror Read
• • Allow local mirror read operations Eliminate network latencies in extended clusters • Better performance
Site A Site B
Extended Cluster
P S
New ASM Disk Group Attributes
• Compatibility of each disk group is separately controllable • database_compatible_version parameter determines the features that are enabled
Name au_size compatible.rdbms
Values 1 | 2 | 4 | 16 | 32 | 64MB Valid database version Description Size of allocation units in the disk group Format of messages exchanged between DB and ASM compatible.asm
disk_repair_time Valid database version 0 M to 2 32 D Format of ASM metadata structures on disk Length time before removing a disk once OFFLINE
ASMCMD Extensions
User created directories Templates Disk group compatibility Disk group name Disk names and failure groups md_backup md_restore remap
Re-maps unreadable blocks
$ asmcmd help lsdsk
List ASM disks with or without ASM instance running
cp
Copy files locally or across db servers
Agenda
• • What is MAA?
MAA Features and Benefits
What is MAA?
• Oracle‘s best practices blueprint based on proven Oracle high-availability technologies and recommendations
Maximum Availability Minimum Time to Recover
What is MAA?
• 3-Tier model
Storage
• ASM
Server
• Clusterware
Database
• RAC • Flashback • …
MAA Best Practices Benefits
• • • • Recommends operational best practices to minimize downtime Reduces the implementation costs for a HA Oracle system by providing detailed configuration guidelines Reduces acquisition, management, and support costs for a HA Oracle system Reduces possible costs of downtime by providing best practices to prevent, eliminate or minimize downtime that could occur because of scheduled and unscheduled outages
Oracle’s HA Solution Set
Integrated with Database – Unparalleled in the Industry!
Unplanned Downtime System Failures Data Failures
Real Application Clusters ASM Flashback RMAN & Oracle Secure Backup Data Guard Streams
Planned Downtime System Changes Data Changes
Online Reconfiguration Rolling Upgrades Online Redefinition
Conclusion
Production Site
Online H/W & S/W Upgrade Real Application Clusters & Oracle Clusterware Fault Tolerant Server Scale-Out
Primary Database
Active Standby Site
Online Table Redefinition Data Guard / Streams Active-active Sites, Data Protection, Rolling Upgrades
Standby Database Storage
Flashback Technologies Move Back in Time to Correct Errors Automatic Storage Management Fault Tolerant Storage Scale-Out
Storage
Recovery Manager & Oracle Secure Backup Data Protection & Archival
Agenda
• • • • What’s Dataguard Broker?
What’s so good in DGBroker?
Steps to create.
Things to Know and To Be Aware of
What is Dataguard Broker?
• Oracle command-line utility for managing dataguard configuration.
• No additional features over normal dataguard configuration (except for observer) • Installed automatically with any DB installation (no additional install required)
Why is Dataguard Broker so good?
• • • Easy way to manage standby environment Straightway command interface Just provide two identical database, Dgbroker will do the rest, including parameters set and standby recovery.
• Switchover and failover: connect only to standby DB, dgbroker will take care of the whole process of role transition on both standby and primary.
• Reinstation: dgbroker will automatically flashback destroyed ex-primary DB and reinstate it as standby.
Simple Steps to Create
• • • • • • Create standby redologs on primary Copy primary database (using RMAN duplicate or OS copy) Create standby controlfile Precreate init.ora, specify:
db_unique_name=
parameters on both standby and primary Create spfile from pfile on both databases.
Mount standby
Simple Steps to Create
• Start dgmgrl utility, connect to primary: $
dgmgrl /
• Create configuration
DGMGRL>CREATE CONFIGURATION mysbconfig AS primary database is nick102 connect identifier is nick102;
• Add standby database:
DGMGRL> add database nick102s as CONNECT IDENTIFIER IS nick102s maintained as physical/logical;
• Enable configuration:
DGMGRL> enable configuration
• We are done!
Things to Know
• • • • • • Dgbroker is fully ASM and RAC compatible.
Don’t forget local_listener parameter, else switchover or failover will fail.
ALWAYS specify standby redologs.
Enable flashback for fast reinstation.
Reinstate requires flashback to be enabled and Complete shutdown and startup mount of failed DB.
Don’t use dgmgrl on Oracle versions lower than 10.2.0.3
Michael Zoll - RAC Development (Performance), Oracle Corp.
Presented by Saar Maoz, RACPACK, Oracle Corp.
The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions.
The development, release, and timing of any features or functionality described for Oracle’s products remains at the sole discretion of Oracle.
Agenda
• • • • • Oracle RAC Fundamentals and Infrastructure Common Problems and Symptoms Application and Database Design Diagnostics and Problem Determination Summary: Practical Performance Analysis
OBJECTIVE
• • • • Realize that Oracle RAC performance does not requires “Black Magic” General system and SQL analysis and tuning experience is practically sufficient for Oracle RAC Problems can be identified with a minimum of metrics and effort Diagnostics framework and advisories are efficient
Oracle RAC Architecture
Node1 VIP1 Service Listener instance 1 ASM Oracle Clusterware Operating System
/…/ public network
VIP2 Service Listener instance 2 ASM Oracle Clusterware Operating System Node
2
VIPn Service Listener instance n ASM Oracle Clusterware Operating System Node
n
Managed by ASM
shared storage Redo / Archive logs all instances Database / Control files
RAW Devices
OCR and Voting Disks
Oracle Clusterware
Node1 VIP1
CSSD OPROCD ONS CRSD EVMD
RAW Devices
CSSD Runs in Real Time Priority /…/ public network
VIP2
CSSD OPROCD ONS CRSD EVMD
Node
2
shared storage OCR and Voting Disks
VIPn
CSSD OPROCD ONS CRSD EVMD
Node
n
Under the Covers
Cluster Private High Speed Network Runs in Real Time Priority LMON LMD0 DIAG LMON LMD0 DIAG Instance 1 SGA Global Resoruce Directory
Dictionary Cache Library Cache Log buffer Buffer Cache
VKTM LMS0 LGWR SMON DBW0 PMON Instance 2 SGA Global Resoruce Directory
Dictionary Cache Log buffer Library Cache Buffer Cache
VKTM LMS0 LGWR SMON DBW0 PMON Node 1 LMON LMD0 DIAG Instance n SGA Global Resoruce Directory
Dictionary Cache Library Cache Log buffer Buffer Cache
VKTM LMS0 LGWR SMON DBW0 PMON Node 2 Node n Redo Log Files Redo Log Files Redo Log Files Data Files and Control Files
The Evolution of The Cache Fusion
Oracle Version
Oracle 8i Oracle 9iR1 Oracle 9iR2 Oracle 10gR1 Oracle 10gR2 Oracle 11gR1
Major milestones in cache fusion code
CR Server Full Cache Fusion Process Architecture / Code path optimization Static File Affinity File Affinity with Dynamic ReMastering (DRM) Affinity Lock ( 16 node TPC-C published) Object Affinity with Dynamic ReMastering Undo Affinity, TX Lock Affinity Optimizations for Reads and Read-Mostly
Global Cache Service (GCS)
• Manages coherent access to data in buffer caches of all instances in the cluster • • • Minimizes access time to data which is not in local cache • access to data in global cache faster than disk access Implements fast direct memory access over high speed interconnects • for all data blocks and types Uses an efficient and scalable messaging protocol • Never more than 3 hops • New optimizations for read-mostly applications
Cache Hierarchy: Data in Remote Cache
Local Cache Miss Datablock Requested Datablock Returned Remote Cache Hit
Cache Hierarchy: Data On Disk
Local Cache Miss Datablock Requested Disk Read Grant Returned Remote Cache Miss
Cache Hierarchy: Read Mostly
Local Cache Miss No Message required Disk Read
11.1 Read-Mostly Improvements
• • CPU Optimizations for read-intensive operations • Read-mostly access • • Message reductions Latency improvements • Read-only access • No messages • Direct reads Significant gains • From 50-70% reductions measured
Read Only Workload (Wait Events)
• RM=OFF
Top 5 Timed Events Avg %Total ~~~~~~~~~~~~~~~~~~ wait Call Event Waits Time (s) (ms) Time ----------------------------------------- ------------ ----------- ------ ----- db file sequential read 263,952 488 2 29.1
gc cr grant 2-way 254,902 453 2 27.1
gc current block 2-way 159,394 305 2 18.2
CPU time 275 16.4
gc current block 3-way 52,236 125 2 7.5
-------------------------------------------------------------
• RM=ON
Top 5 Timed Events Avg %Total ~~~~~~~~~~~~~~~~~~ wait Call Event Waits Time (s) (ms) Time ----------------------------------------- ------------ ----------- ------ ----- db file sequential read 905,808 1,343 1 78.9
CPU time 338 19.9
control file parallel write 321 7 21 .4
reliable message 30 3 96 .2
latch: cache buffers lru chain 1,035 1 1 .1
-------------------------------------------------------------
Read Only Workload (messages / blocks)
• RM=OFF
Global Cache Load Profile ~~~~~~~~~~~~~~~~~~~~~~~~~ Per Second Per Transaction -------------- -------------- Global Cache blocks received: 645.81 211,826.00
Global Cache blocks served: 607.19 199,157.00
GCS/GES messages received: 4,613.02 1,513,070.00
GCS/GES messages sent: 3,086.02 1,012,213.00
DBWR Fusion writes: 0.24 78.00
Estd Interconnect traffic (KB): 4,009.71
• RM=ON
Global Cache Load Profile ~~~~~~~~~~~~~~~~~~~~~~~~~ Per Second Per Transaction -------------- -------------- Global Cache blocks received: 0.77 254.00
Global Cache blocks served: 0.67 221.00
GCS/GES messages received: 9.18 3,031.00
GCS/GES messages sent: 19.81 6,538.00
DBWR Fusion writes: 0.12 38.00
Estd Interconnect traffic (KB): 8.54
Performance of Cache Fusion
Message:~200 bytes
200 bytes/(1 Gb/sec )
LMS Receive Process block Send Initiate send and wait Receive Block: e.g. 8K
8192 bytes/(1 Gb/sec) Total access time: e.g. ~360 microseconds (UDP over GBE) Network propagation delay ( “wire time” ) is a minor factor for roundtrip time ( approx.: 6% , vs. 52% in OS and network stack )
Fundamentals: Minimum Latency (*), UDP/GBE and RDS/IB
Block size RT (ms) UDP/GE 2K 0.30
4K 0.31
8K 0.36
16K 0.46
RDS/IB 0.12
0.13
0.16
0.20
(*) roundtrip, blocks are not “busy” i.e. no log flush, no serialization ( “buffer busy”) AWR and Statspack reports would report averages as if they were normally distributed, the session wait history which is included in Statspack in 10.2 and AWR in 11g will show the actual quantities.
The minimum values in this table are the optimal values for 2-way and 3-way block transfers, but can be assumed to be the expected values ( i.e. 10ms for a 2-way block would be very high )
Numbers all Database Engineers Should Know and Understand
• • • Storage • Latency • Milliseconds • Bandwidth • Gigabytes per second Network Components • Latency • Milliseconds • Bandwidth • Gigabits per second Memory • Latency • • Nanoseconds Microseconds 0.001 s 1,000,000,000 bytes/s 0.001 s 1,000,000,000 bits/s 0.000,000,001 s 0.000,001 s
Infrastructure: Private Interconnect
• Network between the nodes of a RAC cluster MUST be private • Best practice is not to share with ISCSI storage • • • Supported links: GbE, IB ( IPoIB: 10.2 ) Supported transport protocols: UDP, RDS (10.2.0.3) Use multiple or dual-ported NICs for redundancy and increase bandwidth with NIC bonding • Large ( Jumbo ) Frames for GbE recommended • Send entire block in one transmit
Infrastructure: Interconnect Bandwidth
• • • • • Bandwidth requirements depend on several factors ( e.g. buffer cache size, #of CPUs per node, access patterns) and cannot be predicted precisely for every application Typical utilization approx. 10-30% i
n OLTP
• 10000-12000 8K blocks per sec to saturate 1 x Gb Ethernet ( 75 80% of theoretical bandwidth ) Generally, 1Gb/sec sufficient for performance and scalability
in OLTP.
DSS/DW
• systems should be designed with > 1Gb/sec capacity A sizing approach with rules of thumb is described in Project MegaGrid: Capacity Planning for Large Commodity Clusters ( http://otn.oracle.com/rac )
Infrastructure: IPC configuration
• • • Important Settings: • • • • Negotiated top bit rate and full duplex mode NIC ring buffers Ethernet flow control settings CPU(s) receiving network interrupts Verify your setup: • • • CVU does checking Load testing eliminates potential for problems AWR and ADDM give estimations of link utilization Buffer overflows, congested links and flow control can have severe consequences for performance
Infrastructure: Operating System
• • • Block access latencies increase when CPU(s) busy and run queues are long • Immediate LMS scheduling is critical for predictable block access latencies when CPU > 80% busy Fewer and busier LMS processes may be more efficient. • • monitor their CPU utilization Caveat: 1 LMS can be good for runtime performance but may impact cluster reconfiguration and instance recovery time • the default is good for most requirements Higher priority for LMS is default • The implementation is platform-specific
Common Problems and Symptoms
Common Problems and Symptoms
• • • • • “Lost Blocks”: Interconnect or Switch Problems Slow or bottlenecked disks System load and scheduling Contention Unexpectedly high latencies
Miss-configured or Faulty Interconnect Can Cause:
•
Dropped packets/fragments
•
Buffer overflows
•
Packet reassembly failures or timeouts
•
Ethernet Flow control kicks in
•
TX/RX errors
“lost blocks” at the RDBMS level, responsible for 64% of escalations
“Lost Blocks”: NIC Receive Errors
Db_block_size = 8K
ifconfig –a: eth0 Link encap:Ethernet HWaddr 00:0B:DB:4B:A2:04 inet addr:130.35.25.110 Bcast:130.35.27.255 Mask:255.255.252.0
UP BROADCAST RUNNING MULTICAST
MTU:1500
Metric:1 RX packets:21721236
errors:135
dropped:0 overruns:0 frame:95 TX packets:273120 errors:0 dropped:0 overruns:0 carrier:0 …
“Lost Blocks”: IP Packet Reassembly Failures
netstat –s Ip: 84884742 total packets received …
1201 fragments dropped after timeout
…
3384 packet reassembles failed
Finding a Problem with the Interconnect or IPC
Top 5 Timed Events Avg %Total ~~~~~~~~~~~~~~~~~~ wait Call Event Waits Time(s)(ms) Time Wait Class --------------------------------------------------------------------------------------------------- log file sync 286,038 49,872 174 41.7 Commit gc buffer busy 177,315 29,021 164 24.3 Cluster gc cr block busy 110,348 5,703 52 4.8 Cluster
gc cr block lost 4,272 4,953 1159 4.1 Cluster
cr request retry 6,316 4,668 739 3.9 Other
Should never be here
Global Cache Lost block handling
• • Detection Time in 11g reduced • • • • 500ms ( around 5 secs in 10g ) can be lowered if necessary robust ( no false positives ) no extra overhead
Cr request retry
• event related to lost blocks It is highly likely to see it when gc cr blocks lost show up
Interconnect Statistics
Automatic Workload Repository (AWR ) Target Avg Latency Stddev Avg Latency Stddev Instance 500B msg 500B msg 8K msg 8K msg -------------------------------------------------------------------- 1 .79 .65 1.04 1.06
2 .75 .57 . 95 .78
3 .55 .59 .53 .59
4 1.59 3.16 1.46 1.82
-------------------------------------------------------------------- Latency probes for different message sizes Exact throughput measurements (not shown) Send and receive errors, dropped packets (not shown )
“Lost Blocks”: Solution
• Fix interconnect NICs and switches • Firmware, Cables, etc.
• • Tune IPC buffer sizes Ensure enough OS resources available • Spinning process can consume all network ports • Disable any firewall on interconnect
Disk IO Performance Issues
• • Log flush IO delays can cause “busy” buffers “Bad” queries on one node can saturate an interconnect link • • IO is issued from ALL nodes to shared storage
Use Automatic Database Diagnostic Monitor (ADDM) /AWR
• single system image of I/O across cluster Cluster-wide impact of IO or query plan issues responsible for 23% of escalations
Cluster-Wide I/O Impact
Node 1 Top 5 Timed Events Avg %Total ~~~~~~~~~~~~~~~~~~ wait Call Event Waits Time(s)(ms) Time ------------------------------ ------------ ----------- ------ ------
log file sync 286,038 49,872 174 41.7
gc buffer busy 177,315 29,021 164 24.3 gc cr block busy 110,348 5,703 52 4.8 ``
Expensive Query in Node 2 Node 2 1. IO on disk group containing redo logs is bottlenecked 2. Block shipping for “hot” blocks is delayed by log flush IO 3. Serialization/Queues build up Load Profile ~~~~~~~~~~~~ Per Second -------------- Redo size: 40,982.21
Logical reads: 81,652.41
Physical reads: 51,193.37
Drill-down: An IO capacity problem
Top 5 Timed Events Avg %Total wait Call Event Waits Time(s) (ms) Time Wait Class --------------- ------- ------- ---- ---- ----------
db file scattered read 3,747,683 368,301 98 33.3 User I/O gc buffer busy 3,376,228 233,632 69 21.1 Cluster
db file parallel read 1,552,284 225,218 145 20.4 User I/O gc cr multi block 35,588,800 101,888 3 9.2 Cluster request read by other session 1,263,599 82,915 66 7.5 User I/O
Symptom of Full Table Scans I/O contention
IO and/or Bad SQL problem fixed
Top 5 Timed Events Avg %Total ~~~~~~~~~~~~~~~~~~ wait Call Event Waits Time (s) (ms) Time Wait Class --------------------------- --------- ----------- ---- ------ --------- CPU time 4,580 65.4
log file sync 276,281 1,501 5 21.4 Commit log file parallel write 298,045 923 3 13.2 System I/O gc current block 3-way 605,628 631 1 9.0 Cluster gc cr block 3-way 514,218 533 1 7.6 Cluster 1. Log file writes are normal 2. Global serialization has disappeared
IO issues: Solution
• • Tune IO layout Tune queries consuming a lot of IO
CPU Saturation or Long Run Queues
Top 5 Timed Events Avg %Total ~~~~~~~~~~~~~~~~~~ wait Call Event Waits Time(s) (ms) Time Wait Class ---------------- --------- ----- ---- ----- ----------
db file sequential 1,312,840 21,590 16 21.8 User I/O read gc current block
congested
gc cr grant
congested
275,004 21,054
77
21.3 Cluster 177,044 13,495
76
13.6 Cluster gc current block 1,192,113 9,931 8 10.0 Cluster 2-way gc cr block
congested
85,975 8,917
104
9.0 Cluster
“Congested” : LMS could not dequeue messages fast enough Cause : Long run queue, CPU starvation
High CPU Load: Solution
• • • • Run LMS at higher priority (default) Start more LMS processes • Never use more LMS processes than CPUs Reduce the number of user processes Find cause of high CPU consumption
Contention
Event Waits Time (s) AVG (ms) % Call Time --------------------- -------- ------- ------- --------
gc cr block 2-way 317,062 5,767 18 19.0 gc current block 2-way 201,663 4,063 20 13.4 gc buffer
busy
111,372 3,970 36 13.1 CPU time 2,938 9.7
gc cr block
busy
40,688 1,670 41 5.5
-------------------------------------------------------
Serialization Global Contention on Data
Its is very likely that CR BLOCK BUSY and GC BUFFER BUSY are related
Contention: Solution
• • Identify “hot” blocks in application Reduce concurrency on hot blocks
High Latencies
Event Waits Time (s) AVG (ms) % Call Time --------------------- --------- --------- -------- ------- gc cr block 2-way gc current block 2-way
gc buffer
busy
317,062 5,767 201,663 4,063
18 20
19.0
13.4
111,372 3,970 36 13.1 CPU time 2,938 9.7
gc cr block
busy
40,688 1,670 41 5.5
-------------------------------------------------------
Expected: To see 2-way, 3-way Unexpected: To see > 1 ms (AVG ms should be around 1 ms) Tackle latency first, then tackle busy events
High Latencies : Solution
• Check network configuration • • Private Running at expected bit rate • Find cause of high CPU consumption • Runaway or spinning processes
Health Check
• • Look for: Unexpected Events
gc cr block lost
Unexpected “Hints” • • Contention and Serialization
gc cr/current block
busy
1159 ms 52 ms
Load and Scheduling
gc current block
congested
14 ms
• Unexpected high avg
gc cr/current block 2-way 36 ms
Application and Database Design
General Principles
• • • No fundamentally different design and coding practices for RAC BUT: flaws in execution or design have higher impact in RAC • Performance and scalability in RAC will be more sensitive to bad plans or bad schema design • Serializing contention makes applications less scalable Standard SQL and schema tuning solves > 80% of performance problems
Scalability Pitfalls
• • • • Serializing contention on a small set of data/index blocks • • • monotonically increasing key frequent updates of small cached tables segment without automatic segment space managmenent (ASSM) or Free List Group (FLG) Full table scans • Optimization for full scans in 11g can save CPU and latency Frequent invalidation and parsing of cursors • Requires data dictionary lookups and synchronizations Concurrent DDL ( e.g. truncate/drop )
Health Check
• • • Look for: Indexes with right-growing characteristics • Eliminate indexes which are not needed Frequent updated and reads of “small” tables • • “small”=fits into a single buffer cache Sparse blocks ( PCTFREE 99 ) will reduce serialization SQL which scans large amount of data • • Perhaps more efficient when parallelized Direct reads do not need to be globally synchronized ( hence less CPU for global cache )
Diagnostics and Problem Determination
MOST OF THE TIME, A PERFORMANCE PROBLEM IS NOT An Oracle RAC PROBLEM
Checklist for the Skeptical Performance Analyst ( AWR based )
•
Check where most of the time in the database is spent (“Top 5” )
•
Check whether gc events are “busy”, “congested”
•
Check the avg wait time
•
Drill down
• SQL with highest cluster wait time • Segment Statistics with highest block transfers or JUST USE ADDM with Oracle RAC 11g!
Drill-down: An IO capacity problem
Top 5 Timed Events Avg %Total wait Call Event Waits Time(s) (ms) Time Wait Class --------------- ------- ------- ---- ---- ----------
db file scattered read 3,747,683 368,301 98 33.3 User I/O gc buffer busy 3,376,228 233,632 69 21.1 Cluster
db file parallel read 1,552,284 225,218 145 20.4 User I/O gc cr multi block 35,588,800 101,888 3 9.2 Cluster request read by other session 1,263,599 82,915 66 7.5 User I/O
Symptom of Full Table Scans I/O contention
Drill-down: SQL Statements
“Culprit”: Query that overwhelms IO subsystem on one node
Physical Reads
Executions per Exec %Total ------------- ---------- ------------ ------
182,977,469 1,055 173,438.4
99.3 SELECT SHELL FROM ES_SHELL WHERE MSG_ID = :msg_id ORDER BY ORDER_NO ASC
The same query reads from the interconnect:
Cluster
CWT % of CPU
Wait Time (s)
Elapsd Tim Time(s) Executions ------------ --------- ---------- --------------
341,080.54 31.2 17,495.38
1,055 SELECT SHELL FROM ES_SHELL WHERE MSG_ID = :msg_id ORDER BY ORDER_NO ASC
Drill-Down: Top Segments
GC
Tablespace Subobject Obj
Buffer
% of Name Object Name Name Type -------- ------------ --------
Busy
Capture ------ ------- ----- ESSMLTBL ES_SHELL SYS_P537 TABLE
311,966
ESSMLTBL ES_SHELL SYS_P538 TABLE
277,035
ESSMLTBL ES_SHELL SYS_P527 TABLE
239,294
9.91
8.80
7.60
…
Apart from being the table with the highest IO demand it was the table with the highest number of block transfers AND global serialization
Findings Summary in EM
• Each finding type has a descriptive name • Facilitates search / aggregation / directives etc.
Recommendations
•
Most relevant data for analysis can be derived from the wait events
•
Always use Enterprise Manager (EM) and ADDM reports for performance health checks and analysis
•
Activity Session History (ASH) can be used for session-based analysis of variation
•
Export AWR repository regularly to save all of the above
ADDM Diagnosis for RAC
• Data sources are: • • • • Wait events (especially Cluster class and buffer busy) ASH Instance cache transfer data Interconnect statistics (throughput, usage by component, pings) • ADDM analyzes for both the entire database (DATABASE analysis mode) and for each instance (INSTANCE analysis mode).
• Analysis of both database and instance resources summarized in a single report • Allows drill down to specific instance.
What ADDM Diagnoses for RAC
• • • • • • • Latency problems in interconnect Congestion (identifying top instances affecting the entire cluster) Contention (buffer busy, top objects etc.) Top consumers of multiblock requests Lost blocks Reports information about interconnect devices. Warns about using PUBLIC interfaces.
Reports throughput of devices, and how much of it is used by Oracle and for what purpose (GC, locks, PQ)
Agenda
• • Base Floor • A brief introduction about our architecture • • Disk Storage Subsystem 10 GbE for SQL*Loader Performance Administration & Maintenance • Backup & Recovery • • • • • Data Pump Table Compression Bigfile Tablespaces Resumable Events Grid Control
Never try to tune Oracle Server before ensuring the hardware and OS has been tuned.
A Brief Introduction of Our Architecture
Initial Storage Configuration
• • • • • • Symetrix DMX-4 EMC Cabinet RAID 5 Many 15K RPM 146 GB disks (350+) RAW devices Power Path as multipathing tool Share Everything Meta Volume Layout
Let’s Come Together for Solution…
Have you checked the Oracle server X parameter ?
I agree. Oracle is not a file system vendor You have to buy more disks to sustain the desired throughput What is missing in the picture ?
Numbers
ASM is too young for such a big system.
How to Find the Source of Problem ?
• I am not a person capable of writing AWK, Shell, Perl, etc. scripts.
• I am not an expert on Solaris • • I/O architecture.
• There are tools but they do not generate Oracle I/O (
dd
is the most well-known one).
Everybody was blaming on Oracle and ASM.
I was aware that the problem was on hardware but I’ve required proof.
ORION
• • •
ORION
is an advanced tool capable of emulating Oracle I/O by using native OS I/O libraries used by Oracle itself.
ORION
• allows you to define I/O Characteristics • Random • Sequential • Mix • • Array Cache Size Read % + Write %
ORION
will give you the necessary proof if the problem is related to your hardware.
Sample ORION Outputs
1M Sequential Read
2000 1800 1600 1400 1200 1000 800 600 400 200 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45
Number of Sequential Requests
47 49 51 53 55 57 59 61 63 128 256 384 512 640 768
Sample ORION Outputs
32K Random Read
120 110 100 90 80 40 30 20 70 60 50 10 0 1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64
Outstanding I/O
67 70 73 76 79 82 85 88 91 94 192 480 768 10 56 13 44 16 32 19 20
Final Storage Configuration
• • • • • • Symetrix DMX-4 EMC Cabinet RAID 5 Many 15K RPM 146 GB disks (350+) RAW devices Power Path as multipathing tool Share Nothing Meta Volume Layout
Next Step was AwSoMe
• • • • • • After solving the problem on cabinet, ASM was ready to use.
We rely on external redundancy handled by storage system.
Our UNIX administrator has labeled ~900 GB raw devices with EMC Powerpath.
We have modified advised ASM instance parameters in
Metalink Note 368055.1
• •
_asm_ausize _asm_stripesize
ASM instance has never crashed since the first deployment (more than eight months)
AweSoMe
is the
AweSoMe
implementation of S.A.M.E. architecture.
ASM & New Turkcell DWH
• • • • • The very first thing we did on
ASM
was to test & learn the recoverability of a database running on
ASM
under various crash scenarios. We have documented it and shared with our colleagues and our managers in Turkcell.
With compared to our previous file system vendor,
ASM
15% better in various database operations.
is 10% -
ASM
• • • has totally removed the need of Manuel datafile distribution File system boundary prevention Hot spot management
ASM
allows us to take the control of I/O subsystem of our database.
10 GbE for SQL*Loader Performance
• There is a requirement of ~1 TB/h data transfer rate from ETL server to database server • using
SQL*Loader
. With a standard Gigabit network, only ~250 MB/h • • was achieved.
Proposed solutions • • • Clustered File System Trunking 10 GbE This implementation requires modification on • • Solaris 10 TCP Stack Session Data Unit (SDU)
Yet another Hardware Configuration Problem
• • • • • In this implementation everybody was expecting for a transfer rate of 1 GB/s (10GbE).
But again without tuning any Oracle parameter, we’ve first decided to test the network at OS level. We’ve again required an
ORION
throughput testing.
like tool for network Recently, Solaris Performance Team released a tool for this purpose called
uperf. uperf
shows that it is impossible to achieve 1GB/s. It is as simple as
ORION
and guides as much as
ORION
.
For a more details on issue, refer to
High Speed Network Implementation for SQL*Loader Performance
article in http://husnusensoy.wordpress.com
Moral of the Story
• • • It is true that hardware is a very complex issue and requires specialization. Nobody can expect you to configure a cabinet or switch. But • You can measure what you have.
• You should know the limits of your hardware and OS configuration.
And
Never try to tune Oracle Server before ensuring the hardware and OS has been tuned.
The less you spend on administration, the more
Administration & Maintenance
Our Backup & Recovery Strategy
• Incrementally Updated Backup is used as the daily backup strategy.
• It takes only 15 min – 1 hour per day depending on the amount of data change.
• • • RMAN
check logical
option is enabled during backup to minimize the risk of corrupted backups.
• For archiving purposes we are using tape backup which we can achieve 3-4 TB/hour backup speed.
Our backups are scheduled over
Grid Agent
For SOX compliance we restore whole database to another machine in 3-4 months period .
Incremental Backup in 10g
• I’ve heard many administrators annoying that
“RMAN incremental backup reduces the storage space but not the backup time”
• • Others say
“For incremental backup performance you need to put your static data into readonly tablespaces.”
Block Change Tracking
feature solves these problems. As far as we’ve tested and observed • Incremental backup strategy significantly reduces the backup duration.
• Readonly tablespaces have a very small impact on backup speed. 10g genuine incremental backup technology does the job.
Bugs & Advises
• • Bugs • Slow Catalog Resynchronization Bug (5219484) • Patch has been released for 10.2.0.4 and • back ported to 10.2.0.3
Bug on bigfile tablespace datafiles > 4 TB causing even not running
show all;
in RMAN.
• Patch 5448714 Advises • Don’t schedule the backup of bigfile tablespaces with smallfile tablespaces.
• Do you need to backup all tablespaces ?
• Scratching, index, etc. tablespaces • Use
EXCLUDE TABLESPACE
option of RMAN.
Data Pump
• • • • • Throughout the project we have used Data Pump for various purposes • Creating copies of production database for • testing, development, etc.
Metadata backup of our databases (Supplementary Backup Strategy) If your project team is not experienced with RMAN, Data Pump is the best option for partial/full data transfering.
We have cloned +4TB database in ~6 hours including unload, ftp and load.
Be careful about bug related to
EXCLUDE=METADATA
in 10.2.0.3
•
TM
lock Be careful about over network carry • Load over network requires a stable and high performance infrastructure.
Oracle Table Compression
• • Turkcell has a disk pool of ~2.1 PB.
As a result, an aggressive
ILM
policy is required to be used in datawarehouse as being the most disk demanding department of Turkcell.
• Oracle Table Compression in this sense has a significant importance.
• • Oracle has a compression ratio of 1:7 approximately for our data.
It has reduced the space used by our enterprise datawarehouse from 4.5 TB to 1.2 TB.
• Bug
5386204
is a
ASSM
bug related to compressed tables.
• There is already a patch for the bug.
Compression in VLDB leads to…
• • • • • Faster backups Faster queries • • • Higher CPU time Less I/O time Usually less elapsed time Less disk consumption Easier database management Better Information Lifecycle Management • • Easier to secure your data Easier to decide whether information is valuable or not.
Bigfile Tablespaces
• • • • • In our previous database we had +6500 datafiles.
As the size of tablespaces increase, we continuously add more and more datafiles to them.
In 10g we have implemented all tablespaces difficult to estimate its size as bigfile tablespaces • Scratching tablespaces • • • Temporary tablespaces Undo tablespaces Index tablespaces Since we are using ASM we have no practical size of a single datafile is defined with the number of disks we have.
“File System Full”
problem. So the We’ve not observered a positive/negative performance impact of bigfile tablespaces.
Resumable Operations
• Data load operations were another nightmare for our ETL administrators.
• Assume that an ETL job running for six hours has crashed due to a full permanent/temporary tablespace.
• Resumable operations are around since
9i 10g
but prior to , they require either a code modification or login trigger.
• By 10g RESUMABLE_TIMEOUT system parameter allows us to set hang out time on system level.
Grid Control
• • By the arrival of new databases, total number databases we administrate has reached to four.
Grid Control
is the single point of dealing with all.
I have been using Oracle for four years, but still I usually forget the syntax of many commands.
Grid Control
• • • We have scheduled • Object statistics collection jobs • Daily backups • • Resource plan switching jobs ADDM advised tasks We are monitoring • Backup Reports • • • • System Availability ASH graphs AWR reports ADDM reports We • Manage ASM • Create/Drop/Modify database objects
Moral of the Story
• • • • • • • • Backup & Recovery is much easier in 10g.
Being easy doesn’t mean all is done by Oracle. You need to secure your backup and ensure that it will work when it is required.
Data Pump is as easy as good old EXP/IMP. Moreover it is parallel and more flexible.
Every VLDB has some data for online archiving. Oracle table compression is totally transparent from application and effective for efficiently storing those data.
Bigfile tablespaces can remove the burden of dealing with many datafiles in your VLDB.
Resumable Operations at database level will let you better sleep on night.
If all those are the soldiers, Grid Control is the commander, and every army needs a well qualified commander.
And
The less you spend on administration, the more Oracle documentation you can read