Speakers: RAC and ASM on LINUX Forum: Performance and Maximum Availability Sheraton Hotel, Tel Aviv, Israel. July 16, 2008 Oded Ilan, Gadi Chen, Alejandro Vargas, Eyal Druskin, Nickita Chernovski, Saar.

Transcript Speakers: RAC and ASM on LINUX Forum: Performance and Maximum Availability Sheraton Hotel, Tel Aviv, Israel. July 16, 2008 Oded Ilan, Gadi Chen, Alejandro Vargas, Eyal Druskin, Nickita Chernovski, Saar.

RAC and ASM on LINUX Forum: Performance and Maximum Availability Sheraton Hotel, Tel Aviv, Israel. July 16, 2008 Speakers: Oded Ilan, Gadi Chen, Alejandro Vargas, Eyal Druskin, Nickita Chernovski, Saar Maoz Husnu Sensoy

ASM Configuration for max performance Oded Ilan Oracle Systems Analyst -Support

Agenda

• What is ASM and how does it works and its advantages for better performance • Configure ASM for max performance

How ASM Works

Database file open RDBMS Instance Server ASM -

File layout info - Update statistics - Health info

Instance

Create File Extent Map ASM manages metadata Not in the I/O path RAW performance Operating System Disk Group

Asm is an implementation of the SAME Methodology

Disk performance properties : • Disk drive capacity • Rotational speed • seek time • Transfer rate

Asm optimizes the

sequential throughput

and

random access

(OLTP) without making (DWH) tradeoff

Maximize sequential throughput (DWH)

• Use sequential transfer of 1 MB or larger to achieve good balance for a high efficiency and throughput • Placing most frequently accessed data towards the outer edge of spindles

Optimizing random access (OLTP)

In order to Limit the length that the disk head moves (seek ) between data and access positioning most frequently accessed data in the same half or quarter of the disk drive

ASM Striping with RAID 0+1 Online Redo – Undo – Temp – DB Files

Highest Fastest 20% 40% Slow Slowest 20% 20%

Archived logs – Dump destinations

Asm ….

•Stripes data to equalize the workload across disk and eliminate ”hot spots”

•Implements 1mb striping to achieve high sequential bandwidth Striping designed to meet RDBMS I/O characteristics File level define in templates FINE striping Stripe size =128KB for latency sensitive and small files Redo logs, control file COARSE striping Stripe size = AU size = 1-64MB striping for large I/O Data files, backup files, etc

Asm …. Cont’d

•Will take advantage of luns created from the outer half of the disk . to provide fast transfer rates and to minimize seek overhead

Write:

DB_FILE_DIRECT_IO_COUNT x Oracle block size = 1MB In performing tablespace creation, 1MB I/O block size can be 31% faster than 64KB

Read:

DB_FILE_MULTIBLOCK_READ_COUNT x Oracle block size = 1MB Full table scan was 62% faster with 1MB than 64KB block size

ASM Performance Benefits

• • • • Automatic I/O tuning for all loads Prevention of disk space fragmentation Online disk space reorganization - Rebalance Data redundancy to provide fault tolerance (up to 3-way mirror)

Reference: http://www.oracle.com/technology/products/database/asm/pdf/take%20the%20gu esswork%20out%20of%20db%20tuning%2001-06.pdf

Thanks for you attention!

E-mail: [email protected]

RAC & ASM Forum July 2008 ASM Allocation Units and Extents Sizes Gadi Chen |System Analyst | Oracle Israel

Agenda

• ASM Allocation Units • ASM Extents sizes

ASM Allocation Units

- AU

ASM - Allocation Units - AU



What is a AU ?

• AU is the minimum segment of disk that can be assigned to a file.

• Each extent consist of number of AU.

• Each file consist of number of Extents.

ASM - Allocation Units - AU



When do we set it ?

• The AU is set at the disk group creation time.

• Can be set to 1,2,4,8,16,32,64 MB.

• Larger AU sizes typically provide performance advantages for data warehouse applications that use large sequential reads.

• Default size 1 MB (_asm_ausize)

ASM Extents sizes

ASM – Extent Allocation

 What is an extent ?

• Group of AU on a file.

extent

ASM – Extent Allocation

• The extent size is dictated by the size of the AU.

• There are only three possible sizes for an extent in any single data file.

Number of Extents

Less then 20,000 Between 20,000 and 40,000 More Then 40,000

Extent Size

AU AU x 8 AU x 64

Extent Allocation Sample

 100 Gb file with AU=1Mb  Without Variable Extent Size # Of Extents

102,400

AU 1 Mb Total Size 100 Gb  With Variable Extent Size # Of Extents AU 20,000 1 Mb 20,000 12,800

52,800

8 Mb (1*8) 64 Mb (1*64) Total Size 20 Gb 160 Gb 820 Gb

ASM – Extent Allocation

 How do we set it ?

• • • • • So...we don't.

It is automatic driven architecture.

11g new feature - variable segment size.

completely automatic .

Must be enabled in the disk group creation.

Implement Variable extents size

CREATE DISKGROUP FTS EXTERNAL REDUNDANCY DISK ’ORCL:FTS1’ SIZE 102398 M , ’ORCL:FTS2’ SIZE 102398 M , ’ORCL:FTS3’ SIZE 102398 M , ’ORCL:FTS4’ SIZE 102398 M ATTRIBUTE ’compatible.rdbms’ = ’11.1’, ’compatible.asm’ = ’11.1’, ’au_size’ = ’64M’;

Variable extents size

• Enable support for larger ASM files .

• Reduces SGA memory requirement for VLDB.

• Improve performance for file create and open functions.

ASM – Extent Allocation

• ASM extent variable array size File Size 100gb AU size 1 Mb 16 Mb 64 Mb Without VES 800 k 50 k 12.5 k With VES (11g) 237 k 50 k 12.5 k % save 70% 0% 0% 1Tb 1 Mb 16 Mb 64 Mb 8192 k 512 K 128 k 418 k 201 k 128 k 95% 60% 0%

10Tb 1 Mb 16 Mb 64 Mb 80 Mb 5 Mb 1.2 Mb 1570 k 370 k 296 k 98% 92% 76% 1PB 1 Mb 16 Mb 64 Mb 8192 Mb 512 Mb 128 Mb 128 Mb 8 Mb 2 Mb 98% 98% 98% • The benefit of variable extent size  smaller SGA.

RAC, ASM Forum July 2008 Alejandro Vargas Oracle Principal Support Consultant

Near Unlimited Scalability

• • • • 63 disk groups 10,000 ASM disks 1 million files per disk group 4 petabyte per ASM disk • Maximum Database Size: • • • External redundancy: Normal redundancy: High redundancy: 140 PB 42 PB 15 PB

ASM Fast Mirror Resync

• • Fraction of time to re-establish redundancy Only changed blocks are resync’ed

Failure time < DISK_REPAIR_TIME 2 Disk again accessible: Only need to resync modified extents

• Benefits: • Fast recovery from transient failures • Enables pro-active maintenance

ASM Preferred Mirror Read

• • Allow local mirror read operations Eliminate network latencies in extended clusters • Better performance

Site A Site B

Extended Cluster

P S

New ASM Disk Group Attributes

• Compatibility of each disk group is separately controllable • database_compatible_version parameter determines the features that are enabled

Name au_size compatible.rdbms

Values 1 | 2 | 4 | 16 | 32 | 64MB Valid database version Description Size of allocation units in the disk group Format of messages exchanged between DB and ASM compatible.asm

disk_repair_time Valid database version 0 M to 2 32 D Format of ASM metadata structures on disk Length time before removing a disk once OFFLINE

ASMCMD Extensions

User created directories Templates Disk group compatibility Disk group name Disk names and failure groups md_backup md_restore remap

Re-maps unreadable blocks

$ asmcmd help lsdsk

List ASM disks with or without ASM instance running

Copy files locally or across db servers

RAC, ASM Forum July 2008 Maximum Availability Architecture Eyal Druskin Oracle Support Consultant

Agenda

• • What is MAA?

MAA Features and Benefits

What is MAA?

• Oracle‘s best practices blueprint based on proven Oracle high-availability technologies and recommendations

Maximum Availability Minimum Time to Recover

What is MAA?

• 3-Tier model

Storage

• ASM

Server

• Clusterware

Database

• RAC • Flashback • …

MAA Best Practices Benefits

• • • • Recommends operational best practices to minimize downtime Reduces the implementation costs for a HA Oracle system by providing detailed configuration guidelines Reduces acquisition, management, and support costs for a HA Oracle system Reduces possible costs of downtime by providing best practices to prevent, eliminate or minimize downtime that could occur because of scheduled and unscheduled outages

Oracle’s HA Solution Set

Integrated with Database – Unparalleled in the Industry!

Unplanned Downtime System Failures Data Failures

Real Application Clusters ASM Flashback RMAN & Oracle Secure Backup Data Guard Streams

Planned Downtime System Changes Data Changes

Online Reconfiguration Rolling Upgrades Online Redefinition

Conclusion

Production Site

Online H/W & S/W Upgrade Real Application Clusters & Oracle Clusterware Fault Tolerant Server Scale-Out

Primary Database

Active Standby Site

Online Table Redefinition Data Guard / Streams Active-active Sites, Data Protection, Rolling Upgrades

Standby Database Storage

Flashback Technologies Move Back in Time to Correct Errors Automatic Storage Management Fault Tolerant Storage Scale-Out

Storage

Recovery Manager & Oracle Secure Backup Data Protection & Archival

RAC, ASM Forum July 2008 Performance and Maximum Availability Nickita Chernovsky, Oracle Senior Support Consultant

Agenda

• • • • What’s Dataguard Broker?

What’s so good in DGBroker?

Steps to create.

Things to Know and To Be Aware of

What is Dataguard Broker?

• Oracle command-line utility for managing dataguard configuration.

• No additional features over normal dataguard configuration (except for observer) • Installed automatically with any DB installation (no additional install required)

Why is Dataguard Broker so good?

• • • Easy way to manage standby environment Straightway command interface Just provide two identical database, Dgbroker will do the rest, including parameters set and standby recovery.

• Switchover and failover: connect only to standby DB, dgbroker will take care of the whole process of role transition on both standby and primary.

• Reinstation: dgbroker will automatically flashback destroyed ex-primary DB and reinstate it as standby.

Simple Steps to Create

• • • • • • Create standby redologs on primary Copy primary database (using RMAN duplicate or OS copy) Create standby controlfile Precreate init.ora, specify:

db_unique_name= dg_broker_start=true local_listener=‘’ log_filename_convert=‘’,’’ db_filename_convert=‘’,’’

parameters on both standby and primary Create spfile from pfile on both databases.

Mount standby

Simple Steps to Create

• Start dgmgrl utility, connect to primary: $

dgmgrl /

• Create configuration

DGMGRL>CREATE CONFIGURATION mysbconfig AS primary database is nick102 connect identifier is nick102;

• Add standby database:

DGMGRL> add database nick102s as CONNECT IDENTIFIER IS nick102s maintained as physical/logical;

• Enable configuration:

DGMGRL> enable configuration

• We are done!

Things to Know

• • • • • • Dgbroker is fully ASM and RAC compatible.

Don’t forget local_listener parameter, else switchover or failover will fail.

ALWAYS specify standby redologs.

Enable flashback for fast reinstation.

Reinstate requires flashback to be enabled and Complete shutdown and startup mount of failed DB.

Don’t use dgmgrl on Oracle versions lower than 10.2.0.3

Practical Performance Management for Oracle RAC

Michael Zoll - RAC Development (Performance), Oracle Corp.

Presented by Saar Maoz, RACPACK, Oracle Corp.

The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions.

The development, release, and timing of any features or functionality described for Oracle’s products remains at the sole discretion of Oracle.

Agenda

• • • • • Oracle RAC Fundamentals and Infrastructure Common Problems and Symptoms Application and Database Design Diagnostics and Problem Determination Summary: Practical Performance Analysis

OBJECTIVE

• • • • Realize that Oracle RAC performance does not requires “Black Magic” General system and SQL analysis and tuning experience is practically sufficient for Oracle RAC Problems can be identified with a minimum of metrics and effort Diagnostics framework and advisories are efficient

Oracle RAC Architecture

Node1 VIP1 Service Listener instance 1 ASM Oracle Clusterware Operating System

/…/ public network

VIP2 Service Listener instance 2 ASM Oracle Clusterware Operating System Node

VIPn Service Listener instance n ASM Oracle Clusterware Operating System Node

Managed by ASM

shared storage Redo / Archive logs all instances Database / Control files

RAW Devices

OCR and Voting Disks

Oracle Clusterware

Node1 VIP1

CSSD OPROCD ONS CRSD EVMD

RAW Devices

CSSD Runs in Real Time Priority /…/ public network

VIP2

CSSD OPROCD ONS CRSD EVMD

Node

shared storage OCR and Voting Disks

VIPn

CSSD OPROCD ONS CRSD EVMD

Node

Under the Covers

Cluster Private High Speed Network Runs in Real Time Priority LMON LMD0 DIAG LMON LMD0 DIAG Instance 1 SGA Global Resoruce Directory

Dictionary Cache Library Cache Log buffer Buffer Cache

VKTM LMS0 LGWR SMON DBW0 PMON Instance 2 SGA Global Resoruce Directory

Dictionary Cache Log buffer Library Cache Buffer Cache

VKTM LMS0 LGWR SMON DBW0 PMON Node 1 LMON LMD0 DIAG Instance n SGA Global Resoruce Directory

Dictionary Cache Library Cache Log buffer Buffer Cache

VKTM LMS0 LGWR SMON DBW0 PMON Node 2 Node n Redo Log Files Redo Log Files Redo Log Files Data Files and Control Files

The Evolution of The Cache Fusion

Oracle Version

Oracle 8i Oracle 9iR1 Oracle 9iR2 Oracle 10gR1 Oracle 10gR2 Oracle 11gR1

Major milestones in cache fusion code

CR Server Full Cache Fusion Process Architecture / Code path optimization Static File Affinity File Affinity with Dynamic ReMastering (DRM) Affinity Lock ( 16 node TPC-C published) Object Affinity with Dynamic ReMastering Undo Affinity, TX Lock Affinity Optimizations for Reads and Read-Mostly

Global Cache Service (GCS)

• Manages coherent access to data in buffer caches of all instances in the cluster • • • Minimizes access time to data which is not in local cache • access to data in global cache faster than disk access Implements fast direct memory access over high speed interconnects • for all data blocks and types Uses an efficient and scalable messaging protocol • Never more than 3 hops • New optimizations for read-mostly applications

Cache Hierarchy: Data in Remote Cache

Local Cache Miss Datablock Requested Datablock Returned Remote Cache Hit

Cache Hierarchy: Data On Disk

Local Cache Miss Datablock Requested Disk Read Grant Returned Remote Cache Miss

Cache Hierarchy: Read Mostly

Local Cache Miss No Message required Disk Read

11.1 Read-Mostly Improvements

• • CPU Optimizations for read-intensive operations • Read-mostly access • • Message reductions Latency improvements • Read-only access • No messages • Direct reads Significant gains • From 50-70% reductions measured

Read Only Workload (Wait Events)

• RM=OFF

Top 5 Timed Events Avg %Total ~~~~~~~~~~~~~~~~~~ wait Call Event Waits Time (s) (ms) Time ----------------------------------------- ------------ ----------- ------ ----- db file sequential read 263,952 488 2 29.1

gc cr grant 2-way 254,902 453 2 27.1

gc current block 2-way 159,394 305 2 18.2

CPU time 275 16.4

gc current block 3-way 52,236 125 2 7.5

-------------------------------------------------------------

• RM=ON

CPU time 338 19.9

control file parallel write 321 7 21 .4

reliable message 30 3 96 .2

latch: cache buffers lru chain 1,035 1 1 .1

-------------------------------------------------------------

Read Only Workload (messages / blocks)

• RM=OFF

Global Cache Load Profile ~~~~~~~~~~~~~~~~~~~~~~~~~ Per Second Per Transaction -------------- -------------- Global Cache blocks received: 645.81 211,826.00

Global Cache blocks served: 607.19 199,157.00

GCS/GES messages received: 4,613.02 1,513,070.00

GCS/GES messages sent: 3,086.02 1,012,213.00

DBWR Fusion writes: 0.24 78.00

Estd Interconnect traffic (KB): 4,009.71

• RM=ON

Global Cache Load Profile ~~~~~~~~~~~~~~~~~~~~~~~~~ Per Second Per Transaction -------------- -------------- Global Cache blocks received: 0.77 254.00

Global Cache blocks served: 0.67 221.00

GCS/GES messages received: 9.18 3,031.00

GCS/GES messages sent: 19.81 6,538.00

DBWR Fusion writes: 0.12 38.00

Estd Interconnect traffic (KB): 8.54

Performance of Cache Fusion

Message:~200 bytes

200 bytes/(1 Gb/sec )

LMS Receive Process block Send Initiate send and wait Receive Block: e.g. 8K

8192 bytes/(1 Gb/sec) Total access time: e.g. ~360 microseconds (UDP over GBE) Network propagation delay ( “wire time” ) is a minor factor for roundtrip time ( approx.: 6% , vs. 52% in OS and network stack )

**Fundamentals: Minimum Latency (*), UDP/GBE and RDS/IB**

Block size RT (ms) UDP/GE 2K 0.30

4K 0.31

8K 0.36

16K 0.46

RDS/IB 0.12

0.13

0.16

0.20

(*) roundtrip, blocks are not “busy” i.e. no log flush, no serialization ( “buffer busy”) AWR and Statspack reports would report averages as if they were normally distributed, the session wait history which is included in Statspack in 10.2 and AWR in 11g will show the actual quantities.

The minimum values in this table are the optimal values for 2-way and 3-way block transfers, but can be assumed to be the expected values ( i.e. 10ms for a 2-way block would be very high )

Numbers all Database Engineers Should Know and Understand

• • • Storage • Latency • Milliseconds • Bandwidth • Gigabytes per second Network Components • Latency • Milliseconds • Bandwidth • Gigabits per second Memory • Latency • • Nanoseconds Microseconds 0.001 s 1,000,000,000 bytes/s 0.001 s 1,000,000,000 bits/s 0.000,000,001 s 0.000,001 s

Infrastructure: Private Interconnect

• Network between the nodes of a RAC cluster MUST be private • Best practice is not to share with ISCSI storage • • • Supported links: GbE, IB ( IPoIB: 10.2 ) Supported transport protocols: UDP, RDS (10.2.0.3) Use multiple or dual-ported NICs for redundancy and increase bandwidth with NIC bonding • Large ( Jumbo ) Frames for GbE recommended • Send entire block in one transmit

Infrastructure: Interconnect Bandwidth

• • • • • Bandwidth requirements depend on several factors ( e.g. buffer cache size, #of CPUs per node, access patterns) and cannot be predicted precisely for every application Typical utilization approx. 10-30% i

n OLTP

• 10000-12000 8K blocks per sec to saturate 1 x Gb Ethernet ( 75 80% of theoretical bandwidth ) Generally, 1Gb/sec sufficient for performance and scalability

in OLTP.

DSS/DW

• systems should be designed with > 1Gb/sec capacity A sizing approach with rules of thumb is described in Project MegaGrid: Capacity Planning for Large Commodity Clusters ( http://otn.oracle.com/rac )

Infrastructure: IPC configuration

• • • Important Settings: • • • • Negotiated top bit rate and full duplex mode NIC ring buffers Ethernet flow control settings CPU(s) receiving network interrupts Verify your setup: • • • CVU does checking Load testing eliminates potential for problems AWR and ADDM give estimations of link utilization Buffer overflows, congested links and flow control can have severe consequences for performance

Infrastructure: Operating System

• • • Block access latencies increase when CPU(s) busy and run queues are long • Immediate LMS scheduling is critical for predictable block access latencies when CPU > 80% busy Fewer and busier LMS processes may be more efficient. • • monitor their CPU utilization Caveat: 1 LMS can be good for runtime performance but may impact cluster reconfiguration and instance recovery time • the default is good for most requirements Higher priority for LMS is default • The implementation is platform-specific

Common Problems and Symptoms

Common Problems and Symptoms

• • • • • “Lost Blocks”: Interconnect or Switch Problems Slow or bottlenecked disks System load and scheduling Contention Unexpectedly high latencies

Miss-configured or Faulty Interconnect Can Cause:

•

Dropped packets/fragments

•

Buffer overflows

•

Packet reassembly failures or timeouts

•

Ethernet Flow control kicks in

•

TX/RX errors

“lost blocks” at the RDBMS level, responsible for 64% of escalations

“Lost Blocks”: NIC Receive Errors

Db_block_size = 8K

ifconfig –a: eth0 Link encap:Ethernet HWaddr 00:0B:DB:4B:A2:04 inet addr:130.35.25.110 Bcast:130.35.27.255 Mask:255.255.252.0

UP BROADCAST RUNNING MULTICAST

MTU:1500

Metric:1 RX packets:21721236

errors:135

dropped:0 overruns:0 frame:95 TX packets:273120 errors:0 dropped:0 overruns:0 carrier:0 …

“Lost Blocks”: IP Packet Reassembly Failures

netstat –s Ip: 84884742 total packets received …

1201 fragments dropped after timeout

…

3384 packet reassembles failed

Finding a Problem with the Interconnect or IPC

Top 5 Timed Events Avg %Total ~~~~~~~~~~~~~~~~~~ wait Call Event Waits Time(s)(ms) Time Wait Class --------------------------------------------------------------------------------------------------- log file sync 286,038 49,872 174 41.7 Commit gc buffer busy 177,315 29,021 164 24.3 Cluster gc cr block busy 110,348 5,703 52 4.8 Cluster

gc cr block lost 4,272 4,953 1159 4.1 Cluster

cr request retry 6,316 4,668 739 3.9 Other

Should never be here

Global Cache Lost block handling

• • Detection Time in 11g reduced • • • • 500ms ( around 5 secs in 10g ) can be lowered if necessary robust ( no false positives ) no extra overhead

Cr request retry

• event related to lost blocks It is highly likely to see it when gc cr blocks lost show up

Interconnect Statistics

Automatic Workload Repository (AWR ) Target Avg Latency Stddev Avg Latency Stddev Instance 500B msg 500B msg 8K msg 8K msg -------------------------------------------------------------------- 1 .79 .65 1.04 1.06

2 .75 .57 . 95 .78

3 .55 .59 .53 .59

4 1.59 3.16 1.46 1.82

-------------------------------------------------------------------- Latency probes for different message sizes Exact throughput measurements (not shown) Send and receive errors, dropped packets (not shown )

“Lost Blocks”: Solution

• Fix interconnect NICs and switches • Firmware, Cables, etc.

• • Tune IPC buffer sizes Ensure enough OS resources available • Spinning process can consume all network ports • Disable any firewall on interconnect

Disk IO Performance Issues

• • Log flush IO delays can cause “busy” buffers “Bad” queries on one node can saturate an interconnect link • • IO is issued from ALL nodes to shared storage

Use Automatic Database Diagnostic Monitor (ADDM) /AWR

• single system image of I/O across cluster Cluster-wide impact of IO or query plan issues responsible for 23% of escalations

Cluster-Wide I/O Impact

Node 1 Top 5 Timed Events Avg %Total ~~~~~~~~~~~~~~~~~~ wait Call Event Waits Time(s)(ms) Time ------------------------------ ------------ ----------- ------ ------

log file sync 286,038 49,872 174 41.7

gc buffer busy 177,315 29,021 164 24.3 gc cr block busy 110,348 5,703 52 4.8 ``

Expensive Query in Node 2 Node 2 1. IO on disk group containing redo logs is bottlenecked 2. Block shipping for “hot” blocks is delayed by log flush IO 3. Serialization/Queues build up Load Profile ~~~~~~~~~~~~ Per Second -------------- Redo size: 40,982.21

Logical reads: 81,652.41

Physical reads: 51,193.37

Drill-down: An IO capacity problem

Top 5 Timed Events Avg %Total wait Call Event Waits Time(s) (ms) Time Wait Class --------------- ------- ------- ---- ---- ----------

db file scattered read 3,747,683 368,301 98 33.3 User I/O gc buffer busy 3,376,228 233,632 69 21.1 Cluster

db file parallel read 1,552,284 225,218 145 20.4 User I/O gc cr multi block 35,588,800 101,888 3 9.2 Cluster request read by other session 1,263,599 82,915 66 7.5 User I/O

Symptom of Full Table Scans I/O contention

IO and/or Bad SQL problem fixed

Top 5 Timed Events Avg %Total ~~~~~~~~~~~~~~~~~~ wait Call Event Waits Time (s) (ms) Time Wait Class --------------------------- --------- ----------- ---- ------ --------- CPU time 4,580 65.4

log file sync 276,281 1,501 5 21.4 Commit log file parallel write 298,045 923 3 13.2 System I/O gc current block 3-way 605,628 631 1 9.0 Cluster gc cr block 3-way 514,218 533 1 7.6 Cluster 1. Log file writes are normal 2. Global serialization has disappeared

IO issues: Solution

• • Tune IO layout Tune queries consuming a lot of IO

CPU Saturation or Long Run Queues

Top 5 Timed Events Avg %Total ~~~~~~~~~~~~~~~~~~ wait Call Event Waits Time(s) (ms) Time Wait Class ---------------- --------- ----- ---- ----- ----------

db file sequential 1,312,840 21,590 16 21.8 User I/O read gc current block

congested

gc cr grant

congested

275,004 21,054

21.3 Cluster 177,044 13,495

13.6 Cluster gc current block 1,192,113 9,931 8 10.0 Cluster 2-way gc cr block

congested

85,975 8,917

104

9.0 Cluster

“Congested” : LMS could not dequeue messages fast enough Cause : Long run queue, CPU starvation

High CPU Load: Solution

• • • • Run LMS at higher priority (default) Start more LMS processes • Never use more LMS processes than CPUs Reduce the number of user processes Find cause of high CPU consumption

Contention

Event Waits Time (s) AVG (ms) % Call Time --------------------- -------- ------- ------- --------

gc cr block 2-way 317,062 5,767 18 19.0 gc current block 2-way 201,663 4,063 20 13.4 gc buffer

busy

111,372 3,970 36 13.1 CPU time 2,938 9.7

gc cr block

busy

40,688 1,670 41 5.5

-------------------------------------------------------

Serialization Global Contention on Data

Its is very likely that CR BLOCK BUSY and GC BUFFER BUSY are related

Contention: Solution

• • Identify “hot” blocks in application Reduce concurrency on hot blocks

High Latencies

Event Waits Time (s) AVG (ms) % Call Time --------------------- --------- --------- -------- ------- gc cr block 2-way gc current block 2-way

gc buffer

busy

317,062 5,767 201,663 4,063

18 20

19.0

13.4

111,372 3,970 36 13.1 CPU time 2,938 9.7

gc cr block

busy

40,688 1,670 41 5.5

-------------------------------------------------------

Expected: To see 2-way, 3-way Unexpected: To see > 1 ms (AVG ms should be around 1 ms) Tackle latency first, then tackle busy events

High Latencies : Solution

• Check network configuration • • Private Running at expected bit rate • Find cause of high CPU consumption • Runaway or spinning processes

Health Check

• • Look for: Unexpected Events

gc cr block lost

Unexpected “Hints” • • Contention and Serialization

gc cr/current block

busy

1159 ms 52 ms

Load and Scheduling

gc current block

congested

14 ms

• Unexpected high avg

gc cr/current block 2-way 36 ms

Application and Database Design

General Principles

• • • No fundamentally different design and coding practices for RAC BUT: flaws in execution or design have higher impact in RAC • Performance and scalability in RAC will be more sensitive to bad plans or bad schema design • Serializing contention makes applications less scalable Standard SQL and schema tuning solves > 80% of performance problems

Scalability Pitfalls

• • • • Serializing contention on a small set of data/index blocks • • • monotonically increasing key frequent updates of small cached tables segment without automatic segment space managmenent (ASSM) or Free List Group (FLG) Full table scans • Optimization for full scans in 11g can save CPU and latency Frequent invalidation and parsing of cursors • Requires data dictionary lookups and synchronizations Concurrent DDL ( e.g. truncate/drop )

Health Check

• • • Look for: Indexes with right-growing characteristics • Eliminate indexes which are not needed Frequent updated and reads of “small” tables • • “small”=fits into a single buffer cache Sparse blocks ( PCTFREE 99 ) will reduce serialization SQL which scans large amount of data • • Perhaps more efficient when parallelized Direct reads do not need to be globally synchronized ( hence less CPU for global cache )

Diagnostics and Problem Determination

MOST OF THE TIME, A PERFORMANCE PROBLEM IS NOT An Oracle RAC PROBLEM

Checklist for the Skeptical Performance Analyst ( AWR based )

•

Check where most of the time in the database is spent (“Top 5” )

•

Check whether gc events are “busy”, “congested”

•

Check the avg wait time

•

Drill down

• SQL with highest cluster wait time • Segment Statistics with highest block transfers or JUST USE ADDM with Oracle RAC 11g!

Drill-down: An IO capacity problem

Top 5 Timed Events Avg %Total wait Call Event Waits Time(s) (ms) Time Wait Class --------------- ------- ------- ---- ---- ----------

db file scattered read 3,747,683 368,301 98 33.3 User I/O gc buffer busy 3,376,228 233,632 69 21.1 Cluster

db file parallel read 1,552,284 225,218 145 20.4 User I/O gc cr multi block 35,588,800 101,888 3 9.2 Cluster request read by other session 1,263,599 82,915 66 7.5 User I/O

Symptom of Full Table Scans I/O contention

Drill-down: SQL Statements

“Culprit”: Query that overwhelms IO subsystem on one node

Physical Reads

Executions per Exec %Total ------------- ---------- ------------ ------

182,977,469 1,055 173,438.4

99.3 SELECT SHELL FROM ES_SHELL WHERE MSG_ID = :msg_id ORDER BY ORDER_NO ASC

The same query reads from the interconnect:

Cluster

CWT % of CPU

Wait Time (s)

Elapsd Tim Time(s) Executions ------------ --------- ---------- --------------

341,080.54 31.2 17,495.38

1,055 SELECT SHELL FROM ES_SHELL WHERE MSG_ID = :msg_id ORDER BY ORDER_NO ASC

Drill-Down: Top Segments

Tablespace Subobject Obj

Buffer

% of Name Object Name Name Type -------- ------------ --------

Busy

Capture ------ ------- ----- ESSMLTBL ES_SHELL SYS_P537 TABLE

311,966

ESSMLTBL ES_SHELL SYS_P538 TABLE

277,035

ESSMLTBL ES_SHELL SYS_P527 TABLE

239,294

9.91

8.80

7.60

…

Apart from being the table with the highest IO demand it was the table with the highest number of block transfers AND global serialization

Findings Summary in EM

• Each finding type has a descriptive name • Facilitates search / aggregation / directives etc.

Recommendations

•

Most relevant data for analysis can be derived from the wait events

•

Always use Enterprise Manager (EM) and ADDM reports for performance health checks and analysis

•

Activity Session History (ASH) can be used for session-based analysis of variation

•

Export AWR repository regularly to save all of the above

ADDM Diagnosis for RAC

• Data sources are: • • • • Wait events (especially Cluster class and buffer busy) ASH Instance cache transfer data Interconnect statistics (throughput, usage by component, pings) • ADDM analyzes for both the entire database (DATABASE analysis mode) and for each instance (INSTANCE analysis mode).

• Analysis of both database and instance resources summarized in a single report • Allows drill down to specific instance.

What ADDM Diagnoses for RAC

• • • • • • • Latency problems in interconnect Congestion (identifying top instances affecting the entire cluster) Contention (buffer busy, top objects etc.) Top consumers of multiblock requests Lost blocks Reports information about interconnect devices. Warns about using PUBLIC interfaces.

Reports throughput of devices, and how much of it is used by Oracle and for what purpose (GC, locks, PQ)

RAC, ASM Forum July 2008 Building Up a 50 TB Datawarehouse on ASM Hüsnü Şensoy Turkcell Telecommunication Services VLDB Specialist

Agenda

• • Base Floor • A brief introduction about our architecture • • Disk Storage Subsystem 10 GbE for SQL*Loader Performance Administration & Maintenance • Backup & Recovery • • • • • Data Pump Table Compression Bigfile Tablespaces Resumable Events Grid Control

Base Floor

Never try to tune Oracle Server before ensuring the hardware and OS has been tuned.

A Brief Introduction of Our Architecture

Initial Storage Configuration

• • • • • • Symetrix DMX-4 EMC Cabinet RAID 5 Many 15K RPM 146 GB disks (350+) RAW devices Power Path as multipathing tool Share Everything Meta Volume Layout

Let’s Come Together for Solution…

Have you checked the Oracle server X parameter ?

I agree. Oracle is not a file system vendor You have to buy more disks to sustain the desired throughput What is missing in the picture ?

Numbers

ASM is too young for such a big system.

How to Find the Source of Problem ?

• I am not a person capable of writing AWK, Shell, Perl, etc. scripts.

• I am not an expert on Solaris • • I/O architecture.

• There are tools but they do not generate Oracle I/O (

is the most well-known one).

Everybody was blaming on Oracle and ASM.

I was aware that the problem was on hardware but I’ve required proof.

ORION

• • •

ORION

is an advanced tool capable of emulating Oracle I/O by using native OS I/O libraries used by Oracle itself.

ORION

• allows you to define I/O Characteristics • Random • Sequential • Mix • • Array Cache Size Read % + Write %

ORION

will give you the necessary proof if the problem is related to your hardware.

Sample ORION Outputs

1M Sequential Read

2000 1800 1600 1400 1200 1000 800 600 400 200 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45

Number of Sequential Requests

47 49 51 53 55 57 59 61 63 128 256 384 512 640 768

Sample ORION Outputs

32K Random Read

120 110 100 90 80 40 30 20 70 60 50 10 0 1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64

Outstanding I/O

67 70 73 76 79 82 85 88 91 94 192 480 768 10 56 13 44 16 32 19 20

Final Storage Configuration

• • • • • • Symetrix DMX-4 EMC Cabinet RAID 5 Many 15K RPM 146 GB disks (350+) RAW devices Power Path as multipathing tool Share Nothing Meta Volume Layout

Next Step was AwSoMe

• • • • • • After solving the problem on cabinet, ASM was ready to use.

We rely on external redundancy handled by storage system.

Our UNIX administrator has labeled ~900 GB raw devices with EMC Powerpath.

We have modified advised ASM instance parameters in

Metalink Note 368055.1

• •

_asm_ausize _asm_stripesize

ASM instance has never crashed since the first deployment (more than eight months)

AweSoMe

is the

AweSoMe

implementation of S.A.M.E. architecture.

ASM & New Turkcell DWH

• • • • • The very first thing we did on

ASM

was to test & learn the recoverability of a database running on

ASM

under various crash scenarios. We have documented it and shared with our colleagues and our managers in Turkcell.

With compared to our previous file system vendor,

ASM

15% better in various database operations.

is 10% -

ASM

• • • has totally removed the need of Manuel datafile distribution File system boundary prevention Hot spot management

ASM

allows us to take the control of I/O subsystem of our database.

**10 GbE for SQL*Loader Performance**

• There is a requirement of ~1 TB/h data transfer rate from ETL server to database server • using

SQL*Loader

. With a standard Gigabit network, only ~250 MB/h • • was achieved.

Proposed solutions • • • Clustered File System Trunking 10 GbE This implementation requires modification on • • Solaris 10 TCP Stack Session Data Unit (SDU)

Yet another Hardware Configuration Problem

• • • • • In this implementation everybody was expecting for a transfer rate of 1 GB/s (10GbE).

But again without tuning any Oracle parameter, we’ve first decided to test the network at OS level. We’ve again required an

ORION

throughput testing.

like tool for network Recently, Solaris Performance Team released a tool for this purpose called

uperf. uperf

shows that it is impossible to achieve 1GB/s.  It is as simple as

ORION

and guides as much as

ORION

For a more details on issue, refer to

High Speed Network Implementation for SQL*Loader Performance

article in http://husnusensoy.wordpress.com

Moral of the Story

• • • It is true that hardware is a very complex issue and requires specialization. Nobody can expect you to configure a cabinet or switch. But • You can measure what you have.

• You should know the limits of your hardware and OS configuration.

And

Never try to tune Oracle Server before ensuring the hardware and OS has been tuned.

The less you spend on administration, the more

Administration & Maintenance

Our Backup & Recovery Strategy

• Incrementally Updated Backup is used as the daily backup strategy.

• It takes only 15 min – 1 hour per day depending on the amount of data change.

• • • RMAN

check logical

option is enabled during backup to minimize the risk of corrupted backups.

• For archiving purposes we are using tape backup which we can achieve 3-4 TB/hour backup speed.

Our backups are scheduled over

Grid Agent

For SOX compliance we restore whole database to another machine in 3-4 months period .

Incremental Backup in 10g

• I’ve heard many administrators annoying that

“RMAN incremental backup reduces the storage space but not the backup time”

• • Others say

“For incremental backup performance you need to put your static data into readonly tablespaces.”

Block Change Tracking

feature solves these problems. As far as we’ve tested and observed • Incremental backup strategy significantly reduces the backup duration.

• Readonly tablespaces have a very small impact on backup speed. 10g genuine incremental backup technology does the job.

Bugs & Advises

• • Bugs • Slow Catalog Resynchronization Bug (5219484) • Patch has been released for 10.2.0.4 and • back ported to 10.2.0.3

Bug on bigfile tablespace datafiles > 4 TB causing even not running

show all;

in RMAN.

• Patch 5448714 Advises • Don’t schedule the backup of bigfile tablespaces with smallfile tablespaces.

• Do you need to backup all tablespaces ?

• Scratching, index, etc. tablespaces • Use

EXCLUDE TABLESPACE

option of RMAN.

Data Pump

• • • • • Throughout the project we have used Data Pump for various purposes • Creating copies of production database for • testing, development, etc.

Metadata backup of our databases (Supplementary Backup Strategy) If your project team is not experienced with RMAN, Data Pump is the best option for partial/full data transfering.

We have cloned +4TB database in ~6 hours including unload, ftp and load.

Be careful about bug related to

EXCLUDE=METADATA

in 10.2.0.3

•

lock Be careful about over network carry • Load over network requires a stable and high performance infrastructure.

Oracle Table Compression

• • Turkcell has a disk pool of ~2.1 PB.

As a result, an aggressive

ILM

policy is required to be used in datawarehouse as being the most disk demanding department of Turkcell.

• Oracle Table Compression in this sense has a significant importance.

• • Oracle has a compression ratio of 1:7 approximately for our data.

It has reduced the space used by our enterprise datawarehouse from 4.5 TB to 1.2 TB.

• Bug

5386204

is a

ASSM

bug related to compressed tables.

• There is already a patch for the bug.

Compression in VLDB leads to…

• • • • • Faster backups Faster queries • • • Higher CPU time Less I/O time Usually less elapsed time Less disk consumption Easier database management Better Information Lifecycle Management • • Easier to secure your data Easier to decide whether information is valuable or not.

Bigfile Tablespaces

• • • • • In our previous database we had +6500 datafiles.

As the size of tablespaces increase, we continuously add more and more datafiles to them.

In 10g we have implemented all tablespaces difficult to estimate its size as bigfile tablespaces • Scratching tablespaces • • • Temporary tablespaces Undo tablespaces Index tablespaces Since we are using ASM we have no practical size of a single datafile is defined with the number of disks we have.

“File System Full”

problem. So the We’ve not observered a positive/negative performance impact of bigfile tablespaces.

Resumable Operations

• Data load operations were another nightmare for our ETL administrators.

• Assume that an ETL job running for six hours has crashed due to a full permanent/temporary tablespace.

• Resumable operations are around since

9i 10g

but prior to , they require either a code modification or login trigger.

• By 10g RESUMABLE_TIMEOUT system parameter allows us to set hang out time on system level.

Grid Control

• • By the arrival of new databases, total number databases we administrate has reached to four.

Grid Control

is the single point of dealing with all.

I have been using Oracle for four years, but still I usually forget the syntax of many commands.

Grid Control

• • • We have scheduled • Object statistics collection jobs • Daily backups • • Resource plan switching jobs ADDM advised tasks We are monitoring • Backup Reports • • • • System Availability ASH graphs AWR reports ADDM reports We • Manage ASM • Create/Drop/Modify database objects

Moral of the Story

• • • • • • • • Backup & Recovery is much easier in 10g.

Being easy doesn’t mean all is done by Oracle. You need to secure your backup and ensure that it will work when it is required.

Data Pump is as easy as good old EXP/IMP. Moreover it is parallel and more flexible.

Every VLDB has some data for online archiving. Oracle table compression is totally transparent from application and effective for efficiently storing those data.

Bigfile tablespaces can remove the burden of dealing with many datafiles in your VLDB.

Resumable Operations at database level will let you better sleep on night.

If all those are the soldiers, Grid Control is the commander, and every army needs a well qualified commander.

And

The less you spend on administration, the more Oracle documentation you can read