PPT, 1.45Mb - UK Oracle User Group

Download Report

Transcript PPT, 1.45Mb - UK Oracle User Group

A Presentation
Of Sorts
Julian Dyke
Independent Consultant
OUG Scotland Conference
June 2013
1
© 2013 Julian Dyke
juliandyke.com
Motivation 1

Site A - Exadata v2 Half Rack
SELECT /*+ FULL (t1) / COUNT(*) FROM t1

takes 5 seconds
CREATE TABLE jmd1 PARALLEL 8 AS SELECT * FROM t1

takes 336 seconds
Analytic function creates
CREATE TABLE jmd2 PARALLEL 8 AS
temporary segment to perform
SELECT c1, c2, c3 … c35
the sort operation. Temporary
CASE WHEN rank = 1
segment performance is
THEN TO_DATE (‘01-JAN-1990’,’DD-MON-YYYY’)
relatively poor on Exadata
ELSE c36
compared with cell smart scans
END
FROM
(
SELECT c1,c2,c3,…,c35 RANK() OVER (PARTITION BY c1 ORDER BY p36) rank,
NVL((LEAD(c36) OVER (PARTITION BY c1 ORDER BY c36)),
TO_DATE( '31/12/9999','DD/MM/YYYY' )
)

2
takes 722 seconds
© 2013 Julian Dyke
juliandyke.com
Motivation 2

Site B – Packaged application
DESC t2
C1
C2
C3
C4
C5
C6
C7
C8
C9
C10
C11

NUMBER
DATE
NUMBER
LONG RAW
DATE
DATE
NUMBER
DATE
DATE
DATE
DATE
17,000 byte LONG RAW is
included in the sort data.
Table contains millions of rows
so sort requires a huge
temporary sort segment
c4 is always 17,000 bytes
SELECT c1, c2, c3, c4, c5, c6, c7,c8,c9,c10,c11 FROM t2 ORDER BY c1;

3
Takes days or weeks to complete
© 2013 Julian Dyke
juliandyke.com
How Sorts Work

Sorts appear to involve a two stage process


1 - Sort in local process (session heap) memory
If there is not sufficient space

2 – Spill to temporary segment
If sorts run in local memory we probably don’t need
to tune them.
If sorts spill to the temporary tablespace this
generates additional I/O and might need tuning.
4
© 2013 Julian Dyke
juliandyke.com
Test Environment


5
This presentation was developed in the following environment:

Oracle VirtualBox 4.2.0
 Base Memory = 2GB

Oracle Enterprise Linux 5 Update 6

Oracle Enterprise Edition 11.2.0.3

Automatic Memory Management
 MEMORY_TARGET = 800M
Instance was restarted for each test
 Ensures shared memory is initialized
© 2013 Julian Dyke
juliandyke.com
Sample Table

Initial tests were based on the following table:
CREATE TABLE p1
(
c1 VARCHAR2 (10) PRIMARY KEY,
c2 NUMBER,
c3 VARCHAR2 (100),
c4 VARCHAR2 (200)
c5 VARCHAR2 (300),
c6 NUMBER
)
TABLESPACE ts01;
6
© 2013 Julian Dyke
juliandyke.com
REVERSE Function

To ensure sorted data was in a different order to the table data, I decided to
REVERSE the primary key

However, the REVERSE function is undocumented
 It works in SQL
 It does not work in PL/SQL

Workaround
 Create a function called MY_REVERSE
 Cast VARCHAR to RAW
 Use UTL_RAW.REVERSE function
CREATE OR REPLACE FUNCTION my_reverse (p_str IN VARCHAR2)
RETURN VARCHAR2 IS
BEGIN
RETURN UTL_RAW.CAST_TO_VARCHAR2
(UTL_RAW.REVERSE (UTL_RAW.CAST_TO_RAW (p_str)));
END;
/
7
© 2013 Julian Dyke
juliandyke.com
Sample Data

The sample table was populated as follows:
DECLARE
l_c1 VARCHAR2(10);
l_c2 NUMBER;
l_c3 VARCHAR2(100) := LPAD ('X',100,'X')
Fixed Length
l_c4 VARCHAR2(200) := LPAD ('Y',200,'Y');
l_c5 VARCHAR2(300) := LPAD ('Z',300,'Z');
l_c6 NUMBER;
100,000 rows
BEGIN
FOR f IN 1..100000
REVERSE function
LOOP
l_c1 := MY_REVERSE (TO_CHAR (100000 + f));
l_c2 := MOD (f,10);
l_c6 := MOD (f,100);
INSERT INTO p1 VALUE (l_c1,l_c2,l_c3,l_c4,l_c5,l_c6);
END LOOP;
END;
I now realise there are lots of flaws
/

Gather optimizer statistics using:
in the above, but it was sufficient
to make some early progress
EXECUTE dbms_stats.gather_table_stats (ownname=>USER,tabname=>'P1', estimate_percent=>NULL,cascade=>TRUE);
8
© 2013 Julian Dyke
juliandyke.com
Sample Statement

The following statements were executed in a new SQL*Plus session:
SELECT sid FROM v$mystat WHERE ROWNUM < 2;
SID
42
SELECT value FROM v$diag_info
WHERE name = 'Default Trace File';
SID = 42
OS PID = 3407
VALUE
/u01/app/oracle/diag/rdbms/test/TEST/trace/TEST_ora_3407.trc
EXECUTE dbms_monitor.session_trace_enable;
SET PAUSE ON
SELECT /*+ FULL (p1) */ c1,c2,c3,c4,c5 FROM p1 ORDER BY c1;
EXECUTE dbms_monitor.session_trace_disable;

9
SET PAUSE ON prevents all rows being returned immediately
 Allows sort areas to be inspected while operation is ongoing
© 2013 Julian Dyke
juliandyke.com
Execution Plan

Execution plan from the sample statement is:
SQL> SET AUTOTRACE TRACE EXPLAIN
SQL> SELECT /*+ FULL (p1) */ c1,c2,c3,c4,c5 FROM p1 ORDER BY c1;
Execution Plan
---------------------------------------------------------Plan hash value: 101931517
----------------------------------------------------------------------------------| Id | Operation
| Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time
|
----------------------------------------------------------------------------------|
0 | SELECT STATEMENT
|
|
100K|
58M|
| 15686
(1)| 00:03:09 |
|
1 | SORT ORDER BY
|
|
100K|
58M|
60M| 15686
(1)| 00:03:09 |
|
2 |
TABLE ACCESS FULL| P1
|
100K|
58M|
| 2741
(1)| 00:00:33 |
----------------------------------------------------------------------------------
SQL_ID is
60M / 100K = 600 bytes/row
SELECT sql_id FROM v$session WHERE sid = 42;
SQL_ID
------------9y1u8zmhjyw11
10
© 2013 Julian Dyke
juliandyke.com
Dynamic Performance Views
11

PGA usage is summarized in
 V$PGASTAT

Sort operations are described in a number of dynamic performance views
including:
 V$SQL_WORKAREA
 V$SQL_WORKAREA_ACTIVE
 V$SORT_USAGE
 V$SORT_SEGMENT

Temporary tablespaces and their extents are described in:
 V$TEMP_SPACE_HEADER
 V$TEMP_EXTENT_POOL
 V$TEMP_EXTENT_MAP
© 2013 Julian Dyke
juliandyke.com
V$PGASTAT

Reports various PGA statistics
SELECT name, value FROM v$pgastat;
Name
PGA memory freed back to OS
128188416
aggregate PGA auto target
153713664
aggregate PGA target parameter
293601280
bytes processed
cache hit percentage
extra bytes read/written
64159744
100
0
global memory bound
58720256
max processes count
36
maximum PGA allocated
maximum PGA used for auto workareas
maximum PGA used for manual workareas
12
Value
© 2013 Julian Dyke
207040512
58753024
0
juliandyke.com
V$PGASTAT

Continued:
SELECT name, value FROM v$pgastat;
Name
over allocation count
process count
recompute count (total)
0
32
1630
total PGA allocated
148678656
total PGA inuse
123791360
total PGA used for auto workareas
total PGA used for manual workareas
total freeable PGA memory
13
Value
© 2013 Julian Dyke
117760
0
14417920
juliandyke.com
Work Areas
14

Child cursor includes one work area for each sort operation
 Each child cursor may have zero or more sort operations
 Reported in V$SQL_WORKAREA

A work area is active if a child cursor owning that work area is currently
executing
 Reported in V$SQL_WORKAREA_ACTIVE
© 2013 Julian Dyke
juliandyke.com
V$SQL_WORKAREA

Returns one work area for each sort operation in the library cache
SELECT * FROM v$sql_workarea WHERE sql_id = ‘9y1u8zmhjyw11’;
15
Column
Value
ADDRESS
000000008AFBBA30
HASH_VALUE
3776933921
SQL_ID
9y1u8zmhjyw11
CHILD_NUMBER
0
WORKAREA_ADDRESS
000000008A6167B8
OPERATION_TYPE
SORT (v2)
OPERATION_ID
1
Column
Value
POLICY
AUTO
TOTAL_EXECUTIONS
0
ESTIMATED_OPTIMAL_SIZE
36852736
OPTIMAL_EXECUTIONS
0
ESTIMATED_ONEPASS_SIZE
2166784
ONEPASS_EXECUTIONS
0
LAST_MEMORY_USED
0
MULTIPASSES_EXECUTIONS
0
LAST_EXECUTION
OPTIMAL
ACTIVE_TIME
0
LAST_DEGREE
1
MAX_TEMPSEG_SIZE
0
LAST_TEMPSEG_SIZE
0
© 2013 Julian Dyke
Operation Type = SORT (V2)
juliandyke.com
Work Area Operation Types

Each work area has an operation type. Known operation types include:
Type
Name
Type
Name
1
GROUP BY (SORT)
11
BITMAP MERGE
2
IDX MAINTENANCE (SORT)
12
LOAD WRITE BUFFERS
3
WINDOW (SORT)
13
SPREADSHEET
4
ROLLUP (SORT)
14
FIC LOAD SNAKES
5
BUFFER
15
FIC LOAD TRANSACTIONS
6
CONNECT-BY (SORT)
16
FIC GENERATE CANDIDATE
7
SORT (V1)
17
FIC TREE COUNTING
8
SORT (V2)
18
FIC BITMAP ANDING
9
HASH-JOIN
19
DST BITMAP COUNTING
10
BITMAP CONSTRUCTION
20
GROUP BY (HASH)




16
© 2013 Julian Dyke
WINDOW (SORT) is used in analytic queries
SPREADSHEET is used by the Model Clause
FIC is Frequent Itemset Counting
DST is Decision Tree
juliandyke.com
V$SQL_WORKAREA_ACTIVE

Returns one row for each currently executing work area
SELECT * FROM v$sql_workarea_active WHERE sql_id = ‘9y1u8zmhjyw11’;
Column
Value
SQL_HASH_VALUE
3776933921
SQL_ID
3776933921
SQL_EXEC_START
08-JUN-13
SQL_EXEC_ID
16777216
WORKAREA_ADDRESS
000000008A6167B8
OPERATION_TYPE
SORT (v2)
OPERATION_ID
1
POLICY
AUTO
SID
42
QCINST_ID
QCSID
17
ACTIVE_TIME
223831235
WORK_AREA_SIZE
1245184
EXPECTED_SIZE
1245184
© 2013 Julian Dyke
Column
Value
ACTUAL_MEM_USED
117760
MAX_MEM_USED
58753024
NUMBER_PASSES
1
TEMPSEG_SIZE
63963136
TABLESPACE
TEMP
SEGRFNO#
1
SEGBLK#
67072
juliandyke.com
V$SORT_USAGE

Returns one row for each currently executing work area
SELECT * FROM v$sort_usage;
18
Column
Value
USERNAME
US01
USER
US01
SESSION_ADDR
00000000917A1E10
V$SESSION.SADDR
SESSION_NUM
11
V$SESSION.SERIAL#
SQLADDR
000000008AFBF7F0
V$SESSION.PREV_SQL_ADDR
SQLHASH
774405238
V$SESSION.PREV_HASH_VALUE
SQL_ID
grh681wr2hz3q
V$SESSION.PREV_SQL_ID
TABLESPACE
TEMP
CONTENTS
TEMPORARY
SEGTYPE
SORT
SEGFILE#
201
SEGBLK
67072
EXTENTS
61
BLOCKS
7808
SEGRFNO#
1
© 2013 Julian Dyke
Description
juliandyke.com
V$SORT_USAGE
19

Returns list of statements using sort areas

Very misleading
 Reports previous SQL statement executed by session
 NOT current SQL statement performing sort

Reports address of parent cursor
 Does not report child address or child number
 Useless for multiple child cursors

Reports absolute file number as 201 and above
 OK but inconsistent with other views
© 2013 Julian Dyke
juliandyke.com
V$SORT_SEGMENT

Returns one row for each currently executing work area
SELECT * FROM v$sort_segment;
20
Column
Value
TABLESPACE_NAME
TEMP
SEGMENT_FILE
0
SEGMENT_BLOCK
0
EXTENT_SIZE
128
CURRENT_USERS
1
TOTAL_EXTENTS
524
TOTAL_BLOCKS
67072
USED_EXTENTS
61
USED_BLOCKS
7808
FREE_EXTENTS
463
FREE_BLOCKS
59264
ADDED_EXTENTS
0
EXTENT_HITS
61
FREED_EXTENTS
0
FREE_REQUESTS
0
© 2013 Julian Dyke
Column
Value
MAX_SIZE
524
MAX_BLOCKS
67072
MAX_USED_SIZE
61
MAX_USED_BLOCKS
7808
MAX_SORT_SIZE
61
MAX_SORT_BLOCKS
7808
RELATIVE_FNO
0
juliandyke.com
V$SORT_SEGMENT
21

Returns list of currently active sort segments

Incorrectly reports
 Segment absolute file number
 Segment block number
 Segment relative file number
© 2013 Julian Dyke
juliandyke.com
V$TEMP_EXTENT_MAP

Returns one row for each extent in temporary tablespace
SELECT * FROM v$temp_extent_map;
Column
Extent 1
Extent 2
Extent 3
TEMP
TEMP
TEMP
1
1
1
128
256
384
1048576
1048576
1048576
BLOCKS
128
128
128
OWNER
1
1
1
RELATIVE_FNO
1
1
1
Extent 522
Extent 523
Extent 524
TEMP
TEMP
TEMP
1
1
1
66816
66944
67072
1048576
1048576
1048576
BLOCKS
128
128
128
OWNER
1
1
1
RELATIVE_FNO
1
1
1
TABLESPACE_NAME
FILE_ID
BLOCK_ID
BYTES
Column
TABLESPACE_NAME
FILE_ID
BLOCK_ID
BYTES
22
© 2013 Julian Dyke
juliandyke.com
V$TEMP_EXTENT_POOL

Returns one row for each temporary file
SELECT * FROM v$temp_extent_pool;
23
© 2013 Julian Dyke
Column
Value
TABLESPACE_NAME
TEMP
FILE_ID
1
EXTENTS_CACHEED
524
EXTENTS_USED
61
BLOCKS_CACHED
67072
BLOCKS_USED
7808
BYTES_CACHED
549453824
BYTES_USED
63963136
RELATIVE_FNO
1
juliandyke.com
V$TEMP_SPACE_HEADER

Returns one row for each temporary file
SELECT * FROM v$temp_space_header;
24
© 2013 Julian Dyke
Column
Value
TABLESPACE_NAME
TEMP
FILE_ID
1
BYTES_USED
550502400
BLOCKS_USED
67200
BYTES_FREE
0
BLOCKS_FREE
0
RELATIVE_FNO
1
juliandyke.com
Work Area Table Dump

The work area table can be dumped using:
$ sqlplus / as sysdba
SQL> ORADEBUG SETMYPID
Statement processed.
SQL> ORADEBUG DUMP WORKAREATAB_DUMP 3
Statement processed.

The trace file can be copied to the local directory using:
SQL> ORADEBUG TRACEFILE_NAME
/u01/app/oracle/diag/rdbms/test/TEST/trace/TEST_ora_3779.trc
SQL> !cp /u01/app/oracle/diag/rdbms/test/TEST/trace/TEST_ora_3779.trc wa3.trc
25
© 2013 Julian Dyke
juliandyke.com
Work Area Table Dump

The work area table dump includes a summary of SGA/PGA use:
Dumping Work Area Table (level=3)
=================================
Global SGA Info
--------------global target:
auto target:
max pga:
pga limit:
pga limit known:
pga limit errors:
pga inuse:
pga alloc:
pga freeable:
pga freed:
pga to free:
broker request:
pga auto:
pga manual:
26
© 2013 Julian Dyke
280
145
200
0
0
0
MB
MB
MB
MB
118
143
13
112
0
0
MB
MB
MB
MB
%
0 MB
0 MB
pga alloc (max):
pga auto
(max):
pga manual (max):
197 MB
56 MB
0 MB
# workareas
:
# workareas(max):
1
3
juliandyke.com
Work Area Table Dump

The work area table is an array containing 67 elements
Work Area Table (67 entries)
---------------------------Entry 0: 0 are active (max=1)
Entry 1: 0 are active (max=1)
......
Entry 30: 0 are active (max=1)
Entry 31: 1 are active (max=2)
work area is an internal structure
Probably not externalized
Work area 0x8d7e6ec8
wid=2, trcId=-1 sid=42, ser#=11, flags=59 (AUTO|SGA|ADVICE|ADVICE_START|DATASTATS_SET)
qc_sid=65535, qc_instid=65535, slv_grp=1
curSize=115, maxSize=57376, #maxPasses=1
temp seg state obj is state object of type
tempSegStateObj=0x8f31d4d0, tempSegObjType=2
62 (SORT SEGMENT HANDLE) in
shared work area descriptor (0x8a6167b8):
process state dump
flags:52 ( AUTO SGA NOSTATS )
opType:8(SORT (v2)) opId=1 proxy=(nil)
isize=31991K, osize=31991K, rowlen=326, flags=02
opt=0, 1pass=0, mpass=0, lastMem=0KB
maxTempSegSize=0KB, lastTempSegSize=0KB
Entry 32: 0 are active (max=1)
Entry 33: 1 are active (max=2)
......
shared work area descriptor is
WORKAREA_ADDRESS in
V$SQL_WORKAREA and
V$SQL_WORKAREA_ACTIVE
Entry 65: 0 are active (max=1)
Entry 66: 1 are active (max=2)
27
© 2013 Julian Dyke
juliandyke.com
State Objects

Used by PMON to release resources when a session ends abnormally

To dump for operating system pid 3407
$ sqlplus / as sysdba
SQL> ORADEBUG SETOSPID 3407
Oracle pid: 19, Unix process pid: 3407, image: [email protected] (TNS V1-V3)
SQL> ORADEBUG DUMP PROCESSSTATE 10
Statement processed.

28
Level 10 still appears to work
© 2013 Julian Dyke
juliandyke.com
State Objects

Process State dump ONLY includes a state object for the temporary segment
(SORT SEGMENT HANDLE)
 No state objects for the work areas
SO: 0x8f31d4d0, type: 62, owner: 0x917a1e10, flag: INIT/-/-/0x00 if: 0x1 c: 0x1
proc=0x91498608, name=sort segment handle, file=ktst2.h LINE:883, pg=0
SORT SEGMENT HANDLE: tsn=3 extents=61
(enqueue) <no resource>
lock_flag: 0x0, lock: 0x8f31d5a8, res: (nil)
own: (nil), sess: (nil), prv: (nil)
(enqueue) released enqueue or enqueue in flux
lock_flag: 0x0, lock: 0x8f31d610, res: 0x90d7b3c0
own: 0x917f8418, sess: 0x917f8418, prv: 0x8f31d620
(enqueue) <no resource>
lock_flag: 0x0, lock: 0x8f31d678, res: (nil)
own: (nil), sess: (nil), prv: (nil)
29
© 2013 Julian Dyke
juliandyke.com
Direct Path Writes

Oracle uses direct path writes to write blocks to the temporary segment
 Default block size is 1MB (128 x 8192 byte blocks)

For example (event 10046 / level 8 trace):
WAIT #140079396924400: nam='direct path write temp' ela= 27071 file number=201
first dba=67072 block cnt=31 obj#=78342 tim=1370678380137850
WAIT #140079396924400: nam='direct path write temp' ela= 608 file number=201
first dba=67103 block cnt=31 obj#=78342 tim=1370678380138800
WAIT #140079396924400: nam='direct path write temp' ela= 1084 file number=201
first dba=67134 block cnt=31 obj#=78342 tim=1370678380140195
WAIT #140079396924400: nam='direct path write temp' ela= 524 file number=201
first dba=67165 block cnt=31 obj#=78342 tim=1370678380140845
WAIT #140079396924400: nam='direct path write temp' ela= 145 file number=201
first dba=67196 block cnt=4 obj#=78342 tim=1370678380141102

Oracle allows writes 31 + 31 + 31 + 31 + 4 = 128 blocks
I think this weird pattern is required because each block contains a pointer to the
next block in the temporary segment
The final block cannot be written until the next extent is allocated. When the address
of the next extent is known, the final 4 blocks can be written.
30
© 2013 Julian Dyke
juliandyke.com
Direct Path Writes

In Oracle 11.2 direct path writes are implemented using the pwrite system
call.

On Linux this can be traced using the strace utility
 Output is written to stderr

For example for operating system pid 3407
strace –p 3407 2> strace.txt;
pwrite(257, "\10\242\0\0\0\6A\0007++\0\0\0\1\f?\36\0\0\0\6A\0\0\0\0\0\1\6A\0"...,
253952, 549453824) = 253952
pwrite(257, "\10\242\0\0\37\6A\0007++\0\0\0\1\f\207Q\0\0\37\6A\0\37\0\0\0
\6A\0"..., 253952, 549707776) = 253952
pwrite(257, "\10\242\0\0>\6A\0007++\0\0\0\1\f\25\277\0\0>\6A\0>\0\0\0?\6A\0"...,
253952, 549961728) = 253952
pwrite(257, "\10\242\0\0]\6A\0007++\0\0\0\1\f\246u\0\0]\6A\0]\0\0\0^\6A\0"...,
253952, 550215680) = 253952
pwrite(257, "\10\242\0\0|\6A\0007++\0\0\0\1\f\224\247\0\0|\6A\0|\0\0\0}\6A\0"...,
32768, 550469632) = 32768
31
© 2013 Julian Dyke
juliandyke.com
Direct Path Reads

Oracle uses direct path reads to read blocks back from the temporary
segment
 Default read size is 7 x 8192 byte blocks (57344 bytes)

For example (event 10046 / level 8 trace):
WAIT #139938233130376: nam='direct path read temp' ela= 32 file number=201 first
dba=59520 block cnt=7 obj#=78527 tim=1370770967683714
WAIT #139938233130376: nam='direct path read temp' ela= 29 file number=201 first
dba=59527 block cnt=7 obj#=78527 tim=1370806245404502
WAIT #139938233130376: nam='direct path read temp' ela= 17 file number=201 first
dba=59534 block cnt=7 obj#=78527 tim=1370806248875622
WAIT #139938233130376: nam='direct path read temp' ela= 15 file number=201 first
dba=59541 block cnt=7 obj#=78527 tim=1370806251857467
WAIT #139938233130376: nam='direct path read temp' ela= 15 file number=201 first
dba=59548 block cnt=7 obj#=78527 tim=1370806255340181
32
© 2013 Julian Dyke
juliandyke.com
Direct Path Reads

In Oracle 11.2 direct path reads are implemented using the pread system call.

For example for operating system pid 3407
strace –p 3407 2> strace.txt;
pread(257, "\10\242\0\0\243\350@\0\251'.\0\0\0\1\f\246\306\0\0\243\350@\
0#\0\0\0\244\350@\0"..., 57344, 487874560) = 57344
pread(257, "\10\242\0\0\252\350@\0\251'.\0\0\0\1\f\251\306\0\0\252\350@\
0*\0\0\0\253\350@\0"..., 57344, 487931904) = 57344
33
© 2013 Julian Dyke
juliandyke.com
Dumping TEMPFILE blocks

To dump the block(s) of a TEMPFILE use:
ALTER SYSTEM DUMP TEMPFILE <file#> BLOCK <block#>;

For example:
ALTER SYSTEM DUMP TEMPFILE 1 BLOCK 67072;

To dump multiple block(s) of a TEMPFILE use:
ALTER SYSTEM DUMP TEMPFILE <file#>
BLOCK MIN <block#>
BLOCK MAX <block#>

For example:
ALTER SYSTEM DUMP TEMPFILE 1
BLOCK MIN 67072
BLOCK MAX 67079;
34
© 2013 Julian Dyke
juliandyke.com
Dumping TEMPFILE blocks

For example:
Start dump data blocks tsn: 3 file#:1 minblk 67072 maxblk 67080
Block dump from cache:
Dump of buffer cache at level 4 for tsn=3 rdba=4261376
Block dump from disk:
buffer tsn: 3 rdba: 0x00410600 (1/67072)
scn: 0x0000.002a70a0 seq: 0x01 flg: 0x0c tail: 0x70a00801
frmt: 0x02 chkval: 0x1e3e type: 0x08=unknown
Hex dump of block: st=0, typ_found=0
Dump of memory from 0x00007FBF9B56BA00 to 0x00007FBF9B56DA00
7FBF9B56BA00 0000A208 00410600 002A70A0 0C010000 [......A..p*.....]
7FBF9B56BA10 00001E3E 00410600 00000000 00410601 [>.....A.......A.]
7FBF9B56BA20 00000000 0000007F 00001FA4 00000044 [............D...]
7FBF9B56BA30 0000000E 00000266 02C10002 5A5A012C [....f.......,.ZZ]
7FBF9B56BA40 5A5A5A5A 5A5A5A5A 5A5A5A5A 5A5A5A5A [ZZZZZZZZZZZZZZZZ]
Repeat 17 times
7FBF9B56BB60 5A5A5A5A 5A5A5A5A 00025A5A 006402C1 [ZZZZZZZZZZ....d.]
7FBF9B56BB70 58585858 58585858 58585858 58585858 [XXXXXXXXXXXXXXXX]
Repeat 5 times
7FBF9B56BBD0 58585858 595900C8 59595959 59595959 [XXXX..YYYYYYYYYY]
7FBF9B56BBE0 59595959 59595959 59595959 59595959 [YYYYYYYYYYYYYYYY]
Repeat 10 times
7FBF9B56BC90 59595959 59595959 59595959 00005959 [YYYYYYYYYYYYYY..]
7FBF9B56BCA0 00000266 03C10002 5A5A012C 5A5A5A5A [f.......,.ZZZZZZ]
35
© 2013 Julian Dyke
juliandyke.com
Temporary Segment Block Format

Temporary Segment Blocks contain:
Header



header – 52 bytes
body – 8136 bytes
footer – 4 bytes
Body
Footer
36
© 2013 Julian Dyke
juliandyke.com
Temporary Segment Block Header

For example block 1 / 67072 – RDBA 0x410600
A208 0000 0600 0041 70A0 002A 0000 0C01 1E3E 0000 0600 0041 0000 0000 0601 0041
0000 0000 007F 0000 1FA4 0000 0044 0000 000E 0000

37
Fields are:
Bytes
Example
Description
0-1
0000 A208
Block Type (0x08 = Temporary Segment)
4-7
0041 0600
DBA of this block
8-11
002A 70A0
Sort number?
12-15
0C01 0000
Unknown – always 0xC01 = 3073
16-19
0000 1E3E
Check digit
20-23
0041 0600
DBA of this block
24-27
0000 0000
Sequential number of block within sort (0 upwards)
28-31
0041 0601
DBA of next block
32-35
0000 0000
Sequential number of block within sort (0 upwards)
36-39
0000 007F
Blocks remaining in extent (0x7F = 127)
40-43
0000 1FA4
Bytes used (0x1FA4 = 8100)
44-47
0000 0044
Bytes unused (0x44 = 68)
48-51
0000 000E
Rows in block? (0xE = 15)
© 2013 Julian Dyke
juliandyke.com
Temporary Segment Block Header

Check word
 Calculated using bitwise XOR as follows:
 XOR of all words in block including check word is zero
ub2 getCheckWord (void *buffer)
{
ub2 *sptr = (ub2 *)buffer;
int i;
ub2 m = 0;
}
38
for (i = 0;i < BLOCK_SIZE / sizeof (ub2);i++)
{
if (i == 8) continue;
skip the check word
m = m ^ sptr[i];
}
return m;
^ is bitwise XOR operator

Byte usage
 Bytes used + bytes unused is always 8168

Rows in block
 Appears to be number of rows in block – 2
© 2013 Julian Dyke
juliandyke.com
Temporary Segment Body

39
Body contains one or more sort rows:

Rows can be split across blocks

Columns are not split in this example, but it must be possible

Row starts with 4 byte value representing length of row in block
 Rows are padded with zero byte(s) to ensure byte alignment for row
length value

Column values are preceded by a 2-byte length word
 Either maximum column length is 65536 or some longer length value
can be created

NULL values are represented by a 2-byte length word containing zeros

No indication of number of columns in row or continuation rows
 Rows may be indexed in another structure that remains in memory
© 2013 Julian Dyke
juliandyke.com
Temporary Segment Body

Columns are stored in a strange order. For the statement:
SELECT /*+ FULL (p1) */ c1,c2,c3,c4,c5 FROM p1 ORDER BY c1;

The column order is always C1, C5, C2, C3, C4
 C1 is the primary key and the ORDER BY column
 There is no apparent reason why C5 is next
This behaviour is entirely reproducible, but the reason for the re-ordering of the
selected columns is currently unknown.
One theory is that the columns in the ORDER BY clause are sorted first, followed by
the longest column not included in the ORDER BY clause. This would may the sort
output more deterministic where the ORDER BY columns are non-unique.
Further research is required
40
© 2013 Julian Dyke
juliandyke.com
Temporary Segment Body


41
Example of row storage
0000 0269
# Length = 617
0006 3030 3030 3131
# C1 = 000011
012C 5A5A 5A5A 5A5A 5A5A ... 5A5A 5A5A
# C5 = ‘ZZZZZZ...ZZZ’
0001 80
# C2 = 0
0064 5858 5858 5858 5858 ... 5858 5858
# C3 = ‘XXXXXX...XXX’
00C8 5959 5959 5959 5959 ... 5959 5959
# C4 = ‘YYYYYY...YYY’
Total size is 617 bytes
 C1 = 2 + 6
= 8 bytes
 C2 = 2 + 1
= 3 bytes
 C3 = 2 + 100 = 102 bytes
 C4 = 2 + 200 = 202 bytes
 C5 = 2 + 300 = 302 bytes
© 2013 Julian Dyke
juliandyke.com
Extents and Batches


Comparing the contents of the temporary segment to the original table we can
conclude that (for this example)

Each extent contains 128 blocks

Data is sorted and written in batches of APPROXIMATELY 8 blocks

In each extent there will therefore be approximately 16 batches, each
sorted internally

Batches do not align with block boundaries
For this example, the data is probably sorted in local process heap and then
written directly to the temporary segment
Data appears to be identical between local heap and the temporary segment.
However, there is different metadata in each structure and therefore the data does
not fully align in the temporary segment blocks
42
© 2013 Julian Dyke
juliandyke.com
Granules

MEMORY_TARGET parameter set to 800M
SQL> SELECT bytes FROM v$sgainfo WHERE name = 'Granule Size‘;
BYTES
---------4194304

Granule size is 4MB:
 800MB / 4MB = 200 granules
SQL> SELECT grantype,COUNT(*) FROM x$ksmge
GROUP BY grantype
ORDER BY grantype;
GRANTYPE
COUNT(*)
---------- ---------0
70
1
58
2
1
3
1
7
43
14
25

43
#
#
#
#
#
#
Unallocated
Shared pool
Large pool
Java pool
Buffer cache
Total 198 granules reported by X$KSMGE
© 2013 Julian Dyke
juliandyke.com
PGA Granules

Unallocated granules can be determined using:
SQL> SELECT granaddr FROM x$ksgme
WHERE grantype = 0
ORDER BY grantype;
GRANADDR
---------------0000000060800000
0000000060C00000
0000000061000000
0000000061400000
0000000061800000
0000000061C00000
....
0000000070C00000
0000000071000000
0000000071400000
0000000071800000
0000000071C00000
70 rows selected.
44
© 2013 Julian Dyke
juliandyke.com
Memory Mappings

The shared memory devices mapped to the unallocated granules can be
determined using pmap for the operating system PID (3407)
pmap 3407
0000000060800000
0000000060c00000
0000000061000000
0000000061400000
0000000061800000
0000000061c00000
0000000062000000
4096K
4096K
4096K
4096K
4096K
4096K
4096K
rwxsrwxsrwxsrwxsrwxsrwxsrwxs-
/dev/shm/ora_TEST_2392076_0
/dev/shm/ora_TEST_2392076_1
/dev/shm/ora_TEST_2392076_2
/dev/shm/ora_TEST_2392076_3
/dev/shm/ora_TEST_2392076_4
/dev/shm/ora_TEST_2392076_5
/dev/shm/ora_TEST_2392076_6
4096K
4096K
4096K
4096K
4096K
4096K
4096K
rwxsrwxsrwxsrwxsrwxsrwxsrwxs-
/dev/shm/ora_TEST_2392076_63
/dev/shm/ora_TEST_2392076_64
/dev/shm/ora_TEST_2392076_65
/dev/shm/ora_TEST_2392076_66
/dev/shm/ora_TEST_2392076_67
/dev/shm/ora_TEST_2392076_68
/dev/shm/ora_TEST_2392076_69
....
0000000070400000
0000000070800000
0000000070c00000
0000000071000000
0000000071400000
0000000071800000
0000000071c00000

45
The mappings can also be determined using strace during process startup
 However, the output is more difficult to read.
© 2013 Julian Dyke
juliandyke.com
Shared Memory

Mappings are created for all granules during process start up
 Even if shared memory is currently unallocated for that device
$ ls –l /dev/shm/ora_TEST*
-rw-r-----rw-r-----rw-r-----rw-r-----rw-r-----rw-r-----rw-r-----
1
1
1
1
1
1
1
oracle
oracle
oracle
oracle
oracle
oracle
oracle
oinstall
oinstall
oinstall
oinstall
oinstall
oinstall
oinstall
0
0
0
0
0
0
0
Jun
Jun
Jun
Jun
Jun
Jun
Jun
8
8
8
8
8
8
8
08:57
08:57
08:57
08:57
08:57
08:57
08:57
/dev/shm/ora_TEST_2392076_0
/dev/shm/ora_TEST_2392076_1
/dev/shm/ora_TEST_2392076_2
/dev/shm/ora_TEST_2392076_3
/dev/shm/ora_TEST_2392076_4
/dev/shm/ora_TEST_2392076_5
/dev/shm/ora_TEST_2392076_6
1
1
1
1
1
1
1
oracle
oracle
oracle
oracle
oracle
oracle
oracle
oinstall
oinstall
oinstall
oinstall
oinstall
oinstall
oinstall
0
0
0
0
0
0
0
Jun
Jun
Jun
Jun
Jun
Jun
Jun
8
8
8
8
8
8
8
08:57
08:57
08:57
08:57
08:57
08:57
08:57
/dev/shm/ora_TEST_2392076_63
/dev/shm/ora_TEST_2392076_64
/dev/shm/ora_TEST_2392076_65
/dev/shm/ora_TEST_2392076_66
/dev/shm/ora_TEST_2392076_67
/dev/shm/ora_TEST_2392076_68
/dev/shm/ora_TEST_2392076_69
....
-rw-r-----rw-r-----rw-r-----rw-r-----rw-r-----rw-r-----rw-r-----
When a process starts, it creates mappings between addresses and shared memory
devices for all granules (in this case 200 granules).
If granules are allocated to the SGA their size will be 4194304 (in this case); if they
are not allocated the size will be 0.
46
© 2013 Julian Dyke
juliandyke.com
Local Heap Memory

Local PGA memory for a specific process can be dumped using ORADEBUG:
$ sqlplus / as sysdba
SQL> ORADEBUG SETOSPID 3407
Oracle pid: 19, Unix process pid: 3407, image: [email protected] (TNS V1-V3)
SQL> ORADEBUG DUMP HEAPDUMP 1025
Statement processed.

In this case the section of interest is “top uga heap”
HEAP DUMP heap name="top uga heap" desc=0xbaf9700
extent sz=0xffc0 alt=232 het=32767 rec=0 flg=2 opc=2
parent=(nil) owner=(nil) nex=(nil) xsz=0xfff8 heap=(nil)
fl2=0x60, nex=(nil), dsxvers=1, dsxflg=0x0
dsx first ext=0xe8ab0008
47
© 2013 Julian Dyke
juliandyke.com
Local Heap Memory

Sorted rows appear in the heap dump:
EXTENT 0 addr=0x7f45e8927008
Chunk
7f45e8927018 sz=
65512
freeable "session heap
"
ds=0x7f45e8aa67c8
Dump of memory from 0x00007F45E8927018 to 0x00007F45E8937000
7F45E8927010
0000FFE9 10B38F00
[........]
......
7F45E8927400 34010007 39383533
7F45E8927410 5A5A5A5A 5A5A5A5A
Repeat 17 times
7F45E8927530 5A5A5A5A 5A5A5A5A
7F45E8927540 58585858 58585858
Repeat 5 times
7F45E89275A0 58585858 595900C8
7F45E89275B0 59595959 59595959
Repeat 10 times
7F45E8927660 59595959 59595959
7F45E8927670 33350100 31393835
7F45E8927680 5A5A5A5A 5A5A5A5A
Repeat 17 times
7F45E89277A0 5A5A5A5A 5A5A5A5A
7F45E89277B0 58585858 58585858
Repeat 5 times
7F45E8927810 C8585858 59595900
7F45E8927820 59595959 59595959
Repeat 10 times
7F45E89278D0 59595959 59595959
48
© 2013 Julian Dyke
02620431 5A5A012C
5A5A5A5A 5A5A5A5A
[...435891.b.,.ZZ]
[ZZZZZZZZZZZZZZZZ]
00025A5A 006405C1
58585858 58585858
[ZZZZZZZZZZ....d.]
[XXXXXXXXXXXXXXXX]
59595959 59595959
59595959 59595959
[XXXX..YYYYYYYYYY]
[YYYYYYYYYYYYYYYY]
59595959 07055959
2C026204 5A5A5A01
5A5A5A5A 5A5A5A5A
[YYYYYYYYYYYYYY..]
[..535891.b.,.ZZZ]
[ZZZZZZZZZZZZZZZZ]
C100025A 58006406
58585858 58585858
[ZZZZZZZZZ....d.X]
[XXXXXXXXXXXXXXXX]
59595959 59595959
59595959 59595959
[XXX..YYYYYYYYYYY]
[YYYYYYYYYYYYYYYY]
59595959 00070559
[YYYYYYYYYYYYY...]
We know these
rows have been
sorted as the
columns appear in
the “wrong” order
juliandyke.com
Local Heap Memory

Local heap memory shows as /dev/zero in the pmap output

For example the heap extent in the previous slide was
EXTENT 0 addr=0x7f45e8927008
Chunk
7f45e8927018 sz=
65512
freeable "session heap
ds=0x7f45e8aa67c8
Dump of memory from 0x00007F45E8927018 to 0x00007F45E8937000

"
Output of pmap for this area of memory is:
pmap 3407
....
00007f45e88f7000
00007f45e8907000
00007f45e8917000
00007f45e8927000
00007f45e8937000
00007f45e8947000
00007f45e8957000
00007f45e8967000
00007f45e8977000
49
© 2013 Julian Dyke
64K
64K
64K
64K
64K
64K
64K
64K
64K
rwx-rwx-rwx-rwx-rwx-rwx-rwx-rwx-rwx--
/dev/zero
/dev/zero
/dev/zero
/dev/zero
/dev/zero
/dev/zero
/dev/zero
/dev/zero
/dev/zero
#
#
#
#
EXTENT
EXTENT
EXTENT
EXTENT
3
2
1
0
EXTENT 0
juliandyke.com
Conclusions


50
In this example:

Sorted rows are inserted directly into temporary segment when allocated
PGA space is exhausted in local memory

All selected columns are included in sorted rows
 ORDER BY columns
 Other columns

Some dynamic performance views are unreliable
In general:

Minimize number of columns in SELECT list

For some queries it may be more efficient to order the rows first, then to
revisit the table for the remaining data
 Avoids unnecessary writes to temporary segments
© 2013 Julian Dyke
juliandyke.com
Conclusions

51
General conclusions:
© 2013 Julian Dyke
juliandyke.com
Acknowledgements
52

The original presentation as delivered in Scotland contained some
inaccuracies that have been fixed in this version

This version also contains the results of some additional research

Thank you to the following for their invaluable advice, suggestions and
additional input during the event:

Jonathan Lewis

Tony Hasler

Bjoern Rost
© 2013 Julian Dyke
juliandyke.com