Delivering Excellence and Innovation in Advanced Computing

Transcript Delivering Excellence and Innovation in Advanced Computing

I/O Performance
Analysis and Tuning:
From the Application to
the Storage Device
Henry Newman
Instrumental, Inc.
MSP Area Computer Measurement
Group March 2, 2006
© Copyright 2006 Instrumental, Inc.SM
Slide 1 of 109
Tutorial Goal
To provide the attendee with an
understanding of the history and
techniques used in I/O performance
analysis.
“Knowledge is Power”
- Sir Francis Bacon
© Copyright 2006 Instrumental, Inc.SM
Slide 2 of 109
Agenda
Common terminology
I/O and applications
Technology trends and their impact
Why knowing the data path is Important
Understanding the data path
Application performance
Performance analysis
Examples of I/O performance Issues
Summary
© Copyright 2006 Instrumental, Inc.SM
Slide 3 of 109
Common
Terminology
Using the same nomenclature
© Copyright 2006 Instrumental, Inc.SM
Slide 4 of 109
The Data Path
Before you can measure system
efficiency, it is important to understand
how the H/W and S/W work together endto-end or along the “data path”.
Let’s review some of the terminology
along the data path…
© Copyright 2006 Instrumental, Inc.SM
Slide 5 of 109
Terminology/Definitions
DAS
 Direct Attached Storage
SAN
 Storage Area Network
NAS
 Network Attached Storage shared via TCP/IP
SAN Shared File System
 File system that supports shared data between multiple
servers

Sharing is accomplish via a metadata server or distributed lock
manager
© Copyright 2006 Instrumental, Inc.SM
Slide 6 of 109
Direct Attached Storage
Client
Client
Client
Client
Local Area Network (LAN)
© Copyright 2006 Instrumental, Inc.SM
Server 1
Server N
Application
File System
Application
File System
F/C Switch
F/C Switch
Disk
Disk
Slide 7 of 109
Storage Area Network
Client
Client
Client
Client
Local Area Network (LAN)
Server 1
Server 2
Application
File System
Application
File System
F/C Switch
RAID Controller
Disk
© Copyright 2006 Instrumental, Inc.SM
Disk
Slide 8 of 109
Network Attached Storage
Client
Client
Client
Client
Local Area Network (LAN)
Server 1
Server 2
Server N
Application
Application
NAS Server
Application
O/S
File System
Disk
© Copyright 2006 Instrumental, Inc.SM
Slide 9 of 109
Disk
Disk
File System Terminology
File System Superblock
 Describes the layout of the file system



The location and designation of volumes being used
The type of file system, layout, and parameters
The location of file system metadata within the file system and
other file system attributes
File System Metadata
 This is the data which describes the layout of the files
and directories within a file system
File System Inode
 This is file system data which describes the location,
access information and other attributes of files and
directories
© Copyright 2006 Instrumental, Inc.SM
Slide 10 of 109
Other Terms
HSM - Hierarchical Storage Management
 Management of files that are viewed as if they are on
disk within the file system, but are generally stored on a
near line or off-line device such as tape, SATA, optical
and/or other lower performance media.
LUN - Logical Unit Number
 Term used in SCSI protocol to describe a target such as
a disk drive, RAID array and/or tape drive.
© Copyright 2006 Instrumental, Inc.SM
Slide 11 of 109
Other Terms (cont.)
Volume Manager (VM)
 Manages multiple LUNs grouped into a file system
 Can usually stripe data across volumes or concatenate
volumes

Filling a volume before moving to fill the next volume
 Usually have a fixed stripe size allocated to each LUN
before allocating the next LUN

For well tuned systems this is the RAID allocation (stripe width)
or multiple
© Copyright 2006 Instrumental, Inc.SM
Slide 12 of 109
Round-Robin File System Allocation
File 5
File 1
File 6
File 2
File 7
File 3
File 8
File 4
© Copyright 2006 Instrumental, Inc.SM
Slide 13 of 109
File system stripe
group of LUNS
used to match
bandwidth
Round-Robin File System
Populated with three (3) files
File 1
File 2
File 3 removed
File 3
© Copyright 2006 Instrumental, Inc.SM
Slide 14 of 109
Striped File System Allocation
File 5
File 1
File 6
File 2
File 7
File 3
File 8
© Copyright 2006 Instrumental, Inc.SM
With stripe
allocation, all
writes go to all
devices based on
the allocation
within the volume
manager
Each file is not
allocated on a
single disk, but all
disks
File 4
Slide 15 of 109
Striped File System Allocation
Populated with three (3) files
File 1
File 2
Fragmented after File 3 removed
File 3
© Copyright 2006 Instrumental, Inc.SM
Slide 16 of 109
Microsoft NTFS Layout
Boot MFT Free Space Metadata Free Space
Newly formatted NTFS volume
 Data and metadata are mixed and can easily become
fragmented
 Head seeks on the disks are a big issue

Given the different data access patterns for data (long block
sequential) and metadata (short block random)
© Copyright 2006 Instrumental, Inc.SM
Slide 17 of 109
SAN Shared File System (SSFS)
The ability to share data between
systems directly attached to the same
devices
Accomplished through SCSI connectivity
and a specialized file system and/or
communications mechanism
 Fibre Channel
 iSCSI
 Other communications methods
© Copyright 2006 Instrumental, Inc.SM
Slide 18 of 109
SAN Shared File System (SSFS)
Different types of SAN file systems allow
multiple writes to the same file system
and the same file open from more than
one machine
POSIX limitations were never considered
for shared file systems
Let’s take a look at 2 different types of
SSFS…
© Copyright 2006 Instrumental, Inc.SM
Slide 19 of 109
Centralized Metadata SSFS
Client
Client
Client
Client
Local Area Network (LAN)
For metadata traffic
Server
Server
Server
Server
Application
Application
Application
Application
Meta Data
Client
Client
Client
File System
File System
File System
File System
F/C Switch
RAID Controller
Disk
© Copyright 2006 Instrumental, Inc.SM
Disk
Slide 20 of 109
Disk
Distributed Metadata SSFS
Client
Client
Server
Server
Application
Application
Lock Mgr.
File System
Client
(LAN) local data traffic
for file system
Depends on
implementation
Server
Server
Application
Application
Lock Mgr
Lock Mgr
Lock Mgr
File System
File System
File System
F/C Switch
RAID Controller
Disk
© Copyright 2006 Instrumental, Inc.SM
Client
Disk
Slide 21 of 109
Disk
More on SSFS
Metadata server approaches do not scale
for clients counts over 64 as distributed
lock managers
 Lustre and GPFS are some examples of distributed
metadata

Panasas behaves similarly in terms of scaling as to the
distributed metadata approaches, but view the files as objects
© Copyright 2006 Instrumental, Inc.SM
Slide 22 of 109
Definition - Direct I/O
I/O which bypasses server memory
mapped cache and goes directly to disk
 Some file systems can automatically switch between
paged I/O and direct I/O depending on I/O size
 Some file systems require special attributes to force
direct I/O for specific files or directories or enable by API
 For emerging technologies often times call data
movement directly to the device “Direct Memory
Addressing” or “DMA”

Similar to what is done with MPI communications
© Copyright 2006 Instrumental, Inc.SM
Slide 23 of 109
Direct I/O Improvements
Direct I/O is similar to “raw” I/O used in
database transactions
CPU usage for direct I/O
 Can be as little as 5% of paged I/O
Direct I/O improves performance
 If the data is written to disk and not reused by the
application
Direct I/O is best used with large
requests
 This might not improve performance for some file
systems
© Copyright 2006 Instrumental, Inc.SM
Slide 24 of 109
Well Formed I/O
The application, operating system, file
system, volume manager and storage
device all have an impact on I/O
I/O that is “well formed” must satisfy
requirements from all of these areas for
the I/O to move efficiently
 I/O that is well formed reads and writes data on
multiples of the basic blocks of these devices
© Copyright 2006 Instrumental, Inc.SM
Slide 25 of 109
Well Formed I/O and Metadata
If file systems do not separate data and
metadata and their space is co-located
Can impact data alignment because
metadata is interspersed with data
Large I/O requests are not necessarily
sequential allocated sequentially
 File systems allocated data based on internal allocation
algorithms
 Multiple writes streams prevent sequential allocation
© Copyright 2006 Instrumental, Inc.SM
Slide 26 of 109
Well Formed & Direct I/O from the OS
Even if you use the O_DIRECT option, I/O
cannot move from user space to the
device unless it begins and ends on 512
byte boundaries
On some systems additional OS
requirements are mandated
 Page alignment


32 KB requests often times related to page alignment
Just because memory is aligned does not mean that the file
system or RAID is aligned
 These are out of your control
© Copyright 2006 Instrumental, Inc.SM
Slide 27 of 109
Well Formed & Direct I/O from Device
Just because data is aligned in the OS
does not mean it is aligned for the device
 I/O for disk drives must begin and end on 512 byte
boundaries
 And of course you have the RAID alignment issues

More on this later
© Copyright 2006 Instrumental, Inc.SM
Slide 28 of 109
Volume Managers (VMs)
For many file systems, the VMs control
the allocation to each device
 VMs often times have different allocations than the file
system
Making read/write requests equal to or in
multiples of the VM allocation generally
improves performance
 Some VMs have internal limits that prevent large
numbers of I/O requests from being queued
© Copyright 2006 Instrumental, Inc.SM
Slide 29 of 109
Device Alignment
Almost all modern RAID devices have a
fixed allocation per device
 Ranges from 4 KB to 512 KB are common
File systems will have the same issue
with RAID controllers as does memory
with an operating system
© Copyright 2006 Instrumental, Inc.SM
Slide 30 of 109
Direct I/O Examples (Well Formed)
Any I/O request that begins and ends on
a 512 word boundary is well formed*
Request of 262,144 begins at 0 and is
262,144 bytes long
0
512 1024
262144
* Well formed in terms of the disk not the RAID
© Copyright 2006 Instrumental, Inc.SM
Slide 31 of 109
Direct I/O Examples (Not-Well Formed)
I/O that is not well formed can be broken
into well formed parts and non-well
formed parts by some file systems
Request that begin at byte 1 and end at
byte 262,145
 1st request 0-512 bytes of which 511 is moved to system
buffer
 2nd request 513-262144 bytes (direct)
 3rd request 262145-262656 9 bytes buffered
1
© Copyright 2006 Instrumental, Inc.SM
512 1024
262145
Slide 32 of 109
Well Formed I/O Impact
Having I/O that is not well formed causes
 Significant overhead in the kernel to read data that is not
aligned

Impact depends on other factors such as page alignment
 Other impacts on the RAID depend on the
© Copyright 2006 Instrumental, Inc.SM
Slide 33 of 109
I/O and Applications
What is the data path?
© Copyright 2006 Instrumental, Inc.SM
Slide 34 of 109
What Happens with I/O
I/O can take different paths within the
operating system depending on the type
of I/O request
 These different paths have a dramatic impact on
performance
Two types of applications that I/O can
take different paths
 C library buffered
 System calls
© Copyright 2006 Instrumental, Inc.SM
Slide 35 of 109
I/O Data Flow Example
All data goes through system buffer cache
High overhead as data must compete with user
operations for system cache
1
4
5
Program C Library
Buffer
space
3
Page
Cache
File System
Cache
(some
systems)
2
1 Raw I/O no file system or direct I/O
2 All I/O Under file system read/write calls
3 File system meta-data and data most file systems
4 As data is aged it is moved to storage from the page cache
5 Data moved to the file system cache on some systems
© Copyright 2006 Instrumental, Inc.SM
Slide 36 of 109
Storage
C Library Buffered
Library buffer size
 The size of the stdio.h buffer. Generally, this is between
1,024 bytes and 8,192 bytes and can be changed on
some systems by calls to setvbuf()
 Moving data via the C library requires multiple memory
moves and/or memory remapping to copy data from the
user space to library buffer to the storage device
 Library I/O generally has much higher overhead than
system calls because you have to make more system
given the small request sizes
© Copyright 2006 Instrumental, Inc.SM
Slide 37 of 109
C Library I/O Performance
If I/O is random and buffer is bigger than
request
 More data will be read than is needed
If I/O is sequential
 If buffer is bigger than request than data is read ahead
 If buffer is smaller than request size then multiple
system calls will be required
Unless the buffer is larger than the
request
 It needs to be significantly larger given the extra
overhead to move the data or remap the pages
© Copyright 2006 Instrumental, Inc.SM
Slide 38 of 109
System Calls
UNIX system calls are generally more
efficient for random or sequential I/O
 Exceptions for sequential I/O are for small requests as
compared to C library I/O and large setvbuf()
System calls allow you to perform
asynchronous I/O
 Gives immediate control back to the program enabling
management of ACK when you need the data on the
device
© Copyright 2006 Instrumental, Inc.SM
Slide 39 of 109
Vendor Libraries
Some vendors have custom libraries
 Manage data alignment
 That have circular asynchronous buffering
 Allow readahead
Cray, IBM and SGI all have libraries
which can significantly improve I/O
performance for some applications
 There is currently no standard in this area
There is an effort by DOE to develop
similar technology for Linux
© Copyright 2006 Instrumental, Inc.SM
Slide 40 of 109
Technology Trends
and Their Impact
What is changing and what is not
© Copyright 2006 Instrumental, Inc.SM
Slide 41 of 109
Block Device History
The concept of block devices has been
around for a long time…at least 35 years
A block device is a data storage or
transfer device that manipulates data in
groups of a fixed size
 For example, a disk whose data storage size is usually
512 bytes for SCSI devices
© Copyright 2006 Instrumental, Inc.SM
Slide 42 of 109
SCSI Technology History
The SCSI standard has been in place for
a long time as well
 There is an excellent historical account of SCSI
http://www.pcguide.com/ref/hdd/if/scsi/over.htm
Though the SCSI history is interesting
and the technology has been launched
by many companies
 The SCSI standard was published in 1986
 Which makes it nearly 19 years old
© Copyright 2006 Instrumental, Inc.SM
Slide 43 of 109
Changes Have Been Limited
Since the advent of block devices and the SCSI
protocol, modest changes have been made to
support
 Interface changes, new device types, and some changes for error
recovery and performance
Nothing has really changed in the basic
concepts of the protocol
Currently there is no communication regarding
data topology between block devices and SCSI
 Although one new technology has promise - more on OSD later
© Copyright 2006 Instrumental, Inc.SM
Slide 44 of 109
~ Relative Latency for Data Access
1.00E+11
1.00E+10
1.00E+09
Times Difference
1.00E+08
1.00E+07
1.00E+06
1.00E+05
1.00E+04
1.00E+03
1.00E+02
1.00E+01
1.00E+00
CPU
Registers
L1 Cache
L2 Cache
Memory
Min Times Increase
Disk
Max Times Increases
Note: Approximate values for various technologies and 12 orders of magnitude
© Copyright 2006 Instrumental, Inc.SM
Slide 45 of 109
NAS
Tape
~ Relative Bandwidth for Data
1.0E+04
Times Difference
1.0E+03
1.0E+02
1.0E+01
1.0E+00
1.0E-01
1.0E-02
CPU
Registers
L1 Cache
L2 Cache
Min Relative BW in GB/sec
Memory
Disk
Max Relative BW Reduction in GB/sec
Note: Approximate values for various technologies and 6 orders of magnitude
© Copyright 2006 Instrumental, Inc.SM
NAS
Slide 46 of 109
Tape
Performance Increases (1977-2005)
10,000,000
1,000,000
100,000
10,000
1,000
100
10
1
CPU
© Copyright 2006 Instrumental, Inc.SM
Disk Drive
Size
Transfer Rate Transfer Rate
RAID
disk
Slide 47 of 109
RPMS
Seek+Latency Seek+Latency
Read
Write
Bandwidth per GB of Capacity
37.50
35
30
MB/sec.
25
20
15
10
5
0.30
0.17
0.08
0.33
0.17
0
Single Disk
1977
Single Disk
300 GB
RAID 4+1 300 RAID 8+1 300 SATA RAID 4+1SATA RAID 8+1
GB
GB
300 GB
300 GB
Bandwidth per GB of Capacity
© Copyright 2006 Instrumental, Inc.SM
Slide 48 of 109
Modern Bandwidth/Capacity
Bandwidth per GB of Capacity 1991-2005
8.00
7.00
6.00
5.00
4.00
3.00
2.00
1.00
0.00
1991
1991
1992
1994
1996
1998
1999
2000
Bandwidth per GB of Capacity
© Copyright 2006 Instrumental, Inc.SM
Slide 49 of 109
2000
2002
2002
2004
4 KB IOPS for a Single Device
1,655
1,700
Number of I/Os Per Second
1,600
1,500
1,400
1,300
1,200
1,100
1,000
900
800
765
700
600
500
400
300
270
200
100
0
1977 CDC Cyber 819
© Copyright 2006 Instrumental, Inc.SM
2005 300 GB Seagate
Cheetah 10K.7
Slide 50 of 109
2005 400 GB SATA
Density & Performance are Flattening
Year
Capacity
GBytes
Height
Est. Max.
Xfer Rate
MB/sec
Increase
Rate
1991
0.5
Full
4
1991
1.0
Full
4
100%
1992
2.0
Full
7
175%
1994
4.0
Full
9
1996
9.0
Full
1998
18.0
1999
Rate Per Year
since 1991
Seek in
msec
RPM
Latency
(msec)
14.0
4,412
6.80
10.5
4,500
6.67
75%
8.0
7,200
4.17
225%
42%
8.0
7,200
4.17
16
388%
58%
8.8
7,200
4.17
LP
29
722%
89%
5.7
10,000
2.99
36.0
LP
53
1,334%
154%
5.2
10,000
2.99
2000
72.0
LP
67
1,678%
175%
5.1
10,000
2.99
2000
72.0
LP
71
1,773%
186%
3.6
10,000
2.00
2002
146.0
LP
84
2,100%
182%
4.7
15,000
2.99
2002
146.0
LP
89
2,228%
193%
3.6
15,000
2.00
2005
300.0
LP
118
2,950%
204%
4.7
10,000
2.00
© Copyright 2006 Instrumental, Inc.SM
Slide 51 of 109
Tape is Even Worse
Vendor
IBM
IBM
IBM
IBM
IBM
IBM
IBM
STK
STK
IBM
IBM
STK
LTO
STK
STK
IBM
LTO
IBM
LTO
STK
Drive
3420
3480
3490
3490E
3490E
3490
3590
SD-3
T9840A
3590E
3950E
T9940A
LTO
T9840B
T9940B
3590H
LTO-II
3592
LTO-III
T1000
© Copyright 2006 Instrumental, Inc.SM
Media
Reel-to-Reel
3480
3480
3480
3490E
3490E
3590
SD-3
9840
3590E
3590E
9940
LTO
9480
9940
3950
LTO
3592
LTO
Titanium
Peak Xfer Rate
BW per MB/sec
Perf.
MB/sec
per GB
Increase
uncompressed
uncompressed
0.1
1.25
1.00
0.12
0.2
3.00
2.40
0.07
0.2
3.00
2.40
0.07
0.4
4.50
3.60
0.09
0.8
4.50
3.60
0.17
0.8
6.00
4.80
0.13
10.0
9.00
7.20
1.11
50.0
11.00
8.80
4.55
20.0
10.00
8.00
2.00
20.0
14.00
11.20
1.43
40.0
14.00
11.20
2.86
60.0
14.00
11.20
4.29
100.0
14.00
11.20
7.14
40.0
20.00
16.00
2.00
200.0
30.00
24.00
6.67
60.0
14.00
11.20
4.29
200.0
35.00
28.00
5.71
300.0
40.00
32.00
7.50
400.0
80.00
64.00
5.00
500.0
130.00
43.33
3.85
Capacity
Introduced
GB
1974
1984
1989
1991
1992
1995
1995
1995
1998
1999
2000
2000
2000
2001
2002
2002
2003
2004
2005
?
Slide 52 of 109
Why is Tape Important?
These tape issues impact jobs when
recalling data from HSM
A “best case” recall of a 100 GB file will
take ~1,500 seconds to load from HSM
 This is just the recall time from tape using STK T9940
running with full compression
© Copyright 2006 Instrumental, Inc.SM
Slide 53 of 109
Application Efficiency by I/O Size
100%
90%
80%
70%
%
60%
50%
40%
30%
20%
Request Size
% Utilization 1977
Today % Utilization 15K
Today % Utilization 10K
How much data can you transfer in the average seek+latency?
You need to make large I/O requests to amortize for speed
increases given lack of seek+latency improvement
© Copyright 2006 Instrumental, Inc.SM
Slide 54 of 109
8388608
4194304
2097152
1048576
524288
262144
131072
65536
32768
16384
8192
4096
1024
0%
Record Size
10%
What These Charts Show
Densities are growing but not nearly as
fast as CPU performance
 Seek, latency and transfer times are not going to change
much over the next few years
 CPU performance is increasing faster than storage
performance
Making small I/O requests is inefficient
 No technology changes on the horizon will change this
H/W limitation
© Copyright 2006 Instrumental, Inc.SM
Slide 55 of 109
What These Charts Show (cont.)
Transfer rates lag capacity significantly
 I/O performance is comprised of a combination of Raw
media transfer rates, access times (seek+latency), file
system efficiency
Different applications and architectures
are more sensitive to different
combinations of 3 areas
You are not going to spin disks 5.76 B
RPMs
© Copyright 2006 Instrumental, Inc.SM
Slide 56 of 109
Why Knowing the
Data Path is
Important
Applications I/O and
System Impacts
© Copyright 2006 Instrumental, Inc.SM
Slide 57 of 109
Understanding the Data Path
With all of the H/W and tunable
parameters in between the applications
programs and the storage H/W, it is
important that you understand what is
happening to the data when you make an
I/O request
 If I/O is a problem you need to understand what
happens after you make the request
 You cannot depend on the architecture to handle your
requests efficiently as it might be tuned for another
application
© Copyright 2006 Instrumental, Inc.SM
Slide 58 of 109
What To Do
If you want your I/O to go fast you need
to understand
 What paths the I/O takes
 What happens to the I/O as it moves down the data path
 What things you can do to impact performance
Learning to tune applications for the data
path is the only way to significantly
improve I/O performance
 The system can be well tuned, but if the application
does not take advantage of it….
© Copyright 2006 Instrumental, Inc.SM
Slide 59 of 109
Why Am I Telling You This Stuff?
You cannot change any of this
 Some settings can be changed by the Sys Admins
 Some of this Sys Admins can not change

e.g. File systems that do not align
Since you cannot make changes, the key
is to understand what happens
 E.g. Good rules of thumb

The fewer times you ask the OS to do something for you the
better
Let’s review the data path issues…
© Copyright 2006 Instrumental, Inc.SM
Slide 60 of 109
Understanding the
Data Path
OS, VM, File System,
PCI Bus, HBA, RAIDs
and tuning issues
© Copyright 2006 Instrumental, Inc.SM
Slide 61 of 109
Application Programmers
Like it or not, your application I/O
performance is dependent on knowing
the data path issues
If you are getting poor performance and
seemingly doing everything efficiently
from the application point of view, you
need to know what to ask of the
architecture
How it all works and what to ask…
© Copyright 2006 Instrumental, Inc.SM
Slide 62 of 109
OS Request Size Limits
Some OS have limits on the size of the
largest I/O request
Some of these are tunable some are not
 Linux 2.4 often limited I/O requests to 128 KB
The SCSI limit is less than 32 MB but is
not an aligned value
Since VM are often involved, I/O is
usually limited to the allocation size
© Copyright 2006 Instrumental, Inc.SM
Slide 63 of 109
OS Cache Tunables
File system caches are useful for three
reasons
 To hide latency
 To reuse data
 To allow the file system to consolidate to larger request
sizes
Using cache other than for these reasons
will increase overhead
 In some environments there is enough cache per node
that the file can fit in cache and the file is then written out
asynchronously
© Copyright 2006 Instrumental, Inc.SM
Slide 64 of 109
OS Cache Tunables
The OS still has to manage the cache
 The bigger the cache the more overhead to mange
 Caches often use LRU algorithm so they have to find the
oldest page


With lots of 8 K pages this is a problem for large memory
systems
Some systems have larger pages for I/O and this improves
performance
© Copyright 2006 Instrumental, Inc.SM
Slide 65 of 109
OS Cache Impact
By not using direct I/O but trying to
cache the data
 Increased system CPU
 Potential slower I/O performance
Performance could be improved if
 I/Os are small and they get consolidated into larger
requests
 There is enough memory to hide latency but not
significantly increase CPU overhead for management
 The whole file fits in cache
© Copyright 2006 Instrumental, Inc.SM
Slide 66 of 109
Other OS Tunables
Some device drivers have request size
limitation
 AIX and Solaris for example
 This limits the size of the I/O request to the devices
Small pages impact I/O
 Page locking overhead impact I/O performance

Mostly CPU system to lock down large numbers of small pages
so they can be DMA’ed to/from
Various other caches that impact I/O
 Name cache for file system metadata
 Some file systems have an inode cache for example
© Copyright 2006 Instrumental, Inc.SM
Slide 67 of 109
OS Tunables Impact
Having I/O requests broken into small
pieces increases system CPU and
reduces I/O performance
 This reduction can be significant
Small pages increase overhead of the
lock and unlock which is required for
direct I/O
© Copyright 2006 Instrumental, Inc.SM
Slide 68 of 109
File System Configuration & Tuning
Every file system can be tuned well or
tuned poorly
 Remember - tuning is not a trivial task
File systems cannot currently be well
tuned for large block sequential and
small block random in the same file
system at the same time
 This is a problem for many applications where small and
large files are on the same file system
© Copyright 2006 Instrumental, Inc.SM
Slide 69 of 109
Important Point
If the file system is tuned for large I/O
requests and you are making small
requests, I/O performance will be poor
 I am unaware of a file system that dynamically tunes
based on file size and request size

A library might be able to do this as it has more knowledge
If the file system is tuned for small I/O
requests and you are making large
requests, I/O performance will likely be
poor
 Large and small are generally mutually exclusive
© Copyright 2006 Instrumental, Inc.SM
Slide 70 of 109
VM Configuration & Tuning
Knowing the stripe size is important
 Given how volume managers allocate data on devices
Some file systems try to align to the VM
allocations but not all do
 Sometimes file systems pad to begin on a stripe
 These file system tend to perform better as the RAID
does not have to Read/Modify/Write as often
Some VMs have request size limitations
tunables
© Copyright 2006 Instrumental, Inc.SM
Slide 71 of 109
VM Impact
Not making requests in a multiple of the
stripe size will likely have a negative
performance impact
 The RAID controller must do more work to align the data
The impact depends on the file system
type and how well that file system deals
with the underlying RAID (block device)
allocation
 Some file systems align to the RAID controller
 This is easily tested
© Copyright 2006 Instrumental, Inc.SM
Slide 72 of 109
PCI Bus
Many PCI buses cannot run at the rated
speed
 A dual port FC 2 Gb HBA can require 800 MB/sec of
bandwidth

200 MB/sec for each port read and write
 Standard PCI 64/66 can support maximum of 532
MB/sec


Many do not run at full rated speed
Full duplex performance as low as 140 MB/sec are not
uncommon
© Copyright 2006 Instrumental, Inc.SM
Slide 73 of 109
PCI Bus Impact
A bad bus will slow your performance
Often times this can impact your
communications I/O depending on the
system
Measuring the performance is difficult
given all of the H/W and S/W in the way
which could be part of the problem
 For Linux clusters you might to be able to measure this
more easily using MPI performance tests
© Copyright 2006 Instrumental, Inc.SM
Slide 74 of 109
HBA Command Queues
The default for some HBAs is to only
allow 16 outstanding commands
 This is a historical limitation based on older RAID
controllers that had small command queues
 If a single application was making all large I/O requests
(> 4 MB) than 16 is plenty to keep a 2 Gb or 4 Gb HBA
busy
 It is not plenty if requests are say < 64 KB
© Copyright 2006 Instrumental, Inc.SM
Slide 75 of 109
HBA Command Queues Impact
If requests from your application are
large you will not see an impact
If your requests are small, or multiple
applications are making requests, and
the command queue is only 16, I/O
performance will be poor
 We have seen up to 50% reduction in performance for
databases
© Copyright 2006 Instrumental, Inc.SM
Slide 76 of 109
iSCSI/Ethernet Issues
Systems are not tuned for large TCP/IP
packets
 This will increase System CPU overhead
 This will reduce performance
Space allocated in the kernel does not
allow for large requests
 Packetization overhead
Ethernet NICs are not tuned or do not
support 9K MTUs
© Copyright 2006 Instrumental, Inc.SM
Slide 77 of 109
IP/Ethernet Impact
Without support for large TCP/IP
requests and efficient allocation of buffer
space
 TCP/IP performance can drop as much as 50%
 Can impact all types of TCP/IP communication including
NFS, ftp, etc
Large MTUs significantly improve CPU
efficiency and the transfer rate
This packetization can still be an issue
depending on the size of the metadata
requests
© Copyright 2006 Instrumental, Inc.SM
Slide 78 of 109
FC Switch Full Duplex Performance
Some FC switches cannot operate at full
rate, full duplex
 Not a big problem for systems that mostly do I/O for
checkpointing as all you are doing is writing
 A big problem for systems that require full duplex
performance
© Copyright 2006 Instrumental, Inc.SM
Slide 79 of 109
FC Switch Impact
Unless you are doing full duplex I/O this
is not an issue
 Unless…high density blades are used in some directors
the board cannot run all ports at full rate even half
duplex
For full duplex I/O this must be measured
 Ronald Reagan said it best…”Trust but Verify”

This could be a big issue for some of the new 4 Gb switches and
might need to be considered
© Copyright 2006 Instrumental, Inc.SM
Slide 80 of 109
RAID Tuning & Configuration
Why is knowing the RAID layout
important?
 Without knowing the RAID layout you do not know what
size I/O request make the best sense for applications
Is it configured as 7+1 with 512 KB
segments (3,584 KB stripe width) or 8+1
128 KB segments (1,024 KB stripe width)
 RAID allocations should have been set to a be multiple
of the volume manager

Trust but verify
© Copyright 2006 Instrumental, Inc.SM
Slide 81 of 109
RAID Stripe Allocation & Other S/W
Each layer could breakup your I/O
request or if the data is cached in
memory and your request is small the
virtual memory system might make the
request larger.
The key is wherever possible, make large
requests multiples of the stripe width and
barring that make large requests.
© Copyright 2006 Instrumental, Inc.SM
Slide 82 of 109
Potential RAID Configuration
Most RAID LUNs for data in HPC are
configured as RAID-5
 I have seen RAID-5 can be configured as 8+1, 10+1,
12+1 and even 16+1
 Each of configuration need to be considered in
 12+1 with 512 KB segment is 6 MB stripe width!
© Copyright 2006 Instrumental, Inc.SM
Slide 83 of 109
RAID Impact
RAID devices have fixed block size that
they must physically read and write
 This impacts writing more than reading
Not writing in at least segment sizes
often causes a read-modify-write
 The amount of data that needs to be read-written
depends on the RAID vendor, but needless to say this
slows performance
© Copyright 2006 Instrumental, Inc.SM
Slide 84 of 109
Application
Performance
Requirements for High
Performance Applications
© Copyright 2006 Instrumental, Inc.SM
Slide 85 of 109
Where Do I Start?
First - have a well tuned application
 Ask questions
 Performance tools
 Big I/O requests
 Aligned I/O requests to the RAID controller
 Asynchronous I/O or threaded I/O
 Using large pages where possible
 Having I/O requests aligned on memory page
boundaries
 Sometimes all of this is not possible
© Copyright 2006 Instrumental, Inc.SM
Slide 86 of 109
Performance and the OS
A number of performance tools are
available to help you understand I/O
performance
This is often system information but
since your application is using the OS to
do I/O, this is necessary
© Copyright 2006 Instrumental, Inc.SM
Slide 87 of 109
Performance Monitoring Tools
Every vendor and OS has tools or
documentation on collecting system data
Linux
 http://redhat.activeventure.com/9/systemadminprimer/ch
-resource.html
AIX
 http://www.redbooks.ibm.com/redbooks.nsf/0/fb7279fe5
13d39ef85256f1d0061e5f1?OpenDocument
SGI
 http://techpubs.sgi.com/library/tpl/cgibin/download.cgi?coll=linux&db=bks&docnumber=0074580-001
© Copyright 2006 Instrumental, Inc.SM
Slide 88 of 109
Performance Testing Tools
XDD
 http://www.ioperformance.com/
 My personal favorite
IOZONE
 http://www.iozone.org/
 A number of system and file system specific stuff
embedding in the source code
Others (such as bonnie)
 The key is support for I/O like your application with low
overhead communication across multiple nodes
© Copyright 2006 Instrumental, Inc.SM
Slide 89 of 109
Performance Analysis
Here are the suggested steps to find out
what works best on the system that you
are using
The following is a check list of the data
you will need if you want to achieve
maximum I/O performance
© Copyright 2006 Instrumental, Inc.SM
Slide 90 of 109
Check List
In the last section we reviewed all of the
tunables for the data path but here is a
good list of questions:
 What is the largest requests I/O request that can be
issues to the OS
 What are the VM stripe size settings
 How many RAIDs devices are in a stripe
 What is the configuration of the RAID LUN(s)

Hopefully they are all the same
Of course, all the other things covered in
the last section could be issues
© Copyright 2006 Instrumental, Inc.SM
Slide 91 of 109
Performance Analysis Step 1
Use xdd with request sizes preferably the
same size as the I/O from your
application from a single node
 Try 1x, 2x, 4x, 8x etc.

You might need to run multiple times or use very large files given
possible contention
Monitor the system performance to
understand
 I/O request size that the system sees using tools like sar
or iostat
 Monitor the performance of xdd
© Copyright 2006 Instrumental, Inc.SM
Slide 92 of 109
Sar/Iostat
Device:
dev3-0
tps
3.09
Blk_read/s
3.34
Blk_wrtn/s
58.69
Blk_read
646462
Divide block read and written by read and write requests
Average read size
Average write size
© Copyright 2006 Instrumental, Inc.SM
646462/3.34=
11375230/58.59=
Slide 93 of 109
193551
194149
Blk_wrtn
11375230
Performance Analysis Step 2
Next try a request equal to the VM stripe
size
 1x, 2x 4x, 8x etc

You might need to run multiple times or use very large files given
possible contention
Monitor the system performance to
understand
 I/O request size that the system sees using tools like sar
and iostat
 Monitor the performance of xdd
© Copyright 2006 Instrumental, Inc.SM
Slide 94 of 109
Performance Analysis Step 3
At this point you should be able to
understand the performance for a single
node and know what is the best request
size for your application on that machine
 Now we scale to multiple nodes which xdd supports
 Now repeat Test 1 with your application request size on
multiple nodes using a reasonable percentage of nodes
similar to your running job

You will likely not be able to monitor the performance
 Now repeat the best case request size
 Monitor a few nodes to ensure that the request sizes are
what is expected
© Copyright 2006 Instrumental, Inc.SM
Slide 95 of 109
Fragmentation Performance Gotcha
Data fragmentation is a major issue for
most file systems
When data is not written with sequential
block addresses it is fragmented
 Cause can be: File system allocation implementation,
multiple writes to the same file system at the same time,
a file system that was filled up and most the allocations
used up
 Lots of other reasons
© Copyright 2006 Instrumental, Inc.SM
Slide 96 of 109
Fragmented? - How Do You Know?
Fragmentation is hard to determine
without having a performance baseline
 Often times acceptance tests are run to determine the
file system performance
 Performance tends to degrade over time
 It is hard to determine without data over time if
fragmentation or more often how much fragmentation is
impacting performance
© Copyright 2006 Instrumental, Inc.SM
Slide 97 of 109
What Works
The following slides contain common
rules of thumb that almost always
improve application performance
 These need to be considered in context of the system
configuration
 If the system is configured for small requests, large
requests will have limited impact
 In some cases you do not get expected behavior

For example, direct I/O will not always improve performance with
some file system implementations
 To me this is counter intuitive
© Copyright 2006 Instrumental, Inc.SM
Slide 98 of 109
Big I/O Requests
Large requests are efficient for the
 OS

Fewer system calls
 File system


Fewer allocations
Often less fragmentation
Large requests are much more efficient
for the RAID and disks
 This assumes that the system is configured for large
blocks
© Copyright 2006 Instrumental, Inc.SM
Slide 99 of 109
Aligned Requests to the RAID Controller
RAID does not have to read-modify-write
 Far better efficiency for the channels on the RAID
More efficient usage of the RAID cache
Fewer disk seeks and miss revolutions
 This will improve efficiency for all applications
© Copyright 2006 Instrumental, Inc.SM
Slide 100 of 109
Asynchronous I/O or Threaded I/O?
Better utilization of the RAID given the
increased latency in the data path
 In general 2-3 threads each writing large aligned
requests are needed for each RAID device being used
given the latency

If you have 3 RAID devices in a stripe that could be up to 9
threads or three threads with VERY large requests
 The 3 large requests will be broken based on the stripe size in
the volume manager
 Making huge requests because of a large number of RAID
devices can have a large impact on some operating systems
as they cannot handle these large requests
– Sometimes this is a page management problem
© Copyright 2006 Instrumental, Inc.SM
Slide 101 of 109
Large Pages
Large pages reduces system overhead
 Required to lock and unlock the pages when I/O is
happening
This is not as important for buffered I/O
 But significant improves system overhead for direct I/O
Some impact on I/O performance
 But mostly system overhead
© Copyright 2006 Instrumental, Inc.SM
Slide 102 of 109
Conclusions
Final Thoughts
© Copyright 2006 Instrumental, Inc.SM
Slide 103 of 109
I/O Performance
I/O performance increases have not kept
pace with CPU and memory increases
over the years
I/O latency issues are on the rise
Tuning I/O has become critical to keep up
with CPU and memory performance
© Copyright 2006 Instrumental, Inc.SM
Slide 104 of 109
Consider
Tuning Application I/O
 Using system calls vs. standard C library I/O calls
 Vendor specific I/O libraries
Tuning System I/O
 Get a good handle on your data path
 Are you dealing with small or large block requests?
 Tune the system accordingly


Round robin vs. striped allocation
Separate data & metadata if possible and makes sense
© Copyright 2006 Instrumental, Inc.SM
Slide 105 of 109
Consider (cont.)
Tuning I/O is not easy
 Enormous number of variables
 Many different applications
 Tools are not the greatest
 Different products from different vendors are not well
integrated and often times training and documentation
are lacking
© Copyright 2006 Instrumental, Inc.SM
Slide 106 of 109
Short Term Outlook
Scaling of I/O performance will continue
to be the primary challenge given the
baggage we carry
 POSIX limitations
 H/W scaling issues
 Lack of standards

File system and VM interfaces are not documented
© Copyright 2006 Instrumental, Inc.SM
Slide 107 of 109
Longer Term Outlook
A number of USG agencies and the storage
community has begun to realize that I/O
scaling and data management is a serious
problem
 Standards such as OSD could improve the situation but will not fully
solve the problem
There is no easy solution and nothing on the
horizon that will fix the problem
 Holographic storage has been our savory in 3-5 years for 15 years
Fixing this I/O problem is going to be hard
work and no break through technology is
expected for the foreseeable future
© Copyright 2006 Instrumental, Inc.SM
Slide 108 of 109
Thank You
© Copyright 2006 Instrumental, Inc.SM
Slide 109 of 109

Delivering Excellence and Innovation in Advanced Computing

Transcript Delivering Excellence and Innovation in Advanced Computing

Directory