Safe and Efficient Cluster Communication in Java using Explicit Memory Management Chi-Chao Chang

Download Report

Transcript Safe and Efficient Cluster Communication in Java using Explicit Memory Management Chi-Chao Chang

Safe and Efficient Cluster
Communication in Java using
Explicit Memory Management
Chi-Chao Chang
Dept. of Computer Science
Cornell University
Goal
High-performance cluster computing with safe languages

parallel and distributed applications
Use off-the-shelf technologies

Java




safe: “better C++”
“write once run everywhere”
growing interest for high-performance applications (Java Grande)
User-level network interfaces (UNIs)




direct, protected access to network devices
prototypes: U-Net (Cornell), Shrimp (Princeton), FM (UIUC)
industry standard: Virtual Interface Architecture (VIA)
cost-effective clusters: new 256-processor cluster @ Cornell TC
2
Java Networking
Traditional “front-end” approach




pick favorite abstraction (sockets, RMI,
MPI) and Java VM
write a Java front-end to custom or
existing native libraries
good performance, re-use proven code
magic in native code, no common solution
Apps
RMI, RPC
Sockets
Interface Java with Network Devices



bottom-up approach
minimizes amount of unverified code
focus on fundamental data transfer
inefficiencies due to:
1. Storage safety
2. Type safety
Active Messages, MPI, FM
Java
UNI
C
Networking Devices
3
Outline
Thesis Overview

GC/Native heap separation, object serialization
Experimental Setup: VI Architecture and Marmot
Part I: Array Transfers
(1) Javia-I: Java Interface to VI Architecture

respects heap separation
(2) Jbufs: Safe and Explicit Management of Buffers

Javia-II, matrix multiplication, Active Messages
Part II: Object Transfers
(3) A Case For Specialization

micro-benchmarks, RMI using Javia-I/II, impact on application suite
(4) Jstreams: in-place de-serialization

micro-benchmarks, RMI using Javia-III, impact on application suite
Conclusions
4
(1) Storage Safety
Java programs are garbage-collected



no explicit de-allocation: GC tracks and frees garbage objects
programs are oblivious to the GC scheme used: non-copying (e.g.
conservative) or copying
no control over location of objects
Modern Network and I/O Devices


direct DMA from/into user buffers
native code is necessary to interface with hardware devices
5
(1) Storage Safety
Result: Hard Separation between GC and native heaps
Application Memory
GC heap
Application Memory
Native heap
GC heap
Native heap
pin
copy
RAM
ON
OFF
RAM
DMA
OFF
NI
(a) Hard Separation: Copy-on-demand
OFF
DMA
pin
NI
(b) Optimization: Pin-on-demand
Pin-on-demand only works for send/write operations

For receive/read operations, GC must be disabled indefinitely...
6
(1) Storage Safety: Effect
MB/s
Throughput
80
60
40
C raw
Java copy
Java pin
20
0
Kbytes
0
8
16
24
32
Best case scenario: 10-40% hit in throughput


pick your favorite JVM, your fastest network interface, and a pair of
450Mhz P-II with commodity OS
pinning on demand is expensive...
7
(2) Type Safety
Cannot forge a reference to a Java object

b is an array of bytes

in C:
double *data = (double *)b;

in Java:
double[] data = new double[1024/8];
for (int i=0,off=0;i<1024/8;i++,off+=8) {
int upper = (((b[off]&0xff)<<24) +
((b[off+1]&0xff)<<16) +
((b[off+2]&0xff)<<8) +
(b[off+3]&0xff));
int lower = (((b[off+4]&0xff)<<24) +
((b[off+5]&0xff)<<16) +
((b[off+6]&0xff)<<8) +
(b[off+7]&0xff));
data[i] = Double.toLongBits(((long)upper)<<32)+
(lower&0xffffffffL))
}
8
(2) Type Safety
Objects have meta-data

runtime safety checks (array-bounds, array-store, casts)
In C:
In Java:
struct Buffer {
int len; char data[1];}
class Buffer { int len; byte[] data;
Buffer(int n) {
data = new byte[n]; len = n; }
}
Buffer *b =
malloc(sizeof(Buffer)+1024);
b.len = 1024;
b
1024
Buffer b = new Buffer(1024);
Buffer vtable
b
lock obj
1024
byte[]
vtable
lock obj
9
(2) Type Safety
Result: Java objects need to be serialized and deserialized across the network
Application Memory
GC heap
Native heap
serial
pin
copy
ON
OFF
RAM
DMA
NI
10
(2) Type Safety: Effect
us
Round-Trip Latency
1600
C raw
Java copy
Java RMI copy
1200
800
400
0
Kbytes
0
2
4
6
8
Performance hit of one order of magnitude:


pick your favorite high-level communication abstraction (e.g.
Remote Method Invocation)
pick your favorite JVM, your fastest network interface, and a pair of
450Mhz P-II
11
Thesis
Use explicit memory management to improve Java communication
performance

Jbufs: safe and explicit management of Java buffers




softens the GC/Native heap separation
preserves type and storage safety
“zero-copy” array transfers
Jstreams: extends Jbufs for optimizing serialization in clusters

“zero-copy” de-serialization of arbitrary objects
Application Memory
GC heap
Native heap
pin
RAM
ON
OFF
DMA
NI
user-controlled
12
Outline
Thesis Overview

GC/Native heap separation, object serialization
Experimental Setup: Giganet cluster and Marmot
Part I: Array Transfers
(1) Javia-I: Java Interface to VI Architecture

respects heap separation
(2) Jbufs: Safe and Explicit Management of Buffers

Javia-II, matrix multiplication, Active Messages
Part II: Object Transfers
(3) A Case For Specialization

micro-benchmarks, RMI using Javia-I/II, impact on application suite
(4) Jstreams: in-place de-serialization

micro-benchmarks, RMI using Javia-III, impact on application suite
Conclusions
13
Giganet Cluster
Configuration



8 P-II 450MHz, 128MB RAM
8 1.25 Gbps Giganet GNN-1000 adapter
one Giganet switch
GNN1000 Adapter: User-Level Network Interface

Virtual Interface Architecture implemented as a library (Win32 dll)
Base-line pt-2-pt Performance


14s r/t latency, 16s with switch
over 100MBytes/s peak, 85MBytes/s with switch
14
Marmot
Java System from Microsoft Research





not a VM
static compiler: bytecode (.class) to x86 (.asm)
linker: asm files + runtime libraries -> executable (.exe)
no dynamic loading of classes
most Dragon book opts, some OO and Java-specific opts
Advantages

source code
good performance
two types of non-concurrent GC (copying, conservative)

native interface “close enough” to JNI


15
Outline
Thesis Overview

GC/Native heap separation, object serialization
Experimental Setup: Giganet cluster and Marmot
Part I: Array Transfers
(1) Javia-I: Java Interface to VI Architecture

respects heap separation
(2) Jbufs: Safe and Explicit Management of Buffers

Javia-II, matrix multiplication, Active Messages
Part II: Object Transfers
(3) A Case For Specialization

micro-benchmarks, RMI using Javia-I/II, impact on application suite
(4) Jstreams: in-place de-serialization

micro-benchmarks, RMI using Javia-III, impact on application suite
Conclusions
16
Javia-I
Basic Architecture

respects heap separation


buffer mgmt in native code

Vi
primitive array transfers only
non-blocking
blocking




send/recv
ticket ring
copying GC disabled in native code
Send/Recv API

byte array ref
Marmot as an “off-the-shelf” system


GC heap
bypass ring accesses
pin-on-demand
alloc-recv: allocates new array ondemand
Java
C
descriptor
send/recv
queue
buffer
VIA
cannot eliminate copying during recv
17
Javia-I: Performance
Basic Costs (PII-450, Windows2000b3):
pin + unpin = (10 + 10)us, or ~5000 machine cycles
Marmot: native call = 0.28us, locks = 0.25us, array alloc = 0.75us
Latency: N = transfer size in bytes
16.5us + (25ns) * N
38.0us + (38ns) * N
21.5us + (42ns) * N
18.0us + (55ns) * N
raw
pin(s)
copy(s)
copy(s)+alloc(r)
BW: 75% to 85% of raw for 16Kbytes
s
raw
copy(s)
pin(s)
copy(s)+alloc(r)
pin(s)+alloc(r)
400
300
MB/s
80
60
200
40
100
20
raw
copy(s)
pin(s)
copy(s)+alloc(r)
pin(s)+alloc(r)
Kbytes
0
Kbytes
0
0
1
2
3
4
5
6
7
8
0
8
16
24
32
18
jbufs
Goal


provide buffer management capabilities to Java without violating
its safety properties
re-use is important: amortizes high pinning costs
jbuf: exposes communication buffers to Java programmers
1. lifetime control: explicit allocation and de-allocation
2. efficient access: direct access as primitive-typed arrays
3. location control: safe de-allocation and re-use by controlling
whether or not a jbuf is part of the GC heap

heap separation becomes soft and user-controlled
19
jbufs: Lifetime Control
public class jbuf {
public static jbuf alloc(int bytes);/* allocates jbuf outside of GC heap */
public void free() throws CannotFreeException; /* frees jbuf if it can */
}
handle
jbuf
GC heap
1. jbuf allocation does not result in a Java reference to it

cannot access the jbuf from the wrapper object
2. jbuf is not automatically freed if there are no Java references to it

free has to be explicitly called
20
jbufs: Efficient Access
public class jbuf {
/* alloc and free omitted */
public byte[] toByteArray() throws TypedException;/*hands out byte[] ref*/
public int[] toIntArray() throws TypedException; /*hands out int[] ref*/
. . .
}
jbuf
Java
byte[]
ref
GC heap
3. (Storage Safety) jbuf remains allocated as long as there are array
references to it

when can we ever free it?
4. (Type Safety) jbuf cannot have two differently typed references to it at
any given time

when can we ever re-use it (e.g. change its reference type)?
21
jbufs: Location Control
public class jbuf {
/* alloc, free, toArrays omitted */
public void unRef(CallBack cb); /* app intends to free/re-use jbuf */
}
Idea: Use GC to track references
unRef: application claims it has no references into the jbuf



jbuf is added to the GC heap
GC verifies the claim and notifies application through callback
application can now free or re-use the jbuf
Required GC support: change scope of GC heap dynamically
jbuf
jbuf
Java
byte[]
ref
jbuf
Java
byte[]
ref
GC heap
Java
byte[]
ref
GC heap
unRef
GC heap
callBack
22
jbufs: Runtime Checks
to<p>Array, GC
alloc
to<p>Array
free
Unref
ref<p>
unRef
GC*
to-be
unref<p>
to<p>Array, unRef
Type safety: ref and to-be-unref states parameterized by primitive type
GC* transition depends on the type of garbage collector


non-copying: transition only if all refs to array are dropped before GC
copying: transition occurs after every GC
23
Javia-II
Exploiting jbufs


GC heap
explicit pinning/unpinning of jbufs
only non-blocking send/recvs
send/recv
ticket ring
state
jbuf
array
refs
Vi
Java
C
descriptor
send/recv
queue
VIA
24
Basic Jbuf Costs
Javia-II: Performance
allocation = 1.2us, to*Array = 0.8us, unRefs = 2.3 us, GC degradation=1.2us/jbuf
Latency (n = xfer size)
16.5us + (0.025us) * n
20.5us + (0.025us) * n
38.0us + (0.038us) * n
21.5us + (0.042us) * n
raw
jbufs
pin(s)
copy(s)
BW within 1% of raw
400s
80
raw
jbufs
copy
pin
300
MB/s
70
60
50
200
raw
40
jbufs
copy
30
pin
20
100
10
Kbytes
0
1
2
3
4
5
6
7
Kbytes
0
0
8
0
8
16
24
32
25
MM: Communication
msecs
msecs
pMM Comm Time (64x64, 8 procs)
10
78%
50
29%
73%
comm
barrier
8
pMM Comm Time (256x256, 8 procs)
comm
barrier
40
24%
6
78%
85%
22%
30
18%
19%
4
67%
16%
70%
20
56%
2
0
13%
10
copyalloc
copyasync
pinalloc
pinasync
jbufs
jdk copy- jdk copyasync
alloc
0
copyalloc
copyasync
pinalloc
pinasync
jbufs
jdk copy- jdk copyalloc
async
pMM over Javia-II/jbufs spends at least 25% less in communication
for 256x256 matrices on 8 processors
26
MM: Overall
pMM MFLOPS (256x256)
pMM MFLOPS (64x64)
2 procs
4 procs
8 procs
200
180
160
140
120
2 procs
4 procs
8 procs
350
300
250
200
100
80
60
40
150
100
20
0 copyalloc
50
copyasync
pinalloc
pinasync
jbufs
jdk copy- jdk copyalloc
async
0 copyalloc
copyasync
pinalloc
pinasync
jbufs
jdk copy- jdk copyalloc
async
Cache effects: better communication performance does not always
translate to better overall performance
27
Active Messages
Exercising Jbufs:


user supplies a list of jbufs
upon message arrival:




jbuf passed to handler
unRef is invoked after
handler invocation
if pool is empty, reclaim
existing ones
copying deferred to GC-time
only if needed
class First extends AMHandler {
private int first;
void handler(AMJbuf buf, …) {
int[] tmp = buf.toIntArray();
first = tmp[0];
}
}
class Enqueue extends AMHandler {
private Queue q;
void handler(AMJbuf buf, …) {
int[] tmp = buf.toIntArray();
q.enq(tmp);
}
}
28
AM: Performance
Latency about 15s higher than Javia

synch access to buffer pool, endpoint header, flow control
checks, handler id lookup
BW within 10% of peak for 16KByte messages
600s
80
raw
jbufs
AM jbuf
AM copy
AM copy-alloc
500
400
MB/s
70
60
50
300
40
raw
jbufs
AM jbuf
AM copy
AM pin
AM copy-alloc
Kbytes
30
200
20
100
10
Kbytes
0
0
0
1
2
3
4
5
6
7
8
0
8
16
24
32
29
Jbufs: Experience
Efficient access through arrays is useful:



no indirect access via method invocation
promotes code re-use of large numerical kernels
leverages compiler infrastructure for eliminating safety checks
Limitations


still not as flexible as C buffers
stale references may confuse programmers
Discussed in thesis:




the necessity of explicit de-allocation
implementation of Jbufs in Marmot’s copying collector
impact on conservative and generational collector
extension to JNI to allow “portable” implementations of Jbufs
30
Outline
Thesis Overview

GC/Native heap separation, object serialization
Experimental Setup: VI Architecture and Marmot
Part I: Array Transfers
(1) Javia-I: Java Interface to VI Architecture

respects heap separation
(2) Jbufs: Safe and Explicit Management of Buffers

Javia-II, matrix multiplication, Active Messages
Part II: Object Transfers
(3) A Case For Specialization on Homogeneous Clusters

micro-benchmarks, RMI using Javia-I/II, impact on application suite
(4) Jstreams: in-place de-serialization

micro-benchmarks, RMI using Javia-III, impact on application suite
Conclusions
31
Object Serialization and RMI
Standard JOS Protocol

“heavy-weight” class descriptors are serialized along with objects
type-checking: classes need not be “equal”, just “compatible.”

protocol allows for user extensions

Remote Method Invocation



object-oriented version of Remote Procedure Call
relies on JOS for argument passing
actual parameter object can be a sub-class of the formal parameter class.
GC heap
GC heap
readObject
writeObject
NETWORK
32
JOS Costs
us
us
writeObject
93
120
jview
jdk
marmot
275
60
117
50
40
40
30
30
20
20
10
10
byte[]
100
double[]
12
byte[]
500
double[]
62
complex[]
p/elem
list 4
p/elem
271
0
86
jview
jdk
marmot
60
50
0
readObject
70
70
byte[]
100
double[]
12
byte[]
500
double[] complex[] list 4
p/elem p/elem
62
list 160
p/elem
1. overheads in tens or hundreds of s:

send/recv overheads=~ 3 s, memcpy of 500 bytes=~ 0.8 s
2. double[] 50% more expensive than byte[] of similar size
3. overheads grow as object sizes grow
33
Impact of Marmot
Impact of Marmot’s optimizations:



Method inlining: up to 66% improvement (already deployed)
No synchronization whatsoever: up to 21% improvement
No safety checks whatsoever: up to 15% combined
Better compilation technology unlikely to reduce overheads substantially
34
Impact on RMI
2000 s
RMI
4-byte (us)
jbufs
150.4
copy+alloc
161.9
copy
164.5
pin
211.8
jdk copy
271.0
sockets
482.3
jdk sockets 520.1
raw
jbufs
RMI jbufs
RMI copy+alloc
RMI pin
RMI copy+alloc
jdk RMI copy
1600
1200
800
400
Kbytes
0
0



1
2
3
4
5
6
7
8
Order of magnitude worse than Javia-I/II
round-trip latency drops to about 30us in a null RMI: no JOS!
peak bandwidth of 22MBytes/s, about 25% of raw
35
Impact on Applications
Application
SOR
EM3D arrays
FFT complex
FFT arrays
pMM
% comm
% total
time (est.) time (est.)
11.76%
10.90%
14.28%
1.42%
7.64%
2.73%
5.22%
13.73%
1.37%
5.20%
A Case for Specializing Serialization for Cluster applications:




overheads a order of magnitude higher than send/recv and memcpy
RMI performance degraded by one order of magnitude
5-15% “estimated” impact on applications
old adage: “specialize for the common case”
36
Optimizing De-serialization
“in-place” object de-serialization

specialization for homogeneous cluster and JVMs
Goal

eliminate copying and allocation of objects
Challenges



preserve the integrity of the receiving JVM
permit de-serialization of arbitrary Java objects with unrestricted usage
and without special annotations
independent of a particular GC scheme
GC heap
GC heap
writeObject
NETWORK
37
Jstreams: write
public class Jstream extends Jbuf {
public void writeObject(Object o) /* serializes o onto the stream */
throws TypedException, ReferencedException;
public void writeClear()
/* clears the stream for writing*/
throws TypedException, ReferencedException;
}
writeObject





deep-copy of objects: maintains in-memory layout
deals with cyclic data structures
swizzle pointers: offsets to a base address
replace object meta-data with 64-bit class descriptor
optimization: primitive-typed arrays in jbufs are not copied
38
Jstreams: read
public class Jstream extends Jbuf {
public Object readObject() throws TypedException; /* de-serialization */
public boolean isJstream(Object o); /* checks if o resides in the stream */
}
readObject



replace class descriptors with meta-data
unswizzle pointers, array-bounds checking
after first readObject, add jstream to GC heap


tracks references coming out of read objects
unRef: user is willing to free or re-use
GC heap
GC heap
unRef
callBack
GC heap
39
jstreams: Runtime Checks
writeObject, GC
alloc
writeObject
free
Unref
writeClear
Write
Mod
e
readObject
GC*
readObject
to-be
unref
Read
Mode
unRef
unRef
readObject, GC
Modification to Javia-II: prevent DMA from clobbering de-serialized objects


receive posts not allowed if jstream is in read mode
no changes to Javia-II architecture
jstream: Performance
us
readObject
30
39
25
55
86
JOS jdk
JOS marmot
jstreams marmot
jstreams (C)
20
15
10
5
0
byte[] 100
byte[] 500
double[] 62
list 4 p/e
list 160 p/e
De-serialization costs constant w.r.t. object size

2.6us for arrays, 3.3us per list element.
41
jstream: Impact on RMI
80
MB/s
70
60
50
raw
javia-II
AM javia-II
RMI avia-III
RMI javia-I
40
30
20
10
Kbytes
0
0
8
16
24
32
4-byte round-trip latency of 45us (25us higher than Javia-II)
52MBytes/s for 16KBytes arguments
42
jstream: Impact on Applications
Application
SOR
EM3D arrays
FFT complex
FFT arrays
pMM
JOS
JOS jstreams jstreams
%
%
% improv.
% improv.
comm
total
comm
total
improv. improv.
comm
total (est.)
(secs) (secs)
(secs)
(secs) comm
total
(est.)
4.59
19.78
3.99
19.08 13.20% 3.52%
11.76%
2.73%
2.20
4.60
1.99
4.37 9.50% 4.85%
10.90%
5.22%
18.30
19.03
16.16
17.26 11.70% 9.30%
14.28%
13.73%
14.82
15.36
14.29
14.83 3.57% 3.40%
1.42%
1.37%
190.58 280.00
170.91
307.80 10.32% -9.93%
7.64%
5.20%
3-10% improvement in SOR, EM3D, FFT
10% hit in pMM performance



over 22,000 incoming RMIs, 1000 jstreams in receive pool, ~26
garbage collections: 15% of total execution time in GC
generational collection will alleviate GC costs substantially
receive pool size is hard to tune: tradeoffs between GC and locality
43
Jstreams: Experience
Implementation of readObject and writeObject integrated into JVM


protocol is JVM-specific
native implementation is faster
Limitations


not as flexible as Java streams: cannot read and write at the same time
no “extensible” wire protocols
Discussed in thesis:



implementation of Jstreams in Marmot’s copying collector
support for polymorphic RMI: minor changes to the stub compiler
JNI extensions to allow “portable” implementations of Jstreams
44
Related Work
Microsoft J-Direct




“pinned” arrays defined using source-level annotations
JIT produces code to “redirect” array access: expensive
Berkeley’s Jaguar: efficient code generation with JIT extensions
security concern: JIT “hacks” may break Java or byte-code
Custom JVMs


many “tricks” are possible (e.g. pinned array factories, pinned
and non-pinned heaps, etc): depend on a particular GC scheme
Jbufs: isolates minimal support needed from GC
Memory Management

Safe Regions (Gay and Aiken): reference counting, no GC
Fast Serialization and RMI


KaRMI (Karlsruhe): fixed JOS, ground-up RMI implementation
Manta (Vrije U): fast RMI but a Java dialect
45
Summary
Use of explicit memory management to improve Java communication
performance in clusters





softens the GC/Native heap separation
preserves type and storage safety
independent of GC scheme
jbufs: zero-copy array transfers
jstreams: zero-copy de-serialization of arbitrary objects
Framework for building communication software and applications in
Java





Javia-I/II
parallel matrix multiplication
Jam: active messages
Java RMI
cluster applications: TSP, IDA, SOR, EM3D, FFT, and MM
46