Transcript Slides
StreamX10: A Stream
Programming Framework on X10
Haitao Wei
2012-06-14
School of Computer Science at Huazhong University of Sci&Tech
Outline
1
Introduction and Background
2
COStream Programming Language
3
Stream Compilation on X10
4
Experiments
5
Conclusion and Future Work
2
Background and motivition
Stream Programming
A high level programming model that has been
productively applied
Usually, depends on the specific architectures which
makes it difficult to port between different platforms
X10
a productive parallel programming environment
isolates the different architecture details
provides a flexible parallel programming abstract layer
for stream programming
StreamX10:try to make the stream program
portable based on X10
Outline
1
Introduction and Background
2
COStream Programming Language
3
Stream Compilation on X10
4
Experiments
5
Conclusion and Future Work
4
COStream Language
stream
FIFO queue connecting operators
operator
Basic func unit—actor node in stream graph
Multiple inputs and multiple outputs
Window
– like pop,peek,push operations
Init and work function
composite
Connected operators—subgraph of actors
A stream program is composed of
composites
COStream and Stream Graph
Composite Main{
Composite MyOp(output Out ; input In){
graph
stream<int i> S = Source(){
state :{ int x;}
param
stream
attribute:pn
graph
init :{x=0;}
work :{
S[0].i = x;
stream<int j> Out = Averager(In){
operator
work :{
int sum=0,i;
x++;
}
window S:tumbling,count(1);
for(i=0;i<pn;i++)
composite
sum += In[i],j;
Out[0].j = (sum/pn);
}
}
streamit<int j> P = MyOp(S){
window In: sliding,count(10),count(1);
param
Out:tumbling,count(1);
pn:N
}
() as SinkOp = Sink(P){
}
}
state :{int r;}
S
work :{
r = P[0].j;
println(r);
}
window P:tumbling,count(1);
}
}
Source
push=1
peek=10
pop=1
P
Averager
pop=1
Sink
push=1
6
Outline
1
Introduction and Background
2
COStream Programming Language
3
Stream Compilation on X10
4
Experiments
5
Conclusion and Future Work
7
Compilation flow of StreamX10
Phrase
Function
Translates the COStream syntax into abstract syntax tree.
Front-end
Instantiation
Instantiates the composites hierarchically to static flattened
operators.
Constructs static stream graph from flattened operators.
Static Stream Graph
Scheduling
Calculates initialization and steady-state execution orderings of
operators.
Partitioning
Performs partitioning based on X10 parallelism models for load
balance.
Generates X10 code for COStream programs.
Code Generation
The Execution Framework
activity
activity
activity
threads
pool
Place 0
Place 1
Local buffer object
Data flow intra place
Place 2
Global buffer object
Data flow inter place
The node is partitioned between the places
Each node is mapped to an activity
The nodes use the pipeline fashion to exploit the parallelisms
The local and Global FIFO buffer are used
9
Work Partition Inter-place
10
Comp. work=10
1
2
5
Comp. work=10
5
5
2
2
5
5
2
5
Comp. work=10
2
1
10
Speedup:30/10 =3
Communication:2
Objective:Minimized Communication and Load Balance (Using Metis)
10
Global FIFO implementation
push
peek/pop
0
Producer
1 …
copy
n
0
0
1 …
n
1 …
n
copy
Place0
Consumer
Place1
Local Array
DistArray
Each Producer/Consumer has its own local buffer
the producer uses push operation to store the data to the local buffer
The consumer uses peek/pop operation to fetch data from the local
buffer
When the local buffer is full/empty is data will be copied automatically
11
X10 code in the Back-end
//main.x10 control code
public static def main( ) {
...
finish for (p in Place.places())
async at (p) {
switch(p.id){
case 0:
val a_0 = new Source_0(rc);
a_0.run();
break;
case 1:
val a_2 = new MovingAver_2(rc);
a_2.run();
break;
case 2:
val a_1 = new Sink_1(rc);
a_1.run();
break;
default: break;
}
}
…
}
//Source.x10 code
...
def work(){
...
push_Source_0_Sink_1(0).x=x;
x+=1.0;
Define the work function
pushTokens();
popTokens();
}
public def run(){
initWork();//init
Call the work function in
//initSchedule
initial and steady schedule
for(var j:Int=0;j<Source_0_init;j++)
work();
//steadySchedule
for(var i:Int=0;i<RepeatCount;i++)
for(var j:Int=0;j<Source_0_steady;j++)
work();
flush();
}
...
Spawn activities for each node at
place according to the partition
12
Outline
1
Introduction and Background
2
COStream Programming Language
3
Stream Compilation on X10
4
Experiments
5
Conclusion and Future Work
13
Experimental Platform and Benchmarks
Platform
Intel Xeon processor (8 cores ) 2.4 GHZ with 4GB memory
Radhat EL5 with Linux 2.6.18
X10 compiler and runtime used are 2.2.0
Benchmarks
Rewrite 11 benchmarks from StreamIt
14
The throughputs comparison
Throughputs of 4 different configurations (NPLACE*NTHREAD=8)
Normalized to 1 place with 8 threads
10
9
Throughput normalized to 1 place with 8 threads
8
7
• for most benchmarks, CPU utilization increases from 24%
to NTHREADS=8
NPLACES=1,
NPLACES=2, NTHREADS=4
89% ,when places varies from 1 to 4, except for the benchmark
NPLACES=4, NTHREADS=2
with low computation/communication ratio
NPLACES=8, NTHREADS=1
• benefits are little or worse when the number of places increases
from 4 to 8
6
5
4
3
2
1
0
15
Observation and Analysis
The throughput goes up when the number of
places increases. This is because that multiple
places increase the CPU utilization
Multiple places show parallelism but also bring
more communication overhead
Benchmarks with more computation workload like
DES and Serpent_full can still benefit form the
number of places increasing
16
Outline
1
Introduction and Background
2
COStream Programming Language
3
Stream Compilation on X10
4
Experiments
5
Conclusion and Future Work
17
Conclusion
We proposed and implemented StreamX10, a
stream programming language and compilation
system on X10
A raw partitioning optimization is proposed to
exploit the parallelisms based on X10 execution
model
Preliminary experiment is conducted to study the
performance
18
Future Work
How to choose the best configuration (# of places
and # of threads) automatically for each benchmark
How to decrease the thread switching overhead by
mapping multiple nodes to the single activity
19
Acknowledgment
X10 Innovation Award founding support
QiMing Teng, Haibo Lin and David P. Grove
at IBM for their help on this research
20