PPT - ECE 751 Embedded Computing Systems

Transcript PPT - ECE 751 Embedded Computing Systems

Lecture 11: Memory
Optimizations
Embedded Computing Systems
Mikko Lipasti, adapted from M. Schulte
Based on slides and textbook from Wayne Wolf
High Performance Embedded Computing
© 2007 Elsevier
Topics






List scheduling
Loop Transformations
Global Optimizations
Buffers, Data Transfers, and Storage
Management
Cache and Scratch-Pad Optimizations
Main Memory Optimizations
© 2006 Elsevier
List Scheduling


Given DFG graph, how do you assign
(schedule) operations to particular slots?
Schedule based on priority function




Compute longest path to leaf function
Schedule nodes with longest path first
Keep track of readiness (based on result latency)
Goal: maximize ILP, fill as many issue slots
with useful work as possible (minimize NOPs)
© 2006 Elsevier
List Scheduling Example
Pick 3
3 (4), 1(4), 2(3), 4(2), 5(0), 6(0)
Pick 1
1(4), 2(3), 4(2), 5(0), 6(0)
Pick 2
2(3), 4(2), 5(0), 6(0)
Pick 4
4(2), 5(0), 6(0)
Pick 6 (5 nr)
5(0), 6(0)
Pick 5
5(0)


Heuristic – no guarantee of optimality
For parallel pipelines account for structural hazards
© 2006 Elsevier
Local vs. Global Scheduling

Single-entry, single-exit (e.g. basic block) scope


Expand scope



Limited opportunity since only 4-5 instructions
Unroll loop body, inline small functions
Construct superblocks and hyperblocks, which are
single-entry, multiple exit sequences of blocks
Code motion across control-flow divergence


Speculative, consider safety (exceptions) and state
Predication is useful to nullify wrong-path instructions
© 2006 Elsevier
Memory-oriented optimizations




Memory is a key bottleneck in many
embedded systems.
Memory usage can be optimized at any level
of the memory hierarchy.
Can target data or instructions.
Global memory analysis can be particularly
useful.

It is important to size buffers between subsystems
to avoid buffer overflow and wasted memory.
© 2006 Elsevier
Loop transformations

Data dependencies may be within or between loop
iterations. Ideal loops are fully parallelizable.

A loop nest has loops enclosed by other loops.

A perfect loop nest has no conditional statements.
© 2006 Elsevier
Types of loop transformations






Loop permutation changes order of loops.
Index rewriting changes the form of the loop
indexes.
Loop unrolling copies the loop body.
Loop splitting creates separate loops for
operations in the loop body.
Loop fusion or loop merging combines loop
bodies.
Loop padding adds data elements to an array
to change how the array maps into memory.
© 2006 Elsevier
Polytope model


Commonly used to represent data
dependencies in loop nests.
Loop transformations can be modeled as
matrix operations:

Each column represents iteration bounds.
j
i
© 2006 Elsevier
Loop permutation




Changes the order of loop indices
Can help reduce the time needed to access
matrix elements
2-D arrays in C are stored in row major order
 Access the data row by row.
Example of matrix-vector multiplication
© 2006 Elsevier
Loop fusion

Combines loop bodies
for (i = 0; i <N; i++)
x[i] = a[i] * b[i];
for (i = 0; i <N; i++)
y[i] = a[i] * c[i];
Original loops

for (i = 0; i <N; i++) {
x[i] = a[i] * b[i];
y[i] = a[i] * c[i];
}
After loop fusion
How might this help improve performance?
© 2006 Elsevier
Buffer management [Pan01]




In embedded systems,
buffers are often used to
communicate between
subsystems
Excessive dynamic memory
management wastes
cycles, energy with no
functional improvements.
Many embedded programs
use arrays that are statically
allocated
Several loop
transformations have been
developed to make buffer
management more efficient

Before:
for (i=0; i<N; ++i)
for (j=0; j<N-L; ++j)
b[i][j] = 0;
for (i=0; i<N; ++i)
for (j=0; j<N-L; ++j)
for (k=0; k<L; ++k)
b[i][j] += a[i][j+k];

© 2006 Elsevier
After:
for (i=0; i<N; ++i)
for (j=0; j<N-L; ++j)
b[i][j] = 0;
for (k=0; k<L; ++k)
b[i][j] += a[i][j+k];
closer
Buffer management [Pan01]



int a_buf[L];
Loop analysis can
int b_buf;
help to make data
reuse more explicit.
for (i = 0; i < N; ++i) {
Buffers are declared
initialize a_buf
in the program
Don’t need to exist in
for (j = 0; j < N - L; ++j) {
final implementation.
b_buf = 0;
a_buf[(j + L - 1) % L] = a[i][j + L - 1];
for (k<0; k < L; ++k)
b_buf += a_buf[(j + k)%L];
b[i][j] = b_buf;
}
}
© 2006 Elsevier
Cache optimizations – [Pan97]

Strategies:



Move data to reduce the number of conflicts.
Move data to take advantage of prefetching.
Need:


Load map.
Information on access frequencies.
© 2006 Elsevier
Cache conflicts



Assume a direct-mapped cache of size C = 2m
with a cache line size of M words
Memory address A maps to cache line
k = (A mod C)/M
If N is a multiple of C, then a[i], b[i], and c[i] all
map to the same cache line
© 2006 Elsevier
Reducing cache conflicts

Could increase the cache size




Why might these be a bad idea?
Add L dummy words between adjacent arrays
Let f(x) denote the cache line to which the
program variable x is mapped.
For L = M, with a[i] starting at 0, we have
© 2006 Elsevier
Scalar variable placement
Place scalar variables to improve locality and
reduce cache conflicts.

Build closeness graph that indicates the desirability of
keeping sets of variables close in memory.


M adjacent words are read on a single miss
Group variables into M word clusters.
Build a cluster interference graph



Indicates which cluster map to same cache line
Use interference graph to optimize placement.


Try to avoid interference
© 2006 Elsevier
Constructing the closeness graph
Generate an access sequence

Create a node for memory access in the code
Directed edge between nodes indicates successive access
Loops weighted with number of loop iterations



Use access sequence to construct closeness graph

Connect nodes within distance M of each other




x to b is distance 4 (count x and b)
Links give # of times control flows between nodes
Requires O(Mn2) time for n nodes
© 2006 Elsevier
Group variables into M word clusters.
Determine which variables to place on same line


Put variables that will frequently be accessed closely
together on the same line. Why?
Form clusters to maximize the total weight of edges
in all the clusters



Greedy AssignClusters algorithm has complexity O(Mn2)
In previous example M = 3 and n = 9
© 2006 Elsevier
Build cluster interference graph
Identify clusters that should not map to the same line

Convert variable access sequence to cluster access
sequence
Weight in graph corresponds to number of times cluster
access alternates along execution path



High weights should not be mapped to the same line
© 2006 Elsevier
Assigning memory locations to clusters

Find an assignment of clusters in a CIG to memory
locations, such that MemAssignCost (CIG) is
minimized.
© 2006 Elsevier
Array placement



Focus on arrays accessed in innermost loops. Why?
Arrays are placed to avoid conflicting accesses with
other arrays.
Don’t worry about clustering, but still construct the
interference graph – edges dictated by array bounds
© 2006 Elsevier
Avoid conflicting memory locations


Given addresses X, Y.
Cache with k lines each
holding M words.
Formulas for X and Y
mapping to the same
cache line:
© 2006 Elsevier
Array assignment algorithm
[Pan97] © 1997 IEEE
© 2006 Elsevier
Results from data placement [Pan97]



Data cache hit rates improve 48% on average
Results in average speedup of 34%
Results for a 256-byte cache and for kernels
© 2006 Elsevier
On-Chip vs. Off-Chip Data Placement

© 2006 Elsevier
[Pan00] explore how to
partition data to optimize
performance
On-Chip vs. Off-Chip Data Placement





Allocate static variables at compile time
Map all scalar values/constants to scratchpad
Map all arrays too large for scratchpad into DRAM
Only arrays with intersecting lifetimes will have
conflicts.
Calculate several parameters:



VAC(u): variable access count – how many times is u
accessed.
IAC(u): interference access count – how many times are
other variables accessed during u’s lifetime.
IF(u): total interference count = VAC(u) + IAC(u).
© 2006 Elsevier
On-Chip vs. Off-Chip Data Placement

Also need to calculate LCF(u): loop conflict factor.




p is the number of loops accessed by u
k(u) is the number of accesses to u
K(v) is the number of accesses to variables other than u
And TCF(u): total conflict factor.
© 2006 Elsevier
Scratch pad allocation formulation

AD( c ): access density.
© 2006 Elsevier
Scratch pad allocation algorithm
[Pan00] © 2000 ACM Press
© 2006 Elsevier
Scratch pad allocation performance
© 2006 Elsevier
[Pan00] © 2000 ACM Press
Main memory-oriented optimizations

Memory chips provide several useful modes:

Burst mode accesses sequential locations.



Paged modes allow only part of the address to be
transmitted.



Provide start address and length
Reduce number of addresses sent, increase transfer rate
Address split into page number and offset
Store page number in register to quickly access values in
same page
Access times depend on address(es) being
accessed.
© 2006 Elsevier
Banked memories

Banked memories allow multiple memory banks
to be accessed in parallel
© 2006 Elsevier

PPT - ECE 751 Embedded Computing Systems

Transcript PPT - ECE 751 Embedded Computing Systems

Directory