Transcript Document

Lecture 39: Review Session #1
• Reminders
– Final exam, Thursday 12/18/2014 @ 3:10pm
• Sloan 150
– Course evaluation (Blue Course Evaluation)
• Access through zzusis
1
Problem #1
• Consider a system with two multiprocessors with the following
configurations:
–
–
•
Machine A with two processors, each with local memory of 512 MB with local memory
access latency of 20 cycles per word and remote memory access latency of 60 cycles
per word.
Machine B with two processors, with a shared memory of 1GB with access latency of
40 cycles per word.
Suppose an application has two threads running on the two
processors, each of them need to access an entire array of 4096
words.
– Is it possible to partition this array on the local memories of ‘A’ machine so
that the application runs faster on it rather than ‘B’ machine?
•
•
•
If so, specify the partitioning!
If not, by how many more cycles should the ‘B’ memory latency be worsened for a partitioning
on the ‘A’ machine to enable a faster run than the ‘B’ machine?
Assume that the memory operations dominate the execution time.
2
Solution #1
• Suppose we have ‘x’ words on one processor and (T-x) words
on the other processor, where T = 4096.
• Execution Time on ‘A’ = max(20x + 60(T-x), 60x + 20(T-x)) =
max(60T-40x, 20T+40x)
• The max is 40T (unit time), where x = T/2
• Execution Time on ‘B’ = 40T
• So, we can't make ‘A’ faster than ‘B’. However, if ‘B’ access is
one more cycle slower (that is, 41 cycles access latency), ‘A’
could be faster.
3
Problem #2
•
•
Consider a multi-core processor with heterogeneous cores: A, B, C and D where
core B runs twice as fast as A, core C runs three times as fast as A and cores C
and A run at the same speed (ie have the same processor frequency, micro
architecture etc). Suppose an application needs to compute the square of each
element in an array of 256 elements. Consider the following two divisions of
labor:
Core A
32 elements
(a)
Core B
128 elements
Core C
Core D
64 elements
32 elements
Core A
Core B
Core C
Core D
48 elements
128 elements
80 elements
Unused
•
(b)
•
Compute (1) the total execution time taken in the two cases and (2) cumulative
processor utilization (amount of total time the processors are not idle divided by
the total execution time). For case (b), if you do not consider Core D in
cumulative processor utilization (assuming we have another application to run
on Core D), how would it change? Ignore cache effects by assuming that a
perfect pre-fetcher is in operation.
4
Solution #2
• (1) Total execution Time
– Total execution time = max(32/1, 128/2, 64/3, 32/1) = 64 (unit time)
– Total execution time = max(48/1, 128/2, 80/3, 0/1) = 64 (unit time)
• (2) Utilization:
– (a) Utilization = (32/1 + 128/2 + 64/3 +32/1) / 4 * (1/64) = 0.58
– (b) Utilization = (48/1 + 128/2 + 80/3 + 0/1) / 4 * (1/64) = 0.54
– (b) Utilization (if processor D is ignored) = (48/1 + 128/2 + 80/3) / 3 *
(1/64) = 0.72
5
Problem #3
•
How would you rewrite the following sequential code so that it can be run as two
parallel threads on a dual-core processor ?
int A[80], B[80], C[80], D[80];
for (i = 0 to 40)
{
A[i] = B[i] * D[2*i];
C[i] = C[i] + B[2*i];
D[i] = 2*B[2*i];
A[i+40] = C[2*i] + B[i];
}
6
Solution #3
• The code can be written into two threads as follows:
• Thread 1:
int A[80], B[80], C[80], D[80];
for (i = 0 to 40)
{
A[i] = B[i] * D[2*i];
C[i] = C[i] + B[2*i];
A[i+40] = C[2*i] + B[i];
}
• Thread 2:
int A[80], B[80], C[80], D[80];
for (i = 0 to 40)
{
D[i] = 2*B[2*i];
}
7
RAID 0 and RAID 1
• RAID 0 has no additional redundancy (misnomer) – it
uses an array of disks and stripes (interleaves) data
across the arrays to improve parallelism and throughput
• RAID 1 mirrors or shadows every disk – every write
happens to two disks
• Reads to the mirror may happen only when the primary
disk fails – or, you may try to read both together and the
quicker response is accepted
• Expensive solution: high reliability at twice the cost
8
RAID 3
• Data is bit-interleaved across several disks and a separate
disk maintains parity information for a set of bits. On
failure, use parity bits to reconstruct missing data.
• For example: with 8 disks, bit 0 is in disk-0, bit 1 is in disk
1, …, bit 7 is in disk-7; disk-8 maintains parity for all 8 bits
• For any read, 8 disks must be accessed (as we usually
read more than a byte at a time) and for any write, 9 disks
must be accessed as parity has to be re-calculated
• High throughput for a single request, low cost for
redundancy (overhead: 12.5% in the above example), low
task-level parallelism
9
RAID 4 / RAID 5
• Data is block interleaved – this allows us to get all our
data from a single disk on a read – in case of a disk
error, read all 9 disks
• Block interleaving reduces throughput for a single
request (as only a single disk drive is servicing the
request), but improves task-level parallelism as other
disk drives are free to service other requests.
• On a write, we access the disk that stores the data and
the parity disk – parity information can be updated
simply by checking if the new data differs from the old
data
10
RAID 3 vs RAID 4
11
RAID 5
• If we have a single disk for parity, multiple writes can not
happen in parallel (as all writes must update parity info)
• RAID 5 distributes the parity block to allow simultaneous
writes
12
Problem #4
•
Discuss why RAID 3 is not suited for transaction processing applications. What
kind of applications is it suitable for and why?
13
Solution #4
•
RAID 3 is unsuited to transactional processing because each read involves
activity at all disks. In RAID 4 and 5 reads only involve activity at one disk. The
disadvantages of RAID 3 are mitigated when long sequential reads are
common, but performance never exceeds RAID 5. For this reason, RAID 3 has
been all but abandoned commercially.
14