Intel Core2 QuadCPU @2.66 GHz

Download Report

Transcript Intel Core2 QuadCPU @2.66 GHz

Sam Williams Diagram
Clovertowns marketed as xeon’s
Intel Core2 QuadCPU @2.66 GHz
Q6700
L2 Cache 8 Mbytes (4MB per pair)
L1 Cache: (128 KB Instruction +128KB Data at the core level???)
L3 Cache: None?
CPU Frequency: 2.66 Ghz
Bus Speed: 1.066 GHz (FSB=Front Side Bus) (Multiplier=10?)
Code Name: Kentsfield (xeon or not? Not on my machine?)
One thread (8.87/2.66=3.33 flops/cycle)
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
>> maxNumCompThreads(1);
>> n=1000; a=randn(n); tic, a*a; t=toc; (2*n^3)/(t*1e9)
ans =
8.4818
>> n=1000; a=randn(n); tic, a*a; t=toc; (2*n^3)/(t*1e9)
ans =
8.4852
>> n=2000; a=randn(n); tic, a*a; t=toc; (2*n^3)/(t*1e9)
ans =
8.5231
>> n=2000; a=randn(n); tic, a*a; t=toc; (2*n^3)/(t*1e9)
ans =
8.6101
>> n=4000; a=randn(n); tic, a*a; t=toc; (2*n^3)/(t*1e9)
ans =
8.8097
>> n=4000; a=randn(n); tic, a*a; t=toc; (2*n^3)/(t*1e9)
ans =
8.8310
>> n=5000; a=randn(n); tic, a*a; t=toc; (2*n^3)/(t*1e9)
ans =
8.8702
(8.48/8.87=0.95)
Two threads (17.11/2.66=6.43 flops/cycle)
>> maxNumCompThreads(2);
>> n=1000; a=randn(n); tic, a*a; t=toc; (2*n^3)/(t*1e9)
ans =
14.8793
>> n=1000; a=randn(n); tic, a*a; t=toc; (2*n^3)/(t*1e9)
ans =
15.8802
>> n=1000; a=randn(n); tic, a*a; t=toc; (2*n^3)/(t*1e9)
ans =
15.5001
>> n=2000; a=randn(n); tic, a*a; t=toc; (2*n^3)/(t*1e9)
ans =
16.3604
>> n=2000; a=randn(n); tic, a*a; t=toc; (2*n^3)/(t*1e9)
ans =
16.5596
>> n=2000; a=randn(n); tic, a*a; t=toc; (2*n^3)/(t*1e9)
ans =
16.3035
>> n=4000; a=randn(n); tic, a*a; t=toc; (2*n^3)/(t*1e9)
ans =
16.8308
>> n=4000; a=randn(n); tic, a*a; t=toc; (2*n^3)/(t*1e9)
ans =
16.8309
>> n=5000; a=randn(n); tic, a*a; t=toc; (2*n^3)/(t*1e9)
ans =
17.0555
>> n=5000; a=randn(n); tic, a*a; t=toc; (2*n^3)/(t*1e9)
ans =
17.0995
>> n=6000; a=randn(n); tic, a*a; t=toc; (2*n^3)/(t*1e9)
ans =
17.0704
>> n=6000; a=randn(n); tic, a*a; t=toc; (2*n^3)/(t*1e9)
ans =
17.1110
(14.8793/17.111 = 0.86)
Four threads (29.56/2.66=11.1 flops/cycle)
>> maxNumCompThreads(4);
>> n=1000; a=randn(n); tic, a*a; t=toc; (2*n^3)/(t*1e9)
ans =
23.9690
>> n=1000; a=randn(n); tic, a*a; t=toc; (2*n^3)/(t*1e9)
ans =
25.4798
>> n=1000; a=randn(n); tic, a*a; t=toc; (2*n^3)/(t*1e9)
ans =
25.8126
>> n=2000; a=randn(n); tic, a*a; t=toc; (2*n^3)/(t*1e9)
ans =
28.0110
>> n=2000; a=randn(n); tic, a*a; t=toc; (2*n^3)/(t*1e9)
ans =
28.0495
>> n=4000; a=randn(n); tic, a*a; t=toc; (2*n^3)/(t*1e9)
ans =
29.3411
>> n=6000; a=randn(n); tic, a*a; t=toc; (2*n^3)/(t*1e9)
ans =
29.5863
>> n=6000; a=randn(n); tic, a*a; t=toc; (2*n^3)/(t*1e9)
ans =
29.1100
23.969/29.11=0.82
Summary
•
•
•
•
Threads = 1/2/4
Maximum Gflops: 8.87/ 17.11/29.11
Maximum Gflops/cycle: 3.33/6.43/11.1
Maximum Gflops/cycle/thread: 3.33/3.21/2.78
• Minimum (n=1000)/Maximum (n=5000or6000)
– 0.95/0.86/0.82
All indicative of an ability to do 4 mults and 4 adds per core per cycle,
but not enough memory bandwidth to keep the processors going
at full capacity.
Matrix Add
• >> n=5000; a=randn(n,n); tic, c=a+0;
t=toc;(2.66*1e9*t)/(2*n^2)
• ans =
• 12.3890
• >> maxNumCompThreads(4);
• >> n=5000; a=randn(n,n); tic, c=a+0;
t=toc;(2.66*1e9*t)/(2*n^2)
• ans =
• 12.2825
Conclusion: Takes about 12 cycles per read and write independent of operations
i.e. in one cyle we have (1/12) of 8 bytes moving
In one second we have (2.66*1e9)*(1/12)* 8 bytes = 1.7 GB/second (seems
slow!)
One can try a model
• Cycles = (read/writes)*12 +
(flops)/(4*p*efficiency)
• But good luck!
• (not sure if this accounts for all that is going on
and maybe one shouldn’t decouple the memory
starvation from the efficiency. You can see what
you can do if you like. I’m dissapointed this is so
non-predictive.)
https://agora.cs.illinois.edu/download
/attachments/19925366/a38mattson.pdf
As a second point of comparison, consider Intel® Core™ 2
Quad processor CPU running at 2.66 GHz with a thermal
design power of 95W (model number Q6700) [Intel2008].
This CPU was manufactured using the same 65 nm process
technology as was used for the 80-core Terascale processor.
A Core™ 2 core includes two 128 bit wide SIMD FPU that
support the SSE3 instructions each of which can retire up to
4 single precision floating point operations per cycle.
Hence, the peak performance of this quad core CPU is: 4
core*8flop/core*2.66 GHZ = 85.12 single precision GFLOPS
This translates to 0.9 GFLOP/Watt making the 80-core
Terascale processors (19.4 GFLOP/W at 0.394 TFLOP) over
20 times more power efficient than a more traditional “big
core” multicore CPU.
Wikipedia: The Kentsfields comprise two separate silicon dies (each equivalent to a
single Core 2 duo) on one MCM.[30] This results in lower costs but lesser share of the
bandwidth from each of the CPUs to the northbridge than if the dies were each to sit
in separate sockets as is the case for example with the AMD Quad FX platform
Wikipedia
•
The multiple cores of the Kentsfield most benefit applications that can easily be broken into a small
number of parallel threads (such as audio and video transcoding, data compression, video
editing, 3D rendering and ray-tracing). To take a specific example, multi-threaded games such
as Crysis and Gears of War which must perform multiple simultaneous tasks such as AI, audio and
physics benefit from the quad-core CPUs.[35] In such cases, the processing performance may
increase relative to that of a single-CPU system by a factor approaching the number of CPUs. This
should, however, be considered an upper limit as it presupposes the user-level software is wellthreaded. To return to the above example, some tests have demonstrated that Crysis fails to take
advantage of more than two cores at any given time.[36] On the other hand, the impact of this issue
on broader system performance can be significantly reduced on systems which frequently handle
numerous unrelated simultaneous tasks such as multi-user environments or desktops which
execute background processes while the user is active. There is still, however, some overhead
involved in coordinating execution of multiple processes or threads and scheduling them on
multiple CPUs which scales with the number of threads/CPUs. Finally, on the hardware level there
exists the possibility of bottlenecks arising from the sharing of memory and/or I/O bandwidth
between processors.
• I read this as you might hopefully get 4 fold speedups
but some people say you might only get 2, and it all
depends, and nobody really seems to know for sure

Theoretical Memory Bandwidth
• (Clock Frequency) * (Data Path Width) *
(Transfers per clock cycle)
• (1.066 GHz) * (8 bytes?????) * (4)???
• Might be 4=two possibilities during clock rise
and two during clock fall “quad-pumped?”
• This would be 32 GB/sec
• Sam Williams says 10.6 or 21.3 on clovertown
• I see 1.7??
SSE Streaming SIMD Extensions
• Cores have 128 bit registers (eight of them??)
• That allow four single precision, or two double precision
ops per second
• See:
http://en.wikipedia.org/wiki/Streaming_SIMD_Extensions
• Especially packed add ADDPS, and packed multiply MULPS
• See:
• http://developer.intel.com/software/products/college/ia32
/strmsimd/simd.htm
http://www.cortstratton.org/articles/OptimizingForSSE.php
http://www.cortstratton.org/articles/HugiCode.html