Alpha 21172 Chipset - University of Michigan

Download Report

Transcript Alpha 21172 Chipset - University of Michigan

Alpha 21172 Core Logic Chip Set
Jerry Huang
Alpha 21172 Inside out
Zhihui Huang (Jerry)
University of Michigan
Slide #1
Friday, October 10, 1997
Alpha 21172 Core Logic Chip Set
Jerry Huang
Components
One 21172-CA chip
– Control, I/O, address chip(CIA)
– 388 pins, plastic ball grid array(PBGA)
Four 21172-BA
– data switch chip (DSW)
– 208 pins, plastic quad flat pack (PQFP)
Slide #2
Friday, October 10, 1997
Alpha 21172 Core Logic Chip Set
Jerry Huang
Data Paths
64-bit data path between CIA and DSW
– iod
128-bit
Slowest part
has the
widest
data
path
between
bus
21164 and DSW
– cpu_dat
256-bit memory data path between DSW and
memory
– mem_dat
Slide #3
Friday, October 10, 1997
Alpha 21172 Core Logic Chip Set
Jerry Huang
3-way Interface
DSW0
DSW2
DSW3
DRAM 8
DRAM 7
DRAM 6
DRAM 5
DRAM 4
DRAM 3
DRAM 2
64-bit IOD bus
DRAM 1
21164
addr<39:4>
DSW1
RAS
21172
CAS
control
memadr<11:0>
64-bit PCI Bus
Slide #4
Friday, October 10, 1997
Alpha 21172 Core Logic Chip Set
Jerry Huang
Memory
128 bit
DRAM
SIMM 88
DRAM
SIMM 77
DRAM
SIMM 66
DRAM
SIMM 55
DRAM
SIMM 44
DRAM
SIMM 33
DRAM
SIMM 22
DRAM
SIMM 11
The DRAM is contained in one
bank of SIMMs, whether there
SIMMs
fill a
are 4 SIMMs4or
8 SIMMs.
data bus of
Needs a
128
bits
jumper
8 SIMMs
fill a
data bus of
256 bits
Slide #5
256-bit
Friday, October 10, 1997
Alpha 21172 Core Logic Chip Set
Jerry Huang
Memory block
It is better to use the
A 256-bit block
is composed of 15:0256-bit
79:64 configuration,
31:16 95:80 47:32102:9663:48127:102
you
pay the full
Asslices
youItacross
just
maysee,
be clearornow
bit
all
DSW0
DSW1
DSW2
DSW3
price for DSWs
8 SIMMs
the 4theDSWs
whytogether
it is a one bank
provide theschema
lower with and only use
half of the resources.
The arrangement
128-bit
128-bit
memory
all
the
SIMMs
128-bit
of the slices are
bus.
have
the same size.
interleaved
within
the 4 DSWs
DRAM 8
DRAM 7
256-bit
DRAM 6
DRAM 5
DRAM 4
DRAM 3
DRAM 2
Slide #6
DRAM 1
For the 256-bit
configuration,
DSWs also provide the
upper part of the bus
Friday, October 10, 1997
Alpha 21172 Core Logic Chip Set
Jerry Huang
A cache where the cache location for a given address is determined from the middle address
bits. If the cache line size is 2^n then the bottom n address bits correspond to an offset within a cache entry.
If the cache can hold 2^m entries then the next m address bits give the cache location. The remaining top
A cache architecture in which data
Bcache isand
Memory
only written to main memory
address bits are stored as a TAG along with the entry.
The Scache and Bcache block size is either 64-bytes
or 32 bytes.
Scache
andout
Bcache
always
have
whenThe
it is
forced
of
the
cache.
A21164
cache
line and
is allocated
Level identical
Cache
for
the
block
sizes.
All
the
Bcache
main memory
This kind of cache conflict is quite likely on a multi-processor.
ofare
write-through.
when
the
write
memory
FILLs or writeOpposite
transactions
of the
selected
block size.
In this scheme, there is no choice of which block to flush on a cache miss since there is only one place for
any block to go. This simple scheme has the disadvantage that if the program alternately accesses different
addresses which map to the same cache location then it will suffer a cache miss on every access to these locations.
3rd
Attributes
data miss the cache
– optional, external,physical, synchronous SRAM
– direct-mapped, write-back,write-allocate
256-bit or 512-bit block
In the PC164
cache size of 1,2,4,8,16,32,64 Mbytes
support up to 512MB of memory
ECC protected
– 1MBx36, 2MBx36,4MBx36,8MBx36,16MBx36
Slide #7
Friday, October 10, 1997
Alpha 21172 Core Logic Chip Set
Jerry Huang
PCI features
clk bus width
Supports 64-bits PCI
Frame#
Supports 64-bit PCI addressing
(DAC cycles)
data
data
addr
Accept PCI fast back-to-back cycles
– addr,data0,data1,data2,...,addr_again!
– The Frame# is only deasserted for a cycle to allow
the last to finish
Issues PCI fast back-to-back cycles in dense
addrss space
Slide #8
Friday, October 10, 1997
Alpha 21172 Core Logic Chip Set
Jerry Huang
CIA Transactions
21164 memory read miss
21164 memory read miss with victim
21164 I/O read
21164 I/O write
DMA read
DMA read(prefetch)
DMA write
Slide #9
Friday, October 10, 1997
Alpha 21172 Core Logic Chip Set
Jerry Huang
DSW Data Paths
Victim Path
21164
BCache SYS
MEM
Memory
Read Miss Path
IO Paths not shown
Instruction Queue
DMA 0
MEM
DMA 1
Flush
PCI
PCI
IOD
Slide #10
Flush
IOD
MEM
SYS
MEM Friday, October 10, 1997
SYS
Alpha 21172 Core Logic Chip Set
Jerry Huang
DSW Buffers
DMA Buffer Sets (0 and 1)
– PCI buffer for PCI DMA write data
– Memory buffer for memory data
– Flush buffer for system bus data
DMA 0
Flush
PCI
IOD
Slide #11
DMA 1
PCI
Flush
IOD
MEM
MEM Friday, October 10, 1997
Alpha 21172 Core Logic Chip Set
Jerry Huang
DMA Writes




Data
the
Asarrives
you justin
see,
thePCI Buffer
DMABuffer
operation
Then the 3 sources
Memory
loaded at the same time
causes PCI buffer loaded
are merged and
written
Memory
Bcache
line
flushed
and
Flush
buffer
loaded
from the IOD bus, the MEM
back to main memory
buffer
loaded
from
memory,
3 sources merged and data back at memory
and the flush buffer loaded
from system bus at the
same time
DMA 0
21164
BCache
Flush
PCI
IOD
Slide #12
MEM
Friday, October 10, 1997
Alpha 21172 Core Logic Chip Set
Jerry Huang
21164 Read Transaction
If hit in the Bcache, no memory access is
required
Read data
Data back to CPU
21164
HIT !!
SYS
BCache
MEM
Memory
Read Miss Path
Slide #13
Friday, October 10, 1997
Alpha 21172 Core Logic Chip Set
Jerry Huang
21164 Read Miss
If not hit in the Bcache during a read, memory
access is involved.
Read data
21172 CIA
Data back to CPU
21164
Miss!! SYS
BCache
21172 BA
MEM
Memory
Read Miss Path
Slide #14
Friday, October 10, 1997
Alpha 21172 Core Logic Chip Set
Jerry Huang
Read Miss With Victim
Two scenarios
Write allocate!!
Missed
– write data with different address tag intoRead
a valid
cacheblock
line
and
– read data with different address
tag
into a valid cache line
read
allocate!!
Write victim block
are indivisible
Write data
21172 CIA
in the logic design
Victim Path
Merge data
21164
Miss!! SYS
BCache
MEM
Memory
Read Miss Path
Slide #15
Friday, October 10, 1997
Alpha 21172 Core Logic Chip Set
Jerry Huang
Traffic Jam on MEM bus
Let’s think about
Victimthis
Path
Cause
senario, during
the PCI
21164Cause
read miss
read SYS
missDMA transfer,
BCache
with victim
there are READ and
Read Miss Path
WRITE memory
Instruction Queue
happening at the same
time0
DMA
MEM
Flush
PCI
MEM
All the circle
parts
compete
Memory
for this resource
IO Paths not shown
DMA 1
PCI
IOD
Slide #16
Don’t forget
instruction
fetch uses
memory too
Flush
IOD
MEM
SYS
MEM Friday, October 10, 1997
SYS
Alpha 21172 Core Logic Chip Set
Jerry Huang
33 MHz PCI
has the same
speed with
DRAM !!
Can we really
memory/DMA
do this ??
How Fast can DMA be?
2PCI
fetches
33MHz
64-bit bus
60 ns DRAM
256-bit bus
and 2 writes to
– 64 bytes/240 ns = 266 Mbytes/s
– 8 bytes /30 ns = 266 Mbytes/s
Overhead, retrys, read lines,
DMA 0 read line with victim,
DMA 1
instruction fetch
all shareFlush
the samePCI
bandwidth!! Flush
It turns out for the worst case,
17MBytes/s is achieved
just above bottom
IOD
IOD line
Slide #17
MEM
SYS
MEM Friday, October 10, 1997
SYS
Alpha 21172 Core Logic Chip Set
Jerry Huang
Performance of the MB2PCI
Worst case
– 29.9MBytes/s
– 25.5MBytes/s
– 17.5MBytes/s
Best case
- No intervenence
- read line, instruction fetch
- read line, read line with victim,
instruction fetch
– 95MBytes/s
– 80MBytes/s
– 72MBytes/s
Slide #18
Friday, October 10, 1997
Alpha 21172 Core Logic Chip Set
Jerry Huang
Conclusion
If we want to improve
– use 256-bit cache block instead of 512-bit
– Is there
a next
version
There
is a trade
off here,21172
by using chip
smallersurport 512-bit
memoryblock,
bus?the 21164 will generate more
cache miss cycles and may slow down.
– Is there DRAM chips faster then 60ns
the other64M
hand, for
the DMA transfer,
– can weOnafford
Bcache(SRAM)?
when only 128-bit data is transferred, no more
512-bit memory read overhead. There is
only 256-bit read now. Thus improve the
worst case performance.
Slide #19
Friday, October 10, 1997