System Roadmap
Andrew B. Kahng
Core Pillar
September 29, 2006
[email protected]
[email protected]
[email protected]
[email protected]
Modeling Requirements for
System-Level Living Roadmap
Core Pillar Requirements (ASV)
Benefits of technology scaling can be sustained by migrating
design process to system scaling paradigm
design elements are IP blocks/processor cores as opposed
to devices and standard cells
New system synthesis paradigms rely on accurate yet simple
models of delay/area/power/cost trade-offs for parameterized
design elements
Models of block-level design metrics should account for
Cost and impact of design techniques to cope with variability
Cost of hardware (in terms of design metrics) required for
adaptivity/ resiliency
Goal: Synthesize and abstract the impact of low-level
technology parameters (and their variabilities) on design
metrics of system-level blocks
BEOL Stack Optimization (Nagaraj, TI)
Quality assessment of BEOL interconnect stacks
Inputs: Technology parameters (resistivity, ILD thickness etc.),
geometry parameters (wire widths, pitches), Rent parameters
Stack quality assessment is required for blocks (instead of individual
wires) such as data path elements, SoC components and processor
cores
Outputs: Reports of trade-offs and models of stack performance
metrics for system-level exploration
Existing wire length distribution models and interconnect
performance metrics optimize stack metrics do not have a
system-level view
Stack parameter exploration and optimization should be
driven by “design-level” throughput and power considerations
E.g., area-normalized throughput and power density
Interconnect Library Modeling (Carloni)
Focus of design process is shifting from “computation” to
“communication”
Device scaling and interconnect performance scaling mismatches are
causing breakdown of traditional across-chip communication
mechanisms
New techniques: wave pipelining, stateful repeaters communication
and network centric approach for designs in future
Communication-driven design synthesis
System-level design requirements translated to communication
mechanism between computational blocks analogous to classic
synthesis process (design requirements translated to computational
blocks)
Mapping stage involves association of communication apparatus (links,
repeaters, buses, routers etc.) to high-level synthesis solution
similar to technology mapping of standard cells to generic netlist
New synthesis methodology requires “characterized”
interconnect library composed of links, repeaters routers etc.
Modeling/metrics SIG can provide models of latency,
bandwidth, throughput, power (high-level metrics) based on
thorough characterization of library elements based on device
and process technology roadmaps.
Concurrent Theme Requirements (Keutzer)
Current technology extrapolation framework doesn’t allow
study of impact of design choices on high-level parameters
E.g., what if a vector unit is added? What if local memory size is
increased? what is the impact of architectural design choices on
chip-level attributes?
Architectural exploration work requires models of design
metrics that are within ~20%
There is a significant gap between numbers in ITRS and
technology extrapolation frameworks (BACPAC/GTX)
Design space/architectural exploration based on rule/inference chains
will run out of steam require models for higher-level design blocks
Specific questions:
What will be the size of an economical die in future nodes?
# RISC processors that can be implemented
# clock / power regimes (i.e., voltage islands)
Clock frequencies in future nodes
Power implications / trade-offs
Other Guidance
Questions from Intel mentors
How to model the reliability and the error rate of SRAM
How to embed technological variability and reliability issues
into the system diagnosis
How to identify the ‘hot spots’ of a design
How to efficiently validate the design under variations
Other
What are impacts of variability on NOC?
NBTI power-law modeling (Purdue-TI)
Macromodeling
The Challenge of System Projection and Design
What is impact of new technology on system macro
parameters?
Execution speed, power consumed, latency, reliability, cost, …
What macromodeling will enable system-level optimization ?
System optimization : large block :: logic optimization : standard cell
“Large block” = microprocessor, memory, network, bus, …
Logic cell abstraction through 65nm WAS: size, power, delay
Block abstraction beyond 65nm MUST BE: much more
Cost and resource tradeoffs especially in the face of variability and
reliability
From latency and bandwidth to flexibility and resilience
Scaling of future systems will be dominated by nondeterminism
GSRC Modeling SIG: Toward System Scaling Theory
Towards Parameterized Scalable Macromodels
Low-level (device- or gate-level) models accurate but unusable
for system-level exploration
Macromodels:
Estimate metrics such as delay, power, area, power/performance variability,
reliability for higher-level blocks
Are scaleable to novel technologies
Are scaleable to different design styles, Vdd, Vth, etc.
Are parameterized by architectural parameters of higher-level blocks
Allow designers to:
Speculatively achieve highest performance given area, power budget
Explore reliability tradeoffs with area and power
Access system-level resiliency requirements
Develop robust designs
Use Model: Facilitate System-Level Exploration
Delay
Macromodels
Cycle
Time
Instruction-Set or
Cycle-Accurate
Simulator
Performance
Power
Macromodels
Power
Area
Macromodels
Area
System-Level
Design
Reliability
Macromodels
Vulnerable
System
Components
Variability
Macromodels
Yield
Determining
components
Optimizations enabled:
• Evaluation for future
technologies
• Area-performance tradeoff
• Power-performance
tradeoff
• Resilience requirements
due to reliability and/or
variability
Challenges in Macromodeling
Lots of high-level blocks, algorithms and design styles
Some identified blocks (cf. Gajski “Architecture SC” request):
Array structures: single- and multiple-port SRAMs, contentaddressable memories, register files, reservation stations,
renaming units, issue queues, branch target buffers, etc.
Complex logic blocks: adders, multipliers, dividers, vector
blocks, normalization, rounding, etc.
IP blocks: encryption/decryption, JPEG/MPEG
compression/decompression, CRC, etc.
On-chip communication: buses, NoCs (Polaris)
Clocking network
Lack of robust reliability and variability prediction
Parametric Yield Estimation and Optimization
Variability Data
Technology / Circuit Data
Fmax Variability
SER Macromodeling
Statistical Clock Skew
Example: Carry-Lookahead Adder
Parameters: bit width, lookahead stages
Design styles: dynamic, static, pass-gate
Delay: carry generation for MSB slowest
based on bit width and lookahead calculate hierarchy levels
identify critical path
project delay from gate-level delay projections (ITRS + BPTM)
Power: calculated using bit width and lookahead stages in terms of gates,
projected using gate-level power
Area: similar to power
Reliability and variability projections from iTunes
All metrics calibrated with implementations for few parameters and
technologies
Example: Memory Array
Write Column logic
Addr.
Decoder
Memory Core
Precharge, Read Column logic
6T Memory Cell
Memory Array
Parameters: #bitlines, #wordlines, #ECC bits, etc.
Design styles: memory cell design, layouts, drive strength ratios, etc.
Delay: addr decoder delay + memory cell read/write delay + bitline mux
delay project delay from gate-level delay projections
Power: CACTI, IDAP (uses wordline cap, bitline cap, precharge device
cap/memory cell cap, #bit flips, etc.)
Area: memory cell area dominated, easy to predict & project
Reliability and variability projections from iTunes along with #ECC bits
Interconnect Stack Optimization
Why New Models ?
Classic scaling laws are not aware of the implications
of scaling
Models of scaling do not represent system constraint-driven
design of future
Hardware overheads for resiliency, power, adaptability and
tuning go against scaling performance implications
Models of design infrastructure in future nodes
should understand implications of circuit and
interconnect unreliability
Static variations – process variations, NBTI
Dynamic variations – temperature, SEU, EM
Existing models are too low-level to be usable in
system design scenario even with inference chain
analysis (e.g., GTX)
Technology Scaling : Interconnect Implications
Vdd scaling slowing Delay scaling slowing down
Subthreshold slope limit
Vt scaling has Ioff consequences
Power concerns push Vdd down
Scaling interconnect dimensions
Wire delays become worse
Huge performance penalties
(because devices also are not as fast)
Global wires are the worst victims
Repeaters are of limited help
Significant area and power penalty
Global communication a costly overhead
Image source: Prof. Saraswat, Stanford Univ.
Design Impact of Interconnect (non) Scaling
Repeater-driven interconnect is energy, congestion, performance-limited
Maximum reachable distance in a clock cycle = ? (low-swing, differential, …)
Bandwidth vs. latency envelope = ? (encoding, power, signal reliability, …)
Latency is not the only problem: temperature, power density and EM
Temperature of global interconnect
rises with low-k performance impact
Future NoC interconnections should
address performance/thermal/reliability
issues at fabric design, and design
optimization phases
This work search for optimal NoC
interconnect stack parameters
New Directions for System-Level Interconnects
Wire pipelining, state-aware repeaters
Methodologies ?
Globally asynchronous, locally synchronous
Latency insensitive
Design paradigm shift from “computation-driven” to
“communication-driven”
Computation is no longer the bottleneck
Computation is cheap exploit computation infrastructure
to develop efficient communication mechanisms
Designs transforming into distributed systems
Interconnection network performance key for system
performance power, bandwidth, and throughput
envelope constrained by elements in the system
Design-Centric Modeling of Interconnects
Traditional modeling techniques consider individual wires for
characterization / optimization of interconnect performance
metrics
no notion of design specificity
Multi-core / NoC / communication-system design exploration
and synthesis methodologies should consider interconnect
fabric in the context of design design-centric modeling of
interconnects
Modeling design fabrics:
Design fabrics for communication-based design: nodes, interconnect
Global interconnect of data path elements, processor cores
Point-to-point/broadcast buses, links, switch/multiplexer interfaces, and
routers
Metrics for performance analysis of design fabrics
Conventional design metrics: performance, power, area
New metrics needed
Should reflect system-level power/performance characteristics
Interconnect Metrics (IM)
Traditional interconnect performance metrics
Signaling: Delay, power, bandwidth, noise, crosstalk, area
Clocking: Skew/jitter, power, slew rate, area
Power distribution: supply fidelity
Reliability: electromigration
Some recent metrics
Interconnect architecture rank: inclusive metric combining
delay, routability, area
Bandwidth/Energy: signifies throughput as well as energy
spent in signaling with a specific bandwidth
Problems with existing metrics:
No notion of design specificity interconnect stack
performance is heavily dependent on a design’s wire length
distribution
IM optimization based on canonical test structures is not
valid for all wiring topologies sub-optimal results
BEOL Stack Metrics
Design-centric BEOL interconnect stack architectures
global interconnect topologies for NoC
Macro blocks
Bus – (1)
Curves – (2)
Cross w/o contacts – (3)
Cross w/ contacts – (4)-(6)
Interconnect library
Source: Addino et al.
Macro-block configurations may vary in # of wires,
geometric parameters (width, spacing) and link
structure
Stack metrics:
Traditional metrics can be adapted to macro-blocks
New metrics: area-normalized throughput, power density
Recall: TI Request for BEOL Stack Optimization
WANTED: BEOL stack optimization tool (Nagaraj, TI)
Inputs:
Stack options: thickness, pitch, dielectric materials,
process variations
Class of representative designs at RT-level: logic-only,
logic+memory, datapaths, CPU cores, SoC
Cell and IP library
Outputs:
Concise summary of tradeoffs for different BEOL stack
options
Area, clock and power distribution (on die and package),
performance, reliability, cost
BEOL Stack Optimization
Stack optimization: search for the best set of macro
block parameters which yield optimum points for
performance metrics
Methodology
Step 1: Construction of interconnect library
Step 2: Electrical characterization of library elements for
different choices of geometric, user-specified parameters
(Addino et al, PATMOS’03)
Step 3: Computation of performance metrics
Bit transfer rate across cross-section of the elements
Power density per unit area
“Traditional metrics” – latency, bandwidth, noise etc.
Best solutions for different performance metrics may be
mutually conflicting intelligent search in parameter
space to obtain optimal solution
Interconnect Characterization for Communication-based Design
BEOL stack exploration: initial step toward interconnect fabric
design and optimization for MPSoC, CMPs and heterogeneous
systems best interconnect stack for specific communication
objective
Novel interconnect metrics
Capture technology scaling
Capture system scaling (design constraints): consider impact of
memory hierarchy, interface timing, power, signal swing levels
Interconnect characterization: create models of performance
metrics for interconnect structures
E.g. Which structure gives the best throughput per area for a given
performance constraints ?
How does power density change with bus parameters, power
constraints ?
Probabilistic, continuum/hierarchy of models
Dial effort/information vs. accuracy
“N+1 N+2 shrink”; “Side + Ngate + Rent p”; “run Architecture Compiler”; …
Dial guardband vs. certainty
Conclusions
Existing BEOL stack analysis/optimization oblivious of
system design constraints
Individual wires no longer sufficient for performance
analysis move to higher levels of abstraction
Communication-driven design synthesis paradigm
drives system-level interconnect analysis
Standalone metrics (e.g., delay, power, bandwidth)
cannot give complete picture of performance
new metrics: area-normalized throughput, power density
Explore parameter space to efficiently obtain stack
parameters for optimum performance