System Roadmap Andrew B. Kahng Core Pillar September 29, 2006 [email protected] [email protected] [email protected] [email protected] Modeling Requirements for System-Level Living Roadmap.

Download Report

Transcript System Roadmap Andrew B. Kahng Core Pillar September 29, 2006 [email protected] [email protected] [email protected] [email protected] Modeling Requirements for System-Level Living Roadmap.

System Roadmap
Andrew B. Kahng
Core Pillar
September 29, 2006
[email protected]
[email protected]
[email protected]
[email protected]
Modeling Requirements for
System-Level Living Roadmap
Core Pillar Requirements (ASV)
 Benefits of technology scaling can be sustained by migrating
design process to system scaling paradigm
 design elements are IP blocks/processor cores as opposed
to devices and standard cells
 New system synthesis paradigms rely on accurate yet simple
models of delay/area/power/cost trade-offs for parameterized
design elements
 Models of block-level design metrics should account for
 Cost and impact of design techniques to cope with variability
 Cost of hardware (in terms of design metrics) required for
adaptivity/ resiliency
 Goal: Synthesize and abstract the impact of low-level
technology parameters (and their variabilities) on design
metrics of system-level blocks
BEOL Stack Optimization (Nagaraj, TI)
 Quality assessment of BEOL interconnect stacks
 Inputs: Technology parameters (resistivity, ILD thickness etc.),
geometry parameters (wire widths, pitches), Rent parameters
 Stack quality assessment is required for blocks (instead of individual
wires) such as data path elements, SoC components and processor
cores
 Outputs: Reports of trade-offs and models of stack performance
metrics for system-level exploration
 Existing wire length distribution models and interconnect
performance metrics optimize stack metrics do not have a
system-level view
 Stack parameter exploration and optimization should be
driven by “design-level” throughput and power considerations
 E.g., area-normalized throughput and power density
Interconnect Library Modeling (Carloni)
 Focus of design process is shifting from “computation” to
“communication”
 Device scaling and interconnect performance scaling mismatches are
causing breakdown of traditional across-chip communication
mechanisms
 New techniques: wave pipelining, stateful repeaters  communication
and network centric approach for designs in future
 Communication-driven design synthesis
 System-level design requirements translated to communication
mechanism between computational blocks  analogous to classic
synthesis process (design requirements translated to computational
blocks)
 Mapping stage involves association of communication apparatus (links,
repeaters, buses, routers etc.) to high-level synthesis solution
 similar to technology mapping of standard cells to generic netlist
 New synthesis methodology requires “characterized”
interconnect library composed of links, repeaters routers etc.
 Modeling/metrics SIG can provide models of latency,
bandwidth, throughput, power (high-level metrics) based on
thorough characterization of library elements based on device
and process technology roadmaps.
Concurrent Theme Requirements (Keutzer)
 Current technology extrapolation framework doesn’t allow
study of impact of design choices on high-level parameters
 E.g., what if a vector unit is added? What if local memory size is
increased?  what is the impact of architectural design choices on
chip-level attributes?
 Architectural exploration work requires models of design
metrics that are within ~20%
 There is a significant gap between numbers in ITRS and
technology extrapolation frameworks (BACPAC/GTX)
 Design space/architectural exploration based on rule/inference chains
will run out of steam  require models for higher-level design blocks
 Specific questions:





What will be the size of an economical die in future nodes?
# RISC processors that can be implemented
# clock / power regimes (i.e., voltage islands)
Clock frequencies in future nodes
Power implications / trade-offs
Other Guidance
 Questions from Intel mentors
 How to model the reliability and the error rate of SRAM
 How to embed technological variability and reliability issues
into the system diagnosis
 How to identify the ‘hot spots’ of a design
 How to efficiently validate the design under variations
 Other
 What are impacts of variability on NOC?
 NBTI power-law modeling (Purdue-TI)
Macromodeling
The Challenge of System Projection and Design
 What is impact of new technology on system macro
parameters?
 Execution speed, power consumed, latency, reliability, cost, …
 What macromodeling will enable system-level optimization ?
 System optimization : large block :: logic optimization : standard cell
 “Large block” = microprocessor, memory, network, bus, …
 Logic cell abstraction through 65nm WAS: size, power, delay
 Block abstraction beyond 65nm MUST BE: much more
 Cost and resource tradeoffs especially in the face of variability and
reliability
 From latency and bandwidth to flexibility and resilience
 Scaling of future systems will be dominated by nondeterminism
  GSRC Modeling SIG: Toward System Scaling Theory
Towards Parameterized Scalable Macromodels
 Low-level (device- or gate-level) models accurate but unusable
for system-level exploration
 Macromodels:
 Estimate metrics such as delay, power, area, power/performance variability,
reliability for higher-level blocks
 Are scaleable to novel technologies
 Are scaleable to different design styles, Vdd, Vth, etc.
 Are parameterized by architectural parameters of higher-level blocks
 Allow designers to:
 Speculatively achieve highest performance given area, power budget
 Explore reliability tradeoffs with area and power
 Access system-level resiliency requirements
 Develop robust designs
Use Model: Facilitate System-Level Exploration
Delay
Macromodels
Cycle
Time
Instruction-Set or
Cycle-Accurate
Simulator
Performance
Power
Macromodels
Power
Area
Macromodels
Area
System-Level
Design
Reliability
Macromodels
Vulnerable
System
Components
Variability
Macromodels
Yield
Determining
components
Optimizations enabled:
• Evaluation for future
technologies
• Area-performance tradeoff
• Power-performance
tradeoff
• Resilience requirements
due to reliability and/or
variability
Challenges in Macromodeling
 Lots of high-level blocks, algorithms and design styles
 Some identified blocks (cf. Gajski “Architecture SC” request):
 Array structures: single- and multiple-port SRAMs, contentaddressable memories, register files, reservation stations,
renaming units, issue queues, branch target buffers, etc.
 Complex logic blocks: adders, multipliers, dividers, vector
blocks, normalization, rounding, etc.
 IP blocks: encryption/decryption, JPEG/MPEG
compression/decompression, CRC, etc.
 On-chip communication: buses, NoCs (Polaris)
 Clocking network
 Lack of robust reliability and variability prediction
Parametric Yield Estimation and Optimization
Variability Data
Technology / Circuit Data
Fmax Variability
SER Macromodeling
Statistical Clock Skew
Example: Carry-Lookahead Adder
 Parameters: bit width, lookahead stages
 Design styles: dynamic, static, pass-gate
 Delay: carry generation for MSB slowest
 based on bit width and lookahead calculate hierarchy levels
 identify critical path
 project delay from gate-level delay projections (ITRS + BPTM)
 Power: calculated using bit width and lookahead stages in terms of gates,
projected using gate-level power
 Area: similar to power
 Reliability and variability projections from iTunes
 All metrics calibrated with implementations for few parameters and
technologies
Example: Memory Array
Write Column logic
Addr.
Decoder
Memory Core
Precharge, Read Column logic
6T Memory Cell
Memory Array
 Parameters: #bitlines, #wordlines, #ECC bits, etc.
 Design styles: memory cell design, layouts, drive strength ratios, etc.
 Delay: addr decoder delay + memory cell read/write delay + bitline mux
delay  project delay from gate-level delay projections
 Power: CACTI, IDAP (uses wordline cap, bitline cap, precharge device
cap/memory cell cap, #bit flips, etc.)
 Area: memory cell area dominated, easy to predict & project
 Reliability and variability projections from iTunes along with #ECC bits
Interconnect Stack Optimization
Why New Models ?
 Classic scaling laws are not aware of the implications
of scaling
 Models of scaling do not represent system constraint-driven
design of future
 Hardware overheads for resiliency, power, adaptability and
tuning go against scaling  performance implications
 Models of design infrastructure in future nodes
should understand implications of circuit and
interconnect unreliability
 Static variations – process variations, NBTI
 Dynamic variations – temperature, SEU, EM
 Existing models are too low-level to be usable in
system design scenario even with inference chain
analysis (e.g., GTX)
Technology Scaling : Interconnect Implications
 Vdd scaling slowing  Delay scaling slowing down
 Subthreshold slope limit
 Vt scaling has Ioff consequences
 Power concerns push Vdd down
 Scaling interconnect dimensions
 Wire delays become worse
 Huge performance penalties
(because devices also are not as fast)
 Global wires are the worst victims
 Repeaters are of limited help
 Significant area and power penalty
Global communication  a costly overhead
Image source: Prof. Saraswat, Stanford Univ.
Design Impact of Interconnect (non) Scaling
 Repeater-driven interconnect is energy, congestion, performance-limited
 Maximum reachable distance in a clock cycle = ? (low-swing, differential, …)
 Bandwidth vs. latency envelope = ? (encoding, power, signal reliability, …)
 Latency is not the only problem: temperature, power density and EM
Temperature of global interconnect
rises with low-k  performance impact
 Future NoC interconnections should
address performance/thermal/reliability
issues at fabric design, and design
optimization phases
 This work  search for optimal NoC
interconnect stack parameters
New Directions for System-Level Interconnects
 Wire pipelining, state-aware repeaters
 Methodologies ?
 Globally asynchronous, locally synchronous
 Latency insensitive
 Design paradigm shift from “computation-driven” to
“communication-driven”
  Computation is no longer the bottleneck
 Computation is cheap  exploit computation infrastructure
to develop efficient communication mechanisms
 Designs transforming into distributed systems
 Interconnection network performance key for system
performance  power, bandwidth, and throughput
envelope constrained by elements in the system
Design-Centric Modeling of Interconnects
 Traditional modeling techniques consider individual wires for
characterization / optimization of interconnect performance
metrics
  no notion of design specificity
 Multi-core / NoC / communication-system design exploration
and synthesis methodologies should consider interconnect
fabric in the context of design  design-centric modeling of
interconnects
 Modeling design fabrics:
 Design fabrics for communication-based design: nodes, interconnect
 Global interconnect of data path elements, processor cores
 Point-to-point/broadcast buses, links, switch/multiplexer interfaces, and
routers
 Metrics for performance analysis of design fabrics
 Conventional design metrics: performance, power, area
 New metrics needed
 Should reflect system-level power/performance characteristics
Interconnect Metrics (IM)
 Traditional interconnect performance metrics
 Signaling: Delay, power, bandwidth, noise, crosstalk, area
 Clocking: Skew/jitter, power, slew rate, area
 Power distribution: supply fidelity
 Reliability: electromigration
 Some recent metrics
 Interconnect architecture rank: inclusive metric combining
delay, routability, area
 Bandwidth/Energy: signifies throughput as well as energy
spent in signaling with a specific bandwidth
 Problems with existing metrics:
 No notion of design specificity  interconnect stack
performance is heavily dependent on a design’s wire length
distribution
 IM optimization based on canonical test structures is not
valid for all wiring topologies  sub-optimal results
BEOL Stack Metrics
 Design-centric BEOL interconnect stack architectures
 global interconnect topologies for NoC
Macro blocks
Bus – (1)
Curves – (2)
Cross w/o contacts – (3)
Cross w/ contacts – (4)-(6)
 Interconnect library
Source: Addino et al.
 Macro-block configurations may vary in # of wires,
geometric parameters (width, spacing) and link
structure
 Stack metrics:
 Traditional metrics can be adapted to macro-blocks
 New metrics: area-normalized throughput, power density
Recall: TI Request for BEOL Stack Optimization
 WANTED: BEOL stack optimization tool (Nagaraj, TI)
 Inputs:
 Stack options: thickness, pitch, dielectric materials,
process variations
 Class of representative designs at RT-level: logic-only,
logic+memory, datapaths, CPU cores, SoC
 Cell and IP library
 Outputs:
 Concise summary of tradeoffs for different BEOL stack
options
 Area, clock and power distribution (on die and package),
performance, reliability, cost
BEOL Stack Optimization
 Stack optimization: search for the best set of macro
block parameters which yield optimum points for
performance metrics
 Methodology
 Step 1: Construction of interconnect library
 Step 2: Electrical characterization of library elements for
different choices of geometric, user-specified parameters
(Addino et al, PATMOS’03)
 Step 3: Computation of performance metrics
 Bit transfer rate across cross-section of the elements
 Power density per unit area
 “Traditional metrics” – latency, bandwidth, noise etc.
 Best solutions for different performance metrics may be
mutually conflicting  intelligent search in parameter
space to obtain optimal solution
Interconnect Characterization for Communication-based Design
 BEOL stack exploration: initial step toward interconnect fabric
design and optimization for MPSoC, CMPs and heterogeneous
systems  best interconnect stack for specific communication
objective
 Novel interconnect metrics
 Capture technology scaling
 Capture system scaling (design constraints): consider impact of
memory hierarchy, interface timing, power, signal swing levels
 Interconnect characterization: create models of performance
metrics for interconnect structures
 E.g. Which structure gives the best throughput per area for a given
performance constraints ?
 How does power density change with bus parameters, power
constraints ?
 Probabilistic, continuum/hierarchy of models
 Dial effort/information vs. accuracy

“N+1  N+2 shrink”; “Side + Ngate + Rent p”; “run Architecture Compiler”; …
 Dial guardband vs. certainty
Conclusions
 Existing BEOL stack analysis/optimization oblivious of
system design constraints
 Individual wires no longer sufficient for performance
analysis  move to higher levels of abstraction
 Communication-driven design synthesis paradigm
drives system-level interconnect analysis
 Standalone metrics (e.g., delay, power, bandwidth)
cannot give complete picture of performance
 new metrics: area-normalized throughput, power density
 Explore parameter space to efficiently obtain stack
parameters for optimum performance