Transcript Slide 1
Analytical Clustering Score with
Application to Post-Placement
Multi-Bit Flip-Flop Merging
Chang Xu1, Peixin Li1, Guojie Luo1, Yiyu Shi2, and Iris Hui-Ru Jiang3
{changxu, gluo} @ pku.edu.cn
1
Outline
Background
Multi-bit flip-flop
Previous works and limitation
Our
method
Analytical score
Discrete refinement
Efficient implementation
Experimental
Results
Conclusion
2
Clock Power Optimization
Clock
power predominates dynamic power
𝑷𝒄𝒍𝒌 = 𝜶𝑪𝒄𝒍𝒌 𝑽𝟐𝒅𝒅 𝒇𝒄𝒍𝒌
Clock
power optimization
Reduce 𝜶
• Clock gating technique
Reduce 𝑽𝒅𝒅
• Sub-threshold voltage
• Multi-supply-voltage
Reduce 𝑪𝒄𝒍𝒌
• Multi-bit flip-flop
• Resonance clock
3
Multi-Bit Flip-Flop(MBFF)
What’s
MBFF
Several SBFFs share common inverters in MBFF cell
Power saving comes from
MBFF library
Simplified clock tree
2-Bit Flip-Flop
UMC 55nm process
Source:
ICCAD’10
Faraday
cell libraryChang et al.
(a) Common clock tree
(b) Simplified clock tree with MBFF
4
Using MBFF at Different Stages
Pre-placement
MBFF
SNUG’10 Chen et al.,
In-placement
MBFF
ISPD’13 Tsai et al.,
ICCAD’13 Hsu et al.,
Post-placement
Logic Synthesis
MBFF Clustering
Placement
MBFF Clustering
Timing Analysis
MBFF
ICGCS’10 Yan and Chen
ICCAD’10 Chang et al.,
ISPD’11 Jiang et al., INTEGRA
Post-placement Optimization
MBFF Clustering
CTS
Routing
5
Post-Placement MBFF Clustering
Input
Placement of FFs and other gates
Timing slacks
FF
FF
MBFF library
TVFR
Output pin
Output
FF
FF clusters (MBFF)
Constraint
Input pin
FF
Timing constraint
6
Post-Placement MBFF Clustering
Timing
violation free region (TVFR)
TVFR
Output pin
TVFR1
FF
2-bit
FF
Input pin
TVFR2
7
Previous Works and Limitation
Intersection
TVFRs
graph-based searching [ICCAD’10 ]
Intersection
Graph
TVFRs
Complete
Graph
Time consuming: 𝑶(𝑵𝟑 )
Window-based acceleration affects power reduction
8
Previous Works and Limitation
Interval
graph-based searching [ISPD’11]
Random Choice!
Illustration to Interval Graph
Source: ISPD’11 Jiang et al.
Efficient: sub-quadratic time complexity
Effective: best power reduction
Simple: signal wirelength degradation
9
Benchmarks: C1-C6 Vs. IWLS 2005
Difference
TVFD/AFFD: roughly estimate #FF can be covered within TVFR
IWLS benchmarks have much more MBFF candidates!
FF ratio
FF ratio
C1-C6 TVFD/AFFD
Signal
Vga (IWLS 2005) TVFD/AFFD
wirelength degradation (for Integra)
C1-C6: Avg. 3%
IWLS: Avg. 932%
10
Our Contribution
Efficient
and great scalability
Sub-quadratic time complexity
Robust
performance
Power reduction: comparable to Integra
Signal wirelength: much better than Integra, especially for real
designs
Analytical
fashion
Potential integration in analytical global placement
Potential usage for clustering algorithms
11
Optimization Flow
12
Analytical Step: Basic Idea
Optimization
Problem
𝑚𝑖𝑛 𝛼𝑓𝑙 𝒙, 𝒚 − 𝑓𝑐 𝒙, 𝒚
𝑠. 𝑡. 𝑡 𝒙, 𝒚 ≤ 𝑇
𝑓𝑙
𝒙, 𝒚 : signal wirelength
weighted-average WL[DAC’11]
𝑓𝑐
𝒙, 𝒚 : #FF groups
nontrivial to be formulated
Timing
2-bit group
constraint
feasible region
3-bit group
TVFRs
13
Analytical Step: Def. of Clustering Score
Dirac
delta function
1 (𝑤 = 𝑧)
𝛿 𝑤, 𝑧 =
0 (𝑤 ≠ 𝑧)
𝛿
Cluster
𝑥𝑖 , 𝑦𝑖 − 𝑥𝑗 , 𝑦𝑗
,0 =
1
𝑥𝑖 , 𝑦𝑖 − 𝑥𝑗 , 𝑦𝑗
0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
=0
size
𝑁𝑖 𝒙, 𝒚 =
𝑁
𝑗=1 𝛿(
𝑥𝑖 , 𝑦𝑖 − 𝑥𝑗 , 𝑦𝑗
, 0)
𝑭𝑭𝒋
𝑵𝒋 = 𝟐
𝑭𝑭𝒊
𝑵𝒊 = 𝟑
TVFRs
14
Analytical Step: Def. of Clustering Score
Objective
function: 𝒇𝒄 term
4-bit group is most-efficient
𝑁
𝑚𝑖𝑛 − 𝑓𝑐 = −𝑚𝑎𝑥𝑓𝑐 = −𝑚𝑎𝑥
𝛿 𝑁𝑖 𝒙, 𝒚 , 4
𝑖=1
15
Analytical Step: Smoothing
Gaussian
function
𝑤 − 𝑧 2 𝑙𝑛𝜖
𝛿 𝑤, 𝑧 ≈ 𝐷 𝑤, 𝑧 = exp
𝑑02
𝐷 𝑤 − 𝑧 = 1 𝑤ℎ𝑒𝑛 𝑤 = 𝑧
𝐷 𝑤 − 𝑧 < 𝜖 𝑤ℎ𝑒𝑛 𝑤 − 𝑧 > 𝑑0
Dirac Delta function
Gaussian function
16
Analytical Step: Effectiveness
Attractive force
PULL
& repelling force
𝐹𝐹𝑖
PUSH
𝐹𝐹𝑖
𝐹𝐹𝑖
17
Analytical Step: Preliminary Clusters
3500
3500
Init. Loc.
3000
3000
2500
2500
2000
2000
1500
1500
1000
1000
500
500
NLP Loc.
0
0
0
500
1000
1500
2000
2500
3000
3500
0
(a) Initial FFs’ distribution
500
1000
1500
2000
2500
3000
(b) FFs’ distribution after analytical
clustering
𝒇𝒄 : maximizes MBFF group numbers
𝒇𝒍 : pulls FFs towards their “optimal locations” in terms of WL
18
3500
Discrete Step: Basic Idea
Two-pass
best-choice clustering
First-pass: discretization
Second-pass: refinement
A
A
C
B
First-pass
I
E
G
F
H
(a) Proximity relation
after analytical step
A
B
Second-pass
H
I
(d) Final MBFF groups
E
D
G
F
H
(b) Discrete clustering
A
C
D
E G
I
F
C
B
D
C
B
I
H
E
D
G
F
(c) Discrete refinement
19
Discrete Step: Two-Pass Best-Choice
Clustering
First-pass:
extract proximity relation
Bottom-up
mergingA
A
C
B
D
Priority
queue
B
A
x
C
B
D
E G
E G
I
•
Tuple:
𝑭𝑭
,
𝑭𝑭
,
𝒅
𝒅
=
𝒅𝒊𝒔𝒕(𝑭𝑭
,
𝑭𝑭
𝒊
𝒋
𝒋 )F
F
F
H𝒊
H
H
I
E
G
A
C
D
B
I
Proximity relationconstraint:
(b) First-pass4-bit
clustering
(a) Capacity
(c) second-pass clustering
C
D
H
E G
I F
(d) Final MBFF groups
after analytical step
Second-pass:
S(C,D)
S(G,F)
S(E,G)
S(I,H)
S(A,B)
further refinementS(I,E)
Improve the ratio of 4-bit groups
S(H,F)
S(A,C)
S(I,E)
20
MBFF Clusters
3500
3500
Init. Loc.
NLP Loc.
3000
3000
2500
2500
2000
2000
1500
1500
1000
1000
500
500
0
0
0
500
1000
1500
2000
2500
3000
3500
0
500
1000
1500
2000
2500
3000
3500
3500
Final Loc.
3000
Init. Loc.
NLP Loc.
Final Loc.
3000
2500
2500
2000
2000
1500
1500
1000
1000
500
500
0
0
0
500
1000
1500
2000
2500
3000
3500 0
500
1000
1500
2000
2500
3000
21
3500
Efficient Implementation
Sub-quadratic
timing complexity
Analytical Step
𝑚𝑖𝑛 𝛼𝑓𝑙 𝒙, 𝒚 − 𝑓𝑐 𝒙, 𝒚
𝑠. 𝑡. 𝑡 𝒙, 𝒚 ≤ 𝑇
Discrete Refinement
• Gradient calculation
Fast gauss transformation (FGT)
𝑂 𝑁 2 ⇒ 𝑂(𝑁)
• Nonlinear programming solver
• FF-pair distance
Bin-structure searching
𝑂 𝑁 2 ⇒ 𝑂(𝑁)
Nesterov method
Placement-like problem 𝑂(𝑁1.18 )
22
Experiment Results:
Setup:
G++ 4.5.1 −𝑶𝟑
Intel Xeon CPU @ 2.4GHz with 16 logical threads
Benchmarks: C1-C6, IWLS-2005 suite
Synthesis flow for real designs
• Synopsys DC
• Cadence Encounter SOC
23
Experimental Results: C1-C6
Comparable
33%
power reduction
WL reduction
Integra
Ours
Circuit
PWR
WLR
RT
(s)
PWR
WLR
RT
(s)
C1
82.8
96
0.01
83.5
77.4
0.42
C2
80.9
102
0.01
82.3
76.4
0.97
C3
80.8
104
0.01
82.3
74.9
3.14
C4
81.0
104
0.02
82.4
75.6
10.59
C5
80.7
105
0.05
82.1
76.4
16.66
C6
80.7
105
1.11
82.3
82
217.4
Avg.
1
1.33
1
1.02
1
252
24
Experimental Results: Real Designs
Bound-Integra
Effect of Different Bound Factors to Power Ration and
WL Ratio
25
Experimental Results: Real Designs
Comparable
43%
power reduction
WL reduction compared with Bound-Integra
Bound-Integra
Ours
Circuit
PWR
WLR
RT
(s)
PWR
WLR
RT
(s)
Tv80
78.11
109.2
0.01
78.10
95.7
0.94
Wbconmax
78.26
128
0.03
78.02
105
2.3
Pairing
78.00
132
0.03
78.00
109
6.61
Dma
78.04
124
0.05
78.02
96
5.43
Ac97
78.02
120
0.02
78.02
96
4.88
Ethernet
78.00
217
0.63
78.00
88
24.5
Avg.
1
1.43
1
0.99
1
84
26
Conclusion
We
propose analytical clustering score to merge
MBFF
The time complexity is sub-quadratic
We get comparable power reduction as Integra
We reduce wirelength by about 25% compared with
original placement
Potential
usage:
Integrated in global placement
Clustering algorithms
27
Q&A
Thanks
{changxu, gluo} @pku.edu.cn
28
Backup
Output pin
FF
FF
Output pin
Input pin
Input pin
Output pin
FF
Best MBFF location
FF
FF
Input pin
Output pin
Output pin
Input pin
Input pin
29
Backup
Attractive
force & repelling force
30
Backup
Proof
𝑭𝑭𝒊 𝒂𝒕𝒕𝒓𝒂𝒄𝒕𝒔 𝑭𝑭𝒋 𝒘𝒉𝒆𝒏 𝒙𝒊 − 𝒙𝒋 ∗
𝑭𝑭𝒊 𝒓𝒆𝒑𝒆𝒍𝒔 𝑭𝑭𝒋 𝒘𝒉𝒆𝒏 𝒙𝒊 − 𝒙𝒋 ∗
𝝏𝒇𝒄,𝒊
𝝏𝑵𝒊
𝝏𝑵𝒊
𝝏𝒙𝒋
𝝏𝒙𝒋
>𝟎
𝝏𝒙𝒋
𝝏𝒇𝒄,𝒊
𝝏𝒙𝒋
<𝟎
= 𝟐𝝀𝟏 𝑵𝒊 − 𝟑 𝒆𝒙𝒑( 𝑵𝒊 − 𝟑 𝟐 𝝀𝟏 )
= 𝟐𝝀𝟐 𝒆𝒙𝒑(𝝀𝟐 ( 𝒙𝒊 − 𝒙𝒋
𝝏𝒇𝒄,𝒊
𝝏𝒇𝒄,𝒊
=
𝝏𝒇𝒄,𝒊
𝝏𝑵𝒊
∗
𝟐
𝟐
+ 𝒚𝒊 − 𝒚𝒋 ))(𝒙𝒋 − 𝒙𝒊 )
𝝏𝑵𝒊
𝝏𝒙𝒋
𝒙𝒊 − 𝒙𝒋 ∗
𝝏𝒇𝒄,𝒊
𝒙𝒊 − 𝒙𝒋 ∗
𝝏𝒇𝒄,𝒊
𝝏𝒙𝒋
𝝏𝒙𝒋
> 𝟎 𝒘𝒉𝒆𝒏 𝑵𝒊 < 𝟑
< 𝟎 𝒘𝒉𝒆𝒏 𝑵𝒊 > 𝟑
31
TVFD/AFFD
Tight
timing constraint
Slack
distribution
14%
12%
10%
8%
6%
4%
2%
0%
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7
32
Performance tuning
𝜶
=
𝒅𝟎
𝑵
𝑪𝒉𝒊𝒑𝒘𝒊𝒅𝒕𝒉
=
𝟏
𝑵
𝑵
𝒊 ‖𝑭𝑭𝒊
− 𝑭𝑭𝒔𝒆𝒄𝒐𝒏𝒅𝑵𝒆𝒂𝒓𝒆𝒔𝒕𝑻𝒐𝑭𝑭𝒊 ‖
33
Customized FGT
34
Efficient NLP Solver
Nesterov
method[DAC’14]
Projection:
timing constraint
𝑚𝑖𝑛 𝛼𝑓𝑙 𝒙, 𝒚 − 𝑓𝑐 𝒙, 𝒚
𝑠. 𝑡. 𝑡 𝒙, 𝒚 ≤ 𝑇
35