Transcript Slide 1

Analytical Clustering Score with
Application to Post-Placement
Multi-Bit Flip-Flop Merging
Chang Xu1, Peixin Li1, Guojie Luo1, Yiyu Shi2, and Iris Hui-Ru Jiang3
{changxu, gluo} @ pku.edu.cn
1
Outline
 Background
 Multi-bit flip-flop
 Previous works and limitation
 Our
method
 Analytical score
 Discrete refinement
 Efficient implementation
 Experimental
Results
 Conclusion
2
Clock Power Optimization
 Clock
power predominates dynamic power
 𝑷𝒄𝒍𝒌 = 𝜶𝑪𝒄𝒍𝒌 𝑽𝟐𝒅𝒅 𝒇𝒄𝒍𝒌
 Clock
power optimization
 Reduce 𝜶
• Clock gating technique
 Reduce 𝑽𝒅𝒅
• Sub-threshold voltage
• Multi-supply-voltage
 Reduce 𝑪𝒄𝒍𝒌
• Multi-bit flip-flop
• Resonance clock
3
Multi-Bit Flip-Flop(MBFF)
 What’s
MBFF
 Several SBFFs share common inverters in MBFF cell

Power saving comes from
 MBFF library
 Simplified clock tree
2-Bit Flip-Flop
UMC 55nm process
Source:
ICCAD’10
Faraday
cell libraryChang et al.
(a) Common clock tree
(b) Simplified clock tree with MBFF
4
Using MBFF at Different Stages
 Pre-placement
MBFF
 SNUG’10 Chen et al.,
 In-placement
MBFF
 ISPD’13 Tsai et al.,
 ICCAD’13 Hsu et al.,
 Post-placement
Logic Synthesis
MBFF Clustering
Placement
MBFF Clustering
Timing Analysis
MBFF
 ICGCS’10 Yan and Chen
 ICCAD’10 Chang et al.,
 ISPD’11 Jiang et al., INTEGRA
Post-placement Optimization
MBFF Clustering
CTS
Routing
5
Post-Placement MBFF Clustering
 Input
 Placement of FFs and other gates
 Timing slacks
FF
FF
 MBFF library
TVFR
Output pin
 Output
FF
 FF clusters (MBFF)
 Constraint
Input pin
FF
 Timing constraint
6
Post-Placement MBFF Clustering
 Timing
violation free region (TVFR)
TVFR
Output pin
TVFR1
FF
2-bit
FF
Input pin
TVFR2
7
Previous Works and Limitation
 Intersection
TVFRs
graph-based searching [ICCAD’10 ]
Intersection
Graph
TVFRs
Complete
Graph
 Time consuming: 𝑶(𝑵𝟑 )
 Window-based acceleration affects power reduction
8
Previous Works and Limitation
 Interval
graph-based searching [ISPD’11]
Random Choice!
Illustration to Interval Graph
Source: ISPD’11 Jiang et al.
 Efficient: sub-quadratic time complexity
 Effective: best power reduction
 Simple: signal wirelength degradation
9
Benchmarks: C1-C6 Vs. IWLS 2005
 Difference
 TVFD/AFFD: roughly estimate #FF can be covered within TVFR
 IWLS benchmarks have much more MBFF candidates!
FF ratio
FF ratio
C1-C6 TVFD/AFFD
 Signal
Vga (IWLS 2005) TVFD/AFFD
wirelength degradation (for Integra)
 C1-C6: Avg. 3%
 IWLS: Avg. 932%
10
Our Contribution
 Efficient
and great scalability
 Sub-quadratic time complexity
 Robust
performance
 Power reduction: comparable to Integra
 Signal wirelength: much better than Integra, especially for real
designs
 Analytical
fashion
 Potential integration in analytical global placement
 Potential usage for clustering algorithms
11
Optimization Flow
12
Analytical Step: Basic Idea
 Optimization
Problem
𝑚𝑖𝑛 𝛼𝑓𝑙 𝒙, 𝒚 − 𝑓𝑐 𝒙, 𝒚
𝑠. 𝑡. 𝑡 𝒙, 𝒚 ≤ 𝑇
 𝑓𝑙
𝒙, 𝒚 : signal wirelength
 weighted-average WL[DAC’11]
 𝑓𝑐
𝒙, 𝒚 : #FF groups
 nontrivial to be formulated
 Timing
2-bit group
constraint
 feasible region
3-bit group
TVFRs
13
Analytical Step: Def. of Clustering Score
 Dirac
delta function
1 (𝑤 = 𝑧)
𝛿 𝑤, 𝑧 =
0 (𝑤 ≠ 𝑧)
𝛿
 Cluster
𝑥𝑖 , 𝑦𝑖 − 𝑥𝑗 , 𝑦𝑗
,0 =
1
𝑥𝑖 , 𝑦𝑖 − 𝑥𝑗 , 𝑦𝑗
0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
=0
size
𝑁𝑖 𝒙, 𝒚 =
𝑁
𝑗=1 𝛿(
𝑥𝑖 , 𝑦𝑖 − 𝑥𝑗 , 𝑦𝑗
, 0)
𝑭𝑭𝒋
𝑵𝒋 = 𝟐
𝑭𝑭𝒊
𝑵𝒊 = 𝟑
TVFRs
14
Analytical Step: Def. of Clustering Score
 Objective
function: 𝒇𝒄 term
 4-bit group is most-efficient
𝑁
𝑚𝑖𝑛 − 𝑓𝑐 = −𝑚𝑎𝑥𝑓𝑐 = −𝑚𝑎𝑥
𝛿 𝑁𝑖 𝒙, 𝒚 , 4
𝑖=1
15
Analytical Step: Smoothing
 Gaussian
function
𝑤 − 𝑧 2 𝑙𝑛𝜖
𝛿 𝑤, 𝑧 ≈ 𝐷 𝑤, 𝑧 = exp
𝑑02
𝐷 𝑤 − 𝑧 = 1 𝑤ℎ𝑒𝑛 𝑤 = 𝑧
𝐷 𝑤 − 𝑧 < 𝜖 𝑤ℎ𝑒𝑛 𝑤 − 𝑧 > 𝑑0
Dirac Delta function
Gaussian function
16
Analytical Step: Effectiveness
 Attractive force
PULL
& repelling force
𝐹𝐹𝑖
PUSH
𝐹𝐹𝑖
𝐹𝐹𝑖
17
Analytical Step: Preliminary Clusters
3500
3500
Init. Loc.
3000
3000
2500
2500
2000
2000
1500
1500
1000
1000
500
500
NLP Loc.
0
0
0
500
1000
1500
2000
2500
3000
3500
0
(a) Initial FFs’ distribution
500
1000
1500
2000
2500
3000
(b) FFs’ distribution after analytical
clustering

𝒇𝒄 : maximizes MBFF group numbers

𝒇𝒍 : pulls FFs towards their “optimal locations” in terms of WL
18
3500
Discrete Step: Basic Idea
 Two-pass
best-choice clustering
 First-pass: discretization
 Second-pass: refinement
A
A
C
B
First-pass
I
E
G
F
H
(a) Proximity relation
after analytical step
A
B
Second-pass
H
I
(d) Final MBFF groups
E
D
G
F
H
(b) Discrete clustering
A
C
D
E G
I
F
C
B
D
C
B
I
H
E
D
G
F
(c) Discrete refinement
19
Discrete Step: Two-Pass Best-Choice
Clustering
 First-pass:
extract proximity relation
 Bottom-up
mergingA
A
C
B
D
 Priority
queue
B
A
x
C
B
D
E G
E G
I
•
Tuple:
𝑭𝑭
,
𝑭𝑭
,
𝒅
𝒅
=
𝒅𝒊𝒔𝒕(𝑭𝑭
,
𝑭𝑭
𝒊
𝒋
𝒋 )F
F
F
H𝒊
H
H
I
E
G
A
C
D
B
I
Proximity relationconstraint:
(b) First-pass4-bit
clustering
(a) Capacity
(c) second-pass clustering
C
D
H
E G
I F
(d) Final MBFF groups
after analytical step
 Second-pass:
S(C,D)
S(G,F)
S(E,G)
S(I,H)
S(A,B)
further refinementS(I,E)
 Improve the ratio of 4-bit groups
S(H,F)
S(A,C)
S(I,E)
20
MBFF Clusters
3500
3500
Init. Loc.
NLP Loc.
3000
3000
2500
2500
2000
2000
1500
1500
1000
1000
500
500
0
0
0
500
1000
1500
2000
2500
3000
3500
0
500
1000
1500
2000
2500
3000
3500
3500
Final Loc.
3000
Init. Loc.
NLP Loc.
Final Loc.
3000
2500
2500
2000
2000
1500
1500
1000
1000
500
500
0
0
0
500
1000
1500
2000
2500
3000
3500 0
500
1000
1500
2000
2500
3000
21
3500
Efficient Implementation
 Sub-quadratic
timing complexity
Analytical Step
𝑚𝑖𝑛 𝛼𝑓𝑙 𝒙, 𝒚 − 𝑓𝑐 𝒙, 𝒚
𝑠. 𝑡. 𝑡 𝒙, 𝒚 ≤ 𝑇
Discrete Refinement
• Gradient calculation
Fast gauss transformation (FGT)
𝑂 𝑁 2 ⇒ 𝑂(𝑁)
• Nonlinear programming solver
• FF-pair distance
Bin-structure searching
𝑂 𝑁 2 ⇒ 𝑂(𝑁)
Nesterov method
Placement-like problem 𝑂(𝑁1.18 )
22
Experiment Results:
 Setup:
 G++ 4.5.1 −𝑶𝟑
 Intel Xeon CPU @ 2.4GHz with 16 logical threads
 Benchmarks: C1-C6, IWLS-2005 suite
 Synthesis flow for real designs
• Synopsys DC
• Cadence Encounter SOC
23
Experimental Results: C1-C6
 Comparable
 33%
power reduction
WL reduction
Integra
Ours
Circuit
PWR
WLR
RT
(s)
PWR
WLR
RT
(s)
C1
82.8
96
0.01
83.5
77.4
0.42
C2
80.9
102
0.01
82.3
76.4
0.97
C3
80.8
104
0.01
82.3
74.9
3.14
C4
81.0
104
0.02
82.4
75.6
10.59
C5
80.7
105
0.05
82.1
76.4
16.66
C6
80.7
105
1.11
82.3
82
217.4
Avg.
1
1.33
1
1.02
1
252
24
Experimental Results: Real Designs
 Bound-Integra
Effect of Different Bound Factors to Power Ration and
WL Ratio
25
Experimental Results: Real Designs
 Comparable
 43%
power reduction
WL reduction compared with Bound-Integra
Bound-Integra
Ours
Circuit
PWR
WLR
RT
(s)
PWR
WLR
RT
(s)
Tv80
78.11
109.2
0.01
78.10
95.7
0.94
Wbconmax
78.26
128
0.03
78.02
105
2.3
Pairing
78.00
132
0.03
78.00
109
6.61
Dma
78.04
124
0.05
78.02
96
5.43
Ac97
78.02
120
0.02
78.02
96
4.88
Ethernet
78.00
217
0.63
78.00
88
24.5
Avg.
1
1.43
1
0.99
1
84
26
Conclusion
 We
propose analytical clustering score to merge
MBFF
 The time complexity is sub-quadratic
 We get comparable power reduction as Integra
 We reduce wirelength by about 25% compared with
original placement
 Potential
usage:
 Integrated in global placement
 Clustering algorithms
27
Q&A
Thanks
{changxu, gluo} @pku.edu.cn
28
Backup
Output pin
FF
FF
Output pin
Input pin
Input pin
Output pin
FF
Best MBFF location
FF
FF
Input pin
Output pin
Output pin
Input pin
Input pin
29
Backup
 Attractive
force & repelling force
30
Backup
 Proof
 𝑭𝑭𝒊 𝒂𝒕𝒕𝒓𝒂𝒄𝒕𝒔 𝑭𝑭𝒋 𝒘𝒉𝒆𝒏 𝒙𝒊 − 𝒙𝒋 ∗
 𝑭𝑭𝒊 𝒓𝒆𝒑𝒆𝒍𝒔 𝑭𝑭𝒋 𝒘𝒉𝒆𝒏 𝒙𝒊 − 𝒙𝒋 ∗




𝝏𝒇𝒄,𝒊
𝝏𝑵𝒊
𝝏𝑵𝒊
𝝏𝒙𝒋
𝝏𝒙𝒋
>𝟎
𝝏𝒙𝒋
𝝏𝒇𝒄,𝒊
𝝏𝒙𝒋
<𝟎
= 𝟐𝝀𝟏 𝑵𝒊 − 𝟑 𝒆𝒙𝒑( 𝑵𝒊 − 𝟑 𝟐 𝝀𝟏 )
= 𝟐𝝀𝟐 𝒆𝒙𝒑(𝝀𝟐 ( 𝒙𝒊 − 𝒙𝒋
𝝏𝒇𝒄,𝒊
𝝏𝒇𝒄,𝒊
=
𝝏𝒇𝒄,𝒊
𝝏𝑵𝒊
∗
𝟐
𝟐
+ 𝒚𝒊 − 𝒚𝒋 ))(𝒙𝒋 − 𝒙𝒊 )
𝝏𝑵𝒊
𝝏𝒙𝒋
𝒙𝒊 − 𝒙𝒋 ∗
𝝏𝒇𝒄,𝒊
𝒙𝒊 − 𝒙𝒋 ∗
𝝏𝒇𝒄,𝒊
𝝏𝒙𝒋
𝝏𝒙𝒋
> 𝟎 𝒘𝒉𝒆𝒏 𝑵𝒊 < 𝟑
< 𝟎 𝒘𝒉𝒆𝒏 𝑵𝒊 > 𝟑
31
TVFD/AFFD
 Tight
timing constraint
 Slack
distribution
14%
12%
10%
8%
6%
4%
2%
0%
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7
32
Performance tuning
𝜶
=
 𝒅𝟎
𝑵
𝑪𝒉𝒊𝒑𝒘𝒊𝒅𝒕𝒉
=
𝟏
𝑵
𝑵
𝒊 ‖𝑭𝑭𝒊
− 𝑭𝑭𝒔𝒆𝒄𝒐𝒏𝒅𝑵𝒆𝒂𝒓𝒆𝒔𝒕𝑻𝒐𝑭𝑭𝒊 ‖
33
Customized FGT
34
Efficient NLP Solver
 Nesterov
method[DAC’14]
 Projection:
timing constraint
𝑚𝑖𝑛 𝛼𝑓𝑙 𝒙, 𝒚 − 𝑓𝑐 𝒙, 𝒚
𝑠. 𝑡. 𝑡 𝒙, 𝒚 ≤ 𝑇
35