Transcript Document

Clocking and Timing in FaultTolerant Systems-on-Chip
Andreas Steininger
Outline
• The Clock as a Blessing
• The Clock as a Curse
• Alternative Synchronization Schemes
 GALS
 fully asynchronous
 the DARTS approach
• Conclusion
2
Contributors to this Work
The DARTS project team
TU Vienna
RUAG Space
Gottfried Fuchs
Matthias Fuegger
Ulrich Schmid
Thomas Handl
Gerald Kempf
Manfred Sust
Wolfgang Zangerl
3
The Need for Fault Tolerance
miniaturization is key to progress in VLSI
=> smaller structures
=> lower voltage swing
=> smaller critical charge
=> higher operating frequencies
…result in higher susceptibility to faults (SET, EMI,…)
=> cannot avoid faults, need to tolerate them
4
The Role of Time
“The only reason for time is so that everything
doesn’t happen at once”, Albert Einstein
5
The Need for Clocking
activities need to be co-ordinated
• on system level (braking of wheels, …)
• on algorithmic level (consensus, …)
• on communication level
• on logic level (state machine switching,…)
co-ordination in the time domain (synchronization)
is an efficient way to attain this
=> need a global notion of time (discrete „ticks“)
6
The Quality of Synchronization
local time (number of ticks)
precision π
real time
7
Typical Precision Values
on system level:
on algorithm level:
on communication level:
on logic level:
ms … ms
ms … ms
ns … ms
ps … ns
8
Synchronization Requirements
phase synchronisation
(for „hardware clock“
on logic level)
1ms is excellent precision for distributed clock
at 1GHz this means 360.000° phase shift
clock synchronisation
(for distributed time base
on algorithmic level)
9
Globally Synchronous Design
• whole design is „isochronic“ („perfect“ precision)
• time conveyed by clock transitions
• perfect co-ordination of all activities
• very efficient design
• can assume consistent states
• high level of abstraction
• very efficient implementation:
• single crystal oscillator
• single control line (clock net)
10
„Isochronic“ Regions ?
speed of light (in medium) = 2 x 108 m/s = 20cm/ns
Ref
2cm
1GHz
4GHz
8GHz
11
The Variation Problem
Designer
projected
conditions
system
model
User
?(unknown)
worst case
?(imperfections)
safety margins
actual
conditions
actual
system
Timing completely fixed after design
No way to react to actual conditions & system („PVT variations“)
12
Fault-Tolerant Architectures
 Duplication &
 Triple-Modular
Comparison
Redundancy
FU
FU
=?
ERR
FU
FU
voter
Y
FU
13
Lock-Step Operation
single
clock
FU
FU
FU
single point of failure
voter
Y
„3“
„4“
„3“
„4“
„3“
„4“
good replica determinism
14
Lock-Step Operation
independent clocks
„3“
FU
FU
FU
single fault tolerant
voter
Y
„4“
„3“
„3“
„4“
„4“
bad replica determinism
15
Fault-Tolerant HW-Clocking
v
FU
v
FU
v
FU
voter
Y
16
Fault-Tolerant HW-Clocking
?v
FU
D
v
FU
?v
FU
voter
Y
D
17
The Charme of SoCs
billions of transistors fit on one die
=> structuring into (IP) modules
„System-on-Chip“
BUT:
• large clock distribution networks => „isochronic“??
• FT clocking does not work with large skew
• may need individual clocks for function modules
=> clock-synchrony neither attainable nor desirable
18
Co-ordination of Data Exchange
When can SNK use its input?
When it is valid and consistent
SRC
f(x)
SNK
When can SRC apply the next input?
When SNK has consumed the previous one
19
The Synchronous Approach
SRC
f(x)
SNK
co-ordination based on (global) time
20
Alternative: Asynchronous Design
co-ordination based on handshaking
REQ: „Data word valid, you can use it“
SRC
f(x)
SNK
ACK: „Data word consumed, send the next“
21
Async. Design – Advantages
• closed-loop control makes timing much more
•
•
•
•
•
robust and adaptive to PVT variations
no need for worst-case timing
local handshakes replace global clock
activity only when needed
beneficial for EMI
tends to stop operation in case of fault
22
Async. Design – Disadvantages
• Need to handle race between REQ and data
23
Async. Design – Disadvantages
• Need to handle race between REQ and data
REQ: „Data word valid, you can use it“
SRC
f(x)
SNK
24
Async. Design – Disadvantages
• Need to handle race between REQ and data
Solution 1: „Bundled Data“
REQ: „Data word valid, you can use it“
SRC
f(x)
SNK
25
Async. Design – Disadvantages
• Need to handle race between REQ and data
Solution 2: „Delay Insensitive“ (Coding)
REQ: „Data word valid, you can use it“
Completion detection
SRC
f(x)
SNK
26
Async. Design – Disadvantages
•
•
•
•
•
•
Need to handle race between REQ and data
significant HW overhead (coding, delay elements)
„adaptive“ timing not as predictable
more difficult to design
classical fault-tolerance schemes not applicable
tends to stop operation in case of fault
27
Best of Both Worlds
GALS: Globally Asynchronous Locally Synchronous
use asynchronous
principle where
clock distribution
too cumbersome:
„inter-module“
retain efficiency of
synchronous design
wherever possible:
„intra-module“
First mention in PhD thesis by Chapiro / Stanford 84
28
A GALS Example
CPU
DSP
2GHz
2,7GHz
PCI-IF
USB-IF
533MHz
24MHz
29
Communication in GALS
Shared Memory
producer writes to memory, consumer reads from there
pro: control flow stays independent
• shared single-port memory
• true dual-port memory
Direct Messages (Data words)
move data word from producer‘s output register to
consumer‘s input register
• non-buffered / buffered (FIFO-queues)
• clock fixed, data-driven or pausible
30
Shared Memory
decoupling of clock domains by memory acting as a third party
=> high area overhead => unusual
for single port memory arbitration required
• arbitration problem (unbounded delay…)
• one side may block the other at the arbiter
for multiport memory problems are confined to access to the
same cell
• busy flag may become metastable
• blocking still possible for one specific address
31
Shared Memory
• perfect
decoupling of
data path
• potential
metastability
problems at
arbitration logic
• potential blocking
through arbitration
CPU
DSP
2GHz
2,7GHz
Arbitration
shared memory
32
Direct Messages
clock domain boundary is between producer‘s output register
and consumer‘s input register
in general a synchronizer is needed at consumer‘s input
• definitely for conventional (fixed) clock
• can be avoided by data-driven / pausible clocking
control flows of producer and consumer are strongly coupled:
not maintaining the input/output register blocks other party
buffers/queues/FIFOs can
• mitigate, but not avoid this problem (full/empty)
• compensate variations in the data rate on both sides, but not
different average data rates
33
Direct Messages
CPU
2GHz
S
S
DSP
2,7GHz
data moving over clock domain boundary
metastability problems
=> need to insert handshake
…with synchronizers and (optional) buffers
34
Arbiter: Principle
purpose: ○ manage concurring requests to shared resource
method: ○ handle pairs of request_in / grant_out
○ requests may arrive in any order
○ arbiter must activate only one grant_out at a time
(respond to the first requester)
Mutual Exclusion (MUTEX)
problem: ○ resolve concurrent requests
=> metastability problem
35
Arbiter: Circuit
MUTEX-element: SR-latch
R1
G1’
G1
Vout,FF
Vmeta
Vth,inv
R2
G2’
G2
t
„Metastability filter“: e.g., hi-threshold inverter
[from D. J. Kinniment „Synchronization and Arbitration in Digital Systems“, Wiley]
36
Arbiter: Operation
R1
R2
G1’
G1
G2’
G2
R1
R2
G1
G2
37
Muller C-Element
a
IF a = b
THEN y = a
ELSE hold y
b
C
y
a
reset
b
y
RS
a
C
y
b
set
38
Muller C-Element: Circuit
[Alan Martin, Caltech]
39
Data-Driven Clocking
Principle:
○ as soon as new data arrive => start clocking
○ determine number k of clock cycles required
to process new data
○ stop clocking after k cycles, wait for next data
Properties:
○ need to switch clock on and off
=> beware spurious clock pulses!
○ no metastability problem: data stable as soon
as consumer clock starts
○ potential for power saving
○ useful for specific applications only (no pipe!)
40
Data-Driven Clock: Circuit / 1
CLK out
 CLK half period
determined by D
D
D
CLK out
41
Data-Driven Clock: Circuit / 2
CLK out
 transition on
REQ answered
by transition on
CLK out
REQ
C
D
ACK
D
 min CLK half
period determined by D
CLK out
REQ
ACK
42
Pausible Clocking
Principle:
○ producer requests consumer‘s clock to pause
○ data provided to input register during idle time
○ consumer‘s clock may resume
- free running („pausible clock“)
- with one cycle only („stoppable clock“)
Properties:
○ need to switch clock on and off
=> beware spurious clock pulses!
=> beware of clock tree delays!
○ producer controls consumer‘s clock (blocking!)
○ applications must cope with paused clock
43
Pausible Clock: Circuit / 1
CLK out
REQ
C
D
 inverter generates
next REQ
from ACK
 self-oscillation
ACK
D
CLK out
REQ
ACK
44
Pausible Clock: Circuit / 2
CLK out
Arb
C
D
ACK’
REQ’
 external unit can
safely stop CLK by
activating REQ’
 … and gets ACK’ as
a response
D
CLK out
REQ’
ACK’
45
Pausible Clock: Circuit / 3
CLK out
Arb
ACK1
REQ1
Arb
ACKn
REQn
C
D
 for more external sources arbiters can be added and
“anded” before the Muller C-Element
 the two inverters can be eliminated by using a Muller CElement with inverting output
46
Advantages of GALS
•
•
•
•
synchronous islands can be designed efficiently
modules operate independently
can use module specific-clock & timing
clocking is no single point of failure
47
Problems with GALS
• operation of modules not (inherently) co-ordinated
synchrony for communication but not on system /
algorithm level
• communication has to cross clock boundaries
• potential for metastability
=> performance penalty through synchronizers
OR
=> module must handle irregular clocking
48
The DARTS Idea
Distributed Algorithms for Robust Tick Synchronization
phase synchronisation
tick synchronisation
clock synchronisation
49
The DARTS Approach



Concept: Multiple synchronized tick generators
Method: Distributed algorithm for fault-tolerant
tick generation implemented in (asynchronous)
digital logic
Advantages
us
aB
t
a
D
Fu
 No crystal oscillator(s)
Fu
t
s
-Ne
TG
-Alg
TG
 No critical clock tree
Fu
 Clock is no single point of failure!
 Reasonable synchrony
3
1
2
50
The DARTS Principle
Fu1
TG-Net
Fu2

Every function unit Fui
augmented with simple
local clock unit (TG-Alg)

TG-Algs communicate
over dedicated TG-Net to
generate tick-synchronized
local clock signals

Up to f TG-Algs can be
Byzantine faulty  need
n ≥ 3f + 2 TG-Algs
data bus
Fu3
TG-Algs
Clock tree
DARTS clocks
Standard synchronous clocking
51
A Comparison
synchronous SoC
aB
D at
us
Fu 3
to
illa
Osc
r
aB
D at
us
t
-Ne
TG
ck
Clo e
Tre
s
-Alg
TG
Fu 2
Fu 2
Fu1 clk
single point
of failure
Fu clk
2
us
aB
D at
Fu 3
Fu 1
Fu 1
 global synchrony
(< 1 tick)
GALS
DARTS
to
illa
Osc
r
to
illa
Osc
Fu 3
to
illa
Osc
r
Fu 1
r
Fu 2
tick(3) tick(4)
 global synchrony  no single point
(potentially  1 tick)
of failure
 no single point
of failure
 NO (inherent)
global synchrony
52
52
The Distributed Algorithm
TG-Alg 1
TG-Alg 2
TG-Alg 6
TG-Alg 3
TG-Alg 5
TG-Net
[Srikanth & Toueg, 87]
(1) Initially:
TG-Alg 4
(2)
send tick(0) to all; clock:= 0;
(3) “Relay Rule”
(4) If received tick(m) from at least f+1 remote nodes and m > clock:
(5)
send tick(clock+1),…, tick(m) to all [once]; clock:= m;
(6) “Increment Rule”
(7) If received tick(m) from at least 2f+1 remote nodes and m >= clock:
(8)
send tick(m+1) to all [once]; clock:= m+1;
Implementation Challenges
(1) Initially:
(2)
k-bit msg vs. zero-bit tick
send tick(0) to all; clock:= 0;
(3) “Relay Rule”
(4) If received tick(m) from at least f+1 remote nodes and m > clock:
TICK(0)
TICK(k-1)
(5)
...
TICK(1)
send tick(clock+1),…, tick(m) to all [once]; clock:= m;
(6) “Increment Rule”
TICK(k)
(7) If received tick(m) from at least 2f+1 remote nodes and m >= clock:
(8)
send tick(m+1) to all [once]; clock:= m+1;
k-bit messages
Software-based
k unbounded
Replacement by
zero-bit messages
algorithm
Thresholds functions for
fault tolerance
Glitch-free
asynchronous
implementation
Atomicity of actions
To be ensured by
the architecture and
delay constraints
54
The DARTS Prototype
ASIC design:
• radhard
180nm
technology
• 2 designs:
- flexible
- fast
Prototype board:
8 chips plus fixed & programmable interconnect
55
Proof of Concept
56
Frequency Stability (Warm-up)
53.45
frequency in [MHz]
53.4
53.35
53.3
53.25
53.2
53.15
0
2
4
6
8
10
time in [hours]
12
14
16
18
57
Frequency Stability (detail)
core voltage in [V]
frequency in [MHz]
52.0
51.98
1.7974
51.96
1.7972
1.7970
51.94
0
5
10
1.7968
15
time in [min]
58
DARTS – General Properties
 Fully asynchronous implementation  NO
oscillators
 Tolerates up to three Byzantine faulty nodes
(configurable number of TG-Algs; 5 to 12)
 Adapts to operating conditions
(asynchronous logic)
59
Still Room for Improvements
o Transient faults are permanently stored in the
elastic pipelines
o No on-the-fly integration of TG-Alg
o Relatively low clock speed
o Interfacing to traditional synchronous designs
o Scaling with number of faults is costly
60
Summary: Trends & Needs
• Preceding miniaturization necessitates fault
tolerance
• Co-ordinaton of activities is fundamental, thus
tight synchrony is a desirable feature on all levels
• SoCs are large modular designs on a single die
61
Summary: SoC Clocking
• globally synchronous clock:
+ ideal synchrony, efficient in design & implementation
- isochrony unrealistic, single point of failure
• DARTS clock
+ best attainable global synchrony, adaptive timing, FT
- high implementation efforts, frequency not stable
• GALS
+ uses best of syn & asyn, indep. & module-specific clock
- no global synchrony, metastability issues
• asynchronous design
+ power-efficient, robust against faults & PVT
- high overheads, difficult to design, timing hard to predict
62
More information on DARTS
http://ti.tuwien.ac.at/ecs/research/projects/darts
63