Information Redundancy

Download Report

Transcript Information Redundancy

4. Information Redundancy
Reliable System Design 2010
by: Amir M. Rahmani
Information Redundancy

Code: representing information
•




- Morse code
Code word: collection of symbols or digit , use to
representing information according to the rules of a given
code
Binary code: a code in which code words contain only
symbols that are either 0 or 1.
Error detection, Error correction
Coding often applied to
•
•
- Information transfer: often serial communication
through a channel
- Information storage
matlab1.ir
Start with k-bit data word






Add r code bits to k-bit data
Total = n-bit code word (n=k+r)
Not all 2n combinations are valid code words
For certain encoding schemes - some types of
errors can also be corrected
To extract original data - n bits must be decoded
Overhead = r/n
•
•
•
– e.g., for (single-bit) parity, the overhead is 1/n
– additional bits required
– time to encode and decode
matlab1.ir
Hamming distance (d)

Number of bits in which two words differ from
each other; d (x,y)=Σ(xk XOR yk)
•

Rules:
•
•
•

Iff d (x,y)= 0 then x=y
d (x,y)= d (y,x)
d (x,y)= d (y,z)>= d (x,z)
For a group of code words, d is the minimum of
all hamming distance between all possible pairs
of code words.
•

E.g., 0010 and 1110 have a Hamming distance of 2
E.g., {000, 011, 101, 110} have a Hamming distance of 2
d determines the code’s ability to detect and/or
correct errors
•
•
– d-1 bit for error detection
– [(d-1)/2] bit for error correction
matlab1.ir
Hamming distance (d)
Two words in this figure are connected by an edge if their d is 1
d=2 Can detect single bit errors
matlab1.ir
Hamming distance (d)
The code {000,111} can be used to encode a single data bit. 0 can be
encoded as 000 and 1 as 111. This code is identical to TMR
d=3 Can detect single & double bit errors, can correct single bit errors
matlab1.ir
Separability of a Code




A code is separable if it has separate fields for
the data and the code bits.
Decoding consists of disregarding the code bits
The code bits can be processed separately to
verify the correctness of the data
A non-separable code has the data and code bits
integrated together - extracting the data from the
encoded word requires some processing
matlab1.ir
Single-bit Parity

Simplest separable error detection code
•


Encoding and decoding cost is low
Even (odd) parity: add bit such that total number of
ones in code word is even (odd)
•

– E.g., 001010 gets a parity bit of 0 for even parity (1 for
odd)
Can detect all single-bit errors (All odd-bit errors)
•
•

– Adds one bit of redundancy to each data word
– Hamming distance >= 2
– Could be greater than 2 if data words don’t use all bit
combinations
Drawbacks:
– Unable to detect common even errors
•
matlab1.ir
Single-bit Even Parity
matlab1.ir
Even or Odd Parity?




The decision depends on which type of all-bits
error is more probable
For even parity - the parity bit for the all
zeroes data word will be 0 and an all-0’s
failure will go undetected - it is a valid code
word
Selecting the odd parity code will allow the
detection of the all-0's failure
If all-1's failure is more likely - the odd parity
code must be selected if the total number of
bits (n+1) is even, and the even parity if n+1 is
odd
matlab1.ir
Byte-Interlaced Parity Code






Example: n=64, data bits - a63,a62,…,a0
Eight parity bits:
First - parity bit of a63,a55,a47,a39,a31,a23,a15,a7 the most significant bits in the eight bytes
Remaining seven parity bits - assigned so that
the corresponding groups of bits are
interlaced
Scheme is beneficial when shorting of adjacent
bits is a common failure mode (example - a
bus)
If parity type (odd or even) is alternated
between groups - unidirectional errors (all-0's
or all-1's) will also be detected
matlab1.ir
Overlapping Parity Code




Simplest scheme; data is organized in a 2-dimensional
array
Bits at the end of row - parity over that row
Bits at the bottom of column - parity over column
Error correcting code?
•


- A single-bit error anywhere will cause a row and a column
to be erroneous
This identifies a unique erroneous bit
This is an example of overlapping parity - each bit is
covered by more than one parity bit
matlab1.ir
Checksum




Separable code
Checksum is the sum of the original data
All checksum schemes allow error detection but not error
location - entire block of data must be retransmitted if an error is
detected
a) Single-precision checksum
•

b) Double-precision checksum
•

– uses double precision, i.e. compute 2n-bit checksum from n-bit
words using modulo 22n arithmetic.
c) Residue checksum
•

– overflow problem, i.e. adding n bits modulo 2n
– like single-precision checksum, but overflow is now fed back as
carry
d) Honeywell checksum
•
•
– compose word of double length by concatenating 2 consecutive
words (done modulo 22n)
– compute checksum on these double words
matlab1.ir
Comparing the Checksum Types
matlab1.ir
Comparison - Example


In Single-precision checksum - transmitted checksum differs from
computed checksum
In Honeywell checksum computed checksum differs from received
checksum and error is detected
matlab1.ir
Cyclic Codes





Cyclic codes are often non-separable although
separable cyclic codes exist
Encoding consists of dividing the data word by a
constant number
The coded word is the product
Decoding is dividing by the same constant - if the
remainder is non-zero, an error has occurred
Cyclic codes are widely used in data storage and
communication
matlab1.ir
Cyclic Redundancy Checks (CRC)






CRC is based on a mathematical calculation
performed on message.
We will use the following terms:
M - Message to be sent (k bits)
F - Frame Check Sequence (FCS) or CRC to be
appended to message (n bits)
T - Transmitted message includes both M and F
=> (k+n bits)
G - n+1 bit pattern (called polynomial generator)
used to calculate F and check T
matlab1.ir
Cyclic Redundancy Check (CRC)

Key idea
•
•
•

Multiply M by 2n to shift, and add F to padded 0s
•

– given a k-bit frame (message)
– transmitter generates a n-bit sequence called frame check
sequence (FCS)
– so that resulting frame of size k+n is exactly divisible by
some predetermined number
T = 2 nM +F
Dividing 2nM by G gives quotient and remainder
(remainder is 1 bit less than divisor)
2 n M/G = Q + R/G
then using R as our FCS we get
T = 2 nM +F
on the receiving end, division by G leads to
T/G = (2 n M +R)/G = Q + R/G +R/G =Q
If remainder is non-zero, it’s an error
•
•
matlab1.ir
Cyclic Redundancy Check (CRC)

Example, assume G(X) has at least 3 terms
•
•
•
•
•
•
– G(x) has 3 1-bits
» detects all single bit errors
» detects all double bit errors
» detects odd #’s of errors if G(X) contains the
factor (X + 1)
» any burst errors < length of FCS
» most larger burst errors
matlab1.ir
Cyclic Redundancy Check (CRC)

A polynomial view:
•
•
variable X with binary coefficients, where the
coefficients correspond to the bits in the number.
• M = 110011, M(X) = X5 + X4 + X + 1, and for
G = 11001 we have G(X) = X4 + X3 + 1
• Math is still mod 2
» An error E(X) is received, and undetected iff it is
divisible by G(X)
matlab1.ir
CRC Example
M = 10110100011, G = 1101 ; XOR instead of Minus
10110100011 000 | 1101
1101
1100
1101
1100
1101
1011
1101
1100
1101
100
=> CRC = 100
matlab1.ir
Cyclic Redundancy Check (CRC)

Pre-defined polynomial examples:
•
•
•

Why is CRC popular?
•

• CRC-12: X12+X11+X3+X2+X+1
• CRC-16: X16+X15+X2+1
• CRC-CCITT = X16 + X12 + X5 + 1
• Easy to implement! Just need shifters and XORs
Hardware Implementation:
•
G(X) = 1 + a1X +a2 X + …+ an-1 Xn-1 + an X n
matlab1.ir
Hamming Code (7,4)






Class of (n,k) Hamming codes, e.g., (7,4) [r= n-k =3]
Let i1, i2, i3, i4 be the information bits
Let p1, p2, p4 be the check bits
p1 = i1 XOR i2 XOR i4
p2 = i1 XOR i3 XOR i4
p4 = i2 XOR i3 XOR i4
matlab1.ir
Unordered code

To detect all unidirectional errors
•
•
M-of-n code
Berger code
matlab1.ir
m-of-n codes




All code words are n bits in length and
contain exactly m 1’s
Simple implementation
Can detect all single errors
Can detect all unidirectional multiple
errors
matlab1.ir
Berger Code

Separable code
•
•
•
•

Example - encoding 11101
•
•
•
•



. counts the number of 1s in the word
. expresses it in binary
. complements it
. appends this quantity to the data
. Four 1s
. 100 in binary
. 011 after complementing
. the encoded word 11101011
Can detect all single errors
Can Detects all unidirectional bit errors - one or more 1s
turn to 0s and no 0s turn to 1s (or vice versa)
Overhead = r/(k+r)
•
k data bits - at most k 1s , r =[log 2(k+1)] redundant bits
matlab1.ir
Other Coding Schemes

Many Error Detecting/Correcting codes exist
•
– E.g., Arithmetic codes, Reed-Solomon codes, Residue
codes, Bi-Residue codes, etc.

Many of them require more mathematic than
belongs in this course

Reasons for other types of codes
•
•
•
•
•
– Burst errors
– Byte errors
– Cost/Performance
– Multiple-bit errors
– Ease of hardware implementation
matlab1.ir
Error Recovery

Probably the most important phase of any
fault-tolerance technique

Two approaches:
Forward
Backward
•
•
matlab1.ir
Forward Error Recovery




Forward Error Recovery continues from an
erroneous state by making selective corrections
to the system state
This includes making safe the controlled
environment which may be damaged because of
the failure
It is system specific and depends on accurate
predictions of the location and cause of errors
Examples: redundant pointers in data structures
and the use of self-correcting codes such as
Hamming Codes
matlab1.ir
Backward Error Recovery (BER)

If error detected, recover backwards & re-execute
•
•

Some terminology:
•
•
•

– Nightly backups of file systems
May sacrifice performance to achieve availability
•
•

– Sequoia, Synapse N+1, Tandem NonStop
BER also includes all-software schemes
•

– Recovery point: the point to which we recover in case of error
– Check pointing: periodically saving state of system
– Logging: saving changes made to system state
Many commercial machines use BER
•

– Recover to previous state of system that we know is error-free
– Assumes that error will be gone by time of re-execution
– Where might we lose performance?
– May not be suitable for real-time systems
Disadvantage
– it cannot undo errors in the environment!
•
matlab1.ir
The Domino Effect
With concurrent processes that interact with each
other, BER is more complex Consider:
P1
R11
If the error is
detected in P1
rollback to R13.
If the error is
detected in P2?
R12
P2
IPC1
IPC2
R21
IPC3
IPC4
R13
matlab1.ir
R22
Execution time

6 BER Issues
1- What state needs to be saved?
2- How do we save this state?
3- Where do we save it?
4- How often do we save it?
5- How do we recover the system to this state?
6- How do we resume execution after recovery?
matlab1.ir
1- What State needs to be saved



Need to save all state that would be
necessary if this were to become the
recovery point
In general, we only need to save the
user-visible state
For example, microprocessors:
•
•
– Must save architectural state
– Don’t have to worry about
architectural state
matlab1.ir
micro-
2- How to Save State

Two “hints” of BER:
•
•

Check pointing
•
•
•

– Check pointing: Periodically stop system and save state
– Logging: Log all changes to state
– Only suffers overhead at periodic checkpoints
– Can only recover at coarse granularity
– Size of checkpoint is often fixed
Logging
•
•
•
– Finer granularity of rollback
– suffers overhead for logging many common operations
– Amount of state logged is variable
matlab1.ir
3- Where to Save State

Have to save state where it is “reliable”
•

In processor (can’t survive loss of processor chip)
•

•
– Processor copies registers into memory
– Write-through cache copies data into memory
In disk (maybe the safest, but slow)
•

– Processor copies registers into cache
In memory (memory can be made very reliable)
•

– Processor saves registers to shadow registers
In cache (same as processor, if on-chip cache)
•

– A fault in the recovery point state could make recovery
impossible
– E.g., databases log updates to disks
In tape (too slow except for rare backups)
matlab1.ir
4- When to Save State

Check pointing
•

Logging
•

– Can choose checkpoint interval
– Continuously saving state (every time it
changes)
For check pointing, a larger checkpoint
interval means
•
•
– Less overhead due to check pointing (since
less frequent)
– Coarser checkpoint granularity (can’t
recover to arbitrary point)
matlab1.ir
5- How to Recover State



Check pointing: Copy pre-fault recovery
point checkpoint into architectural state
Logging: Unroll log to undo changes since
recovery point
Tradeoff between these two depends on
system
matlab1.ir
6- How to Resume Execution

Simply resuming execution after recovery
may not be possible
•

– E.g., recovery due
interconnection switch
to
hard
fault
in
May need to reconfigure before resuming,
to ensure forward progress
•
– E.g., reconfiguring the routing in interconnect to
avoid dead switch
matlab1.ir
Implementing EDC/ECC in Hardware

Where does EDC/ECC get used?
•
•
•


Tradeoff between EDC and ECC
ECC: Forward error recovery
•
•

– Disk, CD-ROM
– Memory (DRAM, SRAM)
– Buses
– Often on critical path, so can slow down even faultfree system
- in a ony-way transmition
EDC: Backward error recovery
•
– Detecting error requires recovery (can be slow)
matlab1.ir