Memory Errors and Error Correction
Bruce Jacob, … David T. Wang, in Memory Systems, 2008
30.3.4 Multi-Bit Error Detection and Correction: Bossen’s b-Adjacent Algorithm
The SEC error detection and correction algorithms described thus far focus on the detection and correction of single-bit errors. However, multi-bit errors can and do occur. The trends associated with the scaling of process technology increase a circuit’s sensitivity to SEUs, including the shrinking geometries of basic circuits and the continuing reduction in supply voltage. Thus, multi-bit error detection and correction is increasing in importance. Fortunately, multi-bit error detection and correction algorithms have been developed for systems requiring a high degree of reliability in the memory system.
In a 64-bit data word, a single-bit error can occur in 1 of 64 locations. In the same data word, two randomly located single-bit errors can occur in 2016 different combinations of bit positions. Thus, it is far more difficult to locate and correct two random errors. However, multi-bit errors are usually caused by the same event and thus typically occur in adjacent bit locations. Multi-bit error correction algorithms used in modern computer systems exploit this and focus on detecting and correcting multi-bit errors in adjacent locations.
There are numerous multi-bit error correction algorithms, based on algorithms developed by Hsiao, Carter, Kaneda, Bossen, and others. We illustrate a representative example: Bossen’s b-adjacent algorithm [Bossen 1970]. The example describes a 2-adjacent algorithm; the extension to larger values of b is simply asserted and not illustrated.
Bossen’s b-adjacent ECC algorithm uses the familiar recursive application of the exclusive-or function to generate parity protection. The difference between 2-adjacent error correction and SEC error correction is in the definition of the basic unit of error detection. The SEC ECC algorithm relies on the redundant parity checking protection of multiple check bits for each data bit. In SEC ECC, the basic unit of error detection—or symbol size—is an individual bit. In the b-adjacent ECC algorithm, the basic unit of error detection is a group of b-adjacent bits. As an example, in a 64-bit input data vector, there are 32 groups of 2-adjacent bits, or 32 symbols, with each symbol being 2 bits in size. In this algorithm, each symbol is a basic unit for the error location and correction. Figure 30.9 shows that in 2-adjacent code space, there are 22 basic symbols. Formally, 2-bit vectors can be described as elements of a Galois field with 22 elements, denoted as GF(22). The correction of a single error in GF(22) is then equivalent to correcting 2-adjacent bits in the binary field.
FIGURE 30.9. Two-bit vectors in 2-adjacent code.
The 2-adjacent ECC algorithm can be implemented to correct 2-adjacent bit errors with 8 check bits for 64 data bits, denoted as (64, 72). Figure 30.10 shows the parity check matrix for a single ECC from GF(22) transcribed from Bossen’s b-adjacent ECC algorithm. In the figure, the symbols 0, 1, α, and α2 represent the elements in GF(22). The matrix can be used to generate 4 check symbols from 32 strings of GF(22) elements. That is, each column represents 1 of 32 possible length-four strings of elements in GF(22), and each row of the parity check matrix corresponds to 1 of 4 check symbols. The check symbols can detect and locate an error of a single element out of a data vector of 32 elements in GF(22). Figure 30.10 shows a transformation applied to the parity check matrix to convert the single error correcting matrix in GF(22) to a 2-adjacent error correcting matrix in the binary field. The transformation yields the binary parity check matrix that can generate the necessary check bits to detect and locate 2-adjacent errors out of a data vector of 64 elements in the binary field.
FIGURE 30.10. Check bit generation for 2-adjacent error detection.
Check bits C7 through C0 are generated from the transformed parity check matrix through the recursive application of the exclusive-or function to bit positions indicated by the parity check matrix. As an example, the top row of the parity check matrix shows that check bit C7 can be generated from the recursive application of the exclusive-or function to D0, D2, D4, D9 … D61, and d62.
Table 30.3 shows the error location table for th 2-adjacent ECC algorithm; this is taken from the patent filing for Compaq’s Advanced ECC algorithm, based on Bossen’s 2-adjacent ECC algorithm. When an ECC code word is read from memory, the check bits are recomputed from the data bits and compared to the check bits retrieved with the ECC code word. Figure 30.11 shows that, just as in the SEC ECC algorithm, the bitwise comparison of the original check bits against the recomputed check bits is used to generate the ECC syndrome, denoted as S7 through S0 in the figure. The ECC syndrome is used as an index into the error location table to determine the type and location of the error.
TABLE 30.3. Error location table for the 2-adjacent error correction algorithm, taken from US Patent #5,490,155 (Compaq’s Advanced ECC implementation)
S7: | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | ||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
S6: | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | ||||
s | s | s | s: | s5: | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 1 |
3 | 2 | 1 | 0 | s4: | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 |
0 | 0 | 0 | 0 | C4 | C5 | C6 | 5 | 3 | 1 | C7 | 4 | 2 | 2,3 | 0.1 | 4,5 | |||||
0 | 0 | 0 | 1 | C0 | 51 | 49 | 47 | 63 | 33 | 61 | 28 | 59 | 30,31 | |||||||
0 | 0 | 1 | 0 | C1 | 46 | 50 | 48 | 58 | 31 | 62 | 32 | 60 | 28,29 | |||||||
0 | 0 | 1 | 1 | 48,49 | 46,47 | 50,51 | 60,61 | 29 | 58,59 | 62,63 | 32,33 | |||||||||
0 | 1 | 0 | 0 | C2 | 57 | 52 | 54,55 | 11 | 35 | 9 | 19 | 7 | 17 | |||||||
0 | 1 | 0 | 1 | 45 | 39 | 23 | 21 | 37 | ||||||||||||
0 | 1 | 0 | 43 | 24 | 12,13 | |||||||||||||||
0 | 1 | 1 | 1 | 41 | 14 | 26,27 | ||||||||||||||
1 | 0 | 0 | 0 | C3 | 55 | 56 | 52,53 | 6 | 16 | 10 | 34 | 8 | 18 | |||||||
1 | 0 | 0 | 1 | 40 | 27 | 14,15 | ||||||||||||||
1 | 0 | 1 | 0 | 44 | 20 | 38 | 22 | 36 | ||||||||||||
1 | 0 | 1 | 1 | 42 | 24,25 | |||||||||||||||
1 | 1 | 0 | 0 | 53 | 54 | 56,57 | 8,9 | 18,19 | 6,7 | 16,17 | 10,11 | 34,35 | ||||||||
1 | 1 | 0 | 1 | 42, 43 | 25 | 12 | ||||||||||||||
1 | 1 | 1 | 0 | 40, 41 | 15 | 26 | ||||||||||||||
1 | 1 | 1 | 1 | 44, 45 | 22,23 | 20,21 | 38,39 | 36,37 |
FIGURE 30.11. Recomputed check bits bitwise XOR’ed against retrieved check bits to generate the 2-adjacent ECC syndrome.
Figure 30.12 illustrates how both single-bit errors and double-bit errors that occur on even-bit boundaries can be located with 2-adjacent error correction.
FIGURE 30.12. Locating a single bit and 2-adjacent bit errors in a 64-bit word.
In the first matrix, bit 32 of a 64-bit block has been corrupted while stored in memory. The first matrix in Figure 30.12 shows that check bits C7, C5, and C1 would be affected by this. Note that bit 32 in the block is the only bit that would affect these check bits and only these check bits. A comparison of the retrieved check bits against the recomputed check bits results in the ECC syndrome bit vector of 10100010. A check of Table 30.3 shows that an ECC syndrome of 10100010 indicates bit 32 as being corrupt.
In the second example, assume bits 32 and 33 are both corrupted. As a result, check bits C7, C6, C5, C4, C1, and C0 are affected. A comparison of the retrieved check bits against the recomputed check bits results in the ECC syndrome 11110011. Using this as an index into the error location table identifies bits 32 and 33 as corrupted.
Read full chapter
URL:
https://www.sciencedirect.com/science/article/pii/B9780123797513500321
Error Coding Techniques
In Architecture Design for Soft Errors, 2008
5.5.1 DUE FIT from Temporal Double-Bit Error with No Scrubbing
Assume that an 8-bit SECDED ECC protects 64 bits of data, which is referred to as a quadword in this section. A temporal double-bit error occurs when two bits of this 72-bit protected quadword are flipped by two separate alpha particle or neutron strikes. This analysis is only concerned with strikes in the data portion of the cache blocks (and not the tags). To compute the FIT contribution of temporal double-bit errors, the following terms need to be defined:
- ▪
-
Q = number of quadwords in the cache memory. Each quadword consists of 64 bits of data and 8 bits of ECC. Thus, there are a total of 72 bits per quadword
- ▪
-
E = number of random single-bit errors that occur in the population of Q quadwords.
Given E single-bit errors in Q different quadwords, the probability that (E +1)th error will cause a double-bit error is E/Q. Let Pd[n] be the probability that a sequence of n strikes causes n–1 single-bit errors (but no double-bit errors) followed by a double-bit error on the nth strike. Pd[1] must be 0 because a single strike cannot cause a double-bit error. Pd[2] is the probability that the second strike hits the same quadword as the first strike, or 1/Q. Pd[3] is the probability that the first two strikes hit different quadwords (i.e., 1 – Pd[2]) times the probability that the third strike hits either of the first two quadwords that got struck (i.e., 2/Q). Following this formulation and using * to represent multiplication, one gets
- ▪
-
Pd[2] = 1/Q
- ▪
-
Pd[3] = [(Q–1)/Q] * [2/Q]
- ▪
-
Pd[4] = [(Q–1)/Q] * [(Q–2)/Q] * [3/Q]
- ▪
-
…
- ▪
-
Pd[E] = [(Q–1)/Q] * [(Q–2)/Q] * [(Q–3)/Q] * … * [(Q–E+ 2)/Q] * [(E–1)/Q].
Then the probability of a double-bit error after a time period T = Pd[N] * P[N strikes in time T] for all N. Using this equation, one can solve for the expected value of T to derive the MTTF to a temporal double-bit error.
There is, however, an easier way to calculate MTTF to a temporal double-bit error. Assume that M is the mean number of single-bit errors needed to get a double-bit error. Then, the MTTF of a temporal double-bit error = M * MTTF of a single-bit error. (Similarly, the FIT rate for a double-bit error = 1/M * FIT rate for a single-bit error.) A simple computer program can calculate M very easily as the expected value of Pd[.].
▪
EXAMPLE
Compute the DUE rate from temporal double-bit errors in a 32 megabyte cache, assuming a FIT/bit of 1 milliFIT. Use the method described above. Assume M = 2567, which can be easily computed using a computer program.
SOLUTION The cache has 222 quadwords. The single-bit FIT rate for the entire cache is 0.001 * 222 * 72 = 3.02* 105, i.e., the MTTF is 109/(3.02* 105) = 3311 hours. Using a computer program, one can find that M = 2567. Then the MTTF to a double-bit error = 3311 * 2567 hours = 970 years.
Using a Poisson distribution Saleh et al. [18] came up with a compact approximation for double-bit error MTTFs of large memory systems. Derivation of Saleh et al. shows that the MTTF of such temporal double-bit errors is equal to [1/(72 * f)] * sqrt(pi/2Q), where f = FIT rate of a single bit.
▪
EXAMPLE
Compute the DUE rate from temporal double-bit errors in a 32-megabyte cache, assuming a FIT/bit of 1 milliFIT. Use the compact equation of Saleh et al.
SOLUTION f = 0.001. Q=222. Then DUE rate = 0.0085 *109 hours or 970 years. The answer is the same when computed from first principles or Saleh’s compact form.
The above calculation does not factor in the reduced error rates because of the AVF. The single-bit FIT/bit can be appropriately derated using the AVF to compute the more realistic temporal double-bit DUE rate.
It is important to note that the MTTF contribution from temporal double-bit errors for a system with multiple chips cannot be computed in the same way as can be done for single-bit errors. If chip failure rates are independent (and exponentially distributed), then a system composed of two chips, each with an MTTF of 100 years, has an overall MTTF of 100/2 = 50 years. Unfortunately, double-bit error rates are not independent because the MTTF of a double-bit error is not a linear function of the number of bits. This is also evident in Saleh’s compact form, which shows that the rate of such double-bit errors is inversely proportional to the square root of the size of the cache. Thus, quadrupling the cache size halves the MTTF of double-bit errors but does not reduce it by a factor of 4.
Read full chapter
URL:
https://www.sciencedirect.com/science/article/pii/B9780123695291500070
Communicating pictures: delivery across networks
David R. Bull, Fan Zhang, in Intelligent Image and Video Compression (Second Edition), 2021
11.2.1 Synchronization failure
As we have already seen in Chapter 7, a coded video bitstream normally comprises a sequence of variable-length codewords (VLCs). VLC methods such as Huffman or arithmetic coding are employed in all practical image and video coding standards. Fixed-length codewords (FLCs) are rarely used in practice except when symbol probabilities do not justify the use of VLC (e.g., in some header information). A further exception to this is the class of error-resilient methods, including pyramid vector quantization (PVQ), which we will study in Section 11.7.2.
As we will see in the following subsections, the extent of the problem caused will be related to the location where the error occurs and the type of encoding employed.
The effect of a single bit error
With FLC, any codeword affected by a single bit error would be decoded incorrectly but, because of the implicit synchronization of FLC, all other codewords remain unaffected. This means that there is no loss of bitstream or symbol synchronization. In contrast to VLC, single bit errors can cause the decoder to decode a longer or shorter code word, thus meaning that subsequent symbols, although delivered correctly, will most likely be decoded incorrectly due to the loss of synchronization at the decoder. This loss of synchronization can continue until the next explicit resynchronization point and will often result in the wrong number of symbols being decoded. This might mean that important implicit synchronization symbols such as EOB will be missed. In practice, most Huffman codes will resynchronize themselves, whereas arithmetic codes rarely do. The two scenarios in Example 11.1 illustrate the problems of synchronization loss in VLC, demonstrating the subtle differences between synchronization at bitstream, symbol, and coding unit (e.g., transform block) levels.
Example 11.1
Loss of VLC synchronization due to a single bit error
Given the following set of symbols and their corresponding Huffman codewords,
Alphabet | A | B | C | D | E |
Huffman code | 0 | 10 | 110 | 1110 | 1111 |
with the following transmitted message,
Message | B | A | D | E | C | B | A |
Encoded bitstream | 10 | 0 | 1110 | 1111 | 110 | 10 | 0 |
- a)
-
Compute and comment on the decoded sequence of symbols if there is a single bit error in the sixth bit.
- b)
-
Repeat (a) for the case of a single bit error at the tenth bit position.
Solution
a) The table of decoded bits and symbols is given below with the errored bit in bold font:
Encoded message | B | A | D | E | C | B | A | – |
---|---|---|---|---|---|---|---|---|
Received bitstream | 10 | 0 | 1100 | 1111 | 110 | 10 | 0 | |
Parsed bitstream | 10 | 0 | 110 | 0 | 1111 | 110 | 10 | 0 |
Decoded message | B | A | C | A | E | C | B | A |
In this case the symbol D is incorrectly decoded as a C followed by an A and then bitstream resynchronization occurs. However, the fact that an extra symbol is decoded means that the subsequent symbols are displaced and hence symbol synchronization is lost.
b) The table of decoded bits and symbols is given below with the errored bit in bold font.
Encoded message | B | A | D | E | C | B | A |
---|---|---|---|---|---|---|---|
Received bitstream | 10 | 0 | 1110 | 1101 | 110 | 10 | 0 |
Parsed bitstream | 10 | 0 | 1110 | 110 | 1110 | 10 | 0 |
Decoded message | B | A | D | C | D | B | A |
In this case, because of the lengths and encodings of the adjacent symbols, the error is constrained to only two symbols. After this, both bitstream and symbol resynchronization are achieved. Superficially this appears a lot better than the first case. However, consider the situation where the symbols represent {run/size} encodings of transform coefficients. The fact that there are errors in the fourth and fifth symbols means that, after the third symbol, the whole block will be decoded incorrectly because the runs of zeros corresponding to symbols 4 and 5 are incorrect. However, because symbol resynchronization is regained, any subsequent EOB symbol would be correctly detected and block level synchronization will be maintained. See Example 11.2 for more details.
Read full chapter
URL:
https://www.sciencedirect.com/science/article/pii/B9780128203538000207
Quantum Computing and Communication
Paul E. Black, … Carl J. Williams, in Advances in Computers, 2002
4.3.1 Single-Bit-Flip Errors
To see how this is possible, consider a simple class of errors: single-bit errors that affect qubits independently. (In reality, of course, more complex problems occur, but this example illustrates the basic technique.) Consider a single-qubit, a two-state system with bases |0〉 and |1〉. We will use a simple “repetition code”; that is, we represent a logical zero with three zero qubits, |0L〉=|000〉, and a logical one with three ones, |1L〉=|111〉. An arbitrary qubit in this system, written as a superposition a|0L〉+b|1L〉, becomes a|000〉+b|111〉 with repetition coding. Since we assume the system is stable in all ways except perhaps for single-bit flips, there may be either no error or one of three qubits flipped, as shown in Table II.
Table II. Bit-Flip Errors, Syndromes, and Correctives
Error | Error state | Syndrome | Correction |
---|---|---|---|
No error | a|000〉 + b|111〉 | |000〉 | None |
qubit 1 flipped | a|100〉 + b|011〉 | |110〉 | X⊗I⊗I |
qubit 2 flipped | a|010〉 + b|101〉 | |101〉 | I⊗X⊗I |
qubit 3 flipped | a|001〉 + b|110〉 | |011〉 | I⊗I⊗I |
The strategy for detecting errors is to add three “temporary” qubits |t0t1t2〉, set to |000〉, which will hold “parity” results. We then XOR various bits together, putting the results in the added three qubits: t0 is bit 0 XOR bit 1, t1 is bit 0 XOR bit 2, and t2 is bit 1 XOR bit 2. This leaves a unique pattern, called a syndrome, for each error. The third column of Table II shows the respective syndromes for each error.
Measuring the added three qubits yields a syndrome, while maintaining the superpositions and entanglements we need. Depending on which syndrome we find, we apply one of the three corrective operations given in the last column to the original three repetition encoding qubits. The operation X flips a bit; that is, it changes a 0 to a 1 and a 1 to a 0. The identity operation is I. We illustrate this process in the following example.
Read full chapter
URL:
https://www.sciencedirect.com/science/article/pii/S0065245802800079
Packet Optical Networks and Forward Error Correction
Walter Goralski, in The Illustrated Network (Second Edition), 2017
Burst Errors and Interleaving
So none of the methods we just examined do anything about the issue of burst errors. Single bit errors are fine to correct, but many errors come not as single events that smash individual bits, but longer bursts of static or other types of impulse noise that wipes out several bits in a row: often tens of bits, sometimes hundreds, occasionally thousands. In cases where thousands of bits might be flipped or simply eradicated, one of the reasons that we have layered protocols comes into play. One of the main tasks of an upper layer is to recover from uncorrectable errors that occur at a lower layer. For example, uncorrectable errors that might occur on a hop between two intermediate systems can be detected and fixed by an end-to-end resend between the client and server at the endpoints.
But maybe we can do something about our error-correcting system to make it more robust in the face of limited burst errors (there will always be a point where the length of the burst exceeds the ability of the designed correction code’s to fix bit errors).
The method commonly used is called interleaving, sometimes seen as scrambling. With interleaving, we do not send the code bits sequentially as soon as we get them. We buffer the bits, both the data bits and derived parity bits, at the sender until we have a set of bits in some predefined structure. Then we can interleave the bits and send them not sequentially as they arrived, but in another pattern altogether.
Let’s look at a simple example using of initial (3,1) code which sent every data bit three times, so 0 became 000 and 1 became 111. As we saw, single bit errors could be corrected as 010 -> 000 or 110 -> 111 due to the Hamming distance characteristics. But if 000 became 110, we lost that ability by trying to correct the string to 111.
But let’s buffer five of these code words together first. So if, for instance, 0 came in and became 000, we would not send it right away. Let’s call that AAA and wait for another bit like 1 to became 111. That would be BBB in our simple system, with the three letters standing for a particular 000 or 111.
In fact, let’s do this five times: AAA BBB CCC DDD EEE. (Note that the spaces are simply for convenience of visualization: in practice, the bits would all run together.)
Not we’re ready to send the 15 bits…but interleaved. We would send ABCDE ABCDE ABCDE. At the received, even before error correction, the original “frame” structure would be recovered by “unscrambling” the bits. Only after de-interleaving would error correction be applied (010 -> 000) and the bits decoded into the sent information (000 -> 0).
Now, suppose a burst error wiped out four consecutive bits in the stream, perhaps like this:
ABCDE ABXXXX BCDE
When this sequence was un-interleaved, we would have: AAX BBB CXC DXD EXE.
We have just, through the process of interleaved, made four single bit errors out of a four-bit burst error! We have, with this simple method, created a system that can recover from bursts up to five bits long. However, we have also introduced another buffering and processing delay at the sender and receiver, and delay and processing burden that grow with complexity and robustness. But given the delay and burden of resending large chunks of errored data, the trade-off is often more than worthwhile.
In practice, we can interleave much more complex “transmission frames” and actually nest the interleavings to create more and more robust burst error correction methods. Audio CDs next interleavings and error correction to create a method that can recover from some 3600 consecutive bits in error. This “delay” is acceptable because audio is strictly a one-way process from disc to your ear. The interleaving delay is absorbed long before the disc is burned, and the delay to de-interleave when you press “play” is acceptable because once the audio pipeline is filled, the decoding is a continuous process.
So a more complete look at the steps in an error-correcting process includes one or more interleaving stages, as shown in Figure 4.15.
Figure 4.15. Interleaving and error correction.
Read full chapter
URL:
https://www.sciencedirect.com/science/article/pii/B9780128110270000047
Number systems and codes
B. HOLDSWORTH BSc (Eng), MSc, FIEE, R.C. WOODS MA, DPhil, in Digital Logic Design (Fourth Edition), 2002
1.20 The Hamming code
This code provides a minimum distance of three between codewords, a necessary condition that must be provided in order to achieve single-bit error detection and correction. For any value of r, where r represents the number of check bits, 2r − 1 codeword bits can be formed consisting of r check bits and k message bits where k = 2r − 1 − r. For r = 3, k ≤ 4 so that four message bits are the maximum number that can be checked for r = 3.
The bit positions in the codeword are numbered from 1 to 2r − 1 and any position in the codeword whose number is a power of 2 contains a parity bit. For a 7-bit codeword the parity bits occupy positions 1, 2 and 4 so that the format of the transmitted codeword is:
bit position 7, 6, 5, 4, 3, 2, 1C =k4 k3 k2 r3 k1 r2 r1
The bit positions occupied by the parity bits 4, 2 and 1 when converted to binary are 100, 010 and 001. Each of these conversions contains a single 1 and are grouped with message bits k4k3k2k1 whose numbers contain a 1 in the same bit position. For example, r1 in bit position 001 is grouped with message bits that occupy the bit positions 011(3), 101(5) and 111(7). It is then arranged that for a given combination of message bits the parity bit is allocated so that even parity is achieved.
The value of parity bit r1 is given by XORing the message bits in bit positions 7, 5 and 3. Hence:
r1=k4⊕k2⊕k1
The value of parity bit r2 is obtained by XORing the message bits in positions 7, 6 and 3 so that
r2=k4⊕k3⊕k1
Finally r3 is obtained by XORing the message bits in bit positions 7, 6 and 5
r3=k2⊕k3⊕k4
Consider, as an example of the use of the Hamming code, the message bits k4k3k2k1 = 1101. For this message the parity check bits to be transmitted with the message bits are:
r1s=1⊕0⊕1=0r2s=1⊕1⊕1=1r3s=1⊕1⊕0=0
and the transmitted codeword is
k4k3k2r3k1r2r1=1100110
Assuming that the message bit k1 is in error when the codeword is received, then at the receiving end
k4k3k2r3k1r2r1=1100010
Re-computing the parity at the receiving end gives
r1r=1⊕0⊕0=1r2r=1⊕1⊕0=0r3r=1⊕1⊕0=0
The position of the error can be obtained by XORing the transmitted and received parity bits:
r1s⊕r1r=0⊕1=1r2s⊕r2r=1⊕0=1r3r⊕r3r=0⊕0=0}syndrome indicating error
The syndrome S = 011 and indicates there is an error in the third bit position.
Read full chapter
URL:
https://www.sciencedirect.com/science/article/pii/B9780750645829500020
Compression
StéphaneMallat , in A Wavelet Tour of Signal Processing (Third Edition), 2009
Noise Sensitivity
Huffman and arithmetic codes are more compact than a simple fixed-length code of size log2 K, but they are also more sensitive to errors. For a constant-length code, a single bit error modifies the value of only one symbol. In contrast, a single bit error in a variable-length code may modify the whole symbol sequence. In noisy transmissions where such errors might occur, it is necessary to use an error correction code that introduces a slight redundancy in order to suppress the transmission errors [18].
Read full chapter
URL:
https://www.sciencedirect.com/science/article/pii/B9780123743701000148
Mass Storage
Thomas Sterling, … Maciej Brodowicz, in High Performance Computing, 2018
17.4.1.3 RAID 2: Bit-Level Striping With Hamming Code
RAID 2 attempted to reduce spatial overheads caused by data mirroring by selecting a more efficient data protection code. Hamming code uses p ≥ 2 code bits to protect each group of d = 2p − p − 1 data bits against single bit errors, hence attaining (2p − p − 1)/(2p − 1) efficiency or code rate. A RAID 2 array consists of d data drives and p parity (protection bits are calculated as parity for selected bits in the entire bit-stripe) drives. The minimum configuration consists of two parity drives and one data drive, although it has poor storage efficiency of 1/3; the efficiency vastly improves for larger ensembles. The drives have synchronized spindles, ensuring lock-step update and retrieval of each bit in individual stripes (denoted as a and b in Fig. 17.16). This arrangement is able to recover from a single device failure. Since hamming code can pinpoint the position of the erroneous bit in each stripe, RAID 2 is capable of detecting silent drive malfunctions in which the affected device may appear to work but returns invalid data. This property may also be used to correct occasional data errors on the fly due to the nonzero probability of uncorrectable read errors returned by individual disks. Due to the implementation complexity requiring specialized hardware controllers, RAID 2 is no longer used in practice. Its performance characteristic strongly depends on the implementation details, and hence is not analyzed here.
Figure 17.16. RAID 2 data layout.
d=2p−p−1,p≥2
C2=d·CD
Read full chapter
URL:
https://www.sciencedirect.com/science/article/pii/B9780124201583000174
Fault Tolerance in Computer Systems—From Circuits to Algorithms*
Shantanu Dutt, … Fran Hanchek, in The Electrical Engineering Handbook, 2005
Theorem 3
If either modulo or extended-precision integer arithmetic is used in a mantissa checksum test of the form of equation 8.6 shown again here:
f(Σv∈smant(v))=?Σv∈smant(f(v)),
then any single bit error in each of the scalar component of this test will be detected even in the presence of overflow in modulo (or single-precision) integer arithmetic (Dutt and Assaad, 1996).
In equation 8.6, scalars ai and bi are compared, where a: = (a1, …, an)T and b: = (b1, …, bn)T are the LHS and RHS, respectively of equation 8.6. The above result means that single bit errors in either ai or bi can be detected for each i even when single-precision integer arithmetic is used. The following result is regarding the maximum number of arbitrarily distributed errors (i.e., not necessarily restricted to one error per scalar component of the check) that can be detected by the mantissa checksum test.
Read full chapter
URL:
https://www.sciencedirect.com/science/article/pii/B9780121709600500347
The Communication View
Richard John Anthony, in Systems Programming, 2016
3.9.1 A Brief Introduction to Error Detection and Error Correction Codes
When designing an FEC code, a main consideration is the number of bit errors that the code will be able to correct. In other words, the number of bit transitions (0-to-1 or 1-to-0) that can occur within a given-length stream of bits transmitted and yet the original data value can still be determined.
Consider the 8-bit data value 11001100. A single bit error could give rise to, for example, 10001100, 11011100, or 11001110. Two bit errors could give rise to, for example, 00001100, 11101110, or 01001101.
The number of bits that are different in the data (from the original to the modified versions) is called the Hamming distance (d); the single-bit error examples shown above have a Hamming distance of 1 (i.e., d = 1), while the two-bit error examples have a Hamming distance of 2 (i.e., d = 2).
If all bit combinations in a particular code are possible correct values, then the value of d for the data code is 1 and there is no redundancy; that is, a single bit change shifts the code from one legal value to another and therefore cannot be detected as an error. Consider a 4-bit code that holds a Customer ID (CID) and that CIDs can range from 0 to 15 (decimal). The following binary values are all acceptable: 0000, 0001, 0010, 0011, …, 1111. If, during transmission, a single bit error occurs in any of these values, the resulting code (which is wrong) has the value of another acceptable code and is thus undetectable as an error. For example, 0010 can become 0011, which are both valid codes, and thus, the receiver of the value 0011 cannot tell if this is the value that was actually sent or if an error has occurred turning a different value into 0011.
To perform error detection or correction on an original data message, additional information must be added at the point of transmission. The additional information is a form of overhead (also referred to as “redundant bits”) since it must be transmitted as part of the new larger message but does not carry any application-level information.
The simplest way to detect a single bit error in the 4-bit code is to use parity checking, in which case one additional bit must be added (the parity bit). In this case, for every four data bits transmitted, a fifth parity bit must be transmitted, so the overhead is 20%; or alternatively, you can consider that the efficiency is 80%; since 80% of the bits transmitted carry application data, the remaining 20% are redundant from the application viewpoint.
Consider “even parity” in which the number of “1” bits must be even in the 5-bit code that is transmitted. CID values 0000, 0011, and 1001 are examples where the number of “1” bits is already even; thus, a parity bit of “0” will be added, turning these values into 00000, 00110, and 10010, respectively. CID values 0001, 0010, and 1101 are examples where the number of “1” bits is odd; thus, a parity bit of “1” will be added, turning these values into 00011, 00101, and 11011, respectively.
If a single bit error occurs now, for example, 00110 becomes 00100 during transmission, the error will be detected by the receiver, because the parity check will fail (there are not an even number of “1s”). Notice that although the error is detected, it cannot be corrected because there are many possible valid codes that could have been translated into the received value 00100 by a single bit translation, including 00101, 10100, and 01100.
To achieve error correction, additional redundant bits must be added such that there is only one valid data code that could be translated into any particular received value, as long as the number of bit errors does not exceed the number supported by the error correction code. The theory aspect of error correcting codes is a complex subject, so the discussion will be limited to a simple example based on the triple modular redundancy technique to provide an introduction.
In a triple modular redundancy code, each bit is repeated three times, that is, transmitted as 3 bits, so each “0” becomes “000” and each “1” becomes “111.” There are only two valid code patterns per three-bit sequence that can be transmitted, and it is not possible for a single bit error to convert one valid code into another. A single bit error turns 000 into one of 001, 010, or 100, all of which are closer to 000 than 111, and thus, all are interpreted as a 0 and not a 1 data value. Similarly, a single bit error turns 111 into one of 110, 101, or 011 all of which are closer to 111 than 000, and thus, all are interpreted as a 1 data value. Thus, a single bit error can be automatically corrected, but at the cost of a significantly increased message size. For the triple modular redundancy technique, the overhead is 67%; in other words, the code efficiency is 33%; since 33% of the bits transmitted carry useful data, the remaining 67% are redundant. Fortunately, more efficient error-correcting codes do exist.
If triple modular redundancy were applied to the earlier customer ID example, each 4-bit CID value would be transmitted as a 12-bit message. For example, 0001 would be transmitted as 000000000111, and 0010 would be transmitted as 000000111000.
Read full chapter
URL:
https://www.sciencedirect.com/science/article/pii/B9780128007297000030
A Hamming code is a particular kind of error-correcting code (ECC) that allows single-bit errors in code words to be corrected. Such codes are used in data transmission or data storage systems in which it is not feasible to use retry mechanisms to recover the data when errors are detected. This type of error recovery is also known as forward error correction (FEC).
Constructing a Hamming code to protect, say, a 4-bit data word
Hamming codes are relatively easy to construct because they’re based on parity logic. Each check bit is a parity bit for a particular subset of the data bits, and they’re arranged so that the pattern of parity errors directly indicates the position of the bit error.
It takes three check bits to protect four data bits (the reason for this will become apparent shortly), giving a total of 7 bits in the encoded word. If you number the bit positions of an 8-bit word in binary, you see that there is one position that has no «1»s in its column, three positions that have a single «1» each, and four positions that have two or more «1»s.
If the four data bits are called A, B, C and D, and our three check bits are X, Y and Z, we place them in the columns such that the check bits are in the columns with one «1» and the data bits are in the columns with more than one «1». The bit in position 0 is not used.
Bit position: 7 6 5 4 3 2 1 0
in binary: 1 1 1 1 0 0 0 0
1 1 0 0 1 1 0 0
1 0 1 0 1 0 1 0
Bit: A B C X D Y Z -
The check bit X is set or cleared so that all of the bits with a «1» in the top row — A, B C and X — have even parity. Similarly, the check bit Y is the parity bit for all of the bits with a «1» in the second row (A, B and D), and the check bit Z is the parity bit for all of the bits with a «1» in the third row (A, C and D).
Now all seven bits — the codeword — are transmitted (or stored), usually reordered so that the data bits appear in their original sequence: A B C D X Y Z. When they’re received (or retrieved) later, the data bits are put through the same encoding process as before, producing three new check bits X’, Y’ and Z’. If the new check bits are XOR’d with the received check bits, an interesting thing occurs. If there’s no error in the received bits, the result of the XOR is all zeros. But if there’s a single bit error in any of the seven received bits, the result of the XOR is a nonzero three-bit number called the «syndrome» that directly indicates the position of the bit error as defined in the table above. If the bit in this position is flipped, then the original 7-bit codeword is perfectly reconstructed.
A couple of examples will illustrate this. Let’s assume that the data bits are all zero, which also means that all of the check bits are zero as well. If bit «B» is set in the received word, then the recomputed check bits X’Y’Z’ (and the syndrome) will be 110, which is the bit position for B. If bit «Y» is set in the received word, then the recomputed check bits will be «000», and the syndrome will be «010», which is the bit position for Y.
Hamming codes get more efficient with larger codewords. Basically, you need enough check bits to enumerate all of the data bits plus the check bits plus one. Therefore, four check bits can protect up to 11 data bits, five check bits can protect up to 26 data bits, and so on. Eventually you get to the point where if you have 8 bytes of data (64 bits) with a parity bit on each byte, you have enough parity bits to do ECC on the 64 bits of data instead.
Different (but equivalent) Hamming codes
Given a specific number N of check bits, there are 2N equivalent Hamming codes that can be constructed by arbitrarily choosing each check bit to have either «even» or «odd» parity within its group of data bits. As long as the encoder and the decoder use the same definitions for the check bits, all of the properties of the Hamming code are preserved.
Sometimes it’s useful to define the check bits so that an encoded word of all-zeros or all-ones is always detected as an error.
What happens when multiple bits get flipped in a Hamming codeword
Multible bit errors in a Hamming code cause trouble. Two bit errors will always be detected as an error, but the wrong bit will get flipped by the correction logic, resulting in gibberish. If there are more than two bits in error, the received codeword may appear to be a valid one (but different from the original), which means that the error may or may not be detected.
In any case, the error-correcting logic can’t tell the difference between single bit errors and multiple bit errors, and so the corrected output can’t be relied on.
Extending a Hamming code to detect double-bit errors
Any single-error correcting Hamming code can be extended to reliably detect double bit errors by adding one more parity bit over the entire encoded word. This type of code is called a SECDED (single-error correcting, double-error detecting) code. It can always distinguish a double bit error from a single bit error, and it detects more types of multiple bit errors than a bare Hamming code does.
It works like this: All valid code words are (a minimum of) Hamming distance 3 apart. The «Hamming distance» between two words is defined as the number of bits in corresponding positions that are different. Any single-bit error is distance one from a valid word, and the correction algorithm converts the received word to the nearest valid one.
If a double error occurs, the parity of the word is not affected, but the correction algorithm still corrects the received word, which is distance two from the original valid word, but distance one from some other valid (but wrong) word. It does this by flipping one bit, which may or may not be one of the erroneous bits. Now the word has either one or three bits flipped, and the original double error is now detected by the parity checker.
Note that this works even when the parity bit itself is involved in a single-bit or double-bit error. It isn’t hard to work out all the combinations.
Recently I’ve had some issues with my hard drive, where a file extracted from a compressed archive would get corrupted when it was extracted. Extracting the same file from the same archive once more would give me a healthy file.
Today I diffed the broken and healthy file and found that there was only a single bit difference between the files.
I’ve tried to run the Windows check disk utility to see if there was any bad sectors on the disk, but there wasn’t.
What could the cause of these issues?
asked Dec 30, 2011 at 13:01
3
I’ve had errors similar with bad ram that only showed an error in memtest86+ after about 24 hours of testing
answered Dec 30, 2011 at 13:28
mavhcmavhc
1565 bronze badges
2
Individual blocks of data on the disk have ecc, so at the point of reading from the disk, you have very high probability of detecting an error.
For better or worse the entire rest of the process of copying a file is unchecked, you’re completely dependent on all the 99.99…% reliability of lots of subsequent components, and if one has a soft failure you can be screwed without knowing it for a long time.
Back in the stone age of computers (ca. 1978) we had one of the first semiconductor memories made by ampex for pdp10’s, which it turned out would very very occasionally miss a write and send 36 bits to the wrong place. Every file that had been written in the last few months was «possibly corrupted», as were all of our backup tapes. What a mess.
My best advice to you if you have a documented case of a single bit error like this is to take all the hardware involved and send it to a landfill.
answered Dec 30, 2011 at 21:23
ddyerddyer
4576 silver badges12 bronze badges