Ecc double bit error - Исправление ошибок и поиск оптимальных решений проблем

From Wikipedia, the free encyclopedia

ECC DIMMs typically have nine memory chips on each side, one more than usually found on non-ECC DIMMs (some modules may have 5 or 18).^[1]

Error correction code memory (ECC memory) is a type of computer data storage that uses an error correction code^[a] (ECC) to detect and correct n-bit data corruption which occurs in memory. ECC memory is used in most computers where data corruption cannot be tolerated, like industrial control applications, critical databases, and infrastructural memory caches.

Typically, ECC memory maintains a memory system immune to single-bit errors: the data that is read from each word is always the same as the data that had been written to it, even if one of the bits actually stored has been flipped to the wrong state. Most non-ECC memory cannot detect errors, although some non-ECC memory with parity support allows detection but not correction.

Description[edit]

Error correction codes protect against undetected data corruption and are used in computers where such corruption is unacceptable, examples being scientific and financial computing applications, or in database and file servers. ECC can also reduce the number of crashes in multi-user server applications and maximum-availability systems.

Electrical or magnetic interference inside a computer system can cause a single bit of dynamic random-access memory (DRAM) to spontaneously flip to the opposite state. It was initially thought that this was mainly due to alpha particles emitted by contaminants in chip packaging material, but research has shown that the majority of one-off soft errors in DRAM chips occur as a result of background radiation, chiefly neutrons from cosmic ray secondaries, which may change the contents of one or more memory cells or interfere with the circuitry used to read or write to them.^[2] Hence, the error rates increase rapidly with rising altitude; for example, compared to sea level, the rate of neutron flux is 3.5 times higher at 1.5 km and 300 times higher at 10-12 km (the cruising altitude of commercial airplanes).^[3] As a result, systems operating at high altitudes require special provisions for reliability.

As an example, the spacecraft Cassini–Huygens, launched in 1997, contained two identical flight recorders, each with 2.5 gigabits of memory in the form of arrays of commercial DRAM chips. Due to built-in EDAC functionality, the spacecraft’s engineering telemetry reported the number of (correctable) single-bit-per-word errors and (uncorrectable) double-bit-per-word errors. During the first 2.5 years of flight, the spacecraft reported a nearly constant single-bit error rate of about 280 errors per day. However, on November 6, 1997, during the first month in space, the number of errors increased by more than a factor of four on that single day. This was attributed to a solar particle event that had been detected by the satellite GOES 9.^[4]

There was some concern that as DRAM density increases further, and thus the components on chips get smaller, while operating voltages continue to fall, DRAM chips will be affected by such radiation more frequently, since lower-energy particles will be able to change a memory cell’s state.^[3] On the other hand, smaller cells make smaller targets, and moves to technologies such as SOI may make individual cells less susceptible and so counteract, or even reverse, this trend. Recent studies^[5] show that single-event upsets due to cosmic radiation have been dropping dramatically with process geometry and previous concerns over increasing bit cell error rates are unfounded.

Research[edit]

Work published between 2007 and 2009 showed widely varying error rates with over 7 orders of magnitude difference, ranging from 10⁻¹⁰ error/bit·h (roughly one bit error per hour per gigabyte of memory) to 10⁻¹⁷ error/bit·h (roughly one bit error per millennium per gigabyte of memory).^[5]^[6]^[7] A large-scale study based on Google’s very large number of servers was presented at the SIGMETRICS/Performance ’09 conference.^[6] The actual error rate found was several orders of magnitude higher than the previous small-scale or laboratory studies, with between 25,000 (2.5 × 10⁻¹¹ error/bit·h) and 70,000 (7.0 × 10⁻¹¹ error/bit·h, or 1 bit error per gigabyte of RAM per 1.8 hours) errors per billion device hours per megabit. More than 8% of DIMM memory modules were affected by errors per year.

The consequence of a memory error is system-dependent. In systems without ECC, an error can lead either to a crash or to corruption of data; in large-scale production sites, memory errors are one of the most-common hardware causes of machine crashes.^[6] Memory errors can cause security vulnerabilities.^[6] A memory error can have no consequences if it changes a bit which neither causes observable malfunctioning nor affects data used in calculations or saved. A 2010 simulation study showed that, for a web browser, only a small fraction of memory errors caused data corruption, although, as many memory errors are intermittent and correlated, the effects of memory errors were greater than would be expected for independent soft errors.^[8]

Some tests conclude that the isolation of DRAM memory cells can be circumvented by unintended side effects of specially crafted accesses to adjacent cells. Thus, accessing data stored in DRAM causes memory cells to leak their charges and interact electrically, as a result of high cell density in modern memory, altering the content of nearby memory rows that actually were not addressed in the original memory access. This effect is known as row hammer, and it has also been used in some privilege escalation computer security exploits.^[9]^[10]

An example of a single-bit error that would be ignored by a system with no error-checking, would halt a machine with parity checking, or would be invisibly corrected by ECC: a single bit is stuck at 1 due to a faulty chip, or becomes changed to 1 due to background or cosmic radiation; a spreadsheet storing numbers in ASCII format is loaded, and the character «8» (decimal value 56 in the ASCII encoding) is stored in the byte that contains the stuck bit at its lowest bit position; then, a change is made to the spreadsheet and it is saved. As a result, the «8» (0011 1000 binary) has silently become a «9» (0011 1001).

Solutions[edit]

Several approaches have been developed to deal with unwanted bit-flips, including immunity-aware programming, RAM parity memory, and ECC memory.

This problem can be mitigated by using DRAM modules that include extra memory bits and memory controllers that exploit these bits. These extra bits are used to record parity or to use an error-correcting code (ECC). Parity allows the detection of all single-bit errors (actually, any odd number of wrong bits). The most-common error correcting code, a single-error correction and double-error detection (SECDED) Hamming code, allows a single-bit error to be corrected and (in the usual configuration, with an extra parity bit) double-bit errors to be detected. Chipkill ECC is a more effective version that also corrects for multiple bit errors, including the loss of an entire memory chip.

Implementations[edit]

In 1982 this 512KB memory board from Cromemco used 22 bits of storage per 16 bit word to allow for single-bit error correction

Seymour Cray famously said «parity is for farmers» when asked why he left this out of the CDC 6600.^[11] Later, he included parity in the CDC 7600, which caused pundits to remark that «apparently a lot of farmers buy computers». The original IBM PC and all PCs until the early 1990s used parity checking.^[12] Later ones mostly did not.

An ECC-capable memory controller can generally^[a] detect and correct errors of a single bit per word^[b] (the unit of bus transfer), and detect (but not correct) errors of two bits per word. The BIOS in some computers, when matched with operating systems such as some versions of Linux, BSD, and Windows (Windows 2000 and later^[13]), allows counting of detected and corrected memory errors, in part to help identify failing memory modules before the problem becomes catastrophic.

Some DRAM chips include «internal» on-chip error correction circuits, which allow systems with non-ECC memory controllers to still gain most of the benefits of ECC memory.^[14]^[15] In some systems, a similar effect may be achieved by using EOS memory modules.

Error detection and correction depends on an expectation of the kinds of errors that occur. Implicitly, it is assumed that the failure of each bit in a word of memory is independent, resulting in improbability of two simultaneous errors. This used to be the case when memory chips were one-bit wide, what was typical in the first half of the 1980s; later developments moved many bits into the same chip. This weakness is addressed by various technologies, including IBM’s Chipkill, Sun Microsystems’ Extended ECC, Hewlett Packard’s Chipspare, and Intel’s Single Device Data Correction (SDDC).

DRAM memory may provide increased protection against soft errors by relying on error correcting codes. Such error-correcting memory, known as ECC or EDAC-protected memory, is particularly desirable for high fault-tolerant applications, such as servers, as well as deep-space applications due to increased radiation. Some systems also «scrub» the memory, by periodically reading all addresses and writing back corrected versions if necessary to remove soft errors.

Interleaving allows for distribution of the effect of a single cosmic ray, potentially upsetting multiple physically neighboring bits across multiple words by associating neighboring bits to different words. As long as a single event upset (SEU) does not exceed the error threshold (e.g., a single error) in any particular word between accesses, it can be corrected (e.g., by a single-bit error correcting code), and an effectively error-free memory system may be maintained.^[16]

Error-correcting memory controllers traditionally use Hamming codes, although some use triple modular redundancy (TMR). The latter is preferred because its hardware is faster than that of Hamming error correction scheme.^[16] Space satellite systems often use TMR,^[17]^[18]^[19] although satellite RAM usually uses Hamming error correction.^[20]

Many early implementations of ECC memory mask correctable errors, acting «as if» the error never occurred, and only report uncorrectable errors. Modern implementations log both correctable errors (CE) and uncorrectable errors (UE). Some people proactively replace memory modules that exhibit high error rates, in order to reduce the likelihood of uncorrectable error events.^[21]

Many ECC memory systems use an «external» EDAC circuit between the CPU and the memory. A few systems with ECC memory use both internal and external EDAC systems; the external EDAC system should be designed to correct certain errors that the internal EDAC system is unable to correct.^[14] Modern desktop and server CPUs integrate the EDAC circuit into the CPU,^[22] even before the shift toward CPU-integrated memory controllers, which are related to the NUMA architecture. CPU integration enables a zero-penalty EDAC system during error-free operation.

As of 2009, the most-common error-correction codes use Hamming or Hsiao codes that provide single-bit error correction and double-bit error detection (SEC-DED). Other error-correction codes have been proposed for protecting memory – double-bit error correcting and triple-bit error detecting (DEC-TED) codes, single-nibble error correcting and double-nibble error detecting (SNC-DND) codes, Reed–Solomon error correction codes, etc. However, in practice, multi-bit correction is usually implemented by interleaving multiple SEC-DED codes.^[23]^[24]

Early research attempted to minimize the area and delay overheads of ECC circuits. Hamming first demonstrated that SEC-DED codes were possible with one particular check matrix. Hsiao showed that an alternative matrix with odd weight columns provides SEC-DED capability with less hardware area and shorter delay than traditional Hamming SEC-DED codes. More recent research also attempts to minimize power in addition to minimizing area and delay.^[25]^[26]^[27]

Cache[edit]

Many CPUs use error-correction codes in the on-chip cache, including the Intel Itanium, Xeon, Core and Pentium (since P6 microarchitecture)^[28]^[29] processors, the AMD Athlon, Opteron, all Zen-^[30] and Zen+-based^[31] processors (EPYC, EPYC Embedded, Ryzen and Ryzen Threadripper), and the DEC Alpha 21264.^[23]^[32]

As of 2006, EDC/ECC and ECC/ECC are the two most-common cache error-protection techniques used in commercial microprocessors. The EDC/ECC technique uses an error-detecting code (EDC) in the level 1 cache. If an error is detected, data is recovered from ECC-protected level 2 cache. The ECC/ECC technique uses an ECC-protected level 1 cache and an ECC-protected level 2 cache.^[33] CPUs that use the EDC/ECC technique always write-through all STOREs to the level 2 cache, so that when an error is detected during a read from the level 1 data cache, a copy of that data can be recovered from the level 2 cache.

Registered memory[edit]

Registered, or buffered, memory is not the same as ECC; the technologies perform different functions. It is usual for memory used in servers to be both registered, to allow many memory modules to be used without electrical problems, and ECC, for data integrity. Memory used in desktop computers is usually neither, for economy. However, unbuffered (not-registered) ECC memory is available,^[34] and some non-server motherboards support ECC functionality of such modules when used with a CPU that supports ECC.^[35] Registered memory does not work reliably in motherboards without buffering circuitry, and vice versa.

Advantages and disadvantages[edit]

Ultimately, there is a trade-off between protection against unusual loss of data and a higher cost.

ECC memory usually involves a higher price when compared to non-ECC memory, due to additional hardware required for producing ECC memory modules, and due to lower production volumes of ECC memory and associated system hardware. Motherboards, chipsets and processors that support ECC may also be more expensive.

ECC support varies among motherboard manufacturers so ECC memory may simply not be recognized by a ECC-incompatible motherboard. Most motherboards and processors for less critical applications are not designed to support ECC. Some ECC-enabled boards and processors are able to support unbuffered (unregistered) ECC, but will also work with non-ECC memory; system firmware enables ECC functionality if ECC memory is installed.

ECC may lower memory performance by around 2–3 percent on some systems, depending on the application and implementation, due to the additional time needed for ECC memory controllers to perform error checking.^[36] However, modern systems integrate ECC testing into the CPU, generating no additional delay to memory accesses as long as no errors are detected.^[22]^[37]^[38]

ECC supporting memory may contribute to additional power consumption due to error correcting circuitry.

Notes[edit]

^ ^a ^b Most ECC memory uses a SECDED code.
^ While 72-bit word with 64 data bits and 8 checking bits are common, ECC is also used with smaller and larger sizes.

References[edit]

^ Werner Fischer. «RAM Revealed». admin-magazine.com. Retrieved October 20, 2014.
^ Single Event Upset at Ground Level, Eugene Normand, Member, IEEE, Boeing Defense & Space Group, Seattle, WA 98124-2499
^ ^a ^b «A Survey of Techniques for Modeling and Improving Reliability of Computing Systems», IEEE TPDS, 2015
^ Gary M. Swift and Steven M. Guertin. «In-Flight Observations of Multiple-Bit Upset in DRAMs». Jet Propulsion Laboratory
^ ^a ^b Borucki, «Comparison of Accelerated DRAM Soft Error Rates Measured at Component and System Level», 46th Annual International Reliability Physics Symposium, Phoenix, 2008, pp. 482–487
^ ^a ^b ^c ^d
Schroeder, Bianca; Pinheiro, Eduardo; Weber, Wolf-Dietrich (2009). DRAM Errors in the Wild: A Large-Scale Field Study (PDF). SIGMETRICS/Performance. ACM. ISBN 978-1-60558-511-6.
- Robin Harris (October 4, 2009). «DRAM error rates: Nightmare on DIMM street». ZDNet.
^ «A Memory Soft Error Measurement on Production Systems». Archived from the original on 2017-02-14. Retrieved 2011-06-27.
^ Li, Huang; Shen, Chu (2010). ««A Realistic Evaluation of Memory Hardware Errors and Software System Susceptibility». Usenix Annual Tech Conference 2010″ (PDF).
^ Yoongu Kim; Ross Daly; Jeremie Kim; Chris Fallin; Ji Hye Lee; Donghyuk Lee; Chris Wilkerson; Konrad Lai; Onur Mutlu (2014-06-24). «Flipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM Disturbance Errors» (PDF). ece.cmu.edu. IEEE. Retrieved 2015-03-10.
^ Dan Goodin (2015-03-10). «Cutting-edge hack gives super user status by exploiting DRAM weakness». Ars Technica. Retrieved 2015-03-10.
^ «CDC 6600». Microsoft Research. Retrieved 2011-11-23.
^ «Parity Checking». Pcguide.com. 2001-04-17. Retrieved 2011-11-23.
^ DOMARS. «mca — Windows drivers». docs.microsoft.com. Retrieved 2021-03-27.
^ ^a ^b
A. H. Johnston. «Space Radiation Effects in Advanced Flash Memories» Archived 2016-03-04 at the Wayback Machine. NASA Electronic Parts and Packaging Program (NEPP). 2001.
^ «ECC DRAM – Intelligent Memory». intelligentmemory.com. Archived from the original on 2019-02-12. Retrieved 2021-06-12.
^ ^a ^b «Using StrongArm SA-1110 in the On-Board Computer of Nanosatellite». Tsinghua Space Center, Tsinghua University, Beijing. Archived from the original on 2011-10-02. Retrieved 2009-02-16.
^ «Actel engineers use triple-module redundancy in new rad-hard FPGA». Military & Aerospace Electronics. Archived from the original on 2012-07-14. Retrieved 2009-02-16.
^ «SEU Hardening of Field Programmable Gate Arrays (FPGAs) For Space Applications and Device Characterization». Klabs.org. 2010-02-03. Archived from the original on 2011-11-25. Retrieved 2011-11-23.
^ «FPGAs in Space». Techfocusmedia.net. Retrieved 2011-11-23.^{[permanent dead link]}
^ «Commercial Microelectronics Technologies for Applications in the Satellite Radiation Environment». Radhome.gsfc.nasa.gov. Archived from the original on 2001-03-04. Retrieved 2011-11-23.
^
Doug Thompson, Mauro Carvalho Chehab.
«EDAC — Error Detection And Correction» Archived 2009-09-05 at the Wayback Machine.
2005 — 2009.
«The ‘edac’ kernel module goal is to detect and report errors that occur
within the computer system running under linux.»
^ ^a ^b «AMD-762™ System Controller Software/BIOS Design Guide, p. 179» (PDF).
^ ^a ^b Doe Hyun Yoon; Mattan Erez. «Memory Mapped ECC: Low-Cost Error Protection for Last Level Caches». 2009. p. 3
^ Daniele Rossi; Nicola Timoncini; Michael Spica; Cecilia Metra.
«Error Correcting Code Analysis for Cache Memory High Reliability and Performance» Archived 2015-02-03 at the Wayback Machine.
^ Shalini Ghosh; Sugato Basu; and Nur A. Touba. «Selecting Error Correcting Codes to Minimize Power in Memory Checker Circuits» Archived 2015-02-03 at the Wayback Machine. p. 2 and p. 4.
^ Chris Wilkerson; Alaa R. Alameldeen; Zeshan Chishti; Wei Wu; Dinesh Somasekhar; Shih-lien Lu. «Reducing cache power with low-cost, multi-bit error-correcting codes». doi:10.1145/1816038.1815973.
^ M. Y. Hsiao. «A Class of Optimal Minimum Odd-weight-column SEC-DED Codes». 1970.
^
Intel Corporation.
«Intel Xeon Processor E7 Family: Reliability, Availability, and Serviceability».
2011.
p. 12.
^ «Bios and Cache — Custom Build Computers». www.custom-build-computers.com. Retrieved 2021-03-27.
^ «AMD Zen microarchitecture — Memory Hierarchy». WikiChip. Retrieved 15 October 2018.
^ «AMD Zen+ microarchitecture — Memory Hierarchy». WikiChip. Retrieved 15 October 2018.
^
Jangwoo Kim; Nikos Hardavellas; Ken Mai; Babak Falsafi; James C. Hoe.
«Multi-bit Error Tolerant Caches Using Two-Dimensional Error Coding».
2007.
p. 2.
^
Nathan N. Sadler and Daniel J. Sorin.
«Choosing an Error Protection Scheme for a Microprocessor’s L1 Data Cache».
2006.
p. 1.
^ «Typical unbuffered ECC RAM module: Crucial CT25672BA1067».
^ Specification of desktop motherboard that supports both ECC and non-ECC unbuffered RAM with compatible CPUs
^ «Discussion of ECC on pcguide». Pcguide.com. 2001-04-17. Retrieved 2011-11-23.
^ Benchmark of AMD-762/Athlon platform with and without ECC Archived 2013-06-15 at the Wayback Machine
^ «ECCploit: ECC Memory Vulnerable to Rowhammer Attacks After All». Systems and Network Security Group at VU Amsterdam. Retrieved 2018-11-22.

External links[edit]

SoftECC: A System for Software Memory Integrity Checking
A Tunable, Software-based DRAM Error Detection and Correction Library for HPC
Detection and Correction of Silent Data Corruption for Large-Scale High-Performance Computing
Single-Bit Errors: A Memory Module Supplier’s perspective on cause, impact and detection
Intel Xeon Processor E3 — 1200 Product Family Memory Configuration Guide
Linus Torvalds On The Importance Of ECC RAM, Calls Out Intel’s «Bad Policies» Over ECC

Источник

ECC refers to Error Correction Codes. Error happens whenever there is a bit flips & information is read incorrectly. The bit flip can be a single or double bit causing single bit errors or double bit errors. The bit flips can occur because of hard Errors or soft Errors.
The hard errors are due to inherent defects of circuits during manufacturing, Temperature-variance and general Wear and Tear. These issues often cause stuck at faults where bit is permanently stuck at 0 or 1. Soft Errors occur due to gamma rays colliding with bits resulting in bit flips. The temporary bit flips are also caused due to noisy environment where electronic interference is high for E.g if a circuit happens to be close to power supply.

There are 3 key parts to the ECC :

ECC Generation:
ECC generation is basically a process of applying an algorithm to calculate extra bits that would be stored with Data. The algorithm is an XOR logic where each ECC bit is derived from XOR of several bits including few of ECC bits. These bits are stored along with Data into memory or array & then retrieved back for Detection and generation. The number of ECC bits for generation is dependent on size of the data & can be calculated using below formula :

SECDED : 2^n+1: where n+1 = number of ECC bits.
DECTED : 2^n+2: Where n+2 = number of ECC bits.
For E.g :
For 8 Bits of Data with single bit correction and double bit detection (SECDED) we would need 3 ECC bits i.e from 2^(2+1).
For 8 Bits of Data with double bit correction and Triple bit detection (DECTED) we would need 4 ECC bits i.e from 2^(2+2).

ECC Detection:
The detection is basically a method to know whether there was an Error. For detection, at the minimum we should know that whether there was a bit flip i.e polarity change and also in some cases also how many bits were flipped. These are the 2 important parts of information that allows us to make decision whether Error can be corrected.
For detection, ECC bits are re-generated with same XOR formula that was used in generation and then these bits are compared against the original ECC bits retrieved from the Memory/Array. If the XOR result of the original and regenerated ECC bits is not 0, then there is a evidence of Error syndrome & hence Error is detected.
In order to understand how many bits were in Error, an Error Syndrome is used. An Error Syndrome is basically a list of codes that are stored and used as a reference. Whenever Error is detected, XOR result is referenced against these stored Codes. If there is a match then the Error can be corrected otherwise Error cannot be corrected even though it was detected.

ECC Correction:
Correction is a process of restoring the data to its original state. This is done by using Error Syndrome. The Error syndrome is unique per bit and if the XOR result match is found then that particular bit is in Error. For E.g for SECDED protection on 8-bit data with 3 bits of ECC, there will be 11 Error Syndrome Codes for detecting single bit Error on each 11 bit positions. In order to correct that bit, if the syndrome matches then the polarity of that particular bit is flipped i.e from 0 -> 1 or 1-> 0 & original value is restored.
It is also important to note that if the XOR result is not 0 (i.e if Error was detected) but no matching Error Syndrome was found then Error cannot be corrected. This typically happens whenever there are more bit flips than the ECC bits can correct. For E.g if there 2 bit flips in SECDED type of protection on Data.

Источник

Error correcting codes (ECCs) are used in computer and communication systems to
improve resiliency to bit flips caused by permanent hardware faults or
transient conditions, such as neutron particles from cosmic rays, known
generally as soft errors. This note
describes the principles of Hamming codes that underpin ECC schemes, ECC codes
are constructed, focusing on single-error correction and double error
detection, and how they are implemented.

ECCs work by adding additional redundant bits to be stored or transported with
data. The bits are encoded as a function of the data in such a way that it is
possible to detect erroneous bit flips and to correct them. The ratio of the
number of data bits to the total number of bits encoded is called the code
rate, with a rate of 1 being a an impossible encoding with no overhead.

Simple ECCs

Parity coding adds a single bit that indicates whether the number of set
bits in the data is odd or even. When the data and parity bit is accessed or
received, the parity can be recomputed and compared. This is sufficient to
detect any odd number of bit flips but not to correct them. For applications
where the error rate is low, so that only single bit flips are likely and
double bit flips are rare enough to be ignored, parity error detection is
sufficient and desirable due to it’s low overhead (just a single bit) and
simple implementation.

Repetition coding simply repeats each data bit a fixed number of times. When
the encoded data is received, if each of the repeated bits are non identical,
an error has occurred. With a repetition of two, single-bit errors can be
detected but not corrected. With a repetition of three, single bit flips can be
corrected by determining each data bit as the majority value in each triple,
but double bit flips are undetectable and will cause an erroneous correction.
Repetition codes are simple to implement but have a high overhead.

Hamming codes

Hamming codes are an efficient family of codes using additional redundant bits to
detect up to two-bit errors and correct single-bit errors (technically, they are
linear error-correcting codes).
In them, check bits are added to data bits to form a codeword, and the
codeword is valid only when the check bits have been generated from the data
bits, according to the Hamming code. The check bits are chosen so that there is
a fixed Hamming distance between any two valid codewords (the number of
positions in which bits differ).

When valid codewords have a Hamming distance of two, any single bit flip will
invalidate the word and allow the error to be detected. For example, the valid
codewords 00 and 11 are separated for single bit flips by the invalid
codewords 01 and 10. If either of the invalid words is obtained an error
has occurred, but neither can be associated with a valid codeword. Two bit
flips are undetectable since they always map to a valid codeword. Note that
parity encoding is an example of a distance-two Hamming code.

00 < Valid codeword
|
10 < Invalid codeword (obtained by exactly 1 bit flip)
|
11 < Valid codeword

With Hamming distance three, any single bit flip in a valid codeword makes an
invalid one, and the invalid codeword is Hamming distance one from exactly one
valid codeword. Using this, the valid codeword can be restored, enabling single
error correction. Any two bit flips map to an invalid codeword, which would
cause correction to the wrong valid codeword.

000 < Valid codeword
 |
001
 |
011
 |
111 < Valid codeword

With Hamming distance four, two bit flips moves any valid codeword Hamming
distance two from exactly two valid codewords, allowing detection of two flips
but not correction. Single bit flips can be corrected as they were for distance
three. Distance-four codes are widely used in computing, where is it often the
case where single errors are frequent, double errors are rare and triple errors
occur so rarely they can be ignored. These codes are referred to as ‘SECDED
ECC’ (single error correction, double error detection).

0000 < Valid codeword
 |
0001
 |
0011 < Two bit flips from either codeword.
 |
0111
 |
1111 < Valid codeword

Double errors can be corrected with a distance-five code, as well as enabling
the detection of triple errors. In general, if a Hamming code can detect $d$
errors, it must have a minimum distance of $d+1$ so there is no way $d$ errors
can change one valid codeword into another one. If a code can correct $d$
errors, it must have a minimum distance of $2d+1$ so that the originating code
is always the closest one. The following table summarises Hamming codes.

Distance	Max bits corrected	Max bits detected
2	1	0	Single error detection (eg parity code)
3	1	1	Single error correction (eg triple repetition code)
4	1	2	Single error correction, double error detection (a ‘SECDED’ code)
5	2	2	Double error correction
6	2	3	Double error correction, triple error detection

Creating a Hamming code

A codeword includes the data bits and checkbits. Each check bit corresponds to
a subset of the data bits and it is set when the parity of those data bits is
odd. To obtain a code with a particular Hamming distance, the number of check
bits and their mapping to data bits must be chosen carefully.

To build a single-error correcting (SEC) code that requires Hamming distance
three between valid codewords, it is necessary for:

The mapping of each data bit to check bits is unique.
Each data bit to map to at least two check bits.

To see why this works, consider two distinct codewords that necessarily
must have different data bits. If the data bits differ by:

1 bit, at least two check bits are flipped, giving a total of three
different bits.
2 bits, these will cause at least one flip in the check bits since any two
data bits cannot share the same check-bit mapping (ie by taking the XOR of
the two check bit patterns). This also gives a total of three different bits as required.
3 bits, this is already sufficient to give a Hamming distance of three.

To build a SECDED code that requires Hamming distance of four between valid
codewords, it is necessary for:

The mapping of each data bit to check bits is unique.
Each data bit to map to at least three check bits.
Each check bit pattern to have an odd number of bits set.

Following a similar argument, consider two distinct codewords, data differing by:

1 bit flips three check bits, giving a total of four different bits.
2 bits flip check bits in two patterns, and since any two odd-length patterns
must have at least two non-overlapping bits, the results is at least two
flipped bits, giving a total of four different bits. For example:

Check bits:  0 1 2 3
data[a]      x x x
data[b]        x x x
-----------  --------
Flips        x     x

Check bits:  0 1 2 3 4
data[a]      x x x
data[b]      x x x x x
----------   ---------
Flips              x x

3 bits flip check bits in three patterns, and this time it is possible to
overlap odd-length patterns in such a way that a minimum of 1 bit is flipped.
For example:

Check bits:  0 1 2 3 4
data[a]      x x x
data[b]        x x x
data[c]      x     x x
-----------  ---------
Flips                x

Check bits:  0 1 2 3 4
data[a]      x x x x x
data[b]      x x x
data[c]      x     x x
-----------  ---------
Flips        x

4 bits is already sufficient to provide a Hamming distance of four.

An example SEC code for eight data bits with four parity bits:

Check bits:  0 1 2 3
data[0]      x x x
data[1]        x x x
data[2]      x   x x
data[3]      x x   x
data[4]      x x
data[5]        x x
data[6]          x x
data[7]      x     x

An example SECDED code for eight data bits with five parity bits:

Check bits:  0 1 2 3 4
data[0]      x x x
data[1]      x x   x
data[2]      x   x x
data[3]        x x x
data[4]      x x     x
data[5]      x   x   x
data[6]        x x   x
data[7]      x     x x

Note that mappings of data bits to check bits can be chosen flexibly, providing
they maintain the rules that set the Hamming distance. This flexibility is
useful when implementing ECC to reduce the cost of calculating the check bits.
In contrast, many descriptions of ECC that I have found in text books and on
Wikipedia describe a specific
encoding that does not acknowledge this freedom. The encoding they describe
allows the syndrome to be interpreted as the bit index of the single bit error,
by the check bit in position $i$ covering data bits in position $i$.
Additionally, they specify that parity bits are positioned in the codeword at
power-of-two positions, for no apparent benefit.

Implementing ECC

Given data bits and check bits, and mapping of data bits to check bits, ECC
encoding works by calculating the check bits from the data bits, then combining
data bits and check bits to form the codeword. Decoding works by taking the
data bits from a codeword, recalculating the check bits, then calculating the
bitwise XOR between the original check bits and the recalculated ones. This
value is called the syndrome. By inspecting the number of bits set in the
syndrome, it is possible to determine whether there has been an error,
whether it is correctable, and how to correct it.

Using the SEC check-bit encoding above, creating a codeword from data[7:0],
the check bits are calculated as follows (using Verilog syntax):

assign check_word[0] = data[0] ^ data[2] ^ data[3] ^ data[4] ^ data[7];
assign check_word[1] = data[0] ^ data[1] ^ data[3] ^ data[4] ^ data[5];
assign check_word[2] = data[0] ^ data[1] ^ data[2] ^ data[5] ^ data[6];
assign check_word[3] = data[1] ^ data[2] ^ data[3] ^ data[6] ^ data[7];

And the codeword formed by concatenating the check bits and data:

assign codeword = {check[3:0], data[7:0]};

Decoding of a codeword, splits it into the checkword and data bits, recomputes
the check bits and calculates the syndrome:

assign {old_check_word, old_data} = codeword;
assign new_check_word[0] = ...;
assign new_check_word[1] = ...;
assign new_check_word[2] = ...;
assign new_check_word[3] = ...;
assign syndrome = new_check_word ^ old_check_word;

When single bit errors occur, the syndrome will have the bit pattern
corresponding to a particular data bit, so a correction can be applied by
creating a mask to flip the bit in that position:

unique case(syndrome)
  4'b1110: correction = 1<<0;
  4'b0111: correction = 1<<1;
  4'b1011: correction = 1<<2;
  4'b1101: correction = 1<<3;
  4'b1100: correction = 1<<4;
  4'b0110: correction = 1<<5;
  4'b0011: correction = 1<<6;
  4'b1001: correction = 1<<7;
  default: correction = 0;
endcase

And using it to generate the corrected data:

assign corrected_data = data ^ correction;

The value of the syndrome can be further inspected to signal what action has
been taken. If the syndrome is:

Equal to zero, no error occurred.
Has one bit set, then this is a flip of a check bit and can be ignored.
Has a value matching a pattern (three bits set or two bits in the adjacent positions), a correctable error occurred.
Has a value not matching a pattern (two bits set in the other non-adjacent positions: 4'b1010, 4'b0101), or four bits set, a multi-bit uncorrectable error occurred.

The above SECDED check-bit encoding can be implemented in a similar way, but
since it uses only three-bit patterns, mapping syndromes to correction masks
can be done with three-input AND gates:

unique case(syndrome)
  syndrome[0] && syndrome[1] && syndrome[2]: correction = 1<<0;
  syndrome[0] && syndrome[1] && syndrome[3]: correction = 1<<1;
  syndrome[0] && syndrome[2] && syndrome[3]: correction = 1<<2;
  syndrome[1] && syndrome[2] && syndrome[3]: correction = 1<<3;
  syndrome[0] && syndrome[1] && syndrome[4]: correction = 1<<4;
  syndrome[0] && syndrome[2] && syndrome[4]: correction = 1<<5;
  syndrome[1] && syndrome[2] && syndrome[4]: correction = 1<<6;
  syndrome[0] && syndrome[3] && syndrome[4]: correction = 1<<7;
  default:                                   correction = 0;
endcase

And any syndromes with one or two bits set are correctable, and otherwise uncorrectable.

References / further reading

Error correction code, Wikipedia.
Hamming code, Wikipedia.
ECC memory, Wikipedia.
Error detecting and error correcting codes (PDF),
R. W. Hamming, in The Bell System Technical Journal, vol. 29, no. 2, pp. 147-160, April 1950.
Constructing an Error Correcting Code (PDF),
Andrew E. Phelps, University of Wisconsin, Madison, November 2006.

Please get in touch (mail @ this domain) with any
comments, corrections or suggestions.

Источник

tected, the I/O port does not issue the transaction on the TLSB. It simply

aborts that transaction by transmitting a UTV_ERROR_A (or B) code

across its internal TL_CMD bus to each IDR. The I/O port then posts an

IPL 17 interrupt on the TLSB, if enabled by ICCNSE<INTR_NSES>.

Data bus errors are either ECC-detected errors on data transfers or control

errors on the data bus. In addition, all I/O port transceivers on the TLSB

check the data received from the bus against the expected data driven on

the bus.

The I/O port slices the TLSB_D<255:0> and TLSB_ECC<31:0> signals into

four parts, each containing 64 bits of data and 8 bits of ECC as follows:

The I/O port handles error detection on these signals independently in

each slice, setting error bits in a corresponding TLESRn register. The con-

tents of the four TLESRn registers is summarized in the TLBER register.

Broadcasting of the error is determined by the error type and whether or

not broadcasting of the error type is enabled.

6.7.8.1

Single-Bit ECC Errors

A single bit error on a memory data transfer is detected by the I/O port’s

ECC checking logic. The I/O port both checks and corrects the data. If the

I/O port detects a single-bit ECC error, it logs the error in its TLESRn reg-

ister by setting either <CRECC> or <CWECC>, depending on whether a

read or write command failed. If the error was detected on data that the

I/O port was writing to memory, then the TLESR<TDE> and TL-

BER<DTDE> bits are also set.

A CRECC error sets <CRDE> in the I/O port’s TLBER register. A CWECC

error sets <CWDE> in the I/O port’s TLBER register.

When the I/O port detects a single-bit data error, it asserts TLSB_DATA_

ERROR to signal the other nodes of the error. If correctable error inter-

rupts are not disabled by TLCNR<CWDD> and TLCNR<CRDD>, and ICC-

NSE<INTR_NSES> is set, an IPL 17 interrupt is posted to the proces-

sor(s).

The I/O port also latches the failing syndrome in the TLESRn registers, in-

dicating which cycle(s) failed during the transaction.

6.7.8.2

Double-Bit ECC Errors

A double-bit error on a data transfer is detected by the I/O port’s ECC

checking logic. The I/O port logs the error in its TLESRn register by set-

6-72 I/O Port

• TLSB_D<63:0> and TLSB_ECC<7:0> are handled by the IDR_0 gate

array

• TLSB_D<127:64> and TLSB_ECC<15:8> are handled by the IDR_1

gate array

• TLSB_D<191:128> and TLSB_ECC<23:16> are handled by the IDR_2

gate array

• TSB_D<255:192> and TLSB_ECC<31:24> are handled by the IDR_3

gate array

Источник

Error Correction Code (ECC) in DDR Memories

Vadhiraj Sankaranarayanan, Sr. Technical Marketing Manager, Synopsys

Introduction

Double Data Rate Synchronous Dynamic Random-Access Memory (DDR SDRAM or simply DRAM) technology is the widely used for main memory in almost all applications today, ranging from high-performance computing (HPC) to power-, area-sensitive mobile applications. This is due to DDR’s many advantages including high-density with a simplistic architecture, low-latency, and low-power consumption. JEDEC, the standards organization that specifies memory standards, has defined and developed four DRAM categories to guide designers to precisely meet their memory requirements: standard DDR (DDR5/4/3/2), mobile DDR (LPDDR5/4/3/2), graphic DDR (GDDR3/4/5/6), and high bandwidth DRAM (HBM2/2E/3). Figure 1 shows a high-level block diagram of a memory subsystem in a typical system-on-chip (SoC), which comprises of a DDR memory controller, DDR PHY, DDR channel, and DDR memory. As per JEDEC’ definition, the DDR channel is composed of Command/Address and data lanes. The simplified DDR memory shown below can represent a DRAM memory component from any of the four categories.

Figure 1: Memory subsystem block diagram in an SoC

As with any electronic system, errors in the memory subsystem are possible due to design failures/defects or electrical noise in any one of the components. These errors are classified as either hard-errors (caused by design failures) or soft-errors (caused by system noise or memory array bit flips due to alpha particles, etc.). As the names suggest, hard-errors are permanent and soft-errors are transient in nature. Although it is logical to expect the DRAMs (with large memory arrays and getting denser with every standards generation for a smaller process node) to be the bulk source of the memory errors, an end-to-end protection from the controller to the DRAMs is highly desirable for overall memory subsystem robustness.

To handle these memory errors during runtime, the memory subsystem must have advanced RAS (Reliability, Availability, and Serviceability) features to prolong the overall system uptime at times of memory errors. Without RAS features, the system will most likely crash due to memory errors. However, RAS features allow the system to continue operating when there are correctable errors, while logging the uncorrectable error details for future debugging purposes.

ECC as a Memory RAS Feature

One of the most popular RAS schemes used in the memory subsystem is Error Correction Code (ECC) memories. By generating ECC SECDED (Single-bit Error Correction and Double-bit Error Detection) codes for the actual data and storing them in additional DRAM storage, the DDR controller can correct single-bit errors and detect double-bit errors on the received data from the DRAMs.

The ECC generation and check sequence is as follows:

The ECC codes are generated by the controller based on the actual WR (WRITE) data. The memory stores both the WR data and the ECC code.
During a RD (READ) operation, the controller reads both the data and respective ECC code from the memory. The controller regenerates the ECC code from the received data and compares it against the received ECC code.
If there is a match, then no errors have occurred. If there are mismatches, the ECC SECDED mechanism allows the controller to correct any single-bit error and detect double-bit errors.

Such an ECC scheme provides an end-to-end protection against single-bit errors that can occur anywhere in the memory subsystem between the controller and the memory.

Based on the actual storage of the ECC codes, the ECC scheme can be of two types: side-band ECC or inline ECC. In side-band ECC, the ECC codes are stored on separate DRAMs and in inline ECC, the codes are stored on the same DRAMs with the actual data.

As DDR5 and LPDDR5 support much higher data-rates than their predecessors, they support additional ECC features for enhancing the robustness of the memory subsystem. On-die ECC in DDR5 and Link-ECC in LPDDR5 are two such RAS schemes to further bolster the memory subsystem RAS capabilities.

Different Schemes of ECC

Side-band ECC

The side-band ECC scheme is typically implemented in applications using standard DDR memories (such as DDR4 and DDR5). As the name illustrates, the ECC code is sent as side-band data along with the actual data to memory. For instance, for a 64-bit data width, 8 additional bits are used for ECC storage. Hence, the DDR4 ECC DIMMs, commonly used in today’s enterprise class servers and data centers, are 72 bits wide. These DIMMs have two additional x4 DRAMs or a single x8 DRAM for the additional 8 bits of ECC storage. Hence, in side-band ECC, the controller writes and reads the ECC code along with the actual data. No additional WR or RD overhead commands are required for this ECC scheme. Figure 2 describes the WR and RD operation flows with side-band ECC. When there are no errors in the received data, side-band ECC incurs minimal latency penalty as compared to inline ECC.

Figure 2: WR and RD operation flows with side-band ECC

Inline ECC

The inline ECC scheme is typically implemented in applications using LPDDR memories. As the LPDDR DRAMs have a fixed-channel width (16-bits for a LPDDR5/4/4X channel), side-band ECC becomes an expensive solution with these memories. For instance, for a 16-bit data-width, an additional 16-bit LPDDR channel needs to be allocated for side-band ECC for the 7 or 8-bit ECC code-word. Moreover, the 7- or 8-bit ECC code-word fills the 16-bit additional channel only partially, resulting in storage inefficiency and also adding extra load to the address command channel, possibly limiting performance. Hence, Inline ECC becomes a better solution for LPDDR memories.

Instead of requiring an additional channel for ECC storage, the controller in inline ECC stores the ECC code in the same DRAM channel where the actual data is stored. Hence, the overall data-width of the memory channel remains the same as the actual data-width.

In Inline ECC, the 16-bit channel memory is partitioned such that a dedicated fraction of the memory is allocated to ECC code storage. When the ECC code is not sent along with the WR and RD data, the controller generates separate overhead WR and RD commands for ECC codes. Hence, every WR and RD command for the actual data is accompanied with an overhead WR and RD command respectively for the ECC data. High-performance controllers reduce the penalty of such overhead ECC commands by packing the ECC data of several consecutive addresses in one overhead ECC WR command. Similarly, the controller reads the ECC data of several consecutive addresses from memory in one overhead ECC RD command and can apply the read-out ECC data to the actual data from the consecutive addresses. Hence, the more sequential the traffic pattern is, the latency penalty is less due to such ECC overhead commands. Figure 3 describes the WR and RD operation flows with inline ECC.

Figure 3: WR and RD operation flows with Inline ECC

On-die ECC

With each DDR generation, it’s common for the DRAM capacity to increase. It is also common for DRAM vendors to shrink the process technology to achieve both higher speeds and better economies of scale in production. With the higher capacity and speed coupled with the smaller process technology, the likelihood of single-bit errors increases on the DRAM memory arrays. To further bolster the memory channel, DDR5 DRAMs have additional storage just for the ECC storage. On-die ECC is an advanced RAS feature that the DDR5 system can enable for higher speeds. For every 128 bits of data, DDR5 DRAMs has 8 additional bits for ECC storage.

The DRAMs internally compute the ECC for the WR data and store the ECC code in the additional storage. On a read operation, the DRAMs read out both the actual data as well as the ECC code and can correct any single-bit error on any of the read data bits. Hence, on-die ECC provides further protection against single-bit errors inside the DDR5 memory arrays. As this scheme does not offer any protection against errors occurring on the DDR channel, on-die ECC is used in conjunction with side-band ECC for enhanced end-to-end RAS on memory subsystems. Figure 4 describes the WR and RD operation flows with on-die ECC.

Figure 4: WR and RD operation flows with On-die ECC

Link-ECC

The Link-ECC scheme is a LPDDR5 feature that offers protection against single-bit errors on the LPDDR5 link or channel. The memory controller computes the ECC for the WR data and sends the ECC on specific bits along with the data. The DRAM generates the ECC on the received data, checks it against the received ECC data, and corrects any single-bit errors. The roles of the controller and the DRAM are reversed for the read operation. Note that link-ECC does not offer any protection against single-bit errors on the memory array. However, inline ECC coupled with link-ECC strengthens the robustness of LPDDR5 channels by providing an end-to-end protection against single-bit errors. Figure 5 describes the WR and RD operation flows with link-ECC.

Figure 5: WR and RD operation flows with Link-ECC

Conclusion

One of the widely used memory RAS features is the Error Correction Code (ECC) scheme. Applications using standard DDR memories typically implement side-band ECC, while applications using LPDDR memories implement inline ECC. With the higher speeds and hence pronounced SI effects on DDR5 and LPDDR5 channels, ECC is now supported even on DDR5 and LPDDR5 DRAMs in the form of on-die and link-ECC respectively. Synopsys’ DesignWare® DDR5/4 and LPDDR5/4 IP solutions offer advanced RAS features including all of the ECC schemes highlighted in this article.

For More Information

Источник

Description[edit]

Research[edit]

Solutions[edit]

Implementations[edit]

Cache[edit]

Registered memory[edit]

Advantages and disadvantages[edit]

Notes[edit]

References[edit]

External links[edit]

Simple ECCs

Hamming codes

Creating a Hamming code

Implementing ECC

References / further reading

Single-Bit ECC Errors

Double-Bit ECC Errors

Error Correction Code (ECC) in DDR Memories

ECC as a Memory RAS Feature

Different Schemes of ECC

For More Information

Читайте также: