Ecc error detected memtest86 - Исправление ошибок и поиск оптимальных решений проблем

ECC Technical Details

What is ECC memory?
How does ECC work?

What are the different ECC schemes?

What’s new in DDR5 ECC RAM?
Is ECC memory neccessary?
What system requirements are needed to enable ECC protection?

Do I need to change BIOS settings?

How do I know when ECC errors are detected?

How does MemTest86 report ECC errors?
How does Windows report ECC errors?
How does Linux report ECC errors?

What is ECC injection?

Can I use MemTest86 inject ECC errors?
How do I know if my system supports ECC injection?

Why are ECC errors not being reported on my AMD Ryzen system?

What is ECC memory?

Error correction code (ECC) is a mechanism used to detect and correct errors in memory data due to
environmental interference and physical defects. ECC memory is used in high-reliability applications
that cannot tolerate failure due to corrupted data such as medical equipment, aircraft control systems, or bank database servers.

Most memory errors are single (1-bit) errors caused by soft errors (eg. cosmic rays, alpha rays, electromagnetic interference) but some can be due to hardware faults (eg. row hammer fault).
Soft errors are more prevalent for systems that operate at higher altitudes, such as commercial aircrafts. It is said that at an altitude of approximately 10km, bit error inducing cosmic rays are 300 times higher.

Such single bit errors can be corrected by ECC memory systems. Multi-bit errors, may also be detected and/or corrected, depending on the number of symbols in error.

Symptoms of memory errors include corruption of data, system crash, and/or security vulnerabilities giving unprivileged code access to the kernel.
Memory errors are known to be one of the most common hardware causes of machine crashes in large scale data centers.

How does ECC work?

ECC is implemented by generating and storing an encrypted, parity-like code used to not only identify the bit in error but correct it as well.
This implementation-dependent ECC code is generated and stored on writes, and verified on reads.

The most common implementations use Hamming codes for single-bit correction and double-bit detection (SECDED).
Hamming codes define parity bits which cover a pre-defined set of data bits. Typically, an 8-bit hamming code is used to protect 64-bit data.

The ECC verification step, using a parity-check matrix, generates a value called a syndrome.
If the syndrome is zero, no error occurred. Otherwise, it is used to index a lookup table called the syndrome table to identifying the bits in error (if correctable), or otherwise determine if the error is uncorrectable.

An example of a syndrome table for the Hamming (7,4) code (4 data bits d_n, 3 parity-check bits p_n) encoded as p₁p₂d₁p₃d₂d₃d₄ is as follows:

Syndrome	Error Vector
000	0000000
100	1000000
010	0100000
110	0010000
001	0001000
101	0000100
011	0000010
111	0000001

For example, if the ECC verification step calculates the syndrome to be 111, the above syndrome table can be used to lookup the error vector to be 0000001.
The error vector identifies d₄ as the bit in error for the 7-bit message.

The sequences for read and write accesses are summarized below.

Memory write sequence

CPU sends write data to the memory controller (on the same chipset, for newer CPUs)
Memory controller generates ECC code based on the write data
Memory controller sends write data and ECC code across the memory channel
Write data and ECC code are stored in the memory DRAM chips

Memory read sequence

CPU issues read request to the memory controller (on the same chipset, for newer CPUs)
Memory controller reads data and ECC code from the memory
Memory controller generates ECC code based on read data
Memory controller verifies generated and stored ECC match. If not, use ECC SECDED mechanism to correct single-bit errors and detect double-bit errors.

Full end-to-end ECC memory system involves the CPU, memory controller, and DRAM modules during memory access.
Consequently, these components require additional circuitry to support ECC functionality.
For example, ECC RAM contains an additional DRAM chip to store the ECC codes as shown in the following diagram.

In addition, the memory bus not only includes the Data DQ lines (eg. DQ0-63) but also the ECC Check Bits CB lines (eg. CB0-7).

More often than not, ECC RAM is also Registered RAM which places a register between the memory controller and DRAM banks.
This reduces the electrical load especially for systems that have a large amount of RAM installed, which is often the case for server-grade machines.

What are the different ECC schemes?

An ECC-enabled memory subsystem may use one or more of the following schemes:

Side-band ECC (DDR4/DDR5)
Inline ECC (LPDDR4/LPDDR5)
On-die ECC (DDR5)
Link ECC (LPDDR5)

Side-band ECC

Side-band ECC is the most typical scheme used in ECC memory systems today.
Side-band ECC requires supporting hardware including ECC logic in the memory controller, extra bits in the memory bus and separate DRAM chips in the memory module to store the ECC code.

During write operations, the memory controller generates and transmits the ECC code alongside the write data as «side-band» without introducing extra command overhead.

On read operations, the stored ECC code accompanies the read data as «side-band» which is then verified, and if necessary, corrected by the memory controller.

Inline ECC

Inline ECC is used for LPDDR memory systems that have stricter hardware constraints. This effectively removes the requirements of extra bits in the memory bus and separate DRAM chips for the ECC code needed for Side-band ECC.

In contrast to Side-band ECC rather than bundling the data and ECC code in a single command, Inline ECC issues separate read/write commands for both the data and ECC code.
In addition, the ECC code is stored in the same DRAM chips as the data. As a result, Inline ECC introduces extra command overhead during read/write operations.

On-die ECC

On-die ECC is a new scheme introduced for DDR5 memory which is completely self-contained in the DDR5 memory module.

On-die ECC, unlike the above schemes, does not provide end-to-end protection. The purpose of On-die ECC is to protect the integrity of data stored in the memory cells of DRAM arrays; it does not detect or prevent errors that occur during transmission between the memory controller and the memory module.
All ECC detection and correction is performed internally within DRAM memory cells; it is completely invisible to the CPU and memory controller.

To provide full end-to-end protection, On-die ECC would need to be used in conjuction with Side-band ECC.

Link ECC

Link ECC is another new scheme introduced for LPDDR5 memory to augment end-to-end protection for systems with hardware constraints.

Link ECC, by itself, does not provide end-to-end protection; it provides protection for errors that occur during transmission on the channel between the memory controller and the DRAM.

On write operations, the memory controller generates and sends the ECC code along with the write data to the DRAM module.
The DRAM module receives the write data, generates its own ECC code and verifies whether it matches with the ECC code sent by the memory controller.
If necessary, single-bit errors are corrected accordingly.

In constrast to the other schemes, Link ECC does not detect or prevent errors while being stored in DRAM cells.
To provide full end-to-end protection, Link ECC would need to be used in conjuction with Inline ECC to provide full end-to-end protection.

What’s new in DDR5 ECC RAM?

Previous generation ECC memory systems Side-band and Inline ECC schemes which provides end-to-end detection and correction for errors during transmission and storage in DRAM cells

DDR5 RAM introduces two additional schemes, On-die ECC (or On-chip) and Link ECC, to compensate for higher bit error rates (BER) due to increased speed and density of DDR5 RAM.

On-die ECC detects and corrects errors in the DRAM cells that may occur during, for example, a DRAM row refresh.
Due to increasing error rates as process technology reduces the size of memory cells, on-die ECC sustains the yield of «good»
memory cells.

On-die ECC is completely invisible to the system. Its implementation, encoding/decoding algorithms, and metadata are all fully contained within the DRAM device and provide no feedback about error detection and/or correction to the rest of the system

Similarily, Link-ECC detects and corrects transmission errors on the LPDDR5 link or channel. Compared to previous generations the likelihood of errors on the DQ line is much higher, due to significant changes in speed and power usage of LPDDR5 RAM.
Link-ECC offers protection for speed and power improvements.

Is ECC memory neccessary?

In general, there is a trade-off between the higher cost of ECC hardware and slight decrease in performance versus greater system reliability and availability.
Depending on the implementation, it is said that enabling ECC may consume additional power and lower memory performance by around 2–3 percent.
For home or personal use, the consequence of memory errors may not be significant enough to justify the additional cost.

However, for highly-sensitive, industrial-grade systems, the additional hardware costs become neglible compared to the socio-economical consequences of memory failures.
As a result, ECC memory should always be used in systems where memory failure has significant consequences (eg. medical equipment, aircraft control systems, bank payment system).

What system requirements are needed to enable ECC protection?

Due to additional circuitry required for ECC protection, specialized ECC hardware support is required by the CPU chipset, motherboard and DRAM module.
This includes the following:

Server-grade CPU chipset with ECC support (Intel Xeon, AMD Ryzen)
Motherboard supporting ECC operation
ECC RAM

Consult the motherboard and/or CPU documentation for the specific model to verify whether the hardware supports ECC.
Use vendor-supplied list of certified ECC RAM, if provided.

Do I need to change BIOS settings?

Most ECC-supported motherboards allow you to configure ECC settings from the BIOS setup.
The specific option depends on the motherboard vendor or model such as the following:

DRAM ECC Enable (American Megatrends, ASUS, ASRock, MSI)
ECC Mode (ASUS)

An example of such ECC setting is shown in the following screenshot.

How do I know when ECC errors are detected?

The mechanism for how ECC errors are logged and reported to the end-user depends on the BIOS and operating system.
In most cases, corrected ECC errors are written to system/event logs. Uncorrected ECC errors may result in kernel panic or blue screen.

How does MemTest86 report ECC errors?

MemTest86 directly polls ECC errors logged in the chipset/memory controller registers and displays it to the user on-screen.
In addition, ECC errors are written to the log and report file.

During testing, MemTest86 may report ECC errors detected by the memory controller if ECC is supported and enabled.
This is demonstrated in the following screenshot:

The degree of information available for the detected ECC error depends heavily on the CPU/memory controller chipset. This includes any of the following:

Memory address
DRAM address (column, row, rank, bank)
Channel and DIMM slot number

The following examples illustrate possible outputs displayed on screen for detected ECC errors.

[ECC Error] Test: 1, Addr: 0x8F32540AC

The ECC error was detected in memory address 0x8F32540AC.

[ECC Error] Test: 1, (Ch,Sl,Rk,Bk,Rw,Cl): (1,0,2,0,17900,0)

The ECC error was detected in the DIMM module located in channel 1, slot 0 with the indicated rank address (0x2), and bank address (0x0), row address (0x17900), column address (0x0).

[ECC Error] Test: 1, Channel/Slot: 1/0

The ECC error was detected in the DIMM module located in channel 1, slot 0. No information regarding the memory address that triggered the ECC error is available.

*Note* The reported channel/slot of the ECC error is from the point of view of the memory controller and does not necessarily correspond to the expected physical slot on the motherboard.
Although this is true for most motherboards, there are boards that do not map to the expected physical slot. One particular motherboard is the Supermicro H12SSL-NT which follows the physical slot mapping below:

Reported channel	Motherboard slot
0	A1
1	B1
2	D1
3	C1
4	H1
5	G1
6	E1
7	F1

Credits to lunadesign for determining the mapping for Supermicro H12SSL-NT

Due to different memory controller architectures amongst different chipsets, there is no common ECC error framework; specific ECC polling code is required for each chipset.
In particular, this would involve polling one or more of the following hardware registers:

Machine Check Architecture (MCA) registers for x86-based systems
Integrated Memory Controller (IMC) PCI registers
Sideband registers for Intel SoC chipsets
System Management Network (SMN) registers for AMD Ryzen chipsets

Machine Check Architecture (MCA) is an x86-specific mechanism for CPUs to report generic hardware errors to higher-level software (eg. operating system).
This allows system software to handle hardware errors in a generic way, without needing to the internal details of chipset.

It defines a common set of model-specific registers (MSRs) that deterine the system response when hardware errors are detected. In the case of ECC errors,
an exception, Machine Check Exception (MCE), may be generated and the offending address, channel, and/or syndrome may be logged in the MSRs.

Some chipsets may also define a set of Integrated Memory Controller (IMC) PCI registers that record detected ECC errors such as the offending
DRAM address (rank, bank, row, column), channel and/or syndrome. These registers are accessed through standard PCI mechanism.
Unlike the Machine Check Architecture (MCA), the PCI registers are specific to each chipset and would required a separate chipset-specific implementation.

Some chipsets, such as Intel Atom SoCs, use an internal bus as an indirect method for accessing ECC registers.
Such registers are called Sideband registers which adds an additional layer of complexity on top of the previously described PCI registers.
Despite the extra layer, similar ECC error details are logged in these registers such as DRAM address (rank, bank, row, column), channel and/or syndrome.

For AMD Ryzen chipsets, an internal sideband bus called System Management Network (SMN) is used to access ECC registers.
Similar to Sideband registers in Intel SoCs, SMN registers are an indirect way to access ECC error details logged in these registers such as DRAM address (rank, bank, row, column), channel and/or syndrome.

How does Windows report ECC errors?

ECC errors detected in Windows appear in the Event Log as Microsoft Windows Hardware Error Architecture (WHEA) Warning Event.

Log Name:      System
Source:        Microsoft-Windows-WHEA-Logger
Date:          1/1/2021 12:00:00 AM
Event ID:      19
Task Category: None
Level:         Warning
Keywords:      
User:          LOCAL SERVICE
Computer:      WIN10
Description:
A corrected hardware error has occurred.

Reported by component: Processor Core
Error Source: Corrected Machine Check
Error Type: Cache Hierarchy Error
Processor APIC ID: 2

If ECC is enabled properly in the BIOS, end-users should be able to receive errors via WHEA without needing to configure anything in Windows.

How does Linux report ECC errors?

The Linux kernel supports reporting ECC errors for ECC memory via the EDAC (Error Detection And Correction) driver subsystem.

Depending on the Linux distribution, ECC errors may be reported by the following:

mcelog — collects and decodes MCA error events on x86 (deprecated)
edac-utils — fills DIMM labels data and summarizes memory errors (deprecated)
rasdaemon — monitor ECC memory and report both correctable and uncorrectable memory errors on recent Linux kernels

Installing rasdaemon

rasdaemon can be installed for most Linux distributions using the respective package manager:

apt-get install rasdaemon # Debian/Ubuntu

Enabling rasdaemon service

The rasdaemon service can be started using the systemd service manager systemctl

systemctl enable rasdaemon systemctl start rasdaemon

Querying ECC error count & summary

# ras-mc-ctl --error-count Label CE UE mc#0csrow#2channel#0 0 0 mc#0csrow#2channel#1 0 0 mc#0csrow#3channel#1 0 0 mc#0csrow#3channel#0 0 0

# ras-mc-ctl --summary Memory controller events summary: Corrected on DIMM Label(s): 'mc#0csrow#2channel#1' location: 0:2:1:-1 errors: 3 Corrected on DIMM Label(s): 'mc#0csrow#3channel#0' location: 0:3:0:-1 errors: 3 Fatal on DIMM Label(s): 'mc#0csrow#3channel#0' location: 0:3:0:-1 errors: 1

What is ECC injection?

ECC injection is a debugging feature introduced in the memory controller to artificially insert memory errors to verify proper system behaviour.
This feature is meanted to be used by developers and system integrators, and not meant to be used in production by end-users.

Various ECC injection options are available depending on the vendor (eg. Intel, AMD) and chipset (eg. Xeon, Ryzen, Atom).
These options include the following:

Read and/or write data path
Address range
Per chunk count
Single or Multi-bit errors
Single-shot or continuous

Can I use MemTest86 inject ECC errors?

MemTest86 Pro Edition supports ECC injection if the CPU/memory controller chipset supports error injection and the feature is not locked by BIOS.
See the current list of chipsets with ECC injection capability supported by MemTest86.

Once ECC injection is enabled in the main menu or configuration file, MemTest86 will attempt to inject single-bit ECC errors at the beginning of each test.

If ECC errors were successfully injected and detected by the system, the user shall see an [ECC Inject] message followed by an [ECC Errors Detected] message.

If [ECC Errors Detected] message does not appear, it is highly likely the ECC injection is locked or disabled by BIOS.

How do I know if my system supports ECC injection?

In general, ECC injection is not a feature that is normally accessible by end-users. Even if the chipset supports the ECC injection feature, details are often sparse and not described in publicly available datasheets.
Consult the datasheet for your CPU/memory controller chipset to determine whether the ECC injection feature is available and fully specified.

In particular, some Intel chipsets (Broadwell, Xeon Scalable) use Intel Trusted Execution Technology (Intel TXT) to lock ECC injection.
Intel TXT, using secure hardware modules, verifies the integrity of the BIOS, firmware, OS and hypervisor in order to guarantee a trusted operating environment.
As a result, this requires preventing access to specific memory controller registers from being compromised, including ECC injection registers.

Some chipsets that support ECC injection have a locking mechanism that once enabled in the BIOS, effectively disables the ECC injection capability.
For these cases, a BIOS option may be available to leave the feature unlocked. Otherwise, a custom BIOS is required for unlocking the feature.

Why are ECC errors not being reported on my AMD Ryzen system?

There is a possibility that a BIOS setting, Platform First Error Handling (PFEH), is preventing ECC errors from being reported to MemTest86.

An example of this setting is shown in the following screenshot.

If this setting is enabled, set to disabled and try running MemTest86 again.

Another explanation is the use of out-of-band (OOB) monitoring solutions such as Baseboard Management Controller (BMC) and Intelligent Platform Management Interface (IPMI),
which is used in server platforms (eg. Supermicro servers)

Источник

Troubleshooting Memory Errors

MemTest86 detected errors in my memory. Is there something wrong with my RAM?

Why am I only getting errors during Test 13 Hammer Test?
Why do I get errors only when testing RAM modules together, and not when individually tested?

MemTest86 reported the memory address of the failure. What does this mean?

How does MemTest86 report ECC errors?
If I know the address decoding scheme, can I configure MemTest86 to report the failing module?

How do I know which RAM module is failing?
How do I fix the memory errors?

Below is a video overview on how to troubleshoot bad RAM with MemTest86.

Download

Right-click to download, MP4 format, 9MB

MemTest86 detected errors in my memory. Is there something wrong with my
RAM?

Please be aware that not all errors reported by MemTest86 are due to bad memory. The
test implicitly tests the CPU, L1 and L2 caches as well as
the motherboard. It is impossible for the test to determine what causes the failure to
occur. However, most failures will be due to a problem
with memory module. When it is not, the only option is to replace parts until the
failure is corrected.

Sometimes memory errors show up due to component incompatibility. A memory module may
work fine in one system and not in another.
This is not uncommon and is a source of confusion. In these situations the components
are not necessarily bad but have marginal
conditions that when combined with other components will cause errors.

Often the memory works in a different system or the vendor insists that it is good. In
these cases the memory is not necessarily bad
but is not able to operate reliably at full speed. Sometimes more conservative memory
timings on the motherboard will correct these errors.
In other cases the only option is to replace the memory with better quality, higher
speed memory. Don’t buy cheap memory and expect it to work
reliably. On occasion «block move» test errors will occur even with name brand memory
and a quality motherboard. These errors are legitimate and
should be corrected.

All valid memory errors should be corrected. It is possible that a particular error will
never show up in normal operation. However, operating
with marginal memory is risky and can result in data loss and even disk corruption. Even
if there is no overt indication of problems you cannot
assume that your system is unaffected. Sometimes intermittent errors can cause problems
that do not show up for a long time. You can be sure that
Murphy will get you if you know about a memory error and ignore it.

We are often asked about the reliability of errors reported by MemTest86. In the vast
majority of cases errors reported by the test are valid.
There are some systems that cause MemTest86 to be confused about the size of memory and
it will try to test non-existent memory. This will cause a
large number of consecutive addresses to be reported as bad and generally there will be
many bits in error. If you have a relatively small number
of failing addresses and only one or two bits in error you can be certain that the
errors are valid. Also intermittent errors are without exception
valid. Frequently memory vendors question if MemTest86 supports their particular memory
type or a chipset. MemTest86 is designed to work with all
memory types and all chipsets.

MemTest86 cannot diagnose many types of PC failures. For example a faulty CPU that causes
Windows to crash will most likely just cause MemTest86 to crash in the same way.

Why am I only getting errors during Test 13 Hammer Test?

The Hammer Test is designed to detect RAM modules that are susceptible to disturbance
errors caused by charge leakage. This phenomenon is characterized
in the research paper
Flipping Bits in Memory
Without Accessing Them: An Experimental Study of DRAM Disturbance Errors
by
Yoongu Kim et al.
According to the research, a significant number of RAM modules manufactured 2010 or
newer are affected by this defect. In simple terms, susceptible RAM modules can be
subjected to disturbance errors
when repeatedly accessing addresses in the same memory bank but different rows in a
short period of time. Errors occur when the repeated access causes charge loss in a
memory cell, before the cell contents
can be refreshed at the next DRAM refresh interval.

Starting from MemTest86 v6.2, the user may see a warning indicating that the RAM may be
vulnerable to high frequency row hammer bit flips.
This warning appears when errors are detected during the first pass (maximum hammer
rate) but no errors are detected during the second pass (lower hammer rate).
See MemTest86 Test Algorithms for a description of
the two passes that are performed during the Hammer Test (Test 13).
When performing the second pass, address pairs are hammered only at the rate deemed as
the maximum allowable by memory vendors (200K accesses per 64ms).
Once this rate is exceeded, the integrity of memory contents may no longer be
guaranteed. If errors are detected in both passes, errors are reported as normal.

The errors detected during Test 13, albeit exposed only in extreme memory access cases,
are most certainly real errors. During typical home PC usage (eg. web browsing, word
processing, etc.),
it is less likely that the memory usage pattern will fall into the extreme case that
make it vulnerable to disturbance errors. It may be of greater concern if you were
running highly sensitive equipment
such as medical equipment, aircraft control systems, or bank database servers. It is
impossible to predict with any accuracy if these errors will occur in real life
applications. One would need
to do a major scientific study of 1000 of computers and their usage patterns, then do a
forensic analysis of each application to study how it makes use of the RAM while it
executes. To date, we have only
seen 1-bit errors as a result of running the Hammer Test.

There are several actions that can be taken when you discover that your RAM modules are
vulnerable to disturbance errors:

Do nothing
Replace the RAM modules
Use RAM modules with error-checking capabilities (eg. ECC)

Depending on your willingness to live with the possibility of these errors manifesting
itself as real problems,
you may choose to do nothing and accept the risk. For home use you may be willing to
live with the errors. In our experience, we have several machines that have been stable
for home/office use despite experiencing errors in the Hammer Test.

You may also choose to replace the RAM with modules that have been known to pass the
Hammer Test. Choose RAM modules of different brand/model as it is likely that the RAM
modules with the same model would still fail the Hammer test.

For sensitive equipment requiring high availability/reliability, you would replace the
RAM without question and would probably switch to RAM with error correction such as ECC
RAM. Even a 1-bit error can result in catastrophic consequences for say,
a bank account balance. Note that not all motherboards support ECC memory, so consult
the motherboard specifications before purchasing ECC RAM.

Detection and mitigation of row hammer errors

The ability of MemTest86 to detect and report on row hammer errors depends on several
factors and what mitigations are in place. To generate errors adjacent memory
rows must be repeatedly accessed. But hardware features such as multiple channels,
interleaving, scrambling,
Channel Hashing, NUMA & XOR schemes make it nearly impossible (for an arbitrary CPU &
RAM stick) to know which memory addresses correspond to which rows in the RAM.
Various mitigations might also be in place. Different BIOS firmware might set the
refresh interval to different values (tREFI). The shorter the interval the more
resistant the RAM will be to errors.
But shorter intervals result in higher power consumption and increased processing
overhead. Some CPUs also support pseudo target row refresh (pTRR) that can be used in
combination with pTRR-compliant RAM.
This field allows the RAM stick to indicate the MAC (Maximum Active Count) level which
is the RAM can support. A typical value might be 200,000 row activations.
Some CPUs also support the Joint Electron Design Engineering Council (JEDEC) Targeted
Row Refresh (TRR) algorithm. The TRR is an improved version of the previously
implemented
pTRR algorithm and does not inflict any performance drop or additional power usage.
As a result the row hammer test implemented in MemTest86 maybe not be the worst case
possible and vulnerabilities in the underlying RAM might be undetectable due to the
mitigations in
place in the BIOS and CPU.

Why do I get errors only when testing RAM modules together, and not
when individually tested?

Most memory systems nowadays operate in multiple channel mode in order to increase the
transfer rate between the RAM modules and the memory
controller. It is recommended that modules with identical specifications (ie. «matching
modules») when running in multi-channel mode. Some motherboards
also have compatibility issues with certain brand/models of RAM when running in
multi-channel mode.

When you see errors while running MemTest86 with multiple RAM modules installed, but not
when they are tested individually, it is likely that the multi-channel
configuration is the culprit. This could be due to mismatched RAM specifications, or
simply using brands/models of RAM that is incompatible with the motherboard.
Most motherboard vendors release a list of known compatible RAM models that have been
tested to work with your motherboard. Replace the modules with a matching set of
known good ones and see if you get better results.

MemTest86 reported the memory address of the failure. What does this
mean?

When MemTest86 detects errors during the memory tests, the memory address, actual and
expected data are reported to the user. The memory address is the location in system
memory where the data contained does not match what was expected.
This is the address that is specified by the CPU to the memory controller when
requesting data from DRAM. The memory controller then decodes this memory address to
identify the specific channel, DIMM, rank, DRAM chip, bank, row and column in DRAM using
a chipset-specific address decoding scheme.

The address decoding scheme is the process used by the memory controller to generate the
appropriate address signals to the DRAM chip. Depending on the memory controller, this
process can get fairly complex as it is not simply a a direct mapping of the system
address bits to the DRAM address bits. In order to increase
the memory performance, strategies such as channel interleaving (for Dual, Tri and Quad
channel setups), rank/bank/row interleaving, and address swizzling are used to increase
the concurrency of memory accesses. For some chipsets such as AMD, the address decoding
scheme can be configured/determined via PCI registers as
described in the chipset specifications. For other chipsets (eg. Intel), however, the
address decoding scheme is proprietary and not made available to the public. This makes
identifying the DRAM address and correspondingly, the failing module, much more
difficult. For that reason, MemTest86 only has the capability to report DRAM addresses
for supported hardware configurations.

How does MemTest86 report ECC errors?

Refer to ECC Technical Information for ECC reporting in MemTest86 and other ECC technical details.

If I know the address decoding scheme, can I configure MemTest86 to
report the failing module?

For systems where the address decoding scheme is known, MemTest86 provides several configuration file parameters to aid users in
determining the faulty module that corresponds to the memory address:

  ADDR2CHBITS=12,9,7
  ADDR2SLBITS=3,4
  ADDR2CSBITS=8

For each of these 3 parameters, a list of bit positions can be used to specify which
address bits of a memory address to exclusive-or (XOR) in order to determine the
corresponding [memory channel|slot|chip select (CS)] (0 or 1) of the failing module.
This is only useful if you know that the memory controller maps a particular address to
a [memory channel|slot|chip select (CS)] using this XOR-based decoding scheme. If these
parameters are specified and MemTest86 detects a memory error, the [memory
channel|slot|chip select (CS)] will be calculated and displayed along with the faulting
address.

How do I know which RAM module is failing?

Once a memory error has been detected, determining the failing SIMM/DIMM module is not a
clear cut procedure. Different CPUs map memory addresses
to physical memory sticks in different ways. Features like dual channel RAM (with
interleaving), channel hashing and NUMA make the mapping of addresses
to modules, banks & rows very difficult. Due to the large number of CPUs and motherboard
vendors and potential combinations of memory slots we do not have a general solution, though in some cases limited decode is possible. However, there are
steps that may be taken to determine the failing module. Here are some techniques that
you may wish to use:

Removing modules

This is simplest method for isolating a failing modules, but may only be
employed when one or more modules can be removed from the system. By
selectively removing modules from the system and then running the test you will
be able to find the bad modules.
Be sure to note exactly which modules are in the system when the test passes and
when the test fails.
Rotating modules

When none of the modules can be removed then you may wish to rotate modules to
find the failing one. This technique can only be used if there
are three or more modules in the system. Change the location of two modules at a
time. For example put the module from slot 1 into slot 2
and put the module from slot 2 in slot 1. Run the test and if either the failing
bit or address changes then you know that the failing module
is one of the ones just moved. By using several combinations of module movement
you should be able to determine which module is failing.
Replacing modules

If you are unable to use either of the previous techniques then you are left to
selective replacement of modules to find the failure.

How do I fix the memory errors?

Depending on what is causing the memory errors, you can try the following options:

Replace the RAM modules (most common solution)
Set default or conservative RAM timings
Increase the RAM voltage levels
Decrease the CPU voltage levels
Apply BIOS update to fix incompatibility issues
Flag the address ranges as ‘bad’

Once you have determined with certainty which RAM module(s) have failed, replacing them
with a new set of RAM modules usually fixes the errors. When choosing
which modules to use as a replacement, consider using one that is listed as compatible
by the motherboard vendor as it would have been verified by the vendor
itself.

Sometimes, memory errors only manifest themselves when RAM timings are set too
aggressively in the BIOS (eg. overclocking). For certain modules that support
higher performance XMP timings, consider using standard, non-XMP timings to see if you
get better results. Consult your motherboard manual on how to set or
reset your RAM timings to default settings.

For certain configurations (especially when using aggressive RAM timings), higher
voltage may be required in order to operate the RAM in stable conditions.
If you are using non-standard RAM timings, slightly increasing the voltage (eg. from
1.5V to 1.55V) may increase the stability. Increase the voltage at your
own risk as excessive voltage may damage the components of your system

A higher CPU voltage may cause overheating, resulting in memory errors that lead to system hangs/crashes.
Check with the motherboard vendor for instructions on configuring CPU voltage levels.

In certain cases, RAM incompability issues can be fixed with a BIOS update. Check the
motherboard vendor for updated BIOS with RAM compaibiliy fixes.

Several operating systems allow the user to pass in a list of ‘bad’ memory ranges to
prevent the operating system to use or allocate memory in that range.
See Blacklisting RAM Pages for more details.

Источник

Junior Member

Join Date: Sep 2016

Posts: 3
MemTest86 7.1
The message appears — [ECC Errors detected], but summary Errors=0
All information on screenshot, what does it mean???

1 Photo
Junior Member

Join Date: Sep 2014

Posts: 7
Hello Dames,

Can you post a brief system configuration, such as make/model/processors, etc? And most importantly, the details of the memory modules. This is a DDR4 ECC protected system and it’s showing vulnerability to a Rowhammer attack. ECC errors are very real errors and can show a tendency towards a system that might yield lower performance, halts, reboots, and in some cases application data corruption. Passmark appears to not count these are errors because they were correctable, but it’s my opinion that they should be regarded as errors. We pay a premium for ECC, so anomalous behaviors should be noted.

Comment
Junior Member

Join Date: Sep 2016

Posts: 3
CPU: Intel Xeon 2xE5-2620v4
MB: Supermicro MBD-X10DRL-i
RAM: 4x16GB DDR4 ECC REG Micron 36ASF2G72PZ-2G1B1

Comment
Administrator

Join Date: Jan 2003

Posts: 9610
Yes, «ECC errors detected» means there were errors, but they were corrected by the hardware. So from an applications point of view there was no error.

This page has more details

Comment
Newbie

Join Date: Sep 2016

Posts: 3
Hi, I just saw this thread which is very similar to my issue and didn’t want to make a new one.

I have x2 «Samsung DDR4-2133 32GB/4Gx72 ECC/REG CL15 Server Memory M393A4K40BB0-CPB» one is showing a lot of ECC correctable errors mostly on test 13, and a couple errors on tests 0,1,2,3 after 1 or 2 passes, and the other one is doesn’t show any errors at all. Does that mean what I think it means that I have one bad memory stick?

Comment
Administrator

Join Date: Jan 2003

Posts: 9610
If you see «ECC errors detected» but the overall error count is zero, then there were no uncorrected errors.

This page has more details (especially about test 13)

So you could decide to live with the errors, knowing that they are correcting themselves, or you could take the hard line and declare it faulty. It depends a bit on how critical the machine was. If this was in a machine being send to Mars by NASA I would replace the stick. If this was a web server running a WordPress blog I’d be tempted to leave it running if the machine was stable. Further depends on if the memory vendor will replace it for free under warranty or not.

Comment
Newbie

Join Date: Sep 2016

Posts: 3
The thing is I wish I could know what causes errors in this one stick in particular despite them being correctable or not… after 3 full passes I have got 700+ «correctable errors»… I mean the other one is completely error-free, so what gives? that is what’s bugging me.
I will try to replace it soon if I can… my system is completely new by the way.
…Yes and I forgot to mention that the reason I began testing is because my OS (FreesNAS, latest version) is keep reporting about a memory error every once in awhile (same exact error, over and over), something like this:

MCA: Bank 5, Status 0xd40000c000900090

MCA: Global Cap 0x0000000000000806, Status 0x0000000000000000

MCA: Vendor «GenuineIntel», ID 0x406d8, APIC ID 0

MCA: CPU 0 COR OVER RD channel 0 memory error

MCA: Address 0x12ef39498
Comment
Administrator

Join Date: Jan 2003

Posts: 9610
MCA = Machine Check Architecture
COR probably means «correctable».

I wish I could know what causes errors in this one stick

Very likely is a bad bit in one of the chips. e.g. a manufacturing defect. Sometimes there are compatibility issues with certain motherboards. The RAM is marginal at certain voltages and speeds. The motherboard vendors typically publish a compatibility list of RAM they have tested.

Comment
Newbie

Join Date: Sep 2016

Posts: 3
One thing for sure, there is definitely something wrong with it. The RAM is on the motherboard’s compatibility list, so no issues there.
Thanks for the help.

Comment
Junior Member

Join Date: Feb 2020

Posts: 3
Hi,

Please advise.

DDR3 Board: Tyan S7053
CPU: Intel XEON E5-2650 @ 2.0 GHz
Memory: 16GB 1600speed SOUDIMM from SMART Modular

I am running row hammer test of V8.3 and get error

«ECC errors detected Test : 13 Channel/Slot: 0/0»
In the past, on an ASUS motherboard a Test 13 error provides the failing address and failing data

In this format

Addr:1250042EC, Expected:04612B33,Actual:04212B33.

why am I not getting the ECC failing address and data for Tyan S7053?

Comment
Administrator

Join Date: Jan 2003

Posts: 9610
Either the memory controller didn’t provide the details or MemTest86 didn’t find (or didn’t decode) the details.
Or maybe the ECC error was really in Channel & Slot 0.

Comment
Junior Member

Join Date: Feb 2020

Posts: 3
Thanks David.
We have a license. You indicated ‘Memtest didn’t find or didn’t decode the details. If Memtest can’t decode or display the data, can Memtest diagnose further and confirm if it is motherboard issue or Memtest issue ?

Comment
Administrator

Join Date: Jan 2003

Posts: 9610
Won’t be a motherboard issue. Motherboard is just wires that connects the CPU to the RAM.
ECC details are kept secret by Intel. So they are hard to support without getting the correct documentation for the particular CPU and an example of that hardware.
This is a pretty old CPU. It is possible that old CPU never supplied valid channel and slot IDs when an error occurs.

Comment

Источник

[H]ard|Ware
SSDs & Data Storage

You are using an out of date browser. It may not display this or other websites correctly.
You should upgrade or use an alternative browser.

Memtest86+ 4.20 ECC reporting

Thread starter

aamsel
Start date

Aug 15, 2011

Joined: Jun 12, 2004

Messages: 940

I am just testing some new server components, including a Supermicro X9SCL board, some Kingston Unbuffered ECC memory and a Xeon CPU, which together should fully support the ECC memory.

Running a boot disk of the current Memtest86+ 4.20 (and some old versions also), the screen reports ECC as OFF, and it can not be enabled.

The Supermicro BIOS shows the ECC memory, but I have no idea what to not trust:
the motherboard, or the test?

I am sure the Xeon is not to blame, and the memory would be very unlikely.

Please advise!

Thanks!!!

Joined: Jul 26, 2011

Messages: 110

Experienced the same issue but with an asus board. Bios shows the memory as ECC enabled but memtest doesn’t. Did you try setting the ecc option in memtest to enable(on)?

Joined: Jun 12, 2004

Messages: 940

Yes, I toggled the ECC option to ON, but it still shows as OFF.
My understanding is that the «turning it on» in Memtest just enables
or disables logging of ECC errors, and does not actually turn on the
ECC itself. The ECC should be operational, but I would like to verify it.
This would not be the first time that I have purchased a setup that was
supposed to run ECC and did not, and I don’t want to be fooled.

Anyone have any other knowledge of this?

Joined: Nov 19, 2008

Messages: 14,937

I have the same with an asus P8B-X and a xeon E3-1230 with 8GB Crucial Unbuffered DDR3 1066 ECC ram.

BTW, What is this thread doing in the Data Storage forum?

Last edited: Sep 15, 2011

Joined: Jun 12, 2004

Messages: 940

I have accepted this as being normal.
The setup obviously supports ECC.

This started as a question related to setup of my
storage server. It is not fully a memory, CPU or Mobo
question, so if a mod feels this belongs elsewhere, please move
or delete it.

Joined: Nov 19, 2008

Messages: 14,937

Thanks for the info. I just tried non-ECC ddr3 in the P8B-X and it will not even post with non-ECC so I believe ECC is working even though memtest86+4.20 says it is not.

Joined: Jul 17, 2011

Messages: 83

From feedback/discussion at: http://www.newegg.com/Product/Product.aspx?Item=N82E16813131725 (Asus P8B WS, C206 board)

Pros: Nothing that outweighs the cons.

Cons: Asus advertises this as supporting ECC memory and so does Newegg, however the motherboard while being able to use ECC memory does not utilize any of the error correction functionality of ECC memory.

From dmidecode

Handle 0x0024, DMI type 16, 15 bytes
Physical Memory Array
Location: System Board Or Motherboard
Use: System Memory
Error Correction Type: None
Maximum Capacity: 32 GB
Error Information Handle: No Error
Number Of Devices: 4

This is simply false advertising. Newegg needs to stop listening this as supporting ECC memory because it does *NOT*

Other Thoughts: Am never going to buy an ASUS again.

and some other guy later responds to this with:

Pros: It is a pleasure to deal with this nice mobo.

Cons: Absolutely nothing

Other Thoughts: As for ECC, the dmidecode output from a previous review is likely not relevant, since the memory controller is on the processor (and not on motherboard). Version 0704 of the BIOS has an option to enable/disable the ECC support.

Could he be on to something?

I did however find this dmidecode output for an X9SCM-F motherboard (through some lucky googling): http://www.thomas-krenn.com/de/wiki/Hardwareinfos_mit_dmidecode_auslesen#Supermicro_X9SCM-F

Handle 0x0027, DMI type 16, 23 bytes
Physical Memory Array
	Location: System Board Or Motherboard
	Use: System Memory
	[U][I]Error Correction Type: Single-bit ECC[/I][/U]
	Maximum Capacity: 32 GB
	Error Information Handle: No Error
	Number Of Devices: 4

Joined: Nov 19, 2008

Messages: 14,937

Here is what I get for the ASUS P8B-X:

Handle 0x0030, DMI type 16, 15 bytes
Physical Memory Array
        Location: System Board Or Motherboard
        Use: System Memory
        Error Correction Type: Multi-bit ECC
        Maximum Capacity: 32 GB
        Error Information Handle: No Error
        Number Of Devices: 4

I belive the BIOS is 0902 which I upgraded today from 0805.

Joined: Jul 17, 2011

Messages: 83

So,
for C202 chipset (Asus P8B-X) it reports Multi-bit ECC
for C204 chipset (SM X9SCM-F) it reports Single-bit ECC
for C206 chipset (Asus P8B WS) it reports None

Interesting trend!

Joined: Jan 9, 2010

Messages: 52

I just posted the following message in the ASUS forums regarding P8B-WS ECC:

I followed essentially the same path. Disabled or «Auto» is not terribly reassuring. Memtest86++ doesn’t add any confidence, nor did the response from Asus tech support. I’m not a happy camper …

So … Here’s the plan. I pulled out the Intel Xeon E3-1200 data sheet (volume 2). PCI device zero has lots of CPU configuration data. Use «lspci -s 0:0.0 -xxxx». From the data sheet (Table 2-7), offset 0x48-0x4F specifies MCHBAR … in my case, the value was 0xFED10001 (YMMV). (The trailing one indicates that mapping is enabled).

Since «dd if=/dev/mem» would not cooperate, you write a short program using mmap on /dev/mem to access the MCHBAR area and write it (32KB) to a file. Then examine hexdump of that file. As per tables 2.16.2 and 2.16.3 from the data sheet, we examine offsets 0x5004 and 0x5008 in the hexdump. In my case, both values were: «10 10 66 03». All of the bit fields seem reasonable. In particular, the «03» is specified as «ECC active in both I/O and ECC logic» !

Now, I have not read the entire data sheet, and there could be more to it … but it looks to me like the Sandy Bridge Xeons perform ECC operations in the processor itself, and the above referenced bits SEEM to be turning on this capability.

My GUESS is that memtest86++ has not yet been updated to support Sandy Bridge Xeons. Lots of time/anguish wasted on this issue … guess that’s why they call it the bleeding edge. Hopefully I can save someone else some trouble … enjoy ..

Joined: Jul 17, 2011

Messages: 83

I just posted the following message in the ASUS forums regarding P8B-WS ECC:

Thank you so much! I was just about to make my own thread about help on deciding mobo for my build, and the asus p8b ws was only an alternative, but now it’s a solid winner in my eyes, and I can finally finalize my build!

What you found seems to be on par with what one of the guys on newegg said, but didn’t supply proof of.

Any chance you could give us a direct link to your post on the asus forums?

Joined: Jan 9, 2010

Messages: 52

Joined: Jul 17, 2011

Messages: 83

If you have non-ECC memory available, could you perhaps try those in the board and compare results?

Joined: Jan 9, 2010

Messages: 52

If you have non-ECC memory available, could you perhaps try those in the board and compare results?

Sorry for the delay in responding. These systems have been «buttoned up» and burned-in for several days, and also I don’t have any NON-ECC memory handy … so I’m not likely to be able to perform the tests you requested.

However, my GUESS is that there wouldn’t be any difference. I suspect that the motherboard turns on the Flag in the processor whenever the motherboard is configured for AUTO. The processor then sends out the extra ECC bits, whether there is real memory there or not. When it reads it back, unless it identifies a correctable error (which probably never happens without the ECC bits populated)
it only sets some error flags which are ignored.

This is all mere speculation on my part. I don’t know if any operating systems have been modified to actually accumulate error correction information from the processor itself …. This issue probably won’t be fully settled until Memtest86 is updated and the Operating Systems are modified to support ECC error tracking and reporting.

It would be nice if the motherboards made some kind of attempt to determine whether or not the ECC is physically present … if nothing else, just to add piece of mind …

Sorry …

Joined: Jan 9, 2010

Messages: 52

Additional messages have been posted to the thread on the ASUS forum. It seems pretty clear that the P8B WS BIOS needs to be modified to include ECC information in the DMI data that is presented to the Operating System. My -guess- is that it IS performing correction, but needs to communicate the ECC configuration via the DMI data so that the Operating Systems (and possible memtest86) can proceed appropriately.

[H]ard|Ware
SSDs & Data Storage

Источник

Я обновил этот пост с тех пор, как заменил процессор, но суть моего вопроса (и, к сожалению, также и результатов) остается прежней.

Я создал свою первую коробку FreeNAS и хотел использовать оперативную память ECC, поскольку хочу хранить критические данные. Поскольку у меня ограниченный бюджет, я хотел выбрать самое доступное решение, которое по-прежнему поддерживало бы ECC RAM.

Проведя некоторые исследования, я обнаружил, что мне нужны материнская плата, память и процессор с поддержкой ECC. Моя материнская плата — Gigabyte X150M-Pro ECC с чипсетом C232, DDR4 и разъемом LGA1151.

Я также купил набор из двух модулей DIMM производства Kingston с номером модели «KVR21E15S8K2 / 8» ( спецификация ). Gigabyte опубликовал список протестированных модулей памяти, и мои модули, похоже, поддерживаются с работающим ECC ( список поддерживаемых модулей ).

Поскольку у меня ограниченный бюджет, мне понадобился недорогой процессор Skylake с поддержкой ECC. Согласно Intel, Celeron G3900 поддерживает ECC, поэтому я выбрал именно этот.

После сборки компьютера я хотел убедиться, что моя система действительно работает с памятью ECC, и вошел в BIOS материнской платы. На разных интернет-сайтах я обнаружил, что на некоторых материнских платах есть специальный раздел, в котором должно быть указано, работает ли ECC, но на моей материнской плате этого нет. Я проверил все меню и не смог найти похожий раздел.

После еще одного исследования и поиска поста об обмене стека в Unix & Linux, который не решил мою проблему. Я попробовал последний, memtest86+который, насколько я мог судить, даже не показывает значение «ECC». Я попробовал старую версию 4.20, которую использовали системы Puget, которая показала «ECC: off». Однако после прочтения ранее упомянутого поста я сомневаюсь, что он говорит правду (может быть, поэтому функция была удалена?). Обе версии также не считали правильную скорость и задержку модуля DIMM, что усиливает мои сомнения в отношении memtest86+.

Другой популярный способ выяснить, работает ли ECC, — ввести dmidecode -t memoryкоманду и прочитать Total Widthи Data Width. Мои результаты были 128 Bitsи 64 Bitsсоответственно. Одна часть вывода показала подробности о массиве памяти, который имел пару ключ-значение Error Correction Type: Single-bit ECC.

Я ожидал , что 72 bitsдля Total Width, так что я думал , что это может быть связано с двухканальной и переместили модули памяти в двух соседних слотов , которые должны предотвратить двойной канал, но результат был тот же. Вот полный выход из dmidecode -t memory.

Я даже опробовал интересную C-программу, которую опубликовали системы Puget, но результат 0показал отсутствие поддержки ECC.

Теперь я начинаю сомневаться, что данные на собственном веб-сайте Intel верны, и мой процессор фактически не поддерживает ECC. И память, и материнская плата специально обозначены знаком «ECC», так что я могу исключить это.

Возможно ли, что версия BIOS нуждается в обновлении (в настоящее время его нет), чтобы включить ECC, или ECC фактически уже работает, и я просто не смог его проверить? Или мой выбор процессора неверен, если я хочу использовать память ECC, а веб-сайт Intel ошибочен / вводит в заблуждение?

Если мой процессор окажется неправильным выбором, что будет следующим лучшим выбором для «бюджетного процессора ECC»?

ОБНОВЛЕНИЕ: я увидел некоторые новые признаки того, что моя система на самом деле может работать с включенным ECC, и dmidecodeинструмент просто сообщает странные данные. На форуме FreeNAS пользователь Dusan использует аппаратное обеспечение серверного уровня (SuperMicro MB, Xeon CPU, Kingston DIMM) и имеет аналогичный выход 128 Bits. Но он написал, что сам не уверен, действительно ли это работает.

ОБНОВЛЕНИЕ 2: Как упоминал yagmoth555 в своем ответе на этот вопрос, похоже, что моя материнская плата поддерживает ECC только с процессорами Xeon, хотя я думал, что это примечание является реликтом из предыдущих руководств, которые были скопированы. Я думаю, это означает, что мне нужно посмотреть на процессор Xeon ..: — /

ОБНОВЛЕНИЕ 3: Я купил Xeon E3-1220v5 сейчас, который, конечно, поддерживает ECC и должен соответствовать требованиям руководства. Я снова запустил все тесты, чтобы проверить функциональность ECC, и результаты в основном идентичны:

Из комментариев на Puget Systems также кажется, что ecc_check.cпрограмма не работает на процессорах Xeon и Core i7 ..: — /

На memtest86+этот раз я проверил еще кое-что, и я совершенно уверен, что он вообще не поддерживает DDR4 или чипсет C232, поскольку он сообщает не только о неправильной скорости и таймингах, но и о DDR3 вместо установленной DDR4. Тем не менее, он обнаружил процессор просто отлично, но я все же получил тот же конечный результат с обеими версиями memtest86+:

Версия 4.20 даже не определяет мой процессор должным образом ..

Любые идеи о том, как еще я могу проверить на ECC, очень ценятся.

Источник

Всем привет, в общем проблема в названии. Неделю назад обновил компьютер, купил новую оперативную память и вот на днях хотел идти уже за второй плажкой.

Комп:
Проц Intel Core i5-8400 @ 2.80GHz
Мать Gigabyte B360M H
Память CRUCIAL CT8G4DFS8266 DDR4 8GB
БП FSP PNR 500W [ATX-500PNR]

Сначала начну с того, почему я решил вообще сделать этот тест. Купил я оперативку на ддр4 в прошлый четверг (была до этого ддр3) на 8гб, проц, мать и по мелочи еще (куллера и тд). После сборки сел поиграть в ассасина последнего, и получил вылеты игры. Причем вылет может быть после часа-двух, а бывает после 15 минут игры. Вылеты продолжались с четверга по сей день, но особо я этому не придавал значение, т.к. вин 10 стоит мб уже больше 3-ех лет, многие драйвера могли быть несовместимы и т.д. Но сегодня у меня случился совсем редкий случай, синий экран. До этого он у меня был всего 1 раз на моей памяти, а тут и недели не прошло с момента обновления компа и даже после этого, я не сразу подумал на оперативку, т.к. в ошибки было сказано что-то про «хард драйв». но посмотрев дамп с помощью BlueScreenViewer и почитав в инете инфу насчет этого, я понял, что проблема может быть и также в оперативке. Прикреплю скрин, мб что-то там разглядите.

В общем скачал я Memtest, и на 3 прогоне всех тестов оперативка запоролась на 7 тесте, было обнаружено 8 ошибок, также прикреплю скрины
Что примечательно, после этого я сделал еще 4 прогона именно теста номер 7 по два раза, т.е. 8 тестов в общем и ни разу не показало ошибку

ПС. Также, стоит отметить, что после ошибок по советам обновил биос, но сомневаюсь, что в перетестах не показало ошибок именно из-за обновлений, т.к. прошлый биос был не такой старый, от августа 18 года

__________________
Помощь в написании контрольных, курсовых и дипломных работ, диссертаций здесь

Источник

Самым надежным вариантом тестирования оперативной памяти является программа Memtest.

Нам необходимо записать утилиту Memtest86 на DVD диск либо флешку или создать загрузочную флешку, в состав которой входит данная программа. Читаем статью «Как создать мультизагрузочную флешку«, следуем пошаговой инструкции. После того как создана загрузочная флешка производим загрузку с неё, обычно для этого я использую «горячие клавиши» F8, F12, F11, Esc+клавиша отвечающая за быструю загрузку, в зависимости от модели мат. платы на компьютере/ноутбуке. После загрузки видим окно:

Заходим в дополнительные утилиты и выбираем Memtest86. Если все прошло успешно видим такую картинку:

После запуска программа будет проверять вашу оперативную память бесконечно, пока вы её не остановите клавишей ESC. Проверять можно как все модули памяти сразу, так и по одной. Проверяя все модули памяти, программа не скажет на какой именно ошибки, так что если есть ошибки, проверяйте лучше по одному модулю. Для проверки лучше сделать несколько циклов. А для максимального эффекта, лучше поставить проверку на ночь . Поле Pass означает количество проделанных циклов. Если у вас будут ошибки в памяти (колонка Error), вы увидите следующее:

Исправить оперативную память при наличии ошибок в программе невозможно. Это не как в жестком диске затереть битые сектора. Я рекомендую вот что:

Самое распространённое — это выход из строя модуля оперативной памяти. Тестируем по одному модулю. Сначала этот модуль ставим в слот под оперативную память №1.

Тестируем,смотрим результат. Если ошибки есть как показано на рис. выше (там где выделено красным в программе Memtest), то ставим этот модуль слот под оперативную память №2. Тестируем,смотрим результат. Если ничего не изменилось,то модуль неисправен и подлежит замене. Меняем или ставим другой модуль памяти,тестируем. Если модуль памяти исправен,по окончанию теста в видим следующее:

Бывает что неисправен слот для оперативной памяти на мат. плате. В этом случае подключаем модуль в другой свободный слот, далее рассматриваем целесообразность работы компьютера в данной конфигурации с неисправным слотом под оперативную память, целесообразность замены материнской платы, а может вас и так всё устроит, и объёма памяти вам будет достаточно.

В программе — 9 тестов:

Test 0 — [Address test, walking ones, no cache] – тестирования для выяснения проблем в адресе памяти.

Test 1 — [Addresstest, ownaddress] – глубокий тест для выявления ошибок в адресационной прописки памяти

Test 2 — [Movinginversions, ones&zeros] – проверка на трудноуловимые и аппаратные ошибки.

Test 3 — [Movinginversions, 8 bitpat] – как и предыдущий тест, только в нем используется алгоритм в 8ми битном подходе от 0 до 1. Тестируется 20 различных методов.

Test 4 — [Moving inversions, random pattern] – Выявление ошибок связанных с data sensitive. В этом тесте 60 различных методов.

Test 5 — [Block move, 64 moves] – Поиск проблем в схемах оперативной памяти.

Test 6 — [Moving inversions, 32 bit pat] – Самый долгий тест для выявления data sensitive errors.

Test 7 — [Randomnumbersequence] – Проверка ошибок в записи памяти.

Test 8 — [Modulo 20, ones&zeros] – Выявление скрытых ошибок в оперативной памяти с помощью буферизации и кеша.

Test 9 — [Bit fade test, 90 min, 2 patterns] – Тест может быть включен вручную. Он записывает адреса в памяти, после чего уходит в сон на 1.5 часа. После выхода из сна, сверяет биты в адресах, на схожесть. Клавишей C для ручного запуска. Тест требует около 3х часов времени.

Теперь вы увидели как проводится тестирование оперативной памяти, как восстановить её работоспособность самостоятельно и проверить с помощью программы Memtest86 с приведенной инструкцией.

Источник

What is ECC memory?

How does ECC work?

Memory write sequence

Memory read sequence

What are the different ECC schemes?

Side-band ECC

Inline ECC

On-die ECC

Link ECC

What’s new in DDR5 ECC RAM?

Is ECC memory neccessary?

What system requirements are needed to enable ECC protection?

Do I need to change BIOS settings?

How do I know when ECC errors are detected?

How does MemTest86 report ECC errors?

How does Windows report ECC errors?

How does Linux report ECC errors?

Installing rasdaemon

Enabling rasdaemon service

Querying ECC error count & summary

What is ECC injection?

Can I use MemTest86 inject ECC errors?

How do I know if my system supports ECC injection?

Why are ECC errors not being reported on my AMD Ryzen system?

MemTest86 detected errors in my memory. Is there something wrong with my RAM?

MemTest86 reported the memory address of the failure. What does this mean?

How do I know which RAM module is failing?

How do I fix the memory errors?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Memtest86+ 4.20 ECC reporting

Самым надежным вариантом тестирования оперативной памяти является программа Memtest.

В программе — 9 тестов:

Читайте также:

MemTest86 detected errors in my memory. Is there something wrong with my
RAM?

MemTest86 reported the memory address of the failure. What does this
mean?