0
3
Всем привет.
Есть сервер Fujitsu PRIMERGY RX4770 M3 c 32 EEC модулями памяти 16GB Samsung (M393A2K40BB1-CRC)
Операционная система Oracle Linux 7.9
Технологии зеркалирования или спейринга памяти не использовались.
В один из дней планка памяти начала производить корректируемые ошибки. Через некоторое время модуль перешел в режим Error, после чего операционной системе стало плохо. Начался троттлинг ЦПУ и возросла нагрузка на сервер.
В логах самого сервера выглядело это так:
| Sat 30 Jan 2021 14:34:56 PM | Major | 19001A | iRMC S4 | 'MEM3_DIMM-B1': Memory module failure predicted | Memory | Yes |
| Sat 30 Jan 2021 15:25:44 PM | Major | 190033 | BIOS | 'MEM3_DIMM-B1': Too many correctable memory errors | Memory | No |
| Sat 30 Jan 2021 15:25:45 PM | Critical | 190035 | iRMC S4 | 'MEM3_DIMM-B1': Memory module error | Memory | Yes |
| Sat 30 Jan 2021 09:59:37 PM | Major | 190033 | BIOS | 'MEM3_DIMM-B1': Too many correctable memory errors | Memory | No |
| Sat 30 Jan 2021 09:59:37 PM | Major | 190033 | BIOS | 'MEM3_DIMM-B1': Too many correctable memory errors | Memory | No |
в /var/log/messages так:
Jan 30 14:34:55 server1 kernel: mce: [Hardware Error]: Machine check events logged
Jan 30 14:34:55 server1 kernel: EDAC MC2: 213 CE memory scrubbing error on CPU_SrcID#1_Ha#0_Chan#1_DIMM#0 or CPU_SrcID#1_Ha#0_Chan#1_DIMM#1 (channel:1 page:0x3833edc offset:0x0 grain:32 syndrome:0x0 - OVERFLOW area:DRAM err_code:0008:00c1 socket:1 ha:0 channel_mask:2 rank:255)
Jan 30 14:35:27 server1 kernel: {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
Jan 30 14:35:27 server1 kernel: {1}[Hardware Error]: It has been corrected by h/w and requires no further action
Jan 30 14:35:27 server1 kernel: {1}[Hardware Error]: event severity: corrected
Jan 30 14:35:27 server1 kernel: {1}[Hardware Error]: Error 0, type: corrected
Jan 30 14:35:27 server1 kernel: {1}[Hardware Error]: fru_text: Card03, ChnB, DIMM0
Jan 30 14:35:27 server1 kernel: {1}[Hardware Error]: section_type: memory error
Jan 30 14:35:27 server1 kernel: {1}[Hardware Error]: node: 2 card: 1 module: 0
Jan 30 14:35:27 server1 kernel: {1}[Hardware Error]: error_type: 2, single-bit ECC
Jan 30 15:26:39 server1 kernel: {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
Jan 30 15:26:39 server1 kernel: {2}[Hardware Error]: It has been corrected by h/w and requires no further action
Jan 30 15:26:39 server1 kernel: {2}[Hardware Error]: event severity: corrected
Jan 30 15:26:39 server1 kernel: {2}[Hardware Error]: Error 0, type: corrected
Jan 30 15:26:39 server1 kernel: {2}[Hardware Error]: fru_text: Card03, ChnB, DIMM0
Jan 30 15:26:39 server1 kernel: {2}[Hardware Error]: section_type: memory error
Jan 30 15:26:39 server1 kernel: {2}[Hardware Error]: node: 2 card: 1 module: 0
Jan 30 15:26:39 server1 kernel: {2}[Hardware Error]: error_type: 2, single-bit ECC
Jan 30 21:59:52 server1 kernel: {3}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
Jan 30 21:59:52 server1 kernel: {3}[Hardware Error]: It has been corrected by h/w and requires no further action
Jan 30 21:59:52 server1 kernel: {3}[Hardware Error]: event severity: corrected
Jan 30 21:59:52 server1 kernel: {3}[Hardware Error]: Error 0, type: corrected
Jan 30 21:59:52 server1 kernel: {3}[Hardware Error]: fru_text: Card03, ChnB, DIMM0
Jan 30 21:59:52 server1 kernel: {3}[Hardware Error]: section_type: memory error
Jan 30 21:59:52 server1 kernel: {3}[Hardware Error]: node: 2 card: 1 module: 0
Jan 30 21:59:52 server1 kernel: {3}[Hardware Error]: error_type: 2, single-bit ECC
Jan 30 22:08:37 server1 kernel: perf: interrupt took too long (34740 > 34456), lowering kernel.perf_event_max_sample_rate to 5000
Jan 30 22:11:54 server1 kernel: perf: interrupt took too long (43438 > 43425), lowering kernel.perf_event_max_sample_rate to 4000
Jan 30 22:15:02 server1 kernel: mce: [Hardware Error]: Machine check events logged
Jan 30 22:15:02 server1 kernel: EDAC MC2: 1 CE memory scrubbing error on CPU_SrcID#1_Ha#0_Chan#3_DIMM#0 or CPU_SrcID#1_Ha#0_Chan#3_DIMM#1 (channel:3 page:0x32bb2cd offset:0x0 grain:32 syndrome:0x0 - area:DRAM err_code:0008:00c3 socket:1 ha:0 channel_mask:8 rank:255)
Jan 30 22:18:05 server1 kernel: perf: interrupt took too long (54573 > 54297), lowering kernel.perf_event_max_sample_rate to 3000
Jan 30 22:24:04 server1 kernel: perf: interrupt took too long (68810 > 68216), lowering kernel.perf_event_max_sample_rate to 2000
После того, как вытащили модуль памяти и протестировали его с помощью MemTest86, при довольно длительных тестах не удалось получить ошибку ни в одном из режимов тестирования.
Как я себе это представляю:
Модуль памяти какой какой-то причине стал продуцировать корректируемые ошибки памяти. Эти ошибки нормально обрабатывались при помощи ECC, но при достижении определенного количества таких ошибок по каунтеру, сервер решил вывести данный модуль из работы. Я так думаю, потому что если бы он вышел из строя по причине некорректируемой ошибки, то я бы увидел перезагрузку системы. После перехода модуля в режим Error системе стало резко плохо и сервис, который крутился на данном сервере (Oracle DB) по сути встал колом.
Что мне не понятно во всем этом и на что я не смог найти куда копать:
- Должна ли ОС корректно отрабатывать такой выход памяти из строя? Почему потеря планки отразилась на работе ЦПУ? Что почитать на тему MCE и вообще как ОС обрабатывает такие сбои?
- По какой-то причине планка памяти, которая была сбойной, в другом сервере показывает себя нормально. Означает ли это, что проблема может быть в самом сервере или на уровне контакта памяти и мат.платы, не знаю даже что и думать
- Может быть я вообще все неправильно понял, с удовольствием послушал бы альтернативное мнение.
My server running CentOS 8 randomly hangs after 2~3 days running, checking the vmcore-dmesg shows the kernel panics because of the follow error:
Code: Select all
[210684.261133] {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1
[210684.261134] {2}[Hardware Error]: event severity: fatal
[210684.261135] {2}[Hardware Error]: Error 0, type: fatal
[210684.261135] {2}[Hardware Error]: section_type: PCIe error
[210684.261135] {2}[Hardware Error]: port_type: 4, root port
[210684.261136] {2}[Hardware Error]: version: 3.0
[210684.261136] {2}[Hardware Error]: command: 0x0547, status: 0x4010
[210684.261136] {2}[Hardware Error]: device_id: 0000:16:01.0
[210684.261137] {2}[Hardware Error]: slot: 82
[210684.261137] {2}[Hardware Error]: secondary_bus: 0x18
[210684.261137] {2}[Hardware Error]: vendor_id: 0x8086, device_id: 0x2031
[210684.261138] {2}[Hardware Error]: class_code: 000406
[210684.261138] {2}[Hardware Error]: bridge: secondary_status: 0x0000, control: 0x0013
[210684.261139] {2}[Hardware Error]: aer_uncor_status: 0x00000020, aer_uncor_mask: 0x00100000
[210684.261139] {2}[Hardware Error]: aer_uncor_severity: 0x00062030
[210684.261139] {2}[Hardware Error]: TLP Header: 00000000 00000000 00000000 00000000
[210684.261140] Kernel panic - not syncing: Fatal hardware error!
lspci shows the device 16:01.0
Code: Select all
$ lspci -s 16:01.0 -vv
16:01.0 PCI bridge: Intel Corporation Sky Lake-E PCI Express Root Port B (rev 02) (prog-if 00 [Normal decode])
Physical Slot: 82
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin A routed to IRQ 29
NUMA node: 0
Bus: primary=16, secondary=18, subordinate=1b, sec-latency=0
I/O behind bridge: [disabled]
Memory behind bridge: 97700000-97afffff [size=4M]
Prefetchable memory behind bridge: 0000000092000000-00000000972fffff [size=83M]
Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort+ <SERR- <PERR-
BridgeCtl: Parity+ SERR+ NoISA- VGA- VGA16+ MAbort- >Reset- FastB2B-
PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
Capabilities: <access denied>
Kernel driver in use: pcieport
Code: Select all
$ lspci -s 16:01.0 -tvv
0000:16:01.0-[18-1b]----00.0-[19-1b]----03.0-[1a-1b]--+-00.0 Intel Corporation Ethernet Connection X722 for 1GbE
+-00.1 Intel Corporation Ethernet Connection X722 for 1GbE
+-00.2 Intel Corporation Ethernet Connection X722 for 1GbE
-00.3 Intel Corporation Ethernet Connection X722 for 1GbE
My questions is, is this likely something wrong with my motherboard/cpu, or the problem is on the kernel?
Any help will be appreciated.
Good afternoon all!
Fix Common Problems just notified me that my log folder was filling up — currently about 67% full. I took a look and saw that there are two 3 syslog entries — totaling about 256Mb, so i took a look in the newest syslog and I’m seeing this repeated:
Jun 6 15:31:25 Guardian kernel: nvme 0000:02:00.0: AER: aer_layer=Physical Layer, aer_agent=Receiver ID
Jun 6 15:31:25 Guardian kernel: nvme 0000:02:00.0: AER: aer_status: 0x00000001, aer_mask: 0x00000000
Jun 6 15:31:25 Guardian kernel: nvme 0000:02:00.0: [ 0] RxErr (First)
Jun 6 15:31:25 Guardian kernel: nvme 0000:02:00.0: AER: aer_layer=Physical Layer, aer_agent=Receiver ID
Jun 6 15:31:31 Guardian kernel: {127142}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 514
Jun 6 15:31:31 Guardian kernel: {127142}[Hardware Error]: It has been corrected by h/w and requires no further action
Jun 6 15:31:31 Guardian kernel: {127142}[Hardware Error]: event severity: corrected
Jun 6 15:31:31 Guardian kernel: {127142}[Hardware Error]: Error 0, type: corrected
Jun 6 15:31:31 Guardian kernel: {127142}[Hardware Error]: section_type: PCIe error
Jun 6 15:31:31 Guardian kernel: {127142}[Hardware Error]: port_type: 0, PCIe end point
Jun 6 15:31:31 Guardian kernel: {127142}[Hardware Error]: version: 0.2
Jun 6 15:31:31 Guardian kernel: {127142}[Hardware Error]: command: 0x0406, status: 0x0010
Jun 6 15:31:31 Guardian kernel: {127142}[Hardware Error]: device_id: 0000:02:00.0
Jun 6 15:31:31 Guardian kernel: {127142}[Hardware Error]: slot: 0
Jun 6 15:31:31 Guardian kernel: {127142}[Hardware Error]: secondary_bus: 0x00
Jun 6 15:31:31 Guardian kernel: {127142}[Hardware Error]: vendor_id: 0x144d, device_id: 0xa80a
Jun 6 15:31:31 Guardian kernel: {127142}[Hardware Error]: class_code: 010802
Jun 6 15:31:31 Guardian kernel: {127142}[Hardware Error]: bridge: secondary_status: 0x0000, control: 0x0000
Jun 6 15:31:31 Guardian kernel: {127142}[Hardware Error]: Error 1, type: corrected
Jun 6 15:31:31 Guardian kernel: {127142}[Hardware Error]: section_type: PCIe error
Jun 6 15:31:31 Guardian kernel: {127142}[Hardware Error]: port_type: 0, PCIe end point
Jun 6 15:31:31 Guardian kernel: {127142}[Hardware Error]: version: 0.2
Jun 6 15:31:31 Guardian kernel: {127142}[Hardware Error]: command: 0x0406, status: 0x0010
Jun 6 15:31:31 Guardian kernel: {127142}[Hardware Error]: device_id: 0000:02:00.0
Jun 6 15:31:31 Guardian kernel: {127142}[Hardware Error]: slot: 0
Jun 6 15:31:31 Guardian kernel: {127142}[Hardware Error]: secondary_bus: 0x00
Jun 6 15:31:31 Guardian kernel: {127142}[Hardware Error]: vendor_id: 0x144d, device_id: 0xa80a
Jun 6 15:31:31 Guardian kernel: {127142}[Hardware Error]: class_code: 010802
Jun 6 15:31:31 Guardian kernel: {127142}[Hardware Error]: bridge: secondary_status: 0x0000, control: 0x0000
device_id: 0000:02:00.0 is my Samsung 980Pro NVMe drive, which is the 2nd drive in my cache pool.
I haven’t noticed this error before and have been running this setup for about 3-4 months. The only thing that has changed is upgrading from 6.9 -> 6.10.1 -> 6.10.2
One thing that’s weird is I looked at the attributes for the drive and see: «Power on hours 315 (13d, 3h)» — which is NOT right….
Its partner drive has «Power on hours 2,605», which is ~108 days or ~3 1/2 months — and sounds about right as they were installed at almost the same time (about a week apart)
Any suggestions?
guardian-diagnostics-20220606-1527.zip