Edac util error no memory controller data found - Исправление ошибок и поиск оптимальных решений проблем

Hi all,

One of my servers has issues; sometimes it becomes unreachable and the only thing which helps, is a power reboot. So I am testing and checking my hardware.

One of the things I want to test is the memory obviously. So I installed ‘edac-utils’ and ran the command: edac-util -vNow the output was surprising, to say the least:edac-util: Error: No memory controller data found.

Strange… So I looked up the error on Google ofcourse and saw this:

edac—util: Error: No memory controller data found. This error is generally a result of memory pairs not matching up. Ensure that memory pairs match up on the motherboard, reboot and test using edac—util -v once more.

Okay. So maybe I am using incorrect memory pairs, right?Let’s check it with: dmidecode —type 17 | grep -e ‘Size’ -e ‘Type’ -e ‘Speed’ -e ‘Part Number’

Output:

Size: 4096 MBType: DDR3Type Detail: Synchronous Registered (Buffered)Speed: 1333 MHzPart Number: M393B5170FH0-CH9

Size: 4096 MBType: DDR3Type Detail: Synchronous Registered (Buffered)Speed: 1333 MHzPart Number: M393B5170FH0-CH9

Size: 4096 MBType: DDR3Type Detail: Synchronous Registered (Buffered)Speed: 1333 MHzPart Number: M393B5170FH0-CH9

Size: 4096 MBType: DDR3Type Detail: Synchronous Registered (Buffered)Speed: 1333 MHzPart Number: M393B5170FH0-CH9

Size: 4096 MBType: DDR3Type Detail: Synchronous Registered (Buffered)Speed: 1333 MHzPart Number: M393B5170GB0-CH9

Size: 4096 MBType: DDR3Type Detail: Synchronous Registered (Buffered)Speed: 1333 MHzPart Number: M393B5170FH0-CH9

Size: 4096 MBType: DDR3Type Detail: Synchronous Registered (Buffered)Speed: 1333 MHzPart Number: M393B5170FH0-CH9

Size: 4096 MBType: DDR3Type Detail: Synchronous Registered (Buffered)Speed: 1333 MHzPart Number: M393B5170FH0-CH9

So total memory is 32GB, all memory modules are matched (size, speed and even brand).Therefor the information found on the internet is incorrect, or at least doesn’t apply here.

Isn’t this weird?

Other causes for my server to hang could be RAID array (through firmware was updated recently to the newest available version). Or one of the CPU’s (maybe issue or overheating).

Anyone else an idea by any chance?

Thanks in advance.HHawk

Источник

Ответ на:

комментарий
от targitaj 05.08.19 21:40:46 MSK

Все платы для Ryzen запускаются с ECC памятью, но могут работать как с обычной.

dinn ★★★★★

(05.08.19 21:43:27 MSK)

Показать ответ
Ссылка

Ответ на:

комментарий
от dinn 05.08.19 21:43:27 MSK

Ответ на:

комментарий
от targitaj 05.08.19 21:44:07 MSK

Именно. Например MSI: «Supports ECC UDIMM memory (non-ECC mode)»

dinn ★★★★★

(05.08.19 21:44:33 MSK)

Последнее исправление: dinn 05.08.19 21:45:15 MSK
(всего

исправлений: 1)

Показать ответ
Ссылка

Ответ на:

комментарий
от targitaj 05.08.19 21:44:07 MSK

Просто игнорировать бит ЕСС, вроде даже в БИОСах такая опция есть(или была)

torvn77 ★★★★★

(05.08.19 21:45:20 MSK)

Показать ответ
Ссылка

Ответ на:

комментарий
от dinn 05.08.19 21:44:33 MSK

Так, ты говоришь про ситуацию «мат плата поддерживает модули с ECC, но работает с ними в non ECC режиме» и желаешь проверить на месте как именно у тебя будет работать ОЗУ?

targitaj ★★★★★

(05.08.19 21:52:42 MSK)

Показать ответ
Ссылка

Ответ на:

комментарий
от torvn77 05.08.19 21:45:20 MSK

вроде даже в БИОСах такая опция есть(или была)

Да, есть. При её использовании dmidecode не упоминает коррекцию ошибок.

dinn ★★★★★

(05.08.19 22:14:51 MSK)

Ссылка

Ответ на:

комментарий
от targitaj 05.08.19 21:52:42 MSK

Скорее про ситуацию «есть поддержка, но никто не гарантирует что оно реально работает». Всегда найдутся опции, которые есть, но работают при строго определённых фазах луны.

dinn ★★★★★

(05.08.19 22:18:35 MSK)

Ссылка

Чтобы проверить и исправить ошибку, нужен буфер, где это будет делаться. Буфер может быть на самой памяти, например в регистровой памяти. Или в контроллере памяти. Обычно все региcтровые dimm имеют ecc, и для их работы требуется поддержка контроллером памяти регистровых dimm. То есть, если регистровая память работает, то ecc тоже работает.

Можно сэкономить, сделать память с избыточным количеством бит на слово для ecc, но логику коррекции не делать, переложить на контроллер, заодно не нужен буфер в самой памяти. Небуферизированная память. И тут зависит от контроллера, будет он заниматься ecc, или можно выиграть минимум один такт и поднять произвоительность подсистемы памяти, буфер небесплатный и коррекция ошибок тоже.

anonymous

(05.08.19 22:19:21 MSK)

Показать ответ
Ссылка

Ответ на:

комментарий
от anonymous 05.08.19 22:19:21 MSK

прекращайте употреблять вещества. буферный регистр на линиях **адреса** у registered памяти никакого отношения к ЕСС не имеет в принципе.

NiTr0 ★★★★★

(05.08.19 22:45:34 MSK)

Ссылка

Ответ на:

комментарий
от targitaj 05.08.19 21:45:31 MSK

А что ты вообще хочешь?
Может проще и надёжнее купить оверклокерскую память и использовать её в обычном режиме?

torvn77 ★★★★★

(05.08.19 22:48:28 MSK)

Показать ответ
Ссылка

смотреть dmesg.

# dmesg|grep -i edac
EDAC MC: Ver: 2.0.1 Jun  2 2015
EDAC amd64_edac: Ver: 3.4.0
EDAC amd64: ECC is enabled by BIOS.
EDAC amd64: F10h detected (node 0).
EDAC MC: DCT0 chip selects:
EDAC amd64: MC: 0:  1024MB 1:  1024MB
EDAC amd64: MC: 2:  1024MB 3:  1024MB
EDAC amd64: MC: 4:     0MB 5:     0MB
EDAC amd64: MC: 6:     0MB 7:     0MB
EDAC MC: DCT1 chip selects:
EDAC amd64: MC: 0:  1024MB 1:  1024MB
EDAC amd64: MC: 2:  1024MB 3:  1024MB
EDAC amd64: MC: 4:     0MB 5:     0MB
EDAC amd64: MC: 6:     0MB 7:     0MB
EDAC amd64: using x4 syndromes.
EDAC amd64: MCT channel count: 2
EDAC amd64: CS0: Unbuffered DDR2 RAM
EDAC amd64: CS1: Unbuffered DDR2 RAM
EDAC amd64: CS2: Unbuffered DDR2 RAM
EDAC amd64: CS3: Unbuffered DDR2 RAM
EDAC MC0: Giving out device to amd64_edac F10h: DEV 0000:00:18.2

ессно должен быть драйвер EDAC.

NiTr0 ★★★★★

(05.08.19 22:58:59 MSK)

Показать ответ
Ссылка

Ответ на:

комментарий
от torvn77 05.08.19 22:48:28 MSK

и как это избавит от ошибок из-за фонового радиационного излучения или «горячих» частиц в компаунде?

ну и да, у оверклоцкерской памяти тайминги выкручены на минимум, лишь бы как-то работало и игры не вылетали каждые полчаса. сталкивался с тем, что на ам2+ почти любая сохо память давала где-то 1 ошибку в сутки при прогоне мемтеста. некоторые — будучи в паре с другими на канале, некоторые — и самостоятельно тоже. воткнул в ту же плату ЕСС память — как отшептали, ни ошибок мемтеста, ни скорректированных/нескорректированных ошибок памяти.

NiTr0 ★★★★★

(05.08.19 23:03:44 MSK)

Показать ответ
Ссылка

Ответ на:

комментарий
от NiTr0 05.08.19 23:03:44 MSK

Если память оверклокерская то это значит что чип тестировали на работу при повышенных напряжениях и частотах и как следствие при повышенной температуре и по этому при эксплуатации в обычном режиме она очень надёжна, в том числе может проработать более 10 часов без ошибок.

Но это конечно надо брать хорошие и по этому дорогие модули.

torvn77 ★★★★★

(05.08.19 23:38:03 MSK)

Показать ответы
Ссылка

Ответ на:

комментарий
от torvn77 05.08.19 23:38:03 MSK

Если память оверклокерская то это значит

Что на неё налепили ненужные, но красивые радиаторы и продали лохам в 3 раза дороже.

~~K22~~

(05.08.19 23:38:54 MSK)

Показать ответы
Ссылка

Ответ на:

комментарий
от K22 05.08.19 23:38:54 MSK

Ну запусти обычную память в том же режиме что и оверклокерскую, потом напиши тут режим и результат.

torvn77 ★★★★★

(05.08.19 23:40:09 MSK)

Последнее исправление: torvn77 05.08.19 23:40:43 MSK
(всего

исправлений: 1)

Показать ответ
Ссылка

Ответ на:

комментарий
от torvn77 05.08.19 23:38:03 MSK

это значит

Нет, это не значит.

хорошие и по этому дорогие

Купи кирпич, хороший, тк за 1000$.

anonymous

(05.08.19 23:41:01 MSK)

Ссылка

Ответ на:

комментарий
от torvn77 05.08.19 23:40:09 MSK

Нет, запусти ты, а потом балаболь не от балды, а уже со знанием вопроса.

anonymous

(05.08.19 23:43:18 MSK)

Показать ответ
Ссылка

Ответ на:

комментарий
от anonymous 05.08.19 23:43:18 MSK

Я пользовался разной памятью, в том числе и выкручивая тайминги по принципу уменьшим тут и если компьютер работает то и ладно и для меня разница между качественной оверклокерской памятью и обычной просто факт.

Что такого сложного понять простую вещь: чипы при изготовлении получаются с некоторым разбросом характеристик, их сортируют, самые лучшие продают как оверклокерские, остальные помещают в товарную группу «обычные» и продают тебе, в том числе как и ECC Registered.

torvn77 ★★★★★

(06.08.19 00:08:20 MSK)

Последнее исправление: torvn77 06.08.19 00:10:47 MSK
(всего

исправлений: 1)

Показать ответы
Ссылка

Ответ на:

комментарий
от torvn77 06.08.19 00:08:20 MSK

Я пользовался разной памятью

А нужно было одинаковой: обычной на одних чипах и школоклокерской на таких же чипах. При достаточной выборке разницы не найдёшь 100%, кроме может совсем уж топовых набров на 4000+мгц.

anonymous

(06.08.19 00:14:38 MSK)

Показать ответ
Ссылка

Ответ на:

комментарий
от anonymous 06.08.19 00:14:38 MSK

кроме может совсем уж топовых набров на 4000+мгц.

Эта оговорка объясняет разницу наших мнений.
Хотя надо признаться моя последняя оверклокерская память была ddr2, на ddr3 я купил «временно» обычную и как-то мне её хватает, а память не глючит…
Хотя она всё равно с радиаторами.
В общем что сейчас за оверклокерская память я на самом деле не знаю, думаю что так-же, хочешь много плати за ТОПающего по потолку соседа.

torvn77 ★★★★★

(06.08.19 00:32:56 MSK)

Последнее исправление: torvn77 06.08.19 00:34:45 MSK
(всего

исправлений: 2)

Показать ответ
Ссылка

Ответ на:

комментарий
от torvn77 06.08.19 00:32:56 MSK

Нужно выбирать быстрые чипы, иначе купишь говно хуже обычной памяти, зато школоклокерское с заводским разгоном, на котором она не стартует.

anonymous

(06.08.19 00:39:20 MSK)

Показать ответ
Ссылка

Ответ на:

комментарий
от anonymous 06.08.19 00:39:20 MSK

Быстрые по какому признаку, не только частота, но и тайминги или ещё какой признак?

torvn77 ★★★★★

(06.08.19 01:00:32 MSK)

Показать ответ
Ссылка

Ответ на:

комментарий
от torvn77 06.08.19 01:00:32 MSK

Именно по этим признакам, и это зависит от типа чипов (например samsung b-die). Там не единственный в мире чип, который отбирают, а есть страрые или просто плохие типы чипов и наоборот.

anonymous

(06.08.19 01:09:03 MSK)

Ссылка

Ответ на:

комментарий
от torvn77 05.08.19 23:38:03 MSK

Если память оверклокерская то это значит что чип тестировали на работу при повышенных напряжениях и частотах и как следствие при повышенной температуре и по этому при эксплуатации в обычном режиме она очень надёжна

бред. в оверклоцкерскую память берут, к примеру, 3200CL18 и дальше тупо поднимают частоту и урезают тайминги. и «в обычном режиме» у нее тоже тайминги зарезаны. а как чип CL18 будет работать на CL16 — думаю, саи догадываетесь

в том числе может проработать более 10 часов без ошибок.

охтыжнифигасебе… аж целых 10 часов без ошибок… вот это «надежность», да… а на 11-й час сбойнул битик — и в таблице БД, чей кеш хранился в сбойнувшей ячейке в ожидании записи на диск, вместо данных — мусор. зато ж память оверклоцкерская, с ргб подсветкой и массивными нахрен не нужными радиаторами

NiTr0 ★★★★★

(06.08.19 01:13:17 MSK)

Ссылка

Ответ на:

комментарий
от torvn77 06.08.19 00:08:20 MSK

Я пользовался разной памятью, в том числе и выкручивая тайминги по принципу уменьшим тут и если компьютер работает то и ладно

дадада, помнится я лет 15+ назад тоже гнал память по принципу «работает компьютер то и ладно», подумаешь тест мемтеста падал — ну и хрен с ним, игры-то не вылетали, ну мож какой-то полигон хрензнаеткуда улетит изредка…

Что такого сложного понять простую вещь: чипы при изготовлении получаются с некоторым разбросом характеристик, их сортируют, самые лучшие продают как оверклокерские

никто никого никуда не сортирует. берутся те же самые чипы, выкручиваются тайминги на минимум (во всех режимах), задираются частоты, лептся радиаторы и ргб подсветка и напариваются «счастливым владельцам».

NiTr0 ★★★★★

(06.08.19 01:17:04 MSK)

Ссылка

Ответ на:

комментарий
от targitaj 05.08.19 21:44:07 MSK

некоторые модули (см. Samsung) могут работать в «обычном» режиме.

но это всё серая зона, инфы по подобному функуционалу мало, ибо большинству ЦА это просто не нужно

Ссылка

Ответ на:

комментарий
от torvn77 05.08.19 23:38:03 MSK

в том числе может проработать более 10 часов без ошибок

Какие страшные вещи вы рассказываете. Если у меня на компе вот так память будет лагать, то будет мне беда-печаль из-за попортившихся данных.

peregrine ★★★★★

(06.08.19 04:31:48 MSK)

Показать ответы
Ссылка

Ответ на:

комментарий
от peregrine 06.08.19 04:31:48 MSK

Мне тут недавно рассказывали что сотни ошибок битой памяти в день это обычное дело и у любой памяти будет так. Я, признаться, прифигел, поскольку привык, что исправная память не будет сыпать ошибками, как ты с ней ни извращайся.

anonymous

(06.08.19 04:56:06 MSK)

Показать ответы
Ссылка

Ответ на:

комментарий
от peregrine 06.08.19 04:31:48 MSK

Какие страшные вещи вы рассказываете.

Что вы все цепляетесь к словам?
Я просто больше 10 часов не проверял, а так у меня компьютер по несколько дней работает, пока я не перегружу его по какой либо своей надобности.

torvn77 ★★★★★

(06.08.19 05:16:31 MSK)

Последнее исправление: torvn77 06.08.19 05:16:48 MSK
(всего

исправлений: 1)

Показать ответ
Ссылка

Ответ на:

комментарий
от torvn77 06.08.19 05:16:31 MSK

пока я не перегружу его по какой либо своей надобности

Например, почистить от набившихся тараканов?

anonymous

(06.08.19 07:58:17 MSK)

Ссылка

Ответ на:

комментарий
от K22 05.08.19 23:38:54 MSK

Что на неё налепили ненужные, но красивые радиаторы

Если DDR4 запитывать 1,4 вольтами и ставить конские частоты, радиаторы становятся нужными.

Meyer ★★★★★

(06.08.19 08:06:16 MSK)

Показать ответ
Ссылка

Ответ на:

комментарий
от Meyer 06.08.19 08:06:16 MSK

Если DDR4 запитывать 1,4 вольтами

Если ты превышаешь нормальное напряжение (1.2 В), то ты ССЗБ.

anonymous

(06.08.19 08:07:47 MSK)

Показать ответы
Ссылка

Ответ на:

комментарий
от torvn77 05.08.19 23:38:03 MSK

может проработать более 10 часов без ошибок

И как это у меня комп неделями работает без проблем?

Deleted

(06.08.19 08:07:54 MSK)

Ссылка

А можно выключить ECC сделать даун вольт на памяти (плата это позволяет?) до значений чтоб память сыпала ошибками и проверить что будет после включения ECC?

Aber ★★★★★

(06.08.19 08:08:55 MSK)

Ссылка

Ответ на:

комментарий
от anonymous 06.08.19 08:07:47 MSK

Если ты превышаешь нормальное напряжение (1.2 В), то ты ССЗБ.

На стандартном напряжении мои плашки не работали, когда я выставлял частоту в 3733MHz.

Meyer ★★★★★

(06.08.19 08:09:11 MSK)

Ссылка

Ответ на:

комментарий
от anonymous 06.08.19 08:07:47 MSK

Если ты превышаешь нормальное напряжение (1.2 В), то ты ССЗБ.

Плюс 10% вполне безопасно, когда делают ic закладывают некоторые приделы работы, тут конечно чуть больше, но думаю не критично. А вообще я помню во времена ddr2 брал оверклокерские модули TEAM и там уже дефолтно были повышенные вольты записаны в профилях.

Прямо сейчас нашел первый попавшийся модуль памяти TEAM:

Frequency 	2666 	3000
Voltage 	1.2V 	1.35V

Aber ★★★★★

(06.08.19 08:18:49 MSK)

Последнее исправление: Aber 06.08.19 08:20:24 MSK
(всего

исправлений: 1)

Ссылка

Посмотрите тут:

cat /sys/devices/system/edac/mc/mc0/ue_count
для других платформ (каких?) можно установить mcelog и следить за его логами.

Если что новое найдёте — пополняйте wiki

LeNiN ★★

(06.08.19 13:28:11 MSK)

Ссылка

Ответ на:

комментарий
от anonymous 06.08.19 04:56:06 MSK

У меня самый большой аптайм моего компа (не сервера) порядка 5 месяцев. Если там сотни ошибок памяти в день были бы, то 15k ошибок должны были бы как-то себя показать. Правда память у меня таки ECC, работающая как ECC память. Сервер на работе крутился с аптаймом в 1.5 года и ~1 ТБ оперативки для кеширования БД. Правда я ХЗ как там оно было внутри устроено и что за железо/виртуализация. Ничего не сбоило, если рукопопы, вроде меня, туда багованный код не заливали.

peregrine ★★★★★

(06.08.19 16:35:27 MSK)

Ссылка

Ответ на:

комментарий
от anonymous 06.08.19 04:56:06 MSK

Почему тогда только у оперативки а не у процессора (который сам состоит из кардинально боольшего числа транзисторов работающих в кардинально более сложных условиях) и вообще всех остальных чипов? И как оно все тогда работает?

anonymous

(06.08.19 16:38:59 MSK)

Показать ответ
Ссылка

Ответ на:

комментарий
от torvn77 05.08.19 23:38:03 MSK

может проработать более 10 часов без ошибок
надо брать хорошие и по этому дорогие модули

А если брать обыкновеннейшую память вроде https://mobilespecs.net/memory/AMD/AMD_R738G1869U2-US.html, то

—► uptime -p
up 5 weeks, 2 days, 14 hours, 10 minutes

и более — настолько в порядке вещей, что речь про «10 часов без ошибок» как про что-то выдающееся вызывает недоумение.

dexpl ★★★★★

(06.08.19 17:48:43 MSK)

Показать ответ
Ссылка

Ответ на:

комментарий
от dexpl 06.08.19 17:48:43 MSK

То что я тестировал только 10 часов не значит что память на 11 часу даст ощибку, это значит что я тестировал память 10 часов.

torvn77 ★★★★★

(06.08.19 18:07:16 MSK)

Ссылка

Ответ на:

комментарий
от anonymous 06.08.19 16:38:59 MSK

открою секрет: у процов есть ЕСС в кешах. да-да, при том что там SRAM, а не DRAM, и при том что размер кешей — единицы мегабайт, а не десятки гигабайт.

NiTr0 ★★★★★

(06.08.19 18:39:41 MSK)

Показать ответ
Ссылка

Ответ на:

комментарий
от NiTr0 06.08.19 18:39:41 MSK

Я в курсе. В остальном проце ecc нет.

anonymous

(06.08.19 18:40:50 MSK)

Ссылка

Ответ на:

комментарий
от peregrine 06.08.19 04:31:48 MSK

Хехе. Проведи простой эксперимент. Возьми попереливай пару раз туда-сюда терабайт торрентов и потом проведи перехеширование. Ох, был у меня день с открытием… Кароч, данные бьются постоянно и непрерывно. Сквозное ECC по всем линиям связи — это тупо must have.

targitaj ★★★★★

(06.08.19 18:43:43 MSK)

Показать ответы
Ссылка

Ответ на:

комментарий
от targitaj 06.08.19 18:43:43 MSK

Забавно, далаю так и ни разу не видел, чтобы былось. Не терабайты правда, сотни гигов.

anonymous

(06.08.19 18:46:10 MSK)

Показать ответ
Ссылка

Ответ на:

комментарий
от anonymous 06.08.19 18:46:10 MSK

Да, от железа зависит, это факт. У некоторых десятки ошибок в день по ОЗУ бывает. А я тогда перелил 500 гиг туда-сюда и получил каждый 10-тый файл с 99% готовности. Это было неожиданно, мягко говоря. Вот тогда я на самом деле понял зачем надо сквозные ECC по всем коммуникациям.

targitaj ★★★★★

(06.08.19 18:47:48 MSK)

Последнее исправление: targitaj 06.08.19 18:48:51 MSK
(всего

исправлений: 2)

Показать ответ
Ссылка

Ответ на:

комментарий
от NiTr0 05.08.19 22:58:59 MSK

смотреть dmesg.

Вот тут начинается веселье:

edac-util: Error: No memory controller data found

lsmod | grep -i edac
edac_mce_amd           32768  0

dinn ★★★★★

(06.08.19 23:50:16 MSK)

Последнее исправление: dinn 06.08.19 23:55:19 MSK
(всего

исправлений: 1)

Показать ответ
Ссылка

Вы не можете добавлять комментарии в эту тему. Тема перемещена в архив.

Источник

I’m running a AMD Ryzen 3600X with 64GB of ECC Ram on a ASRockRack X470D4U2-2T motherboard and I’ve noticed that the kernel is throwing some EDAC 64 errors.

Here is the output of the dmesg command:
[ 14.882566] EDAC amd64: Node 0: DRAM ECC enabled.
[ 14.882579] EDAC amd64: F17h detected (node 0).
[ 14.882595] EDAC amd64: Error: F0 not found, device 0x1460 (broken BIOS?)
[ 14.882609] EDAC amd64: Error: Error probing instance: 0

Running the ‘edac-util’ command shows this:
# edac-util -v
edac-util: Error: No memory controller data found.

It looks like EDAC AMD64 isn’t working properly with the current shipped kernel that comes with Proxmox 6.1. I’ve tried to compile the latest kernel but I’m unable to boot from it (can’t figure out how setup grub properly on proxmox. There isn’t much documentation on this or help in the forums. Anyone have some tips? ;-))

Any chance the developers of proxmox will backport the Kernel 5.4. patches for AMD64 EDAC to the current 5.3 kernel?

Anyone else running into this?

Thanks!

I’ve have the same problem with my 3600 on X470D4U, I think we need to wait.
However EDAC reporting is not supported on both of these motherboards.

I’ve have the same problem with my 3600 on X470D4U, I think we need to wait.
However EDAC reporting is not supported on both of these motherboards.

According to this topic https://forums.unraid.net/bug-repor…-ecc-error-with-ryzen-3700x-and-ecc-ram-r651/ this will be fixed in the newer 5.4 kernel. It would be awesome if Proxmox would backport the patches to the 5.3 kernel or release a more up to date kernel to solve this problem.

The reason why this is a problem is because ZFS recommends using ECC ram. My hardware combination supports this and so does Linux. Proxmox just needs to catch up on this. *hint hint*

In fact ECC works, EDAC only concern the reporting ability.

In fact ECC works, EDAC only concern the reporting ability.

Great to hear that! So the reporting of EDAC doesn’t work but the ECC does work? How do I check this exactly? I just want to be 100% sure

So EDAC is just a reporting utility and has nothing todo with the working of ECC memory and the kernel?

This is not only for Ryzen. Also this present on EPYC. PVE kernel need backport support Zen 2 CPU’s

Источник

Description

Guy Streeter

2011-03-08 22:02:45 UTC

On a DELL R610 system (dell-per610-01.lab.bos.redhat.com for example) with Nehalem-EP CPU's running RHEL-5.6 EDAC does not recognise the memory controller.

It loads the module correctly:
root@dell-per610-01 ~]# dmesg |grep -i EDAC
EDAC MC: Ver: 2.0.1 Feb 18 2011
EDAC i7core: Driver loaded.

And you can see the mc in the /proc interface:
[root@dell-per610-01 ~]# ls /sys/devices/system/edac/mc
log_ce  log_ue  panic_on_ue  poll_msec

But it's not actually seeing any Memory Controller, as normally you'd expect something like this:
EDAC MC0: Giving out device to i7core_edac.c i7 core #0: DEV 0000:ff:03.0
EDAC MC1: Giving out device to i7core_edac.c i7 core #1: DEV 0000:fe:03.0
which is not the case here.
 This is corroborated by the output from edac-util -r:
[root@dell-per610-01 ~]# /usr/bin/edac-util -r
edac-util: Error: No memory controller data found.

Источник

tpetazzoni

pushed a commit
that referenced
this issue

Jul 18, 2017

…/kernel/git/joro/iommu

Pull IOMMU updates from Joerg Roedel:
 "This update comes with:

   - Support for lockless operation in the ARM io-pgtable code.

     This is an important step to solve the scalability problems in the
     common dma-iommu code for ARM

   - Some Errata workarounds for ARM SMMU implemenations

   - Rewrite of the deferred IO/TLB flush code in the AMD IOMMU driver.

     The code suffered from very high flush rates, with the new
     implementation the flush rate is down to ~1% of what it was before

   - Support for amd_iommu=off when booting with kexec.

     The problem here was that the IOMMU driver bailed out early without
     disabling the iommu hardware, if it was enabled in the old kernel

   - The Rockchip IOMMU driver is now available on ARM64

   - Align the return value of the iommu_ops->device_group call-backs to
     not miss error values

   - Preempt-disable optimizations in the Intel VT-d and common IOVA
     code to help Linux-RT

   - Various other small cleanups and fixes"

* tag 'iommu-updates-v4.13' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: (60 commits)
  iommu/vt-d: Constify intel_dma_ops
  iommu: Warn once when device_group callback returns NULL
  iommu/omap: Return ERR_PTR in device_group call-back
  iommu: Return ERR_PTR() values from device_group call-backs
  iommu/s390: Use iommu_group_get_for_dev() in s390_iommu_add_device()
  iommu/vt-d: Don't disable preemption while accessing deferred_flush()
  iommu/iova: Don't disable preempt around this_cpu_ptr()
  iommu/arm-smmu-v3: Add workaround for Cavium ThunderX2 erratum #126
  iommu/arm-smmu-v3: Enable ACPI based HiSilicon CMD_PREFETCH quirk(erratum 161010701)
  iommu/arm-smmu-v3: Add workaround for Cavium ThunderX2 erratum #74
  ACPI/IORT: Fixup SMMUv3 resource size for Cavium ThunderX2 SMMUv3 model
  iommu/arm-smmu-v3, acpi: Add temporary Cavium SMMU-V3 IORT model number definitions
  iommu/io-pgtable-arm: Use dma_wmb() instead of wmb() when publishing table
  iommu/io-pgtable: depend on !GENERIC_ATOMIC64 when using COMPILE_TEST with LPAE
  iommu/arm-smmu-v3: Remove io-pgtable spinlock
  iommu/arm-smmu: Remove io-pgtable spinlock
  iommu/io-pgtable-arm-v7s: Support lockless operation
  iommu/io-pgtable-arm: Support lockless operation
  iommu/io-pgtable: Introduce explicit coherency
  iommu/io-pgtable-arm-v7s: Refactor split_blk_unmap
  ...

Источник

Originally posted by Jeff

View Post

This patch does not work for the new Ryzen 3000 Zen2 series of AMD CPUs.

I got the Ryzen 3900x with ECC memory and this EDAC patch does not support the new ryzen as far as I can tell… the new ryzen I have is Familiy 17h and model 71h, and the patch was for for F17_M30H (0x30 to 0x3F only). So there is currently no ECC support in the linux kernel with the new Ryzen that have been released. I managed to patch the linux kernel (5.2.1) by changing the PCI device IDs which appear to be different from all other devices so far in the AMD EDAC driver (different than F17 M30H as well). It appear to load the EDAC driver on boot and detect all ECC DIMM properly, but it does not report any ECC CE or UE errors although they appear to be happening based on my memory overclocking test and being corrected when ECC is enabled.

It is a bit disappointing that it does not work, when I read this some time ago I felt that this would support all new Zen 2 CPUs, not just EPIC.

Not 100% sure what you mean by «does not support the new ryzen as far as I can tell», as you posted no actual findings, but does the below proof, in your opinion that Linux 5.4 does support it?

Ubuntu 19.10 (Linux kernel 5.3)

[email protected]:~# find /lib/modules/5.3.0-19-generic/ | grep -i -E ‘edac’
/lib/modules/5.3.0-19-generic/kernel/drivers/edac
/lib/modules/5.3.0-19-generic/kernel/drivers/edac/i7core_edac.ko
/lib/modules/5.3.0-19-generic/kernel/drivers/edac/skx_edac.ko
/lib/modules/5.3.0-19-generic/kernel/drivers/edac/amd64_edac_mod.ko
/lib/modules/5.3.0-19-generic/kernel/drivers/edac/i5100_edac.ko
/lib/modules/5.3.0-19-generic/kernel/drivers/edac/i10nm_edac.ko
/lib/modules/5.3.0-19-generic/kernel/drivers/edac/x38_edac.ko
/lib/modules/5.3.0-19-generic/kernel/drivers/edac/i3000_edac.ko
/lib/modules/5.3.0-19-generic/kernel/drivers/edac/sb_edac.ko
/lib/modules/5.3.0-19-generic/kernel/drivers/edac/i3200_edac.ko
/lib/modules/5.3.0-19-generic/kernel/drivers/edac/i7300_edac.ko
/lib/modules/5.3.0-19-generic/kernel/drivers/edac/i5400_edac.ko
/lib/modules/5.3.0-19-generic/kernel/drivers/edac/i82975x_edac.ko
/lib/modules/5.3.0-19-generic/kernel/drivers/edac/edac_mce_amd.ko
/lib/modules/5.3.0-19-generic/kernel/drivers/edac/e752x_edac.ko
/lib/modules/5.3.0-19-generic/kernel/drivers/edac/pnd2_edac.ko
/lib/modules/5.3.0-19-generic/kernel/drivers/edac/ie31200_edac.ko
/lib/modules/5.3.0-19-generic/kernel/drivers/edac/i5000_edac.ko
[email protected]:~# apt list edac-utils
Listing… Done
edac-utils/eoan,now 0.18-1build1 amd64 [installed]
edac-utils/eoan 0.18-1build1 i386
[email protected]:~# edac-util -vs
edac-util: EDAC drivers loaded. No memory controllers found
[email protected]:~# edac-util -v
edac-util: Error: No memory controller data found.
[email protected]:~#

Fedora Rawhide (Linux kernel 5.4)

[[email protected] ~]# find /lib/modules/5.4.0-0.rc3.git0.1.fc32.x86_64/ | grep -i -E ‘edac’
/lib/modules/5.4.0-0.rc3.git0.1.fc32.x86_64/kernel/drivers/edac
/lib/modules/5.4.0-0.rc3.git0.1.fc32.x86_64/kernel/drivers/edac/amd64_edac_mod.ko.xz
/lib/modules/5.4.0-0.rc3.git0.1.fc32.x86_64/kernel/drivers/edac/e752x_edac.ko.xz
/lib/modules/5.4.0-0.rc3.git0.1.fc32.x86_64/kernel/drivers/edac/edac_mce_amd.ko.xz
/lib/modules/5.4.0-0.rc3.git0.1.fc32.x86_64/kernel/drivers/edac/i10nm_edac.ko.xz
/lib/modules/5.4.0-0.rc3.git0.1.fc32.x86_64/kernel/drivers/edac/i3000_edac.ko.xz
/lib/modules/5.4.0-0.rc3.git0.1.fc32.x86_64/kernel/drivers/edac/i3200_edac.ko.xz
/lib/modules/5.4.0-0.rc3.git0.1.fc32.x86_64/kernel/drivers/edac/i5000_edac.ko.xz
/lib/modules/5.4.0-0.rc3.git0.1.fc32.x86_64/kernel/drivers/edac/i5100_edac.ko.xz
/lib/modules/5.4.0-0.rc3.git0.1.fc32.x86_64/kernel/drivers/edac/i5400_edac.ko.xz
/lib/modules/5.4.0-0.rc3.git0.1.fc32.x86_64/kernel/drivers/edac/i7300_edac.ko.xz
/lib/modules/5.4.0-0.rc3.git0.1.fc32.x86_64/kernel/drivers/edac/i7core_edac.ko.xz
/lib/modules/5.4.0-0.rc3.git0.1.fc32.x86_64/kernel/drivers/edac/i82975x_edac.ko.xz
/lib/modules/5.4.0-0.rc3.git0.1.fc32.x86_64/kernel/drivers/edac/ie31200_edac.ko.xz
/lib/modules/5.4.0-0.rc3.git0.1.fc32.x86_64/kernel/drivers/edac/pnd2_edac.ko.xz
/lib/modules/5.4.0-0.rc3.git0.1.fc32.x86_64/kernel/drivers/edac/sb_edac.ko.xz
/lib/modules/5.4.0-0.rc3.git0.1.fc32.x86_64/kernel/drivers/edac/skx_edac.ko.xz
/lib/modules/5.4.0-0.rc3.git0.1.fc32.x86_64/kernel/drivers/edac/x38_edac.ko.xz
[[email protected] ~]# yum info edac-utils
Last metadata expiration check: 0:01:47 ago on Sun 27 Oct 2019 11:44:47 PM CET.
Installed Packages
Name : edac-utils
Version : 0.16
Release : 21.fc31
Architecture : x86_64
Size : 101 k
Source : edac-utils-0.16-21.fc31.src.rpm
Repository : @System
From repo : rawhide
Summary : Userspace helper for kernel EDAC drivers
URL : http://sourceforge.net/projects/edac-utils/
License : GPLv2+
Description : EDAC is the current set of drivers in the Linux kernel that handle
: detection of ECC errors from memory controllers for most chipsets
: on i386 and x86_64 architectures. This userspace component consists
: of an init script which makes sure EDAC drivers and DIMM labels
: are loaded at system startup, as well as a library and utility
: for reporting current error counts from the EDAC sysfs files.
[[email protected] ~]# edac-util -vs
edac-util: EDAC drivers are loaded. 1 MC detected:
mc0:F17h_M70h
[[email protected] ~]# edac-util -v
mc0: 0 Uncorrected Errors with no DIMM info
mc0: 0 Corrected Errors with no DIMM info
mc0: csrow2: 0 Uncorrected Errors
mc0: csrow2: mc#0csrow#2channel#0: 0 Corrected Errors
mc0: csrow2: mc#0csrow#2channel#1: 0 Corrected Errors
mc0: csrow3: 0 Uncorrected Errors
mc0: csrow3: mc#0csrow#3channel#0: 0 Corrected Errors
mc0: csrow3: mc#0csrow#3channel#1: 0 Corrected Errors
[[email protected] ~]#

Источник

[TGL] EDAC support for TGL(IoTG)

Bug #1842227 reported by
quanxian
on 2019-09-01

This bug affects 1 person

Status	Importance	Assigned to	Milestone
intel	Fix Released	Undecided	Unassigned
Lookout-canyon-series	Fix Released	Undecided	Unassigned	intel sprint2
linux (Ubuntu)	In Progress	Undecided	Unassigned

Affects

Status

Importance

Assigned to

Milestone

intel

Fix Released

Undecided

Unassigned

Lookout-canyon-series

Fix Released

Undecided

Unassigned

intel sprint2

linux (Ubuntu)

In Progress

Undecided

Unassigned

Bug Description

Description
EDAC driver support on TGL(IoTG) for reporting ECC error and DIMM location

Commit ids: 0b7338b27e82
Target Release: 22.04
Target Kernel: 5.14

To post a comment you must log in.

Report a bug

This report contains
Public Security
information

Everyone can see this security related information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.

Источник

View previous topic :: View next topic

Author

Message

slick
Bodhisattva

Joined: 20 Apr 2003
Posts: 3495

Posted: Sun May 05, 2019 11:40 pm Post subject: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 5: d4000

During large rsync jobs (> 5TB of million misc files) with a lot I/O I got this sometimes (but seldom). What is happen here?

Happen only with rsync. Not on cp or mv. Filesystem is zfs over plain dm-crypt.

Code:

[52925.283857] mce: [Hardware Error]: Machine check events logged

[52925.283862] mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 5: d400008000910091

[52925.283867] mce: [Hardware Error]: TSC 0 ADDR 3e734f1e8

[52925.283872] mce: [Hardware Error]: PROCESSOR 0:406d8 TIME 1557098412 SOCKET 0 APIC 0 microcode 121

[52925.283875] mce: [Hardware Error]: Machine check events logged

[52925.283877] mce: [Hardware Error]: CPU 1: Machine Check: 0 Bank 5: d400008000910091

[52925.283879] mce: [Hardware Error]: TSC 0 ADDR 3e734f1e8

[52925.283883] mce: [Hardware Error]: PROCESSOR 0:406d8 TIME 1557098412 SOCKET 0 APIC 2 microcode 121

As I google I found some command to analyse it, but I can’t understand whats telling me.

Code:

# ras-mc-ctl —summary

No Memory errors.

No PCIe AER errors.

No Extlog errors.

MCE records summary:

10 MEMORY CONTROLLER RD_CHANNEL1_ERR Transaction: Memory read error errors

Code:

# ras-mc-ctl —errors

No Memory errors.

No PCIe AER errors.

No Extlog errors.

MCE events:

1 2019-05-04 08:26:35 +0200 error: MEMORY CONTROLLER RD_CHANNEL1_ERR Transaction: Memory read error, mcg mcgstatus=0, mci Corrected_error Error_enabled, n_errors=1, mcgcap=0x00000806, status=0x9400004000910091, addr=0x3e734f1c0, walltime=0x5ccd309a, cpuid=0x000406d8, bank=0x00000005

2 2019-05-04 08:26:35 +0200 error: MEMORY CONTROLLER RD_CHANNEL1_ERR Transaction: Memory read error, mcg mcgstatus=0, mci Corrected_error Error_enabled, n_errors=1, mcgcap=0x00000806, status=0x9400004000910091, addr=0x3e734f1c0, walltime=0x5ccd309a, cpu=0x00000001, cpuid=0x000406d8, apicid=0x00000002, bank=0x00000005

3 2019-05-04 09:54:47 +0200 error: MEMORY CONTROLLER RD_CHANNEL1_ERR Transaction: Memory read error, mcg mcgstatus=0, mci Corrected_error Error_enabled, n_errors=1, mcgcap=0x00000806, status=0x9400004000910091, addr=0x3e734f1c0, walltime=0x5ccd4546, cpu=0x00000002, cpuid=0x000406d8, apicid=0x00000004, bank=0x00000005

4 2019-05-04 09:54:47 +0200 error: MEMORY CONTROLLER RD_CHANNEL1_ERR Transaction: Memory read error, mcg mcgstatus=0, mci Corrected_error Error_enabled, n_errors=1, mcgcap=0x00000806, status=0x9400004000910091, addr=0x3e734f1c0, walltime=0x5ccd4546, cpu=0x00000003, cpuid=0x000406d8, apicid=0x00000006, bank=0x00000005

5 2019-05-04 12:56:22 +0200 error: MEMORY CONTROLLER RD_CHANNEL1_ERR Transaction: Memory read error, mcg mcgstatus=0, mci Error_overflow Corrected_error Error_enabled, n_errors=2, mcgcap=0x00000806, status=0xd400008000910091, addr=0x3e734f1c0, walltime=0x5ccd6fd6, cpu=0x00000001, cpuid=0x000406d8, apicid=0x00000002, bank=0x00000005

6 2019-05-04 13:22:19 +0200 error: MEMORY CONTROLLER RD_CHANNEL1_ERR Transaction: Memory read error, mcg mcgstatus=0, mci Corrected_error Error_enabled, n_errors=1, mcgcap=0x00000806, status=0x9400004000910091, addr=0x3e734f1d0, walltime=0x5ccd75ea, cpu=0x00000001, cpuid=0x000406d8, apicid=0x00000002, bank=0x00000005

7 2019-05-04 15:42:24 +0200 error: MEMORY CONTROLLER RD_CHANNEL1_ERR Transaction: Memory read error, mcg mcgstatus=0, mci Corrected_error Error_enabled, n_errors=1, mcgcap=0x00000806, status=0x9400004000910091, addr=0x3e734f1c0, walltime=0x5ccd96bf, cpu=0x00000002, cpuid=0x000406d8, apicid=0x00000004, bank=0x00000005

8 2019-05-05 11:14:31 +0200 error: MEMORY CONTROLLER RD_CHANNEL1_ERR Transaction: Memory read error, mcg mcgstatus=0, mci Corrected_error Error_enabled, n_errors=1, mcgcap=0x00000806, status=0x9400004000910091, addr=0x3e734f1c0, walltime=0x5ccea977, cpu=0x00000004, cpuid=0x000406d8, apicid=0x00000008, bank=0x00000005

9 2019-05-06 01:20:12 +0200 error: MEMORY CONTROLLER RD_CHANNEL1_ERR Transaction: Memory read error, mcg mcgstatus=0, mci Error_overflow Corrected_error Error_enabled, n_errors=2, mcgcap=0x00000806, status=0xd400008000910091, addr=0x3e734f1e8, walltime=0x5ccf6fac, cpuid=0x000406d8, bank=0x00000005

10 2019-05-06 01:20:12 +0200 error: MEMORY CONTROLLER RD_CHANNEL1_ERR Transaction: Memory read error, mcg mcgstatus=0, mci Error_overflow Corrected_error Error_enabled, n_errors=2, mcgcap=0x00000806, status=0xd400008000910091, addr=0x3e734f1e8, walltime=0x5ccf6fac, cpu=0x00000001, cpuid=0x000406d8, apicid=0x00000002, bank=0x00000005

Is my memory broken or is this just an information that the ECC correct an error? (Yes, it’s ECC RAM)

CPU is:

Code:

# cat /proc/cpuinfo

processor : 0

vendor_id : GenuineIntel

cpu family : 6

model    : 77

model name : Intel(R) Atom(TM) CPU C2750 @ 2.40GHz

stepping : 8

microcode : 0x121

cpu MHz    : 2599.865

cache size : 1024 KB

physical id : 0

siblings : 8

core id    : 0

cpu cores : 8

apicid    : 0

initial apicid : 0

fpu    : yes

fpu_exception : yes

cpuid level : 11

wp    : yes

flags    : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm sse4_1 sse4_2 movbe popcnt tsc_deadline_timer aes rdrand lahf_lm 3dnowprefetch cpuid_fault epb tpr_shadow vnmi flexpriority ept vpid tsc_adjust smep erms dtherm ida arat

bugs    : cpu_meltdown spectre_v1 spectre_v2

bogomips : 4799.73

clflush size : 64

cache_alignment : 64

address sizes : 36 bits physical, 48 bits virtual

power management:

… 8 Cores

mike155
Advocate

Joined: 17 Sep 2010
Posts: 4312
Location: Frankfurt, Germany

Posted: Mon May 06, 2019 12:26 am Post subject:

Quote:

Is my memory broken or is this just an information that the ECC correct an error? (Yes, it’s ECC RAM)

Looks like a memory error which was corrected by ECC logic. I would replace the faulty DIMM as soon as possible.

What does edac-util tell you?

Code:

edac-util -v

bunder
Bodhisattva

Joined: 10 Apr 2004
Posts: 5930

Posted: Mon May 06, 2019 1:42 am Post subject:

are you overclocking your memory? one thing you could try is turning off XMP in the BIOS.
_________________

Neddyseagoon wrote:

The problem with leaving is that you can only do it once and it reduces your influence.

banned from #gentoo since sept 2017

slick
Bodhisattva

Joined: 20 Apr 2003
Posts: 3495

Posted: Mon May 06, 2019 8:00 am Post subject:

mike155 wrote:

Quote:

Is my memory broken or is this just an information that the ECC correct an error? (Yes, it’s ECC RAM)

Looks like a memory error which was corrected by ECC logic. I would replace the faulty DIMM as soon as possible.

What does edac-util tell you?

Code:

edac-util -v

How do I identfy the broken RAM-Module? There are 4 installed.

Fresh installed it say nothing. Do I have to wait for next crash first?

Code:

# edac-util -v

edac-util: Error: No memory controller data found.

bunder wrote:

are you overclocking your memory? one thing you could try is turning off XMP in the BIOS.

No overclocked. Defaults as much as possible.

NeddySeagoon
Administrator

Joined: 05 Jul 2003
Posts: 51961
Location: 56N 3W

Posted: Mon May 06, 2019 11:06 am Post subject:

slick, Boot into memtest86 and run a few cycles. You must boot into it as running it through the kernels memory manager will only tell that you have a fault, not where. You need several cycles. The same error at the same address indicates that its probably a RAM error. Random errors only tell that its a memory subsystem error. _________________ Regards, NeddySeagoon Computer users fall into two groups:- those that do backups those that have never had a hard drive fail.

Display posts from previous:

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum

Источник

Entz

Well-known member

Joined: Jul 17, 2011

Messages: 1,878

Location: Kelowna

JD

Moderator

Joined: Jul 16, 2007

Messages: 11,262

Location: Toronto, ON

Thanks, one of the migration issues it seems. OP has been edited.

After spending quite some time testing, I have an update on this great deep dive, but also some questions / issues with it.

ome background of why this is an update on this great deep dive
I’m doing my testing on an server-grade x470 Server Mobo (ASRock Rack) using a Ryzen 3600 CPU, using a BIOS based on AGESA 1.0.0.3 ABBA (not officially released yet by ASRock Rack) on the latest Windows 10 1903 and the latest Fedora Rawhide for Linux

Windows

For the first command: from the article: 2 (unknown), 3 (none), 4 (parity), 5 (single-bit ECC), or 6 (multi-bit ECC).
So that looks better then it used to be!
For the second command: Also that looks better (TotalWidth is larger than DataWidth), just «TotalWidth» is double (128) instead of the 72 that the article expected.

Also CPU-z, HWinfo64 and AIDA64 now correctly recognize the ECC RAM and AIDA64 also reports that it is enabled.

Linux

Also in Linux everything looks ok: ‘DRAM ECC enabled’ and ‘using x16 syndromes’.

But then the actual testing
For this I’ve overclocked the memory from 1333Mhz to 1500Mhz, keeping all other timings the same. At 1533Mhz or 1567Mhz the mobo no longer posts and requires a clear CMOS to recover.

These are my default settings (bottom right are is the memory)

And these my overclocked settings

However, with the overclocked settings I’m failing to log any memory error at all on both Windows and Linux…

Both memtester, memtest86 and Prime95 Blend can run for hours without error at this speed.

I suspect that ECC actually does work and corrects many errors, but it doesn’t report anything to any OS? (because just slightly increasing the frequency causes it to not post at all anymore).

Please read further in the next post…

Last edited: Oct 24, 2019

In the IPMI I’m also not finding any errors being reported:

I also tried to disable the ECC functionality and see if I could make any of the stresstest programs crash or that the OS then would receive uncorrected errors reported (this would at least proove that my memory is «unstable» at this frequency).
But also that failed. Even after disabling ECC, I get no error in Linux, Windows (didn’t check the IPMI in this scenario yet) and no crashes either.

I’ve used below BIOS settings for trying this (not sure if this is correct / sufficient though).

These settings are default (but show the BIOS maze I went through to get there ):

I’ve tried to change ‘DRAM ECC Enable’ to ‘Disabled’ and after that also ‘DRAM UECC Retry’ to ‘Disabled’:

Can someone help me figure out how to fix my ECC error reporting to the OS (and/or the IPMI) or explain what I’m doing wrong?

Thanks!

Last edited: Oct 24, 2019

this is some cool stuff. after reading the article again I can see it is in need of serious updating as it’s 3 years old and the Ryzen platform has come a long way since then. a refresh could show the better overall picture now.

Entz

Well-known member

Joined: Jul 17, 2011

Messages: 1,878

Location: Kelowna

Can someone help me figure out how to fix my ECC error reporting to the OS (and/or the IPMI) or explain what I’m doing wrong?

are you 100% sure you are actually having ecc errors? There is no guarantee that a bit flip will occur during stress testing or unstable memory. are you messing with timings or just frequency? The article did it with frequency.

Hi,

No, I’m not 100% sure that I’m actually having ecc errors. But it would surprise me if I didn’t…

Also before trying it the above way, I accidently left all memory timings set to auto and when I started increasing the frequency, the board automatically loosened the timings, which gives a completely different scenario. With the loosened timing I could increase the frequency from 1333Mhz to 1967Mhz, before it stopped booting. But also in that scenario, there were no reports of memory errors at for example 1933Mhz.

If you can tell which timing I should decrease while keeping all other settings default, I’m ofcourse very happy to try if it makes a difference…

nToxik

Well-known member

Joined: Apr 7, 2008

Messages: 193

There was similar testing done on the Unraid forums on Reddit as well as the official Unraid forums.

Has anyone else had ECC memory function properly with X470 Taichi and Ryzen 3600? from Amd

For Ryzen builds, ECC ‘looks’ like it is functioning but it really isn’t. I’m not sure if this is motherboard/vendor specific or not.

Entz

Well-known member

Joined: Jul 17, 2011

Messages: 1,878

Location: Kelowna

If you can tell which timing I should decrease while keeping all other settings default, I’m ofcourse very happy to try if it makes a difference…

Yeah I am not sure what what it would take to simulate. You need to get the timings such that writes work perfectly fine and just a few reads will fail. Too many, or to big of an error (Unrecoverable) and the system will crash.

I have never overclocked ECC ram, as that is kinda counter productive, so I am not sure what it would take.

Assuming it is even working at all. I would expect them to show up in the IPMI side over the OS if it is a drive issue, and if that isn’t working it likely isn’t catching them or your just extremely lucky writing 10=reading 10 until you hit a speed then nothing works.

There was similar testing done on the Unraid forums on Reddit as well as the official Unraid forums.

Has anyone else had ECC memory function properly with X470 Taichi and Ryzen 3600? from Amd

For Ryzen builds, ECC ‘looks’ like it is functioning but it really isn’t. I’m not sure if this is motherboard/vendor specific or not.

Just read those links, then did some testing… Could it be that support is there since Linux kernel 5.4?

Don’t know about it working well (couldn’t confirm it yet with my testing as you can read earlier).

Ubuntu 19.10 (Linux kernel 5.3)

root@nas:~# find /lib/modules/5.3.0-19-generic/ | grep -i -E ‘edac’
/lib/modules/5.3.0-19-generic/kernel/drivers/edac
/lib/modules/5.3.0-19-generic/kernel/drivers/edac/i7core_edac.ko
/lib/modules/5.3.0-19-generic/kernel/drivers/edac/skx_edac.ko
/lib/modules/5.3.0-19-generic/kernel/drivers/edac/amd64_edac_mod.ko
/lib/modules/5.3.0-19-generic/kernel/drivers/edac/i5100_edac.ko
/lib/modules/5.3.0-19-generic/kernel/drivers/edac/i10nm_edac.ko
/lib/modules/5.3.0-19-generic/kernel/drivers/edac/x38_edac.ko
/lib/modules/5.3.0-19-generic/kernel/drivers/edac/i3000_edac.ko
/lib/modules/5.3.0-19-generic/kernel/drivers/edac/sb_edac.ko
/lib/modules/5.3.0-19-generic/kernel/drivers/edac/i3200_edac.ko
/lib/modules/5.3.0-19-generic/kernel/drivers/edac/i7300_edac.ko
/lib/modules/5.3.0-19-generic/kernel/drivers/edac/i5400_edac.ko
/lib/modules/5.3.0-19-generic/kernel/drivers/edac/i82975x_edac.ko
/lib/modules/5.3.0-19-generic/kernel/drivers/edac/edac_mce_amd.ko
/lib/modules/5.3.0-19-generic/kernel/drivers/edac/e752x_edac.ko
/lib/modules/5.3.0-19-generic/kernel/drivers/edac/pnd2_edac.ko
/lib/modules/5.3.0-19-generic/kernel/drivers/edac/ie31200_edac.ko
/lib/modules/5.3.0-19-generic/kernel/drivers/edac/i5000_edac.ko
root@nas:~# apt list edac-utils
Listing… Done
edac-utils/eoan,now 0.18-1build1 amd64 [installed]
edac-utils/eoan 0.18-1build1 i386
root@nas:~# edac-util -vs
edac-util: EDAC drivers loaded. No memory controllers found
root@nas:~# edac-util -v
edac-util: Error: No memory controller data found.
root@nas:~#

Fedora Rawhide (Linux kernel 5.4)

[root@localhost ~]# find /lib/modules/5.4.0-0.rc3.git0.1.fc32.x86_64/ | grep -i -E ‘edac’
/lib/modules/5.4.0-0.rc3.git0.1.fc32.x86_64/kernel/drivers/edac
/lib/modules/5.4.0-0.rc3.git0.1.fc32.x86_64/kernel/drivers/edac/amd64_edac_mod.ko.xz
/lib/modules/5.4.0-0.rc3.git0.1.fc32.x86_64/kernel/drivers/edac/e752x_edac.ko.xz
/lib/modules/5.4.0-0.rc3.git0.1.fc32.x86_64/kernel/drivers/edac/edac_mce_amd.ko.xz
/lib/modules/5.4.0-0.rc3.git0.1.fc32.x86_64/kernel/drivers/edac/i10nm_edac.ko.xz
/lib/modules/5.4.0-0.rc3.git0.1.fc32.x86_64/kernel/drivers/edac/i3000_edac.ko.xz
/lib/modules/5.4.0-0.rc3.git0.1.fc32.x86_64/kernel/drivers/edac/i3200_edac.ko.xz
/lib/modules/5.4.0-0.rc3.git0.1.fc32.x86_64/kernel/drivers/edac/i5000_edac.ko.xz
/lib/modules/5.4.0-0.rc3.git0.1.fc32.x86_64/kernel/drivers/edac/i5100_edac.ko.xz
/lib/modules/5.4.0-0.rc3.git0.1.fc32.x86_64/kernel/drivers/edac/i5400_edac.ko.xz
/lib/modules/5.4.0-0.rc3.git0.1.fc32.x86_64/kernel/drivers/edac/i7300_edac.ko.xz
/lib/modules/5.4.0-0.rc3.git0.1.fc32.x86_64/kernel/drivers/edac/i7core_edac.ko.xz
/lib/modules/5.4.0-0.rc3.git0.1.fc32.x86_64/kernel/drivers/edac/i82975x_edac.ko.xz
/lib/modules/5.4.0-0.rc3.git0.1.fc32.x86_64/kernel/drivers/edac/ie31200_edac.ko.xz
/lib/modules/5.4.0-0.rc3.git0.1.fc32.x86_64/kernel/drivers/edac/pnd2_edac.ko.xz
/lib/modules/5.4.0-0.rc3.git0.1.fc32.x86_64/kernel/drivers/edac/sb_edac.ko.xz
/lib/modules/5.4.0-0.rc3.git0.1.fc32.x86_64/kernel/drivers/edac/skx_edac.ko.xz
/lib/modules/5.4.0-0.rc3.git0.1.fc32.x86_64/kernel/drivers/edac/x38_edac.ko.xz
[root@localhost ~]# yum info edac-utils
Last metadata expiration check: 0:01:47 ago on Sun 27 Oct 2019 11:44:47 PM CET.
Installed Packages
Name : edac-utils
Version : 0.16
Release : 21.fc31
Architecture : x86_64
Size : 101 k
Source : edac-utils-0.16-21.fc31.src.rpm
Repository : @System
From repo : rawhide
Summary : Userspace helper for kernel EDAC drivers
URL : http://sourceforge.net/projects/edac-utils/
License : GPLv2+
Description : EDAC is the current set of drivers in the Linux kernel that handle
: detection of ECC errors from memory controllers for most chipsets
: on i386 and x86_64 architectures. This userspace component consists
: of an init script which makes sure EDAC drivers and DIMM labels
: are loaded at system startup, as well as a library and utility
: for reporting current error counts from the EDAC sysfs files.
[root@localhost ~]# edac-util -vs
edac-util: EDAC drivers are loaded. 1 MC detected:
mc0:F17h_M70h
[root@localhost ~]# edac-util -v
mc0: 0 Uncorrected Errors with no DIMM info
mc0: 0 Corrected Errors with no DIMM info
mc0: csrow2: 0 Uncorrected Errors
mc0: csrow2: mc#0csrow#2channel#0: 0 Corrected Errors
mc0: csrow2: mc#0csrow#2channel#1: 0 Corrected Errors
mc0: csrow3: 0 Uncorrected Errors
mc0: csrow3: mc#0csrow#3channel#0: 0 Corrected Errors
mc0: csrow3: mc#0csrow#3channel#1: 0 Corrected Errors
[root@localhost ~]#

Last edited: Oct 28, 2019

Источник

[TGL] EDAC support for TGL(IoTG)

Bug Description

Other bug subscribers

Remote bug watches

Entz

Well-known member

JD

Moderator

Entz

Well-known member

nToxik

Well-known member

Entz

Well-known member

Читайте также: