Hardware error cpu 0 machine check 0 bank 6 ee0000000040110a

Hello Team,   I've purchased an Intel i9-10900K processor and motherboard for a new system.   During boot process, the following error is generated:   [ 0.158444] x86/cpu: SGX disabled by BIOS [ 0.158464] mce: CPU0: Thermal monitoring enabled (TM1) [ 0.158490] process: using mwait in idle threads [ ...

Hello Team,

I’ve purchased an Intel i9-10900K processor and motherboard for a new system.

During boot process, the following error is generated:

[ 0.158444] x86/cpu: SGX disabled by BIOS
[ 0.158464] mce: CPU0: Thermal monitoring enabled (TM1)
[ 0.158490] process: using mwait in idle threads
[ 0.158492] Last level iTLB entries: 4KB 64, 2MB 8, 4MB 8
[ 0.158493] Last level dTLB entries: 4KB 64, 2MB 0, 4MB 0, 1GB 4
[ 0.158495] Spectre V1 : Mitigation: usercopy/swapgs barriers and __user pointer sanitization
[ 0.158496] Spectre V2 : Mitigation: Enhanced IBRS
[ 0.158497] Spectre V2 : Spectre v2 / SpectreRSB mitigation: Filling RSB on context switch
[ 0.158498] Spectre V2 : mitigation: Enabling conditional Indirect Branch Prediction Barrier
[ 0.158499] Speculative Store Bypass: Mitigation: Speculative Store Bypass disabled via prctl and seccomp
[ 0.158691] Freeing SMP alternatives memory: 40K
[ 0.160277] smpboot: Estimated ratio of average max frequency by base frequency (times 1024): 1356
[ 0.160300] smpboot: CPU0: Intel(R) Core(TM) i9-10900K CPU @ 3.70GHz (family: 0x6, model: 0xa5, stepping: 0x5)
[ 0.160352] mce: [Hardware Error]: Machine check events logged
[ 0.160353] mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 6: ee0000000040110a
[ 0.160356] mce: [Hardware Error]: TSC 0 ADDR fef20300 MISC 3880000086
[ 0.160359] mce: [Hardware Error]: PROCESSOR 0:a0655 TIME 1631860523 SOCKET 0 APIC 0 microcode ec
[ 0.160403] Performance Events: PEBS fmt3+, Skylake events, 32-deep LBR, full-width counters, Intel PMU driver.
[ 0.160412] … version: 4
[ 0.160412] … bit width: 48
[ 0.160413] … generic registers: 4
[ 0.160413] … value mask: 0000ffffffffffff
[ 0.160414] … max period: 00007fffffffffff
[ 0.160414] … fixed-purpose events: 3
[ 0.160415] … event mask: 000000070000000f
[ 0.160482] rcu: Hierarchical SRCU implementation.
[ 0.161313] NMI watchdog: Enabled. Permanently consumes one hw-PMU counter.
[ 0.161420] smp: Bringing up secondary CPUs …
[ 0.161478] x86: Booting SMP configuration:
[ 0.161479] …. node #0, CPUs: #1 #2 #3 #4 #5 #6 #7 #8 #9 #10 #11 #12 #13 #14 #15 #16 #17 #18 #19
[ 0.183475] smp: Brought up 1 node, 20 CPUs

Processing the error with ‘mcelog’ returns the following:

# mcelog —ascii < error
Machine check events logged
Hardware event. This is not a software error.
CPU 0 BANK 6
MISC 3880000086 ADDR fef20300
TIME 1631860523 Fri Sep 17 01:35:23 2021
MCG status:
MCi status:
Error overflow
Uncorrected error
MCi_MISC register valid
MCi_ADDR register valid
Processor context corrupt
MCA: corrected filtering (some unreported errors in same region)
Generic CACHE Level-3 Generic Error
STATUS ee0000000040110a MCGSTATUS 0
CPUID Vendor Intel Family 6 Model 165 Step 5
SOCKET 0 APIC 0 microcode ec

It appears that the error is related to L3 Cache on the processor.

I contacted the hardware vendor, and they advised that I must contact the manufacturer.

I ran the Intel SSU utility available at the URL below.

Intel® System Support Utility for the Linux

https://www.intel.com/content/www/us/en/download/18895/26735/intel-system-support-utility-for-the-li…

SSU Output

# ./ssu.sh -d=0 -l=0 -m=0 -b=0 -n=0 -os=0 -o=CPU_Only.txt -p=0 -c=1 -s=0

# SSU Scan Information
Scan Info:
Version:»1.0.0.0″
Scan Date:»2021/09/17″
Scan Time:»06:31:40″

## Scanned Hardware
Computer:
— Processor
— «Intel(R) Core(TM) i9-10900K CPU @ 3.70GHz»
Architecture:»x86_64″
Available:»Offline»
Byte Order:»Little Endian»
Cache Size:»20480 KB»
Caption:»Intel(R) Core(TM) i9-10900K CPU @ 3.70GHz»
— Characteristics
64-bit capable
Enhanced Virtualization
Execute Protection
Hardware Thread
Multi-Core
Power/Performance Control
CPU Speed (Minimum):»1000.000″
CPU Speed (Maximum):»5300 MHz»
Current Voltage:»1.0 V»
External Clock:»100 MHz»
Family:»Not Available»
Flags:»Not Available»
ID:»55 06 0A 00 FF FB EB BF»
Level 1 Cache:»32K»
Level 2 Cache:»256K»
Level 3 Cache:»20480K»
Load:»load average: 0.40, 0.43, 0.18″
Manufacturer:»Intel(R) Corporation»
Model:»165″
Name:»Intel(R) Core(TM) i9-10900K CPU @ 3.70GHz»
Number of Cores:»10″
Number of Cores — Enabled:»10″
Part Number:»To Be Filled By O.E.M.»
Socket Designation:»U3E1″
Status:»Populated, Enabled»
Version:»Intel(R) Core(TM) i9-10900K CPU @ 3.70GHz»
Voltage:»1.0 V»
Virtualization:»Not Available»

Can you please review and provide guidance regarding the next step?

Thanks in advance, your help is very much appreciated!

Как правильно задавать вопросы

Правильно сформулированный вопрос и его грамотное оформление способствует высокой вероятности получения достаточно содержательного и по существу ответа. Общая рекомендация по составлению тем: 1. Для начала воспользуйтесь поиском форума. 2. Укажите версию ОС вместе с разрядностью. Пример: LM 19.3 x64, LM Sarah x32 3. DE. Если вопрос касается двух, то через запятую. (xfce, KDE, cinnamon, mate) 4. Какое железо. (достаточно вывод inxi -Fxz в спойлере (как пользоваться спойлером смотрим здесь)) или же дать ссылку на hw-probe 5. Суть. Желательно с выводом консоли, логами. 6. Скрин. Просьба указывать 2, 3 и 4 независимо от того, имеет ли это отношение к вопросу или нет. Так же не забываем об общих правилах Как пример вот

no avatar

Denys

Сообщения: 3
Зарегистрирован: 18 фев 2018, 11:22
Контактная информация:

Ошибка при установке ubuntu

18 фев 2018, 11:35

После того как выбираю устоновить выбивает ошибку, что это ???? И как с ним быть ??

Вложения
IMG_1153.JPG
После того как выбираю устоновить выбивает ошибку, что это ???? И как с ним быть ??


Аватара пользователя

Chocobo

Сообщения: 9954
Зарегистрирован: 27 авг 2016, 22:57
Решено: 214
Откуда: НН
Благодарил (а): 795 раз
Поблагодарили: 2980 раз
Контактная информация:

Ошибка при установке ubuntu

#2

18 фев 2018, 12:10

Denys,
1. Что за проц?
2. Какая версия дистрибутива?
3. почему не минт, или почему пост не на форуме убунты?

Изображение

   

Изображение


no avatar

Denys

Сообщения: 3
Зарегистрирован: 18 фев 2018, 11:22
Контактная информация:

Ошибка при установке ubuntu

#3

18 фев 2018, 13:10

I7-7700hq
Версия 17,
Ставил минт18, было какое то мерцание и та же ошибка.
Посоветовали здесь


Аватара пользователя

AlexZ

Сообщения: 1395
Зарегистрирован: 06 янв 2018, 21:06
Решено: 3
Откуда: Горно-Алтайск
Благодарил (а): 212 раз
Поблагодарили: 177 раз
Контактная информация:

Ошибка при установке ubuntu

#4

18 фев 2018, 13:13

Denys писал(а): ↑

18 фев 2018, 11:35

После того как выбираю устоновить выбивает ошибку, что это ???? И как с ним быть ??

У меня на новых ядрах это появилось, вот на 4.14:
kernel: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 6: ae0000000040110a
kernel: mce: [Hardware Error]: TSC 0 ADDR fef86e00 MISC 78a0000086
kernel: mce: [Hardware Error]: PROCESSOR 0:40651 TIME 1518826617 SOCKET 0 APIC 0 microcode 20
Хотя на загрузку не влияет..
Попробуй загрузиться с параметром nomodeset


no avatar

Denys

Сообщения: 3
Зарегистрирован: 18 фев 2018, 11:22
Контактная информация:

Ошибка при установке ubuntu

#5

18 фев 2018, 13:33

Как выбрать эти параметры?


no avatar

x230

Сообщения: 2094
Зарегистрирован: 02 сен 2016, 22:07
Решено: 5
Благодарил (а): 406 раз
Поблагодарили: 487 раз
Контактная информация:

Ошибка при установке ubuntu

#6

18 фев 2018, 13:47

Denys писал(а): ↑

18 фев 2018, 13:33

Как выбрать эти параметры?

Тут и тут смотри про параметр.
Не прокатит с модесет, попробуй отключить UEFI (в смысле включить Legacy boot) и начни установку поновой.


Аватара пользователя

Chocobo

Сообщения: 9954
Зарегистрирован: 27 авг 2016, 22:57
Решено: 214
Откуда: НН
Благодарил (а): 795 раз
Поблагодарили: 2980 раз
Контактная информация:

Ошибка при установке ubuntu

#7

18 фев 2018, 13:50

Denys писал(а): ↑

18 фев 2018, 13:10

I7-7700hq
Версия 17

Убунты — 17.04 /17.10? 4.10 из Zesty может еще не подойти. Проц тут явно ругается
ядро 4.13 из 17.10 уже в принципе должно взлететь.

Также можно попробовать

федору

последнюю, или 18-й минт я тут

пересобирал

с 4.13 ядром образ.

Изображение

   

Изображение


no avatar

x230

Сообщения: 2094
Зарегистрирован: 02 сен 2016, 22:07
Решено: 5
Благодарил (а): 406 раз
Поблагодарили: 487 раз
Контактная информация:

Ошибка при установке ubuntu

#8

18 фев 2018, 15:47

Chocobo писал(а): ↑

18 фев 2018, 13:50

пересобирал

чем и как? — бо интерсуюсь


Аватара пользователя

Chocobo

Сообщения: 9954
Зарегистрирован: 27 авг 2016, 22:57
Решено: 214
Откуда: НН
Благодарил (а): 795 раз
Поблагодарили: 2980 раз
Контактная информация:

Ошибка при установке ubuntu

#9

18 фев 2018, 15:54

Не по теме

x230, руками как обычно :smile: ибо слишком часто издавна меня подводили эти приблуды, которые предназначены были помогать. :smile:
Разово разобравшись с процессом по пунктам — оно не кажется чем-то страшным, все становится на свои места. При желании и тебя научим, если что-либо не ясно:)

Изображение

   

Изображение


no avatar

x230

Сообщения: 2094
Зарегистрирован: 02 сен 2016, 22:07
Решено: 5
Благодарил (а): 406 раз
Поблагодарили: 487 раз
Контактная информация:

Ошибка при установке ubuntu

#10

18 фев 2018, 16:22

Chocobo писал(а): ↑

18 фев 2018, 15:54

При желании и тебя научим, если что-либо не ясно:)

всё неясно, поэтому и пользуюсь gui-костылями


  • Index
  • » Kernel & Hardware
  • » mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 6: ee2000000040110

Pages: 1

#1 2021-05-09 22:13:41

johnny.honu
Member
Registered: 2021-04-18
Posts: 17

mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 6: ee2000000040110

I just built a new machine this weekend: an Intel i3-10320 CPU on a MSI MAG B560 Torpedo motherboard.  I’ve never built a machine before.  After building it, I immediately updated the BIOS.  So, here is the problem: the only way I can get Arch Linux to boot, whether from a live USB or as it is currently installed on my new machine, is by adding nomodeset to kernel boot line.  Per dmesg | grep Error, I am getting the following errors at boot:

[    0.107865] mce: [Hardware Error]: Machine check events logged
[    0.107866] mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 6: ee2000000040110a
[    0.107869] mce: [Hardware Error]: TSC 0 ADDR fef20080 MISC 3880000086
[    0.107872] mce: [Hardware Error]: PROCESSOR 0:a0653 TIME 1620583348 SOCKET 0 APIC 0 microcode e0
[    0.794166] ACPI BIOS Error (bug): Could not resolve symbol [_SB.PC00.PGON.PBGE], AE_NOT_FOUND (20210105/psargs-330)
[    0.794174] ACPI Error: Aborting method _SB.PC00.PGON due to previous error (AE_NOT_FOUND) (20210105/psparse-529)
[    0.794178] ACPI Error: Aborting method _SB.PC00.PEG1.PG01._ON due to previous error (AE_NOT_FOUND) (20210105/psparse-529)
[    1.012493] RAS: Correctable Errors collector initialized.

I’m particularly concerned about the hardware errors. From what I’ve been able to gather so far (and I’ve done a fair bit research over the past few days — though I should disclaim that I am a total amateur), I may actually have two issues.  And I’m not sure if they are related.  The hardware errors may be due to a bad CPU or mobo socket, or it may be a firmware or microcode issue.  The ACPI errors may, or may not, be related, but are probably firmware or microcode driven.  I’ve updated the BIOS and added the most recent intel-ucode to my kernel boot line as well.  So I think I am current.  It seems a number of people are having the ACPI issues at the moment.  Does anyone have any insights?

#2 2021-05-09 22:22:07

graysky
Wiki Maintainer
From: :wq
Registered: 2008-12-01
Posts: 10,472
Website

Re: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 6: ee2000000040110

Are you overclocking this machine or undervolting it?  It’s been years since I played with overlocking and undervolting but the error sparks a memory, https://wiki.archlinux.org/title/Stress … ing_Errors

#3 2021-05-09 22:53:53

johnny.honu
Member
Registered: 2021-04-18
Posts: 17

Re: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 6: ee2000000040110

Not overclocking.  I don’t think I’m undervolting, but I will double-check cpu power hookups to the psu just to make sure that is as it should be.

#4 2021-05-10 06:12:38

orlfman
Member
Registered: 2007-11-20
Posts: 121

Re: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 6: ee2000000040110

as far i’m aware the i3-10320 is locked, so really no overclocking capability. outside the motherboard messing with power limits. which is common with intel boards. in the bios i would make sure power limits are truly intel stock and not messed with. tdp for the 10320 is 65 watts for the p1, p2 should be 90 watts. tau of 28 seconds.

googling i found this with a similar mce error to yours: https://community.intel.com/t5/Graphics … d-p/711594
but good ol’ intel doesn’t appear to offer any help as «linux isn’t validated by intel.»

there could be something wrong with your cpu. if you can, i would test with windows first. windows unfortunately has the better monitoring / stability testing / benchmarking tools. i’m curious to see if windows picks up any whea errors with event viewer in the system pane.

Last edited by orlfman (2021-05-10 06:13:51)

#5 2021-05-10 07:59:20

d_fajardo
Member
Registered: 2017-07-28
Posts: 1,435

Re: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 6: ee2000000040110

You could also run MemTest86 to check the CPU memory registers and caches for error.

#6 2021-05-10 17:40:37

johnny.honu
Member
Registered: 2021-04-18
Posts: 17

Re: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 6: ee2000000040110

Thanks everyone for the suggestions.  I have verified the hardware is hooked up as it should be.  As graysky suggested, I tried to find the Fedora version of Intel Processor Diagnostic Tool (IPDT), but all the links to the tool on the Internet appear to be broken.  I did a fair bit of searching for the tool, but only hit dead ends.  Per orlfman, I probably will load Windows and then get IPDT for Windows just to see if that will narrow down any issues.   In the mean time, I will verify the power limits in the BIOS tonight.  May try to run stress and MemTest86 too.

#7 2021-05-10 20:28:24

Ropid
Member
Registered: 2015-03-09
Posts: 1,068

Re: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 6: ee2000000040110

That MCE hardware event in your log happened early at boot. Do you also get MCE log entries later while you are using the computer?

If it’s only happening at boot, you perhaps shouldn’t worry too much about it. I remember seeing other people with those kind of mysterious MCE events that only happen at boot but don’t happen later, their computer ran fine otherwise.

The log you shared is the output of ‘dmesg’? Those entries should also be in systemd’s journal. You can then search for old entries from previous boots in the «journalctl» output.

The «nomodeset» issue should be something else. You didn’t mention what graphics card you are using.

#8 2021-05-10 23:55:21

johnny.honu
Member
Registered: 2021-04-18
Posts: 17

Re: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 6: ee2000000040110

Yes, it happens early each time I boot, and only then.   I’ve run journalctl | grep Error and confirmed it is always the same series of messages early in boot, every boot.  I’m not using a graphics card; I’m relying on the cpu’s built in graphics.  FYI… I’ve built this machine to serve as headless home NAS, and plan to leave it on 24/7.

I was just in BIOS and one anomaly sticks out in bold text below:

VOLTAGE
CPU Core: 0.970V
CPU IO: 0.956V
CPU IO2: 65.535V
CPU SA: 1.054V
System 3.3V: 3.360V
System 12V: 12.120V
DRAM: 1.204V

I don’t even know what CPU IO2 is, but 65V seems kinda high for something on the cpu. I’m guessing it is a un/dis-connected sensor.

#9 2021-05-11 06:19:17

seth
Member
Registered: 2012-09-03
Posts: 35,310

Re: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 6: ee2000000040110

The error is (very most likely) from the last boot.
Do you restart cleanly, does the reboot process hang w/ some errors during the shutdown and is there a difference in MCE messages between a cold and a warm reboot?

Ceterum censeo and since it was mentioned: 3rd link in my signature.

#10 2021-05-15 01:18:43

johnny.honu
Member
Registered: 2021-04-18
Posts: 17

Re: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 6: ee2000000040110

The error is the same (except for the timestamp) regardless of whether it is a cold or warm boot.  No errors during shutdown.

The machine seems to ***mostly run*** despite the CPU and ACPI errors listed above, but of course I have to boot with nomodeset to use a monitor.  I’ve tried disabling fastboot: no effect.  I’ve reinstalled Arch from scratch.  I’ve replace Arch with Ubuntu, then replace Ubuntu with Linux Mint.  I get the exact same messages all the time at the same place during boot and have to use nomodeset in all cases.  I’m in the process of loading Windows 10 to see what happens.

#11 2021-05-15 02:51:11

Ropid
Member
Registered: 2015-03-09
Posts: 1,068

Re: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 6: ee2000000040110

I stumbled onto something interesting in the kernel documentation yesterday that fits with the CPU problem here:

First, I saw this here in kernel-parameters.txt:

        mce=option      [X86-64] See Documentation/x86/x86_64/boot-options.rst

Then next looking through the file boot-options.rst, one of the options it describes is this:

   mce=bootlog
                Enable logging of machine checks left over from booting.
                Disabled by default on AMD Fam10h and older because some BIOS
                leave bogus ones.
                If your BIOS doesn't do that it's a good idea to enable though
                to make sure you log even machine check events that result
                in a reboot. On Intel systems it is enabled by default.
   mce=nobootlog
                Disable boot machine check logging.

You could try this «mce=nobootlog» kernel command line parameter and see what happens. If it hides the MCE event messages in dmesg and journalctl, this should then mean that they were events from before the Linux kernel was loaded. They were then events from early at boot when the UEFI was still in control of the machine.

If this «mce=nobootlog» works, I would then not worry about this anymore. The text in the documentation mentions there’s machines that always create those MCE events at boot. I guess your machine is then one of those and there’s nothing to do about it.

#12 2021-05-15 13:36:14

johnny.honu
Member
Registered: 2021-04-18
Posts: 17

Re: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 6: ee2000000040110

Yes, adding mce=nobootlog to the as a kernel boot parameter did suppress the MCE hardware errors.  The other errors remain, as expected.  I have one quick question: Why would suppressing these errors via the kernel boot parameters indicate the events were driven when the UEFI was still in control of the machine, and not the kernel?

#13 2021-05-15 14:34:27

seth
Member
Registered: 2012-09-03
Posts: 35,310

Re: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 6: ee2000000040110

https://en.wikipedia.org/wiki/Machine-check_exception
It’s either that or the error is carried over from the last boot.
If there were no issues w/ the shutdown (you shut down the system cleanly, rather than it somehow powered off out of nowhere) and the errors are reproducible (always the same), they’re detected at boot.

Whether they’re bogus or a genuine error can’t be told, but google finds that exact error at the exact address and bank quite some times…

#14 2021-05-15 15:12:47

johnny.honu
Member
Registered: 2021-04-18
Posts: 17

Re: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 6: ee2000000040110

As far as I can determine, the machine runs fine despite the MCE error.  I will proceed as if they are harmless.  I have other issues (e.g. graphics problems) that I need to troubleshoot too, but those are topics for another thread.

#15 2021-07-26 12:01:39

zeronullity
Member
Registered: 2021-07-26
Posts: 3

Re: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 6: ee2000000040110

I’m dealing with this same exact issue and trust me when I say it’s not «fine» or a bug.. and has a high-risk of leading to more serious issues down the road. This error is typically going to be a bad memory controller/cache on your cpu 75% of the time, bad chipset on your motherboard, bent pins on your cpu socket, or a bad trace on your motherboard.. although on rare occasions other hardware that shorts out the motherboard such as a bad power supply, pci card, or bad caps on the motherboard can cause it also, however unless it’s directly linked to the CPU it’s extremely rare.).  I’ve decided it’s time for a upgrade and to error on the side of caution and just replace my motherboard, cpus, memory, power supply, etc.. This is the second identical motherboard and cpu set I’ve had with memory / CPU related issues within a few years.. I see no point other then trying to penny pitch $$$ to cause myself headaches for the future.. I can always pinpoint the issue in the future if I have nothing to do and use the hardware for less important things but as my main system it simply will not do. Also bad hardware like this can cause a daisy chain of failures.. bad cpu causes bad motherboard you replace motherboard.. cpu causes bad motherboard again.. you replace bad cpu motherboard causes cpu to go bad again.. extremely rare but I’ve seen it happen in a controlled engineering environment with other hardware.

At this point in time with the same error I can’t even put my root device in rw the kernel forbids it without forcing it’s hand even with a clean fsck mount won’t work. No other hardware errors show. No issues with my RAID drives. Removing all memory sticks from CPU1 bank 1, 2 & 3 and not CPU0 fixes the error (I say fixes, but it’s not a fix, it just «hides» the error from the kernel but it will still cause data corruption) however trying another memory stick from CPU0 bank fails every time.. I tried 15 «known good» memory sticks with the same results. Also I can run CPU/memory burn test and «most» basic test will pass without issue.

Also I wanted to say that the forum sign up on this server with the date/uname/hash should be changed in my opinion.. for people like me that are having hardware issues without access to another Linux system. I actually had to stop and think.. and notice that the security question changed every time..  from hash 256 to 512.. different date %V  %J outputs.. and make sure my time is in par with the servers with epoch or day of the month format..  it really just wasted my time which is something I really hate. Forums are usually for people who need help, which can include date/time/rtc/hardware/kernel issues with Linux systems. Only elitist n00bs who think they are supreme/clever/elite and better than other people would use that type of captcha.. but that’s only my opinion. Hunting down a online sha tool.. and trying multiple «near correct dates» hashes until I get the right one takes more time then most of the hardest captachas.. and that’s from someone who is very experienced with Unix/Linux systems.

It’s kind of like Arch forum admins are telling me : «Well, if you don’t have current access to a Linux system with the correct date set your not getting into our forums without some work and wasted time finding a proper hash that works» It just leaves a bad taste in my mouth.. and I’ve been using Unix/Linux systems for over 27 years and that’s my point of view.

Last edited by zeronullity (2021-07-26 12:25:01)

#16 2021-07-26 12:23:56

seth
Member
Registered: 2012-09-03
Posts: 35,310

Re: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 6: ee2000000040110

I’m dealing with this same exact issue

«Bank 6: ee2000000040110a» and *only* «Bank 6: ee2000000040110a»?

For the rest of your post, please see https://gitlab.archlinux.org/archlinux/ … opicsrants

#17 2021-07-26 12:49:53

zeronullity
Member
Registered: 2021-07-26
Posts: 3

Re: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 6: ee2000000040110

Yes Seth, the same error, different address range, and yes it will effect each system differently based on the hardware in use and ranges effected.. It’s a very serious error even if you don’t have issues currently.. it’s true you may never have an issue but that doesn’t mean the hardware doesn’t have a real physical failure.. How it effects your system depends on many factors.  It’s like me telling you my shirt is ripped,  I’m going to throw it away. And you asking «but is it ripped where it can be seen?»  It doesn’t matter, it’s ripped, I’m throwing it or giving it away.. because it’s most likely only going to get worse.

I’m sorry for breaking the forum rules Seth, I just call it how I see it when things seem to be blatantly wrong from my point of view. It was more or less mean’t as a teaching moment  to share common human kindness & hospitality instead of turning a way new Linux users because they are not smart enough to answer the question or don’t have easy access to checksum tools at the moment. And I was more harsh than I normally am because I saw other forum posts with members/admins boasting about this very thing and it irked me quite a bit.. to be careless/thoughtless about «new members».

Last edited by zeronullity (2021-07-26 13:35:09)

#18 2021-07-26 14:55:04

seth
Member
Registered: 2012-09-03
Posts: 35,310

Re: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 6: ee2000000040110

I don’t have that error but it is all over the internet with this *specific* address (which is a bit specific and probably does matter and it also will likely matter whether the address is varying).
And whether there more and different MCEs along it.
There’re 10 hits for this forum alone (most of which do actually not concern that error but are just random dmesg posts)
Apparently it was introduced w/ linux 4.10, https://bbs.archlinux.org/viewtopic.php … 1#p1698801

I’m not saying it’s harmless for sure, but having an itiching toe right before your house burned down doesn’t mean that the house burned down because of your itching toe…

#19 2021-07-26 17:40:35

zeronullity
Member
Registered: 2021-07-26
Posts: 3

Re: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 6: ee2000000040110

I could go into great detail explaining what the addresses mean.. and how they would effect certain situations if they are on the same hardware in various scenarios or even the same error on different hardware.. Which is like having a unique hardware fingerprint.

https://en.wikipedia.org/wiki/Machine-check_exception

seth: «I’m not saying it’s harmless for sure, but having an itiching toe right before your house burned down doesn’t mean that the house burned down because of your itching toe…»

First of all I personally wouldn’t use that comparison.  Even with the «rare» instances where it’s a Firmware/Kernel/Driver bug you can have the really bad luck of having both.. where it’s both a firmware/kernel bug but you have a hardware issue too. I’m not suggesting that the OP throw his system away.. I would suggest using «known good» parts to rule out the hardware.. If it’s NOT the hardware then swapping out the hardware with a IDENTICAL part# with a IDENTICAL firmware version will make no difference to the error. I definitely wouldn’t take the approach of oh well a few other people had this problem they say it’s nothing and go on. I take the old school approach which usually never fails, not too many ways to reinvent a perfect round-circle.. A square or oblong tire probably wouldn’t be very good to drive on but.. Hey, if you have no other choice, why not right? What’s the worst that could happen?

#20 2021-07-28 16:22:48

seth
Member
Registered: 2012-09-03
Posts: 35,310

Re: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 6: ee2000000040110

I could go into great detail explaining

Don’t let me stop you…

MCE is good to analyze HW errors after the fact and preserve errors across boots.
But if the very same error with the very same addresses occurs for many people and always during the boot when the cores are initialized and without any perceivable issues, chances are it’s bogus and not a thermal issue or decay (where you’d expect more randomness)

Alternatively the OP is really active across the web ;-)
(Though there’re also many dell systems affected where idk the used chipset)

Here’s someone who got them when switching to the exact same chipset, https://forums.unraid.net/topic/103883- … rd-on-691/

#21 2021-08-11 20:10:09

johnny.honu
Member
Registered: 2021-04-18
Posts: 17

Re: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 6: ee2000000040110

Update on my original post… I eventually got a better Mobo and CPU.  That resolved the issue.  I did try swapping out the original Mobo with a new, identical Mobo and got the same error.  In the end, I discovered the Mobo to be very quirky in a few ways ways, and I don’t think it was entirely compatible with the CPU (or the GPU built into the CPU). The Mobo was really built for Intel 11th gen, and I was using Intel 10th gen.  It should have been backward compatible, but just wasn’t.

#22 2021-10-26 03:58:33

kaos77
Member
Registered: 2013-04-26
Posts: 9

Re: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 6: ee2000000040110

Hopefully this thread isn’t too old to bump up again, but I ran across this thread for the exact same error.  Considering the sheer number of hits around this error and addresses, I’m going to let it ride.  Brand new build with quality parts.  Burned in the system pretty hard when I first saw the errors.  Ran stress tests 24 hours running all 20 vcores pretty solid.  Ran into absolutely no issues at all.  I’ve run Windows and Linux on this hardware with no issues, other than a few video issues due to lower quality HDMI cables.  The errors bother me because they’re there, but knowing so many other people have them also is certainly comforting in a twisted kind of way.

Same CPU gen though,  10900 (non-k) on a Gigabyte Z590 Aorus Master.  64G of Gigabyte listed RAM.  CPU runs at 30*C or below at standard load.

Содержание

  1. При установке выходит ошибка
  2. Arch Linux
  3. #1 2021-05-09 22:13:41
  4. mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 6: ee2000000040110
  5. #2 2021-05-09 22:22:07
  6. Re: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 6: ee2000000040110
  7. #3 2021-05-09 22:53:53
  8. Re: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 6: ee2000000040110
  9. #4 2021-05-10 06:12:38
  10. Re: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 6: ee2000000040110
  11. #5 2021-05-10 07:59:20
  12. Re: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 6: ee2000000040110
  13. #6 2021-05-10 17:40:37
  14. Re: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 6: ee2000000040110
  15. #7 2021-05-10 20:28:24
  16. Re: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 6: ee2000000040110
  17. #8 2021-05-10 23:55:21
  18. Re: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 6: ee2000000040110
  19. #9 2021-05-11 06:19:17
  20. Re: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 6: ee2000000040110
  21. #10 2021-05-15 01:18:43
  22. Re: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 6: ee2000000040110
  23. #11 2021-05-15 02:51:11
  24. Re: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 6: ee2000000040110
  25. #12 2021-05-15 13:36:14
  26. Re: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 6: ee2000000040110
  27. #13 2021-05-15 14:34:27
  28. Re: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 6: ee2000000040110
  29. #14 2021-05-15 15:12:47
  30. Re: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 6: ee2000000040110
  31. #15 2021-07-26 12:01:39
  32. Re: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 6: ee2000000040110

При установке выходит ошибка

Не удается поставить Ubuntu, при выборе установки, выходит ошибка:

Hardware error: CPU 0: Machine Check: 0 Bank 6: ee20000000

Может кто знает, сможет подсказать, куда копать

Кешу одного из ядер пришёл кирдык если я ничего не путаю. Есть возможность в биосе включать отключать ядра? Поинтерируй отключения выключи первое ядро, пробуй ставить, если ошибка включи первое ядро выключи второе и так далее.

Если проц новый сдай его по гарантии.

Прогони MemTest86 он опять же если я правильно помню может проверять CPU cache

Может кто подскажет более дельно.

Скорее всего, битая оперативная память.

Память битая, меняй.

Замена оперативки результата не дала

Отключения ядер не возможно, такой возможности в биос нет

Целиком, всю менял?

Если целиком всю менял и не помогло — мать меняй. Значит где то в ней повреждение. Ошибка однозначно указывает на шестой банк памяти, если после смены оперативки ошибка та же — повреждена мать.

А с чего ты решил что это кэш? Обычно когда кэш — написано что это кэш. И к тому же у кэша ECC обычно есть (или я разбалован Ксеоном) и есть сообщение о некорректируемой его ошибке. А тут чётко написано — первый канал (0) шестой банк. Это память же.

Сам удивляюсь почему я так подумал, но ТС сказал что память менял. Наверное ошибка по памяти может быть вызвана не только самой памятью, а и самим процом. Хотя эт я выдумываю. Пока ТС не ответил он всю память менял или только где банка типа битая.

дефект на матери, дефектный слот. Хотя и на проце может быть что то с пятаком, и в сокете с ногой.

Ну да, вариантов много в принципе… если ТС правда менял память и такая же херня то конечно печалька. Ну может произвести визуальный осмотр слота, проца и его ног/площадок, сокета. Ну и протереть спиртиком выводы самой планки памяти.

А 6 банк на что указывает? На какие возможные проблемы в железе он может указывать?

Тебе выше описали какие могут быть причины. Проблемы с процессором, проблема с ножками процессора, проблемы с контактными площадками в сокете, проблемы со слотом, проблемы с модулем оперативной памяти.

Какое у тебя железо? Только не говори, что это китайский xeon и материнская плата.

Планка оперативной памяти эта такая зеленовая картоночка с позолоченными контактами, а на ней распаяны такие черные блямбы. Блямбы эти — банки памяти. Вот механизм MCE зафиксировал ошибку в шестой блямбе на планке сидящей на нулевом канале (какой канал к какому слоту относится — смотри в инструкции в своей материнке). Это также может быть проблема в линиях данных между памятью и процом (слот куда втыкается планка, линии на мамке, сокет процессорный, ноги процессора, компоненты внутри процессора).

Только не говори, что это китайский xeon и материнская плата.

Только вот не надо дискриминации тут. Прекрасно работают аж два китайских ксеона в китайской плате и памятью утыканы все 16 слотов.

Тебе повезло, но в целом процент брака у них выше.

Бу проц, бу чипсет, бу элементы.

И ресурс у всего этого больше чем время в течении которого вся машина в целом актуальна. Реально есть хоть одна история как от старости стёрся чипсет или камень? А брак есть брак, бывает, мне бракованную мать продавец без проблем заменял. Как в обычном магазе, только дольше.

Ну и да, чот мне везёт уже лет восемь. Этот китаец не первый и даже не второй.

Версия Ubuntu 18.04 не работает с материнкой H410 с чипсетом intel H510 Express

Источник

Arch Linux

You are not logged in.

#1 2021-05-09 22:13:41

mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 6: ee2000000040110

I just built a new machine this weekend: an Intel i3-10320 CPU on a MSI MAG B560 Torpedo motherboard. I’ve never built a machine before. After building it, I immediately updated the BIOS. So, here is the problem: the only way I can get Arch Linux to boot, whether from a live USB or as it is currently installed on my new machine, is by adding nomodeset to kernel boot line. Per dmesg | grep Error, I am getting the following errors at boot:

[ 0.107865] mce: [Hardware Error]: Machine check events logged
[ 0.107866] mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 6: ee2000000040110a
[ 0.107869] mce: [Hardware Error]: TSC 0 ADDR fef20080 MISC 3880000086
[ 0.107872] mce: [Hardware Error]: PROCESSOR 0:a0653 TIME 1620583348 SOCKET 0 APIC 0 microcode e0
[ 0.794166] ACPI BIOS Error (bug): Could not resolve symbol [_SB.PC00.PGON.PBGE], AE_NOT_FOUND (20210105/psargs-330)
[ 0.794174] ACPI Error: Aborting method _SB.PC00.PGON due to previous error (AE_NOT_FOUND) (20210105/psparse-529)
[ 0.794178] ACPI Error: Aborting method _SB.PC00.PEG1.PG01._ON due to previous error (AE_NOT_FOUND) (20210105/psparse-529)
[ 1.012493] RAS: Correctable Errors collector initialized.

I’m particularly concerned about the hardware errors. From what I’ve been able to gather so far (and I’ve done a fair bit research over the past few days — though I should disclaim that I am a total amateur), I may actually have two issues. And I’m not sure if they are related. The hardware errors may be due to a bad CPU or mobo socket, or it may be a firmware or microcode issue. The ACPI errors may, or may not, be related, but are probably firmware or microcode driven. I’ve updated the BIOS and added the most recent intel-ucode to my kernel boot line as well. So I think I am current. It seems a number of people are having the ACPI issues at the moment. Does anyone have any insights?

#2 2021-05-09 22:22:07

Re: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 6: ee2000000040110

Are you overclocking this machine or undervolting it? It’s been years since I played with overlocking and undervolting but the error sparks a memory, https://wiki.archlinux.org/title/Stress … ing_Errors

#3 2021-05-09 22:53:53

Re: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 6: ee2000000040110

Not overclocking. I don’t think I’m undervolting, but I will double-check cpu power hookups to the psu just to make sure that is as it should be.

#4 2021-05-10 06:12:38

Re: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 6: ee2000000040110

as far i’m aware the i3-10320 is locked, so really no overclocking capability. outside the motherboard messing with power limits. which is common with intel boards. in the bios i would make sure power limits are truly intel stock and not messed with. tdp for the 10320 is 65 watts for the p1, p2 should be 90 watts. tau of 28 seconds.

googling i found this with a similar mce error to yours: https://community.intel.com/t5/Graphics … d-p/711594
but good ol’ intel doesn’t appear to offer any help as «linux isn’t validated by intel.»

there could be something wrong with your cpu. if you can, i would test with windows first. windows unfortunately has the better monitoring / stability testing / benchmarking tools. i’m curious to see if windows picks up any whea errors with event viewer in the system pane.

Last edited by orlfman (2021-05-10 06:13:51)

#5 2021-05-10 07:59:20

Re: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 6: ee2000000040110

You could also run MemTest86 to check the CPU memory registers and caches for error.

#6 2021-05-10 17:40:37

Re: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 6: ee2000000040110

Thanks everyone for the suggestions. I have verified the hardware is hooked up as it should be. As graysky suggested, I tried to find the Fedora version of Intel Processor Diagnostic Tool (IPDT), but all the links to the tool on the Internet appear to be broken. I did a fair bit of searching for the tool, but only hit dead ends. Per orlfman, I probably will load Windows and then get IPDT for Windows just to see if that will narrow down any issues. In the mean time, I will verify the power limits in the BIOS tonight. May try to run stress and MemTest86 too.

#7 2021-05-10 20:28:24

Re: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 6: ee2000000040110

That MCE hardware event in your log happened early at boot. Do you also get MCE log entries later while you are using the computer?

If it’s only happening at boot, you perhaps shouldn’t worry too much about it. I remember seeing other people with those kind of mysterious MCE events that only happen at boot but don’t happen later, their computer ran fine otherwise.

The log you shared is the output of ‘dmesg’? Those entries should also be in systemd’s journal. You can then search for old entries from previous boots in the «journalctl» output.

The «nomodeset» issue should be something else. You didn’t mention what graphics card you are using.

#8 2021-05-10 23:55:21

Re: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 6: ee2000000040110

Yes, it happens early each time I boot, and only then. I’ve run journalctl | grep Error and confirmed it is always the same series of messages early in boot, every boot. I’m not using a graphics card; I’m relying on the cpu’s built in graphics. FYI. I’ve built this machine to serve as headless home NAS, and plan to leave it on 24/7.

I was just in BIOS and one anomaly sticks out in bold text below:

VOLTAGE
CPU Core: 0.970V
CPU IO: 0.956V
CPU IO2: 65.535V
CPU SA: 1.054V
System 3.3V: 3.360V
System 12V: 12.120V
DRAM: 1.204V

I don’t even know what CPU IO2 is, but 65V seems kinda high for something on the cpu. I’m guessing it is a un/dis-connected sensor.

#9 2021-05-11 06:19:17

Re: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 6: ee2000000040110

The error is (very most likely) from the last boot.
Do you restart cleanly, does the reboot process hang w/ some errors during the shutdown and is there a difference in MCE messages between a cold and a warm reboot?

Ceterum censeo and since it was mentioned: 3rd link in my signature.

Online

#10 2021-05-15 01:18:43

Re: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 6: ee2000000040110

The error is the same (except for the timestamp) regardless of whether it is a cold or warm boot. No errors during shutdown.

The machine seems to ***mostly run*** despite the CPU and ACPI errors listed above, but of course I have to boot with nomodeset to use a monitor. I’ve tried disabling fastboot: no effect. I’ve reinstalled Arch from scratch. I’ve replace Arch with Ubuntu, then replace Ubuntu with Linux Mint. I get the exact same messages all the time at the same place during boot and have to use nomodeset in all cases. I’m in the process of loading Windows 10 to see what happens.

#11 2021-05-15 02:51:11

Re: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 6: ee2000000040110

I stumbled onto something interesting in the kernel documentation yesterday that fits with the CPU problem here:

First, I saw this here in kernel-parameters.txt:

Then next looking through the file boot-options.rst, one of the options it describes is this:

You could try this «mce=nobootlog» kernel command line parameter and see what happens. If it hides the MCE event messages in dmesg and journalctl, this should then mean that they were events from before the Linux kernel was loaded. They were then events from early at boot when the UEFI was still in control of the machine.

If this «mce=nobootlog» works, I would then not worry about this anymore. The text in the documentation mentions there’s machines that always create those MCE events at boot. I guess your machine is then one of those and there’s nothing to do about it.

#12 2021-05-15 13:36:14

Re: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 6: ee2000000040110

Yes, adding mce=nobootlog to the as a kernel boot parameter did suppress the MCE hardware errors. The other errors remain, as expected. I have one quick question: Why would suppressing these errors via the kernel boot parameters indicate the events were driven when the UEFI was still in control of the machine, and not the kernel?

#13 2021-05-15 14:34:27

Re: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 6: ee2000000040110

https://en.wikipedia.org/wiki/Machine-check_exception
It’s either that or the error is carried over from the last boot.
If there were no issues w/ the shutdown (you shut down the system cleanly, rather than it somehow powered off out of nowhere) and the errors are reproducible (always the same), they’re detected at boot.

Whether they’re bogus or a genuine error can’t be told, but google finds that exact error at the exact address and bank quite some times…

Online

#14 2021-05-15 15:12:47

Re: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 6: ee2000000040110

As far as I can determine, the machine runs fine despite the MCE error. I will proceed as if they are harmless. I have other issues (e.g. graphics problems) that I need to troubleshoot too, but those are topics for another thread.

#15 2021-07-26 12:01:39

Re: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 6: ee2000000040110

I’m dealing with this same exact issue and trust me when I say it’s not «fine» or a bug.. and has a high-risk of leading to more serious issues down the road. This error is typically going to be a bad memory controller/cache on your cpu 75% of the time, bad chipset on your motherboard, bent pins on your cpu socket, or a bad trace on your motherboard.. although on rare occasions other hardware that shorts out the motherboard such as a bad power supply, pci card, or bad caps on the motherboard can cause it also, however unless it’s directly linked to the CPU it’s extremely rare.). I’ve decided it’s time for a upgrade and to error on the side of caution and just replace my motherboard, cpus, memory, power supply, etc.. This is the second identical motherboard and cpu set I’ve had with memory / CPU related issues within a few years.. I see no point other then trying to penny pitch $$$ to cause myself headaches for the future.. I can always pinpoint the issue in the future if I have nothing to do and use the hardware for less important things but as my main system it simply will not do. Also bad hardware like this can cause a daisy chain of failures.. bad cpu causes bad motherboard you replace motherboard.. cpu causes bad motherboard again.. you replace bad cpu motherboard causes cpu to go bad again.. extremely rare but I’ve seen it happen in a controlled engineering environment with other hardware.

At this point in time with the same error I can’t even put my root device in rw the kernel forbids it without forcing it’s hand even with a clean fsck mount won’t work. No other hardware errors show. No issues with my RAID drives. Removing all memory sticks from CPU1 bank 1, 2 & 3 and not CPU0 fixes the error (I say fixes, but it’s not a fix, it just «hides» the error from the kernel but it will still cause data corruption) however trying another memory stick from CPU0 bank fails every time.. I tried 15 «known good» memory sticks with the same results. Also I can run CPU/memory burn test and «most» basic test will pass without issue.

Also I wanted to say that the forum sign up on this server with the date/uname/hash should be changed in my opinion.. for people like me that are having hardware issues without access to another Linux system. I actually had to stop and think.. and notice that the security question changed every time.. from hash 256 to 512.. different date %V %J outputs.. and make sure my time is in par with the servers with epoch or day of the month format.. it really just wasted my time which is something I really hate. Forums are usually for people who need help, which can include date/time/rtc/hardware/kernel issues with Linux systems. Only elitist n00bs who think they are supreme/clever/elite and better than other people would use that type of captcha.. but that’s only my opinion. Hunting down a online sha tool.. and trying multiple «near correct dates» hashes until I get the right one takes more time then most of the hardest captachas.. and that’s from someone who is very experienced with Unix/Linux systems.

It’s kind of like Arch forum admins are telling me : «Well, if you don’t have current access to a Linux system with the correct date set your not getting into our forums without some work and wasted time finding a proper hash that works» It just leaves a bad taste in my mouth.. and I’ve been using Unix/Linux systems for over 27 years and that’s my point of view.

Last edited by zeronullity (2021-07-26 12:25:01)

Источник

  • #1

Found my NUC with Proxmox installed in unresponsive state today (first time ever after 2 weeks of use).

On reboot see these errors:
mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 6: xxx
mce: [Hardware Error]: TSC 0 ADDR fef1ce80 MISC xxx
mce: [Hardware Error]: PROCESSOR 0:a0660 TIME xxx SOCKET 0
APIC 0 microcode ca
(see attached pic — https://i.imgur.com/LYsQyyN.png)

The box booted and seems normal so far but see those errors on boot
Quick memory test did not show any problems so far.

rasdaemon -f, journalctl -f show no obvious problems.

==========================
root@pve:~# numactl —hardware
available: 1 nodes (0)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11
node 0 size: 64036 MB
node 0 free: 42377 MB
node distances:
node 0
0: 10
(reverse-i-search)`jo’: ^Curnalctl -f
root@pve:~# ras-mc-ctl —errors
No Memory errors.

No PCIe AER errors.

No Extlog errors.

No MCE errors.

===========================
root@pve:~# ras-mc-ctl —errors
No Memory errors.

No PCIe AER errors.

No Extlog errors.

No MCE errors.

root@pve:~# numactl —hardware
available: 1 nodes (0)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11
node 0 size: 64036 MB
node 0 free: 42345 MB
node distances:
node 0
0: 10
================

I run Intel NUC 7 BXNUC10i7FNH
Here is my CPU info https://pastebin.com/MpXedi1h

Anybody had experience with such errors ? Bad RAM, motherboard ?
Can it be benign?

Thx in advance!

  • CPU_ERRORS.png

    CPU_ERRORS.png

    460.4 KB

    · Views: 10

Last edited: Sep 6, 2020

  • #2

Other then the fact that I noticed this after pve was unresponsive (which could be coincidental and unrelated to h/w errors), I see not issues running pve

  • #3

anybody has seen this ? maybe red herring ?
@wolfgang

wolfgang


  • #4

Hi,

I guess the nuc gen 10 is too new and has some problems.
But if you like to prove that the memory with cpu is ok run for 30 min stress-ng

Code:

stress-ng --cpu 6 --vm 6 --verify 1 --vm-bytes 80%

If this test does not crash the likelihood is hight that the NUC will work without problems.

  • #5

@wolfgang

Thank you for a good practical advise !

I ran that for ~40 min with 100% CPU
This is what I saw in the log:

Sep 09 08:04:30 pve kernel: e1000e 0000:00:1f.6 eno1: Detected Hardware Unit Hang:
TDH <7f>
TDT <93>
next_to_use <93>
next_to_clean <7e>
buffer_info[next_to_clean]:
time_stamp <102670de1>
next_to_watch <7f>
jiffies <1026716a8>
next_to_watch.status <0>
MAC Status <40080083>
PHY Status <796d>
PHY 1000BASE-T Status <3800>
PHY Extended Status <3000>
PCI Status <10>

This NUC is still replaceable, would you suggest to replace it or you suspect it’s more a generic issue?

  • #7

@evg32

Thank you !

What is interesting that after running stress-ng I did not see «mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 6:» during boot

Have you seen the error like mine above too ? In other words, I want to understand why your solution is good for me? (I wish I was an expert in this area :) )

(
I did try it and it seem like there were much more output generated (without GRUB_CMDLINE_LINUX_DEFAULT=»quiet» ), I did not see my error, but saw
pve kernel: [ 5.659369] Bluetooth: hci0: Failed to load Intel firmware file (-2)

/var/log/syslog:Sep 9 09:18:34 pve kernel: [ 6.616670] Bluetooth: hci0: Failed to load Intel firmware file (-2)
/var/log/syslog:Sep 9 09:18:34 pve systemd[1]: apparmor.service: Failed with result ‘exit-code’.

Those maybe unrelated to this at all, guessing …
)

Main problem I am trying to assess now if my NUC h/w is bad and need to be replaced. Based on your post sounds it is not h/w related, correct ?

Thanks again !

  • #9

I noticed that mce errors occured randomly, I couldn’t correlate them with anything.
Yep, I saw the same errors except that my CPU was i9-9900K.
That’s a CPU bug, as described here https://bugzilla.kernel.org/show_bug.cgi?id=109051

You are lucky with i9-9900K, I got i9-9880H from Hystou and could not even set it up, returned and then got I7

OK I will not replace the NUC then.

Thank you !

  • #10

@evg32

Have you tried installing «intel-microcode» (apt install intel-microcode) ?
Wonder if that could help as well ?

  • #11

@evg32

Have you tried installing «intel-microcode» (apt install intel-microcode) ?
Wonder if that could help as well ?

Yep, I tried to upgrade and downgrade cpu firmware. Nothing helped me.

wolfgang


  • #12

Sep 09 08:04:30 pve kernel: e1000e 0000:00:1f.6 eno1: Detected Hardware Unit Hang:

Ignore this.
This stress test is not comparable with normal load and it is normal that other non tested parts get no resources and run in errors.

Have you tried installing «intel-microcode» (apt install intel-microcode) ?

As long you update the bios of your NUC, the microcode will bring no benefit.
Because the microcode from Debian comes also from Intel and Intel does a good job with keeping the NUC firmware update.

  • #13

Yep, I tried to upgrade and downgrade cpu firmware. Nothing helped me.

Do you know what this line is actually doing ?

Code:

GRUB_CMDLINE_LINUX_DEFAULT="consoleblank=0 intel_idle.max_cstate=1"

  • #14

It disables power saving.

  • #15

It disables power saving.

I never setup sound on VMs and was reading about it today. That seems to be around grub settings as well.
Have you configured sound as well with that line ?

thx !

  • #16

Nope, I never touched sounds configs. I just needed stable VMs and host server.

Понравилась статья? Поделить с друзьями:

Читайте также:

  • Hardware error antminer
  • Headers timeout error перевод
  • Hdd low level format ошибка открытия устройства
  • Hardware error 29443
  • Header ошибка cannot modify header information headers already sent by

  • 0 0 голоса
    Рейтинг статьи
    Подписаться
    Уведомить о
    guest

    0 комментариев
    Старые
    Новые Популярные
    Межтекстовые Отзывы
    Посмотреть все комментарии