My dmesg gets completely spammed with the following messages appearing over and over again, and this keep increasing the size of log file. I can use pci=noaer parameter to suppress these annoying messages, but not sure what this parameter do, will it make my pc lost some function.Is this a bug? below is part of the dmesg infomation
——————————————————————————————————
[ 397.076509] pcieport 0000:00:01.1: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0009(Requester ID)
[ 397.076517] pcieport 0000:00:01.1: device [1022:15d3] error status/mask=00100000/04400000
[ 397.076522] pcieport 0000:00:01.1: [20] Unsupported Request (First)
[ 397.076526] pcieport 0000:00:01.1: TLP Header: 34000000 01001f10 00000000 8c548c54
[ 397.081907] pcieport 0000:00:01.1: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0009(Requester ID)
[ 397.081914] pcieport 0000:00:01.1: device [1022:15d3] error status/mask=00100000/04400000
[ 397.081918] pcieport 0000:00:01.1: [20] Unsupported Request (First)
[ 397.081923] pcieport 0000:00:01.1: TLP Header: 34000000 01001f10 00000000 8c548c54
[ 398.983368] pcieport 0000:00:01.1: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0009(Requester ID)
[ 398.983376] pcieport 0000:00:01.1: device [1022:15d3] error status/mask=00100000/04400000
[ 398.983381] pcieport 0000:00:01.1: [20] Unsupported Request (First)
[ 398.983385] pcieport 0000:00:01.1: TLP Header: 34000000 01001f10 00000000 8c548c54
[ 398.984773] pcieport 0000:00:01.1: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0009(Requester ID)
[ 398.984779] pcieport 0000:00:01.1: device [1022:15d3] error status/mask=00100000/04400000
[ 398.984783] pcieport 0000:00:01.1: [20] Unsupported Request (First)
[ 398.984788] pcieport 0000:00:01.1: TLP Header: 34000000 01001f10 00000000 8c548c54
[ 398.994170] pcieport 0000:00:01.1: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0009(Requester ID)
[ 398.994176] pcieport 0000:00:01.1: device [1022:15d3] error status/mask=00100000/04400000
[ 398.994180] pcieport 0000:00:01.1: [20] Unsupported Request (First)
[ 398.994185] pcieport 0000:00:01.1: TLP Header: 34000000 01001f10 00000000 8c548c54
[ 399.333957] pcieport 0000:00:01.1: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0009(Requester ID)
[ 399.333964] pcieport 0000:00:01.1: device [1022:15d3] error status/mask=00100000/04400000
[ 399.333968] pcieport 0000:00:01.1: [20] Unsupported Request (First)
[ 399.333973] pcieport 0000:00:01.1: TLP Header: 34000000 01001f10 00000000 8c548c54
[ 399.339347] pcieport 0000:00:01.1: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0009(Requester ID)
[ 399.339353] pcieport 0000:00:01.1: device [1022:15d3] error status/mask=00100000/04400000
[ 399.339358] pcieport 0000:00:01.1: [20] Unsupported Request (First)
[ 399.339362] pcieport 0000:00:01.1: TLP Header: 34000000 01001f10 00000000 8c548c54
ProblemType: Bug
DistroRelease: Ubuntu 18.04
Package: linux-image-4.15.0-12-generic 4.15.0-12.13
ProcVersionSignature: Ubuntu 4.15.0-12.13-generic 4.15.7
Uname: Linux 4.15.0-12-generic x86_64
ApportVersion: 2.20.8-0ubuntu10
Architecture: amd64
AudioDevicesInUse:
USER PID ACCESS COMMAND
/dev/snd/controlC1: ven 1747 F…. pulseaudio
/dev/snd/controlC0: ven 1747 F…. pulseaudio
CurrentDesktop: ubuntu:GNOME
Date: Sat Mar 24 23:24:16 2018
InstallationDate: Installed on 2018-03-07 (17 days ago)
InstallationMedia: Ubuntu 18.04 LTS «Bionic Beaver» — Alpha amd64 (20180305)
MachineType: LENOVO 81BR
ProcFB: 0 amdgpudrmfb
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-4.15.0-12-generic.efi.signed root=UUID=a8a345d7-5b2e-495a-897a-351b10597e61 ro quiet splash vt.handoff=1
RelatedPackageVersions:
linux-restricted-modules-4.15.0-12-generic N/A
linux-backports-modules-4.15.0-12-generic N/A
linux-firmware 1.173
SourcePackage: linux
UpgradeStatus: No upgrade log present (probably fresh install)
dmi.bios.date: 12/19/2017
dmi.bios.vendor: LENOVO
dmi.bios.version: 6KCN28WW
dmi.board.asset.tag: NO Asset Tag
dmi.board.name: LNVNB161216
dmi.board.vendor: LENOVO
dmi.board.version: SDK0L77769 WIN
dmi.chassis.asset.tag: NO Asset Tag
dmi.chassis.type: 10
dmi.chassis.vendor: LENOVO
dmi.chassis.version: Lenovo ideapad 720S-13ARR
dmi.modalias: dmi:bvnLENOVO:bvr6KCN28WW:bd12/19/2017:svnLENOVO:pn81BR:pvrLenovoideapad720S-13ARR:rvnLENOVO:rnLNVNB161216:rvrSDK0L77769WIN:cvnLENOVO:ct10:cvrLenovoideapad720S-13ARR:
dmi.product.family: ideapad 720S-13ARR
dmi.product.name: 81BR
dmi.product.version: Lenovo ideapad 720S-13ARR
dmi.sys.vendor: LENOVO
I installed Lubuntu on an Acer Swift — installing it on the SSD already required to change the BIOS setting for the controller to AHCI.
Now I’m stuck on getting the shutdown working properly. I tried already some options in /etc/default/grub
like
reboot=bios
acpi=force apm=power_off
pci=nomsi,noaer
(I know this should only turn off the error message)
The error that I get on shutdown is:
PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
device [10de:id13] error status/mask=00100000/00000000
[20] UnsupReq (First)
TLP Header: 40000008 000000ff a024c010 f7f7f7f7
After displaying that error, nothing happens and it doesn’t shutdown properly.
I didn’t yet mount the extra SATA — which I want to do after fixing the shutdown…
➜ ~ uname -a
Linux blub 5.3.0-18-generic #19-Ubuntu SMP Tue Oct 8 20:14:06 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
➜ ~ sudo lshw
...
product: Swift SF314-56G (0000000000000000)
vendor: Acer
version: V1.08
...
The questions I saw so far regarding this kind of error had most times «severity=Corrected» or a different «type». If you need any more information, please write a comment and I’ll update the question.
How can I get this now to shutdown properly?
Update
➜ ~ lspci -nnk
00:00.0 Host bridge [0600]: Intel Corporation Device [8086:3e34] (rev 0b)
Subsystem: Acer Incorporated [ALI] Device [1025:1301]
Kernel driver in use: skl_uncore
00:02.0 VGA compatible controller [0300]: Intel Corporation UHD Graphics 620 (Whiskey Lake) [8086:3ea0]
Subsystem: Acer Incorporated [ALI] UHD Graphics 620 (Whiskey Lake) [1025:1301]
Kernel driver in use: i915
Kernel modules: i915
00:04.0 Signal processing controller [1180]: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor Thermal Subsystem [8086:1903] (rev 0b)
Subsystem: Acer Incorporated [ALI] Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor Thermal Subsystem [1025:1301]
Kernel driver in use: proc_thermal
Kernel modules: processor_thermal_device
00:12.0 Signal processing controller [1180]: Intel Corporation Cannon Point-LP Thermal Controller [8086:9df9] (rev 30)
Subsystem: Acer Incorporated [ALI] Cannon Point-LP Thermal Controller [1025:1301]
Kernel driver in use: intel_pch_thermal
Kernel modules: intel_pch_thermal
00:14.0 USB controller [0c03]: Intel Corporation Cannon Point-LP USB 3.1 xHCI Controller [8086:9ded] (rev 30)
Subsystem: Acer Incorporated [ALI] Cannon Point-LP USB 3.1 xHCI Controller [1025:1301]
Kernel driver in use: xhci_hcd
00:14.2 RAM memory [0500]: Intel Corporation Cannon Point-LP Shared SRAM [8086:9def] (rev 30)
Subsystem: Acer Incorporated [ALI] Cannon Point-LP Shared SRAM [1025:1301]
00:14.3 Network controller [0280]: Intel Corporation Cannon Point-LP CNVi [Wireless-AC] [8086:9df0] (rev 30)
Subsystem: Intel Corporation Cannon Point-LP CNVi [Wireless-AC] [8086:0034]
Kernel driver in use: iwlwifi
Kernel modules: iwlwifi
00:15.0 Serial bus controller [0c80]: Intel Corporation Cannon Point-LP Serial IO I2C Controller #0 [8086:9de8] (rev 30)
Subsystem: Acer Incorporated [ALI] Cannon Point-LP Serial IO I2C Controller [1025:1301]
Kernel driver in use: intel-lpss
Kernel modules: intel_lpss_pci
00:15.1 Serial bus controller [0c80]: Intel Corporation Cannon Point-LP Serial IO I2C Controller #1 [8086:9de9] (rev 30)
Subsystem: Acer Incorporated [ALI] Cannon Point-LP Serial IO I2C Controller [1025:1301]
Kernel driver in use: intel-lpss
Kernel modules: intel_lpss_pci
00:16.0 Communication controller [0780]: Intel Corporation Cannon Point-LP MEI Controller #1 [8086:9de0] (rev 30)
Subsystem: Acer Incorporated [ALI] Cannon Point-LP MEI Controller [1025:1301]
Kernel driver in use: mei_me
Kernel modules: mei_me
00:17.0 SATA controller [0106]: Intel Corporation Cannon Point-LP SATA Controller [AHCI Mode] [8086:9dd3] (rev 30)
Subsystem: Acer Incorporated [ALI] Cannon Point-LP SATA Controller [AHCI Mode] [1025:1301]
Kernel driver in use: ahci
Kernel modules: ahci
00:19.0 Serial bus controller [0c80]: Intel Corporation Device [8086:9dc5] (rev 30)
Subsystem: Acer Incorporated [ALI] Device [1025:1301]
Kernel driver in use: intel-lpss
Kernel modules: intel_lpss_pci
00:1c.0 PCI bridge [0604]: Intel Corporation Cannon Point-LP PCI Express Root Port #1 [8086:9db8] (rev f0)
Kernel driver in use: pcieport
00:1c.4 PCI bridge [0604]: Intel Corporation Cannon Point-LP PCI Express Root Port #5 [8086:9dbc] (rev f0)
Kernel driver in use: pcieport
00:1d.0 PCI bridge [0604]: Intel Corporation Cannon Point-LP PCI Express Root Port #9 [8086:9db0] (rev f0)
Kernel driver in use: pcieport
00:1d.4 PCI bridge [0604]: Intel Corporation Cannon Point-LP PCI Express Root Port #13 [8086:9db4] (rev f0)
Kernel driver in use: pcieport
00:1f.0 ISA bridge [0601]: Intel Corporation Cannon Point-LP LPC Controller [8086:9d84] (rev 30)
Subsystem: Acer Incorporated [ALI] Cannon Point-LP LPC Controller [1025:1301]
00:1f.3 Multimedia audio controller [0401]: Intel Corporation Cannon Point-LP High Definition Audio Controller [8086:9dc8] (rev 30)
Subsystem: Acer Incorporated [ALI] Cannon Point-LP High Definition Audio Controller [1025:1300]
Kernel driver in use: snd_hda_intel
Kernel modules: snd_hda_intel, snd_soc_skl, sof_pci_dev
00:1f.4 SMBus [0c05]: Intel Corporation Cannon Point-LP SMBus Controller [8086:9da3] (rev 30)
Subsystem: Acer Incorporated [ALI] Cannon Point-LP SMBus Controller [1025:1301]
Kernel driver in use: i801_smbus
Kernel modules: i2c_i801
00:1f.5 Serial bus controller [0c80]: Intel Corporation Cannon Point-LP SPI Controller [8086:9da4] (rev 30)
Subsystem: Acer Incorporated [ALI] Cannon Point-LP SPI Controller [1025:1301]
02:00.0 3D controller [0302]: NVIDIA Corporation GP108M [GeForce MX250] [10de:1d13] (rev a1)
Subsystem: Acer Incorporated [ALI] GP108M [GeForce MX250] [1025:1301]
Kernel driver in use: nouveau
Kernel modules: nvidiafb, nouveau
04:00.0 Non-Volatile memory controller [0108]: Sandisk Corp WD Black 2018/PC SN520 NVMe SSD [15b7:5003] (rev 01)
Subsystem: Sandisk Corp WD Black 2018/PC SN520 NVMe SSD [15b7:5003]
Kernel driver in use: nvme
Kernel modules: nvme
I came home to an error saying my log file was full. Turns out I have been receiving a stream of PCIe errors since I made some hardware changes over the weekend.
The first device that is throwing errors is one of two GPUs in the system. The errors look like:
Tower kernel: pcieport 0000:00:03.0: AER: Multiple Uncorrected (Non-Fatal) error received: 0000:01:00.0
Tower kernel: vfio-pci 0000:01:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
Tower kernel: vfio-pci 0000:01:00.0: device [10de:1e84] error status/mask=00100000/00000000
Tower kernel: vfio-pci 0000:01:00.0: [20] UnsupReq (First)
Tower kernel: vfio-pci 0000:01:00.0: AER: TLP Header: 40000001 00000003 000be7c0 f7f7f7f7
Tower kernel: pcieport 0000:00:03.0: AER: device recovery successful
The second device that is throwing errors is my LSI card. This is new. It is an LSI 9207-8i purchased from The Art of the Server on ebay. It is in a PCIe slot that was previously occupied by an NVME SSD in a PCIe adapter. Those errors look like:
Tower kernel: mpt3sas 0000:04:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
Tower kernel: pcieport 0000:00:03.0: AER: Multiple Uncorrected (Non-Fatal) error received: 0000:01:00.0
Tower kernel: mpt3sas 0000:04:00.0: device [1000:0087] error status/mask=00000001/00002000
Tower kernel: mpt3sas 0000:04:00.0: [ 0] RxErr
Despite these errors, both devices are acting normally. The GPU is passed through to a VM and behaves as expected even under full load. The LSI card also appears fully functional. I went through an entire parity check which passed with zero errors. I am currently running through a drive rebuild (not because of drive failure, just swapping it out) and would rather not have to abort, but I also do not know how severe these errors are and if I need to take immediate action.
I am attaching my full diagnostics dump.
Any advice would be much appreciated.
Thank you.
tower-diagnostics-20210310-1757.zip
Edited March 11, 2021 by Team_Dango
Содержание
- 8. The PCI Express Advanced Error Reporting Driver Guide HOWTO¶
- 8.1. Overview¶
- 8.1.1. About this guide¶
- 8.1.2. What is the PCI Express AER Driver?В¶
- 8.2. User Guide¶
- 8.2.1. Include the PCI Express AER Root Driver into the Linux Kernel¶
- 8.2.2. Load PCI Express AER Root Driver¶
- 8.2.3. AER error output¶
- 8.2.4. AER Statistics / Counters¶
- 8.3. Developer Guide¶
- 8.3.1. Configure the AER capability structure¶
- 8.3.2. Provide callbacks¶
- 8.3.2.1. callback reset_link to reset pci express link¶
- 8.3.2.2. PCI error-recovery callbacks¶
- 8.3.2.3. Correctable errors¶
- 8.3.2.4. Non-correctable (non-fatal and fatal) errors¶
- 8.3.3. helper functions¶
- 8.3.4. Frequent Asked Questions¶
- 8.4. Software error injection¶
- 8. The PCI Express Advanced Error Reporting Driver Guide HOWTO¶
- 8.1. Overview¶
- 8.1.1. About this guide¶
- 8.1.2. What is the PCI Express AER Driver?В¶
- 8.2. User Guide¶
- 8.2.1. Include the PCI Express AER Root Driver into the Linux Kernel¶
- 8.2.2. Load PCI Express AER Root Driver¶
- 8.2.3. AER error output¶
- 8.2.4. AER Statistics / Counters¶
- 8.3. Developer Guide¶
- 8.3.1. Configure the AER capability structure¶
- 8.3.2. Provide callbacks¶
- 8.3.2.1. callback reset_link to reset pci express link¶
- 8.3.2.2. PCI error-recovery callbacks¶
- 8.3.2.3. Correctable errors¶
- 8.3.2.4. Non-correctable (non-fatal and fatal) errors¶
- 8.3.3. helper functions¶
- 8.3.4. Frequent Asked Questions¶
- 8.4. Software error injection¶
8. The PCI Express Advanced Error Reporting Driver Guide HOWTO¶
В© 2006 Intel Corporation
8.1. Overview¶
8.1.1. About this guide¶
This guide describes the basics of the PCI Express Advanced Error Reporting (AER) driver and provides information on how to use it, as well as how to enable the drivers of endpoint devices to conform with PCI Express AER driver.
8.1.2. What is the PCI Express AER Driver?В¶
PCI Express error signaling can occur on the PCI Express link itself or on behalf of transactions initiated on the link. PCI Express defines two error reporting paradigms: the baseline capability and the Advanced Error Reporting capability. The baseline capability is required of all PCI Express components providing a minimum defined set of error reporting requirements. Advanced Error Reporting capability is implemented with a PCI Express advanced error reporting extended capability structure providing more robust error reporting.
The PCI Express AER driver provides the infrastructure to support PCI Express Advanced Error Reporting capability. The PCI Express AER driver provides three basic functions:
Gathers the comprehensive error information if errors occurred.
Reports error to the users.
Performs error recovery actions.
AER driver only attaches root ports which support PCI-Express AER capability.
8.2. User Guide¶
8.2.1. Include the PCI Express AER Root Driver into the Linux Kernel¶
The PCI Express AER Root driver is a Root Port service driver attached to the PCI Express Port Bus driver. If a user wants to use it, the driver has to be compiled. Option CONFIG_PCIEAER supports this capability. It depends on CONFIG_PCIEPORTBUS, so pls. set CONFIG_PCIEPORTBUS=y and CONFIG_PCIEAER = y.
8.2.2. Load PCI Express AER Root Driver¶
Some systems have AER support in firmware. Enabling Linux AER support at the same time the firmware handles AER may result in unpredictable behavior. Therefore, Linux does not handle AER events unless the firmware grants AER control to the OS via the ACPI _OSC method. See the PCI FW 3.0 Specification for details regarding _OSC usage.
8.2.3. AER error output¶
When a PCIe AER error is captured, an error message will be output to console. If it’s a correctable error, it is output as a warning. Otherwise, it is printed as an error. So users could choose different log level to filter out correctable error messages.
Below shows an example:
In the example, вЂRequester ID’ means the ID of the device who sends the error message to root port. Pls. refer to pci express specs for other fields.
8.2.4. AER Statistics / Counters¶
When PCIe AER errors are captured, the counters / statistics are also exposed in the form of sysfs attributes which are documented at Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
8.3. Developer Guide¶
To enable AER aware support requires a software driver to configure the AER capability structure within its device and to provide callbacks.
To support AER better, developers need understand how AER does work firstly.
PCI Express errors are classified into two types: correctable errors and uncorrectable errors. This classification is based on the impacts of those errors, which may result in degraded performance or function failure.
Correctable errors pose no impacts on the functionality of the interface. The PCI Express protocol can recover without any software intervention or any loss of data. These errors are detected and corrected by hardware. Unlike correctable errors, uncorrectable errors impact functionality of the interface. Uncorrectable errors can cause a particular transaction or a particular PCI Express link to be unreliable. Depending on those error conditions, uncorrectable errors are further classified into non-fatal errors and fatal errors. Non-fatal errors cause the particular transaction to be unreliable, but the PCI Express link itself is fully functional. Fatal errors, on the other hand, cause the link to be unreliable.
When AER is enabled, a PCI Express device will automatically send an error message to the PCIe root port above it when the device captures an error. The Root Port, upon receiving an error reporting message, internally processes and logs the error message in its PCI Express capability structure. Error information being logged includes storing the error reporting agent’s requestor ID into the Error Source Identification Registers and setting the error bits of the Root Error Status Register accordingly. If AER error reporting is enabled in Root Error Command Register, the Root Port generates an interrupt if an error is detected.
Note that the errors as described above are related to the PCI Express hierarchy and links. These errors do not include any device specific errors because device specific errors will still get sent directly to the device driver.
8.3.1. Configure the AER capability structure¶
AER aware drivers of PCI Express component need change the device control registers to enable AER. They also could change AER registers, including mask and severity registers. Helper function pci_enable_pcie_error_reporting could be used to enable AER. See section 3.3.
8.3.2. Provide callbacks¶
8.3.2.1. callback reset_link to reset pci express link¶
This callback is used to reset the pci express physical link when a fatal error happens. The root port aer service driver provides a default reset_link function, but different upstream ports might have different specifications to reset pci express link, so all upstream ports should provide their own reset_link functions.
Section 3.2.2.2 provides more detailed info on when to call reset_link.
8.3.2.2. PCI error-recovery callbacks¶
The PCI Express AER Root driver uses error callbacks to coordinate with downstream device drivers associated with a hierarchy in question when performing error recovery actions.
Data struct pci_driver has a pointer, err_handler, to point to pci_error_handlers who consists of a couple of callback function pointers. AER driver follows the rules defined in PCI Error Recovery except pci express specific parts (e.g. reset_link). Pls. refer to PCI Error Recovery for detailed definitions of the callbacks.
Below sections specify when to call the error callback functions.
8.3.2.3. Correctable errors¶
Correctable errors pose no impacts on the functionality of the interface. The PCI Express protocol can recover without any software intervention or any loss of data. These errors do not require any recovery actions. The AER driver clears the device’s correctable error status register accordingly and logs these errors.
8.3.2.4. Non-correctable (non-fatal and fatal) errors¶
If an error message indicates a non-fatal error, performing link reset at upstream is not required. The AER driver calls error_detected(dev, pci_channel_io_normal) to all drivers associated within a hierarchy in question. for example:
If Upstream port A captures an AER error, the hierarchy consists of Downstream port B and EndPoint.
A driver may return PCI_ERS_RESULT_CAN_RECOVER, PCI_ERS_RESULT_DISCONNECT, or PCI_ERS_RESULT_NEED_RESET, depending on whether it can recover or the AER driver calls mmio_enabled as next.
If an error message indicates a fatal error, kernel will broadcast error_detected(dev, pci_channel_io_frozen) to all drivers within a hierarchy in question. Then, performing link reset at upstream is necessary. As different kinds of devices might use different approaches to reset link, AER port service driver is required to provide the function to reset link via callback parameter of pcie_do_recovery() function. If reset_link is not NULL, recovery function will use it to reset the link. If error_detected returns PCI_ERS_RESULT_CAN_RECOVER and reset_link returns PCI_ERS_RESULT_RECOVERED, the error handling goes to mmio_enabled.
8.3.3. helper functions¶
pci_enable_pcie_error_reporting enables the device to send error messages to root port when an error is detected. Note that devices don’t enable the error reporting by default, so device drivers need call this function to enable it.
pci_disable_pcie_error_reporting disables the device to send error messages to root port when an error is detected.
pci_aer_clear_nonfatal_status clears non-fatal errors in the uncorrectable error status register.
8.3.4. Frequent Asked Questions¶
What happens if a PCI Express device driver does not provide an error recovery handler (pci_driver->err_handler is equal to NULL)?
The devices attached with the driver won’t be recovered. If the error is fatal, kernel will print out warning messages. Please refer to section 3 for more information.
What happens if an upstream port service driver does not provide callback reset_link?
Fatal error recovery will fail if the errors are reported by the upstream ports who are attached by the service driver.
How does this infrastructure deal with driver that is not PCI Express aware?
This infrastructure calls the error callback functions of the driver when an error happens. But if the driver is not aware of PCI Express, the device might not report its own errors to root port.
What modifications will that driver need to make it compatible with the PCI Express AER Root driver?
It could call the helper functions to enable AER in devices and cleanup uncorrectable status register. Pls. refer to section 3.3.
8.4. Software error injection¶
Debugging PCIe AER error recovery code is quite difficult because it is hard to trigger real hardware errors. Software based error injection can be used to fake various kinds of PCIe errors.
First you should enable PCIe AER software error injection in kernel configuration, that is, following item should be in your .config.
CONFIG_PCIEAER_INJECT=y or CONFIG_PCIEAER_INJECT=m
After reboot with new kernel or insert the module, a device file named /dev/aer_inject should be created.
Then, you need a user space tool named aer-inject, which can be gotten from:
More information about aer-inject can be found in the document comes with its source code.
Источник
8. The PCI Express Advanced Error Reporting Driver Guide HOWTO¶
В© 2006 Intel Corporation
8.1. Overview¶
8.1.1. About this guide¶
This guide describes the basics of the PCI Express Advanced Error Reporting (AER) driver and provides information on how to use it, as well as how to enable the drivers of endpoint devices to conform with PCI Express AER driver.
8.1.2. What is the PCI Express AER Driver?В¶
PCI Express error signaling can occur on the PCI Express link itself or on behalf of transactions initiated on the link. PCI Express defines two error reporting paradigms: the baseline capability and the Advanced Error Reporting capability. The baseline capability is required of all PCI Express components providing a minimum defined set of error reporting requirements. Advanced Error Reporting capability is implemented with a PCI Express advanced error reporting extended capability structure providing more robust error reporting.
The PCI Express AER driver provides the infrastructure to support PCI Express Advanced Error Reporting capability. The PCI Express AER driver provides three basic functions:
Gathers the comprehensive error information if errors occurred.
Reports error to the users.
Performs error recovery actions.
AER driver only attaches root ports which support PCI-Express AER capability.
8.2. User Guide¶
8.2.1. Include the PCI Express AER Root Driver into the Linux Kernel¶
The PCI Express AER Root driver is a Root Port service driver attached to the PCI Express Port Bus driver. If a user wants to use it, the driver has to be compiled. Option CONFIG_PCIEAER supports this capability. It depends on CONFIG_PCIEPORTBUS, so pls. set CONFIG_PCIEPORTBUS=y and CONFIG_PCIEAER = y.
8.2.2. Load PCI Express AER Root Driver¶
Some systems have AER support in firmware. Enabling Linux AER support at the same time the firmware handles AER may result in unpredictable behavior. Therefore, Linux does not handle AER events unless the firmware grants AER control to the OS via the ACPI _OSC method. See the PCI FW 3.0 Specification for details regarding _OSC usage.
8.2.3. AER error output¶
When a PCIe AER error is captured, an error message will be output to console. If it’s a correctable error, it is output as a warning. Otherwise, it is printed as an error. So users could choose different log level to filter out correctable error messages.
Below shows an example:
In the example, вЂRequester ID’ means the ID of the device who sends the error message to root port. Pls. refer to pci express specs for other fields.
8.2.4. AER Statistics / Counters¶
When PCIe AER errors are captured, the counters / statistics are also exposed in the form of sysfs attributes which are documented at Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
8.3. Developer Guide¶
To enable AER aware support requires a software driver to configure the AER capability structure within its device and to provide callbacks.
To support AER better, developers need understand how AER does work firstly.
PCI Express errors are classified into two types: correctable errors and uncorrectable errors. This classification is based on the impacts of those errors, which may result in degraded performance or function failure.
Correctable errors pose no impacts on the functionality of the interface. The PCI Express protocol can recover without any software intervention or any loss of data. These errors are detected and corrected by hardware. Unlike correctable errors, uncorrectable errors impact functionality of the interface. Uncorrectable errors can cause a particular transaction or a particular PCI Express link to be unreliable. Depending on those error conditions, uncorrectable errors are further classified into non-fatal errors and fatal errors. Non-fatal errors cause the particular transaction to be unreliable, but the PCI Express link itself is fully functional. Fatal errors, on the other hand, cause the link to be unreliable.
When AER is enabled, a PCI Express device will automatically send an error message to the PCIe root port above it when the device captures an error. The Root Port, upon receiving an error reporting message, internally processes and logs the error message in its PCI Express capability structure. Error information being logged includes storing the error reporting agent’s requestor ID into the Error Source Identification Registers and setting the error bits of the Root Error Status Register accordingly. If AER error reporting is enabled in Root Error Command Register, the Root Port generates an interrupt if an error is detected.
Note that the errors as described above are related to the PCI Express hierarchy and links. These errors do not include any device specific errors because device specific errors will still get sent directly to the device driver.
8.3.1. Configure the AER capability structure¶
AER aware drivers of PCI Express component need change the device control registers to enable AER. They also could change AER registers, including mask and severity registers. Helper function pci_enable_pcie_error_reporting could be used to enable AER. See section 3.3.
8.3.2. Provide callbacks¶
8.3.2.1. callback reset_link to reset pci express link¶
This callback is used to reset the pci express physical link when a fatal error happens. The root port aer service driver provides a default reset_link function, but different upstream ports might have different specifications to reset pci express link, so all upstream ports should provide their own reset_link functions.
Section 3.2.2.2 provides more detailed info on when to call reset_link.
8.3.2.2. PCI error-recovery callbacks¶
The PCI Express AER Root driver uses error callbacks to coordinate with downstream device drivers associated with a hierarchy in question when performing error recovery actions.
Data struct pci_driver has a pointer, err_handler, to point to pci_error_handlers who consists of a couple of callback function pointers. AER driver follows the rules defined in PCI Error Recovery except pci express specific parts (e.g. reset_link). Pls. refer to PCI Error Recovery for detailed definitions of the callbacks.
Below sections specify when to call the error callback functions.
8.3.2.3. Correctable errors¶
Correctable errors pose no impacts on the functionality of the interface. The PCI Express protocol can recover without any software intervention or any loss of data. These errors do not require any recovery actions. The AER driver clears the device’s correctable error status register accordingly and logs these errors.
8.3.2.4. Non-correctable (non-fatal and fatal) errors¶
If an error message indicates a non-fatal error, performing link reset at upstream is not required. The AER driver calls error_detected(dev, pci_channel_io_normal) to all drivers associated within a hierarchy in question. for example:
If Upstream port A captures an AER error, the hierarchy consists of Downstream port B and EndPoint.
A driver may return PCI_ERS_RESULT_CAN_RECOVER, PCI_ERS_RESULT_DISCONNECT, or PCI_ERS_RESULT_NEED_RESET, depending on whether it can recover or the AER driver calls mmio_enabled as next.
If an error message indicates a fatal error, kernel will broadcast error_detected(dev, pci_channel_io_frozen) to all drivers within a hierarchy in question. Then, performing link reset at upstream is necessary. As different kinds of devices might use different approaches to reset link, AER port service driver is required to provide the function to reset link via callback parameter of pcie_do_recovery() function. If reset_link is not NULL, recovery function will use it to reset the link. If error_detected returns PCI_ERS_RESULT_CAN_RECOVER and reset_link returns PCI_ERS_RESULT_RECOVERED, the error handling goes to mmio_enabled.
8.3.3. helper functions¶
pci_enable_pcie_error_reporting enables the device to send error messages to root port when an error is detected. Note that devices don’t enable the error reporting by default, so device drivers need call this function to enable it.
pci_disable_pcie_error_reporting disables the device to send error messages to root port when an error is detected.
pci_aer_clear_nonfatal_status clears non-fatal errors in the uncorrectable error status register.
8.3.4. Frequent Asked Questions¶
What happens if a PCI Express device driver does not provide an error recovery handler (pci_driver->err_handler is equal to NULL)?
The devices attached with the driver won’t be recovered. If the error is fatal, kernel will print out warning messages. Please refer to section 3 for more information.
What happens if an upstream port service driver does not provide callback reset_link?
Fatal error recovery will fail if the errors are reported by the upstream ports who are attached by the service driver.
How does this infrastructure deal with driver that is not PCI Express aware?
This infrastructure calls the error callback functions of the driver when an error happens. But if the driver is not aware of PCI Express, the device might not report its own errors to root port.
What modifications will that driver need to make it compatible with the PCI Express AER Root driver?
It could call the helper functions to enable AER in devices and cleanup uncorrectable status register. Pls. refer to section 3.3.
8.4. Software error injection¶
Debugging PCIe AER error recovery code is quite difficult because it is hard to trigger real hardware errors. Software based error injection can be used to fake various kinds of PCIe errors.
First you should enable PCIe AER software error injection in kernel configuration, that is, following item should be in your .config.
CONFIG_PCIEAER_INJECT=y or CONFIG_PCIEAER_INJECT=m
After reboot with new kernel or insert the module, a device file named /dev/aer_inject should be created.
Then, you need a user space tool named aer-inject, which can be gotten from:
More information about aer-inject can be found in the document comes with its source code.
Источник
-
#1
I have seen the many threads but I promise I’m a special case here. I have Proxmox VE installed with ubuntu and windows VM which I have two different tasks. Long story short I can physically see my graphics card as a passthrough option but once I go to initialize the VM doesn’t work. I have attempted to have a monitor plugged into the card but here is where we get to the rabbit hole. I cant check Iommu status with dmesg | grep -e DMAR -e IOMMU additionally grub.d as a directory is blank. etc/default/grub
GRUB_CMDLINE_LINUX_DEFAULT=»quiet intel_iommu=on pcie_acs_override=downstream,multifunction video=efifbff»
The server is running an e5 2650 v3, Asrock x99 extreme 4 and a gtx1050 for passthrough.
Im relatively new to all of this and any help would be awesome!
-
#2
I cant check Iommu status with dmesg | grep -e DMAR -e IOMMU
Uhm… why not? Type it in, run it, and post what it says?
Tip for getting help online in general: whenever your tempted to write «doesn’t work» without any context, rewrite your sentence with context
What exactly doesn’t work? What are your expecting to happen, and what happens? What do the error messages say, if there are any, what shows up in syslog (journalctl -e
)?
You’re «GRUB_CMDLINE_LINUX_DEFAULT» looks correct, though it’s a good idea to add ‘iommu=pt’ as well for improved host performance. Did you bind the vfio-pci driver to your GPU as indicated on our wiki?
-
#3
This is what happens when I try to check iommu.
root@Homestead:~# dmesg | grep -e DMAR -e IOMMU
root@Homestead:~#
When I try to initialize the VM the will attempt to start then the log comes back with a error — and hard freezes the VM to the point I have to restart the entire node.
Here is the vfio in /etc/modules
# /etc/modules: kernel modules to load at boot time.
#
# This file contains the names of kernel modules that should be loaded
# at boot time, one per line. Lines beginning with «#» are ignored.
vfio
vfio_iommu_type1
vfio_pci
vfio_virqfd
here is the system log
Oct 12 12:28:28 Homestead kernel: vfio-pci 0000:02:00.0: AER: TLP Header: 400>
Oct 12 12:28:28 Homestead kernel: pcieport 0000:00:03.0: AER: device recovery s>
Oct 12 12:28:28 Homestead kernel: pcieport 0000:00:03.0: AER: Multiple Uncorrec>
Oct 12 12:28:28 Homestead kernel: vfio-pci 0000:02:00.0: PCIe Bus Error: severi>
Oct 12 12:28:28 Homestead kernel: vfio-pci 0000:02:00.0: device [10de:1c81] e>
Oct 12 12:28:28 Homestead kernel: vfio-pci 0000:02:00.0: [20] UnsupReq >
Oct 12 12:28:28 Homestead kernel: vfio-pci 0000:02:00.0: AER: TLP Header: 400>
Oct 12 12:28:28 Homestead kernel: pcieport 0000:00:03.0: AER: device recovery s>
Oct 12 12:28:28 Homestead kernel: pcieport 0000:00:03.0: AER: Multiple Uncorrec>
Oct 12 12:28:28 Homestead kernel: vfio-pci 0000:02:00.0: PCIe Bus Error: severi>
Oct 12 12:28:28 Homestead kernel: vfio-pci 0000:02:00.0: device [10de:1c81] e>
Oct 12 12:28:28 Homestead kernel: vfio-pci 0000:02:00.0: [20] UnsupReq >
Oct 12 12:28:28 Homestead kernel: vfio-pci 0000:02:00.0: AER: TLP Header: 400>
Oct 12 12:28:28 Homestead kernel: pcieport 0000:00:03.0: AER: device recovery s>
Oct 12 12:28:28 Homestead kernel: pcieport 0000:00:03.0: AER: Multiple Uncorrec>
Oct 12 12:28:28 Homestead kernel: vfio-pci 0000:02:00.0: PCIe Bus Error: severi>
Oct 12 12:28:28 Homestead kernel: vfio-pci 0000:02:00.0: device [10de:1c81] e>
Oct 12 12:28:28 Homestead kernel: vfio-pci 0000:02:00.0: [20] UnsupReq >
Oct 12 12:28:28 Homestead kernel: vfio-pci 0000:02:00.0: AER: TLP Header: 400>
Oct 12 12:28:28 Homestead kernel: pcieport 0000:00:03.0: AER: device recovery s>
Oct 12 12:28:28 Homestead kernel: pcieport 0000:00:03.0: AER: Multiple Uncorrec>
Oct 12 12:28:28 Homestead kernel: vfio-pci 0000:02:00.0: PCIe Bus Error: severi>
Oct 12 12:28:28 Homestead kernel: vfio-pci 0000:02:00.0: device [10de:1c81] e>
lines 979-1001/1001 (END)
Oct 12 12:28:28 Homestead kernel: vfio-pci 0000:02:00.0: AER: TLP Header: 40000008 000002ff f1030000 f7f7f7f7
Oct 12 12:28:28 Homestead kernel: pcieport 0000:00:03.0: AER: device recovery successful
Oct 12 12:28:28 Homestead kernel: pcieport 0000:00:03.0: AER: Multiple Uncorrected (Non-Fatal) error received: 0000:02:00.0
Oct 12 12:28:28 Homestead kernel: vfio-pci 0000:02:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
Oct 12 12:28:28 Homestead kernel: vfio-pci 0000:02:00.0: device [10de:1c81] error status/mask=00100000/00000000
Oct 12 12:28:28 Homestead kernel: vfio-pci 0000:02:00.0: [20] UnsupReq (First)
Oct 12 12:28:28 Homestead kernel: vfio-pci 0000:02:00.0: AER: TLP Header: 40000008 000002ff f1030000 f7f7f7f7
Oct 12 12:28:28 Homestead kernel: pcieport 0000:00:03.0: AER: device recovery successful
Oct 12 12:28:28 Homestead kernel: pcieport 0000:00:03.0: AER: Multiple Uncorrected (Non-Fatal) error received: 0000:02:00.0
Oct 12 12:28:28 Homestead kernel: vfio-pci 0000:02:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
Oct 12 12:28:28 Homestead kernel: vfio-pci 0000:02:00.0: device [10de:1c81] error status/mask=00100000/00000000
Oct 12 12:28:28 Homestead kernel: vfio-pci 0000:02:00.0: [20] UnsupReq (First)
Oct 12 12:28:28 Homestead kernel: vfio-pci 0000:02:00.0: AER: TLP Header: 40000008 000002ff f1030000 f7f7f7f7
Oct 12 12:28:28 Homestead kernel: pcieport 0000:00:03.0: AER: device recovery successful
Oct 12 12:28:28 Homestead kernel: pcieport 0000:00:03.0: AER: Multiple Uncorrected (Non-Fatal) error received: 0000:02:00.0
Oct 12 12:28:28 Homestead kernel: vfio-pci 0000:02:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
Oct 12 12:28:28 Homestead kernel: vfio-pci 0000:02:00.0: device [10de:1c81] error status/mask=00100000/00000000
Oct 12 12:28:28 Homestead kernel: vfio-pci 0000:02:00.0: [20] UnsupReq (First)
Oct 12 12:28:28 Homestead kernel: vfio-pci 0000:02:00.0: AER: TLP Header: 40000008 000002ff f1030000 f7f7f7f7
Oct 12 12:28:28 Homestead kernel: pcieport 0000:00:03.0: AER: device recovery successful
Oct 12 12:28:28 Homestead kernel: pcieport 0000:00:03.0: AER: Multiple Uncorrected (Non-Fatal) error received: 0000:02:00.0
Oct 12 12:28:28 Homestead kernel: vfio-pci 0000:02:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
Oct 12 12:28:28 Homestead kernel: vfio-pci 0000:02:00.0: device [10de:1c81] error status/mask=00100000/00000000
I appreciate the help.
-
#4
I see a lot of AER: Multiple Uncorrected (Non-Fatal) error received
. Is this the only GPU in your system? Does it display the startup of the Proxmox host? It often help to add the following kernel parameters: nomodeset textonly video=vesafb:off video=efifb:off
.
-
#5
I see a lot of
AER: Multiple Uncorrected (Non-Fatal) error received
. Is this the only GPU in your system? Does it display the startup of the Proxmox host? It often help to add the following kernel parameters:nomodeset textonly video=vesafb:off video=efifb:off
.
Yes it is the only gpu in the system, I get a Qemu exited with code 1. also I added that line of code to grub and still the same problem. Sidenote if its important the Vm is windows based and I followed this guide: https://www.youtube.com/watch?v=fgx3NMk6F54&t=453s — Clearly not well enough.
-
#6
So Update: I switched to bios 2 as the motherboard supports it, the VM boots but the qemu/guest agent doesn’t work for RDC. I’m not sure if it’s hanging somewhere in the boot or if this is a side effect of something else altogether. Additionally, I noticed that an option of ACS from bios level needs to be enabled for Xeon e3 and e5 processors, is this something that could affect it? I haven’t been able to find the option in the bios to enable or disable.
-
#7
What is the difference between bios 2 and bios 1?
In principle, you want ACS enabled because it determines the IOMMU groups. But you use pcie_acs_override
to ignore ACS and split those groups up. Do you really need that ACS override? Did you check if the GPU was in an IOMMU group without other devices (except bridges) before using the override?
-
#8
What is the difference between bios 2 and bios 1?
In principle, you want ACS enabled because it determines the IOMMU groups. But you usepcie_acs_override
to ignore ACS and split those groups up. Do you really need that ACS override? Did you check if the GPU was in an IOMMU group without other devices (except bridges) before using the override?
Same version of bios just wanted to see if there was a sticking point somewhere. Also how would I check iommu groups? I have pulled it up before and seen the lines of code but how do you determine groups?
-
#9
Same version of bios just wanted to see if there was a sticking point somewhere. Also how would I check iommu groups? I have pulled it up before and seen the lines of code but how do you determine groups?
dmesg | grep iommu
, for example, shows all devices added to each IOMMU group. Each group has a unique number. Devices are in the same group when they are added to the group with the same number. For a fancy overview use: for d in /sys/kernel/iommu_groups/*/devices/*; do n=${d#*/iommu_groups/*}; n=${n%%/*}; printf 'IOMMU group %s ' "$n"; lspci -nnks "${d##*/}"; done
. Make sure to look at the whole group that contains the device you want to passthrough.
-
#10
hey so here is the group list. Im not sure if I don’t know what I am doing but the VGA 02.00.0 in group 30 I think is the graphics card because that is the correct number for it
I had to use find sys/kernel/iommu_groups as dmesg | grep iommu did not produce a result
/sys/kernel/iommu_groups/17
/sys/kernel/iommu_groups/17/devices
/sys/kernel/iommu_groups/17/devices/0000:00:05.2
/sys/kernel/iommu_groups/17/type
/sys/kernel/iommu_groups/17/reserved_regions
/sys/kernel/iommu_groups/7
/sys/kernel/iommu_groups/7/devices
/sys/kernel/iommu_groups/7/devices/0000:ff:14.7
/sys/kernel/iommu_groups/7/devices/0000:ff:14.5
/sys/kernel/iommu_groups/7/devices/0000:ff:14.3
/sys/kernel/iommu_groups/7/devices/0000:ff:14.1
/sys/kernel/iommu_groups/7/devices/0000:ff:14.6
/sys/kernel/iommu_groups/7/devices/0000:ff:14.4
/sys/kernel/iommu_groups/7/devices/0000:ff:14.2
/sys/kernel/iommu_groups/7/devices/0000:ff:14.0
/sys/kernel/iommu_groups/7/type
/sys/kernel/iommu_groups/7/reserved_regions
/sys/kernel/iommu_groups/25
/sys/kernel/iommu_groups/25/devices
/sys/kernel/iommu_groups/25/devices/0000:00:1b.0
/sys/kernel/iommu_groups/25/type
/sys/kernel/iommu_groups/25/reserved_regions
/sys/kernel/iommu_groups/15
/sys/kernel/iommu_groups/15/devices
/sys/kernel/iommu_groups/15/devices/0000:00:05.0
/sys/kernel/iommu_groups/15/type
/sys/kernel/iommu_groups/15/reserved_regions
/sys/kernel/iommu_groups/5
/sys/kernel/iommu_groups/5/devices
/sys/kernel/iommu_groups/5/devices/0000:ff:12.4
/sys/kernel/iommu_groups/5/devices/0000:ff:12.0
/sys/kernel/iommu_groups/5/devices/0000:ff:12.5
/sys/kernel/iommu_groups/5/devices/0000:ff:12.1
/sys/kernel/iommu_groups/5/type
/sys/kernel/iommu_groups/5/reserved_regions
/sys/kernel/iommu_groups/23
/sys/kernel/iommu_groups/23/devices
/sys/kernel/iommu_groups/23/devices/0000:00:19.0
/sys/kernel/iommu_groups/23/type
/sys/kernel/iommu_groups/23/reserved_regions
/sys/kernel/iommu_groups/13
/sys/kernel/iommu_groups/13/devices
/sys/kernel/iommu_groups/13/devices/0000:00:01.0
/sys/kernel/iommu_groups/13/type
/sys/kernel/iommu_groups/13/reserved_regions
/sys/kernel/iommu_groups/31
/sys/kernel/iommu_groups/31/devices
/sys/kernel/iommu_groups/31/devices/0000:02:00.1
/sys/kernel/iommu_groups/31/type
/sys/kernel/iommu_groups/31/reserved_regions
/sys/kernel/iommu_groups/3
/sys/kernel/iommu_groups/3/devices
/sys/kernel/iommu_groups/3/devices/0000:ff:0f.6
/sys/kernel/iommu_groups/3/devices/0000:ff:0f.4
/sys/kernel/iommu_groups/3/devices/0000:ff:0f.2
/sys/kernel/iommu_groups/3/devices/0000:ff:0f.0
/sys/kernel/iommu_groups/3/devices/0000:ff:0f.5
/sys/kernel/iommu_groups/3/devices/0000:ff:0f.3
/sys/kernel/iommu_groups/3/devices/0000:ff:0f.1
/sys/kernel/iommu_groups/3/type
/sys/kernel/iommu_groups/3/reserved_regions
/sys/kernel/iommu_groups/21
/sys/kernel/iommu_groups/21/devices
/sys/kernel/iommu_groups/21/devices/0000:00:14.0
/sys/kernel/iommu_groups/21/type
/sys/kernel/iommu_groups/21/reserved_regions
/sys/kernel/iommu_groups/11
/sys/kernel/iommu_groups/11/devices
/sys/kernel/iommu_groups/11/devices/0000:ff:1f.0
/sys/kernel/iommu_groups/11/devices/0000:ff:1f.2
/sys/kernel/iommu_groups/11/type
/sys/kernel/iommu_groups/11/reserved_regions
/sys/kernel/iommu_groups/1
/sys/kernel/iommu_groups/1/devices
/sys/kernel/iommu_groups/1/devices/0000:ff:0c.6
/sys/kernel/iommu_groups/1/devices/0000:ff:0c.4
/sys/kernel/iommu_groups/1/devices/0000:ff:0c.2
/sys/kernel/iommu_groups/1/devices/0000:ff:0c.0
/sys/kernel/iommu_groups/1/devices/0000:ff:0c.7
/sys/kernel/iommu_groups/1/devices/0000:ff:0c.5
/sys/kernel/iommu_groups/1/devices/0000:ff:0c.3
/sys/kernel/iommu_groups/1/devices/0000:ff:0c.1
/sys/kernel/iommu_groups/1/type
/sys/kernel/iommu_groups/1/reserved_regions
/sys/kernel/iommu_groups/28
/sys/kernel/iommu_groups/28/devices
/sys/kernel/iommu_groups/28/devices/0000:00:1d.0
/sys/kernel/iommu_groups/28/type
/sys/kernel/iommu_groups/28/reserved_regions
/sys/kernel/iommu_groups/18
/sys/kernel/iommu_groups/18/devices
/sys/kernel/iommu_groups/18/devices/0000:00:05.4
/sys/kernel/iommu_groups/18/type
/sys/kernel/iommu_groups/18/reserved_regions
/sys/kernel/iommu_groups/8
/sys/kernel/iommu_groups/8/devices
/sys/kernel/iommu_groups/8/devices/0000:ff:16.2
/sys/kernel/iommu_groups/8/devices/0000:ff:16.0
/sys/kernel/iommu_groups/8/devices/0000:ff:16.7
/sys/kernel/iommu_groups/8/devices/0000:ff:16.3
/sys/kernel/iommu_groups/8/devices/0000:ff:16.1
/sys/kernel/iommu_groups/8/devices/0000:ff:16.6
/sys/kernel/iommu_groups/8/type
/sys/kernel/iommu_groups/8/reserved_regions
/sys/kernel/iommu_groups/26
/sys/kernel/iommu_groups/26/devices
/sys/kernel/iommu_groups/26/devices/0000:00:1c.0
/sys/kernel/iommu_groups/26/type
/sys/kernel/iommu_groups/26/reserved_regions
/sys/kernel/iommu_groups/16
/sys/kernel/iommu_groups/16/devices
/sys/kernel/iommu_groups/16/devices/0000:00:05.1
/sys/kernel/iommu_groups/16/type
/sys/kernel/iommu_groups/16/reserved_regions
/sys/kernel/iommu_groups/6
/sys/kernel/iommu_groups/6/devices
/sys/kernel/iommu_groups/6/devices/0000:ff:13.2
/sys/kernel/iommu_groups/6/devices/0000:ff:13.0
/sys/kernel/iommu_groups/6/devices/0000:ff:13.7
/sys/kernel/iommu_groups/6/devices/0000:ff:13.3
/sys/kernel/iommu_groups/6/devices/0000:ff:13.1
/sys/kernel/iommu_groups/6/devices/0000:ff:13.6
/sys/kernel/iommu_groups/6/type
/sys/kernel/iommu_groups/6/reserved_regions
/sys/kernel/iommu_groups/24
/sys/kernel/iommu_groups/24/devices
/sys/kernel/iommu_groups/24/devices/0000:00:1a.0
/sys/kernel/iommu_groups/24/type
/sys/kernel/iommu_groups/24/reserved_regions
/sys/kernel/iommu_groups/14
/sys/kernel/iommu_groups/14/devices
/sys/kernel/iommu_groups/14/devices/0000:00:03.0
/sys/kernel/iommu_groups/14/type
/sys/kernel/iommu_groups/14/reserved_regions
/sys/kernel/iommu_groups/32
/sys/kernel/iommu_groups/32/devices
/sys/kernel/iommu_groups/32/devices/0000:04:00.0
/sys/kernel/iommu_groups/32/type
/sys/kernel/iommu_groups/32/reserved_regions
/sys/kernel/iommu_groups/4
/sys/kernel/iommu_groups/4/devices
/sys/kernel/iommu_groups/4/devices/0000:ff:10.0
/sys/kernel/iommu_groups/4/devices/0000:ff:10.7
/sys/kernel/iommu_groups/4/devices/0000:ff:10.5
/sys/kernel/iommu_groups/4/devices/0000:ff:10.1
/sys/kernel/iommu_groups/4/devices/0000:ff:10.6
/sys/kernel/iommu_groups/4/type
/sys/kernel/iommu_groups/4/reserved_regions
/sys/kernel/iommu_groups/22
/sys/kernel/iommu_groups/22/devices
/sys/kernel/iommu_groups/22/devices/0000:00:16.0
/sys/kernel/iommu_groups/22/type
/sys/kernel/iommu_groups/22/reserved_regions
/sys/kernel/iommu_groups/12
/sys/kernel/iommu_groups/12/devices
/sys/kernel/iommu_groups/12/devices/0000:00:00.0
/sys/kernel/iommu_groups/12/type
/sys/kernel/iommu_groups/12/reserved_regions
/sys/kernel/iommu_groups/30
/sys/kernel/iommu_groups/30/devices
/sys/kernel/iommu_groups/30/devices/0000:02:00.0
/sys/kernel/iommu_groups/30/type
/sys/kernel/iommu_groups/30/reserved_regions
/sys/kernel/iommu_groups/2
/sys/kernel/iommu_groups/2/devices
/sys/kernel/iommu_groups/2/devices/0000:ff:0d.1
/sys/kernel/iommu_groups/2/devices/0000:ff:0d.0
/sys/kernel/iommu_groups/2/type
/sys/kernel/iommu_groups/2/reserved_regions
/sys/kernel/iommu_groups/20
/sys/kernel/iommu_groups/20/devices
/sys/kernel/iommu_groups/20/devices/0000:00:11.4
/sys/kernel/iommu_groups/20/type
/sys/kernel/iommu_groups/20/reserved_regions
/sys/kernel/iommu_groups/10
/sys/kernel/iommu_groups/10/devices
/sys/kernel/iommu_groups/10/devices/0000:ff:1e.4
/sys/kernel/iommu_groups/10/devices/0000:ff:1e.2
/sys/kernel/iommu_groups/10/devices/0000:ff:1e.0
/sys/kernel/iommu_groups/10/devices/0000:ff:1e.3
/sys/kernel/iommu_groups/10/devices/0000:ff:1e.1
/sys/kernel/iommu_groups/10/type
/sys/kernel/iommu_groups/10/reserved_regions
/sys/kernel/iommu_groups/29
/sys/kernel/iommu_groups/29/devices
/sys/kernel/iommu_groups/29/devices/0000:00:1f.2
/sys/kernel/iommu_groups/29/devices/0000:00:1f.0
/sys/kernel/iommu_groups/29/devices/0000:00:1f.3
/sys/kernel/iommu_groups/29/type
/sys/kernel/iommu_groups/29/reserved_regions
/sys/kernel/iommu_groups/0
/sys/kernel/iommu_groups/0/devices
/sys/kernel/iommu_groups/0/devices/0000:ff:0b.1
/sys/kernel/iommu_groups/0/devices/0000:ff:0b.2
/sys/kernel/iommu_groups/0/devices/0000:ff:0b.0
/sys/kernel/iommu_groups/0/type
/sys/kernel/iommu_groups/0/reserved_regions
/sys/kernel/iommu_groups/19
/sys/kernel/iommu_groups/19/devices
/sys/kernel/iommu_groups/19/devices/0000:00:11.0
/sys/kernel/iommu_groups/19/type
/sys/kernel/iommu_groups/19/reserved_regions
/sys/kernel/iommu_groups/9
/sys/kernel/iommu_groups/9/devices
/sys/kernel/iommu_groups/9/devices/0000:ff:17.7
/sys/kernel/iommu_groups/9/devices/0000:ff:17.5
/sys/kernel/iommu_groups/9/devices/0000:ff:17.3
/sys/kernel/iommu_groups/9/devices/0000:ff:17.1
/sys/kernel/iommu_groups/9/devices/0000:ff:17.6
/sys/kernel/iommu_groups/9/devices/0000:ff:17.4
/sys/kernel/iommu_groups/9/devices/0000:ff:17.2
/sys/kernel/iommu_groups/9/devices/0000:ff:17.0
/sys/kernel/iommu_groups/9/type
/sys/kernel/iommu_groups/9/reserved_regions
/sys/kernel/iommu_groups/27
/sys/kernel/iommu_groups/27/devices
/sys/kernel/iommu_groups/27/devices/0000:00:1c.3
/sys/kernel/iommu_groups/27/type
/sys/kernel/iommu_groups/27/reserved_regions
-
#11
Also a fun new addition to this problem is that the kern.log is 17 gb and my local fold is now full. Which I do not know which files I can and should clear and how to do it via SSH because I can no longer Gui my way in
-
#12
Deleting log files once in a while should be fine: rm /var/log/kern.log
. Or maybe removing the old ones is enough? rm /var/log/kern.log.*
Or all non-currents logs? rm /var/log/*.log.*
Last edited: Oct 16, 2021
-
#13
So I’ve been thinking and wild question here, my server won’t boot without a GPU installed, would this create an issue when PCI passthrough is enabled with a permissions or required hardware level problem?
-
#14
So I’ve been thinking and wild question here, my server won’t boot without a GPU installed, would this create an issue when PCI passthrough is enabled with a permissions or required hardware level problem?
This is usually not a problem. The system detects the GPU during POST and it is only passed through to a VM after it completed booting. Passthrough of the first or only GPU can be a bit more tricky thern passthrough of additional GPUs, but that’s about it.
-
#15
This is usually not a problem. The system detects the GPU during POST and it is only passed through to a VM after it completed booting. Passthrough of the first or only GPU can be a bit more tricky thern passthrough of additional GPUs, but that’s about it.
Well then I am stuck, I have no clue what Im missing here, the only thing I can think of is that it’s some kind of hardware issue but this is obviously not my area of expertise. Also, thank you for all the help so far!
-
#16
Can you please share your VM configuration file from /etc/pve/qemu-server/
and show the groups in a more useful manner using for d in /sys/kernel/iommu_groups/*/devices/*; do n=${d#*/iommu_groups/*}; n=${n%%/*}; printf 'IOMMU group %s ' "$n"; lspci -nn "${d##*/}"; done
after removing the pcie_acs_override=downstream,multifunction
(because that completely invalidates all information we can gather from the IOMMU groups)?
-
#17
Can you please share your VM configuration file from
/etc/pve/qemu-server/
and show the groups in a more useful manner usingfor d in /sys/kernel/iommu_groups/*/devices/*; do n=${d#*/iommu_groups/*}; n=${n%%/*}; printf 'IOMMU group %s ' "$n"; lspci -nn "${d##*/}"; done
after removing thepcie_acs_override=downstream,multifunction
(because that completely invalidates all information we can gather from the IOMMU groups)?
So cleared the line pcie_acs_override=downstream,multifunction from grub and then used nano /etc/pve/qemu-server/ and the result came back saying this is a directory and the area is blank
-
#18
ls /etc/pve/qemu-server/
should show VM configuration files, ending in .conf
and staring with the numeric VM-identifier. You can look at the configuration file with cat /etc/pve/qemu-server/100.conf
for example for VM number 100.
Don’t forget to run update-grub
after making changes to /etc/default/grub
and to reboot for the changes to take effect.
-
#19
ls /etc/pve/qemu-server/
should show VM configuration files, ending in.conf
and staring with the numeric VM-identifier. You can look at the configuration file withcat /etc/pve/qemu-server/100.conf
for example for VM number 100.
Don’t forget to runupdate-grub
after making changes to/etc/default/grub
and to reboot for the changes to take effect.
I did run the update for grub, thanks for the reminder and here is the result!
root@Homestead:~# cat /etc/pve/qemu-server/103.conf
agent: 1
balloon: 1024
bios: ovmf
boot: order=scsi0;net0
cores: 4
cpu: host,hidden=1,flags=+pcid
machine: pc-q35-6.0
memory: 8192
name: Windows-Tester-1
net0: virtio=16:8A:36:9B:E5:42,bridge=vmbr0,firewall=1
numa: 1
ostype: win10
scsi0: Local-Proxmox:vm-103-disk-0,size=40G
scsihw: virtio-scsi-pci
smbios1: uuid=fa39f51a-154d-4c7c-8dab-2facda7ceb2b
sockets: 1
unused0: Local-Proxmox:vm-103-disk-1
vmgenid: 139d7b1d-b5ef-4cce-8cdb-09274a1ae519
vmstatestorage: TrueNas
-
#20
Sorry, but I don’t see any hostpci
entries. Is this the VM you want to use for passthrough?
Can you please share the complete output of for d in /sys/kernel/iommu_groups/*/devices/*; do n=${d#*/iommu_groups/*}; n=${n%%/*}; printf 'IOMMU group %s ' "$n"; lspci -nn "${d##*/}"; done
(without pcie_acs_override)?
- Печать
Страницы: [1] Вниз
Тема: Слишком длительное выключении [PCIe Bus Error] (Прочитано 2065 раз)
0 Пользователей и 1 Гость просматривают эту тему.
rekursia
Доброго времени суток! Проблема:
При выключении/перезагрузке наблюдаю повторяющийся лог ошибок в течении 2ух (!!!) минут.
pcieport 0000:00:1b_0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
pcieport 0000:00:1b_0: device [8086:a2e7] error status/mask=00000001/00002000
pcieport 0000:00:1b_0: [ 0] Receiver Error (First)
lspci -vv
00:1b.0 PCI bridge: Intel Corporation 200 Series PCH PCI Express Root Port #17 (rev f0) (prog-if 00 [Normal decode])
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin A routed to IRQ 16
Bus: primary=00, secondary=02, subordinate=02, sec-latency=0
Memory behind bridge: f7200000-f72fffff
Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort+ <SERR- <PERR-
BridgeCtl: Parity- SERR- NoISA- VGA- MAbort- >Reset- FastB2B-
PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
Capabilities: <access denied>
Kernel driver in use: pcieport
ubuntu 18.10 на системнике:
Процессор: HexaCore Intel Core i5-8400 3900 MHz
Материнская плата: Prime Z370-P
Оперативка: 2x Corsair Vengeance 8 Gb DDR4-2400
Видеоадаптер: GeForce GTX 1060
BIOS Ver. 1410
пробовал
pci=nomsi,pci=noaer (update-grub)
результат тот же.
Morisson
pcie_aspm=off
Попробуйте так.
И еще случаем у вас tlp какой-нибудь не установлен?
rekursia
pcie_aspm=off
Попробуйте так.
И еще случаем у вас tlp какой-нибудь не установлен?
при первой перезагрузке те же ошибки спамились очень быстро
при второй перезагрузке логов с ошибкой не было. На экране было сообщение типа
Stack.
Started bpfilter
но длительность перезагрузки была такая же
ubuntu из коробки, ничего не ставил сразу же такая ошибка появилась.
« Последнее редактирование: 12 Февраля 2019, 21:50:14 от rekursia »
Morisson
journald последний что говорит?
И dmesg киньте сюда?
rekursia
Скопировал часть — потому что там одно и то же….
journalctl -b
dmesg
AnrDaemon
Хотите получить помощь? Потрудитесь представить запрошенную информацию в полном объёме.
Прежде чем [Отправить], нажми [Просмотр] и прочти собственное сообщение. Сам-то понял, что написал?…
rekursia
мышь, уши, клава
lsusb
Bus 002 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
Bus 001 Device 003: ID 04f3:152e Elan Microelectronics Corp.
Bus 001 Device 002: ID 17ef:608e Lenovo
Bus 001 Device 004: ID 09da:2701 A4Tech Co., Ltd.
Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
AnrDaemon
Пробовали отключать лишнее?
Хотите получить помощь? Потрудитесь представить запрошенную информацию в полном объёме.
Прежде чем [Отправить], нажми [Просмотр] и прочти собственное сообщение. Сам-то понял, что написал?…
rekursia
делал перезагрузку с одной подключенной мышкой и монитором… то же самое
DimanBG
пробовал
pci=nomsi,pci=noaer
Интересно, ты хотя бы немного понимаешь, что ты там пробуешь?
nomsi — зачем ты вообще тогда купил всё это —
Процессор: HexaCore Intel Core i5-8400 3900 MHz
Материнская плата: Prime Z370-P
Оперативка: 2x Corsair Vengeance 8 Gb DDR4-2400
Видеоадаптер: GeForce GTX 1060
BIOS Ver. 1410
noaer — не будут выводиться ошибки ядра. Легче станет?
Попробуйте так.
Посмотреть сначала нужно в каком состоянии находится.
dmesg | grep -i aspm
По умолчанию должно быть — используется конфигурация БИОС.
При нормально работающей ACPI и ОС установленной и загруженной в УЕФИ на десктопах проблем меньше. Там стандартов и спецификаций всё-таки лучше придерживаются, чем в ноутах.
И ошибки скорректированы — о чём и записи в логе.
мышь, уши, клава
Bus 001 Device 003: ID 04f3:152e Elan Microelectronics Corp.
А это? Сканер отпечатков, тач?
Bus 001 Device 002: ID 17ef:608e Lenovo
Смартфон?
Выключи и сообщи о результате.
rekursia
dmesg | grep -i aspm
— пусто
journalctl | grep -i aspm
А это? Сканер отпечатков, тач?
Это клавиатура.
Смартфон?
Выключи и сообщи о результате.
А это мышка.
Выключал всё, пробовал другие девайсы.
Забыл сказать что сначала ставил Windows, но на другой SSD, если это имеет какое-то значение.
« Последнее редактирование: 13 Февраля 2019, 14:23:10 от rekursia »
rekursia
добавил скрины bios
есть вообще варианты куда копать??
- Печать
Страницы: [1] Вверх