Pci bus error severity uncorrected - Исправление ошибок и поиск оптимальных решений проблем

My dmesg gets completely spammed with the following messages appearing over and over again, and this keep increasing the size of log file. I can use pci=noaer parameter to suppress these annoying messages, but not sure what this parameter do, will it make my pc lost some function.Is this a bug? below is part of the dmesg infomation
——————————————————————————————————

[ 397.076509] pcieport 0000:00:01.1: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0009(Requester ID)
[ 397.076517] pcieport 0000:00:01.1: device [1022:15d3] error status/mask=00100000/04400000
[ 397.076522] pcieport 0000:00:01.1: [20] Unsupported Request (First)
[ 397.076526] pcieport 0000:00:01.1: TLP Header: 34000000 01001f10 00000000 8c548c54
[ 397.081907] pcieport 0000:00:01.1: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0009(Requester ID)
[ 397.081914] pcieport 0000:00:01.1: device [1022:15d3] error status/mask=00100000/04400000
[ 397.081918] pcieport 0000:00:01.1: [20] Unsupported Request (First)
[ 397.081923] pcieport 0000:00:01.1: TLP Header: 34000000 01001f10 00000000 8c548c54
[ 398.983368] pcieport 0000:00:01.1: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0009(Requester ID)
[ 398.983376] pcieport 0000:00:01.1: device [1022:15d3] error status/mask=00100000/04400000
[ 398.983381] pcieport 0000:00:01.1: [20] Unsupported Request (First)
[ 398.983385] pcieport 0000:00:01.1: TLP Header: 34000000 01001f10 00000000 8c548c54
[ 398.984773] pcieport 0000:00:01.1: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0009(Requester ID)
[ 398.984779] pcieport 0000:00:01.1: device [1022:15d3] error status/mask=00100000/04400000
[ 398.984783] pcieport 0000:00:01.1: [20] Unsupported Request (First)
[ 398.984788] pcieport 0000:00:01.1: TLP Header: 34000000 01001f10 00000000 8c548c54
[ 398.994170] pcieport 0000:00:01.1: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0009(Requester ID)
[ 398.994176] pcieport 0000:00:01.1: device [1022:15d3] error status/mask=00100000/04400000
[ 398.994180] pcieport 0000:00:01.1: [20] Unsupported Request (First)
[ 398.994185] pcieport 0000:00:01.1: TLP Header: 34000000 01001f10 00000000 8c548c54
[ 399.333957] pcieport 0000:00:01.1: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0009(Requester ID)
[ 399.333964] pcieport 0000:00:01.1: device [1022:15d3] error status/mask=00100000/04400000
[ 399.333968] pcieport 0000:00:01.1: [20] Unsupported Request (First)
[ 399.333973] pcieport 0000:00:01.1: TLP Header: 34000000 01001f10 00000000 8c548c54
[ 399.339347] pcieport 0000:00:01.1: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0009(Requester ID)
[ 399.339353] pcieport 0000:00:01.1: device [1022:15d3] error status/mask=00100000/04400000
[ 399.339358] pcieport 0000:00:01.1: [20] Unsupported Request (First)
[ 399.339362] pcieport 0000:00:01.1: TLP Header: 34000000 01001f10 00000000 8c548c54

ProblemType: Bug
DistroRelease: Ubuntu 18.04
Package: linux-image-4.15.0-12-generic 4.15.0-12.13
ProcVersionSignature: Ubuntu 4.15.0-12.13-generic 4.15.7
Uname: Linux 4.15.0-12-generic x86_64
ApportVersion: 2.20.8-0ubuntu10
Architecture: amd64
AudioDevicesInUse:
USER PID ACCESS COMMAND
/dev/snd/controlC1: ven 1747 F…. pulseaudio
/dev/snd/controlC0: ven 1747 F…. pulseaudio
CurrentDesktop: ubuntu:GNOME
Date: Sat Mar 24 23:24:16 2018
InstallationDate: Installed on 2018-03-07 (17 days ago)
InstallationMedia: Ubuntu 18.04 LTS «Bionic Beaver» — Alpha amd64 (20180305)
MachineType: LENOVO 81BR
ProcFB: 0 amdgpudrmfb
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-4.15.0-12-generic.efi.signed root=UUID=a8a345d7-5b2e-495a-897a-351b10597e61 ro quiet splash vt.handoff=1
RelatedPackageVersions:
linux-restricted-modules-4.15.0-12-generic N/A
linux-backports-modules-4.15.0-12-generic N/A
linux-firmware 1.173
SourcePackage: linux
UpgradeStatus: No upgrade log present (probably fresh install)
dmi.bios.date: 12/19/2017
dmi.bios.vendor: LENOVO
dmi.bios.version: 6KCN28WW
dmi.board.asset.tag: NO Asset Tag
dmi.board.name: LNVNB161216
dmi.board.vendor: LENOVO
dmi.board.version: SDK0L77769 WIN
dmi.chassis.asset.tag: NO Asset Tag
dmi.chassis.type: 10
dmi.chassis.vendor: LENOVO
dmi.chassis.version: Lenovo ideapad 720S-13ARR
dmi.modalias: dmi:bvnLENOVO:bvr6KCN28WW:bd12/19/2017:svnLENOVO:pn81BR:pvrLenovoideapad720S-13ARR:rvnLENOVO:rnLNVNB161216:rvrSDK0L77769WIN:cvnLENOVO:ct10:cvrLenovoideapad720S-13ARR:
dmi.product.family: ideapad 720S-13ARR
dmi.product.name: 81BR
dmi.product.version: Lenovo ideapad 720S-13ARR
dmi.sys.vendor: LENOVO

Источник

I installed Lubuntu on an Acer Swift — installing it on the SSD already required to change the BIOS setting for the controller to AHCI.

Now I’m stuck on getting the shutdown working properly. I tried already some options in /etc/default/grub like

reboot=bios
acpi=force apm=power_off
pci=nomsi,noaer (I know this should only turn off the error message)

The error that I get on shutdown is:

PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
  device [10de:id13] error status/mask=00100000/00000000
    [20] UnsupReq               (First)
  TLP Header: 40000008 000000ff a024c010 f7f7f7f7

After displaying that error, nothing happens and it doesn’t shutdown properly.

I didn’t yet mount the extra SATA — which I want to do after fixing the shutdown…

➜  ~ uname -a
Linux blub 5.3.0-18-generic #19-Ubuntu SMP Tue Oct 8 20:14:06 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

➜  ~ sudo lshw
...
    product: Swift SF314-56G (0000000000000000)
    vendor: Acer
    version: V1.08
...

The questions I saw so far regarding this kind of error had most times «severity=Corrected» or a different «type». If you need any more information, please write a comment and I’ll update the question.

How can I get this now to shutdown properly?

Update

➜  ~ lspci -nnk
00:00.0 Host bridge [0600]: Intel Corporation Device [8086:3e34] (rev 0b)
        Subsystem: Acer Incorporated [ALI] Device [1025:1301]
        Kernel driver in use: skl_uncore
00:02.0 VGA compatible controller [0300]: Intel Corporation UHD Graphics 620 (Whiskey Lake) [8086:3ea0]
        Subsystem: Acer Incorporated [ALI] UHD Graphics 620 (Whiskey Lake) [1025:1301]
        Kernel driver in use: i915
        Kernel modules: i915
00:04.0 Signal processing controller [1180]: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor Thermal Subsystem [8086:1903] (rev 0b)
        Subsystem: Acer Incorporated [ALI] Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor Thermal Subsystem [1025:1301]
        Kernel driver in use: proc_thermal
        Kernel modules: processor_thermal_device
00:12.0 Signal processing controller [1180]: Intel Corporation Cannon Point-LP Thermal Controller [8086:9df9] (rev 30)
        Subsystem: Acer Incorporated [ALI] Cannon Point-LP Thermal Controller [1025:1301]
        Kernel driver in use: intel_pch_thermal
        Kernel modules: intel_pch_thermal
00:14.0 USB controller [0c03]: Intel Corporation Cannon Point-LP USB 3.1 xHCI Controller [8086:9ded] (rev 30)
        Subsystem: Acer Incorporated [ALI] Cannon Point-LP USB 3.1 xHCI Controller [1025:1301]
        Kernel driver in use: xhci_hcd
00:14.2 RAM memory [0500]: Intel Corporation Cannon Point-LP Shared SRAM [8086:9def] (rev 30)
        Subsystem: Acer Incorporated [ALI] Cannon Point-LP Shared SRAM [1025:1301]
00:14.3 Network controller [0280]: Intel Corporation Cannon Point-LP CNVi [Wireless-AC] [8086:9df0] (rev 30)
        Subsystem: Intel Corporation Cannon Point-LP CNVi [Wireless-AC] [8086:0034]
        Kernel driver in use: iwlwifi
        Kernel modules: iwlwifi
00:15.0 Serial bus controller [0c80]: Intel Corporation Cannon Point-LP Serial IO I2C Controller #0 [8086:9de8] (rev 30)
        Subsystem: Acer Incorporated [ALI] Cannon Point-LP Serial IO I2C Controller [1025:1301]
        Kernel driver in use: intel-lpss
        Kernel modules: intel_lpss_pci
00:15.1 Serial bus controller [0c80]: Intel Corporation Cannon Point-LP Serial IO I2C Controller #1 [8086:9de9] (rev 30)
        Subsystem: Acer Incorporated [ALI] Cannon Point-LP Serial IO I2C Controller [1025:1301]
        Kernel driver in use: intel-lpss
        Kernel modules: intel_lpss_pci
00:16.0 Communication controller [0780]: Intel Corporation Cannon Point-LP MEI Controller #1 [8086:9de0] (rev 30)
        Subsystem: Acer Incorporated [ALI] Cannon Point-LP MEI Controller [1025:1301]
        Kernel driver in use: mei_me
        Kernel modules: mei_me
00:17.0 SATA controller [0106]: Intel Corporation Cannon Point-LP SATA Controller [AHCI Mode] [8086:9dd3] (rev 30)
        Subsystem: Acer Incorporated [ALI] Cannon Point-LP SATA Controller [AHCI Mode] [1025:1301]
        Kernel driver in use: ahci
        Kernel modules: ahci
00:19.0 Serial bus controller [0c80]: Intel Corporation Device [8086:9dc5] (rev 30)
        Subsystem: Acer Incorporated [ALI] Device [1025:1301]
        Kernel driver in use: intel-lpss
        Kernel modules: intel_lpss_pci
00:1c.0 PCI bridge [0604]: Intel Corporation Cannon Point-LP PCI Express Root Port #1 [8086:9db8] (rev f0)
        Kernel driver in use: pcieport
00:1c.4 PCI bridge [0604]: Intel Corporation Cannon Point-LP PCI Express Root Port #5 [8086:9dbc] (rev f0)
        Kernel driver in use: pcieport
00:1d.0 PCI bridge [0604]: Intel Corporation Cannon Point-LP PCI Express Root Port #9 [8086:9db0] (rev f0)
        Kernel driver in use: pcieport
00:1d.4 PCI bridge [0604]: Intel Corporation Cannon Point-LP PCI Express Root Port #13 [8086:9db4] (rev f0)
        Kernel driver in use: pcieport
00:1f.0 ISA bridge [0601]: Intel Corporation Cannon Point-LP LPC Controller [8086:9d84] (rev 30)
        Subsystem: Acer Incorporated [ALI] Cannon Point-LP LPC Controller [1025:1301]
00:1f.3 Multimedia audio controller [0401]: Intel Corporation Cannon Point-LP High Definition Audio Controller [8086:9dc8] (rev 30)
        Subsystem: Acer Incorporated [ALI] Cannon Point-LP High Definition Audio Controller [1025:1300]
        Kernel driver in use: snd_hda_intel
        Kernel modules: snd_hda_intel, snd_soc_skl, sof_pci_dev
00:1f.4 SMBus [0c05]: Intel Corporation Cannon Point-LP SMBus Controller [8086:9da3] (rev 30)
        Subsystem: Acer Incorporated [ALI] Cannon Point-LP SMBus Controller [1025:1301]
        Kernel driver in use: i801_smbus
        Kernel modules: i2c_i801
00:1f.5 Serial bus controller [0c80]: Intel Corporation Cannon Point-LP SPI Controller [8086:9da4] (rev 30)
        Subsystem: Acer Incorporated [ALI] Cannon Point-LP SPI Controller [1025:1301]
02:00.0 3D controller [0302]: NVIDIA Corporation GP108M [GeForce MX250] [10de:1d13] (rev a1)
        Subsystem: Acer Incorporated [ALI] GP108M [GeForce MX250] [1025:1301]
        Kernel driver in use: nouveau
        Kernel modules: nvidiafb, nouveau
04:00.0 Non-Volatile memory controller [0108]: Sandisk Corp WD Black 2018/PC SN520 NVMe SSD [15b7:5003] (rev 01)
        Subsystem: Sandisk Corp WD Black 2018/PC SN520 NVMe SSD [15b7:5003]
        Kernel driver in use: nvme
        Kernel modules: nvme

Источник

I came home to an error saying my log file was full. Turns out I have been receiving a stream of PCIe errors since I made some hardware changes over the weekend.

The first device that is throwing errors is one of two GPUs in the system. The errors look like:

Tower kernel: pcieport 0000:00:03.0: AER: Multiple Uncorrected (Non-Fatal) error received: 0000:01:00.0
Tower kernel: vfio-pci 0000:01:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
Tower kernel: vfio-pci 0000:01:00.0:   device [10de:1e84] error status/mask=00100000/00000000
Tower kernel: vfio-pci 0000:01:00.0:    [20] UnsupReq               (First)
Tower kernel: vfio-pci 0000:01:00.0: AER:   TLP Header: 40000001 00000003 000be7c0 f7f7f7f7
Tower kernel: pcieport 0000:00:03.0: AER: device recovery successful

The second device that is throwing errors is my LSI card. This is new. It is an LSI 9207-8i purchased from The Art of the Server on ebay. It is in a PCIe slot that was previously occupied by an NVME SSD in a PCIe adapter. Those errors look like:

Tower kernel: mpt3sas 0000:04:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
Tower kernel: pcieport 0000:00:03.0: AER: Multiple Uncorrected (Non-Fatal) error received: 0000:01:00.0
Tower kernel: mpt3sas 0000:04:00.0:   device [1000:0087] error status/mask=00000001/00002000
Tower kernel: mpt3sas 0000:04:00.0:    [ 0] RxErr

Despite these errors, both devices are acting normally. The GPU is passed through to a VM and behaves as expected even under full load. The LSI card also appears fully functional. I went through an entire parity check which passed with zero errors. I am currently running through a drive rebuild (not because of drive failure, just swapping it out) and would rather not have to abort, but I also do not know how severe these errors are and if I need to take immediate action.

I am attaching my full diagnostics dump.

Any advice would be much appreciated.

Thank you.

tower-diagnostics-20210310-1757.zip

Edited March 11, 2021 by Team_Dango

Источник

Содержание

8. The PCI Express Advanced Error Reporting Driver Guide HOWTOВ¶
8.1. OverviewВ¶
8.1.1. About this guideВ¶
8.1.2. What is the PCI Express AER Driver?В¶
8.2. User GuideВ¶
8.2.1. Include the PCI Express AER Root Driver into the Linux KernelВ¶
8.2.2. Load PCI Express AER Root DriverВ¶
8.2.3. AER error outputВ¶
8.2.4. AER Statistics / CountersВ¶
8.3. Developer GuideВ¶
8.3.1. Configure the AER capability structureВ¶
8.3.2. Provide callbacksВ¶
8.3.2.1. callback reset_link to reset pci express linkВ¶
8.3.2.2. PCI error-recovery callbacksВ¶
8.3.2.3. Correctable errorsВ¶
8.3.2.4. Non-correctable (non-fatal and fatal) errorsВ¶
8.3.3. helper functionsВ¶
8.3.4. Frequent Asked QuestionsВ¶
8.4. Software error injectionВ¶
8. The PCI Express Advanced Error Reporting Driver Guide HOWTOВ¶
8.1. OverviewВ¶
8.1.1. About this guideВ¶
8.1.2. What is the PCI Express AER Driver?В¶
8.2. User GuideВ¶
8.2.1. Include the PCI Express AER Root Driver into the Linux KernelВ¶
8.2.2. Load PCI Express AER Root DriverВ¶
8.2.3. AER error outputВ¶
8.2.4. AER Statistics / CountersВ¶
8.3. Developer GuideВ¶
8.3.1. Configure the AER capability structureВ¶
8.3.2. Provide callbacksВ¶
8.3.2.1. callback reset_link to reset pci express linkВ¶
8.3.2.2. PCI error-recovery callbacksВ¶
8.3.2.3. Correctable errorsВ¶
8.3.2.4. Non-correctable (non-fatal and fatal) errorsВ¶
8.3.3. helper functionsВ¶
8.3.4. Frequent Asked QuestionsВ¶
8.4. Software error injectionВ¶

8. The PCI Express Advanced Error Reporting Driver Guide HOWTOВ¶

8.1. OverviewВ¶

8.1.1. About this guideВ¶

This guide describes the basics of the PCI Express Advanced Error Reporting (AER) driver and provides information on how to use it, as well as how to enable the drivers of endpoint devices to conform with PCI Express AER driver.

8.1.2. What is the PCI Express AER Driver?В¶

PCI Express error signaling can occur on the PCI Express link itself or on behalf of transactions initiated on the link. PCI Express defines two error reporting paradigms: the baseline capability and the Advanced Error Reporting capability. The baseline capability is required of all PCI Express components providing a minimum defined set of error reporting requirements. Advanced Error Reporting capability is implemented with a PCI Express advanced error reporting extended capability structure providing more robust error reporting.

The PCI Express AER driver provides the infrastructure to support PCI Express Advanced Error Reporting capability. The PCI Express AER driver provides three basic functions:

Gathers the comprehensive error information if errors occurred.

Reports error to the users.

Performs error recovery actions.

AER driver only attaches root ports which support PCI-Express AER capability.

8.2. User GuideВ¶

8.2.1. Include the PCI Express AER Root Driver into the Linux KernelВ¶

The PCI Express AER Root driver is a Root Port service driver attached to the PCI Express Port Bus driver. If a user wants to use it, the driver has to be compiled. Option CONFIG_PCIEAER supports this capability. It depends on CONFIG_PCIEPORTBUS, so pls. set CONFIG_PCIEPORTBUS=y and CONFIG_PCIEAER = y.

8.2.2. Load PCI Express AER Root DriverВ¶

Some systems have AER support in firmware. Enabling Linux AER support at the same time the firmware handles AER may result in unpredictable behavior. Therefore, Linux does not handle AER events unless the firmware grants AER control to the OS via the ACPI _OSC method. See the PCI FW 3.0 Specification for details regarding _OSC usage.

8.2.3. AER error outputВ¶

When a PCIe AER error is captured, an error message will be output to console. If itвЂ™s a correctable error, it is output as a warning. Otherwise, it is printed as an error. So users could choose different log level to filter out correctable error messages.

Below shows an example:

In the example, вЂRequester IDвЂ™ means the ID of the device who sends the error message to root port. Pls. refer to pci express specs for other fields.

8.2.4. AER Statistics / CountersВ¶

When PCIe AER errors are captured, the counters / statistics are also exposed in the form of sysfs attributes which are documented at Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats

8.3. Developer GuideВ¶

To enable AER aware support requires a software driver to configure the AER capability structure within its device and to provide callbacks.

To support AER better, developers need understand how AER does work firstly.

PCI Express errors are classified into two types: correctable errors and uncorrectable errors. This classification is based on the impacts of those errors, which may result in degraded performance or function failure.

Correctable errors pose no impacts on the functionality of the interface. The PCI Express protocol can recover without any software intervention or any loss of data. These errors are detected and corrected by hardware. Unlike correctable errors, uncorrectable errors impact functionality of the interface. Uncorrectable errors can cause a particular transaction or a particular PCI Express link to be unreliable. Depending on those error conditions, uncorrectable errors are further classified into non-fatal errors and fatal errors. Non-fatal errors cause the particular transaction to be unreliable, but the PCI Express link itself is fully functional. Fatal errors, on the other hand, cause the link to be unreliable.

When AER is enabled, a PCI Express device will automatically send an error message to the PCIe root port above it when the device captures an error. The Root Port, upon receiving an error reporting message, internally processes and logs the error message in its PCI Express capability structure. Error information being logged includes storing the error reporting agentвЂ™s requestor ID into the Error Source Identification Registers and setting the error bits of the Root Error Status Register accordingly. If AER error reporting is enabled in Root Error Command Register, the Root Port generates an interrupt if an error is detected.

Note that the errors as described above are related to the PCI Express hierarchy and links. These errors do not include any device specific errors because device specific errors will still get sent directly to the device driver.

8.3.1. Configure the AER capability structureВ¶

AER aware drivers of PCI Express component need change the device control registers to enable AER. They also could change AER registers, including mask and severity registers. Helper function pci_enable_pcie_error_reporting could be used to enable AER. See section 3.3.

8.3.2. Provide callbacksВ¶

8.3.2.1. callback reset_link to reset pci express linkВ¶

This callback is used to reset the pci express physical link when a fatal error happens. The root port aer service driver provides a default reset_link function, but different upstream ports might have different specifications to reset pci express link, so all upstream ports should provide their own reset_link functions.

Section 3.2.2.2 provides more detailed info on when to call reset_link.

8.3.2.2. PCI error-recovery callbacksВ¶

The PCI Express AER Root driver uses error callbacks to coordinate with downstream device drivers associated with a hierarchy in question when performing error recovery actions.

Data struct pci_driver has a pointer, err_handler, to point to pci_error_handlers who consists of a couple of callback function pointers. AER driver follows the rules defined in PCI Error Recovery except pci express specific parts (e.g. reset_link). Pls. refer to PCI Error Recovery for detailed definitions of the callbacks.

Below sections specify when to call the error callback functions.

8.3.2.3. Correctable errorsВ¶

Correctable errors pose no impacts on the functionality of the interface. The PCI Express protocol can recover without any software intervention or any loss of data. These errors do not require any recovery actions. The AER driver clears the deviceвЂ™s correctable error status register accordingly and logs these errors.

8.3.2.4. Non-correctable (non-fatal and fatal) errorsВ¶

If an error message indicates a non-fatal error, performing link reset at upstream is not required. The AER driver calls error_detected(dev, pci_channel_io_normal) to all drivers associated within a hierarchy in question. for example:

If Upstream port A captures an AER error, the hierarchy consists of Downstream port B and EndPoint.

A driver may return PCI_ERS_RESULT_CAN_RECOVER, PCI_ERS_RESULT_DISCONNECT, or PCI_ERS_RESULT_NEED_RESET, depending on whether it can recover or the AER driver calls mmio_enabled as next.

If an error message indicates a fatal error, kernel will broadcast error_detected(dev, pci_channel_io_frozen) to all drivers within a hierarchy in question. Then, performing link reset at upstream is necessary. As different kinds of devices might use different approaches to reset link, AER port service driver is required to provide the function to reset link via callback parameter of pcie_do_recovery() function. If reset_link is not NULL, recovery function will use it to reset the link. If error_detected returns PCI_ERS_RESULT_CAN_RECOVER and reset_link returns PCI_ERS_RESULT_RECOVERED, the error handling goes to mmio_enabled.

8.3.3. helper functionsВ¶

pci_enable_pcie_error_reporting enables the device to send error messages to root port when an error is detected. Note that devices donвЂ™t enable the error reporting by default, so device drivers need call this function to enable it.

pci_disable_pcie_error_reporting disables the device to send error messages to root port when an error is detected.

pci_aer_clear_nonfatal_status clears non-fatal errors in the uncorrectable error status register.

8.3.4. Frequent Asked QuestionsВ¶

What happens if a PCI Express device driver does not provide an error recovery handler (pci_driver->err_handler is equal to NULL)?

The devices attached with the driver wonвЂ™t be recovered. If the error is fatal, kernel will print out warning messages. Please refer to section 3 for more information.

What happens if an upstream port service driver does not provide callback reset_link?

Fatal error recovery will fail if the errors are reported by the upstream ports who are attached by the service driver.

How does this infrastructure deal with driver that is not PCI Express aware?

This infrastructure calls the error callback functions of the driver when an error happens. But if the driver is not aware of PCI Express, the device might not report its own errors to root port.

What modifications will that driver need to make it compatible with the PCI Express AER Root driver?

It could call the helper functions to enable AER in devices and cleanup uncorrectable status register. Pls. refer to section 3.3.

8.4. Software error injectionВ¶

Debugging PCIe AER error recovery code is quite difficult because it is hard to trigger real hardware errors. Software based error injection can be used to fake various kinds of PCIe errors.

First you should enable PCIe AER software error injection in kernel configuration, that is, following item should be in your .config.

CONFIG_PCIEAER_INJECT=y or CONFIG_PCIEAER_INJECT=m

After reboot with new kernel or insert the module, a device file named /dev/aer_inject should be created.

Then, you need a user space tool named aer-inject, which can be gotten from:

More information about aer-inject can be found in the document comes with its source code.