|Last reviewed: 05/20/2016
|HPE iLO NMI Watchdog Driver
|NMI sourcing for iLO based ProLiant Servers
|Documentation and Driver by
|The HPE iLO NMI Watchdog driver is a kernel module that provides basic
|watchdog functionality and the added benefit of NMI sourcing. Both the
|watchdog functionality and the NMI sourcing capability need to be enabled
|by the user. Remember that the two modes are not dependent on one another.
|A user can have the NMI sourcing without the watchdog timer and vice-versa.
|All references to iLO in this document imply it also works on iLO2 and all
|Watchdog functionality is enabled like any other common watchdog driver. That
|is, an application needs to be started that kicks off the watchdog timer. A
|basic application exists in tools/testing/selftests/watchdog/ named
|watchdog-test.c. Simply compile the C file and kick it off. If the system
|gets into a bad state and hangs, the HPE ProLiant iLO timer register will
|not be updated in a timely fashion and a hardware system reset (also known as
|an Automatic Server Recovery (ASR)) event will occur.
|The hpwdt driver also has three (3) module parameters. They are the following:
|soft_margin - allows the user to set the watchdog timer value.
|Default value is 30 seconds.
|allow_kdump - allows the user to save off a kernel dump image after an NMI.
|Default value is 1/ON
|nowayout - basic watchdog parameter that does not allow the timer to
|be restarted or an impending ASR to be escaped.
|Default value is set when compiling the kernel. If it is set
|to "Y", then there is no way of disabling the watchdog once
|it has been started.
|NOTE: More information about watchdog drivers in general, including the ioctl
|interface to /dev/watchdog can be found in
|Documentation/watchdog/watchdog-api.txt and Documentation/IPMI.txt.
|The NMI sourcing capability is disabled by default due to the inability to
|distinguish between "NMI Watchdog Ticks" and "HW generated NMI events" in the
|Linux kernel. What this means is that the hpwdt nmi handler code is called
|each time the NMI signal fires off. This could amount to several thousands of
|NMIs in a matter of seconds. If a user sees the Linux kernel's "dazed and
|confused" message in the logs or if the system gets into a hung state, then
|the hpwdt driver can be reloaded.
|1. If the kernel has not been booted with nmi_watchdog turned off then
|edit and place the nmi_watchdog=0 at the end of the currently booting
|kernel line. Depending on your Linux distribution and platform setup:
|For non-UEFI systems
|For UEFI systems
|2. reboot the sever
|3. Once the system comes up perform a modprobe -r hpwdt
|4. modprobe /lib/modules/`uname -r`/kernel/drivers/watchdog/hpwdt.ko
|Now, the hpwdt can successfully receive and source the NMI and provide a log
|message that details the reason for the NMI (as determined by the HPE BIOS).
|Below is a list of NMIs the HPE BIOS understands along with the associated
|No source found 00h
|Uncorrectable Memory Error 01h
|ASR NMI 1Bh
|PCI Parity Error 20h
|NMI Button Press 27h
|ILO Doorbell NMI 29h
|ILO IOP NMI 2Ah
|ILO Watchdog NMI 2Bh
|Proc Throt NMI 2Ch
|Front Side Bus NMI 2Dh
|PCI Express Error 2Fh
|DMA controller NMI 30h
|Hypertransport/CSI Error 31h
|-- Tom Mingarelli