| ====================== |
| Firmware-Assisted Dump |
| ====================== |
| |
| July 2011 |
| |
| The goal of firmware-assisted dump is to enable the dump of |
| a crashed system, and to do so from a fully-reset system, and |
| to minimize the total elapsed time until the system is back |
| in production use. |
| |
| - Firmware assisted dump (fadump) infrastructure is intended to replace |
| the existing phyp assisted dump. |
| - Fadump uses the same firmware interfaces and memory reservation model |
| as phyp assisted dump. |
| - Unlike phyp dump, fadump exports the memory dump through /proc/vmcore |
| in the ELF format in the same way as kdump. This helps us reuse the |
| kdump infrastructure for dump capture and filtering. |
| - Unlike phyp dump, userspace tool does not need to refer any sysfs |
| interface while reading /proc/vmcore. |
| - Unlike phyp dump, fadump allows user to release all the memory reserved |
| for dump, with a single operation of echo 1 > /sys/kernel/fadump_release_mem. |
| - Once enabled through kernel boot parameter, fadump can be |
| started/stopped through /sys/kernel/fadump_registered interface (see |
| sysfs files section below) and can be easily integrated with kdump |
| service start/stop init scripts. |
| |
| Comparing with kdump or other strategies, firmware-assisted |
| dump offers several strong, practical advantages: |
| |
| - Unlike kdump, the system has been reset, and loaded |
| with a fresh copy of the kernel. In particular, |
| PCI and I/O devices have been reinitialized and are |
| in a clean, consistent state. |
| - Once the dump is copied out, the memory that held the dump |
| is immediately available to the running kernel. And therefore, |
| unlike kdump, fadump doesn't need a 2nd reboot to get back |
| the system to the production configuration. |
| |
| The above can only be accomplished by coordination with, |
| and assistance from the Power firmware. The procedure is |
| as follows: |
| |
| - The first kernel registers the sections of memory with the |
| Power firmware for dump preservation during OS initialization. |
| These registered sections of memory are reserved by the first |
| kernel during early boot. |
| |
| - When a system crashes, the Power firmware will save |
| the low memory (boot memory of size larger of 5% of system RAM |
| or 256MB) of RAM to the previous registered region. It will |
| also save system registers, and hardware PTE's. |
| |
| NOTE: |
| The term 'boot memory' means size of the low memory chunk |
| that is required for a kernel to boot successfully when |
| booted with restricted memory. By default, the boot memory |
| size will be the larger of 5% of system RAM or 256MB. |
| Alternatively, user can also specify boot memory size |
| through boot parameter 'crashkernel=' which will override |
| the default calculated size. Use this option if default |
| boot memory size is not sufficient for second kernel to |
| boot successfully. For syntax of crashkernel= parameter, |
| refer to Documentation/admin-guide/kdump/kdump.rst. If any offset is |
| provided in crashkernel= parameter, it will be ignored |
| as fadump uses a predefined offset to reserve memory |
| for boot memory dump preservation in case of a crash. |
| |
| - After the low memory (boot memory) area has been saved, the |
| firmware will reset PCI and other hardware state. It will |
| *not* clear the RAM. It will then launch the bootloader, as |
| normal. |
| |
| - The freshly booted kernel will notice that there is a new |
| node (ibm,dump-kernel) in the device tree, indicating that |
| there is crash data available from a previous boot. During |
| the early boot OS will reserve rest of the memory above |
| boot memory size effectively booting with restricted memory |
| size. This will make sure that the second kernel will not |
| touch any of the dump memory area. |
| |
| - User-space tools will read /proc/vmcore to obtain the contents |
| of memory, which holds the previous crashed kernel dump in ELF |
| format. The userspace tools may copy this info to disk, or |
| network, nas, san, iscsi, etc. as desired. |
| |
| - Once the userspace tool is done saving dump, it will echo |
| '1' to /sys/kernel/fadump_release_mem to release the reserved |
| memory back to general use, except the memory required for |
| next firmware-assisted dump registration. |
| |
| e.g.:: |
| |
| # echo 1 > /sys/kernel/fadump_release_mem |
| |
| Please note that the firmware-assisted dump feature |
| is only available on Power6 and above systems with recent |
| firmware versions. |
| |
| Implementation details: |
| ----------------------- |
| |
| During boot, a check is made to see if firmware supports |
| this feature on that particular machine. If it does, then |
| we check to see if an active dump is waiting for us. If yes |
| then everything but boot memory size of RAM is reserved during |
| early boot (See Fig. 2). This area is released once we finish |
| collecting the dump from user land scripts (e.g. kdump scripts) |
| that are run. If there is dump data, then the |
| /sys/kernel/fadump_release_mem file is created, and the reserved |
| memory is held. |
| |
| If there is no waiting dump data, then only the memory required |
| to hold CPU state, HPTE region, boot memory dump and elfcore |
| header, is usually reserved at an offset greater than boot memory |
| size (see Fig. 1). This area is *not* released: this region will |
| be kept permanently reserved, so that it can act as a receptacle |
| for a copy of the boot memory content in addition to CPU state |
| and HPTE region, in the case a crash does occur. Since this reserved |
| memory area is used only after the system crash, there is no point in |
| blocking this significant chunk of memory from production kernel. |
| Hence, the implementation uses the Linux kernel's Contiguous Memory |
| Allocator (CMA) for memory reservation if CMA is configured for kernel. |
| With CMA reservation this memory will be available for applications to |
| use it, while kernel is prevented from using it. With this fadump will |
| still be able to capture all of the kernel memory and most of the user |
| space memory except the user pages that were present in CMA region:: |
| |
| o Memory Reservation during first kernel |
| |
| Low memory Top of memory |
| 0 boot memory size | |
| | | |<--Reserved dump area -->| | |
| V V | Permanent Reservation | V |
| +-----------+----------/ /---+---+----+-----------+----+------+ |
| | | |CPU|HPTE| DUMP |ELF | | |
| +-----------+----------/ /---+---+----+-----------+----+------+ |
| | ^ |
| | | |
| \ / |
| ------------------------------------------- |
| Boot memory content gets transferred to |
| reserved area by firmware at the time of |
| crash |
| Fig. 1 |
| |
| o Memory Reservation during second kernel after crash |
| |
| Low memory Top of memory |
| 0 boot memory size | |
| | |<------------- Reserved dump area ----------- -->| |
| V V V |
| +-----------+----------/ /---+---+----+-----------+----+------+ |
| | | |CPU|HPTE| DUMP |ELF | | |
| +-----------+----------/ /---+---+----+-----------+----+------+ |
| | | |
| V V |
| Used by second /proc/vmcore |
| kernel to boot |
| Fig. 2 |
| |
| Currently the dump will be copied from /proc/vmcore to a |
| a new file upon user intervention. The dump data available through |
| /proc/vmcore will be in ELF format. Hence the existing kdump |
| infrastructure (kdump scripts) to save the dump works fine with |
| minor modifications. |
| |
| The tools to examine the dump will be same as the ones |
| used for kdump. |
| |
| How to enable firmware-assisted dump (fadump): |
| ---------------------------------------------- |
| |
| 1. Set config option CONFIG_FA_DUMP=y and build kernel. |
| 2. Boot into linux kernel with 'fadump=on' kernel cmdline option. |
| By default, fadump reserved memory will be initialized as CMA area. |
| Alternatively, user can boot linux kernel with 'fadump=nocma' to |
| prevent fadump to use CMA. |
| 3. Optionally, user can also set 'crashkernel=' kernel cmdline |
| to specify size of the memory to reserve for boot memory dump |
| preservation. |
| |
| NOTE: |
| 1. 'fadump_reserve_mem=' parameter has been deprecated. Instead |
| use 'crashkernel=' to specify size of the memory to reserve |
| for boot memory dump preservation. |
| 2. If firmware-assisted dump fails to reserve memory then it |
| will fallback to existing kdump mechanism if 'crashkernel=' |
| option is set at kernel cmdline. |
| 3. if user wants to capture all of user space memory and ok with |
| reserved memory not available to production system, then |
| 'fadump=nocma' kernel parameter can be used to fallback to |
| old behaviour. |
| |
| Sysfs/debugfs files: |
| -------------------- |
| |
| Firmware-assisted dump feature uses sysfs file system to hold |
| the control files and debugfs file to display memory reserved region. |
| |
| Here is the list of files under kernel sysfs: |
| |
| /sys/kernel/fadump_enabled |
| This is used to display the fadump status. |
| |
| - 0 = fadump is disabled |
| - 1 = fadump is enabled |
| |
| This interface can be used by kdump init scripts to identify if |
| fadump is enabled in the kernel and act accordingly. |
| |
| /sys/kernel/fadump_registered |
| This is used to display the fadump registration status as well |
| as to control (start/stop) the fadump registration. |
| |
| - 0 = fadump is not registered. |
| - 1 = fadump is registered and ready to handle system crash. |
| |
| To register fadump echo 1 > /sys/kernel/fadump_registered and |
| echo 0 > /sys/kernel/fadump_registered for un-register and stop the |
| fadump. Once the fadump is un-registered, the system crash will not |
| be handled and vmcore will not be captured. This interface can be |
| easily integrated with kdump service start/stop. |
| |
| /sys/kernel/fadump_release_mem |
| This file is available only when fadump is active during |
| second kernel. This is used to release the reserved memory |
| region that are held for saving crash dump. To release the |
| reserved memory echo 1 to it:: |
| |
| echo 1 > /sys/kernel/fadump_release_mem |
| |
| After echo 1, the content of the /sys/kernel/debug/powerpc/fadump_region |
| file will change to reflect the new memory reservations. |
| |
| The existing userspace tools (kdump infrastructure) can be easily |
| enhanced to use this interface to release the memory reserved for |
| dump and continue without 2nd reboot. |
| |
| Here is the list of files under powerpc debugfs: |
| (Assuming debugfs is mounted on /sys/kernel/debug directory.) |
| |
| /sys/kernel/debug/powerpc/fadump_region |
| This file shows the reserved memory regions if fadump is |
| enabled otherwise this file is empty. The output format |
| is:: |
| |
| <region>: [<start>-<end>] <reserved-size> bytes, Dumped: <dump-size> |
| |
| e.g. |
| Contents when fadump is registered during first kernel:: |
| |
| # cat /sys/kernel/debug/powerpc/fadump_region |
| CPU : [0x0000006ffb0000-0x0000006fff001f] 0x40020 bytes, Dumped: 0x0 |
| HPTE: [0x0000006fff0020-0x0000006fff101f] 0x1000 bytes, Dumped: 0x0 |
| DUMP: [0x0000006fff1020-0x0000007fff101f] 0x10000000 bytes, Dumped: 0x0 |
| |
| Contents when fadump is active during second kernel:: |
| |
| # cat /sys/kernel/debug/powerpc/fadump_region |
| CPU : [0x0000006ffb0000-0x0000006fff001f] 0x40020 bytes, Dumped: 0x40020 |
| HPTE: [0x0000006fff0020-0x0000006fff101f] 0x1000 bytes, Dumped: 0x1000 |
| DUMP: [0x0000006fff1020-0x0000007fff101f] 0x10000000 bytes, Dumped: 0x10000000 |
| : [0x00000010000000-0x0000006ffaffff] 0x5ffb0000 bytes, Dumped: 0x5ffb0000 |
| |
| NOTE: |
| Please refer to Documentation/filesystems/debugfs.txt on |
| how to mount the debugfs filesystem. |
| |
| |
| TODO: |
| ----- |
| - Need to come up with the better approach to find out more |
| accurate boot memory size that is required for a kernel to |
| boot successfully when booted with restricted memory. |
| - The fadump implementation introduces a fadump crash info structure |
| in the scratch area before the ELF core header. The idea of introducing |
| this structure is to pass some important crash info data to the second |
| kernel which will help second kernel to populate ELF core header with |
| correct data before it gets exported through /proc/vmcore. The current |
| design implementation does not address a possibility of introducing |
| additional fields (in future) to this structure without affecting |
| compatibility. Need to come up with the better approach to address this. |
| |
| The possible approaches are: |
| |
| 1. Introduce version field for version tracking, bump up the version |
| whenever a new field is added to the structure in future. The version |
| field can be used to find out what fields are valid for the current |
| version of the structure. |
| 2. Reserve the area of predefined size (say PAGE_SIZE) for this |
| structure and have unused area as reserved (initialized to zero) |
| for future field additions. |
| |
| The advantage of approach 1 over 2 is we don't need to reserve extra space. |
| |
| Author: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> |
| |
| This document is based on the original documentation written for phyp |
| |
| assisted dump by Linas Vepstas and Manish Ahuja. |