i.MX8X SCU reset reason / watchdog reset detection for better FOTA failure handling

The NXP i.MX8 series uses a different watchdog approach than previous i.MX processors. I’m specifically interested in the i.MX8X on the Colibri i.MX8X SOMs which uses a System Controller Unit (SCU) to manage the watchdog and other low-level system tasks.

This watchdog is supported by the Toradex BSP in both the Linux and U-boot codebases.

I’m evaluating OTA update strategies and tools, and I am trying to sort out how mender.io or RAUC or any other full image update tool will be able to know if a reboot was intentional (manual power cycle), unintentional (i.e. power loss), or because of a watchdog.

It looks like a call to get this information isn’t presently implemented in the Toradex BSP in either Linux or U-boot, but I could be missing something. So in addition to being a feature request to expose this information for purposes of better handling FOTA failure detection and rollback, I am asking for guidance on how to find this information in a work-around manner, or at least authoritative guidance that I am going to have to patch the BSP myself to support it while I wait for Toradex to include support. If I’m going down the wrong path, I would appreciate pointers on how to handle failed FOTA updates gracefully without knowing the reboot reason.

The information I’m looking for is in the ASMC_SRS register in the SCU, described in the NXP i.MX8X Applications Processor Reference Manual on page 2428.

It looks like the contents of this register are exposed to the world outside of the SCU through the RPC API call sc_pm_reset_reason(), described in the NXP document “System Controller Firmware API Reference Guide i.MX8 QXP Die (Version 1.5)”.

U-boot already interacts with this API, but not this call. See u-boot source code, arch/arm/include/asm/arch-imx8/sci/sci.h. (U-boot exposes a similar API into the SCU to configure pads at boot time, as seen in arch/arm/include/asm/arch-imx8/sci/svc/pad/api.h.)

There is some support in the Linux kernel, but it isn’t complete. For example, an RPC named IMX_SC_PM_FUNC_RESET_REASON is defined in include/linux/firmware/imx/svc/pm.h but not used anywhere in the kernel drivers. This header was added by NXP here: [PATCH V3 1/2] firmware: imx: add pm svc headfile - Dong Aisheng

Here is a test scenario. First, we initiate a watchdog-triggered reboot by starting the built-in imx-sc watchdog and neglecting to pat it:

root@colibri-imx8x-XXXXXXXX:~# echo -n X > /dev/watchdog
[  114.018611] watchdog: watchdog0: watchdog did not stop!

After waiting for the watchdog to trigger a reboot 60 seconds later and then logging in, we check the reboot reason by using watchdog-test, a test tool from the Linux kernel:

root@colibri-imx8x-XXXXXXXX:~# ./watchdog-test -b
Last boot is caused by: Power-On-Reset.

watchdog-test (tools/testing/selftests/watchdog/watchdog-test.c in the kernel source) must be built external to Yocto using the cross SDK.

Here’s how watchdog-test reads the boot status:

ret = ioctl(fd, WDIOC_GETBOOTSTATUS, &flags);
if (!ret)
        printf("Last boot is caused by: %s.\n", (flags != 0) ?
                "Watchdog" : "Power-On-Reset");
else
        printf("WDIOC_GETBOOTSTATUS error '%s'\n", strerror(errno));
break;

The I.MX8X SCU watchdog driver (imx_sc_wdt.c) doesn’t implement the WDIOC_GETBOOTSTATUS IOCTL in this BSP or in the mainline kernel tree, so that would also have to be added to expose this information to Linux userspace.

I am working with the Yocto BSP (git://git.toradex.com/meta-toradex-nxp.git, dunfell-5.x.y, most recent commit bc4b3704ab903506be7d1d2aa674aa8e5cd10037).

Hello @gregory.hancock,
First, regarding the OTA update tools: from what I remember, at least RAUC doesn’t depend on knowing the source of the reboot to do a proper job. The tool needs to be able to properly update the system despite the presence of uncontrolled reboots, and in the case of RAUC they use a procedure to mark the partition as good when the boot is finished. Because of this, RAUC retry installing the image for a preset amount of times until either the partition is marked as good or the maximum retries is achieved, at which point it will roll back to the previously working image.
Marking a partition as good is something that the software developer of the image controls, so you can make sure your services are properly running and working before executing the marking procedure.

As for the support of the reset reason information ioctl, I can see in the NXP BSP reference manual section 2.11, they state that their imx8 watchdog supports the reset reason ioctl. If you really want to try out this functionality I would suggest you try using the downstream kernel from NXP instead of the mainline. From what I saw the imx8_wdt.c file they mention on the reference manual doesn’t exist on the mainline kernel.

Regards,
Rafael Beims