Systemd services are killed by watchdog when resuming from suspend

Hello folks, after upgrading from BSP 5.7.3 to 6.4.0 (Colibri iMX7D) I have a weird behaviour of my systemd services which are using the watchdog functionality related to suspend/resume. When resuming the system from suspend mode after a timespan exceeding the watchdog limit, services are considered as stuck and are killed by systemd immediately.
I’ve also implemented a thread-level supervision in C++ using deadlines on the std::chrono::steady_clock which was working on BSP 5.7.3 and now shows a similar behaviour like systemd. Thus the issues seems to be connected somehow.

I guess that the monotonic clock behaves differently with kernel 6, more precise it seems to keep running in suspend mode, but I was unable to track the issue down. Does anybody have similar problems or any clue how to deal with it?

Cheers, Marc

Hi @marc.windisch ,

I assume you’re referring to the software watchdog interface systemd can use for its services i.e. not the hardware one that can reboot the SoM.

I guess that the monotonic clock behaves differently with kernel 6, more precise it seems to keep running in suspend mode, but I was unable to track the issue down. Does anybody have similar problems or any clue how to deal with it?

I think you’re on the right track here. Searching about your issue I found a bug report on RHEL that may be the same problem you’re having:

In summary when using the newer sleep mode called s2idle the kernel’s monotonic clock can be resumed by some interrupts that wake up the kernel but not the entire system. According to the link above this can be solved by using the old deep sleep mode.

Checking our BSP 6 minimal reference image it uses s2idle by default:

root@colibri-imx7-emmc-06674594:~# cat /etc/os-release
ID=tdx-xwayland-upstream
NAME="TDX Wayland with XWayland Upstream"
VERSION="6.4.0+build.8 (kirkstone)"
VERSION_ID=6.4.0-build.8
PRETTY_NAME="TDX Wayland with XWayland Upstream 6.4.0+build.8 (kirkstone)"
DISTRO_CODENAME="kirkstone"
root@colibri-imx7-emmc-06674594:~# cat /sys/power/mem_sleep 
[s2idle]

Whereas our BSP 5 images (both downstream and upstream kernel versions) uses deep:

root@colibri-imx7-emmc-06674594:~# cat /etc/os-release
ID=tdx-xwayland
NAME="TDX Wayland with XWayland"
VERSION="5.7.2+build.21 (dunfell)"
VERSION_ID=5.7.2-build.21
PRETTY_NAME="TDX Wayland with XWayland 5.7.2+build.21 (dunfell)"
DISTRO_CODENAME="dunfell"
root@colibri-imx7-emmc-06674594:~# cat /sys/power/mem_sleep 
s2idle shallow [deep]
root@colibri-imx7-emmc-06674594:~# cat /etc/os-release
ID=tdx-xwayland-upstream
NAME="TDX Wayland with XWayland Upstream"
VERSION="5.7.2+build.21 (dunfell)"
VERSION_ID=5.7.2-build.21
PRETTY_NAME="TDX Wayland with XWayland Upstream 5.7.2+build.21 (dunfell)"
DISTRO_CODENAME="dunfell"
root@colibri-imx7-emmc-06674594:~# cat /sys/power/mem_sleep
s2idle [deep]

Can you try changing the sleep mode and see if that solves your problem?

Best regards,
Lucas Akira

Hi @lucas_a.tx, thanks for the reply!

Your cat on /sys/power/mem_sleep shows that s2idle is not only the default but also the only sleep mode available. So switching the mode is not possible.
I tried to reconfigure the kernel, but I did not find any options to enable additional modes. Was sleep mode deep dropped in kernel 6?

Cheers, Marc

Hi @marc.windisch ,

Your cat on /sys/power/mem_sleep shows that s2idle is not only the default but also the only sleep mode available. So switching the mode is not possible.

You’re right, I didn’t notice s2idle was the only option on BSP 6.

I tried to reconfigure the kernel, but I did not find any options to enable additional modes. Was sleep mode deep dropped in kernel 6?

I don’t think the S3/deep sleep mode was dropped from the kernel. The BSP 5 upstream ref. minimal image and BSP 6 have pretty much the same configs enabled related to suspend, so I don’t think it’s a kernel config missing either.

BSP 5 Upstream configs:

root@colibri-imx7-emmc-06674594:~# cat /etc/os-release
ID=tdx-xwayland-upstream
NAME="TDX Wayland with XWayland Upstream"
VERSION="5.7.2+build.21 (dunfell)"
VERSION_ID=5.7.2-build.21
PRETTY_NAME="TDX Wayland with XWayland Upstream 5.7.2+build.21 (dunfell)"
DISTRO_CODENAME="dunfell"
root@colibri-imx7-emmc-06674594:~# zcat /proc/config.gz | grep -i suspend
CONFIG_SUSPEND=y
CONFIG_SUSPEND_FREEZER=y
# CONFIG_SUSPEND_SKIP_SYNC is not set
CONFIG_PM_TEST_SUSPEND=y
CONFIG_ARCH_SUSPEND_POSSIBLE=y
CONFIG_ARM_CPU_SUSPEND=y
CONFIG_OLD_SIGSUSPEND3=y
# CONFIG_BT_HCIBTUSB_AUTOSUSPEND is not set
CONFIG_USB_AUTOSUSPEND_DELAY=2

BSP 6 configs:

root@colibri-imx7-emmc-06674594:~# cat /etc/os-release
ID=tdx-xwayland-upstream
NAME="TDX Wayland with XWayland Upstream"
VERSION="6.4.0+build.8 (kirkstone)"
VERSION_ID=6.4.0-build.8
PRETTY_NAME="TDX Wayland with XWayland Upstream 6.4.0+build.8 (kirkstone)"
DISTRO_CODENAME="kirkstone"
root@colibri-imx7-emmc-06674594:~# zcat /proc/config.gz | grep -i suspend
CONFIG_SUSPEND=y
CONFIG_SUSPEND_FREEZER=y
# CONFIG_SUSPEND_SKIP_SYNC is not set
CONFIG_PM_TEST_SUSPEND=y
CONFIG_ARCH_SUSPEND_POSSIBLE=y
CONFIG_ARM_CPU_SUSPEND=y
CONFIG_OLD_SIGSUSPEND3=y
CONFIG_USB_AUTOSUSPEND_DELAY=2

I’ll ask the team internally about it for a more thorough investigation on this matter. Does this issue block your development?

Best regards,
Lucas Akira

Hi @lucas_a.tx,

this is indeed blocking us in a way we had to remove the functionality from our code. Doing so can be just a temporary workaround, due to we hurt our requirements by disabling thread supervision. This is a medical device close to its final release, so we’re quite in a hurry to solve this issue.

Cheer, Marc

Hi @marc.windisch ,

Can you do a quick test on BSP 5 to see if this issue occurs if you change the suspend-to-RAM mode to s2idle? If it does then this newer sleep mode is most likely the cause of this problem.

I’ll see if we can reproduce this on our side in the following days.

Best regards,
Lucas Akira

@marc.windisch
We picked this up as a bug, but I was told this may be a little more complicated to solve and it will probably take some time. At the moment, I cannot give a timeline for when this is going to be solved. We will keep you informed of the progress.

Best regards,
Rafael

Hi @lucas_a.tx and @rafael.tx,

first of all, sorry for the late reply. I was too busy to check anything because of our release, which was shipped without watchdog functionality, unfortunately.

However, within the last days, I conducted some tests to pin the issue down.
In my original post I mentioned two problems: my C++ thread watcher implementation and systemd.

C++ part

For my thread watcher I fixed the issue by implementing suspend/resume methods that flush the priority queue holding the deadlines on suspend, and re-calculate the deadlines at resume. I’ve connected to the D-Bus and subscribed to systemd-logind’s PrepareForSleep signal to automatically suspend and resume thread watching, so the current implementation does not care about clocks that continue running. The thread watcher is also responsible for generation of heartbeats for systemd by invoking sd_notify(0, "WATCHDOG=1"); every 5s. On resume, the watchdog is immediately reset. This solves the issue in the part of code that is under my control.

systemd part

I backported everything to BSP 5.7.3 and switched the sleep mode to s2idle. Using this mode, BSP 5 behaves exactly the same as BSP 6. In my test scenario the system is mostly sleeping, but is woken up every two minutes. As long as a GPIO is not pulled low, the system would fall asleep again after 10s. Keeping the board awake after a while and checking the logs results in various restarts of the all services with WatchdogSec=x parameter set due to the watchdog:

Nov 17 15:07:35 connectivity-node systemd[1]: xxx.service: Failed with result 'watchdog'.
Nov 17 15:14:07 connectivity-node systemd[1]: xxx.service: Failed with result 'watchdog'.
Nov 17 15:14:07 connectivity-node systemd[1]: yyy.service: Failed with result 'watchdog'.
Nov 17 15:16:19 connectivity-node systemd[1]: zzz.service: Failed with result 'watchdog'.
Nov 17 15:16:19 connectivity-node systemd[1]: xxx.service: Failed with result 'watchdog'.
...

For me it seems like systemd cannot deal with the monotonic clock keeping running during suspend to idle. I verified that my implementation instantly resets systemd’s watchdog after resume, so it must be a deadline miscalculation on systemd side, maybe the same problem I had to solve in my C++ code.

Thank you for picking this up as a bug, I’m locking forward for your feedback :slight_smile:

Cheers, Marc

thank you for the information.
Just to be clear, in our test scenario, whenever we started the watchdog on systemd by adding

WatchdogSec=30

to /etc/systemd/system.conf and put the system to sleep, it would reset during sleep because the watchdog timer expired.
Your test on BSP 5 also seems to confirm that this is related to s2idle and what we’re going to investigate is why there’s no deep sleep option on BSP 6 anymore.

Best regards,
Rafael

Hello @marc.windisch,
We found the reason why the suspend mode is not working. Upon image installation, the U-Boot environment variable bootm_boot_mode is being setup with the wrong value, which causes PSCI to not be initialized.

To fix that, you can set the variable in U-Boot. Just stop the module boot by pressing a key when U-Boot is starting and then:

Colibri iMX7 # setenv bootm_boot_mode nonsec
Colibri iMX7 # saveenv

After that, you can reset the module, and it should enable the suspend modes.
To properly install images avoiding the wrong bootm_boot_mode variable setting, you can use the latest nightly Toradex Easy Installer, which includes a correction for this problem:
https://artifacts.toradex.com/artifactory/tezi-oe-prerelease-frankfurt/dunfell-5.x.y/nightly/521/colibri-imx7/tezi/tezi-run/oedeploy/Colibri-iMX7_ToradexEasyInstaller_5.7.5-devel-20240207+build.521.zip

Could you please test this and see if everything works as it should?

Thanks,
Rafael