I’m running Linux (4.14.170-3.0.4+gbaa6c24) on Colibri imx7d
(/etc/issue:TDX X11 2.6-snapshot \n \l)
I have configured the watchdog
(cat /proc/cmdline: rn5t618-wdt.timeout=8 … )
and feed it from an application by writing to /dev/watchdog through system call:
system(“echo 1 > /dev/watchdog”);
Stopping the watchdog in the program flow works well by writing the special character to /dev/watchdog:
system(“echo ‘V’ > /dev/watchdog”);
This way I can pause and resume the watchdog without problems.
However, when I set Linux in standby and pause the watchdog just before:
system(“echo ‘V’ > /dev/watchdog”);
system(“echo standby > /sys/power/state”);
then after wake from standby the watchdog causes a reboot immediately when it gets fed again by:
system(“echo 1 > /dev/watchdog”);
The system can wake from standby and is stable as long as the watchdog is not fed. As soon as I write to /dev/watchdog the reboot happens.
Am I missing something in the combination between watchdog and standby? Does the watchdog need some special disabling before or re-init after the standby?
Correction: stopping and trying to restart the watchdog leads always to reboot.
Watchdog is configured, and after system startup cyclically fed. Then watchdog is paused by writing ‘V’ to /dev/watchdog When writing again anything but ‘V’ to /dev/watchdog the reboot happens. It is independent of system standby.
To update my question: Is it possible to pause and resume the watchdog? If so, what needs to be done?
I’ve done a quick watchdog test using TDX Wayland with pre-built XWayland 5.2.0-devel-20210409+build.272 (dunfell) Colibri-iMX7-eMMC_Reference-Minimal-Image.
It behaves as expected. Writing “1” to /dev/watchdog after writing ‘V’ to the same file does not causing immediate reboot. It just re-enables watchdog and reboot happens only if /dev/watchdog is not touched for specified period.
Kevin, ah, yes, that is why we could not get the reboot to work with the internal watchdog.
Eventually we got the external watchdog stop and restart to work. It took some changes to the watchdog driver. Not sure, why it would not work out-of-the-box in our image.
Hello Jaski,
agreed, but, unfortunately, this conflicts with the feature request that our device shall be sometimes in suspend state. During suspend there is no process active to toggle the watchdog. So either the WD has an input to detect low power mode and stop counting, or WD has a pause/resume feature. Otherwise, suspend state would always lead to reset.
Best regards,
Otmar
Hello Jaski,
unfortunately, we want to have both: the hardware watchdog that shall restart the system if the application fails to toggle it regularly, i.e. monitoring whether the application is alive. And, occasionally, put the device in standby, i.e. suspending Linux and with it our application - in which case we rather not want the watchdog to trigger a reboot.
In any case, we got it working now. In normal operation the WD triggers a reboot when our applications fails toggling it. Before Linux suspend we pause the WD and re-enable it after waking from suspend. This is not ideal, since it does not protect in case the application does not resume properly after system wake-up, but it seems good enough.
Best regards,
Otmar
Hello Klaus,
please see the attached patch to rn5t618wdt.c
Notably, we cleared the pending interrupt flag in various places. And modified some control bits in the stop() function.
On the application level, before we are issuing a “shutdown -h now” we pause the WD by writing the magic ‘V’ into /dev/watchdog and we call
int value = WDIOS_DISABLECARD;
ioctl(<fd of opened /dev/watchdog>, WDIOC_SETOPTIONS, &value);
Basically, before shutting the system down we disable the WD. The changes to the control bits made sure it stayed down.
If the system locks up during shutdown this is very bad. But so far we haven’t encountered this.
Thanks for the quick response! I did not expect that
I looked into your patch and could see that you probed around and changed the order of register accesses, added an additional irq status bit reset,…
But maybe I found the real culprit for the issue:
The remap is initialized to cache the register accesses.
RN5T618_WATCHDOG (0x0b) and RN5T618_PWRIRQ (0x13)
So I expect that because of this caching the IRQ status bit was never reset.
The status register must not be cached because its set by the RN5T567.
Also it is not ideal to cache the access to the watchdog register which resets the counter via read write cycle.
debugfs shows the regmap setting for these registers:
[root@imx7d /sys/kernel/debug]# cat regmap/0-0033/access
// third column means volatile yes or no
…
0b: y y n n
…
13: y y n n
After marking these registers volatile, stopping the wdt and starting again seems to work.
static bool rn5t618_volatile_reg(struct device *dev, unsigned int reg)
{
switch (reg) {
case RN5T618_WATCHDOGCNT:
case RN5T618_DCIRQ:
case RN5T618_ILIMDATAH … RN5T618_AIN0DATAL:
case RN5T618_ADCCNT3:
case RN5T618_IR_ADC1 … RN5T618_IR_ADC3:
case RN5T618_IR_GPR:
case RN5T618_IR_GPF:
case RN5T618_MON_IOIN:
case RN5T618_INTMON:
case RN5T618_RTC_CTRL1 … RN5T618_RTC_CTRL2:
case RN5T618_RTC_SECONDS … RN5T618_RTC_YEAR:
case RN5T618_CHGSTATE:
case RN5T618_CHGCTRL_IRR … RN5T618_CHGERR_MONI:
case RN5T618_CONTROL … RN5T618_CC_AVEREG0:
case RN5T618_WATCHDOG: // should not be cached because of r/w cycle to reset counter
case RN5T618_PWRIRQ: // should not be cached because its set by hardware
return true;
default:
return false;
}
}
Furthermore it is not necessary to do a RN5T618_WATCHDOG read AND write cycle to reset the wdt counter.
The source code states:
/* The counter is restarted after a R/W access to watchdog register */
The RN5T567 datasheet states:
“The count value of watchdog timer is cleared by accessing (R/W) to this register.”
Tests showed that a single read is enough. I did not check other chip variants which use the same driver.
In my opinion a write cycle is even dangerous if there is some strange situation and the write cycle disables the wdt or changes the wdt settings stored in this register.
Hello Klaus,
thank you for sharing your analysis results.
Yes, that caching might be the reason for the troubles. Caching an IRQ status bit and relying on the cache flush to clear it?
I will try your suggestions and see if it fixes the problem for me too. Perhaps I can then avoid the unfortunate total disabling of the WD.
Best regards,
Otmar