Watchdog reboot after wake form suspend

Otmar · May 20, 2021, 2:33pm

Hello,

I’m running Linux (4.14.170-3.0.4+gbaa6c24) on Colibri imx7d
(/etc/issue:TDX X11 2.6-snapshot \n \l)

I have configured the watchdog
(cat /proc/cmdline: rn5t618-wdt.timeout=8 … )
and feed it from an application by writing to /dev/watchdog through system call:
system(“echo 1 > /dev/watchdog”);
Stopping the watchdog in the program flow works well by writing the special character to /dev/watchdog:
system(“echo ‘V’ > /dev/watchdog”);
This way I can pause and resume the watchdog without problems.

However, when I set Linux in standby and pause the watchdog just before:
system(“echo ‘V’ > /dev/watchdog”);
system(“echo standby > /sys/power/state”);
then after wake from standby the watchdog causes a reboot immediately when it gets fed again by:
system(“echo 1 > /dev/watchdog”);
The system can wake from standby and is stable as long as the watchdog is not fed. As soon as I write to /dev/watchdog the reboot happens.

Am I missing something in the combination between watchdog and standby? Does the watchdog need some special disabling before or re-init after the standby?

Many thanks for your suggestions.
Otmar

Otmar · May 21, 2021, 5:59am

Correction: stopping and trying to restart the watchdog leads always to reboot.
Watchdog is configured, and after system startup cyclically fed. Then watchdog is paused by writing ‘V’ to /dev/watchdog When writing again anything but ‘V’ to /dev/watchdog the reboot happens. It is independent of system standby.

To update my question: Is it possible to pause and resume the watchdog? If so, what needs to be done?

Many thanks,
Otmar

kevin.tx · May 21, 2021, 10:12am

Dear @Otmar@hexagon,

Thank you for contacting Toradex.

We are currently looking into your issue.

Best Regards
Kevin

alex.tx · May 21, 2021, 10:16pm

I’ve done a quick watchdog test using TDX Wayland with pre-built XWayland 5.2.0-devel-20210409+build.272 (dunfell) Colibri-iMX7-eMMC_Reference-Minimal-Image.

It behaves as expected. Writing “1” to /dev/watchdog after writing ‘V’ to the same file does not causing immediate reboot. It just re-enables watchdog and reboot happens only if /dev/watchdog is not touched for specified period.

Otmar · May 25, 2021, 7:39am

Hello Alex,

thanks for the verification. Do you know whether this image is using the internal watchdog of imx7 or an external one?

My image uses the external Ricoh RN5T657 PMIC, as pointed out on this page:

We are currebtly trying to switch to the internal watchdog, but had some trouble getting it to correctly reboot the system.

Best regards,
Otmar

kevin.tx · May 25, 2021, 10:59am

Dear @Otmar@hexagon,

the internal watchdog cannot be used due to an errata in the SoC.

More information about it can be found here:

iMX7 Errata

Best Regards
Kevin

Otmar · May 26, 2021, 2:16pm

Kevin, ah, yes, that is why we could not get the reboot to work with the internal watchdog.

Eventually we got the external watchdog stop and restart to work. It took some changes to the watchdog driver. Not sure, why it would not work out-of-the-box in our image.

Many thanks for looking into this issue.
Otmar

kevin.tx · May 27, 2021, 7:08am

Dear @Otmar@hexagon,

I am glad that your question has been answered.

I wish you a nice day.

Best Regards
Kevin

jaski.tx · June 1, 2021, 2:52pm

Hi @Otmar@hexagon

Usually the Watchdog is not meant for stopping after the on trigger and should always lead to reset in case of not writing to it periodically.

Best regards,
Jaski

Otmar · June 1, 2021, 3:06pm

Hello Jaski,
agreed, but, unfortunately, this conflicts with the feature request that our device shall be sometimes in suspend state. During suspend there is no process active to toggle the watchdog. So either the WD has an input to detect low power mode and stop counting, or WD has a pause/resume feature. Otherwise, suspend state would always lead to reset.
Best regards,
Otmar

jaski.tx · June 2, 2021, 12:28pm

Hi @Otmar@hexagon

What is your requirement to have a watchdog?

Best regards,
Jaski

Otmar · June 2, 2021, 1:05pm

Hello Jaski,
unfortunately, we want to have both: the hardware watchdog that shall restart the system if the application fails to toggle it regularly, i.e. monitoring whether the application is alive. And, occasionally, put the device in standby, i.e. suspending Linux and with it our application - in which case we rather not want the watchdog to trigger a reboot.
In any case, we got it working now. In normal operation the WD triggers a reboot when our applications fails toggling it. Before Linux suspend we pause the WD and re-enable it after waking from suspend. This is not ideal, since it does not protect in case the application does not resume properly after system wake-up, but it seems good enough.
Best regards,
Otmar

jaski.tx · June 8, 2021, 2:09pm

Thanks for the information. Could you maybe share how did you do pause the Watchdog?

Thanks and best regards,
Jaski

Klaus · August 4, 2022, 11:00am

Hello Otmar!

We are experiencing the same issue.
Could you tell us which driver changes you made?

Thanks in advance!

Otmar · August 4, 2022, 11:33am

Hello Klaus,
please see the attached patch to rn5t618wdt.c
Notably, we cleared the pending interrupt flag in various places. And modified some control bits in the stop() function.
On the application level, before we are issuing a “shutdown -h now” we pause the WD by writing the magic ‘V’ into /dev/watchdog and we call
int value = WDIOS_DISABLECARD;
ioctl(<fd of opened /dev/watchdog>, WDIOC_SETOPTIONS, &value);

Basically, before shutting the system down we disable the WD. The changes to the control bits made sure it stayed down.
If the system locks up during shutdown this is very bad. But so far we haven’t encountered this.

Hope this helps.
Otmar

006-rn5t618_wdt.c.patch (3.5 KB)

Klaus · August 5, 2022, 5:45am

Thanks for the quick response! I did not expect that
I looked into your patch and could see that you probed around and changed the order of register accesses, added an additional irq status bit reset,…

But maybe I found the real culprit for the issue:
The remap is initialized to cache the register accesses.
RN5T618_WATCHDOG (0x0b) and RN5T618_PWRIRQ (0x13)

So I expect that because of this caching the IRQ status bit was never reset.
The status register must not be cached because its set by the RN5T567.
Also it is not ideal to cache the access to the watchdog register which resets the counter via read write cycle.

debugfs shows the regmap setting for these registers:
[root@imx7d /sys/kernel/debug]# cat regmap/0-0033/access
// third column means volatile yes or no
…
0b: y y n n
…
13: y y n n

After marking these registers volatile, stopping the wdt and starting again seems to work.

static bool rn5t618_volatile_reg(struct device *dev, unsigned int reg)
{
switch (reg) {
case RN5T618_WATCHDOGCNT:
case RN5T618_DCIRQ:
case RN5T618_ILIMDATAH … RN5T618_AIN0DATAL:
case RN5T618_ADCCNT3:
case RN5T618_IR_ADC1 … RN5T618_IR_ADC3:
case RN5T618_IR_GPR:
case RN5T618_IR_GPF:
case RN5T618_MON_IOIN:
case RN5T618_INTMON:
case RN5T618_RTC_CTRL1 … RN5T618_RTC_CTRL2:
case RN5T618_RTC_SECONDS … RN5T618_RTC_YEAR:
case RN5T618_CHGSTATE:
case RN5T618_CHGCTRL_IRR … RN5T618_CHGERR_MONI:
case RN5T618_CONTROL … RN5T618_CC_AVEREG0:
case RN5T618_WATCHDOG: // should not be cached because of r/w cycle to reset counter
case RN5T618_PWRIRQ: // should not be cached because its set by hardware
return true;
default:
return false;
}
}

Klaus · August 8, 2022, 5:38am

Furthermore it is not necessary to do a RN5T618_WATCHDOG read AND write cycle to reset the wdt counter.
The source code states:
/* The counter is restarted after a R/W access to watchdog register */

The RN5T567 datasheet states:
“The count value of watchdog timer is cleared by accessing (R/W) to this register.”

Tests showed that a single read is enough. I did not check other chip variants which use the same driver.

In my opinion a write cycle is even dangerous if there is some strange situation and the write cycle disables the wdt or changes the wdt settings stored in this register.

Otmar · August 16, 2022, 7:16am

Hello Klaus,
thank you for sharing your analysis results.
Yes, that caching might be the reason for the troubles. Caching an IRQ status bit and relying on the cache flush to clear it?
I will try your suggestions and see if it fixes the problem for me too. Perhaps I can then avoid the unfortunate total disabling of the WD.
Best regards,
Otmar