System gets stuck while rebooting (and watchdog fails)

vcds · March 22, 2023, 11:02am

Our application reboots the system periodically for maintenance reasons. It can also be rebooted sometimes out of schedule due to updates or for other reasons.

The reboot sequence works fine most of the time, but sometimes, randomly, the entire system will get stuck at some unknown point in the reboot sequence, and the only way to reset the system is to physically power cycle the carrier board.

I attempted to work around this issue by activating the watchdog just before calling the reboot command, but it made no difference at all.

Trying to recover the systemd-journald logs for investigation from a unit that has just been physically power cycled gets us something like this:

Sep 20 16:14:04 systemd-timesyncd[138]: System clock time unset or jumped backwards, restoring from recorded timestamp: Tue 2023-01-17 08:03:43 UTC
Jan 17 13:33:43 systemd[1]: Started Network Time Synchronization.
Jan 17 13:33:43 kernel: 8<--- cut here ---
Jan 17 13:33:43 kernel: journal-offline: unhandled page fault (7) at 0x76f6f010, code 0x817
Jan 17 13:33:43 kernel: pgd = 398080fa
Jan 17 13:33:43 kernel: [76f6f010] *pgd=94947835, *pte=00000000, *ppte=00000000
Jan 17 13:33:43 kernel: CPU: 0 PID: 144 Comm: journal-offline Not tainted 5.4.115-5.3.0-devel+git.dbdbcabf0f98 #1
Jan 17 13:33:43 kernel: Hardware name: Freescale i.MX6 Ultralite (Device Tree)
Jan 17 13:33:43 kernel: PC is at 0x76ee32ee
Jan 17 13:33:43 kernel: LR is at 0x76dbc117
Jan 17 13:33:43 kernel: pc : [<76ee32ee>]    lr : [<76dbc117>]    psr: 20070030
Jan 17 13:33:43 kernel: sp : 76840d90  ip : 00000076  fp : 7ed03180
Jan 17 13:33:43 kernel: r10: 00000000  r9 : 76841410  r8 : 768413a0
Jan 17 13:33:43 kernel: r7 : 00000002  r6 : 00000006  r5 : 007c2fe0  r4 : 007c2eb0
Jan 17 13:33:43 kernel: r3 : 00000002  r2 : 76f6f000  r1 : 00000002  r0 : 00000000
Jan 17 13:33:43 kernel: Flags: nzCv  IRQs on  FIQs on  Mode USER_32  ISA Thumb  Segment user
Jan 17 13:33:43 kernel: Control: 10c5387d  Table: 9495406a  DAC: 00000055
Jan 17 13:33:43 kernel: CPU: 0 PID: 144 Comm: journal-offline Not tainted 5.4.115-5.3.0-devel+git.dbdbcabf0f98 #1
Jan 17 13:33:43 kernel: Hardware name: Freescale i.MX6 Ultralite (Device Tree)
Jan 17 13:33:43 kernel: [<8010db4c>] (unwind_backtrace) from [<8010ae6c>] (show_stack+0x10/0x14)
Jan 17 13:33:43 kernel: [<8010ae6c>] (show_stack) from [<808cfd68>] (dump_stack+0x90/0xa4)
Jan 17 13:33:43 kernel: [<808cfd68>] (dump_stack) from [<8011192c>] (__do_user_fault+0xfc/0x100)
Jan 17 13:33:43 kernel: [<8011192c>] (__do_user_fault) from [<80111d0c>] (do_page_fault+0x354/0x394)
Jan 17 13:33:43 kernel: [<80111d0c>] (do_page_fault) from [<80111eb4>] (do_DataAbort+0x3c/0xc0)
Jan 17 13:33:43 kernel: [<80111eb4>] (do_DataAbort) from [<80101dbc>] (__dabt_usr+0x3c/0x40)
Jan 17 13:33:43 kernel: Exception stack(0x94bf3fb0 to 0x94bf3ff8)
Jan 17 13:33:43 kernel: 3fa0:                                     00000000 00000002 76f6f000 00000002
Jan 17 13:33:43 kernel: 3fc0: 007c2eb0 007c2fe0 00000006 00000002 768413a0 76841410 00000000 7ed03180
Jan 17 13:33:43 kernel: 3fe0: 00000076 76840d90 76dbc117 76ee32ee 20070030 ffffffff
Jan 17 13:33:43 systemd[1]: Reached target System Initialization.
Jan 17 13:33:43 systemd[1]: Started Daily Cleanup of Temporary Directories.
Jan 17 13:33:43 systemd[1]: Reached target System Time Set.
Jan 17 13:33:43 systemd[1]: Reached target System Time Synchronized.
Jan 17 13:33:43 systemd[1]: Reached target Timers.
Jan 17 13:33:43 systemd[1]: Listening on Avahi mDNS/DNS-SD Stack Activation Socket.
Jan 17 13:33:43 systemd[1]: Listening on D-Bus System Message Bus Socket.
Jan 17 13:33:43 systemd[1]: Starting sshd.socket.
Jan 17 13:33:43 systemd[1]: Listening on sshd.socket.
Jan 17 13:33:43 systemd[1]: Reached target Sockets.
Jan 17 13:33:44 systemd[1]: Reached target Basic System.
.
.
.

And so, whatever valuable information might have been there is already corrupt.

This problem is starting to become a real business-head issue for us, since the more our fleet grows, the more the probability that some unit somewhere will get stuck sooner or later. Our units are spread across very wide geographical regions and it is not possible for us to quickly send in someone to physically reset it when it gets stuck like this. We have already faced multiple client complaints due to the non-performance of our units in the field.

We hope you all will be able to help us find a way to solve this issue. Please bear in mind that it is not going to be possible for us to replace/upgrade the BSP on our field units. What we can do is push small patches/commands, or changes to our application remotely.

Colibri iMX6ULL 512MB IT V1.1A
Linux BSP 5.3.0

sahil.tx · April 18, 2023, 5:52am

Hi @vcds,
This is in continuation to our discussion

Please connect the debug port (UART_A by default) to your PC and see the debug messages on your terminal application (minicom, gtkterm, putty).
Please notice that you need to connect an RS232 to USB converter considering your hardware UART pins voltage levels.
After that please check at what point the system stops booting.

vcds · April 22, 2024, 8:54am

Hi @sahil.tx

We finally managed to acquire and test one of our field units which encountered this particular issue multiple times. Over the course of several tests we were able to reproduce the issue with the particular Toradex Colibri module not only on our custom carrier board but also on the Colibri Evaluation Board. It can take anywhere from about ten reboots to a hundred reboots before it gets stuck - that number seems essentially random.

I have attached in this post the console log from one of our tests. The reboot gets stuck at exactly the last line in the log.
stuck-on-reboot.log.gz (26.0 KB)

sahil.tx · May 7, 2024, 10:47am

Hi @vcds ,
Thanks for sending the module , we are looking into it and will update you soon.

sahil.tx · May 13, 2024, 4:22pm

Hi @vcds,

Sorry for the delay, could not work on it last week.

Please confirm below points,
You are not resetting the system using watchdog, you are simply activating watchdog and rebooting the module from your code?
Is there any delay time between activating watchdog and rebooting the system using code?
Can you also help me understanding where watchdog is failing?

vcds · May 13, 2024, 5:45pm

@sahil.tx

Yes, the watchdog is only activated, and the system is rebooted using the “reboot” command.
No, the code does not wait between activating the watchdog and executing the reboot command.
The watchdog is failing because it is unable to reset the system when it gets stuck while rebooting.

I think what you should first try to do is figure out why the system gets stuck while rebooting. Then you can determine why the watchdog also fails.

sahil.tx · May 22, 2024, 4:09am

Hi @vcds,
We were able to see the issue on the module sent by you and we have started debugging the issue.

sahil.tx · July 29, 2024, 11:36am

Hi @vcds ,

Seems like our discussion is not updated here.
So, I am pasting my last reply
Here is our observation based on our testing

We were able to reproduce the issue with the module sent by you having your customized image 5.3
We tested multiple module + your module for a week having image 5.7.6 and did not observed any reboot stuck issue.
Our suggestion is to use this image and test the issue.