Hello @henrique.tx, @marcel.tx,
Our application reboots the system periodically for maintenance reasons. It can also be rebooted sometimes out of schedule due to updates or for other reasons.
The reboot sequence works fine most of the time, but sometimes, randomly, the entire system will get stuck at some unknown point in the reboot sequence, and the only way to reset the system is to physically power cycle the carrier board.
I attempted to work around this issue by activating the watchdog just before calling the reboot command, but it made no difference at all.
Trying to recover the systemd-journald logs for investigation from a unit that has just been physically power cycled gets us something like this:
Sep 20 16:14:04 systemd-timesyncd[138]: System clock time unset or jumped backwards, restoring from recorded timestamp: Tue 2023-01-17 08:03:43 UTC
Jan 17 13:33:43 systemd[1]: Started Network Time Synchronization.
Jan 17 13:33:43 kernel: 8<--- cut here ---
Jan 17 13:33:43 kernel: journal-offline: unhandled page fault (7) at 0x76f6f010, code 0x817
Jan 17 13:33:43 kernel: pgd = 398080fa
Jan 17 13:33:43 kernel: [76f6f010] *pgd=94947835, *pte=00000000, *ppte=00000000
Jan 17 13:33:43 kernel: CPU: 0 PID: 144 Comm: journal-offline Not tainted 5.4.115-5.3.0-devel+git.dbdbcabf0f98 #1
Jan 17 13:33:43 kernel: Hardware name: Freescale i.MX6 Ultralite (Device Tree)
Jan 17 13:33:43 kernel: PC is at 0x76ee32ee
Jan 17 13:33:43 kernel: LR is at 0x76dbc117
Jan 17 13:33:43 kernel: pc : [<76ee32ee>] lr : [<76dbc117>] psr: 20070030
Jan 17 13:33:43 kernel: sp : 76840d90 ip : 00000076 fp : 7ed03180
Jan 17 13:33:43 kernel: r10: 00000000 r9 : 76841410 r8 : 768413a0
Jan 17 13:33:43 kernel: r7 : 00000002 r6 : 00000006 r5 : 007c2fe0 r4 : 007c2eb0
Jan 17 13:33:43 kernel: r3 : 00000002 r2 : 76f6f000 r1 : 00000002 r0 : 00000000
Jan 17 13:33:43 kernel: Flags: nzCv IRQs on FIQs on Mode USER_32 ISA Thumb Segment user
Jan 17 13:33:43 kernel: Control: 10c5387d Table: 9495406a DAC: 00000055
Jan 17 13:33:43 kernel: CPU: 0 PID: 144 Comm: journal-offline Not tainted 5.4.115-5.3.0-devel+git.dbdbcabf0f98 #1
Jan 17 13:33:43 kernel: Hardware name: Freescale i.MX6 Ultralite (Device Tree)
Jan 17 13:33:43 kernel: [<8010db4c>] (unwind_backtrace) from [<8010ae6c>] (show_stack+0x10/0x14)
Jan 17 13:33:43 kernel: [<8010ae6c>] (show_stack) from [<808cfd68>] (dump_stack+0x90/0xa4)
Jan 17 13:33:43 kernel: [<808cfd68>] (dump_stack) from [<8011192c>] (__do_user_fault+0xfc/0x100)
Jan 17 13:33:43 kernel: [<8011192c>] (__do_user_fault) from [<80111d0c>] (do_page_fault+0x354/0x394)
Jan 17 13:33:43 kernel: [<80111d0c>] (do_page_fault) from [<80111eb4>] (do_DataAbort+0x3c/0xc0)
Jan 17 13:33:43 kernel: [<80111eb4>] (do_DataAbort) from [<80101dbc>] (__dabt_usr+0x3c/0x40)
Jan 17 13:33:43 kernel: Exception stack(0x94bf3fb0 to 0x94bf3ff8)
Jan 17 13:33:43 kernel: 3fa0: 00000000 00000002 76f6f000 00000002
Jan 17 13:33:43 kernel: 3fc0: 007c2eb0 007c2fe0 00000006 00000002 768413a0 76841410 00000000 7ed03180
Jan 17 13:33:43 kernel: 3fe0: 00000076 76840d90 76dbc117 76ee32ee 20070030 ffffffff
Jan 17 13:33:43 systemd[1]: Reached target System Initialization.
Jan 17 13:33:43 systemd[1]: Started Daily Cleanup of Temporary Directories.
Jan 17 13:33:43 systemd[1]: Reached target System Time Set.
Jan 17 13:33:43 systemd[1]: Reached target System Time Synchronized.
Jan 17 13:33:43 systemd[1]: Reached target Timers.
Jan 17 13:33:43 systemd[1]: Listening on Avahi mDNS/DNS-SD Stack Activation Socket.
Jan 17 13:33:43 systemd[1]: Listening on D-Bus System Message Bus Socket.
Jan 17 13:33:43 systemd[1]: Starting sshd.socket.
Jan 17 13:33:43 systemd[1]: Listening on sshd.socket.
Jan 17 13:33:43 systemd[1]: Reached target Sockets.
Jan 17 13:33:44 systemd[1]: Reached target Basic System.
.
.
.
And so, whatever valuable information might have been there is already corrupt.
This problem is starting to become a real business-head issue for us, since the more our fleet grows, the more the probability that some unit somewhere will get stuck sooner or later. Our units are spread across very wide geographical regions and it is not possible for us to quickly send in someone to physically reset it when it gets stuck like this. We have already faced multiple client complaints due to the non-performance of our units in the field.
We hope you all will be able to help us find a way to solve this issue. Please bear in mind that it is not going to be possible for us to replace/upgrade the BSP on our field units. What we can do is push small patches/commands, or changes to our application remotely.
Colibri iMX6ULL 512MB IT V1.1A
Linux BSP 5.3.0