Unable to get persistent logs working

CPU: NXP i.MX8QM RevB A53 at 1200 MHz
BSP: TorizonCore with PREEMPT_RT 5.5.0+build.11

I’m trying to track down a random reboot and have enabled the logs as specified at Persistent Journald Logging

I’ve edited the two files as specified and rm /var/log but on reboot, the system seems to recreate /var/log pointing to /var/volatile/log.

root@apalis-imx8-06945459:~# ls -alt /var/log
lrwxrwxrwx 1 root root 17 Jan  1  1970 /var/log -> /var/volatile/log

Is there some other step that I am missing?

Here’s the boot log


U-Boot 2020.04-5.5.0+git.81bc8894031d (Jan 01 1970 - 00:00:00 +0000)

CPU:   NXP i.MX8QM RevB A53 at 1200 MHz

DRAM:  4 GiB
MMC:   FSL_SDHC: 0, FSL_SDHC: 1, FSL_SDHC: 2
Loading Environment from MMC... OK
In:    serial
Out:   serial
Err:   serial
Model: Toradex Apalis iMX8 QuadMax 4GB IT V1.1C, Serial# 06945459

 BuildInfo:
  - SCFW 778670e2, SECO-FW d63fdb21, IMX-MKIMAGE 8947fea3, ATF 835a8f6
  - U-Boot 2020.04-5.5.0+git.81bc8894031d

switch to partitions #0, OK
mmc0(part 0) is current device
flash target is MMC:0
Net:   eth0: ethernet@5b040000
Fastboot: Normal
Normal Boot
Hit any key to stop autoboot:  0
switch to partitions #0, OK
mmc0(part 0) is current device
Scanning mmc 0:1...
Found U-Boot script /boot.scr
1182 bytes read in 14 ms (82 KiB/s)
## Executing script at 83100000
5967 bytes read in 28 ms (208 KiB/s)
166807 bytes read in 34 ms (4.7 MiB/s)
43 bytes read in 26 ms (1000 Bytes/s)
Applying Overlay: apalis-imx8_hdmi_overlay.dtbo
2177 bytes read in 35 ms (60.5 KiB/s)
11612693 bytes read in 278 ms (39.8 MiB/s)
Uncompressed size: 28715520 = 0x1B62A00
9176228 bytes read in 227 ms (38.6 MiB/s)
## Flattened Device Tree blob at 83000000
   Booting using the fdt blob at 0x83000000
   Loading Ramdisk to fcd8d000, end fd64d4a4 ... OK
   Loading Device Tree to 00000000fcd41000, end 00000000fcd8cfff ... OK

Starting kernel ...

[    0.161562] 001: No BMan portals available!
[    0.162818] 001: No QMan portals available!
[    1.853236] 002: imx-audmix imx-audmix.0: failed to find SAI platform device
[    1.993388] 001: imx6q-pcie 5f010000.pcie: pcie_ext clock source missing or invalid
[    1.994751] 002: imx6q-pcie 5f000000.pcie: pcie_ext clock source missing or invalid
[    2.681111] 000: debugfs: Directory '59050000.sai' with parent 'apalis-imx8qm-sgtl5000' already present!
[    3.843790] 000: imx6q-pcie 5f010000.pcie: failed to initialize host
[    3.843799] 000: imx6q-pcie 5f010000.pcie: unable to add pcie port.
[    3.963620] 003: imx6q-pcie 5f000000.pcie: failed to initialize host
[    3.963629] 003: imx6q-pcie 5f000000.pcie: unable to add pcie port.
Starting version 244.5+
[   10.168412] 001: debugfs: Directory '59090000.sai' with parent 'imx-audio-hdmi-tx' already present!
[   10.187124] 003: debugfs: File 'Capture' in directory 'dapm' already present!
[   12.019231] 001: systemd-journald[731]: Failed to create new system journal: No such file or directory
[   12.019298] 001: systemd-journald[731]: Failed to open user journal directory '/var/log/journal/27b6ab78e4414b3aa6e3067b150dc6c9/': No such file or directory

TorizonCore with PREEMPT_RT 5.5.0+build.11 apalis-imx8-06945459 ttyLP1

Greetings @toebes,

This is interesting, it seems there may have been a regression here in TorizonCore. In the past persistent logging was known to work as documented. But it seems that is no longer the case. I was able to reproduce the behavior you described on my end. The logging seems to work fine in our reference BSP, but not TorizonCore perhaps something minor changed between the two.

In any case, I’ll report this issue to the team to get looked at. Thank you for bringing this to our attention.

Best Regards,
Jeremias

Thank you! I look forward to a fix. If you have any other suggestions for tracking down a random reboot in the meantime I’m all ears!

With regards to your random reboot I see you’re using the PREEMPT_RT variant of TorizonCore. I recall in the past for a time we had issues where the PREEMPT_RT image would reboot randomly as you describe here. However the issue was unable to be reproduced after some time, so we thought perhaps it resolved itself.

Is there any information you could provide about the reboot? For example what you were you doing on the device when it happens? How long has the system been running before it reboots?

Last time this issue popped up there was so little information that we couldn’t characterize it very well. So any information you can provide would be very helpful.

Best Regards,
Jeremias

Alas I am doing nothing at all, it is just sitting there. It seems to reboot about once an hour, but not on a regular schedule that I can tell. Last night I left it up and it rebooted 13 times in a period of 12 hours. I always keep the serial port connection open, so I can just count the number of U-Boot banner messages.

Let me know if you can think of anything I can do to capture the events.

What you described certainly sounds like the issue that was reported in the past with PREEMPT_RT. As a heads-up, I’ve notified the team internally about this issue appearing again. We’re discussing how best to handle this, I’ll keep you informed as the situation develops. Just to be absolutely clear, you do require PREEMPT_RT in your solution then?

As for capturing/logging such events. With such a random reboot event that produces no output it might be difficult to get any useful data. One idea would be to leave dmesg running with the -w flag so it follows new logs. Perhaps whatever causes the reboot outputs something in dmesg before the system reboots. It also may not, so I’m not entirely sure if this would even give useful information.

Best Regards,
Jeremias

I started running a top and capturing the output also in the hopes that it manages to catch something unusual. I expect it to fail sometime in the next hour and will see what we get.

dmesg doesn’t appear to have been revealing. Nothing seemed to come out with the reboot. I’ve attached the full logs watching top prior to the restart ono two occasions.

If you have anything else to monitor, this testbed seems to be pretty reliably unreliable :slight_smile:
logs - Toebes.zip (705.3 KB)

I’ll get the team’s eyes on this issue and see what they come up with. In the meantime please provide us with any further information you uncover on your end.

Best Regards,
Jeremias

Ok, I think we have a fix for the persistent logging issue. Our documentations says to remove the line d /var/volatile/log 1777 - - from the file /etc/tmpfiles.d/00-create-volatile.conf. However, the contents of this file in TorizonCore are slightly different now, they look like this:

#This goes hand-in-hand with the base-files of OE-Core. The file must
# be sorted before 'systemd.conf' becuase this attempts to create a file
# inside /var/log.

d               /var/volatile/log               -       -       -       -
d               /var/volatile/tmp               1777    -       -
L+ /var/tmp - - - - /var/volatile/tmp
L+ /var/log - - - - /var/volatile/log

You instead need to remove the following 2 lines from this file:

d               /var/volatile/log               -       -       -       -
L+ /var/log - - - - /var/volatile/log

Other than this all the other steps are the same as documented. With this persistent logging should be working. We’re currently working on updating the documentation to reflect this change.

Best Regards,
Jeremias

Ok it didn’t take long to reboot after I got it all set up.
errlog (216.3 KB)

Thank you for the log, there’s quite a bit to go through here. But, one thing I immediately noticed is towards the end. It seems various processes are stopped and that a proper software reboot is occurring. This is important to know that the system is going through the steps of a reboot, rather than just abruptly rebooting.

I don’t see a reason yet though for what is triggering this reboot process. Though I may need to comb through these logs more thoroughly. I’ll pass this on to the team as well.

Best Regards,
Jeremias

I’ll capture one more reboot and then put /var/log back to /var/volatile log. Other than putting the two files back to the original state and deleting /var/log, is there anything else I need to do? Is there a way to automatically clean up the old logs?

Here’s another one from a second run. I’ve put everything back to having the logs volatile, but if you need me to switch it back, just let me know.

log1.txt (216.3 KB)

Quick clarifying question from looking at your logs.

Have you provisioned this device for our OTA update service? Some of the lines in the logs you provided seem to suggest this, but I wanted to check with you to make sure.

Best Regards,
Jeremias

Yes. It is using OTA.

Well the good news is, I was able to reproduce this issue. Albeit a bit inconsistently. After running the PREEMPT_RT image for ~1hr the first reboot occurred. I then didn’t experience any reboots for the next several hours.

But it does seem to happen as you say. Though reproducing this is a bit hit or miss. Anyways further tests will need to be done, but just wanted to give you an update.

Best Regards,
Jeremias

That is great news that you can reproduce it. Is there an image that you recommend which is more stable as a base?

Any of the non PREEMPT_RT images should be stable. If you are not doing anything in development that strictly requires PREEMPT_RT than I would suggest using a non PREEMPT_RT image for the time being.

Best Regards,
Jeremias