Apalis iMX8QM/Ixora v1.2 spontaneous restart

Hello,

I have encountered an issue with our Apalis iMX8QMs. They reboot spontaneously, sometimes after 12 or more hours. Our setup:

  • Ixora V1.2 carrier board with custom interface hat board

  • Apalis iMX8QM compute module

  • Heatsink

  • Capacitive 10" touch display

  • torizon-core-docker-rt-apalis-imx8-Tezi_5.3.0+build.7

I downloaded the torizon-core image manually and delivered via USB Apalis-iMX8_ToradexEasyInstaller_2.0b6-20201102 since only the pre-provisioned version seemed to be available on the EasyInstaller streams.

I manually change

  • Enable 2 wifi networks, one as subscriber, one in AP mode

  • Reduce uptane polling_sec to 30s

  • add overlays: fdt_overlays=apalis-imx8_lvds_overlay.dtbo apalis-imx8_atmel-mxt_overlay.dtbo display-lt170410_overlay.dtbo apalis-imx8_hdmi_overlay.dtbo

  • disable docker-compose service and define a custom service that runs docker-compose from /var/sota/storage/docker-compose with additional StartPre and StopPos commands

  • Provision and install docker-compose.yml from Torizon OTA

I have observed nothing in journalctl that would explain this behaviour. The systems start up normally and nothing seems out of the ordinary and then spontaneously restart sometimes hours later. Adding a big fan on top of the heatsink made no difference, we also have a lot of Apalis iMX8 with a similar setup on TorizonCore 1.5.1 that will run without a fan for weeks. Increasing the uptane polling_period seemed to reduce the reset frequency a little bit, but even at polling_period=300-1800 they will reset after 6-7 hours

Is there any known issue with torizon-core-docker-rt-apalis-imx8-Tezi_5.3.0+build.7 that might be related? 1.5.3-build7 is in my understanding the latest available quarterly release for docker-rt without pre-provisioned containers.

Is there a chance that no using the RT version might change anything in this regard? Our application is not currently using RT functionality but might in the future.

i don’t think the shunt resistor issue discussed here https://share.toradex.com/pykrvmjjq91unh8?direct applies since I am using Ixora v1.2.

Just found IMX8 apalis is not booting up - #20 by oklemencic with might be related, is there any way to identify by serial number whether a Ixora v1.2 might be affected? I have carrier boards from various production batches available here.

How much is your hat interface board consuming? Though Ixora 1.2 has an increased power supply ampacity you still can overload it if additional peripheral connected.

There is nothing on the interface board that loads the Ixora supply, the Ixora is actually supplied 12V from the hat. only a couple of serial interfaces that are supplied separately and a short-term back-up power supply if power should drop out.

Even when supplying 12V directly from a lab supply it reboots sometimes.

What version of the Apalis IMX8QM are you using ?
Is the software setup the same on other modules like on this one ?

Best Regards,

Matthias Gohlke

We are using Apalis IMX8QM 4GB WiFi/IT V1.1C at the moment.

I was able to solve the issue in the meantime. The solution was to use version without PREEMPT_RT applied. All problems disappeared when switching from torizon-core-docker-rt-apalis-imx8-Tezi_5.3.0+build.7 to torizon-core-docker-apalis-imx8-Tezi_5.3.0+build.7.

We have not found the underlying cause of the intermittent reset when flashing the docker-rt variant. Hope this is helpful.

Hi @grafue, thanks for sharing-- glad to hear it’s not just me. I am using a very similar setup to you (same Apalis IMX8QM SOM version, Ixora carrier board, etc.) and I experienced the same rebooting issue when using the RT TorizonCore image. Sometimes it would last up to 12 hours before a reboot. Other times it reboot every 10 minutes. Most of the time the system was just sitting idle, not running any application code. I lived with the issue for several months. Like you, my problem went away after switching to the non-real-time kernel. (Right now I’m at 18 days of uptime, so that’s a good sign!) Unfortunately I wasn’t able to make any progress at figuring out the root cause (and my motivation is low because my application does not have strict real-time requirements). Just wanted to share so you’d know it affects more people!

Greetings @grafue and @dcullen,

Thank you both for reporting this odd issue I’ll report this internally and see if we can attempt to reproduce on our side.

Just to confirm are these full steps for reproducing this?

  • Flash Apalis i.MX8QM with Real time variant of TorizonCore
  • Run system idle
  • After some time of running idle the system will randomly reboot

Also just a sanity check do either of you get such spontaneous reboots when running an RT image based on our reference BSP (i.e. not-Torizon).

Did I get that all right?

Best Regards,
Jeremias

Thanks for sharing @dcullen !

Yes those steps should suffice to reproduce @jeremias.tx . As discussed in my initial post I also apply some manual changes and overlays and replace the default docker-compose service but between @dcullen and me it sounds like even the bare install should trigger the issue with this hardware combo. Can confirm that sometimes is appears more often, sometimes takes a long time - nothing obvious that influences the frequency.
One important point to make is that I did not flash our systems from EasyInstaller but downloaded the image manually from the server, maybe there is something wrong in the Release/tagging process there.

I have not used any BSPs other than Torizon so I know nothing about reboot issues in those circumstances.

Best

Hi @jeremias.tx,

Yes, those steps look correct. I had hoped to give you some specific version information so I dug through my notes. I definitely experienced the issue with this nightly build:

  • “TorizonCore with evaluation containers (PREEMPT_RT)” 5.3.0-devel-2021-06+build.14.container

I later switched to a quarterly release (5.3.0+build.7.container I believe) in hopes of resolving it but still experienced the same behavior. Unfortunately my notes don’t have any more detailed than that.

Also, I have not tried try running the RT Reference BSP. (I’ve dabbled with the non-real-time Reference BSP, but not the RT version. For what it’s worth, I’m happy to report that the non-real-time version gave me no issues.)

A few more observations:

  • Like @grafue, there has been no obvious relationship between my activities on the device and the frequency of the reboots.
  • I have used various device tree overlays and various docker-compose services, but none of that seems to correlate with the frequency of reboots.
  • Like @grafue, I also first assumed a power supply issue, but tried a couple options and the issue persisted. In fact, now that I think about it, I’m pretty sure I also observed this issue with the Apalis Evaluation Board, in addition to the Ixora Carrier Board.

Thanks,
Dan

Thank you to you both for the reports and additional information regarding this issue. I think I have enough information to bring this forward internally. I’ll see if we can do some testing on the real-time images leading up to this next quarterly release of TorizonCore.

If either of you have any additional information about this issue, please continue to comment in this thread here.

Best Regards,
Jeremias

Hi @grafue and @dcullen,

Unfortunately after some testing we have not been able to reproduce this issue yet. In our tests we’ve done the following:

  • Flash an Apalis i.MX8QM with TorizonCore Preempt-RT image (5.4 nightly).
  • Instead of just leaving the module running idle, we decided to run a RT stress tests to simulate some load on the device.
  • Left device running with stress test, unattended, observing occasionally to see if any reboot has occurred.

One of these tests ran for over 1 day straight with no reboots. It’d hard to say if there’s an underlying issue here or not. If there is, then it doesn’t seem to be very reproducible.

If possible I ask if both of you could try and reproduce this issue with a recent PREEMPT-RT nightly. I noticed when you both encountered this issue it seemed to be on version 5.3.0. If you could try 5.4.0 that would be appreciated. Furthermore when you run your tests, if you could only connect the bare minimum cables to the hardware to avoid any other factors.

Since we were unable to reproduce this I can only assume there might be some other factor in both of your setups. Perhaps power related? Or network possibly?

Please let me know if either of you discover any additional information that we can investigate further.

Best Regards,
Jeremias

@jeremias.tx Interesting! Maybe 5.4 fixes something subtle. Thanks for trying to reproduce. I’ll give 5.4 a try on my SOM later this week.