iMX8M Plus PREEMPT_RT gets rebooted with rt-validation stress tests

wheeler · July 19, 2021, 1:39am

Hello

The iMX8M Plus gets rebooted when running the rt-validation stress-test container. It reboots 15-60min after the container is started:

Boot Output:

U-Boot SPL 2020.04-5.3.0-devel+git.2e5818d88ee0 (Jan 01 1970 - 00:00:00 +0000)
DDRINFO: start DRAM init
DDRINFO: DRAM rate 4000MTS
DDRINFO:ddrphy calibration done
DDRINFO: ddrmix config done
Normal Boot
Trying to boot from BOOTROM
Find FIT header 0x4803b000, size 969
Download 839680, total fit 840496
NOTICE:  BL31: v2.2(release):toradex_imx_5.4.70_2.3.0-g835a8f67b2
NOTICE:  BL31: Built : 00:00:00, Jan  1 1970


U-Boot 2020.04-5.3.0-devel+git.2e5818d88ee0 (Jan 01 1970 - 00:00:00 +0000)

CPU:   i.MX8MP[8] rev1.1 1600 MHz (running at 1200 MHz)
CPU:   Industrial temperature grade (-40C to 105C) at 52C
Reset cause: POR
DRAM:  4 GiB
MMC:   FSL_SDHC: 1, FSL_SDHC: 2
Loading Environment from MMC... OK
In:    serial
Out:   serial
Err:   serial
Model: Toradex Verdin iMX8M Plus Quad 4GB Wi-Fi / BT IT V1.0B, Serial# 06848946
Carrier: Toradex Verdin Development Board V1.1A, Serial# 10795103

 BuildInfo:
  - ATF 835a8f6
  - U-Boot 2020.04-5.3.0-devel+git.2e5818d88ee0

flash target is MMC:2
Net:   eth0: ethernet@30be0000, eth1: ethernet@30bf0000 [PRIME]
Fastboot: Normal
Normal Boot
Hit any key to stop autoboot:  0
switch to partitions #0, OK
mmc2(part 0) is current device
Scanning mmc 2:1...
Found U-Boot script /boot.scr
1182 bytes read in 9 ms (127.9 KiB/s)
## Executing script at 46000000
4727 bytes read in 12 ms (383.8 KiB/s)
85856 bytes read in 18 ms (4.5 MiB/s)
112 bytes read in 12 ms (8.8 KiB/s)
Applying Overlay: verdin-imx8mp_native-hdmi_overlay.dtbo
1860 bytes read in 20 ms (90.8 KiB/s)
Applying Overlay: verdin-imx8mp_lt8912_overlay.dtbo
2007 bytes read in 20 ms (97.7 KiB/s)
Applying Overlay: custom-kargs_overlay.dtbo
188 bytes read in 16 ms (10.7 KiB/s)
11629598 bytes read in 54 ms (205.4 MiB/s)
Uncompressed size: 28791296 = 0x1B75200
9170412 bytes read in 42 ms (208.2 MiB/s)
## Flattened Device Tree blob at 43000000
   Booting using the fdt blob at 0x43000000
   Loading Device Tree to 00000000fdbbc000, end 00000000fdbf3fff ... OK

Starting kernel ...

[    0.109748] 001: No BMan portals available!
[    0.110504] 001: No QMan portals available!
[    1.273601] 003: samsung-hdmi-phy 32fdff00.hdmiphy: failed to get phy apb clk: -517
[    1.273816] 003: imx8-pcie-phy 32f00000.pcie-phy: failed to get imx pcie phy clock
[    1.301028] 003: imx-lcdifv3 32fc6000.lcd-controller: No irq get
[    1.304096] 003: imx-hdmi-pavi 32fc4000.hdmi-pai-pvi: No pvi clock get
[    1.367559] 003: failed to register cpuidle driver
[    1.707156] 003: imx_sec_dsim_drv 32e60000.mipi_dsi: modalias failure on /soc@0/bus@32c00000/mipi_dsi@32e60000/port@1
[    1.707469] 003: dwhdmi-imx 32fd8000.hdmi: No pavi info found
[    1.828796] 003: imx_sec_dsim_drv 32e60000.mipi_dsi: modalias failure on /soc@0/bus@32c00000/mipi_dsi@32e60000/port@1
[    2.863991] 002: imx6q-pcie 33800000.pcie: failed to initialize host
[    2.863995] 002: imx6q-pcie 33800000.pcie: unable to add pcie port.
Starting version 244.5+
[    7.105606] 002: debugfs: Directory '30cb0000.aud2htx' with parent 'audio-hdmi' already present!
[    7.782931] 003: debugfs: Directory '30c10000.sai' with parent 'imx8mp-nau8822' already present!

TorizonCore with PREEMPT_RT 5.3.0-devel-202106+build.14 verdin-imx8mp-06848946 ttymxc2

I modified the kernel command line args with the TorizonCore Builder to get better latency (cpuidle.off=1, cpufreq.off=1): torizoncore-builder kernel set_custom_args “cpuidle.off=1” “cpufreq.off=1” (4.4. Linux — EC-Master 3.1 documentation)

I can see cpuidle and cpufreq are disabled, as the dont appear anymore in list when running ls /sys/devices/system/cpu/

The CPU-Temperatures are around ~50°C (cat /sys/class/thermal/thermal_zone0/temp).

Without the modification of the kernel arguements the container runs fine but the maximum latency is rather on the upper limit. I need to have a latency <100us because of Ethercat. With the modified image i get a maximum latency of ~70us, which would be fine (i could only run the rt-validation tests for around 30min, because of the reboot problem).

Do you have any idea why the system gets rebooted?

Best regards,
Fabian

henrique.tx · July 20, 2021, 9:07pm

Hi @wheeler!

Does the kernel print any message (just before rebooting) that could point to the rebooting reason?

Best regards.

wheeler · July 27, 2021, 12:13pm

Hi @henrique.tx,

unfortunately there is no message before rebooting. I can reproduce the issue with a network speedtest, so it seems network related. You can use my speedtest-cli container: docker run -it wheeler1818/speedtest-cli. In 50% of the cases the session hangs (not the whole system) and the system gets rebooted after a few minutes:

Retrieving speedtest.net configuration...
Testing from UPC Schweiz ...
Retrieving speedtest.net server list...
Selecting best server based on ping...
Hosted by HEIG-VD (Yverdon-Les-Bains) [75.64 km]: 32.131 ms
Testing download speed........................................................>
U-Boot SPL 2020.04-5.3.0-devel+git.2e5818d88ee0 (Jan 01 1970 - 00:00:00 +0000)
DDRINFO: start DRAM init
DDRINFO: DRAM rate 4000MTS
DDRINFO:ddrphy calibration done
DDRINFO: ddrmix config done
Normal Boot
Trying to boot from BOOTROM
Find FIT header 0x4803b000, size 969
Download 839680, total fit 840496
NOTICE:  BL31: v2.2(release):toradex_imx_5.4.70_2.3.0-g835a8f67b2
NOTICE:  BL31: Built : 00:00:00, Jan  1 1970


U-Boot 2020.04-5.3.0-devel+git.2e5818d88ee0 (Jan 01 1970 - 00:00:00 +0000)

CPU:   i.MX8MP[8] rev1.1 1600 MHz (running at 1200 MHz)
CPU:   Industrial temperature grade (-40C to 105C) at 48C
Reset cause: POR
DRAM:  4 GiB
MMC:   FSL_SDHC: 1, FSL_SDHC: 2
Loading Environment from MMC... OK
In:    serial
Out:   serial
Err:   serial
Model: Toradex Verdin iMX8M Plus Quad 4GB Wi-Fi / BT IT V1.0B, Serial# 06848946
Carrier: Toradex Verdin Development Board V1.1A, Serial# 10795103

 BuildInfo:
  - ATF 835a8f6
  - U-Boot 2020.04-5.3.0-devel+git.2e5818d88ee0

flash target is MMC:2
Net:   eth0: ethernet@30be0000, eth1: ethernet@30bf0000 [PRIME]
Fastboot: Normal
Normal Boot
Hit any key to stop autoboot:  0
switch to partitions #0, OK
mmc2(part 0) is current device
Scanning mmc 2:1...
Found U-Boot script /boot.scr
1182 bytes read in 8 ms (143.6 KiB/s)
## Executing script at 46000000
4762 bytes read in 13 ms (357.4 KiB/s)
85718 bytes read in 19 ms (4.3 MiB/s)
73 bytes read in 13 ms (4.9 KiB/s)
Applying Overlay: verdin-imx8mp_lt8912_overlay.dtbo
1879 bytes read in 20 ms (90.8 KiB/s)
Applying Overlay: custom-kargs_overlay.dtbo
212 bytes read in 16 ms (12.7 KiB/s)
11629598 bytes read in 50 ms (221.8 MiB/s)
Uncompressed size: 28791296 = 0x1B75200
9170412 bytes read in 44 ms (198.8 MiB/s)
## Flattened Device Tree blob at 43000000
   Booting using the fdt blob at 0x43000000
   Loading Device Tree to 00000000fdbbc000, end 00000000fdbf3fff ... OK

Starting kernel ...

[    0.086332] 002: No BMan portals available!
[    0.087076] 002: No QMan portals available!
[    1.250934] 001: imx8-pcie-phy 32f00000.pcie-phy: failed to get imx pcie phy clock
[    1.282401] 002: imx_sec_dsim_drv 32e60000.mipi_dsi: modalias failure on /soc@0/bus@32c00000/mipi_dsi@32e60000/port@1
[    1.282418] 002: imx_sec_dsim_drv 32e60000.mipi_dsi: Failed to attach bridge: 32e60000.mipi_dsi
[    1.282423] 002: imx_sec_dsim_drv 32e60000.mipi_dsi: failed to bind sec dsim bridge: -517
[    1.339197] 002: failed to register cpuidle driver
[    1.682478] 001: imx_sec_dsim_drv 32e60000.mipi_dsi: modalias failure on /soc@0/bus@32c00000/mipi_dsi@32e60000/port@1
[    1.773235] 001: cpufreq-dt cpufreq-dt: failed register driver: -19
[    2.840469] 002: imx6q-pcie 33800000.pcie: failed to initialize host
[    2.840473] 002: imx6q-pcie 33800000.pcie: unable to add pcie port.
Starting version 244.5+
[    9.954151] 001: nau8822 3-001a: Failed to issue reset: -6

Best regards,
Fabian

henrique.tx · August 10, 2021, 2:27pm

Hi @wheeler

I’ll try to reproduce your problem.

Until then, could you try to get more information about the problem?

One way to get could be to increase the loglevel of the kernel
https://www.kernel.org/doc/html/latest/admin-guide/kernel-parameters.html

Best regards.

wheeler · August 16, 2021, 4:03pm

Hi @henrique.tx ,

i increased the loglevel to 8. Here is all i got from the serial console:
Serialoutput.txt (37.0 KB)

Best regards,
Fabian

henrique.tx · August 18, 2021, 6:36pm

Hi @wheeler!

I’m currently running some tests to try to reproduce your issue, so we can try to investigate better.

For now, I have two comments:

From your Serialoutput.txt we can see that there is no indication of a kernel failure
When I pulled the stress-test for the first time after installed the TorizonCore 5.3.0 with PreemptRT, I saw the same behavior: reboot without any indication on the serial output. I solved it by using a more capable power source.

So, my question is: what power source are you using? Could you try again with a more capable power source?

Best regards,
Henrique

wheeler · August 19, 2021, 9:37am

Hi @henrique.tx ,

i’m using a 12V/2A power supply. I tried it with a 24V/4A. Unfortunately the behaviour is the same. When running the stress-test container the board gets rebooted after 15-60min. With the speedtest-cli container the reboot comes earlier, without any stress testing. Have you applied the kernel commandline parameters cpuidle.off=1 cpufreq.off=1 ? Could it be a watchdog, who is rebooting the system?

Best regards,
Fabian

henrique.tx · August 23, 2021, 8:14pm

Hi @wheeler !

I could run two tests for at least 12 hours.

For both tests I used:

Verdin IMX8M Plus Quad 4GB WB IT V1.0B
Verdin Development Board V1.1A
TorizonCore with PREEMPT_RT 5.3.0-devel-202106+build.14
rt-validation repository from Toradex’s Github account (this specific snapshot)
a lab power source (just in case)

Before running both the stress-tests and the rt-tests, I launched a script the logs the docker stats in a file. With this, I can analyze for how long the test run. Here is the script:

# file: docker-logger.sh
# how to launch:
# nohup sh docker-logger.sh &
log_file=/home/torizon/docker-stats.log
sleep_time=2m
while :
do
        echo "date $(date +%s)" >> $log_file
        docker stats --no-stream >> $log_file
        sleep $sleep_time
done

Test 1:
– Default kernel arguments (no cpufreq and cpuidle disabled)
– Result: executed for 43127 seconds (or 11h59)
Test 2:
– Added kernel arguments (using TorizonCore Builder): cpufreq.off=1 and cpuidle.off=1
– Result: executed for 43056 seconds (or 11h59)

So, I couldn’t reproduce the rebooting error.

The fact that there is no message before rebooting indicates that there could be a hardware issue.

Could you reproduce this sudden rebooting behavior on other Verdin IMX8M Plus Quad modules?

Best regards
Henrique

wheeler · August 24, 2021, 11:21am

Hi @henrique.tx ,

many thanks for the detailed clarification. I have only one Verdin IMX8M Plus Quad module. We ordered more modules, they should arrive in 3 weeks. I report back when i tested it with another module.

Best regards,
Fabian

henrique.tx · August 24, 2021, 3:12pm

Hi @wheeler!

Ok. Let me know when you have more results.

Best regards,

wheeler · August 30, 2021, 11:32am

Hi @henrique.tx ,

i recieved a replacement module. Unfortunately i have the same behaviour. So i tried with a Dahlia carrierboard. The test is now running 3h and more. So it seems to be a hardware issue of the carrierboard not the cpu module itself. I will get in touch with the RMA department to get a replacement.

Thanks for the support!

Best regards,
Fabian

henrique.tx · August 30, 2021, 12:16pm

Ok, @wheeler !

Hoping for good news

Best regards,

wheeler · September 14, 2021, 12:22pm

Hi @henrique.tx,

i tested it again with brand new Carrierboard, CPU Module and Image. Unfortunately i have the same behaviour.

Model: Toradex Verdin iMX8M Plus Quad 4GB Wi-Fi / BT IT V1.0D, Serial# 06965602
Carrier: Toradex Verdin Development Board V1.1A, Serial# 10807547
Image: torizon-core-docker-rt-verdin-imx8mp-Tezi_5.4.0-devel-202109+build.18: Download here

I have some more hints for reproducing:

The issue seems network related and only appears with the eqos (ethernet@30bf0000) driver. This is connector X25 on Verdin Development Board.
The issue does not appear with the fec (ethernet@30be0000) driver. This is connector X35 on Verdin Development Board. With an usb-to-ethernet adapter the issue does also not appear.
The issue does not appear with no network load. (comment lines 29-38 in rt-validation/stress-tests.sh at bullseye · toradex/rt-validation · GitHub).

To which network connector did you connect the network cable, when you did the tests?

Can you please rerun the test, when you connect the network cable to connector X25 on Verdin Development Board. Please also verify that tcp port 5201 is open for iperf3.

I don’t know if these optimizations are necessary. The latency is not really better in the mesurements you did. But anyway i think the module should not reboot with these optimizations .

Best regards,
Fabian

wheeler · September 15, 2021, 9:53am

Hi @henrique.tx,

the link to the image was wrong. I replaced it with the official toradex image. You have to apply the kernel commandline by yourself. Or you can use the image from your previous tests.

Best regards,
Fabian

henrique.tx · September 15, 2021, 4:15pm

Hi @wheeler !

You raised intriguing points. I really didn’t pay attention to which network interface I connected.

I’ll rerun the tests using each of them. If I get the same error, maybe you found a possible improvement point.

Please also verify that tcp port 5201 is open for iperf3.

I don’t get the reason of this specific port… Could you explain, please?

About the latency, I agree with you: since the worst-case delay was the same for these tests, seems like that under the stress caused by the stress-test the response time isn’t enhanced by turning off cpufre and cpuidle.

But keep in mind that this kind of stuff does not necessarily have a linear behavior. Under a different kind of processing load, maybe applying these modifications can provide better results.

Best regards,

wheeler · September 15, 2021, 5:36pm

Hello @henrique.tx,

the stress test container is using iperf3 to generate network load. iperf3 uses tcp port 5201 for the connection to the server. So this port should be open otherwise no load is generated (it was blocked in my company ).

Yes under stress the latency is not really better. I found out, that under idle or low load the latency is slightly better. Our application will probably not use more than 25% cpu load, so i expect that the modifaction will improve the latency.

The stress test conditions are very unrelaistic for our application and in this case the problem would not be serious.

But the problem also happens with no other load than network load. For example when pulling a large docker image or running a speedtest (speedtest-cli) or it even could happen when downloading a OTA-Update. In this case i think the issue is serios.

Using the other network interface is also no an option for us, as we must use the fec-adapter for ethercat.

I hope you can reproduce it .

Best regards,
Fabian

henrique.tx · October 11, 2021, 11:46am

Hi @wheeler !

Last week we were running several tests using both TorizonCore and Reference Images to reproduce and (maybe) track the issue.

We reproduced your issue and it was escalated. Our next step is to investigate and try to find out where the issue comes from.

I’ll keep you updated with the findings.

Best regards,

wheeler · October 11, 2021, 1:40pm

Hi @henrique.tx ,
thanks for the feedback.

Best regards,
Fabian

henrique.tx · November 24, 2021, 9:54pm

Hi again, @wheeler !

Several tests were carried out up to a point that the problem could not be reproduced anymore on recent nightly builds of TorizonCore. More than one module is currently running for more than 1 day without the rebooting problem.

I’m attaching here a quite recent nightly build of TorizonCore 5.5.0 with Preempt RT (build 465 from 22/11/2021), which is the same as one of the running modules currently in test: https://share.toradex.com/dt5srknygx174kj

Please run some tests on your side with it to see if the reboot issue keeps happening.

If you reproduce the rebooting issue, we will need a more deterministic test setup because we are not able to reproduce with this image (until now).

Best regards,

wheeler · November 30, 2021, 9:07am

Hello @henrique.tx ,

i can’t reproduce this issue with the lastest build of Torizon Core. I did the rt-validation test for 12h.

Here are my results (Core 3 was isolated cpuidle.off=1 cpufreq.off=1 isolcpus=3):
latency-plot

Thank you for the support.

Best regards,
Fabian