Verdin imx8mm does not reboot after thermal trip

edwaugh · June 23, 2022, 1:11pm

Hi all,

We have spotted a problem with our devices during the very hot weather where a thermal trip causes a permanent shutdown. I was expecting a reboot once the temperature cooled but that does not seem to happen. What is the expected behaviour here?

Also, is it possible to trigger the trip point in software so I don’t have to use my ‘practical setup’

Screenshot 2022-06-23 141044

Cheers

Ed

edwaugh · June 23, 2022, 3:22pm

After fully cooled I checked the board and can see that the 3.3 V supply is up but the 1.8 V from the SOM is not present.
@RoccoBr @drew.tx @jaski.tx

gclaudino.tx · June 23, 2022, 4:34pm

Dear @edwaugh, how are you?

Could you please share more information about the exact version of your module?

Also, what were the test conditions that you performed? At which temperature did you set the heater? Was it inside a closed environment with a more distributed temperature or a single-point heat source as it appears to be on your picture?

Best regards,

edwaugh · June 24, 2022, 8:39am

Hi @gclaudino.tx,
The module is a Verdin imx8mm v1.1a serial 06827566. I just warm the SOM until the CPU hits its temperature trip. We have also seen the same behaviour in a temperature chamber but I don’t have one in my office to use. The heat is quite evenly distributed and builds quite slowly.

I think the main question for me is; what is the expected behaviour? I saw some documentation that said it might just be shutdown in which case behaviour is as expected and we just need to avoid that case. Do customers need to implment their own handling of over temperature conditions?

Thanks

Ed

matthias.tx · June 24, 2022, 12:23pm

Hello Edwaugh,

I need a bit more information on this. In fact the SOC temperature has to go below 85 c to reboot again. So that means your ambiance temp in your test might already be below 85 c but the SOC is not.
So until then the SOM should stuck in uboot. Can you check the console logs during that time for me?

Best Regards,

Matthias

edwaugh · June 24, 2022, 1:04pm

Hi @matthias.tx

I can confirm cpu temperature definitely goes under 85 C without rebooting. Can you give the commands you would like for the logs, I couldn’t see anything using dmesg. Any chance you can have a try on a board you have?

Thanks

Ed

edwaugh · June 24, 2022, 1:42pm

Hi @matthias.tx,
I made a log of the serial port. As you can see it trips and then I wait 10 minutes+ and CPU is cool then power cycle. It boots ok reporting a cpu temperature of 38 C but did not reboot itself.

Ed
thermal trip log.txt (6.7 KB)

gclaudino.tx · June 27, 2022, 4:25pm

Dear @edwaugh,

Thanks for the log and the update, we’ll investigate it. Did you experience a similar behavior while using one of our carrier boards?

Best regards,

edwaugh · June 28, 2022, 10:22am

I haven’t tried it but I can see that 3.3 V and Vbackup are both up on our board.

gclaudino.tx · June 28, 2022, 1:45pm

Dear @edwaugh,

Thanks for the update. Could you please check it on one of our carrier boards to see if this behavior also happens? This will limit the possibilities for us to find where the issues happen and to toggle it better.

edwaugh · June 30, 2022, 7:05am

Hi @gclaudino.tx and @matthias.tx,

I tried the test with a Dahlia board and I see the same result. I didn’t have an adaptor for the serial port so could only watch over SSH on ethernet.

At 95 degC the CPU stopped as expected
The red reset light came on briefly and then went off
Only the +5V STB green LED remains on
I waited for 10 minutes until the CPU felt cold
The LED lights remained the same
The board was not accessible over SSH and did not ping
I then pressed the on/off button on the board and all the V supply LEDs came on and reset flashed briefly
The system then booted normally and reported a temperature of 38 degC

Could you please try to replicate on your side and confirm if this is the expected behaviour?

Thanks

Ed

henrique.tx · June 30, 2022, 1:44pm

Hi @edwaugh !

After this kind of shutdown, you actually need a hardware reset. In other words, there is no automatic reboot.

Best regards,

edwaugh · June 30, 2022, 2:05pm

Hi @henrique.tx,
Thanks for the info, that is very helpful. It does seem to contradict what @matthias.tx is saying above. Would the system hold at uboot until the temperature is lower as he describes? What is your recommended method for performing the reset? Ideally I would avoid adding another hardware watchdog as I believe there is already one on the SOM.
Thanks
Ed

edwaugh · July 4, 2022, 6:56am

Hi @henrique.tx and @matthias.tx can I get an update on this please?

henrique.tx · July 11, 2022, 8:38am

Hi @edwaugh!

Sorry for the delay. I was double-checking the information here.

Actually, maybe what @matthias.tx wrote was misunderstood. This is what he wrote:

He was not talking about the automatic reboot, but about the ability to even be able to boot. If the temperature exceeds the threshold, the module will not be able to turn on successfully.

If the information shared in this thread is not enough for your use case, I would like to ask you to explain your use case in as detail as possible, so we can better support you on this.

If you don’t what to share further details publicly, we can proceed in a private thread here in the Community, or via email. Whichever you prefer.

Best regards,

rafael.tx · July 28, 2022, 2:35pm

Hello @edwaugh,
We analyzed the situation internally and here are some points:

the behavior that the linux kernel implements is the safest one. The critical temperature point is defined as the temperature where the processor is not reliable anymore. Shutting down is the only way to guarantee that it will not keep getting hotter and maybe create additional problems.
the u-boot behavior of waiting for the temperature to be lower than the threshold before booting seems targeted more to not boot the system at all if the temperature is already too high.
I tested the behavior of the DVFS on the verdin IMX8MM and in my environment it was very effective in lowering the temperature of the SOC. When the temperatures get higher the DVFS forces the clock of the CPU to 1.2GHz regardless of the cpu load and also reduces the clock of the GPU. This should be enough to keep the temperature under control especially if a heatsink is installed. It would be advisable to check in which overall conditions the temperature trip was triggered on your side and if this is a possible use case, evaluate external cooling help / case design.

Said all that, it is possible to monitor the temperature of the SOC from userspace:

root@verdin-imx8mm-06827736:# cat /sys/class/thermal/thermal_zone0/temp  
40000

With this, you could create a script that monitors the temperature and triggers a reboot in case it raises above say 93 degrees celsius to avoid tripping the critical temperature point.

You will need to evaluate how well this solution will work on your environment to check if it is acceptable for your application, and also monitor closely how well the SOC cools down while waiting in u-boot.

Best regards,
Rafael

edwaugh · July 29, 2022, 6:49am

Hi @rafael.tx,
Thanks so much for your time on this. Just fyi we discussed this issue with @drew.tx, @andi.tx and @michael.tx on a call last week. Hopefully I got their tags right I had to guess!

Shutdown might be safe for a device in an office but for an embedded device that might be weeks or months away from being serviced is not a good behaviour
I set the govenor to powersave at the start of the application anyway (1.2 GHz)
The application monitors the core temperature using the thermal zone you describe
We have added the behaviour you describe to sleep the system from the application when it gets too hot
I’ll respond on the ADC read errors on Verdin 1.1b - #16 by rafael.tx thread as well, here you mention having trouble getting over temperature, that is because your ambient temperature is low. If your CPU temperature is 66 degC at an ambient of 20 degC then at an ambient of 80 degC (the test for an industrial module) it will be at 126 degC.
We do not see any overheating in the loading test, I suspect the difference could be that my application is both performing the reads and running the CPU test on a single python thread as this is what we care about. I think that will be quite different to Linux managing multiple processes. Perhaps this gives us a clue that the race condition is inside Python somewhere.

Overall my position is still that for an embedded device to permanently shutdown for any reason is the wrong design. I can try to work around it in our application but I really think it should be a reboot.

Thanks

Ed

rafael.tx · August 2, 2022, 7:02pm

Hi @edwaugh,

The default shutdown behavior is standard linux kernel behavior and we (Toradex) simply don’t change it. If you think that the solutions that you have at the moment don’t fit, you could patch the kernel to create the desired functionality.
You could also propose changes like this on the linux kernel mailing list and try to get them approved by the maintainers so that you don’t need to maintain the patch yourself in the future.

Regards, Rafael

edwaugh · August 4, 2022, 8:02am

Hi @rafael.tx,
Thanks we can look into it. However, my view is that you are not selling me a standard Linux build, you are selling me one suitable for use in an embedded device and as such it should work that way.
Seems like we have answered this question as much as we can in this thread so I will just follow up next time we have a call with the team.
Cheers
Ed

FabioEstevam · January 22, 2024, 6:12pm

Hi,

I know this thread is old, but just wanted to let you know that I have added support for triggering a reboot after the critical temperature is reached.

The patches have landed in 6.8-rc1:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=v6.8-rc1&id=62e79e38b257a59f1e3d8aff801ae8590e2e45b4

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=v6.8-rc1&id=79fa723ba84c2b1b3124c72df8a3b07b851a5477

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=v6.8-rc1&id=5a0e241003b80247de59727c945bc94c848f893d

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=v6.8-rc1&id=87f67d1747bc3ce8ace14be99b47d7731041ff03