Verdin AM62 - power off due to thermal protection at 95° junction temeprature

Dears,

we’ve been doing thermal tests on our product that mounts the Verdin AM62 Solo 512MB WB IT V1.2A, running a Yocto image compiled with BSP version 7.5.0.

We ‘ve monitored the behavior of the SoM at high temperatures, logging the thermal zones used to monitor the junction temperature that, as per documentation, shall perform an automatic power cut off at 105 degrees.

The values have been logged by reading the following data continuously during the test:

cat /sys/class/thermal/thermal_zone1/temp

cat /sys/class/thermal/thermal_zone2/temp

We verified that the trip points for both zones is set at 95° for passive and 105° for critical:

root@verdin-am62:/sys/class/thermal/thermal_zone1# cat trip_point_0_hyst
2000
root@verdin-am62:/sys/class/thermal/thermal_zone1# cat trip_point_0_temp
95000
root@verdin-am62:/sys/class/thermal/thermal_zone1# cat trip_point_0_type
passive
root@verdin-am62:/sys/class/thermal/thermal_zone1# cat trip_point_1_hyst
2000
root@verdin-am62:/sys/class/thermal/thermal_zone1# cat trip_point_1_temp
105000
root@verdin-am62:/sys/class/thermal/thermal_zone1# cat trip_point_1_type
critical

root@verdin-am62:/sys/class/thermal/thermal_zone2# cat trip_point_0_hyst
2000
root@verdin-am62:/sys/class/thermal/thermal_zone2# cat trip_point_0_temp
95000
root@verdin-am62:/sys/class/thermal/thermal_zone2# cat trip_point_0_type
passive
root@verdin-am62:/sys/class/thermal/thermal_zone2# cat trip_point_1_hyst
2000
root@verdin-am62:/sys/class/thermal/thermal_zone2# cat trip_point_1_temp
105000
root@verdin-am62:/sys/class/thermal/thermal_zone2# cat trip_point_1_type
critical

We did a fast test by heating artificially very fastly the SoM up to 105° of junction temperature and the thermal power off protection occurred instantly as expected:

But, on the long run in thermal chamber set at 55°, we observed that the junction temperature reached 95° and kept that value for a long time (around one hour) and then the thermal protection occurred:

We are trying to understand why the thermal cutoff occurred even if the board was able to “contain” the junction temperature thanks to the effect of the govern when reaching the first “passive” threshold of 95°.

It was a “test to fail” so we just wanted to reach the limit of the hardware, but anyway we would like to have an explanation on this shutdown intervention.

Do you have any clue about this behavior?

Thanks,

Andrea

Hi, I’m the Toradex AI Assistant.

I might make mistakes or occasionally get you wrong. Our support team will step in if needed.


Based on your thermal testing observations, the behavior you’re experiencing is consistent with the thermal management design of the Verdin AM62 Solo 512MB(1).

Thermal Specifications and Behavior

According to the Verdin AM62 datasheet, the SoC junction temperature maximum is +95°C for standard temperature variants, with a thermal resistance junction-to-ambient of 22.3°C/W(1). For the IT (Industrial Temperature) variant you’re using, the maximum SoC junction temperature is +105°C(2).

Understanding the Thermal Protection Mechanism

The thermal protection system operates on two levels(2):

  • Passive trip point (95°C): Initiates thermal throttling through Dynamic Voltage and Frequency Scaling (DVFS)
  • Critical trip point (105°C): Forces system shutdown to prevent permanent damage

Why Shutdown Occurred at 95°C

Your observation of shutdown at 95°C during prolonged exposure, despite the critical trip point being set at 105°C, can be explained by several factors:

1. Thermal Throttling Effectiveness

When the passive trip point is reached, the system implements thermal throttling to reduce heat generation(2). However, if the ambient temperature remains high (55°C in your case) and the thermal load persists, the throttling mechanism may not be sufficient to maintain the temperature below the critical threshold indefinitely.

2. Cumulative Heat Effects

During prolonged exposure to high temperatures, even with throttling active, the SoC may experience gradual temperature increases due to:

  • Reduced cooling efficiency at sustained high ambient temperatures
  • Thermal resistance characteristics of the module and carrier board design
  • Continued heat generation from essential system processes that cannot be throttled

3. Temperature Sensor Accuracy and Hysteresis

The thermal zones have a hysteresis of 2000 (2°C) as shown in your readings(2). This means there can be some variation in the exact temperature at which protection mechanisms trigger.

Recommendations

For applications requiring sustained operation at high ambient temperatures:

  1. Implement Active Cooling: Consider adding heatsinks or active cooling solutions(2)
  2. Monitor Temperature Proactively: Use the thermal zone readings to implement application-level thermal management before reaching critical thresholds
  3. Optimize Power Consumption: Review your application’s CPU and GPU usage to minimize heat generation(3)
  4. Consider Carrier Board Design: Ensure proper thermal design of your carrier board with adequate copper planes and thermal vias

The behavior you observed is part of the safety mechanism designed to protect the SoC from permanent damage. The system prioritizes hardware protection over continuous operation when thermal limits are approached.

Best regards, Toradex AI Assistant.


Did you like my answer? :+1: :-1:

If I answered your question, please tick the “Solution” checkbox.

For follow-up questions to me, please use @ToradexAI in your post.

Hello @dm-andreapedica,

Thanks for sending the graphs.

In your second test, where the ambient temperature of the SoM was 55°, at what junction temperature did the SoC shut down? I am asking because this is not visible on that graph. If the SoC die temperature reached 105° before it shut down, then it is the expected behavior. Could you please confirm if this is what happened?

One way to avoid this fast shutdown (although you said it happened after about 1 hour) would be to lower the passive trip point temperature. This is the temperature at which Linux starts to throttle the CPU down. If your application can run at a lower CPU frequency, then I would reduce the passive trip point. The passive and crtitical trip points are defined in the device tree file k3-am62-thermal.dtsi. By default, the passive and critical trip points are 95° and 105°. Would this be a solution for you? Or is your question more out of curiosity?

Dear @rudhi.tx ,

thanks for your answer.

To answer your question:

In your second test, where the ambient temperature of the SoM was 55°, at what junction temperature did the SoC shut down? I am asking because this is not visible on that graph

That’s exactly the question we rise: the system is configured with standard values, with primitive lmit at 95° (CPU freq lowered by governor) and 105° for immediate shutdown. We expected (like you) to see a slow rise up to 105° where power cut shall happen. Instead, it seems like the protection occurred for a “spike” that we were not able to log (because it happens immediately and the sw is not able to log that instantly, I assume).

Our question is risen just to have a technical confirmation of what we observed, for sure we can apply actions to avoid to get in that “risky” thermal area, just we wanted assurance that the system is behaving correctly and the protection occurred for an expected reason and not because there is some “hidden” problem we did not managed.

Thanks,

Andrea

Hi @dm-andreapedica,

I have done a study on the thermal behavior of Verdin AM62 a while ago, and I tested the module in the climate chamber exactly at 55°C, running an early version of BSP 7 at the time. I kept the test going for about 5 hours to monitor the throttling behavior of the CPU. In this test, I have not seen such a sudden spike in the temperature or shutdown. I saw that the device went into shutdown only when I stressed the CPU and the junction temperature reached 105°C. Therefore, my suggestion to you would be to check what in your application causes such a sudden spike. Perhaps you could also monitor the power consumption and memory usage on the SoM to get an idea? I would also check any logs from your application or related systemd services.

Perhaps, it would also make sense for you to run the same test on Toradex reference BSP 7.5.0 (and ideally on a Toradex carrier board also) which you can download here: Download Links | Toradex Developer Center
This will help you isolate the problem. Please let me know how it goes if you happen to do a test on the reference image.