Thermal management of T30 (under WEC2013)

lerimini · October 21, 2016, 2:11pm

Hi community,
we are performing stress tests on our T30 based system and i want to have some clarification on “strange” effects (from our point of view). The test environment is: Iris carrier board + T30 (without external heat sink and on my desk out of any case) + wec2013 (2beta2) powered at 500mA@9V (in idle mode).
In these conditions, when 100% CPU load is reached for about 3 minutes, the module reaches the maximum temperature (module 85 degrees, cpu 115) and it resets itself (at the end we see 1.35A@ 8.8V). This behaviour is present in three different T30 modules. The (very) strange thing is that we have an older T30 working differently in the same conditions ( module temperature is lower than 65 degrees after more than 10 minutes, cpu under 80 degrees).

It seems that DFVS is not working good in these conditions. Could you please explain me the logic that is behind DFVS thermal management for T30?

samuel.tx · October 24, 2016, 6:52am

Do both modules have the same revision number (i.e V1.1S)?

lerimini · October 24, 2016, 7:02am

Yes, they have both 1.1E revision number. The “good” one has SN02790541, tha others SN02819655 and later. The same behaviour is present with 1.4 stable version.

samuel.tx · October 24, 2016, 7:52am

We have seen such behavior as well in the past. I assume that your issue is related to the production lots of the CPUs itself. The different temperature increase is depending on the production lot of the NVidia Tegra 3 chips.

Beside of differences in the production, this issue is also based on the fact that increasing temperatures increase the power consumption which increases again the temperature on the chips. With other words: The higher the temperature the faster the temperature increases on high usage of the CPUs.

On Windows CE the T30 modules are currently missing a temperature throttling that would help to prevent such situations by lowering the maximal CPU frequency in case of high temperatures (see also roadmap)

Possible workarounds in case you run into thermal issues:

Check the Thermal Management developer website.
Disable some cores: Set pex.cpus in the config block.
Reduce workload

lerimini · October 24, 2016, 8:26am

I see that the issue is not yet been scheduled, do you have any further information on that? Furthermore, as far as i can understand, it is about “customization” of the throttling, so i suppose a “standard” throttling management is available. Isn’t it?
Looking at the values shown by the Colibri monitor while the temperature increases i can’t see any modification on the CPU frequency.
By the way, we will add a cooling solution such the ones you suggest in “thermal management” article. They will be sufficient also for intense usage of CPU?

lerimini · October 24, 2016, 12:26pm

@samuel.tx, some further info to let us make the right choice, since Thermal throttling is very important for us to have the confidence that the system will not resets itself:

this feature will be supported in the future? could you please give us an approximative scheduling?
I’ve given a look to the release roadmap. Is there a relationship with issue 7788?
What’s the relationship between DFVS and thermal throtting?
Is it possible to use FreqLib to change cpu freq at runtime for T30?

Thanks in advance

samuel.tx · October 25, 2016, 6:31am

@lerimini:

Yes, this feature will be supported in the future but is not planned yet. This feature will not be part of the next beta and final release 2.0. I will but a reference into the ticket that there is a need from a customer for this feature, so we can plan it for the 2.1 release beginning of next year (2017).
No, this is not related to this. 7788 is about the max temperature. The description of 7788 was wrong anyway. During runtime the max temperature on the modules already has been high enough, we only had to increase it during boottime.
DVFS and thermal throttling have no direct relationship: DVFS is already implemented. But if your application needs the full performace also if the CPU temp is high it will not slow you down.
Yes, you can use FRQSetTEGFrequency with the TEGCPUClockID after disabling DFS using FRQSetTEGDFSState (see also developer website API description)

lerimini · October 25, 2016, 7:16am

Thank you very much for the explanation. We look forward to see thermal throttling support added to your release map for 2.1 and, in the meanwhile, we will add a cooling solution and also a frequency cutoff as extrema ratio.

lerimini · October 25, 2016, 10:11am

Hi @samuel.tx, I add this comment for further information on the original problem. The thermic drift we saw in three different models is present also without stressing CPU, but it is sufficient to disable DFS and let them work at the maximum frequency (1300 Mhz) in IDLE. This sounds strange. Could be this explained with the variability in Tegra 3 production you mentioned before?

lerimini · November 3, 2016, 2:37pm

Hi @samuel.tx, sorry if I point the attention on this topic once again but I need some clarification on the behaviour pointed out in my last comment. Is it possible such a thermal drift also WITHOUT any cpu load (only disabling DFS)? If so, how could you get the results explained here? T30 thermal stability isn’t possible with DFS disabled?

enafziger.procat · December 6, 2016, 9:51pm

@samuel.tx can you please elaborate on what the suggested solution is? I have some of these “bad” modules, not sure how many, I just became aware of this issue. I disabled DFS and set the CPU frequency to 300 and ran a CPU stress test (calculating primes using 4 threads). After a few minutes it rebooted. I’m not sure where to go from here.

enafziger.procat · December 6, 2016, 10:15pm

@lerimini did you have any luck with changing the CPU frequency?

samuel.tx · December 7, 2016, 7:37am

@lerimini: Sorry I missed that reply from you in Novmeber. Not sure this already was solved in the meantime or not. How ever. How did you run these tests? Was the module in one of our carrier boards with a standard image? Could it be some pins driving against each other or are floating on your carrier board? Can you reproduce the same with a standard image and a Toradex Carrier board with these modules?

samuel.tx · December 7, 2016, 7:46am

@ed.nafziger: Did I got you right, you see that only one some modules? What is the max temperature on the modules which do not reboot? And what is the temperature on the modules that reboot? You can check the temperature with Colibri Monitor.

Same question also here as I asked lerimini: What was the environment you tested this setup (carrier board)? It could also be there are some floating pins that trigger interrupts on your carrier board? We have seen that such situations sometimes happen on one or the other modules also it is used on the same carrier board.

lerimini · December 7, 2016, 9:02am

Hi @samuel.tx, the problem is not being solved yet. The tests were performed on Iris Carrier Board V1.1A with wec2013 1.4, 2beta2 and 2beta3 standard images. The behaviour is present in at least 9 modules that we have purchased for evaluation, although with quite differences on the drift velocity. The test has been repeated with at least 3 different Iris carrier (out of any case) and GPIOs were all floating. As I wrote in the previous comment, it is very simple to reproduce the issue, it is sufficient to disable DFS (setting maximum frequency) and just wait few minutes (15-20) without any CPU load. The reset occours when Tcpu on colibri monitor reaches 115°C (the reset wasn’t related to Tmodule?)
As far as i can understand looking at the serigraphy over the “bad modules”, NVidia cores are coming from different production lots.

lerimini · December 7, 2016, 10:02am

@ed.nafziger No, as you can see from my latest comment, it is impossible for us to play with frequency because it requires to disable DFS and this led us to critical thermal behaviour.

enafziger.procat · December 7, 2016, 8:20pm

@samuel.tx Yes, only some of the modules have this issue. With DFS disabled and CPU set to 300, the max temp on the modules that do not reboot (from my latest test) is TModule 49C, Tcpu 58C, ambient ~75F. I can reproduce the issue using the Colibri Eval board V3.1 and also our custom base boards. In an exe that is launched from \FlashDisk\AutoRun I set all the unused pins to GPIO output. I went through the entire list in GpioConfig and make sure that all unused pins are set to GPIO output.

samuel.tx · December 9, 2016, 8:41am

@ed.nafziger: We will make some tests here to see if we can reproduce this issue. Just to make sure: You see that with our standard image without reconfiguring any GPIO as well?

One note about the GPIO you set to output: Be careful with the multiplexed pins (there are many on the T30). If you have a function (i.e. UART TX, SDIO, ..) on one pin and drive against it with an other Pin which is multiplexed on the same SODIMM, you may run into issues.

daniel.tx · December 9, 2016, 5:36pm

A table with all multiplexed pins is in the Colibri T30 datasheet in chapter 4.1 Function Multiplexing

enafziger.procat · December 9, 2016, 6:21pm

@samuel.tx yes, I can replicate the issue with the standard image in the eval board with no changes to GPIOs.
Sorry, my answer was to generic about the floating pins. Yes you are correct about the multiplexed pins. For each pair of completely unused multiplexed pins, I am setting one of them to gpio output and the other to gpio input tristate. If one of the pins is used of a multiplexed pair, then I set the non-used one to gpio input tristate.