MMC0 error in Apalis TK1 when CPU/GPU heavy loaded

When running application Video analytics heavy loaded ( CPU//GPU CUDA)
We get once in a while MMC0 error , system stuck & can’t recover.
I monitored the 3.3V on the IXORA
and noticed voltage dips , correlated with the MMC error( see below snapshot)
I suspect the 3.3V converter circuit on the IXORA (maybe a sense resistor , or capacitor)
Any ideas how to solve it ?
alt text

A couple of weeks ago another customer with an Apalis TK1 module on the Ixora carrier board reported system freezing after several hours of heavy GPU loads. Since the issue seamed to be quite weird in the beginning, I did a lot of investigations in this issue and found the source of the faulty behavior.

The 3.3V buck converter on the Ixora carrier board is designed for handling a continuous load of 6A. This are almost 20W for the computer module. The average power consumption of the complete system (Apalis TK1 and Ixora) was only at around 12W during the GPU load tests. We always thought that this is well below the 20W maximum. However, the problems are the short (hard to measure) peak current requests of the module.

The problem is that the over current protection of the TPS51120 buck converter on the Ixora board is actually an current limiter. This means the current is limited to 6A by reducing the output voltage as soon as the limit is reached. The module features its own buck converter for the GPU and CPU rail. The GPU demands a certain current at around 1V, which means a certain power is required. If the input voltage of the on-module buck converter is reduced, it demands an even higher current for compensating the required power. This means, the Ixora buck converter is further reducing the voltage. This is a very unfortunate vicious cycle which causes the 3.3V main voltage to continuously drop until the voltage is too low and the system crashes. As soon as it is crashed, the consumption of the module is reduced and the voltage of the Ixora buck converter rises up to the regular level.

To wrap it up: everything is working fine until one current peak is reaching the 6A limit. As soon as it reaches this limit, it triggers this vicious cycle which causes the TPS51120 to reduce the output voltage until the system crashes. The whole thing is very temperature dependent. The current consumption of the Apalis TK1 module is increasing with the temperature while the actual current limit of the TPS51120 is decreasing with the temperature rise.

Currently we can only offer these two workaround: We found out that the temperature dependency of the output current threshold of the TPS51120 is mainly caused by the solder joints of the current shunt resistor R9. By adding extra solder to the shunt resistor R9, the temperature dependency can be dramatically reduced and the current limit increased a couple of hundert milliampere. In the case of the other customer, this increase of the threshold was already resolving the issue.

The second solution is to replace the shunt resistor R9 with a lower value (e.g. 10mΩ, 1%, 250mW, 1206). On the Ixora V1.1, the resistor value of R9 is currently 12mΩ. Instead of replacing R9, it is also possible to piggyback solder an additional resistor on top of R9. I suggest a value of around 24mΩ since the resistance of the solder joints have to be take into account.

Unfortunately, this are all the workarounds I can offer right now. Since we discovered this issue just very short ago, we did not decide so far whether there will be an improved version of the Ixora and when it will be available.

Hi Peter,
Thanks for your detailed answer !

Just for the test , is it possible to simply short R9 with 0 Ohm resistor ?
( I don’t have 10mOhm resistor in hand)

I understand there is not current limit with a 0 ohm resistor !

Thanks again
Oded

Yes, shorting also works as a test. Please keep in mind, you need to use a thick cable for shorting it, otherwise your connection has more than 10mR. If you do not have a thick wire available, a good solution is using a soaked desoldering braid for making the short connection.

Thanks Peter, Marcel,

I Hooked a 5 mOhm resistor in parallel with the 12mOhm,
It works fine (no MMC0 errors ) !
Will continue testing .

Thanks for your support

Oded

Thank you for your update.