ADC read errors on Verdin 1.1b

Hi all,
We have found a problem where the TLA2024 ADC on the Verdin IMX8MM throws exceptions occasionally. We read the device very regularly and it must be less than 1 in 10000 reads that fail. Originally I thought it was due to debugger detach/attach or some other testing related issue, however, we have now seen this happen on devices in the field.
At the moment the thread reading the ADC is monitored and restarted if it exits due to an exception so this is not a very big deal but it does seem like something is wrong somewhere.

I have attached the full version of the code used to read the device, it is not very special, just a thread and some surrounding functions to start/stop and handle locks.
tla2024_adc.py (11.0 KB)

An example I caught of the read failing:


And another:

After these exceptions simply restarting the thread and re-attempting the read works successfully.

I am not explicitly accessing anything else on that I2C bus although I expect that Linux is talking to the RTC sometimes.

All suggestions very welcome, maybe there is some more detail in the exception that I should try to capture?

Thanks

Ed
@gauravks

Maybe I can tempt @matthias.tx to take a look at this.

Hello @edwaugh, sorry for the delay in the answer.
I’ll try to simulate the problem on my side and try to find out what’s the reason for this issue. I saw in your exception reports that there are two different error returns (I assume they come from the read call).
It would be interesting to know if the error codes always vary or if they are limited to these two. It could be that there is some bus contention from time to time, and I would guess this could turn out to be a timeout error (like the one you showed).
It would also be interesting to know if there are error / log outputs coming from the syslog via dmesg.

Regards,
Rafael Beims

Thanks Rafael, I will try to capture the dmesg log next time I see this error but it is quite hard to find. Any indication from the driver on under what conditions it might trigger this error? I don’t think the bus should be very busy, having to wait briefly for something else to use it would hopefully not cause this problem.

Hello Edwaugh,

which BSP or torizon version you are using ?

Best Regards,

Matthias Gohlke

ID=torizon
NAME="TorizonCore"
VERSION="5.2.0-devel-202104+build.11 (dunfell)"
VERSION_ID=5.2.0-devel-202104-build.11
PRETTY_NAME="TorizonCore 5.2.0-devel-202104+build.11 (dunfell)"
BUILD_ID="11"
ANSI_COLOR="1;34"
VARIANT="Docker"
CLOUDBOX_OS_VER=1.0.1.3-2823cac

Hello @edwaugh, I tried to simulate the issue on my side but couldn’t replicate it. I made some quick changes to your script in order to be able to run it on my board directly (see attached). I added a sample counter so that I could see the amount of samples already taken as you mention that the problem probably happens less than once in 10000 samples.
I let the script run through the night and the counter reached upwards of 800000 samples and I did not see any issue.
This now brings some extra questions on my side:

  • Could you look at my modified script and check if I didn’t do anything wrong that would prevent me from seeing the same issue as you? I’m especially concerned that maybe I didn’t properly understand how to use the monitor_status call.
  • Have you seen this problem on all the hardware you have or only in some units?
  • Do you have other stuff running on the board while the sampling is going on?

Best regards,
Rafael Beims

tla2024_adc.py (11.6 KB)

Hi @rafael.tx,
Thanks very much for taking a look. We do have a lot of other stuff going on at the same time and there may even be rare moments of CPU limits.

  • Modifications look ok to me
  • I think on all units but it is a very low frequency occurrence so hard to be sure, in my post I mistyped, I think I mean less common than 1:100000
  • No need to use monitor status, this just lets another thread monitor this one for failure

One thing we do have is the alias to a different folder. @gauravks is there any reason why this might get updated and clash with the ADC read?
I have added some logging now so I could start to track how often it happens.
Thanks
Ed

Any new on your tracking since you added the logging?

Regards,

Matthias

Hi @matthias.tx and @rafael.tx,

I was performing some testing using a high CPU load loop as part of the application and noticed that the ADC problem occurs more frequently. Is there any timeout on a file read from Python? Perhaps you could try loading the processor in your test?

Thanks

Ed

Hello Edwaugh,

We would recommend to improve the error handling in your python code by adding a counter in the exception. Like it is done add the first exception.

Best Regards,

Matthias

Sure thing, although this does just seem to be masking an underlying problem somewhere. I do think it is worth understanding it properly. Our devices need to run for very long periods unattended, ideally there wouldn’t be any race conditions like this (if that’s what it is).