We are seeing temporary CAN problems on our AM62 boards (Dual Core/Quad Core).
We are sending cyclic CAN messages every 20ms (500 KBit/s). The latency is often within +/- 1ms. However we see latencies of 150ms regurlarly and every few hours we found disruptions of more than 1s. We use CODESYS with socketCan. After the pause, missing frames are transmitted back to back (100 messages buffer in socketCAN). So they are not lost but simply not send in time.
Questions:
Could this be a problem of the socket CAN layer and their task priorities?
How would we change the priorities within the Linux kernel to get better results?
Might this be a hardware issue, see Errata #4: HAR-10889– A manufacturing issue might affect the V1.1B functionality
This is a show stopper for starting series production. Any ideas what we could check?
Hi, I’m the Toradex AI Assistant. The Toradex Support team will follow up with you shortly!
Did you like my answer?
Hello Hartmut,
I’m sorry to hear about the CAN communication issues you’re experiencing with your Verdin AM62 boards. Let’s explore some potential causes and solutions:
SocketCAN Layer and Task Priorities:
The latency might be related to task scheduling priorities in the Linux kernel. You can adjust these priorities using the chrt command to set real-time priorities for CAN-related tasks. This adjustment can help ensure that CAN messages are processed with higher priority, potentially reducing latency.
It’s also important to monitor CPU load to ensure other processes aren’t consuming excessive resources, which could impact CAN message handling.
Changing Priorities in the Linux Kernel:
Use the chrt command to set the scheduling policy and priority of processes. For example, you can assign a higher priority to the CAN interface process:
sudo chrt -f -p 99 <pid_of_can_process>
Ensure the CAN interface driver is configured correctly and that the system isn’t experiencing CPU throttling or other resource constraints.
Potential Hardware Issue (Errata #4: HAR-10889):
The errata you mentioned might affect your module’s functionality. Verify if your hardware version is impacted and consult the errata documentation for any recommended workarounds.
Additional Troubleshooting Steps:
Check system logs (dmesg and /var/log/syslog) for any error messages related to CAN communication.
Test CAN communication under different conditions to identify patterns or specific triggers for latency.
Consider using a different CAN interface or hardware setup temporarily to rule out hardware-specific issues.
By addressing these areas, you should be able to identify and mitigate the causes of the CAN communication disruptions. I hope this helps you move forward with your production plans.
Best regards,
Toradex AI Assistant
Please, also check if these links can help you solve your problem:
You have tight deadlines for a non-real-time system to handle. Since you might have another tasks/process running concurrently on the OS (maybe even some GUI?), all this can be disrupting the CAN deadlines.
Codesys has support for Linux with PREEMPT_RT patch. Using the patch (plus the correct configuration for your specific case) might help your application to behave according to your requirements. More here: Partner Demo Container - Codesys | Toradex Developer Center
Also, still in this topic, you can try to explore CPU pinning and isolation approaches (which you can try with/without the PREEMPT_RT patch). I foundthis nice article online about it: On pinning and isolating CPU cores - Manuel Bernhardt
Regarding this, although possible, as we can see in the Errata’s content itself, it is rare. If you perform the test as described in the errata where only the CAN process is being executed and it doesn’t behave as expected, then we can start considering this possibility.
I know that the latency can jitter depending on the workload of other processes. We have a NodeJS server with a chromium browser as our GUI.
However I would expect the jitter to be a few ms (less with the PREEMPT_RT patch). We see jitter of up to a few seconds! This is way out of proportion. Our cycle time is 20ms! This is more than a factor of 50. Can the Linux scheduler really stop a process for more than one second? What can we do to avoid this? Can we change driver priority of the CAN driver or the processCan library?
See our trace of the CANopen protocol running in CODESYS. We see in the logs that CODESYS overruns the socketCAN buffer when the CAN is stalled for more than a second, which is to be expected. This shows that socketCan has the problem not CODESYS.
To be honest I had the same initial gut feeling. But nothing can be discarded without testing, properly isolating the underlying issue and more testing . Scheduling in a non-real-time system under load is quite complex. For example: If you shutdown everything that is not the CAN process, does the issue still occur? Can you reproduce it on a reference image?
Also, which exact Verdin AM62 are you using? Which exact carrier board? And which exact OS? Please share all full names and versions.
If you want to deeply analyze what is going one, the previous messages here can be used as a guide.
If you would like some help on such task, we can recommend you some of our proven Toradex Partners with expertise in this specific field.
Actually we did a lot more testing.
If we simply run a CODESYS container with a cycle time of 20ms, it works as expected. We have 25% CPU load, the max. cycle time that we measured is 45ms, which is acceptable without an realtime Linux patch.
If we run all containers this results in a high load and than all bets are off. Latency goes up to 20s. And sometimes the system crashes. In the Torizon FAQ I found one paragraph that states that if Docker is overloaded it might crash to whole system. This is maybe what we see, as all measurements indicate that we don’t have a hardware problem.
We will no tr to pin containers to cores and see if that helps…
We pinned CODESYS to core 0 and our other containers to Core one on a DualCore AM62.
This works reasonably well with latencies up to 80ms.
However, every 30 min or 1 hour there is exactly one measurement which is way out like 2-10s!!!
Then it runs with low latency for another 30 min or so.
I have a new suspicion but I can’t proof it yet: Maybe this is the CMA allocator Torizon CMA allocator. This takes a large chunk of memory (e.g. 128 MB out of 1 GB) and reserves it for later use. The Linux memory allocator might use the memory as normal. Once the CMA needs more memory it might move all other pages in memory. This might take long and might cause the latency issues we are facing.
Does anybody know more about such an issue? How much CMA memory is needed for Weston and Chromium to run. Maybe decreasing the total CRA size would help but I haven’t checked yet.
Actually there is no “Torizon CMA allocator”
CMA is a standard Linux feature that Torizon OS simply configures.
Regarding CMA affecting system’s latency: if you do not have actually tested and validated this hypothesis, it is really a shot in the dark. “Default” Linux kernel (e.g. without enabling PREEMPT_RT) under load usually won’t guarantee any upper bound for latency.
Real time is really a complex topic. If you need help with it, the best approach is to get help from one of our experienced Toradex partners. Real time applications and OS configuration for real time is currently out of scope of usual Toradex support.
The following links were shared with me as a good resource to learn more about performance real time and tuning (plus the blog post I shared here in a previous message):