CAN Rx overrun errors on Colibri iMX6DL module running Yocto

I’m currently working on a project that uses the CAN bus with a Colibri iMX6 based main board running Yocto, communicating with 9 module boards which run off STM32 MCUs as nodes.

Currently when running the CAN bus at the maximum supported bitrate of 1Mbit/s, we’re seeing a lot of RX overruns on the main board (in the thousands after running a minute or so using ip -s -d link show) and steadily increasing.

The other boards do not show any errors. However when checking logs from the on-(main)board TShark monitor the traffic is only about 1250 packets/s, which only accounts for ~165Kbit/s at maximum.

When lowering the bus bitrate to 500Kbit/s the overruns markedly drop to <100 in the same time span, which is better but still not OK.

We’re using the flexCAN controller that is internal to the MCU. It is our understanding that the overruns indicate the inability of the kernel driver to get the packets from the controller in time, so it is unexpected to me that the overruns happen at a supported bitrate.

This is a general question for suggestions to diagnose/fix this issue.

After I posted my original question I found that we are in fact using the flexCAN controller instead of the MCP2515 over SPI as I originally reported. I’ve updated the question to reflect this.

Thanks you for your answer. After I posted my original question I found that we are in fact already use the flexCAN controller instead of the MCP2515 over SPI. I’ve updated the question to reflect this. Do you have any suggestions about what could cause these issues when using flexCAN? Through search engine I found some driver issues for flexCAN in the past. But it seems to me that the drivers in the most recent commit of the toradex_4.9-2.3.x-imx (from which we build) are relatively up to date.

@colibriuser123,

I’m not aware of any issues with FlexCAN that could cause the symptoms you mentioned.

Are you using a custom carrier board? Is your CAN bus correctly terminated? Lacking termination resistors can lead to such issues.

@gustavo.tx Thank you for your answer. Sorry for my late reply: we have also been working on other issues. Yes, we do run a custom carrier board based on the Colibri evaluation board. We have confirmed a while ago that the CAN termination is indeed correct. We have been running more tests related to this issue. When loading the CAN bus from a node that generates an artificial load (~50%) on the bus, generating a CPU load of about ~20% on our main board from
“candump can0 > null”, the RX overrun errors start showing up on the running no other custom software. We are now suspecting driver issues. Is there anyone that can confirm whether migrating to a newer driver could help? And could migrating to a newer driver be done using the same Yocto/Linux kernel version?

Ok! Thanks for the update.

Hi @colibriuser123,

Normally the driver is attached to the Kernel version, which in turn is attached to a given BSP version. So, it’s hard to guarantee that it will be possible to use a newer driver with the same Yocto/Kernel version.

I have a Colibri iMX6DL here with me, in which I’m also conducting tests with CAN.
I can try to reproduce your issue here with my setup, but I’ll only have results by beginning of next week, ok?

Judging by your Yocto version, you are using our BSP 2.8, which is still maintained, but is quite old.

I strongly advise you to move to the BSP 3.0.4, which is our latest LTS BSP.

Best regards,
André Curvello

Hello @andrecurvello.tx,

Thank you for spending your time to investigate this issue. For now we have found workarounds to make our application work, but we would indeed like to fix this in a better way. According to your own version tables I would assume you are correct and we are using BSP 2.8, but I cannot find a way to verify this. Is there an easy way to check the BSP version in bitbake or at image runtime?

Greetings @colibriuser123,

Which carrier board are you using?

We’ve seen issues with the MCP2515 SPI CAN controller at higher bitrates. It provides a small buffer for only two CAN messages, so the system may lose messages if the CAN is running at a high speed and three or more messages are transmitted over the bus over a short interval.

Since this is a limitation of this external controller, I would suggest you to use the internal CAN controller on the i.MX 6 SoC (FlexCAN). At least on the module side, you should be able to achieve higher speeds using FlexCAN. Here’s more info about using FlexCAN on the Colibri iMX6.

Hi @colibriuser123,

Yes, there is a way to check which BSP you are using.

Please paste here the output of the following files from your generated image:

  • /etc/issue
  • /etc/os-release

Regarding Yocto, you can check in our Versions Table and verify it by checking the following file in your Yocto repo:

  • layers/meta-yocto/meta-poky/conf/distro/poky.conf
  • Then search for DISTRO_CODENAME variable.

See here an example:

$ cat ./layers/meta-yocto/meta-poky/conf/distro/poky.conf | grep DISTRO_CODENAME
DISTRO_CODENAME = "dunfell"

dunfell is the Yocto reference of our BSP 5.

Best regards,
André Curvello

The output for these is:

/etc/issue

SCAP (Sioux Customizable Application Platform) 2.2.0 \n \l

/etc/os-release

NAME="SCAP (Sioux Customizable Application Platform)"
VERSION="2.2.0 (Crazy Horse)"
VERSION_ID="2.2.0"
PRETTY_NAME="SCAP (Sioux Customizable Application Platform) 2.2.0 (Crazy Horse)"

poky.conf (grep’ed)

DISTRO_CODENAME = "rocko"

The first two files seem to be influenced by my company’s custom Yocto layer which is called SCAP as you can see. We use the Yocto Rocko so the table you linked seems to confirm we use BSP 2.8.

Hi @colibriuser123,

You are using a custom image from your Yocto Build.

Based on your Distro, Rocko, you are using our BSP 2.8, according to our BSP Table.

Best regards,
André Curvello

Hi @colibriuser123,

  1. CAN Rx at 1Mbps doesn’t overrun on iMX6ULL and even on VF61/VF50. Perhaps it matters how do you code it. Are you using select() to wait for messages? select() is much slower than threaded app + blocking I/O.
  2. Perhaps CAN bit time error matters. In CAN at 1Mbos it can’t be about 1us, it has to be strictly 1us. Don’t you get any errors and retransmissions? VF50 for 1Mbps can be clocked from IPG? clock, no bit time error. But VF61 can’t, its CAN module clock needs to be switched to 24MHz oscillator clock, else you get bit time error. You may check bit time correctness using oscilloscope.

Edward

Thanks for the comments, @Edward.