Kernel 4.14 FlexCAN loses messages

Hi,

4.14 kernels seem having FlexCAN problems. Copy pasting flexcan.c driver from BSP2.86 (iMX7D 4.9.x? kernel) and recompiling kernel and modules seems fixing problems, CAN Rx messages are no more lost. Though I fear there could be some FlexCAN driver incompatibility between 4.9 and 4.14. Isn’t it?

Issue never occurs talking to CAN devices with slower CPUs. Talking to faster ones, which respond faster, leads to loss of messages with no CAN errors detected. Issue happens periodically, there could be periods of few seconds where Rx issue doesn’t happen, CPU usage is very low.
Could anyone aware of FlexCAN driver versions history check for what changes between 4.9 and 4.14 could lead to such problems?

Thanks
Edward

Hi,

All 3.0.x BSPs for iMX7D and iMX6ULL seem having broken FlexCAN support. Symptoms are the same: fast reply to requesting CAN message sent from Colibri is lost. Slower replies seem being all good. candump doesn’t show all messages as well. BTW VF61 and VF50 with the same good 2.8B6 BSP don’t loose messages as well.
I tried looking for which commit was wrong but I’m lost. Older flexcan.c’s usually don’t compile because other files need to be changed as well. Pulling whole kernel snapshots would take way too long. I could try checking 1,2, perhaps few more snapshots. But which ones?

Edward

Hi @Edward

Thanks for writing to the Toradex Support.

Regarding your issue:

What is your application?
Have you done any changes to the Software (KernelConfig, devicetree, …)?
If yes, can you share these changes?

Did you use Bsp 3.0.4 release?

Issue never occurs talking to CAN devices with slower CPUs. Talking to faster ones, which respond faster, leads to loss of messages with no CAN errors detected.

Could you share some code for your application?
What is delay value of faster response (some us, some ms)?

Best regards,
Jaski

Hi @jaski.tx

Application is some communication server with onboard state machine. USB, LAN, WiFi, CAN, RS232. The same machine with higher or lower CPU power is ported to VF50 / VF60 / iMX6ULL / iMX7D Colibrie’s. All variants up to 28b6 worked well even on VF50 and even using our fastest slave device. 3.0.x BSPs seemed being fine as well, but we just didn’t test it with faster slave device on TMS320F28377S. Rootfs’s are all fine. Difference is with corresponding kernels. Starting from the first released with 30b2 all are loosing messages from 28377S, not all but a lot of them.
You asked for request-reply time difference which is problematic for new kernels. I didn’t measure it yet, I’ll try with scope when time permits. It is our fastest slave device we are using, so it should response quicker and I assume the problem is too fast responses.
I’d love in the first place to identify what change to flexcan.c led to our problem and how to properly roll back wrong changes.

I could create some test programs, but do you have some TMS320F28377S board with CAN? Perhaps test could be done using two Colibries, I’ll think about it but don’t promise yet.

See attached defconfig for kernel from BSP 3.0.4 and diff for iMX7D dts.

Best regards,
Edwardlink text

Done. Using slower slave device it takes about 30 microseconds from last recessive bit from Colibri request to first dominant bit of slave response. Using faster slave the same reduces to 15 microseconds.

Edward

hi @Edward

Thanks for the Information.

Starting from the first released with 30b2 all are loosing messages from 28377S, not all but a lot of them.

Okay.

I’d love in the first place to identify what change to flexcan.c led to our problem and how to properly roll back wrong changes.

I think, this won’t be easy, since the complete network stack was changed from kernel 4.9 to kernel 4.14 and maybe also not well imported in NXP Kernel branch. I would suggest to make the test with kernel 5.4 from kernel.org and check if you still see this issue.

I could create some test programs, but do you have some TMS320F28377S board with CAN? Perhaps test could be done using two Colibries, I’ll think about it but don’t promise yet.

Creating Test Programs would be wonderful. We don’t have any board with TMS320F28377S but with MCP2551T. Additionally we have Peak-Can devices.

See attached defconfig for kernel from BSP 3.0.4 and diff for iMX7D dts.

Thanks for the files. They are OK. Could you try to disable CONFIG_CAN_BCM=m and CONFIG_CAN_GW=m if not needed and check if you still see the issue?

Thanks and best regards,
Jaski