Kernel hangs due CAN driver

gauravks · December 15, 2021, 1:44pm

Hello all,

We have faced a new issue with the CAN device now. We noticed that our application hangs and the system reboots after sometimes. With some investigation we noticed that this could be something to do with the CAN device. So to identify the exact cause we setup a test which performs following

Starts the application in operational mode which brings the CAN network device up
Perform can transfers for 5 minutes
Stop the application bringing the CAN network device down
Repeat the above steps

With the above test setup we were able to reproduce the issue and capture the kernel logs where the kernel hangs and the system watchdog triggers a restart. The kernel logs snapshot is attached below.

As you can see the MCP2517 driver throws an error which locks up the whole of network manager causing the system hang and restart.

We had in past few discussions on similar topic in the following thread. Not sure if they are related.

@Edward @jaski.tx Can you please let me know if you have seen similar issues for the CAN devices and if they is some workaround or fix available which we should integrate?

FYI @edwaugh
Regards,
Gaurav

Edward · December 19, 2021, 6:54am

Hi @gauravks,

Did you try patch from thread you mentioned? Toradex downstream kernel still uses the same unpatched driver with several known issues. Here’s link to message with patch

You didn’t tell anything about SPI clock rate you are using. It should be as high as possible, limited by your MCP2517FD half clock.

I’m busy with other tasks. If you wish me to reproduce your issue much much later on MCP2518FD (I don’t have MCP2517FD), please patch kernel first, then provide detail instructions to reproduce your up-send-down-up-send-down test cycle, perhaps some script for this. You may have different rate of sending messages and catching the same in other conditions may be impossible. When time permits I’ll try to test cyclic up/down on my side, but this won’t happen soon.

Regards
Edward

jaski.tx · January 4, 2022, 12:03pm

hi @gauravks

Could you provide the necessary information to reproduce this issue:

Software version
kernel config
Any changes you have done to Software

On Toradex Side, @rafael.tx is working on this topic. He will be happy to reproduce the issue and provide assistance to solve this.

In my case, I have never seen this issue before with kernel 5.4.

Best regards,
Jaski

Edward · January 5, 2022, 4:12pm

@gauravks , @edwaugh

I tried down/up-ing CAN IF during transfer. Yes, sometimes TEFIF message happens, but just this message. After following ifconfig can0 up, CAN RX/TX keeps going without the need to do anything else like restarting my CAN control application or rebooting.

5 minutes in problem description can’t be critical. Hardware queue due to limited MCP25xxFD RAM is very short. It should be repeatable much faster. Could you indeed provide detail instructions how to reproduce issue? Some script for this?

cansend is too slow I think. Repeating it in bash script leads to 4ms gap between messages. cangen seems better. Can you reproduce issue with just this:

cangen can0 -g 0.2 -n 1000 ; ifconfig can0 down ; ifconfig can0 up

Please adjust gap parameter (-g) to match your message transfer rate, it can be a fraction of second. You may TX messages with no gap with -g 0.01 -p 1 in parameters. Try as well increasing -n parameter to produce longer bursts. With above command I don’t see even TEFIF message, perhaps cangen doesn’t exit until message is send. But you can launch cangen from serial console and ifconfig up/down from SSH console. No crash on my side.

As well, could you please tar or zip *.c and *.h files form your linux-toradex/drivers/net/can/spi/mcp25xxfd after patching? Perhaps some diff is missing for some reason?

Edward