MCP2517FD/MCP2518FD CAN controller driver errors on IMX8MM Verdin

edwaugh · March 3, 2021, 8:02pm

We have been seeing some errors from the CAN controller on the Verdin boards:

[  204.915985] mcp25xxfd spi2.0 can1: tefif: fifo 0 not pending - tef data: id: 00000000 flags: 00000000, ts: 00000000 - this may be a problem with spi signal quality- try reducing spi-clock speed if this can get reproduced

There is a long thread here including a guy who seems to be writing the/a driver for raspberry Pi:
https://github.com/Seeed-Studio/pi-hats/issues/7

It seems like a genuine problem with the version of the driver they are using and then he fixes it around last May. Is it possible for someone to confirm what driver version is in use in Torizon 5.1 and what testing has been done with the boards? The errors are worse when there is moderate bus pressure.

One possible cause is a silicon bug in the MCP2517FD. The Verdin datasheet says MCP2518FD so I think this is probably not the problem.

I am currently using the interface to talk to 2 Hydac CANOpen pressure sensors at 250 kbps. At low data rates of around 5 messages per second per device it seems to work fine. But some errors start to occur at 10 mpspd and worse at 20 mpspd. Any information on the expected performance or performance testing that has been done on this interface would be appreciated.

It would also be good if someone from Toradex could check if there is an updated driver available. Maybe we are just pulling an old one with kernel 5.4.

Thanks very much

Ed

edwaugh · March 5, 2021, 2:50pm

@spasoye I noticed you are doing similar work. Have you seen this problem?

spasoye · March 5, 2021, 3:13pm

@edwaugh hey, Im currently testing Codesys CANopen interface and started getting this error now and then, I can’t find out why this happens. Im using 1 000 000 bitreate.

edwaugh · March 5, 2021, 3:20pm

@spasoye I am just at 250kbps but find it gets worse with more bus traffic. Maybe you will find the same. I think it is possibly a driver issue. Maybe there is a way we can try the latest from the 5.9 kernel. @matthias.tx did mention in an email that he might know a solution. I will make sure to post here if he does.

spasoye · March 5, 2021, 3:23pm

I think Im using 5.4 kernel

edwaugh · March 4, 2021, 7:22am

Hi had a quick look at the kernel development logs this morning and although I can’t see anything specifically relating to this issue it does seem like a lot of fixes and patches are happening over the last few years.

Age	Commit message (Expand)	Author	Files	Lines
2021-02-03	can: dev: prevent potential information leak in can_fill_info()	Dan Carpenter	1	-1/+1
2019-11-05	can: mcp251x: mcp251x_restart_work_handler(): Fix potential force_quit race c...	Timo Schlüßler	1	-1/+1
2019-09-05	Merge tag 'linux-can-next-for-5.4-20190903' of git://git.kernel.org/pub/scm/l...	David S. Miller	4	-148/+81
2019-09-03	can: mcp251x: Call wrapper instead of regulator_disable()	Andy Shevchenko	1	-4/+2
2019-09-03	can: mcp251x: Make use of device property API	Andy Shevchenko	1	-7/+5
2019-09-03	can: mcp251x: Use devm_clk_get_optional() to get the input clock	Andy Shevchenko	1	-18/+12
2019-09-03	can: mcp251x: remove deprecated board file setup example	Marc Kleine-Budde	1	-20/+0
2019-08-20	Merge tag 'linux-can-next-for-5.4-20190820' of git://git.kernel.org/pub/scm/l...	David S. Miller	6	-164/+83
2019-08-20	can: mcp251x: remove custom DMA mapped buffer	Marc Kleine-Budde	1	-48/+10
2019-08-20	can: mcp251x: Use DT-supplied interrupt flags	Phil Elwell	1	-1/+4
2019-08-20	can: mcp251x: Use dev_name() during request_threaded_irq()	Alexander Shiyan	1	-1/+2
2019-08-20	can: mcp251x: mcp251x_hw_reset(): allow more time after a reset	Marc Kleine-Budde	1	-5/+14
2019-08-20	can: mcp251x: use u8 instead of uint8_t	Marc Kleine-Budde	1	-5/+4
2019-08-20	can: mcp251x: fix print formating strings	Marc Kleine-Budde	1	-2/+1
2019-08-20	can: mcp251x: avoid long lines	Marc Kleine-Budde	1	-2/+4
2019-08-20	can: mcp251x: remove unnecessary blank lines	Marc Kleine-Budde	1	-2/+0
2019-08-20	can: mcp251x: convert block comments to network style comments	Marc Kleine-Budde	1	-10/+5
2019-07-25	can: mark expected switch fall-throughs	Gustavo A. R. Silva	4	-5/+8
2019-07-24	can: mcp251x: add error check when wq alloc failed	Weitao Hou	1	-27/+22
2019-07-24	can: mark expected switch fall-throughs	Gustavo A. R. Silva	4	-5/+8
2019-06-09	Merge tag 'linux-can-fixes-for-5.2-20190607' of git://git.kernel.org/pub/scm/...	David S. Miller	6	-22/+42
2019-06-07	can: mcp251x: add support for mcp25625	Sean Nyekjaer	2	-11/+19
2019-04-03	spi: bcm2835aux: fix corruptions for longer spi transfers	Martin Sperl	1	-4/+4
2019-01-09	spi: core: avoid waking pump thread from spi_sync instead run teardown delayed	Martin Sperl	1	-33/+89

https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/log/drivers?h=linux-5.4.y&qt=grep&q=mcp25

edwaugh · March 4, 2021, 7:43am

Just some links:

v5.4

https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/drivers/net/can/spi/mcp251x.c?h=linux-5.4.y
v5.9

https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/drivers/net/can/spi/mcp251x.c?h=linux-5.9.y

edwaugh · March 5, 2021, 2:46pm

hi @matthias.tx and @andrecurvello.tx is it possible to get a response to this?
Thanks
Ed

swisstinu · March 9, 2021, 1:19pm

Hi @edwaugh

I am using the backported driver from varicite:

This works not too bad on my side at any bitrate but not with a high message rate.
Maybe you can give this a try.

edwaugh · March 9, 2021, 4:54pm

Hi Andre, I can confirm on v1.1a with 5.1. Unless there are any changes in 5.2 that affect can I would expect it to be the same.

andrecurvello.tx · March 10, 2021, 1:46pm

Hi @edwaugh,

Thanks for the feedback.

I’m escalating this topic internally.

As soon as I have feedback I’ll update you.

Best regards,
André Curvello

Edward · March 10, 2021, 4:46pm

Hi @edwaugh,

Links you provided are about different driver for MCP2517fd/18fd and other links for unrelated MCP251x (like 2515).

Do you receive those MCP25xxFD TEF related messages with hardware chip select (CS) enabled, or indeed you are using GPIO CS? On iMX7D TEF messages appear sending burst of messages, but only when both hardware CS and ECSPI DMA are enabled.

Enabled ECSPI DMA as well breaks MCP2518FD ECC RAM initialization, which leads to ECC errors reported by driver.

Update: iMX6ULL as well has issues with HW CS + DMA=on.

Edward

Edward · March 10, 2021, 9:04pm

Well, you have GPIO CS instead of HW CS. Most likely you have DMA enabled in some lower level dtsi. So your settings are quite similar to those, which don’t trigger TEF errors for me on iMX7D and iMX6ULL… With one exception, you have very low SPI clock.

I may try lowering SPI clock on my side to check what happens. If you wish to play with DT:

Try rising spi-max-frequency to 19500000. 20MHz is real MCP2518 limit, but it may be marginal.
Try disabling DMA. Add to ecspi properties dma-names= "", "";
You could switch to HW CS, but I don’t have ready instructions for your iMX8M Verdin.

andrecurvello.tx · March 11, 2021, 6:06pm

Hi @Ed,

I think I haven’t asked you this before, but this problem is only happening with the External CAN Controller, or does it also happens with the integrated CAN Controller on Verdin module?

Best regards,
André Curvello

edwaugh · March 12, 2021, 9:47am

Hi Andre, the problem only happens on the internal controller, I have not checked external yet.

andrecurvello.tx · March 12, 2021, 2:58pm

Ok! Thanks.

andrecurvello.tx · March 12, 2021, 3:09pm

Hi @Edward,

Are you also being affected by this issue?

Edward · March 14, 2021, 8:25pm

Hi @andrecurvello.tx,

Actually I’m rather concerned than affected. Since @edwaugh doesn’t use HW CS I don’t see how he could get TEF errors except that perhaps too slow SPI clock or wrong IRQ type (IRQ_TYPE_EDGE_FALLING instead of IRQ_TYPE_LEVEL_LOW). I personally didn’t try to reduce SPI clock that low on my side.

Instead I investigated a bit why HW CS + DMA breaks MCP25xxFD. There’s a problem in spi-imx.c. It is reproducible with spidev device, visible as well in MCP25xxFD waveforms.

Sending 64 bytes file spidev_test -D /dev/spidev2.0 -i /home/root/f2:
[upload|BCIwP3x2M35hmwPkg6kvWsTutaU=]

Where 1 – CS, 2- SCK

Sending the same 64 bytes file in 32-bit word mode spidev_test -D /dev/spidev2.0 -i /home/root/f2 -b 32

[upload|jVfTxP8emTaiyjkE/XmZJpVRVQM=]

As you see CS toggles where it shouldn’t. The same in both, 8 and 32 bits per word modes. This actually breaks MCP25xx18FD when DMA transfers are triggered (>=64 bytes per message).

Disabling DMA echo N > /sys/module/spi_imx/parameters/use_dma
[upload|AUjcbATPhUop5XIqgXkjreZYmW4=]

So either DMA should be off or HW CS should be not used.

Current driver implementation sets ECSPIx_CONREG.BURSTLEN to match transfer length only for PIO path. DMA mode keeps BURSTLEN matching bits_per_word setting. And that’s the problem for HW CS, as well the problem because of unwanted gaps between bytes.

Patching driver like THIS bad patch cancels CS toggle between SPI bytes. But then DMA buswidth / .dst_addr_width (see spi-imx.c) has to be somehow forced to 4. With buswidth left as is spidev_test -D /dev/spidev2.0 -i /home/root/f2 –b8 produces following

[upload|5aaIabGmTezWHd8SEf+6kDfnm+g=]

^^ here data bytes on MOSI come in this order 0, 0, 0, <first byte from file>, 0, 0, 0, <second byte from file>, etc. And spidev_test -D /dev/spidev2.0 -i /home/root/f2 –b32 produces this

[upload|wCO3luFUkXReCjRUEzGVtzhnsoA=]

^^ here byte order matches use_dma=N case.

MCP25xxFD driver actually uses 8 bits per word setting, so no go other than GPIO CS or DMA=off or even better - fixing driver. It would be nice to get spi-imx.c driver fixed. I’m not sure yet how do it without breaking other modes like slave, 8/16/32 bits per word etc.

Edward

Update: reloaded pictures

edwaugh · March 15, 2021, 7:06am

Hi @Edward,

Thanks for the awesome investigation. What do you think the next steps are? Is there an owner of the spi-imx.c driver we should be talking to? I guess NXP supply it as part of their BSP. Maybe @andrecurvello.tx and @brandon.tx have contacts at NXP they could share this thread with?

Ed

Edward · March 16, 2021, 8:42pm

Hi @edwaugh

To all: Do you see images in my post? On one PC I see them, on another one I don’t. Hm. All 5 have url’s like https://www.toradex.com/community/storage/temp/2637-tek0010.png so should be visible. Let me know if I should repost them.

Regarding next steps. After half work of implementing ECSPI+MCP2518 on iMX7D.M4, I think I know how to fix Linux driver DMA. Since buswidth=32 mode works for 4*n bytes transfer, its just a matter of smart DMA buffer filing and ceiling transfer size to the next 32 bit boundary, ECSPI will do the rest, send not necessary 4N bytes, but as well every other lengths from 4N-3 to 4N-1. I’ll try it after done with M4.