MCP2517FD/MCP2518FD CAN controller driver errors on IMX8MM Verdin

Hi @edwaugh,

When time permits please try this patch. Can’t upload it as file, for some reason Uploading… never ends

diff --git a/drivers/net/can/spi/mcp25xxfd/mcp25xxfd_can_tx.c b/drivers/net/can/spi/mcp25xxfd/mcp25xxfd_can_tx.c
index 07526a2..ece661f 100644
--- a/drivers/net/can/spi/mcp25xxfd/mcp25xxfd_can_tx.c
+++ b/drivers/net/can/spi/mcp25xxfd/mcp25xxfd_can_tx.c
@@ -384,15 +384,21 @@ mcp25xxfd_can_tx_handle_int_tefif_optimized(struct mcp25xxfd_can_priv *cpriv,
 	MCP25XXFD_DEBUGFS_STATS_INCR(cpriv, tef_optimized_read_sizes[i]);
 
 	/* now iterate those */
-	for (i = 0, fifo = cpriv->fifos.tx.start; i < cpriv->fifos.tx.count;
-	     i++, fifo++) {
-		if (finished & BIT(fifo)) {
-			ret = mcp25xxfd_can_tx_handle_int_tefif_fifo(cpriv,
-								     false);
+	do {
+		ret = mcp25xxfd_can_tx_handle_int_tefif_fifo(cpriv, false);
+		if (ret)
+			return ret;
+		count--;
+		if(count)
+		{
+			u32 tefsta;
+		
+			ret = mcp25xxfd_cmd_read_mask(cpriv->priv->spi, MCP25XXFD_CAN_TEFSTA,
+				      &tefsta, MCP25XXFD_CAN_TEFSTA_TEFNEIF);
 			if (ret)
 				return ret;
 		}
-	}
+	} while(count);
 
 	return 0;
 } 

According to MCP25XXFD reference manual

Before reading a TEF Object, the application must check that the TEF is not empty by reading the CiTEFSTA register.

??_optimized() routines form mcp25xxfd_can_tx.c violate this “must check”. I’m not sure 100% if that’s real problem and real cause of TEFIF error, but patch on my setup seems making it much harder to catch this error, I don’t see it any more. Things still are optimized the way they were supposed: reading all flagged TEF objects in 1 or 2 SPI read bursts. So strictly speaking above mentioned “must check” is still being violated, but now every TEF read leading UA pointer increment command has matching TEF flag reading command.

Edward

Hi @Edward,

Thanks for the patch, I will ask @gauravks to take a look but realistically I think we will probably need Toradex to roll it or something similar into Torizon. Have you had any luck with NXP? Can they fix it in their BSP? Then it should just end up in all the distributions.

Thanks

Ed

Hi @Edward

I checked this again and you are right, the SYSCLK can be set to 40MHz which is needed for a data rate of 8Mbit/s.

For testing this speed, we would need to change the hardware and also adapt the driver.
I created a Ticket for this internally and I will let you know once we have further information. We will also test which value for data rate can be achieved with the current setup.

@edwaugh, it would be really nice, if you can share your python scripts to reproduce the issue.

Thanks @Edward and @edwaugh for the investigation and your help.

Best regards,
Jaski

Hi @edwaugh,

Patch is about Microchip driver, not NXP. If you mean imx-spi driver issues regarding combination of hardware SPI CS and DMA, I doubt regarding NXP involvement. eCSPI is quite old thing and should be olready polished long ago…

You wrote previously about tefif: fifo 0 not pending error:

Even when I bring the CAN interface DOWN then UP again there seem to be some faults that it is not possible to clear.

I have answer and fix for that. In configuration mode all FIFO pointers are reset. Driver doesn’t take that into account and once one message is sent, interface down/up will lead to loss of synchronization between driver expected next TEF FIFO address and real TEF FIFO address. If you would repeat interface up/down several times, checking if cansend works or not, you would recover after about 5 attempts. Apply next PATCH to fix this 2nd issue.

I see one more issue. Driver can’t probe when CAN bus is too busy.

Edward

Sorry, of course you are right, should be with Microchip.

Also, it looks like the MCP2518 is out of stock everywhere and I maybe have to place the MCP2517 instead. Do you think that will be a problem?

Hi @edwaugh,

These days everything seems being out of stock. Checking some stores looks like MCP2518 is not yet in mass production. I didn’t try MCP2517, hope it won’t be as scary as its errata.

Regarding driver, that patch in mcp25xxfd_can_tx.c helps avoiding issues but isn’t enough. With constant high bus load and at quite high CPU usage it still may break. I found possibility in the code to get TEF overflow. This perhaps helps as well but still is not enough. Only removing mcp_xxx_optimized routines and paths makes driver pretty reliable, I’m unable to break it any more. CPU load on iMX7D doesn’t suffer from that fix noticeably, so I’ll stay with it.

There’s as well problem with MCP2518 switching operation modes while bus load is high. Driver could even fail to probe. This was easy to fix.

There are some issues with interrupt handlers on MCP2518. Some conditions like bus off could lead to IRQ disable. I’m not going to polish that, just my workarounds to prevent bad things.

Give me some time, I’ll clean code and repost final patches.

Edward

Hi

Here’s my all in one PATCH .

Actually disabling not reliable optimized path slows down possible rate, at which burst of messages could be send. I see 10-15% slower burst on iMX7D, but I prefer reliability.

Edward

Hi @Edward

I am trying to reproduce these issues on Verdin iMX8MM. Till now I only reproduced the issue missing messages when the message Length is varying. Could you tell me how did you reproduce the other issues? Do you have a sample code?

Best regards,
Jaski

Hi @jaski.tx

Unfortunately my setup is not easily reproducible at your side.

Conditions:

  1. 1Mbps CAN bus.
  2. Node on the bus sends equally spaced stream of messages, message rate is variable from 0.5 to 9k+ messages per second. 5-7kmsg/s was the best setting to catch TEF overflow error, but it should depend at what rate you may send messages from SOM to the bus.
  3. Send burst of few thousands of messages from SOM as fast as possible to the bus. I was able to push up to 5kmsg/s from iMX7D SOM.

For 2 I use some MCU with dedicated FW. 3 is difficult (lua script + so library to communicate over localhost TCP with elf service running on SOM), no general purpose example available.

Edward

Hi @Edward

Thanks for your answer.

I did some test with the cangen utility and what I can see is when the SoMs is the sender then the messages are lost with a speed of 1MBit/s or higher and the BusLoad is never higher than 20%. If the SoMs is the receiver than the busload can be 100% but the SoM is losing more than 50% of the packets without any error messages, which is not helpful. I was also able to catch the tefif error.

Are the results you got acceptable for you or do you need to optimize them?

Best regards,
Jaski

Hi @jaski.tx ,

I haven’t tried cangen. Looks useful.

Regarding loss of RX messages. Are you using 2MHz SPI clock like I saw in DT for some iMX8 SOM? Of course that’s too low for 1Mbps CAN. MCP18FD click board I’m trying is clocked from 40MHz, SPI clock is 16-17MHz. On iMX7D packets receive is good up to 8++ kmsgs/s, which is not far from full bus load. TX performance isn’t very important, but about 50% bus load is certainly possible. I’ll try later cangen, I think it should produce much more than that.

Yes, it would be nice to optimize eCSPI performance. SDMA as I understand should allow sending multiple SPI messages with HW CS. SDMA transfer is broken even for single SPI message with HW CS.

Edward

Hi @Edward,

Can you please let me know which Linux kernel version have you used for this patch?

Regards,
Gaurav

Hi @gauravks

http://git.toradex.com/cgit/linux-toradex.git/log/?h=toradex_5.4-2.3.x-imx

Edward

Hi @jaski.tx,

Any progress on this issue? We still see problems with the CAN driver.

Thanks

Ed
@walter.tx @gauravks

Hello @edwaugh
We are still working on the CAN issues, making a lot of tests on our side. For the TX side the results so far are pretty close to what’s reported in this thread. I was able to test the TX side with something like 8000 packets / s on 1Mbps bus rate with no packet loss. When I lower the bus speed I start to get the TEFIF problems as reported in this thread.
I still need to add the patch from @Edward and check here if it also solves the “TEFIF - fifo not pending” issues on my side.
For the RX side, I have been able to get good results in terms of low / no packet loss in bus speeds up to 1Mbps and bus occupation up to 100%. All these tests are done using the SPI clock set as 8.5MHz. If I can confirm that the TX side is not affected by increasing the SPI clock we will be merging this change to our default device tree.
High packet rates seem to be the biggest problem related to packet loss on the RX side. When we increase the bus speed to 1Mbps we can get up to 16000 packets / s when using DLC 1. Increasing the DLC to 8 decreases the maximum packet / s count and improves the results (lower packet loss, in most cases no packet loss can be detected).

Can I ask what are the problems you still see on your side and are blocking points for your application to work correctly?

Best Regards,
Rafael Beims

Hi Rafael,
Happy new year. We are seeing driver crashes when bringing the CAN interface up and down as described here:

We do see the TEFIF message during these transistions. The reason we do this is that our device sleeps every day, so we shut down all can devices and bring the bus to the down state.
@gauravks can confirm the SPI rate we have set in the device tree.
Thanks
Ed

Hi @edwaugh,
Happy new year to you too!

Are you still seeing this problem even with the patch mentioned in the link applied?