Ethernet and RNDIS mode of USB fail on T20 after a few days of run time

MikeS · August 25, 2020, 12:06am

We have been using the T20 for years and have many units in the field. We have a mixture of images on our units including 2.1 and 1.4. We have recently started to see a problem where the Ethernet connection will stop working some time after booting. The time at which this happens seems to be random: sometimes it is 1 day, sometimes a few days, sometimes a few weeks. This makes it very hard to collect data on the circumstances around the problem.

When the Ethernet connection stops working the USB port also stops working (We use it in RNDIS mode). The Ethernet interface to our application stops working, and so do the OS services like ping, telnet, http, ftp and so on. The LEDs in the Ethernet connector also stop blinking. But an important detail is that our application does continue to run and we can continue to communicate with it using RS232. This suggests that the problem is in the OS, or the hardware.

We found a description of issue WC-1659 in OS image 2.4b2 which says, “Ethernet Communication becomes very slow and finally fails”. This sounds like it might be the root cause of the problem we are seeing, do the symptoms I have described sound consistent with that issue? We are working to try to reproduce this issue more often so that we can then try the new image and prove to ourselves that this does fix the problem. The problem happens so infrequently that it is hard to prove that a change has really fixed it.

Do you know if the issue which was fixed in image 2.4b2 was introduced in image 2.1? Or was this problem there for longer? If the issue was new in image 2.1 then our older systems with image 1.4 would not show it.

germano.tx · August 25, 2020, 9:43am

Hi @MikeS,

The issue you describe does not match the WC-1659 issue we fixed. That issue was only seen on T30 (4 cores) and happened because of a race condition (some counters not incremented in a multicore safe way). Also if i remember correctly it was not affecting the whole network stack as in your case.

The problem you describe looks more like something get stuck in the network stack and blocks everything network related. We saw this kind of issues with other network adapters (SDIO WiFi), and sometimes (also very rarely) it was deadlocking the whole network stack (probably the driver still holds a critical section).

One important clue could be that you saw the ethernet led stops blinking… this points to the ethernet controller being disabled. Did you ever check if you still see the network controller in the system? On T20 the ethernet adapted is internally connected over USB, so it could in theory happen that it’s seen as detached and the driver is unloaded.
One images before 2.1 we also had this issue which could explain the ethernet disappearing:
WC-1482 (but i’m not sure it explains the other network stuff not working anymore too) (also: it seems like you see the issue even on 2.1)

One thing you could also try is to enable debug messages and see if anything happens when the issue appears (maybe something crashes giving a hint on where to look for)

MikeS · August 25, 2020, 6:19pm

Hi @germano.tx,

The T20 does have two cores which are both used by CE7. Does that mean that WC-1659 could affect the T20, even if you didn’t detect it yet? If the counters are not incremented in a multicore-safe way then couldn’t the problem happen with 2 cores or with 4 cores? The issue description for WC-1659 says it affects the T20 and the T30 but the problem is much more likely to happen on the T30 because it has 4 cores.

We use the T20 in a headless application without a display, so when the Ethernet fails I don’t have a way to inspect the OS to see if the driver is still loaded. I could try to make the problem happen when the USB is in AS mode and then attempt to use the remote display over AS to inspect the system.

germano.tx · August 25, 2020, 7:31pm

Hi @MikeS ,

It’s possible that WC-1659 also affects T20, but i was not able to reproduce the issue with the Customer specific use case that was able to reproduce it on T30 (unfortunately i cannot share his application, but it was basically a use case where lots of smaller packets (100-500 bytes) were sent to the device with little wait time in between (100-100us).

Also when the problem happened RNDIS was still working. Actually the whole network was kind of still working… continuing to send packets was at some point making the ethernet driver realize there were some packets it did not look at yet… (that’s why it was also called “becomes very slow” and was only stop working if the other party was waiting for a reply without sending anything anymore or a timeout happened)

NOTE: I think when your problem happens you cannot start anything network related, most probably also AS wil not work as it heavily relies on network stat for PPP and thus also TCP.