Wifi throughput issue with Colibri IMX8QXP and TorizonCore

ychabert · October 17, 2023, 4:28pm

Hello community,

SETUP:
I am using a Colibri iMX8QXP 2GB WB IT, on a Viola Plus carrier board.
I have installed TorizonCore 6.4.0-build.5 on it.
I have set up 2 Unictron AA222 wifi antennas on it.

ISSUE:
I am getting an issue with the LAN wifi: it crashes when a very low throughput is reached.

STEPS TO REPRODUCE:
Once TorizonCore is installed, I am connecting to the LAN wifi with:

nmcli -a device wifi connect <my-wifi>

Then, I unplug the ethernet cable.
Then, I perform either (A) or (B), and it systematically provokes a wifi crash within few minutes.

(A) Pinging the Colibri from a remote machine at 100 ping/second.

(B) Connecting to the Colibri with ssh, then pinging google.com at 100 ping/second.

What do I mean by “wifi crash”?

In (A), not being able to ping the Colibri anymore, until rebooting Colibri.
In (B), loosing the ssh connection to the Colibri, and not being able to ping it anymore as well, until rebooting Colibri.

In a minority of cases, the wifi is able to “recover”, and I am able to ping again after 10 minutes without needing a reboot.

Occasionally, in (B), I could see the following message in the terminal:

ping: sendto: No buffer space available

FURTHER OBSERVATIONS AND HYPOTHESES:
The probability that the issue appears decreases as I reduce the ping frequency: at 10 ping/second the issue seldom appears within 10 minutes, and at 1 ping/second it never occurs.

Therefore, my guess is that the wifi has an abnormally low throughput, so it quickly fills its buffer, and once the buffer is filled the wifi crashes. This hypothesis is also supported by the occasional observation of the ping: sendto: No buffer space available message.

I have performed these tests using different Colibri boards (of the same model) and carrier boards (of the same model as well).

I also have performed the same tests between 2 laptops on the same LAN wifi, and did not get any issue.

Additionally, I have done some throughput calculations.

In (A) my pings are sent from Windows, so they should contain 64 bytes of data.
In (B) they are sent from TorizonCore, so they should contain 32 bytes of data.

At 100 pings/second, that implies a throughput of 100 x 32 x 8 = 25600 bits/sec.
That is a very low throughput, far below the 400.0Mbps of the wifi network the data is going through, and also far below the “up to 866.7 Mbps” that I have read from the IMX8X datasheet.

LOGS
I am providing attached below the journalctl logs that I got from a test of type (B), after having enabled persistent Journald logging.
In this test, the connection was lost around 16:12:15.

journalctl_oct_16_12.log (140.2 KB)

lucas_a.tx · October 17, 2023, 9:56pm

Hi @ychabert ,

That’s strange. The Wi-Fi should be able to handle the pings with no problems.

Can you check if this issue also happens on TorizonCore/Torizon OS 5.7.2?

On my side I’ll try to reproduce your results.

Best regards,
Lucas Akira

ychabert · October 18, 2023, 3:42pm

Hello @lucas_a.tx,

Thanks for you help!

As you recommended, today I have installed TorizonCore 5.7.2 with the evaluation containers on my Colibri, and I am getting the same issues and behavior than I had with Torizon Core 6.4.0-build.5.

I am attaching 2 persistent Journald logs corresponding to 2 wifi failures in the test scenario (B) in my first message, i.e. pinging google.com from the Colibri at 100Hz.

I have noticed that in those persistent logs this time there are errors that I was not getting with Torizon Core 6.4.0-build.5, but I don’t know if they are related to the wifi throughput issue.

Also, in my first message, I forgot to mention that I had tried to deactivate the wifi power save with iw mlan0 set power_save off, and that it did not fix the issue.

Those issues are a big problem for my company, because in our use case we need to send data packets from the Colibri board to a remote machine by wifi at a 60Hz frequency, which is currently impossible, because even the ping packets which are way smaller (64 bytes) than our data packets can not be sent consistently at this frequency without the wifi crashing. The issue is not present with ethernet but we need LAN wifi.

JOURNALD LOGS:
Wifi crash near 12:47:45:
journald-logs-lost-co-12h47m45s.log (161.0 KB)
Wifi crash near 13:05:15:
journald-logs-lost-co-13h05m15s.log (162.2 KB)

lucas_a.tx · October 19, 2023, 5:20pm

Hi @ychabert ,

I was not able to reproduce your issue. Using:

Colibri iMX8QXP 2GB WB IT V1.0D
Colibri Evaluation Board V3.2B
Torizon OS 6.4.0+build.5
2x Dual-Band Wi-Fi/Bluetooth Dipole Antennas with the pigtail cables

Connecting the SoM only via Wi-Fi the same way you did, I did a test similar to (B), but I ran the ping command in the module using the debug Serial connection. I also connected to it via SSH and monitored the journalctl logs there.

The ping command executed was:

ping -i 0.01 google.com

That should do 100 pings/second.

I left the Colibri iMX8X running the command for a little more than 18 hours, and the Wi-Fi connection did not crash like you described: SSH connection remained up, and it was still pinging non-stop until the end of the test:

[...]
64 bytes from 142.250.217.110: seq=61696 ttl=52 time=190.659 ms
64 bytes from 142.250.217.110: seq=61697 ttl=52 time=191.689 ms
64 bytes from 142.250.217.110: seq=61698 ttl=52 time=185.734 ms
64 bytes from 142.250.217.110: seq=61699 ttl=52 time=193.571 ms
64 bytes from 142.250.217.110: seq=61700 ttl=52 time=188.811 ms
64 bytes from 142.250.217.110: seq=61701 ttl=^C
--- google.com ping statistics ---
6615321 packets transmitted, 5704312 packets received, 13% packet loss
round-trip min/avg/max = 180.895/359.391/7058.641 ms
torizon@colibri-imx8x-06995807:~$ ping -i 0.01 google.com

No containers were running or started during the test.

One similarity between your journalctl logs and the ones from this test is that the Wi-Fi connection would occasionally drop, then immediately go up again. However the network connection always recovered:

[...]
Oct 19 04:16:36 colibri-imx8x-06995807 systemd-networkd[682]: uap0: Link DOWN
Oct 19 04:16:36 colibri-imx8x-06995807 NetworkManager[675]: <info>  [1697688996.4718] device (uap0): set-hw-addr: set MAC address to BA:2E:AE:7C:59:54 (scanning)
Oct 19 04:16:36 colibri-imx8x-06995807 systemd-networkd[682]: uap0: Link UP
Oct 19 04:16:36 colibri-imx8x-06995807 NetworkManager[675]: <info>  [1697688996.5307] device (uap0): supplicant interface state: inactive -> disconnected
Oct 19 04:16:36 colibri-imx8x-06995807 NetworkManager[675]: <info>  [1697688996.5404] device (uap0): supplicant interface state: disconnected -> inactive
Oct 19 04:23:33 colibri-imx8x-06995807 systemd-networkd[682]: uap0: Link DOWN
Oct 19 04:23:33 colibri-imx8x-06995807 NetworkManager[675]: <info>  [1697689413.4725] device (uap0): set-hw-addr: set MAC address to 6A:2E:94:83:B4:44 (scanning)
Oct 19 04:23:33 colibri-imx8x-06995807 systemd-networkd[682]: uap0: Link UP
Oct 19 04:23:33 colibri-imx8x-06995807 NetworkManager[675]: <info>  [1697689413.4796] device (uap0): supplicant interface state: inactive -> disconnected
Oct 19 04:23:33 colibri-imx8x-06995807 NetworkManager[675]: <info>  [1697689413.4854] device (uap0): supplicant interface state: disconnected -> inactive
[...]

I also never saw the ping: sendto: No buffer space available message in the journalctl logs.

It’s a strange issue. Even though I used a different carrier board and different antennas for my test compared to yours, I don’t think those should affect the results.

You mentioned you tested different Colibri iMX8X modules, all of them the same model. Which specific model (e.g. V1.0D) did you test?

Have you made any changes to Torizon OS/TorizonCore? Were you running any containers during the tests?

Can you test if you see the same issue in one of our BSP references images, like the minimal v6.4.0:

Download Links | Toradex Developer Center

Best regards,
Lucas Akira

ychabert · October 20, 2023, 12:34pm

Thanks @lucas_a.tx!

The command I am using for my test (B) is ping -i 0.01 google.com as well.

The model of the Colibri iMX8QXP 2GB WB IT is V1.0D as well.

At first I observed the issue on 3 boards on which we had made changes to the TorizonCore OS, and were running our Docker containers on it. But then I redeployed TorizonCore from scratch on one of them, and only did the LAN wifi configuration with nmcli -a device wifi connect \<my-wifi\>, then I tested and encountered the issue as well. I have done tests with both the Viola Plus carrier board and a custom carrier board that we have built. Also, a colleague reproduced the issue on his WiFi at home, so we eliminated our office wifi (sometimes facetious) from the possible causes of the problem.

You mentioned that in the logs the Wi-Fi connection occasionally drops, and I have seen that as well, but from my understanding it does not concern the LAN wifi but the access point wifi uap0.

I will test with the BSP references images and come back to you.

Best regards

ychabert · October 20, 2023, 2:23pm

@lucas_a.tx I just have done tests with the minimal reference image 6.4.0+build.8, and I am not getting any issues with that one, even when pinging at 1000 ping/sec.

So, the issue seems to come from TorizonCore, or from the usage of TorizonCore in some conditions that are yet to determine.

In our use case, I don’t think we could use the BSP reference images though, because we need to deploy Docker containers and we don’t have the necessary skills for customizing our images.

Maybe I can try older TorizonCore versions.

Thanks again for your help

lucas_a.tx · October 23, 2023, 3:03pm

Hi @ychabert ,

You mentioned that in the logs the Wi-Fi connection occasionally drops, and I have seen that as well, but from my understanding it does not concern the LAN wifi but the access point wifi uap0.

Yes you are correct, I got confused here. uap0 is the access point, not the Wi-Fi client. Sorry for the confusion.

I just have done tests with the minimal reference image 6.4.0+build.8, and I am not getting any issues with that one, even when pinging at 1000 ping/sec.

OK, that’s interesting to know.

So, the issue seems to come from TorizonCore, or from the usage of TorizonCore in some conditions that are yet to determine.

From what you tested previously this issue also occurred to you with an unmodified TorizonCore image with no containers running, is that right?

Maybe it could be a power issue. Is the power supply used for the SoM able to provide at least 2A of current?

Best regards,
Lucas Akira

ychabert · October 23, 2023, 4:23pm

Hi @lucas_a.tx,

From what you tested previously this issue also occurred to you with an unmodified TorizonCore image with no containers running, is that right?

Yes, except the wifi configuration.

Maybe it could be a power issue. Is the power supply used for the SoM able to provide at least 2A of current?

Yes, the power supplies I am using provide 3A of current with a 5V voltage, so 15W of power.

Best regards

lucas_a.tx · October 24, 2023, 5:39pm

Hi @ychabert ,

Yes, the power supplies I am using provide 3A of current with a 5V voltage, so 15W of power.

OK, that should be plenty of available current for the SoM.

I did some additional tests here, and I’m still not able to reproduce your exact issue.

I’ve used the same SoM as before on a Viola Plus V1.2B running Torizon OS 6.4.0+build.5. I’ve also put a Weston and a Chromium container running during the tests to stress the CPU and the GPU.

I only connected to the SoM via SSH, the same way you did, and executed the ping command from there as well.

After a couple of minutes the SSH connection would drop, but I was always able to reconnect to it without problems i.e. the Wi-Fi never crashed completely as you reported.

I repeated the same tests on a different Wi-Fi connection and also with 1000 pings/second, with pretty much the same results.

With 1000 pings/s the ping: sendto: No buffer space available error would often occur, whereas it only happened occasionally with 100 pings/s. Either way, it’s always possible to ping again with no issues.

Without being able to reproduce your results, it’s hard to tell exactly what is happening here. Do you have any other antennas you could use for the tests?

Best regards,
Lucas Akira

ychabert · October 25, 2023, 3:53pm

Hi @lucas_a.tx,

We just have managed to fix the issue by:

Setting up the wifi with systemd-networkd instead of NetworkManager (I don’t know the detail of the commands because my colleague did it).
Stopping NetworkManager with sudo systemctl stop NetworkManager.

After that, now we can ping at 100Hz, and even 1000Hz without any issue.

Last week I had done one test with systemd-networkd disabled, and only NetworkManager running, because I had read somewhere that network managers can conflict so it is recommended to have only one running at the same time. But this test had not fixed the issue, so the issue does not seem to be a problem of conflict. The issue seems to really come from NetworkManager.
Maybe the issue comes from the driver used by NM. We have configured systemd-networkd with the mwifiex_pcie driver, I don’t know if NM uses the same driver?

Best regards

lucas_a.tx · October 25, 2023, 7:08pm

Hi @ychabert ,

Glad you were able to fix you issue!

It’s still strange that I wasn’t able to reproduce it given that all tests made here were with NetworkManager i.e. the default configuration.

We have configured systemd-networkd with the mwifiex_pcie driver, I don’t know if NM uses the same driver?

As far as I know NetworkManager should use the same driver, as otherwise the Wi-Fi connection using nmcli would not be made at all.

If you don’t have other questions about this issue, can you mark your post as the solution for this thread?

Feel free to create a new thread here in the Community if you need help in other topics.

Best regards,
Lucas Akira