Bluetooth disconnection from external device crashes bluetooth module/driver on a Verdin AM62 board

Hello,

We are running into an issue with Bluetooth on a Verdin AM62 board.
The Bluetooth module or driver becomes unresponsive/crashes after 2 to 10 connect/disconnect cycles from a master device.

Board: Verdin AM62 Dual 1GB Wi-Fi / Bluetooth IT (u-blox MAYA-W160-00 module)
Image: Non customized Torizon OS 6 6.5.0+build.8 2024/01/17 for Verdin AM62 installed using Tezi.
Carrier board: Yavia

After installation has finished we can ssh into the board and execute the following procedure:

bluetoothctl
power on
menu advertise
name test
tx-power on
back
advertise on
  • We can see the AM62 advertising packets using nRF Sniffer.
  • We connect to the AM62 board using a laptop (Windows 10 using BLEConsole).
  • BLEConsole on the laptop and bluetoothctl on the AM62 both confirm that they are connected.
  • We confirm that the connection is successful by verifying that the advertising packets have stopped and there are now Master/Slave packets being transmitted.
  • We disconnect from the AM62 board by closing the connection on the laptop.
    After a few seconds bluetoothctl recognizes that the laptop has disconnected.
  • We verify that the Master/Slave packets also stop.
  • We then restart the advertising via
advertise off
advertise on

and repeat the connection process.

After 2 to 10 (the number of attempts appears to be random) such cycles the AM62 board does not receive the disconnect event anymore and still thinks it is connected.
After about 3 seconds the following trace is printed to the kernel ring buffer:

[  835.213717] Bluetooth: hci0: Frame reassembly failed (-84)
[  835.219727] ------------[ cut here ]------------
[  835.219761] serial serial0: receive_buf returns -84 (count = 20)
[  835.219965] WARNING: CPU: 0 PID: 27 at ttyport_receive_buf+0xb0/0x100
[  835.220074] Modules linked in: xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c xt_addrtype iptable_filter ip_tables x_tables br_netfilter bridge stp llc bnep rpmsg_ctrl rpmsg_char irq_pruss_intc pru_rproc crct10dif_ce btnxpuart pvrsrvkm(O) ti_k3_r5_remoteproc virtio_rpmsg_bus rpmsg_ns rtc_ti_k3 mwifiex_sdio ti_k3_m4_remoteproc mwifiex ti_k3_common tidss drm_dma_helper sa2ul pruss mcrc snd_soc_davinci_mcasp snd_soc_ti_udma snd_soc_ti_edma snd_soc_ti_sdma pwm_tiehrpwm m_can_platform m_can snd_soc_nau8822 can_dev ina2xx ti_ads1015 industrialio_triggered_buffer kfifo_buf lontium_lt8912b industrialio lm75 tc358768 display_connector overlay drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops spi_omap2_mcspi uio_pdrv_genirq uio cfg80211 libcomposite fuse
[  835.221445] CPU: 0 PID: 27 Comm: kworker/u4:1 Tainted: G           O       6.1.46-6.5.0+git.8e6a2ddd4fe6 #1-TorizonCore
[  835.221496] Hardware name: Toradex Verdin AM62 WB on Verdin Development Board (DT)
[  835.221531] Workqueue: events_unbound flush_to_ldisc
[  835.221599] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[  835.221655] pc : ttyport_receive_buf+0xb0/0x100
[  835.221716] lr : ttyport_receive_buf+0xb0/0x100
[  835.221775] sp : ffff800009e6bd40
[  835.221801] x29: ffff800009e6bd40 x28: 0000000000000000 x27: 0000000000000000
[  835.221884] x26: ffff00000ada4cb8 x25: ffff000000019005 x24: ffff000000cb36e0
[  835.221970] x23: ffff00000023ae80 x22: ffff000000cb3708 x21: ffff000000f4b800
[  835.222052] x20: 00000000ffffffac x19: 0000000000000014 x18: ffffffffffffffff
[  835.222133] x17: 0000000000000000 x16: 0000000000000000 x15: ffff800009b7ef46
[  835.222215] x14: 0000000000000001 x13: ffff800009897638 x12: 0000000000000630
[  835.222295] x11: 0000000000000210 x10: ffff800009947638 x9 : ffff800009897638
[  835.222376] x8 : 00000000ffffdfff x7 : ffff800009947638 x6 : 0000000000000000
[  835.222456] x5 : ffff00003fd7eb60 x4 : 0000000000000000 x3 : 0000000000000027
[  835.222534] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff00000023ae80
[  835.222617] Call trace:
[  835.222641]  ttyport_receive_buf+0xb0/0x100
[  835.222705]  flush_to_ldisc+0xb8/0x1d0
[  835.222751]  process_one_work+0x1d8/0x44c
[  835.222816]  worker_thread+0x14c/0x444
[  835.222864]  kthread+0x110/0x120
[  835.222909]  ret_from_fork+0x10/0x20
[  835.222963] ---[ end trace 0000000000000000 ]---
[  835.223044] Bluetooth: hci0: Frame reassembly failed (-84)
[  835.257805] Bluetooth: hci0: Frame reassembly failed (-84)
[  835.301071] Bluetooth: hci0: Frame reassembly failed (-84)
[  835.346094] Bluetooth: hci0: Frame reassembly failed (-84)
[  835.352128] Bluetooth: hci0: Frame reassembly failed (-84)
[  835.358035] Bluetooth: hci0: Frame reassembly failed (-84)
[  835.366175] Bluetooth: hci0: Frame reassembly failed (-84)
[  869.179293] Bluetooth: hci0: Frame reassembly failed (-84)
[  869.185063] Bluetooth: hci0: Frame reassembly failed (-84)
[  871.258567] Bluetooth: hci0: command 0x0406 tx timeout

The Bluetooth module remains completely unresponsive aftewards, restarting the bluetooth service does not fix it. The only “fix” we have found so far is to power cycle the whole board. It is unclear if this issue occurs in the module or driver.

A similar issue is described here and here but the issue occurred on board boot up not during disconnection handling.

What could be causing this behaviour and how could we mitigate it?
Thanks in advance

Hi @SmartStuff!

Welcome to Toradex Community! :tada: :partying_face:
Feel free to browse around :smiley:

Thanks a lot for such detailed description of the issue you are facing. The reproducing steps are really helpful!

As you found in Linux Kernel mainling list, Torade R&D has faced an issue that seems related to the issue you are sharing here.

The investigation about the issue on the Linux Kernel mainling list is currently ongoing.

Please allows some time to investigate if the issue you are facing is related to the one being investigated.

We will get back to you as soon as possible.

Best regards,

Hi @SmartStuff !

I just tested this on my side (and @rafael.tx tested on his side).

Turns out we couldn’t reproduce the issue you pointed out in your first message.

Here is what we tested:

Modules & OS tested

  • Verdin AM62 Quad 2GB WB IT V1.1A (Torizon OS 6.5.0-build.8)
  • Verdin AM62 Quad 2GB WB IT V1.1A (TDX Wayland with XWayland 6.6.0-devel-20240125+build.506)
  • Verdin AM62 Dual 1GB WB IT V1.1A (TDX Wayland with XWayland 6.5.0+build.9 (kirkstone))

Carrier boards

  • Verdin Development Board V1.1A
  • Verdin Development Board V1.1B

Devices connected

  • LightBlue app v1.9.9(50) on Android (no Windows PC around :man_shrugging:)

We executed the steps you shared way more than 10 times and no issue happened.

Some comments about what we could see (or not see) during the tests:

  1. Verdin AM62 always managed to figure out the disconnection from LightBlue app and bluetoothctl on Verdin AM62 printed the suitable messages indicating the disconnection
  2. The kernel never showed any WARNING message
  3. Verdin AM62 never failed to connect nor disconnect (@Rafael Beims’s phone sometimes failed to connect - nothing permanent -, but it is possible that it was not Verdin’s fault)
  4. After disconnecting on LightBlue and before issuing the advertise off in bluetoothctl , a strange device called verdin-am6 (yes, it is missing the 2 in AM62) appears in the list of devices in LightBlue and the app fails to connect to it. After advertise off it disappears. After advertise on, Verdin AM62 appears in the list of devices in LightBlue as expected with the name defined in name <advertised name>.

Related to 4.: We currently do not know what is this verdin-am6. For now we just recommend not using it to connect, as the connection simply fails.

Could you please try to use something other than BLEConnect/Windows to connect to the module? Maybe this BLEConnect/Windows is misbehaving and making the connection fail?

Best regards,

Hello @henrique.tx,

I was also able to reproduce the issue with an Android device using the nRF Connect application.

Our primary use case consists of multiple centrals, usually mobile devices, connecting to a single peripheral device simultaneously.

I created a sample Python code that does not stop advertising after a device gets connected/disconnected. It works for a short period of time; however, if enough attempts are made, the same scenario as described by @SmartStuff can be achieved.

To reproduce the issue, use two Android devices and connect/disconnect from the Toradex board multiple times in random order with random intervals. It’s worth noting that the Python code might not be immediately notified when a device is disconnected, but this shouldn’t prevent you from attempting to reconnect from the same device before the disconnection message is printed in the logs.

https://gist.github.com/giorgib/a74f15e591979036c154fab5520b22c2

Hi @baska !

Thanks for sharing!

We will perform more tests targeting your use case and let you know here as soon as we have more news.

Best regards,

Hello @baska, @SmartStuff,
We’re going to report this issue to NXP as it seems to be related to the firmware. As soon as we have any information we’re going to report back here.

Hi @SmartStuff and @baska !

I would like to share the outcome of some more Bluetooth (BT) tests I did on my side with bluetoothctl on Verdin AM62 and three smartphones.

My hardware

  • Verdin AM62 Quad 2GB WB IT V1.1A (Torizon OS 6.5.0-build.8 - latest quarterly)
  • Carrier Board
    • Verdin Development Board V1.1A
    • Yavia V1.0A

Phones used

Phone Android version Nickname
Samsung Galaxy S22 14 s22
Samsung Galaxy S21 FE 14 s21
Samsung Galaxy XCover Pro 13 xcover

Android apps used

The phones are either running running Light Blue (version 1.9.9 (50)) or nRF Connect (version 4.28.0).

BT setup on AM62

The bluetoothctl setup done starts with rfkill to unblock the Bluetooth and proceeds just like your procedure from your first message:

rfkill ublock 0
bluetoothctl
power on
menu advertise
name xablau
tx-power on
back
advertise on

Initial BT scanning

Then the xablau device is listed when scanning on both s22, s21 and xcover.

Connecting the first phone

The BT connection with the first device right after the advertise on worked successfully almost everytime. One time or another I had to issue another connection attempt, but this was really rare.

Either phones worked well regardless of the app. The xcover showed some strange behavior, as the Light Blue app was showing no devices on the BT device list after scanning (not even some other BT devices I have around). But after some back and forth, Light Blue worked on xcover as well.

After connecting the first phone

As said in my previous tests, the xablau device is not visible anymore and some strange verdin-am6 device shows up in the list of scanned devices on all phones and both apps (nRF and LightBlue). Trying to connect to this verdin-am6 always fails and bluetoothctl shows nothing about it. Even the BT MAC for this verdin-am6 is different from the BT MAC of xablau. After redoing some of this tests, I found out that the BT MAC of verdin-am6 is not even constant.

Stopping AM62 from advertising (advertise off) makes this strange and useless verdin-am6 go away.

Connecting a second phone (and third, and probably forth, and etc)

I just couldn’t find a way to connect more than one phone without restarting advertising (advertise off / advertise on). Restarting advertising after each phone connection allowed me to connect all 3 phones (command bluetoothctl devices Connected):

image

One time I managed to connect 2 phones at once (without restarting advertisement between their connection) by tapping connect on both at the same time. Also, one time, I managed to reproduce the initial issue reported here by doing the same thing: tapping on 2 phones to connect at once.

Conclusion

With the setup I did, seems like the connection/disconnection using several devices is not that bad.
I don’t know if the restarting of the advertisement to connect/reconnect phones is something common or it should not be like this (therefore it might be a bug).
But restarting the advertisement was the only thing that allowed me to connect more than one phone at the same time (or even disconnect and reconnect the same phone).

Having said all this, as @rafael.tx said, as soon as we have information from NXP, we will update here.

I just wanted to share the results of my last tests.

Best regards,

Thanks for doing the tests @henrique.tx ! Here are couple of areas where we like to have a clarity.
Did you run any tests using Python code @baska provided?
Since NXP dealing with issue includes lot of uncertainty (may take days or may take months) we would like like to understand what other options we have. I had had discussion with Cyril that as long as AM62 bluetooth module has the firmware issue, we could also try to run our solution on iMX8M mini. We checked that there a no pinout conflicts, and now we would like to know does iMX8M WiFi/Bluetooth module works without experiencing the same issue that AM62 currently has? Having this knowledge would allow us to make decisions about how to proceed.
Best Regards,
Paul

Hi @paullumi!

Sorry for the delay here.

No, I didn’t run the tests using any custom code. I only used the tools available on the OS. This way, if the issue appears, only the OS can be blamed. But I still need to test it using your test application.

I didn’t test it myself, but I expect it to work well, as we have no other customers complaining about such issue. But I will also take some time to perform the same test on a Verdin iMX8M Mini. Also, be aware that you might need to adapt the device tree, as its organization is not the same between SoCs from different vendors.

Best regards,

I have been able to reproduce this error on the AM62 using python code (using bleak). The code runs fine on the IMX8.
The issue happens when we call disconnect on a connected device. I also found that I can manually disconnect the BLE device using:
$ sudo hcitool ledc <handle>
which unlocks the bluetooth module and then it works again (without needing to reboot)
Also,
$ bluetoothctl disconnect <mac_address>
does not work

Hello @paullumi, @baska,
I could reproduce the “Frame reassembly failed” problem with the provided script. In my case, I only had one device connect / disconnect, but I also left the system idle for a while. After some time I came back and tried to connect, and that’s when the issue was triggered.

We are working together with NXP to find a solution for this issue.

@izhar.hassan I didn’t try your method to recover the module, but I’ll try to reproduce it again and check if it can be brought back the way you described.

Best regards,
Rafael

1 Like

Hello @SmartStuff, @baska, @izhar.hassan,
We found that when the “Frame reassembly” error is printed, we have a UART packet that’s missing a byte. We are working to find the cause for this random error, but in the meantime, we merged a patch that makes the driver recover from this failure.
In my tests, I could continue the connection/disconnection cycles even after I saw the “Frame reassembly” error on the kernel output.

Here’s the patch:
https://git.toradex.com/cgit/linux-toradex.git/commit/?h=toradex_ti-linux-6.1.y&id=67b3310ce2f8bb2d579d94cd618f1441bbe1d9f8

Our latest nightly builds should already include it as well.

Could you test this on your side and give feedback on whether it improves the situation?

Best regards,
Rafael