Kernel panic when shutting down coprocessor (Verdin imx8mm)

davidkhess · January 31, 2024, 3:16pm

We are using the Remote Processor kernel support and RPMsg for integration and management from the Linux side to the RTOS running on the M4 coprocessor. For the most part, it works well.

Occasionally we get a kernel panic when using Remote Processor to “stop” the coprocessor:

[25016.237134] Unable to handle kernel paging request at virtual address ffff800015b3a002
[25016.245244] Mem abort info:
[25016.248053]   ESR = 0x0000000096000007
[25016.251824]   EC = 0x25: DABT (current EL), IL = 32 bits
[25016.257140]   SET = 0, FnV = 0
[25016.260216]   EA = 0, S1PTW = 0
[25016.263363]   FSC = 0x07: level 3 translation fault
[25016.268242] Data abort info:
[25016.271147]   ISV = 0, ISS = 0x00000007
[25016.274991]   CM = 0, WnR = 0
[25016.277960] swapper pgtable: 4k pages, 48-bit VAs, pgdp=0000000049c2d000
[25016.284680] [ffff800015b3a002] pgd=10000000bffff003, p4d=10000000bffff003, pud=10000000bfffe003, pmd=1000000075692003, pte=0000000000000000
[25016.297273] Internal error: Oops: 96000007 [#1] PREEMPT SMP
[25016.302859] Modules linked in: rpmsg_ctrl rpmsg_char imx_rpmsg_tty xt_nat xt_tcpudp xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c xt_addrtype iptable_filter ip_tables x_tables br_netfilter bridge stp llc mwifiex_sdio mwifiex bnep overlay cfg80211 mcp251xfd can_dev cm
[25016.356332] CPU: 1 PID: 95780 Comm: python Tainted: G           O      5.15.129-6.4.0+git.67c3153d20ff #1-TorizonCore
[25016.366955] Hardware name: Toradex Verdin iMX8M Mini WB on Yavia Board (DT)
[25016.373924] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[25016.380891] pc : virtqueue_get_buf_ctx_split+0x28/0x180
[25016.386132] lr : virtqueue_get_buf+0x30/0x40
[25016.390411] sp : ffff800015db3a80
[25016.393727] x29: ffff800015db3a80 x28: ffff80000a7022a0 x27: 0000000000000007
[25016.400870] x26: ffff0000077dec00 x25: ffff00000e76c0c0 x24: ffff00000709bf00
[25016.408015] x23: 0000000000000007 x22: 0000000000000100 x21: ffff0000014e1f40
[25016.415162] x20: ffff0000014e1f00 x19: ffff000006c3cd00 x18: 0000000000000000
[25016.422306] x17: 0000000000000000 x16: 0000000000000000 x15: 0000ffffa5db3fb0
[25016.429452] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000
[25016.436596] x11: 0000000000000000 x10: 0000000000000000 x9 : ffff800015db3eb0
[25016.443742] x8 : 0000000000000000 x7 : 0000000000000000 x6 : ffff0000075c6e40
[25016.450888] x5 : 0000000000000001 x4 : ffff800015db3ae0 x3 : ffff0000014e1f40
[25016.458033] x2 : 0000000000000000 x1 : 00000000000002cf x0 : ffff800015b3a000
[25016.465182] Call trace:
[25016.467631]  virtqueue_get_buf_ctx_split+0x28/0x180
[25016.472515]  virtqueue_get_buf+0x30/0x40
[25016.476441]  rpmsg_send_offchannel_raw+0x44c/0x4f0
[25016.481240]  virtio_rpmsg_send+0x28/0x34
[25016.485167]  rpmsg_send+0x20/0x40
[25016.488488]  rpmsgtty_write+0x54/0xb0 [imx_rpmsg_tty]
[25016.493551]  n_tty_write+0x2c0/0x48c
[25016.497134]  file_tty_write.constprop.0+0x130/0x294
[25016.502016]  tty_write+0x14/0x20
[25016.505248]  new_sync_write+0xec/0x18c
[25016.509004]  vfs_write+0x24c/0x2b0
[25016.512409]  ksys_write+0x6c/0x100
[25016.515817]  __arm64_sys_write+0x1c/0x30
[25016.519744]  invoke_syscall+0x48/0x114
[25016.523499]  el0_svc_common.constprop.0+0xd4/0xfc
[25016.528209]  do_el0_svc+0x28/0xa0
[25016.531526]  el0_svc+0x28/0x80
[25016.534589]  el0t_64_sync_handler+0xa4/0x130
[25016.538863]  el0t_64_sync+0x1a0/0x1a4
[25016.542533] Code: 35000700 f9403660 aa0103e4 79409261 (79400400) 
[25016.548634] ---[ end trace bc845368ab15e73f ]---
[25016.553257] Kernel panic - not syncing: Oops: Fatal exception
[25016.559009] SMP: stopping secondary CPUs
[25016.563249] Kernel Offset: disabled
[25016.566739] CPU features: 0x0,00002001,20000846
[25016.571276] Memory Limit: none
[25016.574336] Rebooting in 5 seconds..

To me, this looks like a write on the tty occurring after the coprocessor shutdown unmapped the memory region being used to communicate with it.

Seems the immediate answer is “don’t do that”. I.e. we should shutdown our communication and close the tty before attempting to shutdown the processor.

However, even so, it would be good if the Remote Processor code handled that correctly instead of panicing.

Can anybody confirm my suspicions and if so, perhaps suggest a way to get a patch to fix this?

eric.tx · February 1, 2024, 9:35pm

Hey @davidkhess,

I can ping some of our domain experts to see if they have seen this error before. Can you give more background on your setup? Such as:

Can you isolate this error to a repeatable event?
Do you think the error is only pertaining to rpmsg communication? (error messages not changing when repeatably seen)
Any device tree modifications that play a significant role in your setup?
Are you on Torizon BSP 5 or 6? (yocto build?)
Can you share generally what you are utilizing rpmsg for? [data transfer?]
And maybe anything else you think may pertain to this.

Thanks

-Eric

davidkhess · February 1, 2024, 10:12pm

It seems to happen if we have the tty open and are writing to it and we stop the coprocessor. It doesn’t happen often, but I suppose we might be able to make it happen repeatedly if we write to the tty constantly. At the moment it averages around every 50ms.

Since the stack trace shows it passing through tty and rpmsg kernel code and the panics occur when we stop the coprocessor, I’m pretty certain this pertains to RPMsg. This is the first panic I’ve been able to collect off of the console so far.

Device tree modifications:

Enable “hmp” (i.e. Remote Processor) overlay
Disable UART1 and UART2 on the linux side so the M4 has exclusive access
Disable GPIO 1-4 on the linux side so the M4 has exclusive access

Torizon BSP 6.

RPMsg is used to send pretty short control messages to the coprocessor and get a response back. Nothing real magical going on there. Seems to work reliably other than this panic.

davidkhess · February 7, 2024, 5:12pm

So, I reported this to the Linux Remote Proc maintainers and mailing list. They reported back that this stack trace is going through code not in the main Linux kernel. I investigated and found it in the NXP fork here:

Does Toradex have a communication channel with NXP for reporting this? Or should I just glean some email addresses from the commits to the IMX code there and try and contact them?

I should note, somebody else reported basically the same issue here in 2022:

eric.tx · February 15, 2024, 5:59pm

Hey @davidkhess,

After reviewing with some team-members. It does look like this issue could/would be addressed via the rpmsg tty driver. It would also be important to note that there’s a rpmsg driver upstream @…/drivers/tty/rpmsg_tty.c So if your wanting to work on a patch, its worth considering working on the upstream side of things.
The NXP community boards is most likely the best location to consolidate effort from NXP. As you posted, this have been previously posted about, but maybe is due for another attempt.

-Eric