Eth0 TX ring dump

Dear Community

In a project for a customer, incomprehensible crashes occurring.
An sql-server is running on a VF61 (mariadb 5.5.57 / rocko). An external client reads an writes values from/to the database over eth0 interface.
An executable (written in C++) running on the VF61. This executable retrieves values from the database and controls sensors and actuators via modbus (uart configured as rs485). Also write back values from the sensors to the database.
The i2c interface is also in use to control io expanders.

The system runs without problems. But unfortunately, after several hours (12h or more) the VF61 module sometimes crashes and restarts.

CPU load is about 46% and memory usage about 92MB.

Before the crash, the following entries appear in the syslog:
colibri-vf kernel: watchdog watchdog0: watchdog did not stop!
colibri-vf kernel: fec 400d1000.ethernet eth0: TX ring dump

After the crash the syslog files in /var/log/journal are corrupted.

After some research I found this patch which should fix the problem: linux-toradex.git - Linux kernel for Apalis, Colibri and Verdin modules
But this patch is already in the BSP 2.8.8

It is very difficult to find the cause of this error, as it occurs very rarely. I am grateful for any help and tips.

Best regards Stefan

Software summary
------------------------------------------------------------
Bootloader: U-Boot
Kernel version: 4.4.220-rt197-2.8.8+ga54df82 #1 PREEMPT RT Wed Jun 26 12:12:14 UTC 2024
BSP: 2.8.8
Kernel command line: user_debug=30 ubi.mtd=ubi root=ubi0:rootfs rw rootfstype=ubifs ubi.fm_autoconvert=1 console=tty1 console=tty0,115200n8 consoleblank=0 mtdparts=vf610_nfc:128k(vf-bcb)ro,1408k(u-boot)ro,512k(u-boot-env),-(ubi)
Distro name: NAME=“Angstrom”
Distro version: VERSION_ID=“v2017.12”
Distro variant: -
Hostname: colibri-vf
------------------------------------------------------------

Hardware info
------------------------------------------------------------
HW model: Toradex Colibri VF61 on IFM2 Board
Toradex version: 0012 V1.2B
Serial number: 06618447
Processor arch: armv7l
------------------------------------------------------------

Hi, I’m the Toradex AI Assistant. I’m in early beta, and I would appreciate your feedback.
The Toradex Support team will follow up with you shortly!

Did you like my answer?

:+1: :-1:


Handshake,

For the issue described with the VF61 module experiencing crashes after several hours of operation, there are several steps and considerations to help isolate and potentially resolve the problem:

  1. Kernel and Driver Updates: Ensure that your system is running the latest available kernel and drivers provided by Toradex for the VF61 module. Although you mentioned that the patch is already included in BSP 2.8.8, it’s worth checking for any newer patches or updates that might address stability issues not covered by your current version.

  2. Watchdog Timer: The message watchdog watchdog0: watchdog did not stop! suggests that the system watchdog is active. If the system is becoming unresponsive due to high CPU load or other resource constraints, the watchdog may trigger a reset. Review the watchdog configuration and consider adjusting the timeout period or temporarily disabling it for debugging purposes to see if the crashes still occur.

  3. Network Traffic Analysis: Since the issue seems related to network activity (eth0: TX ring dump), closely monitor the network traffic leading up to a crash. Tools like tcpdump can help capture the traffic. This might reveal if there’s a specific pattern or type of network activity causing the issue.

  4. Memory and Resource Management: With a memory usage reported at around 92MB, it’s crucial to ensure there’s no memory leak in the application or the database that could lead to resource exhaustion over time. Utilize tools like valgrind to check for memory leaks in your C++ application.

  5. Thermal Management: Ensure that the system is not overheating. Continuous high CPU load can lead to increased temperatures, potentially causing system instability. Check the system’s thermal management strategies and ensure adequate cooling.

  6. Power Supply Stability: Verify that the power supply is stable and capable of delivering the required current under load. Power supply issues can lead to unpredictable system behavior.

  7. I2C and RS485 Interfaces: Since both I2C and RS485 interfaces are in use, ensure that there are no conflicts or resource contention issues between these interfaces and the network operations. This includes checking for proper locking mechanisms in your application when accessing shared resources.

  8. System Logs and Core Dumps: Since the syslog files are corrupted after a crash, consider configuring an external syslog server to capture logs remotely or enable core dump capturing. This might provide more insights into what happens right before the system crashes.

  9. Stress Testing: Conduct stress tests on the system focusing on

Please, also check if these links can help you solve your problem:

Running FreeRTOS on the Cortex-M4 of a Colibri iMX7 | Toradex Developer Center.

Are you using a Watchdog timer? If so, please disable it temporarily to see if the issue persists.

Regarding the eth0: TX ring dump message, this usually occurs when TCP data communication is disconnected. If the network problem is intermittent, increasing the buffer size may help.
In fec.h, you can find the TX_RING_SIZE line and increase its value. For example:

#define TX_RING_SIZE 512 // Increase to 1024 or another suitable value

To gather more debug information, I recommend establishing a Debug UART connection with your development machine and enabling log saving in your terminal application. This way, in the event of a crash, you can easily retrieve the saved log from your development machine for further analysis."

Hi Alex

thx for your reply. I have increased the value to 1024 and will test it.

Could this problem be related to this issue?

e6358: ENET: Write to Transmit Descriptor Active Register (ENET_TDAR) is ignored
Errata type: Errata
Description: If the ready bit in the transmit buffer descriptor (TxBD[R]) is previously detected as not set
during a prior frame transmission, then the ENET_TDAR[TDAR] bit is cleared at a later time,
even if additional TxBDs were added to the ring and the ENET_TDAR[TDAR] bit is set. This
results in frames not being transmitted until there is a 0-to-1 transition on ENET_TDAR[TDAR].

Workaround: Code can use the transmit frame interrupt flag (ENET_EIR[TXF]) as a method to detect
whether the ENET has completed transmission and the ENET_TDAR[TDAR] has been
cleared. If ENET_TDAR[TDAR] is detected as cleared when packets are queued and waiting
for transmit, then a write to the TDAR bit will restart TxBD processing.

Do you know if this workaround is implemented in BSP2.8?

Hi Senge,

you didn’t answer regarding watchdog. On old BSP IIRC I2C recovery was not implemented. So if you are waiting with watchdog for I2C call, which in case I2C gets stuck may lead to watchdog reset.
I wanted to suggest issuing watchdog reset, as old Knowledge Base was suggesting. Old KB seems being gone, i see only KB for Windows CE, no more KB for Linux? So, just google how to issue watchdog reset, perhaps it will end with the same kernel messages like you see.

Hello @Edward,

Thanks for helping us here! Could you please specify which developer page article you are referring to?

@Edward

Sorry for my incomplete response. The problem with the watchdog has been solved. Our application used the watchdog API and did not reset it correctly.

The TX ring dump problem persists. As already mentioned, I have set the buffer to 1024 and the tests are still running. I will give feedback as soon as I have further findings.

@rudhi.tx

I found the information here:
NXP Errata Page 5

Unfortunately, increasing the buffer has not improved the situation.

journal.log:

Jul 03 15:51:20 colibri-vf kernel: ------------[ cut here ]------------
Jul 03 15:51:20 colibri-vf kernel: WARNING: CPU: 0 PID: 4 at /home/dev/ifm2_project/custom_data/build/tmp-glibc/work-shared/colibri-vf/kernel-source/net/sched/sch_generic.c:306 dev_watchdog+0x268/0x274()
Jul 03 15:51:20 colibri-vf kernel: NETDEV WATCHDOG: eth0 (fec): transmit queue 0 timed out
Jul 03 15:51:20 colibri-vf kernel: Modules linked in: virtio_rpmsg_bus usb_f_rndis u_ether vf610_rpmsg virtio virtio_ring libcomposite configfs
Jul 03 15:51:20 colibri-vf kernel: CPU: 0 PID: 4 Comm: ktimersoftd/0 Not tainted 4.4.220-rt197-2.8.8+ga54df82 #2
Jul 03 15:51:20 colibri-vf kernel: Hardware name: Freescale Vybrid VF5xx/VF6xx (Device Tree)
Jul 03 15:51:20 colibri-vf kernel: Backtrace: 
Jul 03 15:51:20 colibri-vf kernel: [<800133e8>] (dump_backtrace) from [<80013600>] (show_stack+0x18/0x1c)
Jul 03 15:51:20 colibri-vf kernel:  r7:80534fbc r6:00000132 r5:00000009 r4:00000000
Jul 03 15:51:20 colibri-vf kernel: [<800135e8>] (show_stack) from [<802db2d0>] (dump_stack+0x24/0x28)
Jul 03 15:51:20 colibri-vf kernel: [<802db2ac>] (dump_stack) from [<80022a00>] (warn_slowpath_common+0x88/0xb4)
Jul 03 15:51:20 colibri-vf kernel: [<80022978>] (warn_slowpath_common) from [<80022a64>] (warn_slowpath_fmt+0x38/0x40)
Jul 03 15:51:20 colibri-vf kernel:  r8:808e5380 r7:808b73c0 r6:8e7417b0 r5:8e5f1800 r4:80826e2c
Jul 03 15:51:20 colibri-vf kernel: [<80022a30>] (warn_slowpath_fmt) from [<80534fbc>] (dev_watchdog+0x268/0x274)
Jul 03 15:51:20 colibri-vf kernel:  r3:8e5f1800 r2:80826e2c
Jul 03 15:51:20 colibri-vf kernel:  r4:00000000
Jul 03 15:51:20 colibri-vf kernel: [<80534d54>] (dev_watchdog) from [<80065438>] (call_timer_fn.constprop.7+0x30/0xa0)
Jul 03 15:51:20 colibri-vf kernel:  r10:8e5f1800 r9:80534d54 r8:808b73c0 r7:80534d54 r6:00000000 r5:00000000
Jul 03 15:51:20 colibri-vf kernel:  r4:ffffe000
Jul 03 15:51:20 colibri-vf kernel: [<80065408>] (call_timer_fn.constprop.7) from [<80065644>] (run_timer_softirq+0x140/0x210)
Jul 03 15:51:20 colibri-vf kernel:  r7:00000000 r6:808b7400 r5:00000000 r4:808b7400
Jul 03 15:51:20 colibri-vf kernel: [<80065504>] (run_timer_softirq) from [<80025a9c>] (do_current_softirqs+0x1ac/0x26c)
Jul 03 15:51:20 colibri-vf kernel:  r10:00000000 r9:00000001 r8:04208140 r7:00000020 r6:ffffe000 r5:808b4250
Jul 03 15:51:20 colibri-vf kernel:  r4:00000004
Jul 03 15:51:20 colibri-vf kernel: [<800258f0>] (do_current_softirqs) from [<80025db8>] (run_ksoftirqd+0x34/0x64)
Jul 03 15:51:20 colibri-vf kernel:  r10:00000000 r9:00000000 r8:00000000 r7:00000001 r6:808b421c r5:ffffe000
Jul 03 15:51:20 colibri-vf kernel:  r4:ffffe000
Jul 03 15:51:20 colibri-vf kernel: [<80025d84>] (run_ksoftirqd) from [<80041750>] (smpboot_thread_fn+0x2b4/0x2b8)
Jul 03 15:51:20 colibri-vf kernel:  r5:ffffe000 r4:8e418ac0
Jul 03 15:51:20 colibri-vf kernel: [<8004149c>] (smpboot_thread_fn) from [<8003e1c8>] (kthread+0x108/0x110)
Jul 03 15:51:20 colibri-vf kernel:  r9:00000000 r8:8004149c r7:8e418ac0 r6:8e462000 r5:8e418b40 r4:00000000
Jul 03 15:51:20 colibri-vf kernel: [<8003e0c0>] (kthread) from [<8000fb18>] (ret_from_fork+0x14/0x3c)
Jul 03 15:51:20 colibri-vf kernel:  r8:00000000 r7:00000000 r6:00000000 r5:8003e0c0 r4:8e418b40
Jul 03 15:51:20 colibri-vf kernel: ---[ end trace 0000000000000002 ]---
Jul 03 15:51:20 colibri-vf kernel: fec 400d1000.ethernet eth0: TX ring dump
Jul 03 15:51:20 colibri-vf kernel: Nr     SC     addr       len  SKB
Jul 03 15:51:20 colibri-vf kernel:   0    0x9c00 0x8e7bb800   42 87416f00

Hi @rudhi.tx,

actually I can’t find it to specify how it was called. In the there was a whole bunch of useful advices in KB, not only for Windows CE, but as well for Linux. How to trigger watchdog, how to use sleep mode, etc, etc. It was very easy to find with Google “toradex knowledge base”. Now even Toradex search reveals only articles for Windows CE. Could you tell me please how to navigate to full old Knowledge Base?

Best Regards,
Edward

Edit:

found it on wayback machine:
Knowledge Base (archive.org)
Watchdog (Linux) (archive.org)

Why that all should disappear?