In a project for a customer, incomprehensible crashes occurring.
An sql-server is running on a VF61 (mariadb 5.5.57 / rocko). An external client reads an writes values from/to the database over eth0 interface.
An executable (written in C++) running on the VF61. This executable retrieves values from the database and controls sensors and actuators via modbus (uart configured as rs485). Also write back values from the sensors to the database.
The i2c interface is also in use to control io expanders.
The system runs without problems. But unfortunately, after several hours (12h or more) the VF61 module sometimes crashes and restarts.
CPU load is about 46% and memory usage about 92MB.
Before the crash, the following entries appear in the syslog:
colibri-vf kernel: watchdog watchdog0: watchdog did not stop!
colibri-vf kernel: fec 400d1000.ethernet eth0: TX ring dump
After the crash the syslog files in /var/log/journal are corrupted.
Hi, I’m the Toradex AI Assistant. I’m in early beta, and I would appreciate your feedback.
The Toradex Support team will follow up with you shortly!
Did you like my answer?
Handshake,
For the issue described with the VF61 module experiencing crashes after several hours of operation, there are several steps and considerations to help isolate and potentially resolve the problem:
Kernel and Driver Updates: Ensure that your system is running the latest available kernel and drivers provided by Toradex for the VF61 module. Although you mentioned that the patch is already included in BSP 2.8.8, it’s worth checking for any newer patches or updates that might address stability issues not covered by your current version.
Watchdog Timer: The message watchdog watchdog0: watchdog did not stop! suggests that the system watchdog is active. If the system is becoming unresponsive due to high CPU load or other resource constraints, the watchdog may trigger a reset. Review the watchdog configuration and consider adjusting the timeout period or temporarily disabling it for debugging purposes to see if the crashes still occur.
Network Traffic Analysis: Since the issue seems related to network activity (eth0: TX ring dump), closely monitor the network traffic leading up to a crash. Tools like tcpdump can help capture the traffic. This might reveal if there’s a specific pattern or type of network activity causing the issue.
Memory and Resource Management: With a memory usage reported at around 92MB, it’s crucial to ensure there’s no memory leak in the application or the database that could lead to resource exhaustion over time. Utilize tools like valgrind to check for memory leaks in your C++ application.
Thermal Management: Ensure that the system is not overheating. Continuous high CPU load can lead to increased temperatures, potentially causing system instability. Check the system’s thermal management strategies and ensure adequate cooling.
Power Supply Stability: Verify that the power supply is stable and capable of delivering the required current under load. Power supply issues can lead to unpredictable system behavior.
I2C and RS485 Interfaces: Since both I2C and RS485 interfaces are in use, ensure that there are no conflicts or resource contention issues between these interfaces and the network operations. This includes checking for proper locking mechanisms in your application when accessing shared resources.
System Logs and Core Dumps: Since the syslog files are corrupted after a crash, consider configuring an external syslog server to capture logs remotely or enable core dump capturing. This might provide more insights into what happens right before the system crashes.
Stress Testing: Conduct stress tests on the system focusing on
Please, also check if these links can help you solve your problem:
Are you using a Watchdog timer? If so, please disable it temporarily to see if the issue persists.
Regarding the eth0: TX ring dump message, this usually occurs when TCP data communication is disconnected. If the network problem is intermittent, increasing the buffer size may help.
In fec.h, you can find the TX_RING_SIZE line and increase its value. For example:
#define TX_RING_SIZE 512 // Increase to 1024 or another suitable value
To gather more debug information, I recommend establishing a Debug UART connection with your development machine and enabling log saving in your terminal application. This way, in the event of a crash, you can easily retrieve the saved log from your development machine for further analysis."
thx for your reply. I have increased the value to 1024 and will test it.
Could this problem be related to this issue?
e6358: ENET: Write to Transmit Descriptor Active Register (ENET_TDAR) is ignored Errata type: Errata Description: If the ready bit in the transmit buffer descriptor (TxBD[R]) is previously detected as not set
during a prior frame transmission, then the ENET_TDAR[TDAR] bit is cleared at a later time,
even if additional TxBDs were added to the ring and the ENET_TDAR[TDAR] bit is set. This
results in frames not being transmitted until there is a 0-to-1 transition on ENET_TDAR[TDAR].
Workaround: Code can use the transmit frame interrupt flag (ENET_EIR[TXF]) as a method to detect
whether the ENET has completed transmission and the ENET_TDAR[TDAR] has been
cleared. If ENET_TDAR[TDAR] is detected as cleared when packets are queued and waiting
for transmit, then a write to the TDAR bit will restart TxBD processing.
Do you know if this workaround is implemented in BSP2.8?
you didn’t answer regarding watchdog. On old BSP IIRC I2C recovery was not implemented. So if you are waiting with watchdog for I2C call, which in case I2C gets stuck may lead to watchdog reset.
I wanted to suggest issuing watchdog reset, as old Knowledge Base was suggesting. Old KB seems being gone, i see only KB for Windows CE, no more KB for Linux? So, just google how to issue watchdog reset, perhaps it will end with the same kernel messages like you see.
Sorry for my incomplete response. The problem with the watchdog has been solved. Our application used the watchdog API and did not reset it correctly.
The TX ring dump problem persists. As already mentioned, I have set the buffer to 1024 and the tests are still running. I will give feedback as soon as I have further findings.
actually I can’t find it to specify how it was called. In the there was a whole bunch of useful advices in KB, not only for Windows CE, but as well for Linux. How to trigger watchdog, how to use sleep mode, etc, etc. It was very easy to find with Google “toradex knowledge base”. Now even Toradex search reveals only articles for Windows CE. Could you tell me please how to navigate to full old Knowledge Base?
We migrated the documentation to a new domain and we had some link redirects from the archives to the new website.
The redirections have now been removed to allow full access to the developer archives.
You can find all the articles from the Toradex Developer Archives here:
Here is the specific article about watchdog on linux: