(Continuation) Colibri goes dark after crash: What could cause this?

Hi

This is a continuation of the following issue, since the other one was about how to generally debug this kind of problem. I did mark a solution there, but the problem suddenly reappeared.

The Problem is still the same. The system is going dark, locking me out, after some kind of error. I cannot write anything into the DB or into the console so journald could log it. **If there is an error message, the system has no time to persist it. **

All the changes i did where to output some more debug info relating another problem. I did this like it is done many times for other messages.

So the question is: What could cause the colibri vf61 board to shut off like it does for me? I would understand if the software crashes, even without an error message. But i don’t see a reason for the OS to crash like this too.
Are there general problems which could cause such a behavior?

  1. Network Connection not possible (while LED still on)
  2. Serial Conslole, with login shell on UART_A, does not repond also.

Both of which do work on the running system. The problem seems rather random, it may be related to the wheater for all i know. Until now, every time i think i’ve found something important, a few days later, the behavior changes in such a manner, that my theory does not make any sense again. (see compiler options of related thread)…

  1. Network Connection not possible (while LED still on)

That LED is controlled by the PHY independent whether or not any software is still running on the SoC so it is not really any helpful indicator.

  1. Serial Conslole, with login shell on UART_A, does not repond also.

That would tell us that it probably did lock up however more interesting would be the last messages that UART output before it locks up. Do you by any chance have all this output of a failing module run?

Via UART i can get a Login Shell. The output would be whatever journalctl -u -f would print.
I can build an image to see if there is an error message in UART, but i don’t believe i saw one last time.

Our regular BSP demo images would print the Linux kernel messages as well which is really what would be interesting.

I started the test with the serial console again. First crash, while running the program directly from there shows the regular log, which just stops when crashing. Nothing interesting there.

My second test runs cat /proc/kmsg which hopefully displays some more info. update: also nothing

Maybe i can just activate the output of the kernel messages to this console, like it is in the BSP you mentioned? I cannot use any other image, because the software depends on various components to run.

My base image: angstrom-qt5-x11-image.bb

Create a u-boot patch like this: Redirect Kernel Messages - Technical Support - Toradex Community
With this, the kernel messages will be shown

Above the patch to output kernel messages on ttyLP0.

There is no output when the crash occurs.

I’ve patched u-boot to output the kernel messages via UART_A ttyLP0. I tested it with plugin in a usb stick, which generated the expected output.

The crash does not output anything.

    ---
     include/configs/colibri_vf.h | 2 +-
     1 file changed, 1 insertion(+), 1 deletion(-)
    
    diff --git a/include/configs/colibri_vf.h b/include/configs/colibri_vf.h
    index db610d5..5c15ebb 100644
    --- a/include/configs/colibri_vf.h
    +++ b/include/configs/colibri_vf.h
    @@ -190,7 +190,7 @@
     		"fatload ${interface} 0:1 ${loadaddr} " \
     		"${board}/flash_blk.img && source ${loadaddr}\0" \
     	"setup=setenv setupargs " \
    -		"console=tty1 console=${console}" \
    +		"console=${console}" \
     		",${baudrate}n8 ${memargs} consoleblank=0 ${mtdparts}\0" \
     	"setupdate=run setsdupdate || run setusbupdate || run setethupdate\0" \
     	"setusbupdate=usb start && setenv interface usb && " \
    -- 
    1.8.3.1

From all I read the crash is more a freeze do I see this right? There is no indication upfront whatsoever and no output from that point on?

So this freeze is only happening when your application is running? As far as I understand you used to use UART_A for RS485, but you do use this now as serial console. What other interfaces is your program using now?

It may be drastic and depending on the load not really possible but you could try to run your programm using strace to see whether the freeze happens always after a specific syscall.

It may very well be a freeze, yes.

In use is i2c and DI/DO, AI,AO. I’ve disabled RT for the hardware thread, and all other threads not used for my case. There is also one rs232 interface.

I don’t know what s trace does exactly, but when compiling with debug symbols, the freeze does not occur. Hopefully this does not interfere like debug mode does. I will try this later today, thank you.

Also i’ve dealt with all the warnings valgrind memcheck told me about and cleaned up most of the code inspection warnings from clion, just in case.

I believe now, this problem occurs more often. But then again, it could be because of the wheater…

**update: ** i won’t be working this issue for the next few days.

Thanks for the information.

I don’t know what s trace does exactly, but when compiling with debug symbols, the freeze does not occur.

You really have to log the differences (build flags, ram and cpu usage, …) between normal and debug application.

Options for ARMv7 Debug and Release build:

set(CMAKE_CXX_FLAGS_DEBUG "${CMAKE_CXX_FLAGS_DEBUG} -O0 -Wall -pipe -g -feliminate-unused-debug-types -fno-inline")
set(CMAKE_CXX_FLAGS_RELEASE "${CMAKE_CXX_FLAGS_RELEASE} -g0 -O3 -Wall")

set(CMAKE_EXE_LINKER_FLAGS "${CMAKE_EXE_LINKER_FLAGS} -Wl,--hash-style=gnu -Wl,--as-needed")
set(CMAKE_EXE_LINKER_FLAGS_RELEASE "${CMAKE_EXE_LINKER_FLAGS} -Wl,-O1")

if(ARC MATCHES ARMv7)
    set(CMAKE_CXX_FLAGS_ARM "-fno-tree-vectorize -mthumb-interwork -mfloat-abi=hard \
    -mtune=cortex-a5 -Wno-poison-system-directories -flto \
    -fmessage-length=0")

    if(CMAKE_BUILD_TYPE MATCHES Release)
        set(CMAKE_CXX_FLAGS_RELEASE "${CMAKE_CXX_FLAGS_RELEASE} ${CMAKE_CXX_FLAGS_ARM}")
    elseif(CMAKE_BUILD_TYPE MATCHES Debug)
        set(CMAKE_CXX_FLAGS_DEBUG "${CMAKE_CXX_FLAGS_DEBUG} ${CMAKE_CXX_FLAGS_ARM}")
    endif()
endif()

set(CMAKE_SYSTEM_NAME Linux)
set(CMAKE_SYSTEM_PROCESSOR "armv7-a")

set(CMAKE_SYSROOT /local/v2.7b5/sysroots/armv7at2hf-neon-angstrom-linux-gnueabi)
set(CMAKE_CXX_COMPILER "/local/gcc-linaro/gcc-linaro-6.2.1/bin/arm-linux-gnueabihf-g++")

The software also crashes/freezes when idle. I set up a test in order to see if it also happens if my software is not actually running → problem with the image. After that i will try to strace the binary.
strace probably won’t help, because the lack of output whenever the problem occurs. i imagine that there won’t be time to persist the output or to display it in UART.

Did you also consider external factors (e.g. hardware issue? Maybe the input voltage drops for a very short period?). Separating supply to the module as much as possible helps in such cases. And also using more powerful main supplies in case the regular supply’s output power is close to the expected maximum…

Linux is usually rather robust, user space crashes almost always would at least leave the console/kernel running. Kernel level crashes almost always lead to some kind of stack trace on the serial console. There are only very rare circumstances where there is really nothing to be seen. Typically when clocks or power domains get accidentally disabled…

Yes and No. Same hardware with Old Image and Software does not suffer from this problem. Same goes for the New Image with a Debug-Build of the Software. Only the release build has this problem.

The problem also occurs when the software is just idling. In this case only the database is polled in the background.

There is no output on serial console and since the system freezes, the hardware has to be restarted. I just can’t get anything of value out of this thing.

Strace is runnging now but it takes a few hours for the problem to occur.

Since i have a compiler warning in boos iostreams of v1.61 in image 2.7b5, i try to update to 2.8 with boost 1.64.

After a fiew hours, here is the last thing strace wrote:

    read(7, "0\0\0\1\0\1\0\2\0\1\0(Rows matched: 1  Cha"..., 16384) = 52
    poll([{fd=7, events=POLLIN|POLLPRI}], 1, 0) = 0 (Timeout)
    send(7, "O\0\0\0\3update Value set sName='Mai"..., 83, MSG_WAITALL) = 83
    read(7, "0\0\0\1\0\1\0\2\0\1\0(Rows matched: 1  Cha"..., 16384) = 52
    poll([{fd=7, events=POLLIN|POLLPRI}], 1, 0) = 0 (Timeout)
    send(7, "N\0\0\0\3update Value set sName='Mai"..., 82, MSG_WAITALL) = 82
    read(7, "0\0\0\1\0\1\0\2\0\1\0(Rows matched: 1  Cha"..., 16384) = 52
    poll([{fd=7, events=POLLIN|POLLPRI}], 1, 0) = 0 (Timeout)
    send(7, "7\0\0\0\3select * from Value where I"..., 59, MSG_WAITALL) = 59
    read(7, "\1\0\0\1\4/\0\0\2\3def\tinterface\5Value\5Va"..., 16384) = 243
    poll([{fd=7, events=POLLIN|POLLPRI}], 1, 0) = 0 (Timeout)
    send(7, "P\0\0\0\3select * from Attribute whe"..., 84, MSG_WAITALL) = 84
    read(7, "\1\0\0\1\0107\0\0\2\3def\tinterface\tAttribut"..., 16384) = 551
    poll([{fd=7, events=POLLIN|POLLPRI}], 1, 0) = 0 (Timeout)
    send(7, ";\0\0\0\3select * from Command where"..., 63, MSG_WAITALL) = 63
    read(7, "\1\0\0\1\0043\0\0\2\3def\tinterface\7Command\7"..., 16384) = 267
    write(8, "\0", 1)                       = 1
    nanosleep({0, 250000000}, 0x7e8afbb0)   = 0
    poll([{fd=7, events=POLLIN|POLLPRI}], 1, 0) = 0 (Timeout)
    send(7, "N\0\0\0\3update Value set sName='Mai"..., 82, MSG_WAITALL) = 82
    read(7, "0\0\0\1\0\1\0\2\0\1\0(Rows matched: 1  Cha"..., 16384) = 52
    poll([{fd=7, events=POLLIN|POLLPRI}], 1, 0) = 0 (Timeout)
    send(7, "Y\0\0\0\3update Value set sName='Mai"..., 93, MSG_WAITALL) = 93
    read(7, "0\0\0\1\0\1\0\2\0\1\0(Rows matched: 1  Cha"..., 16384) = 52
    poll([{fd=7, events=POLLIN|POLLPRI}], 1, 0) = 0 (Timeout)
    send(7, "7\0\0\0\3select * from Value where I"..., 59, MSG_WAITALL) = 59
    read(7, "\1\0\0\1\4/\0\0\2\3def\tinterface\5Value\5Va"..., 16384) = 243
    poll([{fd=7, events=POLLIN|POLLPRI}], 1, 0) = 0 (Timeout)
    send(7, "P\0\0\0\3select * from Attribute whe"..., 84, MSG_WAITALL) = 84
    read(7, "\1\0\0\1\0107\0\0\2\3def\tinterface\tAttribut"..., 16384) = 551
    poll([{fd=7, events=POLLIN|POLLPRI}], 1, 0) = 0 (Timeout)
    send(7, ";\0\0\0\3select * from Command where"..., 63, MSG_WAITALL) = 63
    read(7, "\1\0\0\1\0043\0\0\2\3def\tinterface\7Command\7"..., 16384) = 267
    write(8, "\0", 1)                       = 1
    nanosleep({0, 250000000}, 0x7e8afbb0)   = 0
    poll([{fd=7, events=POLLIN|POLLPRI}], 1, 0) = 0 (Timeout)
    send(7, "N\0\0\0\3update Value set sName='Mai"..., 82, MSG_WAITALL) = 82
    read(7, "0\0\0\1\0\1\0\2\0\1\0(Rows matched: 1  Cha"..., 16384) = 52
    poll([{fd=7, events=POLLIN|POLLPRI}], 1, 0) = 0 (Timeout)
    send(7, "N\0\0\0\3update Value set sName='Mai"..., 82, MSG_WAITALL) = 82
    read(7, "0\0\0\1\0\1\0\2\0\1\0(Rows matched: 1  Cha"..., 16384) = 52
    poll([{fd=7, events=POLLIN|POLLPRI}], 1, 0) = 0 (Timeout)
    send(7, "Y\0\0\0\3update Value set sName='Mai"..., 93, MSG_WAITALL) = 93
    read(7, "0\0\0\1\0\1\0\2\0\1\0(Rows matched: 1  Cha"..., 16384) = 52
    poll([{fd=7, events=POLLIN|POLLPRI}], 1, 0) = 0 (Timeout)
    send(7, "7\0\0\0\3select * from Value where I"..., 59, MSG_WAITALL) = 59
    read(7, "\1\0\0\1\4/\0\0\2\3def\tinterface\5Value\5Va"..., 16384) = 243
    poll([{fd=7, events=POLLIN|POLLPRI}], 1, 0) = 0 (Timeout)
    send(7, "P\0\0\0\3select * from Attribute whe"..., 84, MSG_WAITALL) = 84

This is all it ever does when idling. Using mysql 5.5 and maridadb-connector-c.

Hm, unfortunately this is not very informative. So it really seems that the CPU just stops working at a random point, without involving hardware really…

Another idea worth trying is to run a memory tester for some hours. We typically use memtester which is preinstalled in our stock image.

If you do have JTAG access, connecting a hardware debugging might help too. Unless the CPU/bus is really completely stuck, you should be able to get register information etc…

It took a little more work than expected to upgrade to LinuxImageV2.8b1. I had to reduce the image size in order to flash successfully. It did grow quiet a lot.

Anyhow. I do have memtester installed and will run it over night. For JTAG, i imagine i will have alter the image since it is used for audio by default. Will look into that.

**Update: ** Memtester detects no problems.

I’ve an error like this, my linux was corrupted.
Don’t use the on-off button to reset send the shutdown command before.

I reinstalled linux and now that work well.

Thanks for your insight. Unfortunately, there is no such thing as an on-off button or a way to send the shutdown command. Has Power → its on; Has no Power → off. We literally just pull the plug.

Did you find the source of the corruption, or did the problem just go away after a new installation?