System is going dark after some kind of error (how to debug)

Hi

This question is not directly related to the Colibri VF chip. Only through the fact, that my software is run on it.

I’m facing a very strange behavior which causes the system to go dark after some error, making it impossible to debug because there is no way to access anything.

The story goes like this:

Of 4 identical test systems, 3 of them “crash” after 4 to 10 hours, whereas the last one runs for days without a problem. Sometimes all of them Crash, but never at the same time and always within a broad time frame.
This “crash” somehow kills the network stack, and with it my only connection to the system, preventing me from acquiring any kind of runtime information.
If there is an exception message, I am unable to persist it. My log files do not show anything out of the ordinary.

I tried various different angles without any success. I reduced the Stack to 1/10 of the original stack size, but the application does not crash earlier. Also, when doubling the stack size, it does not crash later. Therefore, I don’t believe in a Stack Overflow in this case.

When running a Debug-build, the software runs indefinitely, without any problems, crushing my hope of a useful core-dump. Running a release build will never actually write a core dump.

Debugging on another platform is nearly impossible, since the crash only seems to manifest itself when other hardware, like a motor, is utilized which I cannot use on anything other than our own proprietary motherboard.

At this point, after soon to be 2 months of trial and error, I don’t know what else I could try. Without any reliable access to the system, I can only speculate on what might be the cause, but I can never really scrutinize my hypothesis.

I’m looking for any kind of input which could give me a new idea on how to tackle this issue. My money is on SIGSEGV, some problem when accessing memory.

Please give me some insight about how you would handle a problem such as this.

Hi

This “crash” somehow kills the network stack, and with it my only connection to the system.

Is it possible to have the console connection over UART to the system? This would help to see what the system is doing.

Did you check the CPU Load and the Ram Consumption?

Debugging on another platform is nearly impossible, since the crash only seems to manifest itself when other hardware, like a motor, is utilized which I cannot use on anything other than our own proprietary motherboard.

Which Hardware?? What is your use case? What is the application doing which is crashing?

What i know is that after this crash, there is no connection what so ever via LAN, but the LEDs of the NIC are still on. So it somehow maybe killing the network stack.

My hardware skills are not that great, but the person most knowledgeable about our mainboard says there is no other option to connect. It may be possible to build a new image granting access via USB/RS232, but then again it may be not.

CPU Load, RAM Consumption are not out of the ordinary. At least while the system is still running. If it is a steady overflow of some kind, i would expect a more reproduceable behavior.

The hardware: Colibri VF61 on top of our own designed motherboard. The designer does not work here anymore and i probably cannot send you the plans :slight_smile: So this sucks.

I tried to run the software on x86, which does work but without any meaning to everything it does, since no periphery is present.
The use case is a kind of PLC for automation of Labratory processes. measureing, controlling temperature and gas flow, mixing etc.

Hopefully i can get some insight into the mainboard design today, or else this will have to wait for another week.

As you see, i’m in a kind of a loss. It’s like i’m working with a blackbox here, and not the airplane kind.

Update:

NIC LED left: green, steady
NIC LED right: orange, steady

It may be possible to build a new image granting access via USB/RS232, but then again it may be not.

If you take the bsp from toradex, you have always connection over UART enabled. However your carrier board have to support the connection, too. I would recommend you to by our Iris or Viola Board.

If your software is running on x86 without errors and crashing on vf61, this can be a kernel/bsp issue. Did you tried an official bsp image of toradex?

I’m afraid it is not that easy. Without the connection to the other hardware components, the software won’t even boot. The x86 build is for debug only, and everything hardware related has to be removed in order to make it run.

Other question:
Since the Debug build runs fine and the Release build does not, are there any mandatory compiler flags i may be missing?

The failing flags i’am using. -pipe was added when the crash was already present.

set(CMAKE_CXX_FLAGS_RELEASE "${CMAKE_CXX_FLAGS_RELEASE} -O2 -Wall -pipe")
set(CMAKE_EXE_LINKER_FLAGS_RELEASE "${CMAKE_EXE_LINKER_FLAGS} -Wl,-O1")

In this Document i found a list of optimizations, maybe anything is important? https://docs.toradex.com/102017-ecos-colibri-vf61-vybrid.pdf

set(OPTIMIZATION_FLAGS "-Wall -Wpointer-arith -Wstrict-prototypes -Wundef \
-Wno-write-strings -mthumb -g -O2 -fdata-sections \
-ffunction-sections -fno-exceptions -nostdlib \
-mcpu=cortex-m4")

I will try and setup a new test with those flags. I tried the ones from the bitbake sysroot environment file:

./environment-setup-armv7at2hf-neon-angstrom-linux-gnueabi

It may be necessary to make it run in KVM or on the Eval board. But as it will be time consuming to do so, i want to explore my other options first. Without any access, there is little i can do, but the compiler options may be worth while.

which compiler flags are different in release and debug build?

Redo since i lost some stuff.

set(CMAKE_CXX_FLAGS_DEBUG "${CMAKE_CXX_FLAGS_DEBUG} -O0 -Wall -pipe -g -feliminate-unused-debug-types -fno-inline ")
set(CMAKE_CXX_FLAGS_RELEASE "${CMAKE_CXX_FLAGS_RELEASE} -O2 -Wall -pipe")

set(CMAKE_EXE_LINKER_FLAGS "${CMAKE_EXE_LINKER_FLAGS} -Wl,--hash-style=gnu -Wl,--as-needed")
set(CMAKE_EXE_LINKER_FLAGS_RELEASE "${CMAKE_EXE_LINKER_FLAGS} -Wl,-O1")

if(ARC MATCHES ARMv7)
    set(CMAKE_CXX_FLAGS_ARM "-fno-tree-vectorize -mthumb -mfloat-abi=hard -mfpu=neon \
    -mtune=cortex-a5 -Wno-poison-system-directories -flto \
    -fmessage-length=0")

    if(CMAKE_BUILD_TYPE MATCHES Release)
        set(CMAKE_CXX_FLAGS_RELEASE "${CMAKE_CXX_FLAGS_RELEASE} ${CMAKE_CXX_FLAGS_ARM}")
    elseif(CMAKE_BUILD_TYPE MATCHES Debug)
        set(CMAKE_CXX_FLAGS_DEBUG "${CMAKE_CXX_FLAGS_DEBUG} ${CMAKE_CXX_FLAGS_ARM}")
    endif()
endif()

If the kernel crashes/hangs, it usually prints a stack trace on the serial console.

You really should try to get access to the serial console. Check in your hardware design if SODIMM 33 (UART_A RXD) and SODIMM 35 (UART_A TXD) is somehow available. If really not, worst case you could try to solder directly to the SODIMM connector (requires some soldering skill…). With a FTDI (or similar) UART TTL adapter you then should be able to get a serial console.

What Linux kernel version are you using (check uname -a)? What hardware interfaces are you using to communicate with the hardware?

So i’ve spotted another difference between now and an earlier version:

Old version had:

-mthumb-interwork

New version had:

-mthumb -mfpu=neon

But as i understand there are no mandatory options which need be present always for the colibri platform?

I will Setup a Test without those and start looking into a way to get access to the system via uart. Thanks for the starter info there.

uname -a:
Linux colibri-vf 4.4.88-2.7.5+ge0f2806 #1 Wed Dec 20 14:22:20 CET 2017 armv7l armv7l armv7l GNU/Linux

Using LinuxImageV2.7b5 → most current and up-/downgraded for testing purposes in this matter.

UPDATE:

I posted the above as an answer. But it isn’t.
The crash may be caused because the ussage of -mfpu=neon. It is for specific purposes only and does not fully support the IEEE 754 standard.

I wonder why yocto/bb defaults to it.

export CC="arm-angstrom-linux-gnueabi-gcc  -march=armv7-a -mthumb **-mfpu=neon**  -mfloat-abi=hard --sysroot=$SDKTARGETSYSROOT"
export CXX="arm-angstrom-linux-gnueabi-g++  -march=armv7-a -mthumb **-mfpu=neon**  -mfloat-abi=hard --sysroot=$SDKTARGETSYSROOT"
export CPP="arm-angstrom-linux-gnueabi-gcc -E  -march=armv7-a -mthumb **-mfpu=neon**  -mfloat-abi=hard --sysroot=$SDKTARGETSYSROOT"

well, it still does not explain why it would run with debug information and crash without since both use the same options in this regard.

Using -mfpu=neon allows the compiler to use NEON instructions. Regular, IEEE754 standard floating point instructions are still available and can/will be used where appropriate. NEON is often used in hand optimized assembly sections. A compiler makes use of such instrucations when doing auto-vectorizations.

OpenEmbedded compiles a wealth of software with this flag without any problem. I don’t think that there is a general problem with the flag.

I suspect that the compiler arranges things/operations differently such that another issue is exacerbated due to the flag (and again suppressed when using debug mode). Typical candidates are missing initializations for objects on stack or undefined behaviors.

I’ve made some progress with gaining access to the system. All the boot messages are showing (Linux kernel / Systemd) until the Login shell is displayed.
In addition, i can now flash uboot and image without the eval board.

As soon as the the output changes to systemd messages, my input line shuts down and i am unable to insert username and password. How could that be?

I’ve not gone too deep into this as of yet, but maybe this is something known.

UPDATE:
I besically un-did: Modbus on UART0 RS485 - Colibri VF61 - Technical Support - Toradex Community

&uart0 {
        status = "okay";
        linux,rs485-enabled-at-boot-time;
        dma-names = "", "";
};

removing rs485-enabled-at-boot-time did also not change the behavior.

I’m using UART_A Rx=33 and Tx=35

Thank you for explaining.

If -mthumb(-interwork) is also also save to use, my only option is gaining access.

Do you by any chance know of any low resource memory profiler? I cannot seem to get valgrind to work, it runs out of memory fast.

So you still are only able to receive output from the module but are not able to type anything?

Removing any RS485 properties is certainly required. Also make sure that hardware flow control on both sides is not enabled (it is disabled on Colibri side by default, but make sure to also not use hardware flow control on the host side).

Not exactly.
run setupdate
run update_uboot
run update
Those commands work. There is input.
But as soon as the system takes over, this stops.

I will look into disabling flow control.

Device tree seems to do the trick

In order to spawn a login shell on UART_A:

  • Make sure u-boot is spawning console on /dev/ttyLP0
  • remove CTS/RTS from uart0 in device tree (copy pinmux and remove who lines)
  • disable rs485 for uart0
  • enjoy

Formatting does not work properly, sorry:

&uart0 {
        status = "okay";
        dma-names = "", "";
        pinctrl-0 = <&pinctrl_uart0>;
};

&iomuxc {
pinctrl-0 = <&pinctrl_hog_1>;

    vf610-colibri {
            pinctrl_uart0: uart0grp {
                    fsl,pins = <
                            VF610_PAD_PTB10__UART0_TX               0x21a2
                            VF610_PAD_PTB11__UART0_RX               0x21a1
                    >;
            };

};

So, have you now access to your device through UART? Can you debug your application?

I would like to add that after the crash, i also cannot connect via UART_A to the system. While the above is the right configuration, it does not help debug the problem.

That being said, the Test where i changed the options of -mthumb -mfpu=neon is running since last friday and is running still.

You can try to keep connected to UART_A to log when when the system crashes.

That being said, the Test where i changed the options of -mthumb -mfpu=neon is running since last friday and is running still.
This means with these compiler flags your application is running without any errors and the problem seems to be solved.

Hopefully yes. But these options should not cause problems on their own. As stefan.tx mentioned, this is probably due to an error in the code which only causes problems with these options.

Leaving UART_A connected wont show me any output past the crash. It logs until the crash, and then suddenly stops.

Ok. So after adding some log messages, the problem reappeared.

I’ve created a new issue, hope this is ok: