SIGILL on imx7 nand ubifs

Hi there,

We have two IMX7 models we’re working on at the moment, Dual 1GB (emmc) and Dual 512MB (nand) . As far as we know these have the exact same CPUs (2A7+1M4).

However, the same binary executable on the 512MB NAND/UBIFS version is encountering a SIGILL error:

07:52:22 [3071]: SIGILL: illegal instruction
07:52:22 [3071]: PC=0x6adca m=8 sigcode=1
07:52:22 [3071]: signal arrived during cgo execution
07:52:22 [3071]: instruction bytes: 0xfe 0xeb 0x8 0x0 0xdd 0xe5 0x2 0x0 0x50 0xe3 0x0 0x0 0xa0 0xe3 0x1 0x0
07:52:22 [3071]: goroutine 33 [syscall]:
07:52:22 [3071]: runtime.cgocall(0xa7cbc0, 0x32dfdb8)
07:52:22 [3071]:    /usr/local/go/src/runtime/cgocall.go:157 +0x5c fp=0x32dfda0 sp=0x32dfd88 pc=0x1710c
07:52:22 [3071]:, 0x2e0d268)
07:52:22 [3071]:    _cgo_gotypes.go:355 +0x34 fp=0x32dfdb4 sp=0x32dfda0 pc=0x75e154
07:52:22 [3071]:*Journal).Next.func1(0x63dca519, 0x30407e0)
07:52:22 [3071]:    /go/pkg/mod/ +0x7c fp=0x32dfdcc sp=0x32dfdb4 pc=0x75ecbc

The 1GB emmc chip does not encounter this error. Running the same binary from tmpfs on the NAND version also works fine, so it could be somehow related to UBIFS.

If it were an error related to disk reads we’d understand, but as far as we know SIGILL is a CPU instruction error, which shouldn’t be affected by the different storage mechanism.

We’re curious if there are any other differences that aren’t documented that could lead to this? Or if there are any workarounds?

We’re on:

BSP 3.0
Colibri-iMX7_Console-Image 3.0b2 20220628
Linux 4.14.117-0+ge43e3a26e1b7 #1 SMP Tue Jun 7 07:43:18 UTC 2022 armv7l GNU/Linux
Build Configuration:
BB_VERSION           = "1.40.0"
BUILD_SYS            = "x86_64-linux"
NATIVELSBSTRING      = "ubuntu-18.04"
TARGET_SYS           = "arm-oe-linux-gnueabi"
MACHINE              = "colibri-imx7"
DISTRO               = "nodistro"
DISTRO_VERSION       = "nodistro.0"
TUNE_FEATURES        = "arm armv7ve vfp thumb neon callconvention-hard cortexa7"
TARGET_FPU           = "hard"

dmesg.log (19.6 KB)

Hi @Pashugan , could you check the MD5 values of files on tmpfs and Nand by the command md5sum?

We did it, I confirm the sums were the same.

I confirm the sums were the same.

It is interesting.
From your dmesg log, I find multi volumes ,e.g. data, rootfs_A, rootfs_B, are created on Nand. Could you check volumes usage by df -h. Which rootfs is being used now? The go applications MD5 checksums were the same on tmpfs and Nand. How about the go libraries?

Hi @benjamin.tx . Thank you for the reply. Our Go binary is statically built with some C dependencies via the CGO mechanism, and we already found out that the problem is related to the C dependencies (it doesn’t crash without them). We are still not sure how it can cause SIGILL on a particular filesystem.

# ldd -v /usr/bin/xxx (0x7efb5000) => /lib/ (0x76f84000) => /lib/ (0x76f0b000) => /lib/ (0x76ef8000) => /lib/ (0x76e05000)
	/lib/ (0x76fa8000)
	Version information:
	/usr/bin/xxx: (GLIBC_2.4) => /lib/ (GLIBC_2.4) => /lib/ (GLIBC_2.4) => /lib/ (GLIBC_2.4) => /lib/
	/lib/ (GLIBC_2.4) => /lib/ (GLIBC_PRIVATE) => /lib/ (GLIBC_PRIVATE) => /lib/ (GLIBC_2.4) => /lib/
	/lib/ (GLIBC_PRIVATE) => /lib/ (GLIBC_2.4) => /lib/
	/lib/ (GLIBC_PRIVATE) => /lib/ (GLIBC_PRIVATE) => /lib/ (GLIBC_2.4) => /lib/
	/lib/ (GLIBC_2.4) => /lib/ (GLIBC_PRIVATE) => /lib/

We suspect it can also be related to cross-compilation against a different version of glibc, but haven’t checked it yet.

I think I didn’t make it clear. The Go libraries are linked statically, and C deps are linked dynamically.

On ubifs all files are compressed by default. It shouldn’t lead to SIGILL, but you may try uncompressing your executables with chattr -c file

AFAIK, the data CRC is not verified by default but the meta-data. If the issue happens on other NandFlashed-based Colibri iMX7 SoM, it should be software related. It is BSP v3 installed now. You can also test with our latest Linux BSP v5.x where the glibc is newer one.

@benjamin.tx Yes, we were able to reproduce it on another nand imx7. We also think it’s likely software but SIGILL seems peculiar.

Interesting problem! I wonder if you could also try to run the binaries under strace in the configuration that works and compare it with the configuration that doesn’t.
Since you think it is related to the dependencies on the dynamically linked libraries, maybe strace could at least point to which library is being called when the problem happens.

@rafael.tx Going to try it, thanks. So far, we think it can be related to journald log rotation. Since NAND version has much less disk space, log rotation apparently happens more often, and the error occurs on reads from a (just) deleted log file. We are still investigating it.