Apalis iMX6 Kernel Ext4 Recovery fail

Hi,
after changing our image to the 3.0 series (Apalis-iMX6_Console-Image_3.0b2.65-20190830) we encountered kernel hangs when ext4 Recovery has to be performed on the rootfs. This basically bricks the system because a reboot will just enter the same state again. After updating the kernel to the latest Apalis-iMX6_Console-Image_3.0b4.254-20200421 the problem still occurs.

Our rootfs is unfortunately not read-only at the moment, although we try to minimize writes during runtime.

Strangely the system can be recovered if we replace the kernel on the boot partition, even if it is the same 3.0b4 kernel. This leaves some questions open for me: is it not actually the root that’s corrupted but the boot partition? And how would that even happen because boot is never written to and the kernel seems to startup fine apart from the error.

Any tips are greatly appreciated

Here’s the bootlog:

Starting kernel ...
[    0.000000] Booting Linux on physical CPU 0x0
[    0.000000] Linux version 4.14.170-3.0.2+gae6a94f (oe-user@oe-host
) (gcc version 8.2.0 (GCC)) #1 SMP PREEMPT Wed Apr 22 15:10:55 UTC 2020
...snip...
[    2.152485] EXT4-fs (mmcblk2p2): INFO: recovery required on readonly filesystem
[    2.159948] EXT4-fs (mmcblk2p2): write access will be enabled during recovery
[    2.244481] EXT4-fs (mmcblk2p2): recovery complete
[    7.673280] mmcblk2: error -110 sending stop command, original cmd response 0x900, card status 0x400e00
[    8.042810] random: crng init done

Unfortunately, BSP 3.0.2 is deprecated and no longer supported. Please update to the latest stable BSP 3.0.4 LTS at your earliest convenience.

Hi Marcel,
Thanks for the info, will do. However the kernel we are using is already based on 3.0b4

And why exactly are you then talking about 4.14.170? Please state exactly what hardware (module and carrier board) and software versions of things you are talking about. Thanks!

Hi Marcel, I am sorry I just took the version from the bootlog without thinking about it. I’ve updated the post to make it clear that we are using a kernel based on the Console-Image_3.0b4 tag.

I also clarified that the problem first occured when we switched to 3.0.2 (and then updated the kernel to 3.0b4 hoping it would fix the issue). Previously we were using Apalis-iMX6_LXDE-Image_2.7b4. Unfortunately the 2.7b4 kernel no longer builds with the oe toolchain in 3.0 so this issue is a bit of a blocker for us.

Thanks,
Anton

Could you please provide the full boot log attached as a textual file? Have you tried using our regular BSP 3.0.4 demo image at all?

I still suspect that you are doing something different/wrong somewhere…

Hi Marcel,

Here’s the full bootlog

We do not have any issue building or running your upstream bsp or our own customized version for that matter. It boots and runs just fine.

The condition I am showing here only occurs after multiple hours to days of operation and multiple hard poweroffs (unfortunately the system cant be shutdown cleanly), but I haven’t been able to reproduce this on purpose in the lab.

I just think that this is some regression in either the ext4 recovery code or the eMMC handling that gets triggered once the rootfs needs fsck/recovery before boot.

But that boot log still shows 4.14.170-3.0.2! As mentioned above that version is simply no longer supported, sorry.

But your image version 3.0.4 is based on 4.14.170. This is with your upstream binary image booted:

root@apalis-imx6:~# cat /etc/issue
TDX X11 2.6-snapshot \n \l

Apalis-iMX6_Console-Image 3.0b4.254 20200421

root@apalis-imx6:~# uname -a
Linux apalis-imx6 4.14.170-3.0.4+gbaa6c24240a4 #1 SMP Tue Apr 21 01:36:00 UTC 2020 armv7l GNU/Linux

What am I missing? Yes my posted bootlog still contains the text 3.0.2 because that’s what the bsp was at, but the kernel version is up to date with your current stable:

$ git rev-list ae6a94f --max-count 16 | grep baa6c24240a4
baa6c24240a4c9e871e14c056a3485f98fef4dee

Yes, BSP 3.0.4 but not BSP 3.0.2. So you seem to be mixing or matching something somewhere. Please try with genuine BSP 3.0.4. Thanks!

Hi Marcel, I have unfortunately reproduced the error with BSP 3.0.4 and am able to do so now in in a couple of minutes with timed resets.

Starting kernel ...

[    0.000000] Booting Linux on physical CPU 0x0
[    0.000000] Linux version 4.14.170-3.0.4+g63bf5ba18210 (oe-user@oe-host) (gcc version 8.2.0 (GCC)) #1 SMP PREEMPT Tue Oct 20 08:22:02 UTC 2020
[    0.000000] CPU: ARMv7 Processor [412fc09a] revision 10 (ARMv7), cr=10c5387d
[    0.000000] CPU: PIPT / VIPT nonaliasing data cache, VIPT aliasing instruction cache
[    0.000000] OF: fdt: Machine model: Toradex Apalis iMX6Q/D Module
...snip...
[    1.961338] hub 1-1:1.0: USB hub found
[    1.965429] hub 1-1:1.0: 4 ports detected
[    1.970801] (NULL device *): hwmon_device_register() is deprecated. Please convert the driver to use hwmon_device_register_with_info().
[    1.983333] imx_thermal 2000000.aips-bus:tempmon: Industrial CPU temperature grade - max:105C critical:105C passive:95C
[    2.089925] asoc-simple-card sound: tlv320aic32x4-hifi <-> 2028000.ssi mapping ok
[    2.117242] snvs_rtc 20cc000.snvs:snvs-rtc-lp: setting system clock to 1970-01-01 00:00:00 UTC (0)
[    2.127087] usb_otg_vbus: disabling
[    2.130677] ALSA device list:
[    2.133697]   #0: tlv320aic3254
[    2.146252] EXT4-fs (mmcblk2p2): INFO: recovery required on readonly filesystem
[    2.153665] EXT4-fs (mmcblk2p2): write access will be enabled during recovery
[    2.228226] EXT4-fs (mmcblk2p2): recovery complete
[    2.296856] usb 1-1.1: new high-speed USB device number 3 using ci_hdrc
[    2.404242] usb-storage 1-1.1:1.0: USB Mass Storage device detected
[    2.411118] scsi host0: usb-storage 1-1.1:1.0
[    3.441749] scsi 0:0:0:0: Direct-Access     INTENSO  USB              1100 PQ: 0 ANSI: 0 CCS
[    3.453837] sd 0:0:0:0: [sda] 3915776 512-byte logical blocks: (2.00 GB/1.87 GiB)
[    3.462452] sd 0:0:0:0: [sda] Write Protect is off
[    3.468325] sd 0:0:0:0: [sda] No Caching mode page found
[    3.473691] sd 0:0:0:0: [sda] Assuming drive cache: write through
[    3.485598]  sda: sda1
[    3.490577] sd 0:0:0:0: [sda] Attached SCSI removable disk
[    7.660373] mmcblk2: error -110 sending stop command, original cmd response 0x900, card status 0x400e00
[    7.994844] random: crng init done

The full boot log is attached

I have also tried to mount the partition from Tezi, that yields the following errors:

/ # mount /dev/mmcblk0p2 /mnt/

[   57.532494] sdhci-esdhc-imx 2198000.usdhc: Card stuck in wrong state! card_busy_detect status: 0xe00
[   57.541953] mmc0: cache flush error -110
[   58.832497] mmcblk0: recovery failed!
[   58.836256] blk_update_request: I/O error, dev mmcblk0, sector 40960 op 0x1:(WRITE) flags 0x20800 phys_seg 1 prio class 0
[   58.847252] Buffer I/O error on dev mmcblk0p2, logical block 0, lost sync page write
[   58.855047] EXT4-fs (mmcblk0p2): I/O error while writing superblock
[   58.861355] EXT4-fs (mmcblk0p2): mount failed
[   58.867540] mmc0: cache flush error -110
[   58.871501] blk_update_request: I/O error, dev mmcblk0, sector 3448832 op 0x1:(WRITE) flags 0x20800 phys_seg 1 prio class 0
[   58.882647] Buffer I/O error on dev mmcblk0p2, logical block 425984, lost sync page write
[   58.890949] JBD2: Error -5 detected when updating journal superblock for mmcblk0p2-8.
[   58.898806] Aborting journal on device mmcblk0p2-8.
[   58.910033] sdhci-esdhc-imx 2198000.usdhc: error -110 requesting status
[   58.916718] mmcblk0: recovery failed!
[   58.920419] blk_update_request: I/O error, dev mmcblk0, sector 3448832 op 0x1:(WRITE) flags 0x20800 phys_seg 1 prio class 0
[   58.931571] Buffer I/O error on dev mmcblk0p2, logical block 425984, lost sync page write
[   58.939847] JBD2: Error -5 detected when updating journal superblock for mmcblk0p2-8.
[   58.954750] sdhci-esdhc-imx 2198000.usdhc: error -110 requesting status
[   58.963056] mmc0: cache flush error -110
[   60.258325] mmcblk0: recovery failed!
[   60.262066] blk_update_request: I/O error, dev mmcblk0, sector 40962 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[   60.272873] EXT4-fs (mmcblk0p2): unable to read superblock
[   60.284567] sdhci-esdhc-imx 2198000.usdhc: error -110 requesting status
[   60.291196] mmcblk0: recovery failed!
[   60.294997] blk_update_request: I/O error, dev mmcblk0, sector 40962 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[   60.305665] EXT4-fs (mmcblk0p2): unable to read superblock
[   60.317861] sdhci-esdhc-imx 2198000.usdhc: error -110 requesting status
[   60.324579] mmcblk0: recovery failed!
[   60.328323] blk_update_request: I/O error, dev mmcblk0, sector 40960 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[   60.339003] FAT-fs (mmcblk0p2): unable to read boot sector
[   60.378237] sdhci-esdhc-imx 2198000.usdhc: error -110 requesting status
[   60.384910] mmcblk0: recovery failed!
[   60.388752] blk_update_request: I/O error, dev mmcblk0, sector 40962 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[   60.399664] EXT4-fs (mmcblk0p2): unable to read superblock
[   60.414148] sdhci-esdhc-imx 2198000.usdhc: error -110 requesting status
[   60.420773] mmcblk0: recovery failed!
[   60.424504] blk_update_request: I/O error, dev mmcblk0, sector 40962 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[   60.435200] EXT4-fs (mmcblk0p2): unable to read superblock
[   60.447164] sdhci-esdhc-imx 2198000.usdhc: error -110 requesting status
[   60.453933] mmcblk0: recovery failed!
[   60.457644] blk_update_request: I/O error, dev mmcblk0, sector 40962 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[   60.468256] EXT4-fs (mmcblk0p2): unable to read superblock
[   60.480250] sdhci-esdhc-imx 2198000.usdhc: error -110 requesting status
[   60.486920] mmcblk0: recovery failed!
[   60.490633] blk_update_request: I/O error, dev mmcblk0, sector 40960 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[   60.501208] SQUASHFS error: squashfs_read_data failed to read block 0x0
[   60.507945] unable to read squashfs_super_block
[   60.518755] sdhci-esdhc-imx 2198000.usdhc: error -110 requesting status
[   60.525409] mmcblk0: recovery failed!
[   60.529133] FAT-fs (mmcblk0p2): unable to read boot sector
[   60.540925] sdhci-esdhc-imx 2198000.usdhc: error -110 requesting status
[   60.547656] mmcblk0: recovery failed!
mount: mounting /dev/mmcblk0p2 on /mnt/ failed: Invalid argument

Reflashing the module however works fine. I am kind of stuck here, I don’t know what’s causing these mmc errors. I have verified that the issue does not occur as fast with kernel 4.1 (requires a couple thousand hard resets).

Do you have multiple modules behaving like that? If not I would guess something is just wrong with that one module’s eMMC part. Please follow RMA procedure to get it replaced:

Yeah I have multiple modules behaving like that so I think it’s a software issue.
I can also reproduce it pretty reliably in a couple of minutes now by continuously hard resetting the modules around 8-12 seconds after power-on. This only happens if the rootfs is not readonly.

I have yet to test whether that also happens with the unmodified toradex bsp.

I have yet to test whether that also happens with the unmodified toradex bsp.

Yes, please do that and report back. Thanks!