Checking quality of flash memory

Logosol · August 8, 2024, 8:01pm

Image based on tdx-reference-minimal-image, with small DTB changes. Custom carrier board, with UART and ETH interfaces only.

Software summary
------------------------------------------------------------
Bootloader:               U-Boot
Kernel version:           5.4.193-rt74-5.7.3-dev-1+git.49e4130e2197 #1 SMP PREEMPT_RT Tue Apr 11 21:14:29 UTC 2023
Kernel command line:      user_debug=30 ubi.mtd=ubi root=ubi0:rootfs rw rootfstype=ubifs ubi.fm_autoconvert=1 console=tty1 console=ttymxc0,115200n8 consoleblank=0
Distro name:              NAME="TDX Wayland with XWayland RT"
Distro version:           VERSION_ID=5.7.3-devel-20230823013514-build.0
Hostname:                 colibri-14758274
------------------------------------------------------------

Hardware info
------------------------------------------------------------
HW model:                 Toradex Colibri iMX6ULL 256MB on Colibri Evaluation Board V3
Toradex version:          0036 V1.1A
Serial number:            14758274
Processor arch:           armv7l
------------------------------------------------------------

The issue: After flashing out-of-the-box Colibri module, strange errors during book appear, and we believe these’re due to “bad flashing”, or flash errors. badblocks does not report any issues/

root@localhost:~# badblocks -s /dev/mtd4
Checking for bad blocks (read-only test): set_o_direct: Invalid argument/0/0 errors)

After re-TEZI-ing, everything is OK, no errors are reported.

Is there any way to check flash for errors? Shall we discard these modules?

As of today, this has been observed on two out of almost 100 modules we’ve used. I can provide boot log if requested.

ToradexAI · August 8, 2024, 8:02pm

Hi, I’m the Toradex AI Assistant. I’m in early beta, and I would appreciate your feedback.
The Toradex Support team will follow up with you shortly!

Did you like my answer?

Hello,

Given the context of your issue with the Colibri iMX6ULL modules, it’s important to approach the situation methodically to ensure the integrity of the flash memory without hastily discarding the modules.

Firstly, the fact that re-flashing the modules with the Toradex Easy Installer (TEZI) resolves the issue suggests that the problem might not be with the flash hardware itself but could be related to the initial flashing process or the software environment.

Here are some steps and considerations to further diagnose and potentially resolve the issue:

Verify the Flashing Process: Ensure that the flashing process is performed correctly and with the latest version of TEZI. It’s crucial to follow the recommended steps provided by Toradex for flashing the modules.
Check for Power Stability: Insufficient or unstable power supply during the flashing or boot process can lead to unpredictable behavior. Verify that your power supply meets the requirements.
Use the Latest Firmware: Ensure that you are using the latest version of the firmware available for your module. Firmware updates often include fixes that could resolve your issue.
Flash Memory Wear and Tear: Although the badblocks command didn’t report issues, flash memory does have a limited number of write cycles. If these modules have been extensively used for development (flashed many times), consider this aspect. However, given the low failure rate (2 out of 100), this seems less likely.
Environmental Factors: Consider if there are any environmental factors at play that could affect the flash memory or the module operation, such as temperature extremes or electromagnetic interference during the flashing or operation process.
Detailed Logging: Look into more detailed system logs for any errors or warnings that could give clues as to what might be happening during the boot process. This can be done by enabling additional logging during boot and examining the system logs.
Memory Testing Tools: While badblocks is a useful tool, consider using other memory testing utilities that are designed specifically for NAND flash memory, as they might provide more detailed insights or be more suited to the specific characteristics of NAND flash.
Consult Toradex Support Resources: Although direct contact is not within the scope of this assistance, Toradex provides extensive documentation and community forums where similar issues may have been discussed and resolved.

Before discarding any modules, it’s advisable to exhaust all diagnostic avenues. Given the low failure rate, it’s also possible that the issue is isolated or procedural rather than indicative of a widespread

Please, also check if these links can help you solve your problem:

First Steps With Subsystem Updates | Toradex Developer Center.

alex.tx · August 8, 2024, 8:56pm

The badblocks command was initially designed to test floppy drives, so it’s not very useful on NAND memory, which the Colibri iMX6ULL utilizes. If you want to see how many bad blocks are already marked and removed from usage along with some ECC info, you can use: nanddump -o /dev/mtd4.

Could you please share a full boot log with the mentioned “strange errors”?

Logosol · August 8, 2024, 9:40pm

Flash errors 20240808.txt (29.1 KB)
Hi Alex,

Thank you for quick response. I was not aware of “nanddump”, here is the resuls (after re-TEZI-ing):

# nanddump -o /dev/mtd4
ECC failed: 0
ECC corrected: 0
Number of bad blocks: 3
Number of bbt blocks: 4
Not printing binary garbage to tty. Use '-a'
or '--forcebinary' to override.

Boot log attached as well, Ctrl-F for “FAILED”

alex.tx · August 8, 2024, 10:12pm

Looks completely OK. As per failed messages I can’t see anything related to NAND problem.
Please use

systemctl status dbus.service
systemctl status connman.service
systemctl status set-hostname.service

to get more info. Please note that conman and hostname depend on D-bus service. So they can faul due to D-bus service failure

Logosol · August 9, 2024, 12:00am

Thank you, Ales, I’ll relay these to our production team for future reference.

Edward · August 9, 2024, 7:16am

nanddump with just a MTD device specified reports current error correction status. If OS doesn’t touch blocks with ECC errors, then you won’t be aware about ECC errors in such unread blocks (errors statistics is accumulated from boot up and lost on reset). You may let nanddump to read whole MTD to a better understanding how safe is your data in MTD. So

nanddump -o /dev/mtd4 > /dev/null

and you should see some ECC bit errors corrected. You may have quite a lot of single bit ECC per flash page corrected, this is OK. But if you get numerous pages with, eh, say 5 bits corrected, then it could be a clue to worry.

I hope you are running ubihealthd service. If not, then certainly you should enable it with systemctl enable ubihealthd. It will keep reading random flash blocks, which belong to UBI, and move data from blocks with high ECC corrections to less wearied blocks.

As well you should look at UBI erase counters. Perhaps your flash is limited to 100k erases, so just see how far you are from there.
ubiscan /dev/mtd4
If ubiscan is not available, then dmesg | grep mean should report at least max/mean values of erase counters.