Greetings!
Background
I’m an engineer working for a customer who has had some reliability issues with a Colibri T20-based product. The problem has manifested as UBIFS filesystem corruption. We tried upgrading the linux kernel to bring in a newer version of UBIFS, but filesystem writes were still not reliable. After seeing the recommendation to try yaffs2, we built a custom image using yaffs2 instead.
The yaffs2 filesystem has worked reliably so far.
However, in some cases, we’ve seen corruption in the linux kernel and the ‘uboot environment’, which are stored directly on flash without any filesystem.
We’d like to completely resolve the reliability issues, so I’ve tried to dig down into how the Colibri T20 product is supposed to work with a potentially unreliable flash memory. (Ideally, we don’t want to ship a “fixed” product, only to yet again receive field reports of bricked units. The end user is unlikely to be satisfied to know that the units are now bricked for a slightly different reason and that the previous complaint has in fact been resolved ).
I wasn’t personally involved in the original configuration of the linux os on these devices, but I have the toolchain and I can see it uses yocto. It’s based on the official toradex toolchain.
Flash failure points
From what I gather, possible failure points for the flash memory are (correct me if I’m wrong):
#1: The flash could have failed at block 0, causing the stage-0 bootloader to not load.
#2: The flash could have failed somewhere within uboot, causing the rest of uboot to not load.
#3: The flash could have failed where the uboot environment variables are stored
#4: The flash could have failed where the linux kernel is stored
#5: The flash could have failed where the root filesystem is stored
The issue we have resolved is failure at point #5. I’m unsure why yaffs2 works better. I think it might be because yaffs2 contains a “block refreshing” mechanism, which I think ubifs lacks. Yaffs2 seems to be designed to handle read distrub and write disturb, which the ubifs webpage explicitly says ubifs does not yet handle.
From what I can understand, the product is still susceptible to flash faults in any of the non-yaffs2 areas. I don’t think it even has bad block-management (please correct me if I’m wrong!) for points #1-#4 above.
Questions
Question A
Is it known why UBIFS does not work for Colibri T20 flash? Is it because of “write distrub”/“read disturb”? If it is because of “read disturb”, isn’t there a risk with every reboot that the uboot will be corrupted? Should we implement some sort of ‘refresh’ task for the flash failure points in #1-#4?
Question B
The customer’s product supports firmware upgrade in the field, triggered by the user. As I understand, there is a risk that any part of the flash memory fails and is marked as bad. Is there a strategy for dealing with flash memory failure at failure points #1-#4 above? I know some CPU:s have support for reading the stage 0 bootloader from the next block, if the first block is marked as bad. Does the arm CPU in colibri T20 do this?
The firmware updates in our case are in principle performed by uboot reading the image to RAM, and then writing it to the flash memory. Should we place everything (including env-vars and linux kernel) in yaffs2-fs images, to gain support for bad blocks? How about uboot?
Any hints or resources for this would be much appreciated!
Best regards,
/Anders