Colibri T20 flash reliability strategy

anders · March 30, 2021, 1:41am

Greetings!

Background

I’m an engineer working for a customer who has had some reliability issues with a Colibri T20-based product. The problem has manifested as UBIFS filesystem corruption. We tried upgrading the linux kernel to bring in a newer version of UBIFS, but filesystem writes were still not reliable. After seeing the recommendation to try yaffs2, we built a custom image using yaffs2 instead.

The yaffs2 filesystem has worked reliably so far.

However, in some cases, we’ve seen corruption in the linux kernel and the ‘uboot environment’, which are stored directly on flash without any filesystem.

We’d like to completely resolve the reliability issues, so I’ve tried to dig down into how the Colibri T20 product is supposed to work with a potentially unreliable flash memory. (Ideally, we don’t want to ship a “fixed” product, only to yet again receive field reports of bricked units. The end user is unlikely to be satisfied to know that the units are now bricked for a slightly different reason and that the previous complaint has in fact been resolved ).

I wasn’t personally involved in the original configuration of the linux os on these devices, but I have the toolchain and I can see it uses yocto. It’s based on the official toradex toolchain.

Flash failure points

From what I gather, possible failure points for the flash memory are (correct me if I’m wrong):

#1: The flash could have failed at block 0, causing the stage-0 bootloader to not load.

#2: The flash could have failed somewhere within uboot, causing the rest of uboot to not load.

#3: The flash could have failed where the uboot environment variables are stored

#4: The flash could have failed where the linux kernel is stored

#5: The flash could have failed where the root filesystem is stored

The issue we have resolved is failure at point #5. I’m unsure why yaffs2 works better. I think it might be because yaffs2 contains a “block refreshing” mechanism, which I think ubifs lacks. Yaffs2 seems to be designed to handle read distrub and write disturb, which the ubifs webpage explicitly says ubifs does not yet handle.

From what I can understand, the product is still susceptible to flash faults in any of the non-yaffs2 areas. I don’t think it even has bad block-management (please correct me if I’m wrong!) for points #1-#4 above.

Questions

Question A

Is it known why UBIFS does not work for Colibri T20 flash? Is it because of “write distrub”/“read disturb”? If it is because of “read disturb”, isn’t there a risk with every reboot that the uboot will be corrupted? Should we implement some sort of ‘refresh’ task for the flash failure points in #1-#4?

Question B

The customer’s product supports firmware upgrade in the field, triggered by the user. As I understand, there is a risk that any part of the flash memory fails and is marked as bad. Is there a strategy for dealing with flash memory failure at failure points #1-#4 above? I know some CPU:s have support for reading the stage 0 bootloader from the next block, if the first block is marked as bad. Does the arm CPU in colibri T20 do this?

The firmware updates in our case are in principle performed by uboot reading the image to RAM, and then writing it to the flash memory. Should we place everything (including env-vars and linux kernel) in yaffs2-fs images, to gain support for bad blocks? How about uboot?

Any hints or resources for this would be much appreciated!

Best regards,
/Anders

marcel.tx · March 30, 2021, 2:26pm

Is it known why UBIFS does not work for Colibri T20 flash?

Yes, due to known issues which long since got fixed upstream. Unfortunately, back-porting such fixes into such an ancient downstream version is almost impossible.

Is it because of “write distrub”/“read disturb”?

No, the SLC flashes as used on Colibri T20 should not exhibit any such issues.

If it is because of “read disturb”, isn’t there a risk with every reboot that the uboot will be corrupted?

No.

Should we implement some sort of ‘refresh’ task for the flash failure points in #1-#4?

While this should really not be required in the Colibri T20 case upstream meanwhile would have a solution for this as well.

The customer’s product supports firmware upgrade in the field, triggered by the user. As I understand, there is a risk that any part of the flash memory fails and is marked as bad. Is there a strategy for dealing with flash memory failure at failure points #1-#4 above?

Yes, bad blocks are just skipped over. Plus, of course, regular ECC functionality is used as well.

I know some CPU:s have support for reading the stage 0 bootloader from the next block,

Block 0 is guaranteed good for a hundred thousand write cycles so that should really not be an issue unless one would do something ultra stupid.

if the first block is marked as bad. Does the arm CPU in colibri T20 do this?

It’s actually the boot ROM that does such handling. While it would in theory support a redundant boot option we at Toradex resp. at least the Embedded Linux BSP never integrated any such support.

The firmware updates in our case are in principle performed by uboot reading the image to RAM, and then writing it to the flash memory. Should we place everything (including env-vars and linux kernel) in yaffs2-fs images, to gain support for bad blocks?

The regular strategy of just skipping over bad blocks should be sufficient. But, of course, that won’t help if new bad blocks would develop. However, on SLC this should not really happen if one is only reading from it. And the write-case again should handle it by said skip strategy.

How about uboot?

Yes, that’s what I was talking about above.