Ubifs errors on Colibri VF50

sst · October 21, 2016, 1:43pm

Hello,

we sent one of our devices to a customer and after some days u-boot was not able to mount the root partition (to load the kernel), the first bytes of the ubifs filesystem only contained 0xff 0xff 0xff…

So I did some reseach (some hundred power-cuts while system runs) and the root file system is really broken (even if the system still boots).
I often get messages like this:

[  907.760817] UBIFS error (ubi0:0 pid 633): ubifs_read_node: bad node type (255 but exp
ected 2)
[  907.769791] UBIFS error (ubi0:0 pid 633): ubifs_read_node: bad node at LEB 519:20352, LEB mapping status 0
[  907.781958] Not a node, first 24 bytes:
[  907.786058] 00000000: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff                     
     ........................
[  907.799373] CPU: 0 PID: 633 Comm: systemd-tmpfile Not tainted 4.1.15-v2.5b3+ge6d111c #1
[  907.810335] Hardware name: Freescale Vybrid VF5xx/VF6xx (Device Tree)
[  907.817947] Backtrace: 
[  907.820560] [<80012cf8>] (dump_backtrace) from [<80012f18>] (show_stack+0x18/0x1c)
[  907.830345]  r7:00000279 r6:00000207 r5:87ad6000 r4:00004f80
[  907.836573] [<80012f00>] (show_stack) from [<805b64c4>] (dump_stack+0x24/0x28)
[  907.845881] [<805b64a0>] (dump_stack) from [<802424fc>] (ubifs_read_node+0x294/0x314)
[  907.855136] [<80242268>] (ubifs_read_node) from [<80260560>] (ubifs_tnc_read_node+0x134/0x140)
[  907.865052]  r10:8483be00 r9:848b69c0 r8:8483be00 r7:00000002 r6:848b69c0 r5:87ad6000
[  907.874455]  r4:8483be30
[  907.877181] [<8026042c>] (ubifs_tnc_read_node) from [<80243788>] (tnc_read_node_nm+0xcc/0x1e4)
[  907.887875]  r8:8483be00 r7:87ad61c0 r6:87ad6000 r5:848b69c0 r4:8483be30
[  907.895189] [<802436bc>] (tnc_read_node_nm) from [<80247030>] (ubifs_tnc_next_ent+0x140/0x1a4)
[  907.905817]  r7:87ad61c0 r6:87ad6000 r5:84841ee8 r4:00000000
[  907.911777] [<80246ef0>] (ubifs_tnc_next_ent) from [<8023a0d0>] (ubifs_readdir+0x320/0x4a4)
[  907.922192]  r10:00000000 r9:00000000 r8:00000000 r7:861a5600 r6:848b6540 r5:84841f68
[  907.930608]  r4:800e98fc
[  907.934552] [<80239db0>] (ubifs_readdir) from [<800e972c>] (iterate_dir+0x84/0x104)
[  907.942362]  r10:00000000 r9:84840000 r8:84841f68 r7:861a5664 r6:848b6548 r5:00000000
[  907.952306]  r4:848b6540
[  907.956297] [<800e96a8>] (iterate_dir) from [<800e9bc4>] (SyS_getdents64+0x80/0xe8)
[  907.966025]  r10:00000000 r9:84840000 r8:8000f9a4 r7:848b6540 r6:848b6540 r5:560341f4
[  907.974433]  r4:00008000
[  907.977117] [<800e9b44>] (SyS_getdents64) from [<8000f800>] (ret_fast_syscall+0x0/0x3c)
[  907.987706]  r7:000000d9 r6:76f6d850 r5:560341f4 r4:560341f0
[  907.994732] UBIFS error (ubi0:0 pid 633): ubifs_readdir: cannot find next direntry, error -22
[  908.015595] UBIFS error (ubi0:0 pid 633): ubifs_read_node: bad node type (255 but expected 2)
[  908.024899] UBIFS error (ubi0:0 pid 633): ubifs_read_node: bad node at LEB 519:20352, LEB mapping status 0
[  908.037072] Not a node, first 24 bytes:
[  908.040834] 00000000: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff                     
     ........................
[  908.056352] CPU: 0 PID: 633 Comm: systemd-tmpfile Not tainted 4.1.15-v2.5b3+ge6d111c #1
[  908.065826] Hardware name: Freescale Vybrid VF5xx/VF6xx (Device Tree)
[  908.072409] Backtrace: 
[  908.076146] [<80012cf8>] (dump_backtrace) from [<80012f18>] (show_stack+0x18/0x1c)
[  908.085642]  r7:00000279 r6:00000207 r5:87ad6000 r4:00004f80
[  908.091599] [<80012f00>] (show_stack) from [<805b64c4>] (dump_stack+0x24/0x28)
[  908.100792] [<805b64a0>] (dump_stack) from [<802424fc>] (ubifs_read_node+0x294/0x314)
[  908.109943] [<80242268>] (ubifs_read_node) from [<80260560>] (ubifs_tnc_read_node+0x134/0x140)
[  908.119943]  r10:8483be00 r9:85e870c0 r8:8483be00 r7:00000002 r6:85e870c0 r5:87ad6000
[  908.128366]  r4:8483be30
[  908.131038] [<8026042c>] (ubifs_tnc_read_node) from [<80243788>] (tnc_read_node_nm+0xcc/0x1e4)
[  908.142246]  r8:8483be00 r7:87ad61c0 r6:87ad6000 r5:85e870c0 r4:8483be30
[  908.149545] [<802436bc>] (tnc_read_node_nm) from [<80247030>] (ubifs_tnc_next_ent+0x140/0x1a4)
[  908.160153]  r7:87ad61c0 r6:87ad6000 r5:84841ee8 r4:00000000
[  908.166390] [<80246ef0>] (ubifs_tnc_next_ent) from [<8023a0d0>] (ubifs_readdir+0x320/0x4a4)
[  908.176629]  r10:00000000 r9:00000000 r8:00000000 r7:861a5600 r6:85e87480 r5:84841f68
[  908.185053]  r4:800e98fc
[  908.187729] [<80239db0>] (ubifs_readdir) from [<800e972c>] (iterate_dir+0x84/0x104)
[  908.198372]  r10:00000000 r9:84840000 r8:84841f68 r7:861a5664 r6:85e87488 r5:00000000
[  908.206809]  r4:85e87480
[  908.209482] [<800e96a8>] (iterate_dir) from [<800e9bc4>] (SyS_getdents64+0x80/0xe8)
[  908.219700]  r10:00000000 r9:84840000 r8:8000f9a4 r7:85e87480 r6:85e87480 r5:560341f4
[  908.228224]  r4:00008000
[  908.230910] [<800e9b44>] (SyS_getdents64) from [<8000f800>] (ret_fast_syscall+0x0/0x3c)
[  908.241482]  r7:000000d9 r6:76f6d850 r5:560341f4 r4:560341f0
[  908.248501] UBIFS error (ubi0:0 pid 633): ubifs_readdir: cannot find next direntry, error -22

I think, an embedded system should survive power cuts.

Do you know any solution for this problem?
Simply setting the read-only-rootfs image feature in OE fails because of some bugs in the BSP/Angström layers.

stefan.tx · October 21, 2016, 6:27pm

We did some power-cut tests with the V2.5 release and could not see that error. However, we did see problems when the root file system was full (or close to full). Could it be that your root file system was close to full at this point?

In the V2.5 release we mount the root file system also using U-Boot, and we suspect that U-Boot’s UBI/UBIFS implementation might corrupt the root file system. For that reason we switched to separate kernel/device tree UBI volumes with the V2.6 release. With that change, U-Boot only needs to attach the UBI partition and not mount the UBIFS. Can you try reproducing the issue with the latest V2.6.1 release?

sst · November 10, 2016, 2:34pm

Hello,
thank you for your reply.
I did a lot of tests with thousands of power cuts and with the current BSP the error seems to be gone.
I am not sure if I will switch to 2.6 or make the needed changes in our OE layer because I had to change a lot of stuff (device tree, drivers, u-boot, package modifications,…) for our device.

Are there more changes (except moving kerned/dtb to an extra volume) which could have influence to the ubifs problem?

stefan.tx · November 10, 2016, 6:24pm

We did change the kernel version, but the issue seems to be more U-Boot related. But there haven’t been changes on the UBI/UBIFS/NAND driver implementation between V2.5 Beta 3 and V2.6.1 Beta 1, so I think it is only that extra volume move which makes the difference…

Bernd · January 4, 2017, 9:13am

Hi,

we are having the same problem, i.e. after several hundreds of power cuts uboot is not capabable any more to load the kernel and dtb from the rootfs.
So, as suggested in this answer I build a new uboot and moved the kernel and dtb into their own dedicated UBIFS volumes.
After several hundreds of power cuts where the system worked fine the kernel was not able any more to mount the rootfs.
During this test, the rootfs was filled about 80-90%.

We are using a VF-50, the OS is based on 2.5 beta 3

The kernel boot messages where things start to go wrong are according to my understanding:

[ 2.660608] UBIFS error (ubi0:2 pid 1): ubifs_read_node: bad node type (255 but expected 3)
[ 2.669003] UBIFS error (ubi0:2 pid 1): ubifs_read_node: bad node at LEB 844:97008, LEB mapping status 1
[ 2.678598] Not a node, first 24 bytes:
[ 2.682363] 00000000: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff …
[ 2.695503] CPU: 0 PID: 1 Comm: swapper Tainted: G W 4.1.15-00001-g631715a #46
[ 2.703916] Hardware name: Freescale Vybrid VF5xx/VF6xx (Device Tree)
[ 2.710443] Backtrace:
[ 2.712979] [<80012ad4>] (dump_backtrace) from [<80012cf4>] (show_stack+0x18/0x1c)
[ 2.720642] r7:00000001 r6:0000034c r5:86ad3000 r4:00017af0
[ 2.726418] [<80012cdc>] (show_stack) from [<805d92c8>] (dump_stack+0x24/0x28)
[ 2.733763] [<805d92a4>] (dump_stack) from [<80208c88>] (ubifs_read_node+0x294/0x314)
[ 2.741698] [<802089f4>] (ubifs_read_node) from [<80208db4>] (ubifs_read_node_wbuf+0xac/0x2e4)
[ 2.750369] r10:00000049 r9:86040f80 r8:00000003 r7:86ad3000 r6:0000034c r5:86bf0720
[ 2.758289] r4:00017af0

So it seems that this workaround does not work, at least not for us

stefan.tx · January 4, 2017, 3:13pm

Do I see things correct, you are using seperate UBI volumes (or UBIFS?) for storing DTB/Kernel with V2.5 Beta 3?

We renewed our effort and started multiple power cut tests, we could reproduce similar stack traces recently. We haven’t found the cause, we are still investigating at this point.

stefan.tx · January 19, 2017, 2:03am

It seems that the UBIFS has a bug which caused the above stack trace. This patch seems to solve the issue: The description matches the stack trace, and our preliminary tests show promising results.

I pushed that patch on our -next branch:
http://git1.toradex.com/cgit/linux-toradex.git/commit/?h=toradex_vf_4.4-next&id=4cceedeff6651e7d4f977c5b25ad442cd6d2f87c

Thanks @Bhuvan for hunting that issue down!

sst · February 13, 2017, 3:27pm

Hello Stefan,
I just got a device (based on SDK 2.6.1) from our customer with the following UBIFS error:

[    1.756094] ubi0: default fastmap pool size: 35
[    1.760760] ubi0: default fastmap WL pool size: 17
[    1.765981] ubi0: attaching mtd3
[    1.866052] ubi0: attached by fastmap
[    1.869841] ubi0: fastmap pool size: 35
[    1.873885] ubi0: fastmap WL pool size: 17
[    1.894746] ubi0: attached mtd3 (name "ubi", size 90 MiB)
[    1.900281] ubi0: PEB size: 131072 bytes (128 KiB), LEB size: 126976 bytes
[    1.907448] ubi0: min./max. I/O unit sizes: 2048/2048, sub-page size 2048
[    1.914465] ubi0: VID header offset: 2048 (aligned 2048), data offset: 4096
[    1.921529] ubi0: good PEBs: 719, bad PEBs: 1, corrupted PEBs: 0
[    1.927715] ubi0: user volume: 3, internal volumes: 1, max. volumes count: 128
[    1.935167] ubi0: max/mean erase counter: 13/3, WL threshold: 4096, image sequence number: 0
[    1.943824] ubi0: available PEBs: 0, total reserved PEBs: 719, PEBs reserved for bad PEB handling: 19
[    1.953314] ubi0: background thread "ubi_bgt0d" started, PID 61
[    1.959532] ubi1: default fastmap pool size: 10
[    1.964351] ubi1: default fastmap WL pool size: 5
[    1.969464] ubi1: attaching mtd4
[    2.162743] ubi1: scanning is finished
[    2.178089] ubi1: attached mtd4 (name "config", size 36 MiB)
[    2.184082] ubi1: PEB size: 131072 bytes (128 KiB), LEB size: 126976 bytes
[    2.191079] ubi1: min./max. I/O unit sizes: 2048/2048, sub-page size 2048
[    2.198070] ubi1: VID header offset: 2048 (aligned 2048), data offset: 4096
[    2.205230] ubi1: good PEBs: 284, bad PEBs: 4, corrupted PEBs: 0
[    2.211338] ubi1: user volume: 1, internal volumes: 1, max. volumes count: 128
[    2.218889] ubi1: max/mean erase counter: 22/4, WL threshold: 4096, image sequence number: 0
[    2.227588] ubi1: available PEBs: 0, total reserved PEBs: 284, PEBs reserved for bad PEB handling: 16
[    2.237084] ubi1: background thread "ubi_bgt1d" started, PID 62
[    2.245744] input: gpio-keys as /devices/platform/gpio-keys/input/input0
[    2.254830] snvs_rtc 400a7000.snvs:snvs-rtc-lp: setting system clock to 1970-01-01 00:00:00 UTC (0)
[    2.335057] UBIFS (ubi0:2): recovery needed
[    3.322141] UBIFS error (ubi0:2 pid 1): ubifs_read_node: bad node type (178 but expected 3)
[    3.330949] UBIFS error (ubi0:2 pid 1): ubifs_read_node: bad node at LEB 459:103328, LEB mapping status 1
[    3.340890] Not a node, first 24 bytes:
[    3.344832] 00000000: ac be 00 60 ad be 00 c0 ae be 00 20 b0 be 00 80 b1 be 00 e0 b2 be 00 40                     
     ...`....... ...........@
[    3.358341] CPU: 0 PID: 1 Comm: swapper Not tainted 4.4.21-v2.6.1b1+g7ecc29c #1
[    3.366021] Hardware name: Freescale Vybrid VF5xx/VF6xx (Device Tree)
[    3.372761] Backtrace: 
[    3.375398] [<800136e8>] (dump_backtrace) from [<800138e0>] (show_stack+0x18/0x1c)
[    3.383339]  r7:00000001 r6:000001cb r5:86b1e000 r4:000193a0
[    3.389253] [<800138c8>] (show_stack) from [<802d7e80>] (dump_stack+0x24/0x28)
[    3.396882] [<802d7e5c>] (dump_stack) from [<80250afc>] (ubifs_read_node+0x29c/0x318)
[    3.405122] [<80250860>] (ubifs_read_node) from [<80250c48>] (ubifs_read_node_wbuf+0xd0/0x340)
[    3.414087]  r10:00000049 r9:00000003 r8:8611ff80 r7:86b1e000 r6:000001cb r5:86079320
[    3.422432]  r4:000193a0
[    3.425132] [<80250b78>] (ubifs_read_node_wbuf) from [<8026ec48>] (ubifs_tnc_read_node+0x50/0x144)
[    3.434456]  r10:86128d00 r9:8611ff80 r8:86128d00 r7:00000003 r6:8611ff80 r5:86b1e000
[    3.442797]  r4:86128d30
[    3.445490] [<8026ebf8>] (ubifs_tnc_read_node) from [<80251da4>] (tnc_read_node_nm+0xcc/0x1e8)
[    3.454474]  r7:86b1e1e8 r6:86b1e000 r5:8611ff80 r4:86128d30
[    3.460386] [<80251cd8>] (tnc_read_node_nm) from [<8025565c>] (ubifs_tnc_next_ent+0x144/0x1a8)
[    3.469359]  r7:86b1e1e8 r6:86b1e000 r5:86843c18 r4:00000000
[    3.475497] [<80255518>] (ubifs_tnc_next_ent) from [<80255774>] (ubifs_tnc_remove_ino+0xb4/0x144)
[    3.484731]  r10:86b1e960 r9:00001b65 r8:ffffffff r7:00000000 r6:860bca40 r5:86b1e000
[    3.493082]  r4:860c05e0
[    3.495776] [<802556c0>] (ubifs_tnc_remove_ino) from [<80258308>] (ubifs_replay_journal+0xe80/0x14c8)
[    3.505365]  r10:86b1e960 r9:86843cf0 r8:860c05c0 r7:86b1e000 r6:860bca40 r5:00000000
[    3.513703]  r4:860c05e0
[    3.516375] [<80257488>] (ubifs_replay_journal) from [<8024cd80>] (ubifs_mount+0x118c/0x18b4)
[    3.525271]  r10:00000002 r9:00000000 r8:86070280 r7:86b1e000 r6:8607e000 r5:86070280
[    3.533600]  r4:86b1e7d0
[    3.536282] [<8024bbf4>] (ubifs_mount) from [<800e35c4>] (mount_fs+0x1c/0xac)
[    3.543752]  r10:8083013c r9:8082740c r8:00000000 r7:8083013c r6:8083013c r5:86070200
[    3.551875]  r4:8024bbf4
[    3.554788] [<800e35a8>] (mount_fs) from [<800fea70>] (vfs_kern_mount+0x50/0x11c)
[    3.562641]  r6:00008001 r5:86070200 r4:86b6ecc0
[    3.567481] [<800fea20>] (vfs_kern_mount) from [<80101afc>] (do_mount+0x1a8/0xcf4)
[    3.575409]  r9:8082740c r8:860701c0 r7:86070200 r6:00008001 r5:00000060 r4:00000000
[    3.583698] [<80101954>] (do_mount) from [<801029c8>] (SyS_mount+0x9c/0xc8)
[    3.590755]  r10:86fb2f00 r9:807366a8 r8:00008001 r7:807366a8 r6:00000000 r5:86070200
[    3.599119]  r4:860701c0
[    3.601791] [<8010292c>] (SyS_mount) from [<807e52cc>] (mount_block_root+0x140/0x268)
[    3.609992]  r8:00008001 r7:86078000 r6:86078000 r5:8080f858 r4:86078000
[    3.617195] [<807e518c>] (mount_block_root) from [<807e55cc>] (prepare_namespace+0xa4/0x1a0)
[    3.626003]  r10:8080f838 r9:00000008 r8:807e4600 r7:8080f834 r6:808510c0 r5:8080f868
[    3.634346]  r4:8080f858
[    3.637026] [<807e5528>] (prepare_namespace) from [<807e4ef8>] (kernel_init_freeable+0x1d0/0x1e0)
[    3.646270]  r5:808510c0 r4:807e1064
[    3.650050] [<807e4d28>] (kernel_init_freeable) from [<805f5208>] (kernel_init+0x10/0xf0)
[    3.658596]  r10:00000000 r9:00000000 r8:00000000 r7:00000000 r6:00000000 r5:805f51f8
[    3.666933]  r4:00000000
[    3.669610] [<805f51f8>] (kernel_init) from [<8000fa38>] (ret_from_fork+0x14/0x3c)
[    3.677542]  r5:805f51f8 r4:00000000
[    3.705074] List of all partitions:
[    3.708712] No filesystem could mount root, tried:  ubifs
[    3.714687] Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(0,0)
[    3.723108] ---[ end Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(0,0)

As you can see, the difference is that the first bytes are not deleted (ff ff ff…)
Do you think, the patch could help here, too?
Did you also get this kind of error?

stefan.tx · February 13, 2017, 5:37pm

My guess is that in this situation the xattr entry could also point to not yet cleared page, so I wouldn’t read too much into the data. The stack trace shows that mounting fails in ubifs_tnc_next_ent, which I think is the key observation here. Pretty sure the patch helps in this case too.

You can easily test that if you still have the unmountable file system: Update the kernel only with that patch applied and check whether that makes your UBIFS mountable again…

sst · February 14, 2017, 8:17am

Just installing an updated kernel was also my idea - and it works. Thank you!