OSTree rolling back to "unconfigured" state after flash/deploy

bw908 · May 9, 2023, 8:22pm

We are experiencing some issues with Verdin IMX8MPs “soft-bricking” themselves after a deploy and falling back into a state where they require a re-image using easy-installer/recovery mode.

This appears to be tied to a situation where a rollback triggers but the rollback is somehow faulty (?) In some cases there is nothing to roll back to (it is the first image deployed via TEZI/USB) and in other times it is after several development images have been deployed using torizoncore-builder.

The serial console log for one such instance is as follows:

[PRODUCT]-Hub-NOT-PROGRAMMED login: 
[  131.782002] watchdog: watchdog0: watchdog did not stop!
'HH�Could not get PHY for FEC1: addr 7
Could not get PHY for FEC1: addr 7
eth0: ethernet@30bf0000 [PRIME]
Saving Environment to MMC... Writing to MMC(2)... OK
Normal Boot
Hit any key to stop autoboot:  0 
switch to partitions #0, OK
mmc2(part 0) is current device
Scanning mmc 2:1...
Found U-Boot script /boot.scr
973 bytes read in 1 ms (950.2 KiB/s)
## Executing script at 50280000
5512 bytes read in 2 ms (2.6 MiB/s)
101944 bytes read in 4 ms (24.3 MiB/s)
31 bytes read in 2 ms (14.6 KiB/s)
Applying Overlay: [PRODUCT]-hub.dtbo
3341 bytes read in 4 ms (815.4 KiB/s)
13355717 bytes read in 44 ms (289.5 MiB/s)
22592703 bytes read in 72 ms (299.3 MiB/s)
   Uncompressing Kernel Image
## Flattened Device Tree blob at 50200000
   Booting using the fdt blob at 0x50200000
   Loading Device Tree to 00000000ffac5000, end 00000000ffb00fff ... OK
Remove /soc@0/bus@30800000/ethernet@30be0000:phy-reset-gpios

Starting kernel ...

[    0.936273] caam_jr 30901000.jr: failed to flush job ring 0
[    1.062485] mdio_bus 30be0000.ethernet-1: MDIO device at address 6 is missing.
Starting version 250.5+

�Could not get PHY for FEC1: addr 7
Could not get PHY for FEC1: addr 7
eth0: ethernet@30bf0000 [PRIME]
Normal Boot
Hit any key to stop autoboot:  0 
switch to partitions #0, OK
mmc2(part 0) is current device
Scanning mmc 2:1...
Found U-Boot script /boot.scr
973 bytes read in 1 ms (950.2 KiB/s)
## Executing script at 50280000
5512 bytes read in 2 ms (2.6 MiB/s)
Rollback enabled. Booting previously deployed version.
Failed to load '/boot/boot/ostree/torizon-18d7280ec39f76ae394c0e0fa65cd2a9d90d288f559961e761a99d4b1a7ccd22/dtb//bootimx8mp-verdin-[PRODUCT].dtb'
31 bytes read in 1 ms (30.3 KiB/s)
libfdt fdt_check_header(): FDT_ERR_BADMAGIC
No FDT memory address configured. Please configure
the FDT address via "fdt addr <address>" command.
Aborting!
Applying Overlay: [PRODUCT]-hub.dtbo
3467 bytes read in 3 ms (1.1 MiB/s)
No FDT memory address configured. Please configure
the FDT address via "fdt addr <address>" command.
Aborting!
No FDT memory address configured. Please configure
the FDT address via "fdt addr <address>" command.
Aborting!
13355717 bytes read in 43 ms (296.2 MiB/s)
22592703 bytes read in 75 ms (287.3 MiB/s)
   Uncompressing Kernel Image
ERROR: Did not find a cmdline Flattened Device Tree
Could not find a valid device tree
SCRIPT FAILED: continuing...
libfdt fdt_check_header(): FDT_ERR_BADMAGIC
Scanning disk mmc@30b50000.blk...
Scanning disk mmc@30b60000.blk...
Found 4 disks
No EFI system partition
BootOrder not defined
EFI boot manager: Cannot load any image

Of particular note here are two things:

Firstly, there was only one failed boot between the last login prompt and the rollback being triggered.
Second, the line:
Failed to load '/boot/boot/ostree/torizon-18d7280ec39f76ae394c0e0fa65cd2a9d90d288f559961e761a99d4b1a7ccd22/dtb//bootimx8mp-verdin-[PRODUCT].dtb'

This DTB is non-existent; the correct name for the DTB is imx8mp-verdin-[PRODUCT].dtb

In other instances, I have observed the system falling back completely to the “dev board” configuration and using e.g. the imx8mp-verdin-nonwifi-dev.dtb file.

This is obviously concerning as it would not be good to have a fielded device brick itself in this way due to either power interruptions or a customer being overzealous with the reset button. (I can understand that it might roll back to the previous OS, but I would not expect it to roll back to an invalid configuration)

We are using TorizonCore 6.2.0 with a custom devicetree and changes from torizoncore-builder. No updates have yet been applied to these systems, only engineering deploys via torizoncore-builder deploy for testing

What’s going on and what can we do to prevent this?

jeremias.tx · May 9, 2023, 10:03pm

Greetings @bw908,

Let me see if I can summarize what’s happening to see if I understood everything correctly.

You start with a base TorizonCore image
You then apply some changes over network using torizoncore-builder deploy
Then in some cases a rollback gets triggered and in some of these rollbacks the system tries to rollback to a non-valid configuration.

Did I more or less get the gist of things?

So now, some questions.

How reproducible is this for you? Does it happen every time, randomly, only specific sequences? If possible could you give me instructions on how I might reproduce this?

Does the type of changes you’re trying to deploy matter at all for this issue?

Is the system rolling back for a valid reason? Or is the rollback unwarranted?

Now for some thoughts/comments:

First of all I’m not sure why exactly your system is rolling-back. Rollback support needs to be manually triggered, rollback is only automatic with OSTree updates coming via the update client Aktualizr. But you say you’re not doing updates and only using torizoncore-builder deploy so this automatic rollback shouldn’t be happening (unless there’s a bug). More details on this here: OSTree | Toradex Developer Center

If possible could you see if the rollback U-Boot variables mentioned in this article are being changed somehow?

This DTB is non-existent; the correct name for the DTB is imx8mp-verdin-[PRODUCT].dtb

So the boot script parses the name and path of the DTB from /boot/loader/uEnv.txt. Could you check this file to see if this is consistent?

This is obviously concerning as it would not be good to have a fielded device brick itself in this way due to either power interruptions or a customer being overzealous with the reset button. (I can understand that it might roll back to the previous OS, but I would not expect it to roll back to an invalid configuration)

I’m confused is the issue here caused by deploying new OSTrees with torizoncore-builder deploy or is it being caused by power-cuts/power-cycles?

Best Regards,
Jeremias

jeremias.tx · May 10, 2023, 1:09am

I did some additional testing after my initial response and I think I partially reproduced this at least:

U-Boot SPL 2022.04-6.2.0+git.0e1f11392251 (Jan 01 1970 - 00:00:00 +0000)
DDRINFO: start DRAM init
DDRINFO: DRAM rate 4000MTS
Training FAILED
DDRINFO: start DRAM init
DDRINFO: DRAM rate 4000MTS
DDRINFO:ddrphy calibration done
DDRINFO: ddrmix config done
DDR configured as single rank
SEC0:  RNG instantiated
Normal Boot
WDT:   Started watchdog@30280000 with servicing (60s timeout)
Trying to boot from BOOTROM
Boot Stage: Primary boot
Find img info 0x&4802d800, size 888
Download 916480, Total size 917024
NOTICE:  BL31: v2.6(release):lf_v2.6-g3c1583ba0a
NOTICE:  BL31: Built : 00:00:00, Jan  1 1970


U-Boot 2022.04-6.2.0+git.0e1f11392251 (Jan 01 1970 - 00:00:00 +0000)

CPU:   i.MX8MP[8] rev1.1 1600 MHz (running at 1200 MHz)
CPU:   Industrial temperature grade (-40C to 105C) at 55C
Reset cause: POR
DRAM:  4 GiB
Core:  89 devices, 23 uclasses, devicetree: separate
WDT:   Started watchdog@30280000 with servicing (60s timeout)
MMC:   FSL_SDHC: 1, FSL_SDHC: 2
Loading Environment from MMC... OK
In:    serial@30880000
Out:   serial@30880000
Err:   serial@30880000
Model: Toradex 0058 Verdin iMX8M Plus Quad 4GB WB IT V1.0B
Serial#: 06849059
Carrier: Toradex Dahlia V1.1A, Serial# 10870300
Net:   eth1: ethernet@30be0000, eth0: ethernet@30bf0000 [PRIME]
Normal Boot
Hit any key to stop autoboot:  0
switch to partitions #0, OK
mmc2(part 0) is current device
Scanning mmc 2:1...
Found U-Boot script /boot.scr
973 bytes read in 1 ms (950.2 KiB/s)
## Executing script at 50280000
5512 bytes read in 2 ms (2.6 MiB/s)
Rollback enabled. Booting previously deployed version.
Failed to load '/boot/boot/ostree/torizon-13ed17bdac233819053dd3d44704bd49d7c00049104bf10fa2cfdb48d54f083c/dtb//bootimx8mp-verdin-wifi-dev.dtb'
14 bytes read in 2 ms (6.8 KiB/s)
libfdt fdt_check_header(): FDT_ERR_BADMAGIC
No FDT memory address configured. Please configure
the FDT address via "fdt addr <address>" command.
Aborting!
13355717 bytes read in 44 ms (289.5 MiB/s)
11523429 bytes read in 38 ms (289.2 MiB/s)
   Uncompressing Kernel Image
ERROR: Did not find a cmdline Flattened Device Tree
Could not find a valid device tree
SCRIPT FAILED: continuing...
libfdt fdt_check_header(): FDT_ERR_BADMAGIC
MMC: no card present
Scanning disk mmc@30b50000.blk...
Disk mmc@30b50000.blk not ready
Scanning disk mmc@30b60000.blk...
Found 2 disks
No EFI system partition
BootOrder not defined
EFI boot manager: Cannot load any image

First I started with a default 6.2.0 TorizonCore (version 1, let’s call it). Then, what I did was create and deploy a custom device tree using TorizonCore Builder (version 2). Then I deployed another miscellaneous change with the same custom device tree (version 3). So now the system is running this version 3 and has version 2 as a rollback, however version 3 and 2 should be using the same custom device tree. Version 1 is nowhere in the system anymore since by default OSTree only tracks the current and previous deployments (versions 2 & 3 in this case).

I now manually cause a rollback to occur, this should result in version 2 being booted with the same custom device tree, but instead it results in the logs I shared above. The logs show a similar error to your logs where the system for some reason tries to boot using the version 1 device tree instead of the custom device tree that was deployed with version 2. Also the path and device tree name itself aren’t even completely correct for version 1.

This definitely seems like a bug and I will bring it up with our team internally. However, the open question for me is why was your system rolling back? In my test here I had to manually cause a rollback. Again I ask was your system rolling back because it was in a situation where it had to rollback, or was it rolling back when it should not have?

Best Regards,
Jeremias

bw908 · May 10, 2023, 1:24pm

Thanks for the reply - I’m glad to hear you already found an issue (which looks quite similar to at least one of my scenarios), and I’ll do my best to answer your questions:

Did I more or less get the gist of things?

Yep, that’s about it - though in my case I am not starting with a vanilla image, but rather I am doing a TEZI deploy of our modified image on top of which I am doing further development.

How reproducible is this for you? Does it happen every time, randomly, only specific sequences? If possible could you give me instructions on how I might reproduce this?

I have not yet found a trigger - it seems to happen at random. In the case for which I included the log, there was no explicit thing I can think of - I performed a deploy, and on reboot the device was soft-bricked. In other (but not all) cases it may have been tied to resets without allowing the system to boot fully. I am currently trying to debug an issue with intermittent hardware bringup, so naturally I am resetting the device often to try to inspect/cause the problem. I’ll continue keeping the system’s serial port active in order to capture more logs that might shed light on the issue and inspect the boot environment as recommended if it recurs.

Does the type of changes you’re trying to deploy matter at all for this issue?

In these cases I’m purely deploying devicetree/overlay changes, the rest of the configuration has remained untouched.

Is the system rolling back for a valid reason? Or is the rollback unwarranted?
// --snip–//
I’m confused is the issue here caused by deploying new OSTrees with torizoncore-builder deploy or is it being caused by power-cuts/power-cycles?

There’s a mix of both here. Several times it has been unexpected/unwarranted (example from my post shows the system rebooted only once between the deploy and the rollback being triggered - and a subsequent re-deploy of the same setup worked just fine, ruling out e.g a DT bug causing a boot failure) but in some cases (as noted previously) it could be attributed to reboots before the system was fully operational. But my take on it is it probably shouldn’t have rolled back in either case - a customer being overzealous with a reset button shouldn’t trigger a rollback of a successfully applied update/boot, should it? (though I could understand if it happens because they are resetting just after an update was deployed.)

If it offers any more context, I have been following a very similar workflow previously with 5.7.0 and never experienced this issue.

jeremias.tx · May 10, 2023, 10:57pm

Ok so here’s what I have after more testing and research. First of all for context, there’s a file on the device called /boot/loader/uEnv.txt this file contains some logic and variables that get used by the boot script. Most importantly it contains variables that tell the bootloader where to locate the kernel image and device tree binary.

When flashed with a fresh 6.2.0 image we have the following variables:

kernel_image=/boot/ostree/torizon-a7ab41e968c27662d7fc71a4adf501104af8a6b2ea99063c1d15ad3ea49ed099/vmlinuz-5.15.77-6.2.0+git.aa0ff7e3554e
ramdisk_image=/boot/ostree/torizon-a7ab41e968c27662d7fc71a4adf501104af8a6b2ea99063c1d15ad3ea49ed099/initramfs-5.15.77-6.2.0+git.aa0ff7e3554e.img
fdtdir=/boot/ostree/torizon-a7ab41e968c27662d7fc71a4adf501104af8a6b2ea99063c1d15ad3ea49ed099/dtb
bootargs=quiet logo.nologo vt.global_cursor_default=0 plymouth.ignore-serial-consoles splash fbcon=map:3 ostree=/ostree/boot.1/torizon/a7ab41e968c27662d7fc71a4adf501104af8a6b2ea99063c1d15ad3ea49ed099/0
kernel_image_type=Image.gz
overlays_file="overlays.txt"
otaroot=1

After using torizoncore-builder to deploy a custom device tree we now have the following:

kernel_image=/boot/ostree/torizon-4937304066160a3d7bd560eb978add3cb1f633494aeedc04fd6d3db91f500db8/vmlinuz-5.15.77-6.2.0+git.aa0ff7e3554e
ramdisk_image=/boot/ostree/torizon-4937304066160a3d7bd560eb978add3cb1f633494aeedc04fd6d3db91f500db8/initramfs-5.15.77-6.2.0+git.aa0ff7e3554e.img
fdtdir=/boot/ostree/torizon-4937304066160a3d7bd560eb978add3cb1f633494aeedc04fd6d3db91f500db8/dtb
bootargs=quiet logo.nologo vt.global_cursor_default=0 plymouth.ignore-serial-consoles splash fbcon=map:3 ostree=/ostree/boot.0/torizon/4937304066160a3d7bd560eb978add3cb1f633494aeedc04fd6d3db91f500db8/0
fdtfile=test.dtb
kernel_image_type=Image.gz
overlays_file="overlays.txt"
otaroot=1

kernel_image2=/boot/ostree/torizon-a7ab41e968c27662d7fc71a4adf501104af8a6b2ea99063c1d15ad3ea49ed099/vmlinuz-5.15.77-6.2.0+git.aa0ff7e3554e
ramdisk_image2=/boot/ostree/torizon-a7ab41e968c27662d7fc71a4adf501104af8a6b2ea99063c1d15ad3ea49ed099/initramfs-5.15.77-6.2.0+git.aa0ff7e3554e.img
fdtdir2=/boot/ostree/torizon-a7ab41e968c27662d7fc71a4adf501104af8a6b2ea99063c1d15ad3ea49ed099/dtb
bootargs2=quiet logo.nologo vt.global_cursor_default=0 plymouth.ignore-serial-consoles splash fbcon=map:3 ostree=/ostree/boot.0/torizon/a7ab41e968c27662d7fc71a4adf501104af8a6b2ea99063c1d15ad3ea49ed099/0

As you can see we now have a 2nd set of variables that denote what binaries to use in case of a rollback. Now after another deployment with torizoncore-builder we have this:

kernel_image=/boot/ostree/torizon-9608f75d3d4318a19fea2fa0bea705a9c01a71c0a1b67e0784ab0a8d6d14f3f7/vmlinuz-5.15.77-6.2.0+git.aa0ff7e3554e
ramdisk_image=/boot/ostree/torizon-9608f75d3d4318a19fea2fa0bea705a9c01a71c0a1b67e0784ab0a8d6d14f3f7/initramfs-5.15.77-6.2.0+git.aa0ff7e3554e.img
fdtdir=/boot/ostree/torizon-9608f75d3d4318a19fea2fa0bea705a9c01a71c0a1b67e0784ab0a8d6d14f3f7/dtb
bootargs=quiet logo.nologo vt.global_cursor_default=0 plymouth.ignore-serial-consoles splash fbcon=map:3 ostree=/ostree/boot.1/torizon/9608f75d3d4318a19fea2fa0bea705a9c01a71c0a1b67e0784ab0a8d6d14f3f7/0
fdtfile=test.dtb
kernel_image_type=Image.gz
overlays_file="overlays.txt"
otaroot=1

kernel_image2=/boot/ostree/torizon-4937304066160a3d7bd560eb978add3cb1f633494aeedc04fd6d3db91f500db8/vmlinuz-5.15.77-6.2.0+git.aa0ff7e3554e
ramdisk_image2=/boot/ostree/torizon-4937304066160a3d7bd560eb978add3cb1f633494aeedc04fd6d3db91f500db8/initramfs-5.15.77-6.2.0+git.aa0ff7e3554e.img
fdtdir2=/boot/ostree/torizon-4937304066160a3d7bd560eb978add3cb1f633494aeedc04fd6d3db91f500db8/dtb
fdtfile2=/boottest.dtb
bootargs2=quiet logo.nologo vt.global_cursor_default=0 plymouth.ignore-serial-consoles splash fbcon=map:3 ostree=/ostree/boot.1/torizon/4937304066160a3d7bd560eb978add3cb1f633494aeedc04fd6d3db91f500db8/0

Notice the rollback variable for fdtfile2 has an incorrect value of /boottest.dtb this causes the issue where the DTB can’t be found due to this mangled name. I did the same test with TorizonCore 5.7 and here the fdtfile2 variable had the correct name. So at least with this point it seems we had a regression somewhere between 5.7 and 6.X.

That I said there were still 2 issues you mentioned that I wasn’t able to reproduce:

Random rollbacks. I tried deploying multiple times, I tried cutting power during the reboot process before it boots into a new deployment. I couldn’t get the system to rollback unexpectedly. The only way the system rolled-back for me was when I set the U-Boot environment variable rollback to 1. Though this was something I did manually as I said previously. I never noticed the rollback variable get set without explanation.
I haven’t experienced the situation you described where “the system falling back completely to the “dev board” configuration and using e.g. the imx8mp-verdin-nonwifi-dev.dtb file.” For me the system either fell back to whatever DTB the previous deployment had, or it failed to rollback cause the name was mangled by the above error.

While I have definitely identified at least 1 bug here, I still need further details or explanation for the other 2 issues listed.

Best Regards,
Jeremias

bw908 · May 11, 2023, 1:25pm

Thanks for the update. So far I have had no luck yet with reproducing the situation under observation. I had a single instance yesterday but of course it was not the board I had connected to the serial console, so I don’t have details as to what triggered the rollback. (I was able to “resurrect” it by using TEZI, mounting the boot partition and editing uEnv.txt to fix the mangled fdtfile2 entry).

I’ll continue monitoring and keep you posted.

jeremias.tx · May 11, 2023, 5:52pm

Yes, do please let me know when you find something. Hopefully it’s not one of this race-condition like issues where it only happens “sometimes”. That would be quite difficult to debug and analyze.

In any case, for the bug I was able to confirm I’ve already made a report to the team.

Best Regards,
Jeremias

bw908 · May 19, 2023, 2:47pm

I think I have found the cause of the back-to-stock issue I was experiencing - it happens if something goes wrong with the images unpack output command; then deploy will deploy the stock image rather than the customized one.

In one case I had it occur and looked back through the command history where I noticed I’d simply forgotten to run the unpack command before deploying; I would expect similar situations might occur if (e.g) one accidentally types something other than y as a response when not using the --remove-storage argument.

jeremias.tx · May 19, 2023, 5:59pm

That would make sense. So the issue wasn’t it was rolling back to the stock image, but rather the setup was unwittingly deploying the stock image due to the aforementioned issue with images unpack. Therefore this part doesn’t sound inherently like a bug, correct?

bw908 · May 19, 2023, 6:05pm

Correct, it does not appear to be a bug, just an unexpected corner case that was conflated with the “real” bug uncovered previously in this thread.

jeremias.tx · May 19, 2023, 9:37pm

Ok, well glad we could at least clear that part up then. Just to summarize so we have the one issue that I already reported and this issue is actually a non-issue. Was there anything else that still needed to be addressed?

Best Regards,
Jeremias

bw908 · May 19, 2023, 10:12pm

I think we’re good, thanks. I haven’t been able to reproduce a false rollback caused by resets since.

bw908 · July 7, 2023, 5:48pm

Have there been any developments on fixing the mangled fdtfile path? It’s starting to become a nuisance for developers testing changes via local deploys.

jeremias.tx · July 10, 2023, 8:36pm

This should have been fixed on the TorizonCore level as seen with this commit: ostree: fix fdtfile patch · toradex/meta-toradex-torizon@9b3d6b6 · GitHub

Though you’ll need a somewhat recent version of TorizonCore that has this fix. The upcoming quarterly should have this fix.

Best Regards,
Jeremias