Failed to update Director metadata: Rollback attempt

lblackbeard · May 23, 2024, 12:26pm

Please help. After 5 attempts at a bulk upgrade through through the Torizon OTA API, 1/3 of the fleet targeted for upgrade continues to fail the docker-compose upgrade. Here is the log from one of them:

I traced the “Failed to update Director metadata: Rollback attempt” string in the toradex/aktualizr repo to here:

Which in turn is looking at this json field

Now, on an unupgraded unit, calling aktualizr-info --director-targets yields

Whereas on an upgraded unit, the same command yields

Could somebody explain to me what’s going on please and how to get our fleet to upgrade without logging on to hundreds of units to delete/modify files and/or reprovision them? 6 months ago we struggled to upgrade units and it involved a significant amount of manual labour that caused project timelines to suffer and we simply can’t afford the time to do that again, especially considering that this is a paid service.

Software summary
Bootloader: U-Boot
Kernel version: 5.4.193-5.7.0+git.f5d73fd6e9f8 #1-TorizonCore SMP Fri Jun 24 10:15:32 UTC 2022
Kernel command line: root=LABEL=otaroot rootfstype=ext4 quiet logo.nologo vt.global_cursor_default=0 plymouth.ignore-serial-consoles splash fbcon=map:3 ostree=/ostree/boot.0/torizon/82e35a379141d2c27f6fe11e9ac8663e43f2781adba31ba7da1560f275e4f54f/0
Distro name: NAME=“TorizonCore Upstream”
Distro version: VERSION_ID=5.7.0-build.17
Distro variant: VARIANT=“Docker”
Hostname: Bradford

Hardware info
HW model: Toradex Colibri iMX7D on Senceive EdgeHub Carrier Board v4
Toradex version: 0039 V1.1B
Serial number: 07209641
Processor arch: armv7l

Hope to hear back soon,
Lloyd

ToradexAI · May 23, 2024, 12:27pm

Hi, I’m the Toradex AI Assistant. I’m in early beta, and I would appreciate your feedback.
The Toradex Support team will follow up with you shortly!

Did you like my answer?

Hello,

The issue you’re encountering, indicated by the “Failed to update Director metadata: Rollback attempt” message, typically arises when there’s a mismatch or an issue with the metadata expected by the device’s updater client (Aktualizr-Torizon) and the metadata available on the OTA server. This can happen for various reasons, including but not limited to network issues during the update process, corrupted metadata, or discrepancies between the metadata for the director and the image repository.

Given the context and the complexity of the issue affecting a significant portion of your fleet, here are some steps and considerations to help address the problem:

Verify Connectivity and Server Status: Ensure that all devices in the fleet have reliable connectivity to the OTA server and that the server is functioning correctly. Network issues can sometimes cause updates to fail in ways that are not immediately obvious.
Check for Corrupted Metadata: It’s possible that the metadata on the OTA server or on the devices has become corrupted. You can verify the integrity of the metadata on the server and compare it against the expected values. If discrepancies are found, correcting them on the server might resolve the issue.
Review Update Campaign Configuration: Ensure that the update campaign is correctly configured, with the correct targets and metadata specified for the devices you’re attempting to update. Misconfiguration here could lead to devices attempting to update with incorrect or outdated information.
Incremental Rollout and Monitoring: If not already doing so, consider rolling out updates in smaller batches. This allows for closer monitoring of the update process and can help isolate issues to a smaller subset of devices, making troubleshooting more manageable.
Device-Side Logs and Status: On devices that fail to update, closely examine the logs for any additional errors or warnings that might provide more context on why the update is failing. The aktualizr-info command you’ve used is a good start; look also for any logs related to network connectivity, disk space, or hardware issues that might be affecting the update process.
Recovery and Manual Intervention: While manual intervention is not ideal, especially at scale, understanding the exact nature of the failure on a few devices might provide insights that can be applied fleet-wide. For a subset of the failing devices, manually checking and, if necessary, correcting the metadata or other update-related files might reveal patterns or specific issues that can then be addressed more systematically.
Consultation with Toradex Support: While direct contact with support is not within the scope of this

Please, also check if these links can help you solve your problem:

Torizon Updates Technical Overview | Toradex Developer Center.

jeremias.tx · May 23, 2024, 4:33pm

Greetings @lblackbeard,

That is quite concerning and confusing. Just to make sure I understand, part of your fleet (1/3) failed to update. On the units that failed to update the version of your director-targets metadata is “13”. On the units that managed to update the version is instead “10”, suggesting somehow that the metadata rolled-back versions which is bad.

I honestly can’t say I’ve seen this happen before. Let me investigate your case and discuss this internally with our team and see if we can figure something out.

Just in case we need it. What’s the email address for the torizon.io account that these devices and updates are connected to?

Best Regards,
Jeremias

jon.tx · May 23, 2024, 7:28pm

Hi Lloyd,

I can identify what’s happening here. I’m imagining that the devices that failed were provisioned, used for a while, and then re-provisioned as new devices at some point, right? First, a short explanation of the security feature that is causing this.

Rollback protection is a really important property for a software update system to have. You don’t want someone to be able to replay a message from the software repository saying “Install my_app_v.1.0.0-buggy” to cause devices to install some previously-valid version of software. Torizon Cloud protects you against that in several ways using the Uptane security framework. One of the things it does, is perform a check to ensure that the version number of the director targets metadata (which is essentially the install instruction) always monotonically increases. Because each device can be independently directed to install its own software, that version number is unique per device. When you start an update, our backend generates and signs new metadata (including an incremented version number) that the device will validate. So far, so good.

Uptane also has this concept of primary and secondary components–or subsystems, as we call them in Torizon. Each subsystem independently validates the update metadata, so even if you have a subsystem that’s an external microcontroller connected to the main device over a link layer you don’t trust, you’re still protected from rollbacks and all the other threats that we consider in the software update system threat model. Unfortunately, what seems to have happened here is that the devices that are currently failing have subsystems with incorrect versions of director targets metadata–almost certainly because they were re-provisioned at some point in the past without the subsystem’s metadata being reset. I’m not 100% sure how that would have happened, but I think that older versions of aktualizr may not have dealt with the situation properly if aktualizr wasn’t shut down cleanly and the device was re-provisioned into the same repository. Today, when you re-provision a device, we make sure to reset the metadata of all subsystems as part of the provisioning process, just in case aktualizr doesn’t do it properly.

So, for example, this is what probably happened:

You provision device foo
It stays connected to Torizon Cloud for quite some time, gets updates, etc. Every time any subsystem of the device is updated, or the metadata expires, the director targets metadata version number gets bumped, until it reaches version N.
At some point, you re-provision it as device bar. As far as the backend is concerned, this is a new device, so its targets metadata version starts back at 1. However, there was an issue with the re-provisioning process for one of the subsystems. Nothing seems to be wrong: the subsystems are all registered and reporting correct information, and updates to the base OS and any subsystems that were new (or that hadn’t been updated when the device was named foo) will work just fine.
You send an update to the subsystem that still has the version N director targets metadata from when it was named foo. The metadata is still signed by the correct roots of trust for your repository, so it passes the initial checks. But since you’re sending it update instructions via director targets metadata with version number M which is less than N, the subsystem rejects it, thinking that it’s an attacker trying to fool it by replaying metadata from the past.

So, in summary: the issue is that Torizon Cloud and aktualizr both agree that the device’s director targets metadata should be at version M. The docker-compose subsystem you’re trying to update, though, has already seen version N, which is higher than M (from when it was provisioned previously), and thus flags the update as a rollback attempt.

There are essentially two ways to fix this. One way is to stop aktualizr on the failing devices, delete the file /var/sota/storage/<subsystem-name>/sql.db, then restart aktualizr. You’ve already said that’s not a desirable solution, though. The other is to sufficiently increase the director targets metadata version for the failing devices, until it’s greater than whatever version the failing subsystem has left over in its storage. One way to do that is simply to send a bunch of updates; the director metadata version will increment by one every time the device picks up a new update, and eventually the version number will reach the number that the problematic subsystem has in its internal storage. I understand that might be equally undesirable, though. We’re looking into another possible solution on the back-end, but will need a couple of days to get back to you on that.

lblackbeard · May 24, 2024, 11:07am

Hi Jeremias,

Thanks for your reply.

1/3 of the fleet after 5 attempts, yes. I can’t say what is causing the issue on all of them, just the one that I’ve highlighted, and on the one example of a successful upgrade. We’ve also seen a problem that is rectified by changing the gateway url, timeouts on container creation and other problems which we don’t understand (fixed by reprovisioning).
Interestingly, the device I’ve highlighted upgraded successfully on the 8th attempt as you can see from the snip here:

And thankfully, with each attempt, the number of devices requiring upgrades is reducing. On the 11th attempt (today), the proportion is down to 4%.
I would prefer to post the account email address privately to avoid scraping / malicious actors, just need to figure out how to do that.

Best regards,
Lloyd

henrique.tx · May 24, 2024, 11:54am

Hi @lblackbeard !

You can create a new Hidden Topic here on Toradex Community and refer to this thread (simply using the URL). It will be accessible/visible only to you and Toradex people. This is the preferred way.

You can also send to us via email: support@toradex.com.

Feel free to proceed as you prefer

Best regards,

jeremias.tx · May 24, 2024, 3:03pm

And thankfully, with each attempt, the number of devices requiring upgrades is reducing. On the 11th attempt (today), the proportion is down to 4%.

Given these results, I believe the hypothesis from @jon.tx is most likely correct and what is happening here. In the past at some point these devices were probably re-provisioned in some way, without the previous data being properly wiped out as documented here: Aktualizr - Modifying the Settings of Torizon Update Client | Toradex Developer Center

If it’s okay for you, I guess just keep pushing updates to bump the version of your metadata until the updates are all succeeding. As this seems to be working for your fleet.

Best Regards,
Jeremias

lblackbeard · May 31, 2024, 1:28pm

@jon.tx , thank you very much for your detailed reply - that mostly makes sense to me… does the sqldb sit in ostree? To date, the reports I’ve received and my own experience indicate that reprovisioning is a successful method for ensuring upgrades succeed, so is it possible that a/the sqldb is rolling back to a “previous life”?
We now trigger a program every hour that hooks into the ota api to check subsystem versions and queue upgrades as needed - it seems to be working and I would recommend the approach to others.

jeremias.tx · May 31, 2024, 4:29pm

does the sqldb sit in ostree?

No actually. The sql.db files that Aktualizr uses are stored in /var. This directory is explicitly not tracked by OSTree. Therefore it’s not part of the data in any OS updates.

so is it possible that a/the sqldb is rolling back to a “previous life”?

As I said above since sql.db is not part of the OS update data it should not be possible for it to “rollback”. That said, what might have happened is that during re-provisioning the old sql.db was not deleted. This would mean after re-provisioning the system would be working with an old sql.db from a “past-life” so to speak. This would cause such issues or strange interactions.

Best Regards,
Jeremias