Offline updates - errors in docker secondary due to missing metadata

We have recently started having problems producing offline update lockboxes and applying them to systems. We’ve successfully built a number of lockboxes already (10+), however recently they have started failing to apply: during installation, these updates will fail with errors similar to:

validateMetadata: Key '6b8391a3e90732c48ee24e2b9ab509a148f88e467f66c7949f8a077058250693/layer.tar' not found in metamap [RE]

This seems to be an issue with data being written by TorizonCore builder during the lockbox creation step as we’ve re-built several old lockboxes that are known to work, and they also exhibit this problem. The .lock.yml files generated during the process all look reasonable, and we are not sure why TCB appears to be leaving out metadata for these docker layers.

Is this something you’ve seen before, and if so can you offer some guidance where the issue might lie?

Relevant section of the Aktualizr client’s journalctl:

Mar 31 18:23:51 MM-1 aktualizr-torizon[997]: Reading config: "/etc/sota/conf.d/100-offline-updates.toml"
Mar 31 18:23:51 MM-1 aktualizr-torizon[997]: Reading config: "/usr/lib/sota/conf.d/20-sota-device-cred.toml"
Mar 31 18:23:51 MM-1 aktualizr-torizon[997]: Reading config: "/usr/lib/sota/conf.d/30-rollback.toml"
Mar 31 18:23:51 MM-1 aktualizr-torizon[997]: Reading config: "/usr/lib/sota/conf.d/40-hardware-id.toml"
Mar 31 18:23:51 MM-1 aktualizr-torizon[997]: Reading config: "/etc/sota/conf.d/41-mm-hwid.toml"
Mar 31 18:23:51 MM-1 aktualizr-torizon[997]: Reading config: "/usr/lib/sota/conf.d/50-secondaries.toml"
Mar 31 18:23:51 MM-1 aktualizr-torizon[997]: Reading config: "/usr/lib/sota/conf.d/60-polling-interval.toml"
Mar 31 18:23:51 MM-1 aktualizr-torizon[997]: Use existing SQL storage: "/var/sota/sql.db"
Mar 31 18:23:51 MM-1 aktualizr-torizon[997]: docker-compose file matches expected digest
Mar 31 18:23:51 MM-1 aktualizr-torizon[997]: Loading metadata from tarball: /var/volatile/DEV_UPDATE/update/images/9dd281bf6d3f4eaa4cd7c444c0da0ebe6ab7775cc0d41cb51c183c001efcd4fd.images/4c7bf92a09bcba4620626bb1e39f93120035d6546955852c95151ab2e6d4273c.tar
Mar 31 18:24:05 MM-1 aktualizr-torizon[997]: Found in archive a file with bad file type: 40960
Mar 31 18:24:41 MM-1 aktualizr-torizon[997]: validateMetadata: Key '6b8391a3e90732c48ee24e2b9ab509a148f88e467f66c7949f8a077058250693/layer.tar' not found in metamap [RE]
Mar 31 18:24:41 MM-1 aktualizr-torizon[997]: Loading of tarballs aborted!
Mar 31 18:24:41 MM-1 aktualizr-torizon[997]: Offline loading failed: Failed to load docker tarball 4c7bf92a09bcba4620626bb1e39f93120035d6546955852c95151ab2e6d4273c.tar
Mar 31 18:24:41 MM-1 aktualizr-torizon[997]: Rolling back container update
Mar 31 18:24:41 MM-1 aktualizr-torizon[997]: Removing not used containers, networks and images
Mar 31 18:24:41 MM-1 aktualizr-torizon[997]: Running command: docker system prune -a --force
Mar 31 18:24:41 MM-1 aktualizr-torizon[1084]: Total reclaimed space: 0B
Mar 31 18:24:41 MM-1 aktualizr-torizon[997]: Running command: fw_setenv rollback 1
Mar 31 18:24:41 MM-1 aktualizr-torizon[997]: Event: InstallTargetComplete, Result - Error
Mar 31 18:24:41 MM-1 aktualizr-torizon[997]: Event: AllInstallsComplete, Result - INTERNAL_ERROR
Mar 31 18:24:41 MM-1 aktualizr-torizon[997]: Update install completed. Releasing the update lock...
Mar 31 18:24:41 MM-1 aktualizr-torizon[997]: Exiting aktualizr so that pending updates can be applied after reboot

We have tried both TCB 3.6 and 3.7 in case this issue was introduced with 3.7, but it did not appear to make a difference.

Thanks!

Greetings @bw908,

Correct me if I misunderstood, but you’re using the same TorizonCore version as before, the same TorizonCore Builder version, and the same lockboxes. All of these things worked before, but now using the same version of these things it suddenly does not work anymore?

Or, did the version of something change?

Otherwise based on your description it sounds like nothing was changed, but now you’re getting a different result somehow. Which would be quite concerning if true.

Best Regards,
Jeremias

Yes, as far as I currently understand it, there are cases where nothing has changed and we are rebuilding known-good tags with known-good tools (on differing machines). We’ve even tried explicitly calling out TCB 3.6 and 3.7.0 just in case a recent update caused it.

We were previously using the early-access tag due to a bug in TCB not accepting logins for private repos, but some of these “known good” packages were built with the current “early-access” tag in the last few weeks (in other words, after the most recent release of it)

Well let’s see if we can figure out what happened here. I don’t think it’s very likely that the software behaves differently when the version of everything is supposed to be the same as before.

First off, from your initial logs I noticed this line:

Found in archive a file with bad file type: 40960

This looks similar to the issue where there are symlinks in the container metadata. If I recall you had a local workaround for this, are you still using it?

Next regarding this comment:

on differing machines

Do you mean you’re running things on a different development PC now? Do things still work with the old development machine?

Best Regards,
Jeremias

Found in archive a file with bad file type: 40960

I’m told this message is also present in updates that install successfully, so I do not think it is related to the issue at hand (but I could be wrong)

Do you mean you’re running things on a different development PC now? Do things still work with the old development machine?

Sorry, poor phrasing on my part. I meant to explain we have tried the build both on the original machine that produced the successful one (CI build agent) as well as a local build on my laptop (which I know has not had any changes, such as docker updates) without success

I’m told this message is also present in updates that install successfully, so I do not think it is related to the issue at hand (but I could be wrong)

Not sure where you heard that, but ideally you shouldn’t be seeing this message at all. Your logs look almost the exact same as the symlink issue I described. See this thread to compare: [aktualizr, offline updates] Found in archive a file with bad file type: 40960

Also I misremembered it was the customer in the thread I just linked that had a workaround for this issue.

But yeah, in the thread I just linked the customer there too had the same error message with the same file type number of “40960”, which should denote a file type of symlink. Our update client back in 5.7.0 can’t handle docker metadata if it contains a symlink file.

Best Regards,
Jeremias

Not sure where you heard that, but ideally you shouldn’t be seeing this message at all.

Just relaying what was reported to me by the developer trying to integrate updates into the software frontend and reported the issue (so not first hand experience)

I did observe a symlink in the archive when I inspected it so I will attempt the patch in the linked thread.

It sounds like we don’t have any concrete reason why docker sometimes decides to make symlinks and sometimes not, but this would be consistent with our observations that lockboxes which previously applied correctly, when rebuilt, start failing with similar errors.

Can confirm the linked script appears to resolve the issue for us. Thanks!

Ok so that confirms then that you were experiencing the same symlink issue. We do have an official fix for this issue that we plan to release in TorizonCore 6.X. We eventually plan to backport this fix to a future patch release of TorizonCore 5.7.X as well.

One thing I do find strange is that your old containers/Lockboxes suddenly started experiencing this issue as well. It’s still not understood to us why some container images have symlinks in their metadata. Though yours is the first case I’ve seen of a previous container image suddenly having symlinks later on, very strange.

Best Regards,
Jeremias

Has this already been patched in 6.2 or not yet?

The fix for this issue should have made it into the 6.2.0 quarterly release. Please let me know if this does not seem to be the case. For clarification we fixed this on the OS level, so now the update client should be capable of handling container image metadata that contains symlinks.

Best Regards,
Jeremias

1 Like

I removed the patch code from our build process and it seems to be working without issue :+1:

Thanks for confirming the fix on your end.