Update with offline-update fails: `Error downloading image: Could not write content to file`

Dear community,

We’re using lockboxes for combined (docker-compose + OS) updates (in a specific context).
The device is configured for offline updates and we’re using lockboxes for updates regularly. The process looks always the same, means downloading lockbox to device, pointing a symlink to the lockbox (where the symlink is the path to offline_updates_source). This process goes fine so far - no big issues (except the lockbox size but that’s another story); the last update went through last week without any errors.

A few hours ago, the CI built another lockbox. But this time, when trying to apply the lockbox on the devices, the update process ends up in an error. I’ve tried it multiple times with the same lockbox (and different devices), but the log shows always the same errors.

Aktualizr logs:

Oct 30 22:07:41 colibri-imx8x-00000000 aktualizr-torizon[1025]: fetchMetaOffUpd() called with source_path: "/var/rootdirs/media/update/aktualizr"
Oct 30 22:07:41 colibri-imx8x-00000000 aktualizr-torizon[1025]: New updates found in Director metadata. Checking Image repo metadata...
Oct 30 22:07:41 colibri-imx8x-00000000 aktualizr-torizon[1025]: 2 new updates found in both Director and Image repo metadata.
Oct 30 22:07:41 colibri-imx8x-00000000 aktualizr-torizon[1025]: Event: UpdateCheckComplete, Result - Updates available
Oct 30 22:07:41 colibri-imx8x-00000000 aktualizr-torizon[1025]: Update available. Acquiring the update lock...
Oct 30 22:07:41 colibri-imx8x-00000000 aktualizr-torizon[1025]: Error downloading image: Could not write content to file: /tmp/aktualizr-e441-8897-ca96-6995/3d21-37a5-tls-ca: No such file or directory
Oct 30 22:07:44 colibri-imx8x-00000000 aktualizr-torizon[1025]: Event: DownloadTargetComplete, Result - Error
Oct 30 22:07:47 colibri-imx8x-00000000 aktualizr-torizon[1025]: Error downloading image: Could not write content to file: /tmp/aktualizr-e441-8897-ca96-6995/934f-e662-tls-ca: No such file or directory
Oct 30 22:07:51 colibri-imx8x-00000000 aktualizr-torizon[1025]: Event: DownloadTargetComplete, Result - Error
Oct 30 22:07:51 colibri-imx8x-00000000 aktualizr-torizon[1025]: 0 of 2 targets were successfully downloaded.
Oct 30 22:07:51 colibri-imx8x-00000000 aktualizr-torizon[1025]: Correlation ID is blank, installation failure will not be logged
Oct 30 22:07:51 colibri-imx8x-00000000 aktualizr-torizon[1025]: Event: AllDownloadsComplete, Result - Error

Looks like some CA files are missing. But what does that mean?

The lockbox got created with TorizonCore Builder 3.8.1 and we’re using Torizon OS 6.4.0

Any hints are appreciated!

Regards,

Lukas

Greetings @lzaugg_parametric,

Well I can see where the error is happening in the code, though I’m scratching my head a bit trying to figure out how this might be possible. During the update process the client will load the keys as seen here: https://github.com/toradex/aktualizr/blob/d71f18a1b8139cd4adadb41667b73d7c92fdbd9f/src/libaktualizr/crypto/keymanager.cc#L37

Looking at this specific code chunk:

if (config_.tls_ca_source == CryptoSource::kFile) {
    std::string ca;
    if (ca_content != nullptr) {
      ca = *ca_content;
    } else {
      backend_->loadTlsCa(&ca);
    }
    if (!ca.empty()) {
      if (tmp_ca_file == nullptr) {
        tmp_ca_file = std_::make_unique<TemporaryFile>("tls-ca");
      }
      tmp_ca_file->PutContents(ca);
    }
  }

We can see it loads the CA, then puts it’s contents into a temporary file. Now for some reason in your case it looks like this temporary file does not exist? But I’m not sure how that can be the case since as seen in the code the temporary file does get generated. The only case it doesn’t get generated is if a temporary file already exists. But if it already exists then you shouldn’t be getting this error.

A few hours ago, the CI built another lockbox. But this time, when trying to apply the lockbox on the devices, the update process ends up in an error. I’ve tried it multiple times with the same lockbox (and different devices), but the log shows always the same errors.

Are you saying this error only happens with this specific Lockbox? Other Lockboxes are processed fine? If that is the case we may need further details on this Lockbox and why it might be causing this issue.

Otherwise can you think of anything else that may have changed recently on your systems that could be a variable or factor here?

Best Regards,
Jeremias

Hi @jeremias.tx ,

Thanks for your response. Indeed - looks very strange.

Yes, previous lockboxes were processed fine. I’ve run another build this morning, with the same result upon installing (error)! I can share the lockbox (in private) if that helps.

We’ve changed to Torizon OS 6.4.0 and upgraded successfully our test devices 2 weeks ago with a previous lockbox (with some dt and rootfs changes through tcbuilder). This lockbox in question is the first one applied as a combined (docker-compose + OS) after the upgrade.

Regards,
Lukas

@jeremias.tx

Oct 30 22:07:41 colibri-imx8x-00000000 aktualizr-torizon[1025]: Error downloading image: Could not write content to file: /tmp/aktualizr-e441-8897-ca96-6995/3d21-37a5-tls-ca: No such file or directory

Just checked the error message in depth and what it tells me is, that the temporary directory /tmp/aktualizr-e441-8897-ca96-6995/ does not exist when trying to write the CA file.

So, when does this directory get created?

Update:

After restarting aktualizr (systemctl restart aktualizr), the update went through! I’ll have to take a look in depth in what stage they were before (aktualizr was running for sure).

We’ve changed to Torizon OS 6.4.0 and upgraded successfully our test devices 2 weeks ago with a previous lockbox (with some dt and rootfs changes through tcbuilder).

What version of Torizon OS were you running before you updated to 6.4.0? Do your Lockboxes work on the previous version you were on?

After restarting aktualizr (systemctl restart aktualizr), the update went through! I’ll have to take a look in depth in what stage they were before (aktualizr was running for sure).

That’s strange and doesn’t make a lot of sense to me. I’m probably going to need some more info here to be able to investigate this further. I tried doing a quick test where I did a offline (docker-compose & OS) update on a device running Torizon OS 6.4.0 and it seems to work just fine on my side.

Best Regards,
Jeremias

I’ll chime in on this thread and note we’ve observed a similar error yesterday.

We’re still trying to understand it but one of the observations we have right now is the affected unit wasn’t rebooted after the new suite of containers came up for the first time. Does this seem plausible? I am aware there’s a staging step for new containers using docker-compose.tmp.yml which has a slightly different workflow from “normal”, but offhand I’m not sure how that fits into the bigger picture in terms of aktualizr.

It possibly makes sense we wouldn’t want to allow updates during this staging step because if the containers fail in some way then you would no longer be able to do a rollback if the rollback target has been replaced by a further update.

I’ll provide anything else we have as we uncover it; if there’s anything specific we can provide that would help narrow down the cause please let me know.

@bw908

We’re still trying to understand it but one of the observations we have right now is the affected unit wasn’t rebooted after the new suite of containers came up for the first time. Does this seem plausible?

Wait so you’re talking about a synchronous (os + docker-compose) update right? The expected process for a sync update is outlined here: Torizon Updates Technical Overview | Toradex Developer Center

As stated the new containers shouldn’t start till after the reboot into the new OS version. I’m not sure what you mean by rebooting after the new containers come up. There shouldn’t be a reboot after the new containers, as the reboot should have already happened.

It possibly makes sense we wouldn’t want to allow updates during this staging step because if the containers fail in some way then you would no longer be able to do a rollback if the rollback target has been replaced by a further update.

I’m not sure if I follow. The docker-compose.tmp.yml file, is just a temporary transition file so that we can atomically write the actual docker-compose.yml during the update. Also it should not be possible to do another update until the current one has completed (success or failure). So unless there’s an unknown bug here there should be no danger of another un-related update occurring while this temporary file is in use.

Best Regards,
Jeremias

Yes, this would be a synchronous update. Thanks for the link, it helps clarify.

Sorry for the confusion on my part, I was musing that it may have been a corner case where the current update has not “completed” yet (in that the containers are still in the process of being evaluated post-OS-update).

Per the link you provided, what is the trigger for transitioning from " Application installation finish" to " Update successful and cleanup"? Based on what you are describing it sounds like it is not possible to apply a new update before it completes and I’m wondering how long that window is present.

Per the link you provided, what is the trigger for transitioning from " Application installation finish" to " Update successful and cleanup"? Based on what you are describing it sounds like it is not possible to apply a new update before it completes and I’m wondering how long that window is present.

Basically if docker-compose up was ran successfully on the new compose file and the system is running the new OS version, then the update is successful. If either of these conditions fail then the entire synchronous update is considered a failure. In either case failure or success the appropriate follow-up action for cleanup and/or rollback is taken.

Only once this entire update process has reached a finished terminal state will the update client even consider looking at any new updates. This is what I mean by there should be no “window” as you said. If the temp file still exists then the system should still be in progress of some kind of update and should not even entertain other updates until a finished state is reached.

Best Regards,
Jeremias

OK, our theory sounds orthogonal to the original issue then and is probably not the cause.

Sorry for the delayed answer.
Before 6.4.0 we were running 6.3.0-monthly-something. The lockboxes were working before as well.

In the meantime we updated other systems successfully (without having to restart aktualizr).

Tbh, I have no clue what was wrong about these devices affected by this issue. Some ideas:

  • long running system without internet access → wrong time
  • messed up system from previous lockbox tests (without rebooting and docker-compose lockboxes only)

In the meantime, those systems were updated and are serving another purpose. Therefore I suggest to close this thread, but leave it here as a reference for having a vague description in case the same behaviour happens again somewhere else…

Thanks for your time!

Best Regards,
Lukas

In the meantime we updated other systems successfully (without having to restart aktualizr).

Strange, but at least it doesn’t seem like a permanent issue. Well thanks for reporting back. I’ll try and keep this in the back of my mind in case I see something similar in the future.

Best Regards,
Jeremias