We’re using lockboxes for combined (docker-compose + OS) updates (in a specific context).
The device is configured for offline updates and we’re using lockboxes for updates regularly. The process looks always the same, means downloading lockbox to device, pointing a symlink to the lockbox (where the symlink is the path to offline_updates_source). This process goes fine so far - no big issues (except the lockbox size but that’s another story); the last update went through last week without any errors.
A few hours ago, the CI built another lockbox. But this time, when trying to apply the lockbox on the devices, the update process ends up in an error. I’ve tried it multiple times with the same lockbox (and different devices), but the log shows always the same errors.
Aktualizr logs:
Oct 30 22:07:41 colibri-imx8x-00000000 aktualizr-torizon[1025]: fetchMetaOffUpd() called with source_path: "/var/rootdirs/media/update/aktualizr"
Oct 30 22:07:41 colibri-imx8x-00000000 aktualizr-torizon[1025]: New updates found in Director metadata. Checking Image repo metadata...
Oct 30 22:07:41 colibri-imx8x-00000000 aktualizr-torizon[1025]: 2 new updates found in both Director and Image repo metadata.
Oct 30 22:07:41 colibri-imx8x-00000000 aktualizr-torizon[1025]: Event: UpdateCheckComplete, Result - Updates available
Oct 30 22:07:41 colibri-imx8x-00000000 aktualizr-torizon[1025]: Update available. Acquiring the update lock...
Oct 30 22:07:41 colibri-imx8x-00000000 aktualizr-torizon[1025]: Error downloading image: Could not write content to file: /tmp/aktualizr-e441-8897-ca96-6995/3d21-37a5-tls-ca: No such file or directory
Oct 30 22:07:44 colibri-imx8x-00000000 aktualizr-torizon[1025]: Event: DownloadTargetComplete, Result - Error
Oct 30 22:07:47 colibri-imx8x-00000000 aktualizr-torizon[1025]: Error downloading image: Could not write content to file: /tmp/aktualizr-e441-8897-ca96-6995/934f-e662-tls-ca: No such file or directory
Oct 30 22:07:51 colibri-imx8x-00000000 aktualizr-torizon[1025]: Event: DownloadTargetComplete, Result - Error
Oct 30 22:07:51 colibri-imx8x-00000000 aktualizr-torizon[1025]: 0 of 2 targets were successfully downloaded.
Oct 30 22:07:51 colibri-imx8x-00000000 aktualizr-torizon[1025]: Correlation ID is blank, installation failure will not be logged
Oct 30 22:07:51 colibri-imx8x-00000000 aktualizr-torizon[1025]: Event: AllDownloadsComplete, Result - Error
Looks like some CA files are missing. But what does that mean?
The lockbox got created with TorizonCore Builder 3.8.1 and we’re using Torizon OS 6.4.0
if (config_.tls_ca_source == CryptoSource::kFile) {
std::string ca;
if (ca_content != nullptr) {
ca = *ca_content;
} else {
backend_->loadTlsCa(&ca);
}
if (!ca.empty()) {
if (tmp_ca_file == nullptr) {
tmp_ca_file = std_::make_unique<TemporaryFile>("tls-ca");
}
tmp_ca_file->PutContents(ca);
}
}
We can see it loads the CA, then puts it’s contents into a temporary file. Now for some reason in your case it looks like this temporary file does not exist? But I’m not sure how that can be the case since as seen in the code the temporary file does get generated. The only case it doesn’t get generated is if a temporary file already exists. But if it already exists then you shouldn’t be getting this error.
A few hours ago, the CI built another lockbox. But this time, when trying to apply the lockbox on the devices, the update process ends up in an error. I’ve tried it multiple times with the same lockbox (and different devices), but the log shows always the same errors.
Are you saying this error only happens with this specific Lockbox? Other Lockboxes are processed fine? If that is the case we may need further details on this Lockbox and why it might be causing this issue.
Otherwise can you think of anything else that may have changed recently on your systems that could be a variable or factor here?
Thanks for your response. Indeed - looks very strange.
Yes, previous lockboxes were processed fine. I’ve run another build this morning, with the same result upon installing (error)! I can share the lockbox (in private) if that helps.
We’ve changed to Torizon OS 6.4.0 and upgraded successfully our test devices 2 weeks ago with a previous lockbox (with some dt and rootfs changes through tcbuilder). This lockbox in question is the first one applied as a combined (docker-compose + OS) after the upgrade.
Oct 30 22:07:41 colibri-imx8x-00000000 aktualizr-torizon[1025]: Error downloading image: Could not write content to file: /tmp/aktualizr-e441-8897-ca96-6995/3d21-37a5-tls-ca: No such file or directory
Just checked the error message in depth and what it tells me is, that the temporary directory /tmp/aktualizr-e441-8897-ca96-6995/ does not exist when trying to write the CA file.
So, when does this directory get created?
Update:
After restarting aktualizr (systemctl restart aktualizr), the update went through! I’ll have to take a look in depth in what stage they were before (aktualizr was running for sure).
We’ve changed to Torizon OS 6.4.0 and upgraded successfully our test devices 2 weeks ago with a previous lockbox (with some dt and rootfs changes through tcbuilder).
What version of Torizon OS were you running before you updated to 6.4.0? Do your Lockboxes work on the previous version you were on?
After restarting aktualizr (systemctl restart aktualizr), the update went through! I’ll have to take a look in depth in what stage they were before (aktualizr was running for sure).
That’s strange and doesn’t make a lot of sense to me. I’m probably going to need some more info here to be able to investigate this further. I tried doing a quick test where I did a offline (docker-compose & OS) update on a device running Torizon OS 6.4.0 and it seems to work just fine on my side.
I’ll chime in on this thread and note we’ve observed a similar error yesterday.
We’re still trying to understand it but one of the observations we have right now is the affected unit wasn’t rebooted after the new suite of containers came up for the first time. Does this seem plausible? I am aware there’s a staging step for new containers using docker-compose.tmp.yml which has a slightly different workflow from “normal”, but offhand I’m not sure how that fits into the bigger picture in terms of aktualizr.
It possibly makes sense we wouldn’t want to allow updates during this staging step because if the containers fail in some way then you would no longer be able to do a rollback if the rollback target has been replaced by a further update.
I’ll provide anything else we have as we uncover it; if there’s anything specific we can provide that would help narrow down the cause please let me know.
We’re still trying to understand it but one of the observations we have right now is the affected unit wasn’t rebooted after the new suite of containers came up for the first time. Does this seem plausible?
As stated the new containers shouldn’t start till after the reboot into the new OS version. I’m not sure what you mean by rebooting after the new containers come up. There shouldn’t be a reboot after the new containers, as the reboot should have already happened.
It possibly makes sense we wouldn’t want to allow updates during this staging step because if the containers fail in some way then you would no longer be able to do a rollback if the rollback target has been replaced by a further update.
I’m not sure if I follow. The docker-compose.tmp.yml file, is just a temporary transition file so that we can atomically write the actual docker-compose.yml during the update. Also it should not be possible to do another update until the current one has completed (success or failure). So unless there’s an unknown bug here there should be no danger of another un-related update occurring while this temporary file is in use.
Yes, this would be a synchronous update. Thanks for the link, it helps clarify.
Sorry for the confusion on my part, I was musing that it may have been a corner case where the current update has not “completed” yet (in that the containers are still in the process of being evaluated post-OS-update).
Per the link you provided, what is the trigger for transitioning from " Application installation finish" to " Update successful and cleanup"? Based on what you are describing it sounds like it is not possible to apply a new update before it completes and I’m wondering how long that window is present.
Per the link you provided, what is the trigger for transitioning from " Application installation finish" to " Update successful and cleanup"? Based on what you are describing it sounds like it is not possible to apply a new update before it completes and I’m wondering how long that window is present.
Basically if docker-compose up was ran successfully on the new compose file and the system is running the new OS version, then the update is successful. If either of these conditions fail then the entire synchronous update is considered a failure. In either case failure or success the appropriate follow-up action for cleanup and/or rollback is taken.
Only once this entire update process has reached a finished terminal state will the update client even consider looking at any new updates. This is what I mean by there should be no “window” as you said. If the temp file still exists then the system should still be in progress of some kind of update and should not even entertain other updates until a finished state is reached.
Sorry for the delayed answer.
Before 6.4.0 we were running 6.3.0-monthly-something. The lockboxes were working before as well.
In the meantime we updated other systems successfully (without having to restart aktualizr).
Tbh, I have no clue what was wrong about these devices affected by this issue. Some ideas:
long running system without internet access → wrong time
messed up system from previous lockbox tests (without rebooting and docker-compose lockboxes only)
In the meantime, those systems were updated and are serving another purpose. Therefore I suggest to close this thread, but leave it here as a reference for having a vague description in case the same behaviour happens again somewhere else…
In the meantime we updated other systems successfully (without having to restart aktualizr).
Strange, but at least it doesn’t seem like a permanent issue. Well thanks for reporting back. I’ll try and keep this in the back of my mind in case I see something similar in the future.
We may have made a finding on this after we encountered it today (and earlier this week) with lockbox install failing and aktualizr complaining its /tmp folder no longer existed.
After some digging I found that there appears to be a cleanup rule for /tmp in /usr/lib/tmpfiles.d/tmp.conf which cleans items more than 5 days old, and I don’t see any exclusion rules to prevent aktualizr’s folders from being cleaned.
This could explain the disappearance of Aktualizr’s temp directory on systems with long uptimes (as ours are - this unit has an uptime of 31 days, currently.)
we are running some experiments to confirm this theory, but I figured I’d drop an update here to further the discussion.
This could explain the disappearance of Aktualizr’s temp directory on systems with long uptimes (as ours are - this unit has an uptime of 31 days, currently.)
This is a known interaction and is technically possible. But to clarify systemd tmpfiles only cleans up files that have went longer than 5 days without being accessed, modified, or changed. Also known as a file’s atime, mtime, and ctime.
In usual Aktualizr operation the tmpfiles related to the Aktualizr process should be somewhat regularly accessed at the least. On the online update side I know for sure every polling event results in the tmpfiles being accessed, which would prevent cleanup.
That said, there could possibly be some circumstance with a purely offline update workflow that might cause the tmpfiles for Aktualizr to go stale for 5 days. Even on devices that are only partially offline. They may hit an NTP server upon going online and their time suddenly changes causing the appearance of a large time jump, which could then trigger a cleanup in theory.
A suggestion for your experiments. You could try manually reducing the cleanup time from 5 days to something less, so you could run experiments quicker without having to wait as long for the files to be determined as stale. Though keep in mind if you set the cleanup time less than 1 day you have to also change the systemd-tmpfiles-clean.timer that triggers the automatic cleanup. By default this timer will trigger cleanup 15mins after boot, and then triggers once a day afterwards.
That said, there could possibly be some circumstance with a purely offline update workflow that might cause the tmpfiles for Aktualizr to go stale for 5 days. Even on devices that are only partially offline. They may hit an NTP server upon going online and their time suddenly changes causing the appearance of a large time jump, which could then trigger a cleanup in theory.
This is consistent with our observation (sample size n=1) where the /tmp path has a timestamp that corresponds to system boot.
We’ll keep you posted with our findings.
Edit: we just checked a system with an uptime of ~3.5 days and the timestamps on the /tmp files are still consistent with the time the system booted. I think we can be reasonably certain this cleanup is the cause of the issue.
Note this device was internet connected, but is running Torizon 6.3 (and so provisioned for offline-only mode).
Additionally, two systems with uptimes greater than 5 days were observed missing the temporary directory.
This seems pretty definitive to me it is the cause of the issue now.
Interesting. Just for context, your devices are running continuously for a while (+5 days) without any kind of update or reboot of the device it sounds like. Is that correct?
Note this device was internet connected, but is running Torizon 6.3 (and so provisioned for offline-only mode).
Have you also observed this on any device running a more recent version of Torizon OS? I’m just asking since the Aktualizr code has had some changes and refactoring somewhat recently, especially since 6.3. Though I’m not sure if this changes the frequency on how often the /tmp files get accessed or not. Though it would be good to confirm since any potential fix would be on the latest version of course.
Thank you again for your findings and investigation here. It’s really a big help for us.
Just for context, your devices are running continuously for a while (+5 days) without any kind of update or reboot of the device it sounds like. Is that correct?
That’s correct. These are life science products and are running without reboot for the duration of an experiment, which can be on the order of weeks.
We do have a plan in place for an OS update to 6.7; we’ll revisit the issue when that is implemented.
I did some testing here using our latest Torizon OS software.
Interestingly when Aktualizr first starts up, configured for offline updates only, the /tmp/aktualizr-* directory isn’t created yet. When I attach my first Lockbox, is when this gets created it seems. That said, even after the offline update completes the directory now exists but it’s empty.
For context, in the online update case this directory contains tls related certs and files. Obviously these aren’t needed in a pure offline use-case. My best assumption is that there’s some erroneous code that gets called in the offline update flow when it’s only really needed for the online update flow.
So that is one issue here. The other issue is of course that this behavior is only triggered because that /tmp/aktualizr-* directory disappears. Continuing with my testing after my first offline update I left the device running for a while. I checked later, all the timestamps on /tmp/aktualizr-* and they were still from the time of the my first update. I performed another offline update and checked the timestamps again and now they were updated.
Conclusion, it seems the timestamps only get updated during the event of an actual offline update. This is unlike online updates where the timestamp gets updated whenever the server is polled regardless of an actual update event.
So given all this, it would seem likely that on an offline device /tmp/aktualizr-* could have gotten cleaned up by the pruning service. Given the device has went 5+ days without a power-cycle or offline update. Furthermore this still seems to be the case even on our latest software.
I’ll report this to our team given our combined findings. Thank you again for your own input.
Thanks for the update. We’ve mitigated it in our deployment by adding an exclusion rule to /etc/tmpfiles.d modelled on /usr/lib/tmpfiles.d/systemd-tmp.conf