Manually stopping containers launched from docker-compose.tmp.yml

bw908 · August 26, 2024, 6:18pm

Preamble/context:

We have a series of scripts integrated into the base OS image (via TCB) which are used to kick off the update process for updates placed on USB drives when certain conditions are met (not blocked by the running software, prompt approved by the user, etc).
These scripts don’t interfere with Aktualizr, rather they just symlink the offline-update directory to the correct place for it to be seen once all conditions are met.

Issue/question:

As part of the update process, we stop the running containers via systemctl stop docker-compose. This appears to be working fine under most circumstances, except during the first reboot (or container bringup) post-update, where the containers stay running.

This is due to the first-launch of new containers coming from from docker-compose.tmp.yml instead, as part of the OS/docker rollback mechanism if the update fails.

Is it safe to stop the containers launched in this way manually via a system command? I don’t have enough insight to the inner workings to know whether this could potentially interfere with the “software has started successfully” checks if containers launched from the .tmp.yml are stopped via any “external” means.

If yes, what is the recommended way to stop these containers if the prior systemctl command does not work?

Thanks!

~BW908

ToradexAI · August 26, 2024, 6:18pm

Hi, I’m the Toradex AI Assistant. I’m in early beta, and I would appreciate your feedback.
The Toradex Support team will follow up with you shortly!

Did you like my answer?

Yes, it is generally safe to stop containers launched from docker-compose.tmp.yml manually via a system command. However, you need to ensure that this action does not interfere with the “software has started successfully” checks, especially if you are using mechanisms like Greenboot for health checks and rollbacks.

To stop these containers, you can use the docker-compose command directly, targeting the temporary YAML file. Here is a recommended approach:

Identify the Temporary Docker Compose File: Ensure you know the exact path of the docker-compose.tmp.yml file.
Stop the Containers: Use the docker-compose command to stop the containers defined in the temporary YAML file. For example:
```
docker-compose -f /path/to/docker-compose.tmp.yml down
```

This command will stop and remove the containers defined in the docker-compose.tmp.yml file.

Note: Ensure that stopping these containers does not disrupt any critical services or health checks that might be running as part of the update process. You may need to coordinate this action with your update scripts to ensure a smooth transition.

Best regards, Toradex AI Assistant

jeremias.tx · August 26, 2024, 10:11pm

Greetings @bw908,

As part of the update process, we stop the running containers via systemctl stop docker-compose.

Okay, you’re going to have to explain this part to me. Why are you running this in the first place?

This appears to be working fine under most circumstances, except during the first reboot (or container bringup) post-update, where the containers stay running.

When you say “first reboot post-update”, are you doing a synchronous update of some kind? Otherwise, a standalone docker-compose update does not cause the system to reboot.

I’m not quite sure what your full use-case is here, but let me try and explain how the process works. Given what you’ve said I’m assuming you’re doing a synchronous OS & docker-compose update. correct me if I’m wrong.

Okay so when you do a sync update, the docker-compose.yml from your update package is copied into /var/sota/storage/docker-compose/docker-compose.yml.tmp on the device. Before the reboot all Aktualizr does on the docker-compose portion of the update is pull/load the container images onto the device.

After the reboot, Aktualizr then executes docker-compose up on /var/sota/storage/docker-compose/docker-compose.yml.tmp. If this succeeds, and the OS portion of the update succeeds then the tmp file overwrites the actual file at /var/sota/storage/docker-compose/docker-compose.yml. Otherwise the sync update will rollback.

The docker-compose.service in systemd has nothing to do with this post-reboot action it’s all in Aktualizr. That is probably why you’re observing the behavior you’re seeing, systemctl stop would not affect this. You can even see from the docker-compose.service itself that it explicitly doesn’t act if the tmp file exists: meta-toradex-torizon/recipes-containers/docker-compose-up/files/docker-compose.service at kirkstone-6.x.y · torizon/meta-toradex-torizon · GitHub

If yes, what is the recommended way to stop these containers if the prior systemctl command does not work?

This goes back to my first question now, of why you’re trying to do this in the first place. My answer/suggestion may depend here on why and what you’re trying to do exactly.

Best Regards,
Jeremias

bw908 · August 27, 2024, 5:29pm

Sure, I realize it’s an off-the-beaten-path setup. We do generally do synchronous “OS + container” updates as a result of various bug-fixes/features and maintenance we do to both components. That said, I certainly expect the OS side to decrease in frequency as it matures and be limited to primarily updates that track upstream Torizon releases.

The main reason(s) we are stopping the containers manually prior to the update is preventative and avoiding resource contention. (this docker handling code is also shared with some other (diagnostic) tools where we also want to avoid race conditions between the software continuing to write certain files and those files being copied to e.g. an external USB drive.

Our update handling implementation (and diagnostic tools) are a collection of scripts that are embedded in the OS side of the image (rather than the container side) to facilitate software installs or diagnostics in cases where the containers either don’t work (bug, HW failure, or some other cause we could not forsee/prevent) or are not present yet (initial deploy of the OS, as we opt not to bundle docker images inside the basic OS image to reduce OS churn in manufacturing/deployment).

These scripts provide the user with progress feedback about the update itself, either through the use of the Plymouth system-upgrade display mode if a display is present, or other hardware (Indicator LED) if these are not available.

Naturally, the running applications will make use of both of these things under normal operation, so we want to avoid a conflict between the two operations, and it also simplifies the software’s handling surrounding the update since it assumes it will be stopped and can simply check for a special “update result” file that is written that tells it whether it was successful, failed, user-aborted, etc.

I hope that helps clarify more of the context surrounding my inquiry.

~BW908

jeremias.tx · August 27, 2024, 10:16pm

Okay, so basically the processes inside the container has some race conditions with processes outside of the container. So you stop the container after they get updated to prevent this race condition issue, then you start the containers later after some time. Is that more or less what is happening on your systems?

Given this then, let’s think what can be done for your system/use-case here.

Okay so you want to stop the containers from running after the reboot. However, Aktualizr needs to run the containers at least once successfully to validate the success of the overall update. First of all, I assume you’ve already tried to resolve this race condition type issues, and it was not possible/feasible?

As I said previously, currently Aktualizr needs to run docker-compose up on the new updated containers at least once to attest the success of the update or not. There’s not really a way to prevent this initial operation without impacting the rest of the update process.

In the case of a sync update as I said before, Aktualizr after the reboot will run docker-compose up on /var/sota/storage/docker-compose/docker-compose.yml.tmp first. If this is successful then the file overrides the non-temp file at /var/sota/storage/docker-compose/docker-compose.yml. At this point the update would be considered successful and then you can stop the containers. In which case you could try to execute docker-compose -p torizon -f /var/sota/storage/docker-compose/docker-compose.yml down. This should then bring down the new containers. It’s important though you only try to do this after the temp file has replaced the non-temp file. Otherwise the new set of containers would not be stopped/effected. Not sure how you can time this exactly, maybe you can monitor for when the *.tmp file no longer exists. As this would indicate it has overwritten the non-temp file.

Would something like this work for your needs?

Best Regards,
Jeremias

bw908 · August 28, 2024, 2:15pm

Correct.

Not quite, we are looking to stop the containers before the update. I apologize if I’m not doing a very good job of describing what I am looking for. Here’s a more detailed workflow and outline of the issue if that helps clarify the confusion:

User inserts USB drive with an update. OS scripts detect the update and launch.
Scripts execute any logic to check if the update should be installed or not
If nothing prohibits installing the update in step 2, the currently running containers are stopped, as outlined before via systemctl stop .....
The update installs as normal with progress updates being shown on-screen via plymouth. If it is a synchronous update, the system will reboot when finished and launch the new containers. If it is docker-only, they will be restarted once installed without an OS reboot (in both cases, from the docker-compose.tmp.yml to facilitate rollback, per the above).
The updater scripts are terminated either through seeing that aktualizr has finished working or via system reboot. The system is now updated.

The above appears to be working fine for us as-implemented under most circumstances. However, the issue I am trying to solve is if the user installs a second update without having shut down or rebooted the system again after step 5 - that is, the containers are still running from the docker-compose.tmp.yml instance, and so the stop command in step 3 does nothing (which is expected, though undesired).

One solution is, of course, to mandate a reboot from the software side (in step 2) if we detect the containers are still running from the .tmp.yml, however this looks “odd” from a user perspective because the system may already have rebooted as part of the synchronous update process if OS changes were installed.

Reading your previous detailed reply though, I think I can (and correct me if I am wrong) infer that because the .tmp.yml is fully managed by Aktualizr, it would either not be technically possible, or ill-advised to stop those containers, due to possible problems it would create with the rollback mechanisms. This is an acceptable response to my inquiry; we are not mandating “it must be done this way”, but rather seeking to understand whether it can be done to suit our existing workflow.

~BW908

jeremias.tx · August 28, 2024, 10:59pm

Reading your previous detailed reply though, I think I can (and correct me if I am wrong) infer that because the .tmp.yml is fully managed by Aktualizr, it would either not be technically possible, or ill-advised to stop those containers, due to possible problems it would create with the rollback mechanisms. This is an acceptable response to my inquiry; we are not mandating “it must be done this way”, but rather seeking to understand whether it can be done to suit our existing workflow.

Well again, the *.tmp file should only exist for short periods of time, when the update is actually in-progress still. Once the update is complete (success or failure) then this file should no longer exist. At this point it should be safe to stop the containers. Though as I said in the case of a sync update Aktualizr needs to run the containers at least once for the update procedure to be complete. This part can’t be avoided.

Maybe instead of using systemctl stop you should just use something like docker stop $(docker ps -a -q) to stop all running containers. The problem with systemctl stop is that it relies on docker-compose.service. But if you check the definition of this service that I linked previously it does not run at all if the *.tmp file exists on boot, which it will exist on boot in the case of sync updates. Once you reboot the system or systemctl restart the service the *.tmp file shouldn’t exist anymore and the service can behave normally again.

Best Regards,
Jeremias

bw908 · August 30, 2024, 2:48pm

Thanks, we’ll give that a try and see if it works for us.

jeremias.tx · August 30, 2024, 8:39pm

Thanks, we’ll give that a try and see if it works for us.

Alright, just let us know if we can be of any more help.

Best Regards,
Jeremias

jeremias.tx · September 13, 2024, 6:11pm

Just wanted to check if you still had any issues here, or this was more or less resolved.

bw908 · September 18, 2024, 12:36pm

Thanks for the check-in; unfortunately we haven’t yet had a chance to try it out. I’ll follow up when we have.

I am reasonably certain it should work though - I spent some time looking at the aktualizr code and I can see now where some of the earlier confusion stems from - the .tmp.yml does indeed only persist for a very short time and is “installed” as soon as the compose up command completes without error - but of course those containers spawned in this way do persist until the next reboot.

~BW

jeremias.tx · September 18, 2024, 6:05pm

Thanks for the check-in; unfortunately we haven’t yet had a chance to try it out. I’ll follow up when we have.

No worries, just wanted to check in, take your time.

I spent some time looking at the aktualizr code and I can see now where some of the earlier confusion stems from - the .tmp.yml does indeed only persist for a very short time and is “installed” as soon as the compose up command completes without error - but of course those containers spawned in this way do persist until the next reboot.

Glad you were able to understand looking at the code itself. I tried to explain earlier, but some things are harder to explain in words than code.

Best Regards,
Jeremias