Feature request: Pass --force-recreate to docker-compose up to fix weird network state

We are using torizoncore on an IMX8MP board with a custom stack of docker containers. In a few machines deployed at customers, we’ve seen a weird issue where the docker containers failed to start. In one of these machines, we have confirmed that there is an issue with the docker-compose network that prevents the containers from being started.

What seemed to have happened, is that the torizon_default network was removed or recreated (getting a new id), while some (or maybe all, not sure right now) containers were not recreated and still pointed to the old network id. If you (or the docker-compose.service) then try to up these containers, that fails with:

Error response from daemon: network c9c46b4cc2257dc4f0c1122e41a56293c0ce0ba8523fcbe6823d6cf2c3156ba7 not found

We are quite unsure how this situation has arisen (we shipped a machine back from Japan and have investigated the current state, but since we did not set up persistent logs and the customer has limited info on what happened), though we suspect it might be caused by cutting power halfway an update, boot or shutdown (maybe repeatedly).

However, we found that one good way to fix this issue if it arises, is to pass --force-recreate to the docker-compose up ran by docker-compose.service. If the broken situation arises, then this will discard any existing containers and recreate them, pointing at the new/current torizon_default network. And in the normal, non-broken case, a clean shutdown will already have ran docker-compose down (by docker-compose.service), so there will be no containers to recreate (so this is also a good failsafe in case the down did not work, e.g. on unclean shutdown).

Is this something that could perhaps be added to the default images?

I did just realize that we’re still using an older version of torizoncore (torizon-core-docker-verdin-imx8mp-Tezi_6.7.0+build.18.tar), so maybe this issue does not exist in a newer version (but no opportunity to confirm this until next week).

Reproducing

To reproduce the “broken” state, one can simply remove the network (requires stopping containers) without removing the containers (requires stopping docker-compose.service first to prevent that from removing containers later.

So:

systemctl stop docker-compose.service
cd /var/sota/storage/docker-compose
docker-compose up -d
docker-compose stop
docker network rm torizon_default
shutdown -r now

Hello @matthijs,

Thanks for the detailed report of your issue.

In your case, a solution would be to customize the docker-compose.service file as you want - ie, with the --force-recreate passed. Then, you can use the TorizonCore builder tool to capture the changes and build your custom Torizon OS with these changes, which you can then use in your further development or production.

Making this the default behavior would bring breaking changes for other customers. However, I will still talk to our development team and see what they think about it.

Yeah, that’s what we’re doing now, but I think this change could benefit others (and less local changes the better, of course), hence this request.

Is there any usecase in particular that you think will break from this? Or do you just mean that it can in some cases change behavior?

With --force-recreate, containers are destroyed and replaced, even if their configuration did not change. Any non-persistent data inside the container filesystem will be lost. Therefore, customers relying on container-local state (instead of docker volumes or bind-mounts) would experience data loss.

Also, it could increase downtime on restart since containers are rebuilt and restarted from scratch rather than reusing existing ones.

That is true, but the docker-compose.service already does a docker-compose down on shutdown, which also destroys containers and causes them to be recreated on startup. So any customer that relies on container-local state can then never use clean shutdown (though maybe there are people that always do hard power cuts, though).

Hi @matthijs,

You are right. I have submitted a feature request to our development team. They will investigate and test this. I will update you with the results as soon as I hear back from them.

Thanks a lot!