We are using torizoncore on an IMX8MP board with a custom stack of docker containers. In a few machines deployed at customers, we’ve seen a weird issue where the docker containers failed to start. In one of these machines, we have confirmed that there is an issue with the docker-compose network that prevents the containers from being started.
What seemed to have happened, is that the torizon_default
network was removed or recreated (getting a new id), while some (or maybe all, not sure right now) containers were not recreated and still pointed to the old network id. If you (or the docker-compose.service
) then try to up these containers, that fails with:
Error response from daemon: network c9c46b4cc2257dc4f0c1122e41a56293c0ce0ba8523fcbe6823d6cf2c3156ba7 not found
We are quite unsure how this situation has arisen (we shipped a machine back from Japan and have investigated the current state, but since we did not set up persistent logs and the customer has limited info on what happened), though we suspect it might be caused by cutting power halfway an update, boot or shutdown (maybe repeatedly).
However, we found that one good way to fix this issue if it arises, is to pass --force-recreate
to the docker-compose up
ran by docker-compose.service
. If the broken situation arises, then this will discard any existing containers and recreate them, pointing at the new/current torizon_default
network. And in the normal, non-broken case, a clean shutdown will already have ran docker-compose down
(by docker-compose.service
), so there will be no containers to recreate (so this is also a good failsafe in case the down did not work, e.g. on unclean shutdown).
Is this something that could perhaps be added to the default images?
I did just realize that we’re still using an older version of torizoncore (torizon-core-docker-verdin-imx8mp-Tezi_6.7.0+build.18.tar), so maybe this issue does not exist in a newer version (but no opportunity to confirm this until next week).
Reproducing
To reproduce the “broken” state, one can simply remove the network (requires stopping containers) without removing the containers (requires stopping docker-compose.service
first to prevent that from removing containers later.
So:
systemctl stop docker-compose.service
cd /var/sota/storage/docker-compose
docker-compose up -d
docker-compose stop
docker network rm torizon_default
shutdown -r now