We recently encountered a problem where one of our devices corrupted its docker state and failed to start. Log analysis seems to suggest the unit wasn’t properly shut down.
When it was powered back up, the system was stuck in an endless cycle because the docker network was gone:
Oct 11 18:02:46 Maverick-Hub-133 docker-compose[1180]: Container torizon-kiosk-1 Started
Oct 11 18:02:46 Maverick-Hub-133 docker-compose[1180]: Error response from daemon: network edda8bc27d6960d30d057ef472d52c8f66163dfc04239eafee9827eed4c0f818 not found
Oct 11 18:02:46 Maverick-Hub-133 systemd[1]: docker-compose.service: Main process exited, code=exited, status=1/FAILURE
Oct 11 18:02:46 Maverick-Hub-133 systemd[1]: docker-compose.service: Failed with result 'exit-code'.
Oct 11 18:02:46 Maverick-Hub-133 systemd[1]: docker-compose.service: Triggering OnFailure= dependencies.
Oct 11 18:02:46 Maverick-Hub-133 systemd[1]: Starting Verify Docker images integrity and recover from Docker data corruption...
Oct 11 18:02:46 Maverick-Hub-133 docker-integrity-checker.sh[1969]: docker-compose service has failed.
Oct 11 18:02:46 Maverick-Hub-133 docker-integrity-checker.sh[1969]: Docker integrity checker is disabled. Create /etc/docker/enable-integrity-checker file to enable it.
Oct 11 18:02:46 Maverick-Hub-133 docker-integrity-checker.sh[1969]: Restarting docker-compose in 10 seconds...
We read up on the docker integrity checker mentioned, which doesn’t seem like it would have fixed this even if it was enabled; the documentation suggests it only handles corrupted container images and not issues with ancillary components - and in our case the devices are offline and so there’s no option to re-pull from the docker registry.
It also doesn’t sound like the docker watchdog would have helped, as it is described as monitoring unhealthy containers only.
We were able to resolve the issue by updating the software (since that triggers a re-creation of the various components in the compose, but I did want to bring this up here to discuss whether there is a better solution to mitigating this, and if not, whether it’s something the docker watchdog could potentially also resolve if detected.
Hi, I’m the Toradex AI Assistant. The Toradex Support team will follow up with you shortly!
Did you like my answer?
The issue you encountered with the corrupted Docker network, which led to the system failing to start, is indeed a challenging one. Based on the context provided, here are some insights and potential considerations:
Docker Integrity Checker: As you noted, the Docker integrity checker is primarily designed to handle corrupted container images and not network or ancillary component issues. Therefore, enabling it would not have resolved the network corruption problem.
Docker Watchdog: The Docker watchdog is focused on monitoring unhealthy containers, not network issues. It would not have been effective in this scenario either.
Software Update: You successfully resolved the issue by updating the software, which re-created the Docker components. This suggests that a software update can serve as a workaround for restoring functionality after a corruption incident.
Mitigation Strategies: There is a need for better solutions to mitigate Docker network corruption issues. Current tools do not adequately address these problems, highlighting a gap in Docker’s recovery mechanisms.
Future Enhancements: It may be beneficial to explore enhancements to Docker’s monitoring and recovery systems to prevent similar issues in the future. This could involve developing more robust mechanisms for detecting and recovering from network state corruption.
In summary, while the current tools like the Docker integrity checker and watchdog are not designed to handle network corruption, your approach of using a software update as a workaround is effective. However, it would be worthwhile to consider advocating for or developing enhancements to Docker’s recovery capabilities to better handle such scenarios in the future.
Please, also check if these links can help you solve your problem:
We do have some additional reliability settings that can be configured. These are documented here. It’s not clear if any of these would help for your issue in particular but please take a look. If we need to beef those up we would definitely like to know.
Thanks - that is the page we did already find and it’s not obvious that any of them would address the problem.
it’s possible that the integrity checker could resolve this, if it is set up such that it deletes the existing running containers and recreates them from their images as part of the process; this would also re-create the docker network.
If any of the Docker images are identified as corrupted, they will be deleted and re-pulled from the container registry again.
Our caveat here is the system is offline. If the image is corrupted, there’s obviously not much we can do, but this service also wouldn’t be much help to us if it depends on connectivity to function at all. We’re certainly happy to try enabling it going forward if it turns out it is capable of handling this situation.
I’m trying to understand the issue here. Do you have any idea how this device could have gotten into this “corrupted” state in the first place? You said the theory is that the unit wasn’t “properly” shut down, but any more specifics?
Regarding the error in the systemd logs:
Error response from daemon: network edda8bc27d6960d30d057ef472d52c8f66163dfc04239eafee9827eed4c0f818 not found
Out of curiosity was this network truly missing? Or is this error message “incorrect”?
As for suggestions/solutions. You mentioned that updating the software fixes things since it makes the system recreate container stack. I did notice that docker-compose up has a --force-recreate flag: docker compose up | Docker Docs
According to Docker this “Recreate containers even if their configuration and image haven’t changed”. I wonder if this has the same effect of recreating the container stack similar to an update.
That said, I’m stil curious about the conditions that caused this in the firce place. Not saying we could fix this as it may likely be Docker related, but there might be some possible mitigation or something we could do.
Unfortunately we don’t have much to go on - as the unit has been repaired we don’t really have a state we can inspect to understand whether it’s truly “gone”.
All we have are the tail end of the logs for the prior boot:
Oct 11 18:02:30 Maverick-Hub-133 systemd-logind[866]: System is powering down.
Oct 11 18:02:30 Maverick-Hub-133 systemd-networkd[879]: ethernet0: Lost carrier
-- Boot 0be3821269e34753942ad6eff143b93c --
Our suspicion is that someone instructed the unit to shut down and then started unplugging cables without waiting for it to finish. Normally there are many more messages logged after the “System is powering down” entry. Naturally, in a hard power-off, the systemd logs are often not flushed to disk either and so it’s not surprising some entries that provide more information are missing
As per the Github issue, it seems my idea of using the --force-create flag would work to force the network and containers to be recreated from scratch. This seems to be the most straightforward recovery method, especially for your devices which are offline.
I’ll see if we can improve our own OS with this as a recovery method for offline devices. Since the integrity checker only really helps for online devices as you have already noted.