TorizonCore docker-compose.service fails to recover from deleted images, private registry

Dear Developer Community,
we have observed a couple of times failures around docker-compose configuration resulting in some error like the one below. Our analysis and the error loop follow, I’d like to hear your opinion and suggestions to avoid new occurrencies of the problem.

  • Our customized TorizonCore OS is based on 5.6 release and has docker data integrity check enabled as here (link).
  • Our docker images are managed in a privare registry and the credentials are stored in /etc/docker to be used by Aktualizer (link).
  • Before having the problem, both OS and Apps were updated via Torizon OTA some days ago and the system was intensively used for about a week.
  • At a certain point, both docker containers and docker images disappeared, we suspect docker container or docker engine had an issue or corruption and the integrity check did that.
  • TorizonCore is not able to recover because docker-compose.service enters a failure loop (extract below) where either the [yN] question is answered no by default or the private registry login is not loaded (/etc/docker/config.json only used by Aktualizer and not by docker-compose?)
  • Are OS rollbacks and integrity-check-actions logged somewhere in a persistant way or is there a way to understand if they were executed?
systemd[1]: Starting Docker Compose service with docker compose...
systemd[1]: Started Docker Compose service with docker compose.
docker-compose[4857]: Creating network "torizon_default" with the default driver
docker-compose[4857]: Pulling MY_IMAGE (HASH)...
docker-compose[4857]: The image for the service you're trying to recreate has been removed. If you continue, volume data could be lost. Consider backing up your data before continuing.
docker-compose[4857]: Continue with the new image? [yN]pull access denied for MY_IMAGE, repository does not exist or may require 'docker login': denied: requested access to the resource is denied
systemd[1]: docker-compose.service: Main process exited, code=exited, status=1/FAILURE
systemd[1]: docker-compose.service: Failed with result 'exit-code'.
systemd[1]: docker-compose.service: Triggering OnFailure= dependencies.

Thanks in advance for the attention and the suggestions,
ldvp

Hi @ldvp,

Just to be sure, you’re talking about the docker container/image inside the module, right?

Did you try this article already? Some instructions on this link below may help you with your issue.

Could you please tell us what modifications have you done in your module and docker-compose file that might have triggered this issue?
Finally, what module are you using? And which carrier board?

Best regards,
Hiago.

Hi @hfranco.tx,
thanks for the feedback, we are running TorizonCore on Colibri iMX8X and Iris carrier. I confirm the docker container and images are those inside the module.

Thanks for the link but I’ve already read the article about OS rollbacks. I don’t think the problem is actually related to the OS, rather maybe to the app integrity, even if I don’t have a proof that a rollback didn’t happened and it can be verified.

The docker-compose has a few services, two of them using images from private registry. That docker-file worked correctly for about a week then the issue happened. Do you have suggestions how to verify the docker engine filesystem is working or other health checks I could do?

Bests,
ldvp

Hi @ldvp,

Thanks for the information. Could you please share your image or docker-compose file with a step-by-step of what you did with your module for me?
With that, I can run it here on my side and do a better investigation into your problem.
You can send us your files by email or send me a private message here.
Please, use share.toradex.com to upload your files.

Let me know if doing this is fine for you.

Best regards,
Hiago.

Hi @hfranco.tx,
the root cause is then confirmed in different DOCKER_CONFIG variables currently set in the defaults systemd services involved. Thanks to @jeremias.tx for the help!
Cheers,
ldvp