Dead containers - docker-compose.yml

DAH · January 14, 2025, 7:24pm

Hi,

I am using a imx8mm SOM with a number of containers which get created when the unit is powered on via a docker-compose.yml file placed in /var/sota/storage/docker-compose/ directory.

This has been working fine until recently when one or more of the containers fail to start and transition to the ‘dead’ state - see attached.

When a container is in the dead state the only way to recover the container is to use the docker prune command via ssh. Power cycling the unit does not work. Once the containers have been pruned the docker-compose.yml file works as expected by recreating the containers on power up.

I have checked the docker-compose systemctl status - see attached.

Looking through the docker-compose status the command ‘docker-compose -p torizon up -d --remove-orphans’ is executed. However, this does not remove any dead containers from the previous power cycle.

Is there any way to recover the dead containers automatically without having to ssh into the unit and manually prune the containers?

How can I tell what is causing the containers to transition into the dead state?

Thanks in advance

ToradexAI · January 14, 2025, 7:25pm

Hi, I’m the Toradex AI Assistant. The Toradex Support team will follow up with you shortly!

Did you like my answer?

Hello,

I’m sorry to hear about the issues you’re experiencing with your containers on the iMX8MM SOM. Let’s explore some strategies to automatically recover these containers and diagnose the root cause of the problem.

Automatic Recovery

Docker Health Checks: Consider adding health checks to your containers in the docker-compose.yml file. This involves using the HEALTHCHECK instruction to periodically test the container’s functionality. If a container is marked as “unhealthy,” it can be automatically restarted. Ensure each container is labeled with autoheal=true to enable this feature.
Docker Watchdog Service: You can enable a watchdog service to monitor and restart containers that fail health checks. This can be set up using systemd to ensure the service is active and monitoring your containers.

Diagnosing the Cause

Examine Logs: Check the logs of both the containers and the Docker daemon for any errors or warnings that might indicate why the containers are entering the ‘dead’ state. This can provide valuable insights into potential issues such as resource constraints or configuration errors.
Resource Monitoring: Verify that your system has adequate resources (CPU, memory, storage) to support all running containers. Resource exhaustion can lead to containers failing to start properly.
Review Docker Compose Configuration: Double-check your docker-compose.yml file for any misconfigurations that might affect container startup. Ensure that dependencies between containers are correctly defined.
Inspect Docker Images: Ensure that the Docker images you are using are not corrupted or outdated. Pulling fresh copies of the images can help rule out image-related issues.

By implementing these strategies, you can improve the reliability of your containerized applications on Torizon and reduce the need for manual intervention.

I hope this helps resolve your issue. If you have any more questions, feel free to ask.

Best regards,
Toradex AI Assistant.

Please, also check if these links can help you solve your problem:

Docker-compose.yml.

jeremias.tx · January 14, 2025, 11:35pm

Greetings @DAH,

Is there any way to recover the dead containers automatically without having to ssh into the unit and manually prune the containers?

If you want the system to try and cover automatically you should take a look at our Docker integrity checker feature, as documented here: Reliability on Torizon | Toradex Developer Center

When enabled this will trigger a script if the docker-compose systemd service fails like in your case. The script will try various methods to try and recover your containers before eventually just pruning them and restarting them from scratch.

That said you probably should get to the bottom of the actual issue.

How can I tell what is causing the containers to transition into the dead state?

This is kind of tricky to determine. The “Dead” state for containers is not a normal state and there are quite a few things that can cause this. Also typically dead containers don’t have a lot of logs to parse so it’s hard to figure out what caused the state to begin with.

Some common cases that can cause this are:

System crashed or improperly shutdown while a container was running leaving the container in a strange state which causes Docker to consider it “Dead”.
The container itself was improperly shutdown/removed
Various issues with the Docker daemon like memory leaks or resource constraints

You said your system was working fine until “recently”. Did something change about your system or setup that may have caused this?

Best Regards,
Jeremias

DAH · January 17, 2025, 8:58am

Thank you for your reply.

I think I have narrowed it down to something not shutting down correctly but the investigation is still on going.

I will look into the docker integrity checker feature that you send the link to.

Thanks

jeremias.tx · January 17, 2025, 6:32pm

Thanks for giving us an update, please do let us know how your investigations turns out.

Best Regards,
Jeremias

jvieira · February 12, 2025, 8:49am

Hi @jeremias.tx ,

Pruning the containers manually and relaunching fixes it.

You mentioned above that the system would prune the containers if all else fails but I’m not sure if this is happening.
Of course that getting to the route cause (power related issue) is essential but even if there’s some power related issue when powering down/up the board, we still need a fallback mechanism like the container pruning to recover.
My question is, why would a manual prune work but not the one implemented by the toradex script?

Regards,
Joao

jeremias.tx · February 12, 2025, 6:02pm

You mentioned above that the system would prune the containers if all else fails but I’m not sure if this is happening.

What do you mean exactly by “not sure”?

You can see the script the integrity checker runs here: meta-toradex-torizon/recipes-containers/docker-integrity-checker/files/docker-integrity-checker.sh at scarthgap-7.x.y · torizon/meta-toradex-torizon · GitHub

Please notice that it only tries to attempt docker system prune starting with recovery attempt 11 onward. Before this it still tries to recover your containers via re-pulling the container image. If you’re not waiting long enough then you might not see the docker system prune executions.

Best Regards,
Jeremias