Race conditions when debugging multi container projects with ApolloX

We have been using the multi container projects feature on the Torizon IDE Extension v2 (ApolloX) and we wanted to report some problems we ran into. We know this feature is experimental but since the application we are developing needs to be multi container we don’t have another choice, or do we?

We first noticed problems when trying to debug simple test projects with two containers created from the same template, e.g. two dotnetConsole, two cppConsole or two cppQML.

The first time we ran a debug session on the project, it would debug correctly but on all subsequent debug sessions both build threads would fail with the following error while trying to execute the run-container-torizon-debug-arm64 task:

Container project 1 task terminal output:

Creating network "torizon_default" with the default driver
Creating torizon_cpptest1-debug_1 ... 
Creating torizon_cpptest1-debug_1 ... error

ERROR: for torizon_cpptest1-debug_1  Cannot start service cpptest1-debug: network torizon_default is ambiguous (2 matches found on name)

ERROR: for cpptest1-debug  Cannot start service cpptest1-debug: network torizon_default is ambiguous (2 matches found on name)
Encountered errors while bringing up the project.

 *  The terminal process "sshpass '-p', '*****', 'ssh', '-o', 'UserKnownHostsFile=/dev/null', '-o', 'StrictHostKeyChecking=no', 'torizon@xxx.xxx.xxx.xxx', 'LOCAL_REGISTRY=xxx.xxx.xxx.xxx TAG=arm64 GPU= docker-compose up -d cpptest1-debug'" terminated with exit code: 1. 

Container project 2 task terminal output:

Creating network "torizon_default" with the default driver
Creating torizon_cpptest2-debug_1 ... 
Creating torizon_cpptest2-debug_1 ... error

ERROR: for torizon_cpptest2-debug_1  Cannot start service cpptest2-debug: network torizon_default is ambiguous (2 matches found on name)

ERROR: for cpptest2-debug  Cannot start service cpptest2-debug: network torizon_default is ambiguous (2 matches found on name)
Encountered errors while bringing up the project.

 *  The terminal process "sshpass '-p', '*****', 'ssh', '-o', 'UserKnownHostsFile=/dev/null', '-o', 'StrictHostKeyChecking=no', 'torizon@xxx.xxx.xxx.xxx', 'LOCAL_REGISTRY=xxx.xxx.xxx.xxx TAG=arm64 GPU= docker-compose up -d cpptest2-debug'" terminated with exit code: 1. 

Looking at the output of docker network ls effectively there are two torizon_default networks:

NETWORK ID          NAME                DRIVER              SCOPE
4c0b9db106dc        bridge              bridge              local
3ff9b4be7153        host                host                local
98f73ca1bd83        none                null                local
4fb1f6bca8ab        torizon_default     bridge              local
cc03125124f6        torizon_default     bridge              local

If we don’t delete the networks and try running another debug session, we get the following error while it tries to execute the pre-cleanup-arm64 task:

2 matches found based on name: network torizon_default is ambiguous

 *  The terminal process "sshpass '-p', '*****', 'ssh', '-o', 'UserKnownHostsFile=/dev/null', '-o', 'StrictHostKeyChecking=no', 'torizon@xxx.xxx.xxx.xxx', 'LOCAL_REGISTRY=xxx.xxx.xxx.xxx TAG=arm64 GPU= docker-compose down --remove-orphans'" terminated with exit code: 1. 

Of course this error is happening because two networks with the same name are getting created in the first place. If we delete the networks manually, again we get the first error we mentioned.

We couldn’t figure out why docker-compose was creating 2 networks with the same name nor why if we created a brand new multi container project, the first debug session would work correctly and subsequent sessions wouldn’t. If we deleted the networks and modified the container projects in a way that one of the projects would take longer to build than the other, then it would work. So this got us thinking that maybe we were getting a race condition between the build threads. If both threads execute the run-container-torizon-debug-arm64 task at roughly the same time, both detect that the torizon_default network isn’t created and try to create it and so 2 networks with the same name get created (and apparently that is OK) but then when each thread tries to connect the container to the network it runs into the ambiguity error.

We also noticed that there is a workspace setting called wait_sync and that on multi container projects by default it is set to increments of 1 second per each container project, so figured it must be a wait period that can be varied on each container project to break an undesired synchronization of tasks and avoid race conditions. We tried several combinations of these settings but since builds are incremental when not much needs to be rebuilt or if the container image remains the same, build threads end up syncing anyway. We needed to distance the threads by several seconds for it to work and that resulted in other debugger errors and slower debug times which is already pretty slow.

Finally we were able to fix the problem by disabling the default network that docker-compose creates and defining and connecting the containers to a custom external network we created manually. We modified our docker-compose.yml file as follows:

version: "3.9"
services:
  cpptest1-debug:
    image: ${LOCAL_REGISTRY}:5002/cpptest1-debug:${TAG}
    ports:
      - 2231:2231
    # connect container 1 to custom external network
    networks:
      - app-network
  cpptest1:
    image: ${DOCKER_LOGIN}/cpptest1:${TAG}
  cpptest2-debug:
    image: ${LOCAL_REGISTRY}:5002/cpptest2-debug:${TAG}
    ports:
      - 2232:2232
    # connect container 2 to custom external network
    networks:
      - app-network
  cpptest2:
    image: ${DOCKER_LOGIN}/cpptest2:${TAG}

networks:
  #disable default network
  default:
    external: true
    name: none
  #define custom external network
  app-network:
    external: true

When container projects are created from different templates, e.g. dotnetConsole and cppConsole, dotnetConsole and cppQML or cppConsole and cppQML, it seems the build processes are sufficiently different for the build threads not to sync up and most debug sessions start successfully, we rarely get the ambiguity error. We still get the error from time to time.

Another race condition we found, but couldn’t find an elegant solution to, happens when two or more containers depend on a third container, e.g. two cppQML project containers that both depend on a weston container. In this scenario the build thread that first tries to bring up the container detects the third container is not running and tries to bring it up also, shortly after the second build thread tries to do the same and fails because the first thread is already doing it. The first container starts the debug session while the second fails.

Container project 1 task terminal output:

Pulling cpptest2-debug ... 
Pulling cpptest2-debug ... pulling from cpptest2-debug
Pulling cpptest2-debug ... digest: sha256:dde35d45d153e7607e...
Pulling cpptest2-debug ... status: image is up to date for 1...
Pulling cpptest2-debug ... done
 *  Terminal will be reused by tasks, press any key to close it. 

 *  Executing task in folder cpptest2: sshpass -p ***** ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no torizon@xxx.xxx.xxx.xxx LOCAL_REGISTRY=xxx.xxx.xxx.xxx TAG=arm64 GPU= docker-compose up -d cpptest2-debug

Container project 2 task terminal output:

Creating torizon_weston_1 ... 
Creating torizon_weston_1 ... error

ERROR: for torizon_weston_1  Cannot create container for service weston: Conflict. The container name "/torizon_weston_1" is already in use by container "29bdd092c526df476c269826822a299236bd37b9aa304275ad5f9de0a287106c". You have to remove (or rename) that container to be able to reuse that name.

ERROR: for weston  Cannot create container for service weston: Conflict. The container name "/torizon_weston_1" is already in use by container "29bdd092c526df476c269826822a299236bd37b9aa304275ad5f9de0a287106c". You have to remove (or rename) that container to be able to reuse that name.
Encountered errors while bringing up the project.

 *  The terminal process "sshpass '-p', '*****', 'ssh', '-o', 'UserKnownHostsFile=/dev/null', '-o', 'StrictHostKeyChecking=no', 'torizon@xxx.xxx.xxx.xxx', 'LOCAL_REGISTRY=xxx.xxx.xxx.xxx TAG=arm64 GPU= docker-compose up -d cpptest1-debug'" terminated with exit code: 1. 

The corresponding docker-compose.yml files is as follows:

version: "3.9"
services:
  cpptest1-debug:
    image: ${LOCAL_REGISTRY}:5002/cpptest1-debug:${TAG}
    ports:
      - 2231:2231
    # connect container 1 to custom external network
    networks:
      - app-network
    depends_on:
      - weston
  cpptest1:
    image: ${DOCKER_LOGIN}/cpptest1:${TAG}
  cpptest2-debug:
    image: ${LOCAL_REGISTRY}:5002/cpptest2-debug:${TAG}
    ports:
      - 2232:2232
    # connect container 2 to custom external network
    networks:
      - app-network
    depends_on:
      - weston
  cpptest2:
    image: ${DOCKER_LOGIN}/cpptest2:${TAG}
  weston:
    image: torizon/weston${GPU}:2
    environment:
      - ACCEPT_FSL_EULA=1
    network_mode: host
    volumes:
      - type: bind
        source: /tmp
        target: /tmp
      - type: bind
        source: /dev
        target: /dev
      - type: bind
        source: /run/udev
        target: /run/udev
    cap_add:
      - CAP_SYS_TTY_CONFIG
    device_cgroup_rules:
      - c 4:0 rmw
      - c 4:1 rmw
      - c 4:7 rmw
      - c 13:* rmw
      - c 199:* rmw
      - c 226:* rmw

networks:
  #disable default network
  default:
    external: true
    name: none
  #define custom external network
  app-network:
    external: true

For this test we are using two cppConsole container projects because it was easier to setup but we get the same error when using two cppQML container projects.

In this case, modifying the wait_sync setting for one of the container projects from 3 seconds to 7 seconds solved the problem, but like we said, it doesn’t seem like the most elegant solution.

Do you have any suggestions?

Best regards.

Greetings @mmarcos.sensor,

Thank you for the very detailed investigation and report provided here. On initial review I believe your analysis is more or less correct regarding why the issues occur the way they do.

That said, we’ve gone ahead and brought this to the attention of our IDE extensions team for further analysis. We’ll let you know if we have further questions or updates regarding this.

Best Regards,
Jeremias