Local Docker Registry stops working after runing torizoncore-builder

jcabecerans · November 23, 2021, 1:09pm

Hi,

I have set up a local docker register to use in conjuction with torizoncore-builder (tcb from here onwards).

I am running in WSL2 with a Windows 10 host.

The local register works fine and I can push/pull stuff to it without issue.

~/tcbworkdir/dockers/custom_weston> docker build -t localhost:5000/custom_weston .

~/tcbworkdir/dockers/custom_weston> docker push localhost:5000/custom_weston
Using default tag: latest
The push refers to repository [localhost:5000/custom_weston]
c910081f00de: Pushed
…
b48145e02449: Pushed
latest: digest: sha256:2e34ecf654547ef2b2e641476e8523f44fdfcab6b92ba1c20ad1d318fb687f3f size: 5536

However after I run: (or any other tcb commands)

~/tcbworkdir> source tcb-env-setup.sh
~/tcbworkdir> tcb build --force

this happens:

~/tcbworkdir> docker push localhost:5000/custom_weston
Using default tag: latest
The push refers to repository [localhost:5000/custom_weston]
Get “http://localhost:5000/v2/”: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

Something similar happens with other dockers that have ports binded.
I use an nginx docker to serve the torizon images to use with easy installer.

docker run --rm -d -p 4321:80 --name web -v ~/easy_install_images:/usr/share/nginx/html nginx

Before running the tcb I can acces localhost:4321 and browse the images.
After runnign tcb localhost:4321 becomes unreachable.

Is tcb messing with the ports bindings somehow ??

Thanks,
Jaume

jeremias.tx · November 23, 2021, 9:05pm

Greetings @jcabecerans,

This sounds like a odd bug. Hmm as far as I know the only command that really interacts with Docker registries is the bundle command. Technically the build command does as well but it uses the same logic as the bundle command. With this command the only thing tcb does is some pull requests to the registry to pull down the necessary container images. Given this I can’t imagine it does anything to mess with the ports in anyway.

But first let’s try and analyze the issue. First of all it might be important to know, how exactly are you setting up this local registry? Second since you’re using the tcb build command here, what is your *.yaml file that you are passing to the command?

Once I have this info I can try to reproduce the issue on my side, and see if I can take a closer look on my setup.

Best Regards,
Jeremias

jcabecerans · November 24, 2021, 12:55pm

Hi @jeremias.tx,

I have narrowed it down, and you are right, it is not any command, it is just the bundle command which causes this effect. The build command also does the same, but only when there is a “bundle: compose-file” setup in the yaml, which I am guessing internally calls the bundle command.

I think I am setting up the local registry the standard way:

docker run -d -p 5000:5000 --restart=always --name registry registry:2

This is the bundle command that I use:
(jcabecerans/test_cvc2 is private, hence login)

tcb bundle dockers/docker-compose.yml  --bundle-directory bundle --login "<user>" "<passw>" --force --platform=linux/arm64

tcb yaml input file:

input:
  easy-installer:
    local: images/torizon-core-docker-verdin-imx8mm-Tezi_5.4.0+build.10.tar
customization:
  splash-screen: splash.png
  kernel:
    modules:
      - source-dir: modules/ili/
        autoload: yes
  device-tree:
    include-dirs:
      - device-trees/include/
      - device-trees/dts-arm64/
    custom: device-trees/dts-arm64/imx8mm-verdin-wifi-dev.dts
    overlays:
      add:
        - device-trees/overlays/verdin-imx8mm_agramkow_overlay_5.4.dts
output:
  easy-installer:
    local: custom_agramkow_5_4
    bundle:
      #compose-file: dockers/docker-compose.yml
      dir: bundle

Docker compose file:

version: '2.4'
# docker-compose.yml
services:
  production_tester:
    depends_on: []
    devices:
    - /dev/spidev1.0
    - /dev/verdin-adc1
    - /dev/verdin-adc2
    - /dev/verdin-adc3
    - /dev/verdin-adc4
    - /dev/gpiochip0
    - /dev/gpiochip1
    - /dev/gpiochip2
    - /dev/gpiochip3
    - /dev/gpiochip4
    - /dev/gpiochip5
    image: jcabecerans/test_cvc2:latest
    ports:
    - 1234:1234
    privileged: 'True'
    volumes:
    - /dev:/dev-host:rw
    
  weston:
    # image: torizon/weston-vivante:2
    # image: localhost:5000/custom_weston
    image: jcabecerans/custom_weston:latest
    # Accept the EULA required to run imx8 vivante graphic drivers
    environment:
     - ACCEPT_FSL_EULA=1
    # Required to get udev events from host udevd via netlink
    network_mode: host
    volumes:
      - type: bind
        source: /tmp
        target: /tmp
      - type: bind
        source: /dev
        target: /dev
      - type: bind
        source: /run/udev
        target: /run/udev
    cap_add:
      - CAP_SYS_TTY_CONFIG
    # Add device access rights through cgroup...
    device_cgroup_rules:
      # ... for tty0
      - 'c 4:0 rmw'
      # ... for tty7
      - 'c 4:7 rmw'
      # ... for /dev/input devices
      - 'c 13:* rmw'
      - 'c 199:* rmw'
      # ... for /dev/dri devices
      - 'c 226:* rmw'
    command: --developer weston-launch --tty=/dev/tty7 --user=torizon
    healthcheck:
        test: ["CMD", "test", "-S", "/tmp/.X11-unix/X0"]
        interval: 5s
        timeout: 4s
        retries: 6
        start_period: 10s

  kiosk:
    image: torizon/kiosk-mode-browser:2
    security_opt:
      - seccomp:unconfined
    command: --browser-mode http://www.toradex.com
    shm_size: '256mb'
    device_cgroup_rules:
      # ... for /dev/dri devices
      - 'c 226:* rmw'
    volumes:
      - type: bind
        source: /tmp
        target: /tmp
      - type: bind
        source: /var/run/dbus
        target: /var/run/dbus
      - type: bind
        source: /dev/dri
        target: /dev/dri
    depends_on:
      weston:
        condition: service_healthy

I originially detected this issue when working with the nginx docker. I didn’t post anything back then and just restarted my pc and nignx would work again (until the next tcb bundle/build).

However, now I am now trying to use a local docker registry to build the bundle. But the moment I run tcb bundle the local registry becomes unaccessible so there is no real way arround it …

Thanks,
Jaume

jeremias.tx · November 24, 2021, 11:54pm

Okay I was able to reproduce this. However I’m still not sure of the root cause. Furthermore I tried reproducing on a Windows machine and then on a Linux machine. I was only able to reproduce this on a Windows machine.

I don’t think tcb is causing the issue at least directly. Since the issue only occurs on Windows there must be some Windows specific component to the bug. Even stranger is that when I start a new registry container it is still inaccessible. Meaning that the issue isn’t just with that instance of the registry, somehow the issue persists beyond it.

Restarting Docker Desktop at least seems to restore usability of the registry. So it seems related to Docker Desktop perhaps the networking. If I check the logs of the registry container when it’s inaccessible I don’t see any of my requests to the registry get through.

This is a real puzzling issue. This will take more time to investigate especially since we don’t have too many Windows-based developers. Please let me know if you figure out anything on your side.

Best Regards,
Jeremias

jcabecerans · November 25, 2021, 9:25am

Hi @jeremias.tx,

Good to know it is not only me.
I will let you know if I stumble uppon a solution.
I will use remote registries for the time beeing then.

I don’t think tcb is causing the issue at least directly. Since the issue only occurs on Windows there must be some Windows specific component to the bug. Even stranger is that when I start a new registry container it is still inaccessible. Meaning that the issue isn’t just with that instance of the registry, somehow the issue persists beyond it.

I am not a docker expert my any means. I could be completely off but my intuition is telling me …

Given that:

The containers that use ports are inaccessible.
Those containers keep running and detect no fault.
It is possible to start new containers.
This behaviour happens not only with registry but also nginx.

To me it just looks like tcb hijacks the communication “bus” and the other containers are left with a severed connection to the outside world.

(taken from tcb-env-setup.sh)

alias torizoncore-builder=‘docker run --rm -it’“$volumes”‘-v $(pwd):/workdir -v ‘“$storage”’:/storage --net=host -v /var/run/docker.sock:/var/run/docker.sock torizon/torizoncore-builder:’“$chosen_tag”

This line looks especially suspicious to me:

-v /var/run/docker.sock:/var/run/docker.sock

Thanks,
Jaume

jeremias.tx · November 29, 2021, 9:04pm

This line looks especially suspicious to me:

Binding in the docker socket is a common way to allow a container to communicate with the backend docker process. (Docker Tips : about /var/run/docker.sock | by Luc Juggery | Better Programming)

We use it because the bundle command starts what’s called a “docker in docker” (dind) container so that we have an isolated environment to pull and bundle the containers. But anyways this would be odd if this is what causes the issue on Windows since it’s a fairly common method.

Do you happen to know if any other containers are affected negatively other than the local registry container? I’m curious if this is a general issue or just a specific interaction between tcb and the local registry container.

Best Regards,
Jeremias

jcabecerans · November 30, 2021, 11:00am

Hi @jeremias.tx,

We use it because the bundle command starts what’s called a “docker in docker” (dind) container so that we have an isolated environment to pull and bundle the containers. But anyways this would be odd if this is what causes the issue on Windows since it’s a fairly common method.

Oh, very interesting, nice to learn new things

Do you happen to know if any other containers are affected negatively other than the local registry container? I’m curious if this is a general issue or just a specific interaction between tcb and the local registry container.

I have tried a couple extra containers that use ports and they all have the same symptoms. They keep running but the port is not accessible.

Thanks,
Jaume

jeremias.tx · November 30, 2021, 5:29pm

Based off your observations it seems the problem is narrowed down to container ports. This seems to match with what I observed previously with the registry container. With the registry container I noticed when it was in the “not working” state the logs for this container showed that it was no longer getting any communications. This lines up with your observation of ports not working.

Possibly on Windows somehow the container network stack is affected?

Unfortunately though there’s still no good answers here to the root problem. This will probably need further investigation by our team here at Toradex.

For the time being I would suggest an alternative solution. Perhaps try setting up a container registry that doesn’t rely on the registry container itself. I can’t guarantee when we’d be able to fix this on our side, so an alternative solution may be necessary to avoid you from getting blocked for too long.

Best Regards,
Jeremias

jcabecerans · November 30, 2021, 8:12pm

Hi @jeremias.tx,

Alright, thanks for your support !
Please let me know if/when there is a fix

For the time being I would suggest an alternative solution. Perhaps try setting up a container registry that doesn’t rely on the registry container itself. I can’t guarantee when we’d be able to fix this on our side, so an alternative solution may be necessary to avoid you from getting blocked for too long.

I’ll give it a try, I was thinking of having the docker regisitry in another PC in the local network. That should also work.

Thanks,
Jaume

Kacper · April 28, 2022, 12:36pm

Hello @jeremias.tx, @jcabecerans,

Do you have any update on this topic? My colleagues are struggling with the same problem - they cannot use tcb with local registry because of this error.

Best Regards,
Kacper

jeremias.tx · May 10, 2022, 9:30pm

Hi @Kacper,

This issue hasn’t been tackled yet since it’s a rather niche issue among our customer base at the moment. I’d suggest alternative registries since I can’t guarantee when this will be looked at.

Best Regards,
Jeremias

henrique.tx · November 8, 2022, 11:23am

Hi @Kacper !

During the investigation, our team could not reproduce the issue anymore.

Using the setup script to setup torizoncoreBuilder on WSL, after executing the bundle command, it was possible to interact with the resgistry over curl.

Could you please check on your side if this issue is gone?

Best regards,
Henrique