Offline Update with secondaries

Environment:
version 5.7.0 of TorizonCore,
image build with torizoncore-builder,
bunch of docker images,
offline update,
preprovisioning offline

Description:
We do a synchronous update, covering filesystem and docker images.
Our docker-compose relies on the resources assured by system (depending on ostree).
Some devices being must exists when the docker-compose is started.
If they do not exist a docker-compose does not start properly, some images are in CREATING state for a longer time than expected by GreenBot.
In the mean time docker image prune is executed. Images which are not RUNNING are removed.
After some time a decision that update was not successful is made by the GreenBot.
Then try to do a revert is executed. Sadly - some docker images/layers were pruned (as not used at the moment).

Result:
System is stuck in non-operating state as some images are incomplete or missing.

Hi @marek.kucinski ,

Does this happen every time you try to do a synchronous offline update with this docker compose in particular? Can you post the logs during the offline update process i.e. the output of this command during the docker images update:

journalctl -f -u aktualizr*

Result:
System is stuck in non-operating state as some images are incomplete or missing.

Does the TorizonCore itself become inoperable or can you restore the Docker images by removing then pulling/loading them again?

Best regards,
Lucas Akira

Does this happen every time you try to do a synchronous offline update with this docker compose in particular?

We tried few times and fialed. We had tight schedule so we did workaround:

  1. aktualizr ostree update
  2. manual docker load of all containers
  3. swap the docker-compose.

Can you post the logs during the offline update process

we will retry this scenario and gather the logs.

Does the TorizonCore itself become inoperable or can you restore the Docker images by removing then pulling/loading them again?

TorizonCore itself is alive. However we lost the functionality of the device as the docker containers were prunned. After manual load of all images and restarting docker-compose - it went operational again.

Hello @marek.kucinski,

How did you create your custom image bundled with the docker containers? Did you bundle it using TorizonCore Builder tool as it is explained here: Pre-provisioning Docker Containers onto a TorizonCore image | Toradex Developer Center

Could you please try it with the following workflow:

  1. Create your bundle directory using the command torizoncore-builder bundle --platform=linux/arm/v7 docker-compose.yml --bundle-directory bundle

  2. Add the bundle/docker-compose.yml into the output section of your tcbuild.yaml as follows:

    output:
      easy-installer:
        local: MY-CUSTOM-BUNDLED-IMAGE
        bundle:
          compose-file: bundle/docker-compose.yml
    
  3. Build the image using torizoncore-builder build command

  4. SIgn and push the image to the platform: Signing and Pushing TorizonCore Packages to Torizon Platform Services | Toradex Developer Center

  5. Configure device for offline updates, create and download the lockbox and perform the offline update: How to Use Secure Offline Updates with TorizonCore | Toradex Developer Center

Could you please confirm if these are the exact same steps that you have followed? If not, please give it a try and let us know if that solves your issue.

hi @rudhi.tx
We use TorizonCore Builder tool but we use separate steps instead of “build” command. There are some issues related support in torizoncore-builder in highly restricted corporate envirovnments. That’s why we had to extend the functionality of the tool for our needs. You can find another my topics where I explain details of our challenges and how we solved it.

Please find our build script below. Of course applications and containers are build beforehand.

#!/bin/bash

set -euo pipefail
### CONFIG

echo "====== Preparing environment" 
source prepareEnvironment.sh
echo "====== Setting components versions" 
source .env

### to properly handle alias created by prepareEnvironment.sh
shopt -s expand_aliases

# build the base image
#torizoncore-builder build --force --set IMAGE_NAME="$IMAGE_NAME" --set IMAGE_DESC="$IMAGE_DESC"
echo "====== Pre-build cleanup"
rm -rf rootimage out output bundle output_provisioned tuf update

#unpack image
echo "====== TorizonCore-Builder: download base image"
torizoncore-builder images --remove-storage unpack torizon-core-docker-apalis-imx6-Tezi_5.7.0+build.17.tar

#device tree from Toradex
echo "====== TorizonCore-Builder: apply DeviceTree"
torizoncore-builder dt checkout
#device tree overlays -ours
torizoncore-builder dt apply --include-dir device-trees/include/ device-trees/dts-arm32/imx6q-apalis-eval-adm-hw21.dts
torizoncore-builder dto apply device-trees/overlays/apalis-imx6_adm_hw21_overlay.dts

#merge modifications
echo "====== TorizonCore-Builder: union customizations"
torizoncore-builder union custom-branch --changes-directory changes1 --changes-directory changes2

#deploy the customized image (without containers) to the local directory
echo "====== TorizonCore-Builder: deploy image locally"
torizoncore-builder deploy custom-branch --output-directory rootimage/

#bundle the containers
echo "====== TorizonCore-Builder: bundle containers"
torizoncore-builder bundle --force --platform linux/arm/v7 --registry-certs="$(pwd)/certs.d" docker-compose.yml

echo "====== TorizonCore-Builder: combine local image with containers"
#combine base image with containers
torizoncore-builder combine rootimage/ output --image-name "$IMAGE_NAME" --image-description "$IMAGE_DESC" #\
#--image-accept-licence --image-autoinstall

echo "====== TorizonCore-Builder: provisioning images"
torizoncore-builder images provision --mode offline --force --shared-data shared-data.tar.gz output output_provisioned

SANE_IMAGE_NAME="${IMAGE_NAME//[^[:alnum:]]/_}"

echo "====== Compressing output directory"
mkdir -p out
tar czvf "out/$SANE_IMAGE_NAME.tar.gz" output_provisioned

#echo "====== Push to TorizonPlatform"
cp output_provisioned/docker-compose.yml docker-compose.lock.yml
torizoncore-builder platform push --credentials credentials.zip docker-compose.lock.yml
torizoncore-builder platform push --credentials credentials.zip custom-branch

#LOCKBOX_NAME=$1
#torizoncore-builder platform lockbox --credentials credentials.zip --registry-certs $(pwd)/certs.d $LOCKBOX_NAME

Hi @marek.kucinski,

Thanks for sending us these details. I’m checking internally to gather more information about what could be causing this issue. I will get back to you as soon as I have an answer.

Thanks for your patience.

No problem, we have another task to be finished in the meantime.

I would love to get back using torizoncore-builder build command, as soon as those request will be implemented:
TCB-300
TCB-299
TCB-119

In between, I have another question regarding the lockbox that you are creating: I see that you push both the docker-compose.lock.yml and the custom-branch which includes the filesystem changes. From which of these packages are you creating the lockbox on the platform? I assume that it is the docker-compose file. Could you please confirm?

At the beginning, when we tried synchronous updates (ostree + containers) we put both. Below please see the screenshot of such lockbox.

image

When we suffered issues regarding synchronous updates, we switched to do only an ostree update by aktualizr-uptane and later manual docker load all containers + swap of docker-compose.yml file.

Hi @marek.kucinski ,

Can you send the logs during the offline update?

And also, can you send the docker-compose.yml file related to your issue? If you can’t share it, can you at least send a similar Docker compose file that has the same problem so that we can try to reproduce it here?

Best regards,
Lucas Akira

Hi,
sorry for the delay - I had to reinstantiate our test setup (we moved forward with discrete aproach).

I do have only a part of logs (log_level=3, not complete log after reboot). I will share more later.

Before restart
docker-compose (target) - anonymyzed, only relevant data left:

services:
    external_communication:
      container_name: external_communication
      image: OUR-REPO/BRANCH/external_communication@sha256:4b13db098a3e2026c89ef135480000e2b0077b4ba18614e7373562a94e23dacd
      ...
    historical_data_center:
      container_name: historical_data_center
      image: OUR-REPO/BRANCH/historical_data_center@sha256:246c1ab2b4258627199d940fff1cb4690ad270626d657a3d0cd3f484ddff337c
      ...
    hmi_mgr:
      container_name: hmi_mgr
      image: OUR-REPO/BRANCH/hmi_mgr@sha256:a4021c03f133c411c18985b03881304a4d92273e72bf28c6a5a03540c39e0484
      ...
    init:
      container_name: init
      image: OUR-REPO/BRANCH/init@sha256:58cf1844df494523011a2108d0916fe620dc7a3560c8f4688e851c33ae0f4366
      ...
    internal_communication:
      container_name: internal_communication
      image: OUR-REPO/BRANCH/internal_communication@sha256:08e6087a15139abe0fb8f5cf752277aaeed92fd498d04a6eb18b120a735b0d38
      ...
    local_maintenance:
      container_name: local_maintenance
      image: OUR-REPO/BRANCH/local_maintenance@sha256:4a688869d581b2d53188e53ae73d2bc4842f9c38906cebe2cc99dea5799e22a4
      ...
    meerkat:
      container_name: meerkat
      image: OUR-REPO/adm/meerkat@sha256:d3f4d3fc709a76894daaf2eca82179623c68fc1c507ee79160d2c9221574f20d
      ...
    unimp_center:
      container_name: unimp_center
      image: OUR-REPO/BRANCH/unimp_center@sha256:f5e60e30f0e9fa30ad61917a38519cf1df35b9921ecc3fd21d6795e1ad6092b6
      ...
  version: '2.4'
  volumes:
    ipcvolume: {}

After plugging the lockbox:

Jan 01 00:37:03 apalis-imx6-10771760 aktualizr-torizon[7380]: Current version for ECU ID: cdd0fe2f362fe6d02af533ab4809d2b1ae5e63ba845ed9c4a93b11195a577301 is   unknown
Jan 01 00:37:03 apalis-imx6-10771760 aktualizr-torizon[7380]: Current version for ECU ID: 448efe58e98d881ae36968d1a124476bcb6c37f91bc3fad0823b93c1c0e0c57b is  unknown
Jan 01 00:37:03 apalis-imx6-10771760 aktualizr-torizon[7380]: Current version for ECU ID: cdd0fe2f362fe6d02af533ab4809d2b1ae5e63ba845ed9c4a93b11195a577301 is unknown
Jan 01 00:37:03 apalis-imx6-10771760 aktualizr-torizon[7380]: Current version for ECU ID: 448efe58e98d881ae36968d1a124476bcb6c37f91bc3fad0823b93c1c0e0c57b is unknown
Jan 01 00:37:07 apalis-imx6-10771760 aktualizr-torizon[7380]: Unable to read filesystem statistics: error code -1
Jan 01 00:37:07 apalis-imx6-10771760 aktualizr-torizon[7380]: Current version for ECU ID: cdd0fe2f362fe6d02af533ab4809d2b1ae5e63ba845ed9c4a93b11195a577301 is  unknown
Jan 01 00:37:07 apalis-imx6-10771760 aktualizr-torizon[7380]: Current version for ECU ID: 448efe58e98d881ae36968d1a124476bcb6c37f91bc3fad0823b93c1c0e0c57b is unknown
Jan 01 00:37:18 apalis-imx6-10771760 aktualizr-torizon[7380]: Copying /etc changes: 5 modified, 1 removed, 11 added
Jan 01 00:37:22 apalis-imx6-10771760 aktualizr-torizon[7380]: Transaction complete; bootconfig swap: yes; deployment count change: 1
[ 2343.131275] imx2-wdt 20bc000.wdog: Device shutdown: Expect reboot!
[ 2343.137627] reboot: Restarting system

After restart:
journalctl -u aktualizr*

Jan 01 00:00:47 apalis-imx6-10771760 aktualizr-torizon[2191]: untagged: OUR-REPO/historical_data_center:digest_sha256_246c1ab2b4258627199d940fff1cb4690ad270626d657a3d0cd3f484ddff337c
......
Jan 01 00:00:47 apalis-imx6-10771760 aktualizr-torizon[2191]: untagged: OUR-REPO/external_communication:digest_sha256_4b13db098a3e2026c89ef135480000e2b0077b4ba18614e7373562a94e23dacd

...
Jan 01 00:00:47 apalis-imx6-10771760 aktualizr-torizon[2191]: Total reclaimed space: 5.686MB
Jan 01 00:00:48 apalis-imx6-10771760 aktualizr-torizon[1075]: Current versions in storage and reported by OSTree do not match
Jan 01 00:00:48 apalis-imx6-10771760 aktualizr-torizon[1075]: No ECU version report counter, please check the database!
Jan 01 00:00:48 apalis-imx6-10771760 aktualizr-torizon[1075]: Failed to store Secondary manifest: no more rows available
Jan 01 00:00:48 apalis-imx6-10771760 aktualizr-torizon[1075]: curl error 3 (http code 0): URL using bad/illegal format or missing URL

systemctl status green*

● greenboot-task-runner.service - greenboot Success Scripts Runner
     Loaded: loaded (/usr/lib/systemd/system/greenboot-task-runner.service; enabled; vendor preset: enabled)
     Active: active (exited) since Thu 1970-01-01 00:00:22 UTC; 1min 50s ago
    Process: 1076 ExecStart=/usr/libexec/greenboot/greenboot green (code=exited, status=0/SUCCESS)
   Main PID: 1076 (code=exited, status=0/SUCCESS)

Jan 01 00:00:22 apalis-imx6-10771760 systemd[1]: Starting greenboot Success Scripts Runner...
Jan 01 00:00:22 apalis-imx6-10771760 greenboot[1076]: Boot Status is GREEN - Health Check SUCCESS
Jan 01 00:00:22 apalis-imx6-10771760 greenboot[1076]: Running Green Scripts...
Jan 01 00:00:22 apalis-imx6-10771760 greenboot[1076]: Script '00_cleanup_uboot_vars.sh' SUCCESS
Jan 01 00:00:22 apalis-imx6-10771760 greenboot[1076]: Script '01_log_rollback_info.sh' SUCCESS
Jan 01 00:00:22 apalis-imx6-10771760 systemd[1]: Started greenboot Success Scripts Runner.

● greenboot.target - Generic green boot target
     Loaded: loaded (/usr/lib/systemd/system/greenboot.target; enabled; vendor preset: enabled)
     Active: active since Thu 1970-01-01 00:00:22 UTC; 1min 51s ago

Jan 01 00:00:22 apalis-imx6-10771760 systemd[1]: Reached target Generic green boot target.

● greenboot-status.service - greenboot MotD Generator
     Loaded: loaded (/usr/lib/systemd/system/greenboot-status.service; enabled; vendor preset: enabled)
     Active: active (exited) since Thu 1970-01-01 00:00:22 UTC; 1min 50s ago
    Process: 1103 ExecStart=/usr/libexec/greenboot/greenboot-status (code=exited, status=0/SUCCESS)
   Main PID: 1103 (code=exited, status=0/SUCCESS)

Jan 01 00:00:22 apalis-imx6-10771760 systemd[1]: Starting greenboot MotD Generator...
Jan 01 00:00:22 apalis-imx6-10771760 greenboot-status[1111]: Boot Status is GREEN - Health Check SUCCESS
Jan 01 00:00:22 apalis-imx6-10771760 systemd[1]: Started greenboot MotD Generator.

● greenboot-healthcheck.service - greenboot Health Checks Runner
     Loaded: loaded (/usr/lib/systemd/system/greenboot-healthcheck.service; enabled; vendor preset: enabled)
     Active: active (exited) since Thu 1970-01-01 00:00:11 UTC; 2min 1s ago
    Process: 596 ExecStart=/usr/libexec/greenboot/greenboot check (code=exited, status=0/SUCCESS)
   Main PID: 596 (code=exited, status=0/SUCCESS)

Jan 01 00:00:11 apalis-imx6-10771760 systemd[1]: Starting greenboot Health Checks Runner...
Jan 01 00:00:11 apalis-imx6-10771760 greenboot[596]: Running Required Health Check Scripts...
Jan 01 00:00:11 apalis-imx6-10771760 greenboot[596]: Running Wanted Health Check Scripts...
Jan 01 00:00:11 apalis-imx6-10771760 systemd[1]: Started greenboot Health Checks Runner.

Hi @marek.kucinski ,

We recently made some bug fixes on Offline Updates, including one where the containers don’t load correctly after a synchronous update . Can you install the latest TorizonCore 6 nightly version to see if this issue persists?

I also suggest verifying if this problem happens when doing an Offline Update with only the container images, without including an OS update.

Best regards,
Lucas Akira

Hello @marek.kucinski ,

Have you already tested the offline updates on TorizonCore 6 nightly version as @lucas_a.tx has suggested before? Do you have any updates?

Hello @rudhi.tx
i made an attempt to this try, but it was more complicated than expected (see below). I had to postpone it for later diagnosis.

  1. move to TC 6.x is needed,
  2. TC6 is recommended to use Debian Bookworm based containers,
  3. no migration path from 5.7.0 to 6.x is prepared.

Therefore I need more time to test it.