Torizon OTA: constant TLS handshake timeout error

RoccoBr · July 4, 2022, 1:34pm

hello,
I’m trying to update some devices on the field through the app.torizon but they all fail in the same way, with a TLS handshake timeout error.

here is the aktualizr log:

Jul 04 09:27:16 CB20300050 aktualizr[290808]: Use existing SQL storage: "/var/sota/sql.db"
Jul 04 09:27:16 CB20300050 aktualizr[290808]: Initializing docker-compose Secondaries...
Jul 04 09:27:16 CB20300050 aktualizr[290808]: Adding Secondary with ECU serial: 81e8c96b459631898cab81789f0f9a11195cc04f51aca5ea692d37bc52e01a01 with hardware ID: docker-compose
Jul 04 09:27:16 CB20300050 aktualizr[290808]: Primary ECU serial: c79b565b406f5fdfc1fc4d3b269e5aee4fe5a3c0ef5b50503696764a348d9381 with hardware ID: verdin-imx8mm
Jul 04 09:27:16 CB20300050 aktualizr[290808]: Device ID: d68e5b52-f639-40ec-b038-f9649002e0c8
Jul 04 09:27:16 CB20300050 aktualizr[290808]: Device Gateway URL: https://ota-ce.torizon.io
Jul 04 09:27:16 CB20300050 aktualizr[290808]: Certificate subject: CN=d68e5b52-f639-40ec-b038-f9649002e0c8
Jul 04 09:27:16 CB20300050 aktualizr[290808]: Certificate issuer: CN=ota-devices-CA
Jul 04 09:27:16 CB20300050 aktualizr[290808]: Certificate valid from: Sep  2 13:29:56 2021 GMT until: Aug  9 13:29:56 2120 GMT
Jul 04 09:27:18 CB20300050 aktualizr[290808]: Event: SendDeviceDataComplete
Jul 04 09:27:24 CB20300050 aktualizr[290808]: New updates found in Director metadata. Checking Image repo metadata...
Jul 04 09:27:26 CB20300050 aktualizr[290808]: 1 new update found in both Director and Image repo metadata.
Jul 04 09:27:26 CB20300050 aktualizr[290808]: Event: UpdateCheckComplete
Jul 04 09:27:26 CB20300050 aktualizr[290808]: Update available. Acquiring the update lock...
Jul 04 09:27:32 CB20300050 aktualizr[290808]: Event: DownloadProgressReport
Jul 04 09:27:32 CB20300050 aktualizr[290808]: Event: DownloadTargetComplete
Jul 04 09:27:32 CB20300050 aktualizr[290808]: Event: AllDownloadsComplete
Jul 04 09:27:32 CB20300050 aktualizr[290808]: Waiting for Secondaries to connect to start installation...
Jul 04 09:27:33 CB20300050 aktualizr[290808]: No update to install on Primary
Jul 04 09:27:33 CB20300050 aktualizr[290808]: Event: InstallStarted
Jul 04 09:27:33 CB20300050 aktualizr[290808]: Updating containers via docker-compose
Jul 04 09:27:33 CB20300050 aktualizr[290808]: Running docker-compose pull
Jul 04 09:27:33 CB20300050 aktualizr[290808]: Running command: /usr/bin/docker-compose --file /var/sota/storage/docker-compose/docker-compose.yml.tmp pull --no-parallel
Jul 04 09:27:36 CB20300050 aktualizr[291062]: Pulling v3_device_application (cloudcycleltd/v3_device_application@sha256:bc9a98acc7bd5ed10e4d0504c13ee9329184f276ab837d24d24ff9335f3de221)...
Jul 04 09:27:50 CB20300050 aktualizr[291062]: sha256:bc9a98acc7bd5ed10e4d0504c13ee9329184f276ab837d24d24ff9335f3de221: Pulling from cloudcycleltd/v3_device_application
Jul 04 09:28:09 CB20300050 aktualizr[291062]: error pulling image configuration: Get "https://production.cloudflare.docker.com/registry-v2/docker/registry/v2/blobs/sha256/53/536b4498eb83bad4d9367edbe3e6855c930fb0e7076e5ba403d4df0e3ffebebf/data?verify=1656929878-D%2B53i%2BfZ9JbxEaKD19AITeJ88UM%3D": net/http: TLS handshake timeout
Jul 04 09:28:10 CB20300050 aktualizr[290808]: Error running docker-compose pull
Jul 04 09:28:10 CB20300050 aktualizr[290808]: Rolling back container update
Jul 04 09:28:10 CB20300050 aktualizr[290808]: Removing not used containers, networks and images
Jul 04 09:28:10 CB20300050 aktualizr[290808]: Running command: docker system prune -a --force
Jul 04 09:28:10 CB20300050 aktualizr[291552]: Total reclaimed space: 0B
Jul 04 09:28:10 CB20300050 aktualizr[290808]: Event: InstallTargetComplete
Jul 04 09:28:10 CB20300050 aktualizr[290808]: Event: AllInstallsComplete
Jul 04 09:28:10 CB20300050 aktualizr[290808]: Update install completed. Releasing the update lock...
Jul 04 09:28:19 CB20300050 aktualizr[290808]: Event: PutManifestComplete

looking online I found this old post, I wonder if it’s related to this issue:

gclaudino.tx · July 4, 2022, 2:03pm

Hi @RoccoBr, how are you?

From your logs, it seems that the TLS error happens when trying to pull the image from the internet. How are your modules connected to the internet?

Also, which module and carrier board are you using? It seems that you have tagged Verdin iMX6 but this module does not exist.

Finally, does this happen with every module on the field? Has this also happened with a module on a more controlled debug environment?

Best regards,

RoccoBr · July 4, 2022, 2:48pm

hi,
sorry about the wrong tag. I’m using a verdin imx8m-mini with custom carrier board.
the modules are connected to the internet with a modem, my first thought was bad connectivity , but I get this error also on devices that use the fast cat M1 network and where I can easily connect with ssh , so it doesn’t appear to be a connectivity issue.

the problem is present on 8-10% of all devices deployed on the field and I can see the error also on some devices on our test bench in the lab

Regards,
Rocco

gclaudino.tx · July 4, 2022, 4:06pm

Hi @RoccoBr,

Thanks for the explanation. Is the fast cat network also modem-based? Have you seen the same error using an ethernet connection, for instance?

RoccoBr · July 4, 2022, 4:08pm

hi,
yes the CAT M1 is modem-based and no, I don’t see the error with wired ethernet connection

RoccoBr · July 5, 2022, 3:50pm

on other devices I’m getting another error : net/http: request canceled (Client.Timeout exceeded while awaiting headers)

looking online it seems that both errors are related to the way the dns are resolved. I have checked my /etc/resolv.conf file and I can see two namervers:
nameserver 10.80.4.10
nameserver 10.80.5.10

I have tried to add manually the google dns servers to resolv.conf file but, as reported online, it’s not enough because that file is managed dynamically by the OS (TorizonCore in my case) through the networkManager.

any suggestions?

gclaudino.tx · July 5, 2022, 4:16pm

Hi @RoccoBr,

Thanks for the update. Do you have the links from where you got this suggestion to change the DNS nameservers?

We’re investigating the issue internally in the meantime to see how can we help you.

Best regards,

RoccoBr · July 5, 2022, 5:02pm

gclaudino.tx · July 5, 2022, 5:20pm

Dear @RoccoBr,

If you need to change the DNS Nameservers on Torizon, you may need to follow the procedure described here: networking - How to manage DNS in NetworkManager via console (nmcli)? - Server Fault. I just did a test with a module using TorizonCore and it worked on my side to modify the resolv.conf file. Could you please give it a try?

gclaudino.tx · July 14, 2022, 1:27pm

Dear @RoccoBr,

Do you have any news on this topic?

Best regards,

RoccoBr · July 14, 2022, 1:38pm

hi @gclaudino.tx ,
I was just doing some tests in the morning.

our modem interface is not managed by NetworkManager nor ModemManager, because it needs to be controlled by our docker application, but I managed to add the cloudflare dns server (1.1.1.1) and google dns server (8.8.8.8) with resolvconf command. Despite I have seen the same error once, eventually I managed to update 3 out of 5 devices, I have lost communication with the other two during the process.

so even it might not be the definitive solution to the problem, adding those nameserver to the modem interface improves the situation.

do you know where these current nameserver in /etc/resolv.conf come from?
nameserver 10.80.4.10
nameserver 10.80.5.10
I can’t find them anywhere in my project, so I guess they are coming from the TorizonCore image

gclaudino.tx · July 14, 2022, 8:06pm

Dear @RoccoBr,

Thanks for the update. I don’t know exactly where these two configurations come from. We tested on a module with a fresh TorizonCore image and the file looked like this:

nameserver 1.1.1.1
nameserver 8.8.8.8
nameserver 1.0.0.1
# Too many DNS servers configured, the following entries may be ignored.
nameserver 8.8.4.4

We’ll check where this could be coming from.

gclaudino.tx · July 26, 2022, 12:58pm

Dear @RoccoBr,

I couldn’t find a reason for the DNS names you saw. Do you have any news on that topic? Do you have any other news from your side regarding this topic?

Best regards,

RoccoBr · July 27, 2022, 1:33pm

hi @gclaudino.tx ,
next week I will deploy on all our fleets the new firmware release including the possible fix for this issue. I will let you know if I still have problems

drew.tx · July 27, 2022, 3:27pm

@RoccoBr for what it’s worth, the resolv.conf file is usually populated by the DHCP server. I wonder if those DNS addresses were populated when you generated your image and just have not been modified. Are you using torizoncore-builder to generate your image?

Regarding the fix for the TLS issues, can you share details?

Drew

RoccoBr · August 15, 2022, 7:42am

Hello,
I have verified that my changes are effective against this issue.
I have update the firmware of our fleets via OTA and the error is basically gone, just few retries on a small number of device due to loss of connectivity.

my solution consists in a modification of the configuration file for docker deamon /usr/lib/docker/daemon.json with the following settings:

change to maximum number of concurrent download to 1 (basically no concurrent downloads, just 1 layer at the time
increase the number of attempt
set the DNS servers

these changes seem to guarantee a better stability even in case of poor connectivity

daemon.json (138 Bytes)

gclaudino.tx · August 15, 2022, 12:01pm

Hi @RoccoBr ! Thanks for updating this topic. I’m glad you could find a working solution for your use case.

Please feel free to come back in the future if you need support with any other topic.

I’ll tag your answer as the solution

Best regards,