TCB deploy fails randomly

Hi,

I’m connected to my iMX8 over WiFi, and I want to update my torizon image.

I use the following command:

torizoncore-builder deploy main --remote-host $TORADEX_IP --remote-password $TORIZON_USER_PWD --reboot

This works “sometimes”, without any change to the code nor setup from my side but often it fails with the following error:

$ torizoncore-builder deploy main --remote-host $TORADEX_IP --remote-password $TORIZON_USER_PWD --reboot
Pulling OSTree with ref main (checksum 14c688b091113e45400bb6ea989396c5440f6a054f8d5710a3865fe36b79e7e7) from local archive repository…
Starting http server to serve OSTree.
OSTree server listening on “localhost:38869”.
Starting OSTree pull on the device…
Password: error: While fetching http://localhost:40715/objects/b0/d3237dc91f3d1219ebacbef2c89ff633d3a3069b284a0f848d8ada10f1d4a0.filez: [18] Transferred a partial file
Failed to run command on module: ostree pull tcbuilder:14c688b091113e45400bb6ea989396c5440f6a054f8d5710a3865fe36b79e7e7

I guess that “Password: error” has nothing to do with my password, since I’m sure it’s correct and every now and the deplopyment works.

Any idea how can this be improved or what am I possibly doing wrong?

It’s really hard to work with torizon in this way…

Thank you!

Hi, I’m the Toradex AI Assistant. The Toradex Support team will follow up with you shortly!

Did you like my answer?

:+1: :-1:


Hello there!

When deploying over WiFi using the TorizonCore Builder, you might find the --mdns-source parameter particularly useful. This parameter helps when multiple network interfaces are in use, and mDNS multicast requests are being sent through the incorrect interface. You can find more details in the TorizonCore Builder Tool Commands Manual.

To address this, you can modify your deployment command as follows:

torizoncore-builder deploy main --remote-host $TORADEX_IP --remote-password $TORIZON_USER_PWD --mdns-source <your-wifi-interface-ip> --reboot

Additionally, ensure that you are using the correct default values for the remote connection:

  • Remote username: The default is “torizon”.
  • Remote port: The default is 22.

If necessary, you can explicitly set these using the --remote-username and --remote-port parameters.

Should you encounter further issues, consider these alternative deployment methods:

  1. Toradex Easy Installer: This tool allows for direct flashing of the device. More information can be found here.
  2. Torizon Cloud Updates: This method can offer more reliability for remote updates. Details are available here.

If you need further assistance or information about alternative deployment methods, feel free to ask!

Best regards,
Toradex AI Assistant.

I think we’ve also seen this particular error in our organization. IIRC it mostly (maybe exclusively) occurs when deploying from a Mac, but I’m not entirely sure. What is your client OS?

Interesting, I’m on PopOS.

I think that I’ve been able to improve the situation by disabling the LAN connection on my dev-machine and only using the WiFi connection.
For some reason the LAN over my docking station seems to be buggy on my laptop (it never connects, stays in “connecting” all the time and maybe caused intereference in the torizon deployment …)

Greetings @Giona,

This does seem like a network related issue, as you’ve already discovered. When using torizoncore-builder deploy to deploy your image over network, it uses OSTree as the medium to transfer the changes. When deploying OSTree there’s many small object files that need to be sent, so a stable network connection is very helpful for this.

I think that I’ve been able to improve the situation by disabling the LAN connection on my dev-machine and only using the WiFi connection.

After using the WiFi connection does the issue more or less go away for your setup?

Best Regards,
Jeremias

Hi Jeremias

I didn’t try N-times, but it worked 2-times in a row …thing that did never happen before so I guess it’s now good.

This worries me a bit though… isn’t it the same mechanism used for the OTA ?
There, e.g. when running on LTE the connection might not be really robust and stable of the time of the update.

This worries me a bit though… isn’t it the same mechanism used for the OTA ?
There, e.g. when running on LTE the connection might not be really robust and stable of the time of the update.

We have made improvements in Torizon OS to better handle such cases where the bandwidth of the connection may be low or poor. As documented here there are some additional options that can be configured to help improve stability in such cases: Aktualizr - Modifying the Settings of Torizon Update Client | Toradex Developer Center

Of course you should still test in an environment that emulates the conditions you expect to see for your product. While we try to consider for most connection scenarios it’s not possible to have accounted for all of them.

Also to consider, if the connection to the device is so poor at the time of the update that the update just can’t be completed, then the update will simply fail. The device will continue to run the old software, but this is preferable to the device possibly bricking itself or being in some weird state due to a “half-update”. The update can then be re-attempted or re-scheduled for a time when the connection may be more stable.

Best Regards,
Jeremias

1 Like

Thank you for the link, we’ll look into it.

If I understand you correctly, OSTree & Aktualizr do not perform partial downloads until the new SW package is fully downloaded, but it rather fails if the new update is not downloaded fully “in one go”.

If this is the case, it is technically possible to change this behavior?

In our product use-case, it’s probably very uncommon for a connection to remain stable for a very long time allowing the ~100 - 200 MB (more?) SW packages to be downloaded at once when the LTE connection is poor (which is very often the case even in EU countries like Germany)

Let me try to understand your use-case and perspective.

For context, with our update framework we try to aim to have updates be as atomic as possible. Meaning there is a binary state between the update being “done” versus not done. This is to avoid the possible issues with having “partial” states which can be problematic with power-cuts and other such scenarios. This way it can be guaranteed the system won’t be in a weird invalid half-way state.

From your side though, it sounds like you don’t trust your system will have a stable enough connection to reliably complete an update. Cause of this, your worry is that an update may never complete and just cause a cycle of failure and restarting the update. Is that more or less correct?

If my understanding is correct, what would be your hypothetical ideal scenario?

Realistically, we still want the atomic aspect of our updates. Are you thinking about caching the progress, so if it fails it picks up progress where it left off on the next try?

Keep in mind during an actual update there are a limited number of retries before it gives up. It doesn’t fail on the first instance of a failed connection. This makes the update a lot more robust in terms of poor connections compared to what you may have seen with TorizonCore Builder.

Furthermore, as I said previously, it may help to test your system under a simulated scenario of the kind of connection environment you expect to see. It would help to see what actual issues you may realistically face. I say this, because we do have other customers with bad connections, or low bandwidths. While not ideal they are able to perform updates in their cases.

Best Regards,
Jeremias

You basically got all the points right.

What I would expect is of course not a partial update in which a suddend reboot would brick the device, but a partial download of the update in a sort of local cache which than can be installed when the download phase is terminated.
We did implemented this update method on our current system, and we see that it helps a lot. With a mobile device as ours, the connection is not stable at all and suddend disconnections are normal.
Additionally, with torizon updates being likely bigger than a yocto optimized image update this “never ending update” it’s even more likely to happen.

Testing in a real world scenario makes sense, but it’s hard to predict how the LTE network is going to be from Mexico to Singapur passing from a german rural place in which 4G it’s still a luxury thing…

Do you think that a sort of caching is technically possible?
Is it something we could developed as standalone service … And then feed the Aktualizr with a local registry or so?

We did implemented this update method on our current system, and we see that it helps a lot. With a mobile device as ours, the connection is not stable at all and suddend disconnections are normal.

Could you elaborate on this point. Do you have another update method alongside Torizon Cloud?

Additionally, with torizon updates being likely bigger than a yocto optimized image update this “never ending update” it’s even more likely to happen.

I’m not sure what you mean by Torizon updates being bigger than a Yocto image. When the OS/filesystem is being updated via OSTree it should only need to download the differences between what is on the device and what is on the server. Unless the difference is 100%, which is unlikely in practical scenarios, you’re never going to be downloading an entire image worth of data.

This amount of data is even further reduced if you make use of static deltas: Signing and Pushing Torizon OS Packages to Torizon Cloud | Toradex Developer Center

Do you think that a sort of caching is technically possible?
Is it something we could developed as standalone service … And then feed the Aktualizr with a local registry or so?

I wouldn’t say it’s “impossible”, but we’re not talking about trivial work and changes to our current update framework. The reason I suggest initial testing in this case, is that we may be going down a road that will require non-trivial amounts of time and effort for a potential problem that hasn’t been analyzed yet.

To give further context, the link I shared here before: Aktualizr - Modifying the Settings of Torizon Update Client | Toradex Developer Center

We’ve tested these configurations before with a low-bandwidth non-stable connection modem that had download speeds around 10kB/s. So far from an ideal network connection. Via the documented configurations we set a timeout period of about 1hr. We then performed several OSTree based update tests where the size of the update ranged from around 100-200MB. In one test where the connection was particularly bad, the update took around 16hrs, but it still succeeded.

Keep in mind the timeout configuration is for each individual curl request during the download process. During which most errors will trigger OSTree to retry the current download chunk it is on. So with a high enough timeout set, in theory the device will keep retrying download during the update.

I think it’s worth it to try our current solutions to see if they can already satisfy your use-case. If we see that these can not solve your potential problems, then we can of course discuss other potential solutions/work.

Best Regards,
Jeremias

Could you elaborate on this point. Do you have another update method alongside Torizon Cloud?

No, I was referring to our pre-torizon solution based on a custom Yocto installation. Sorry for the confusion here.

I’m not sure what you mean by Torizon updates being bigger than a Yocto image.

Our old full Yocto image was ~65MB.
We don’t have enough experience to say weather a Torizon SW update (mostly containers only) will be smaller or not. Somehow with the current experience we have build up on torizon, I have the feeling that it will be bigger. The fact that is not so visible is not really a good thing, but that’s another topic :slight_smile:

We’ll definitely look into the static-deltas topic, thanks for sharing.

Wrt to the “caching feature”, adding such a big timeout and many retries might be a viable replacement.
Just to give you an example of the challenges we faced… our (old) SIM provider dropped the connection every 10MB, resulting in no connection for ~10minutes. So it’s not only the bandwidth the limiting factor here…
You would recon, that in such conditions it’s going to be pretty hard to define the magic numbers (timeout and retries) to have a successful update.
We’ll definitely perform more tests in real-world scenario, the experience tells me though that without a caching mechanism it’s going to be hard.
The main problem is that once you have deployed your SW, and a customer can’t update… you can’t easily change the configurations in order to make the SW update more stable

We don’t have enough experience to say weather a Torizon SW update (mostly containers only) will be smaller or not. Somehow with the current experience we have build up on torizon, I have the feeling that it will be bigger. The fact that is not so visible is not really a good thing, but that’s another topic

Oh I thought you were only talking about OS updates, since we’ve just been talking about OSTree up till now. The issue we’ve been discussing is definitely more of an issue when talking about container updates. Currently the container engine we use (Docker), does not really have good retry logic or download caching for container images. It’s known limitation of the engine.

Do you plan to do more OS updates or container updates? Or an equal mix of both?

Wrt to the “caching feature”, adding such a big timeout and many retries might be a viable replacement.

Well as I said such a feature is still on the table for discussion on our side. Though even if we do decide today to implement such a feature, it’s not something that will be ready anytime soon. This is another reason I urge you to evaluate our current features for poor connection during updates. It’s not that we don’t want to do the new feature, it’s that it will take time and might not even be ready in a time-frame for you to make use of it at the start. Basically we’re just trying to think of what can be of help to you now.

Best Regards,
Jeremias

My apologies for not being clear, I was referring to updates in general.

We definitely are going to have many more containers updates, like 4-5 times a year at least whereas OS updates basically only when there is a new base image update or security issues to be addressed.
If docker-updates are problematic in such conditions, I guess we might have to consider to go without Docker :frowning: . If I got you correctly, the current retry/timeout settings are OS-related only.