Unexpected Base OS Rollbacks after Upgrades

DuncanNapier · May 20, 2024, 9:33am

We see unexpected rollbacks on a devices after an upgrade. We suspect our code is doing something to trigger this however we not 100% sure.
To remedy the rollback we have to remove the device, delete the contents of /var/sota/storage and reprovision the device.
However we find a file /var/sota/storage.lock left behind, suggesting the upgrade didn’t complete. What is responsible for this file and what would be the cause of it being left behind

ToradexAI · May 20, 2024, 9:34am

Hi, I’m the Toradex AI Assistant. I’m in early beta, and I would appreciate your feedback.
The Toradex Support team will follow up with you shortly!

Did you like my answer?

Hello,

The presence of /var/sota/storage.lock suggests that an operation related to software updates was in progress but did not complete successfully. This file is typically created by the software update management system (in the context of Torizon, this would be part of the Aktualizr or OSTree update process) to prevent concurrent update processes from interfering with each other. If the process is interrupted unexpectedly, due to a system crash, power loss, or a kernel panic, the lock file may not be removed properly.

Several factors could lead to the unexpected rollbacks you’re experiencing, including but not limited to:

Incomplete or Corrupted Updates: If an update process is interrupted or if the update package is corrupted, it might leave the system in an inconsistent state, triggering a rollback during the next boot.
System Stability Issues: If the new software causes system instability, crashes, or kernel panics, Torizon’s automatic rollback mechanism would be triggered to revert to the last known good state.
Critical Service Failure: As you mentioned, Torizon is configured to reboot if a critical service (like docker.service) fails to start. If the new update adversely affects any critical services, this could also lead to a rollback.

To address the issue with the lock file and prevent unnecessary rollbacks, consider the following steps:

Ensure Clean Update Cycles: Make sure that the update process is allowed to complete without interruption. This may involve checking for power stability and ensuring that the device is not manually rebooted during an update.
Investigate System Logs: Check system logs for any errors related to the update process, critical services failures, or system stability issues. This can provide clues as to why the update was not successful.
Manual Cleanup: If you’re certain that no update process is currently running and the system is stable, you can manually remove the lock file. However, do this with caution, as it could potentially cause issues if an update process is actually still in progress.
Review Update Packages: Ensure that the update packages are correctly built and not corrupted. Testing updates in a controlled environment before deploying them widely can help catch issues early.
Monitor Critical Services: If specific services are critical for your application, ensure they are properly configured and tested with the new update before deployment.

If the issue persists despite these steps, it may be necessary to delve deeper into the specific configurations and logs of your device to identify the root cause of the unexpected rollbacks and lock file residue.

Best regards

Please, also check if these links can help you solve your problem:

OSTree | Toradex Developer Center.

bruno.tx · May 20, 2024, 6:57pm

Hello @DuncanNapier,

Can you clarify the following topics:

Which version of Torizon OS are you using?
What type of updates are you performing?
Do you only see rollbacks in one type of update? Or in both OS and application updates?
Why do you think your code may be causing this rollback?

Best Regards,
Bruno

jeremias.tx · May 20, 2024, 8:09pm

Just to chime in regarding the storage.lock file. This is a file that is created by the update client Aktualizr. This file gets created when Aktualizr is actively accessing the sql storage database and is removed when access is no longer happening. This is to prevent multiple processes/clients from accessing the storage database at the same time, as this could cause in incorrect information being recorded in the database or even data corruption.

The fact the file is still there means perhaps when you stopped Aktualizr and re-provisioned the device it was in the middle of accessing the storage database. Since it was left behind it might prevent Aktualizr from using the storage database as it sees the lock file and thinks something else is accessing the database. Strange how this got left behind though as Aktualizr should have cleaned this up prior, assuming normal operation.

In addition to Bruno’s questions could you also share the logs from Aktualizr surrounding this issue you’ve observed.

Best Regards,
Jeremias