Weird issue here I don’t fully understand. Today we encountered an issue where after applying an update via lockbox, the system breaks in a way that SSH keys no longer work.
Preamble: We’ve configured /etc/ssh/sshd_config with the following change so we can push authorized_keys as part of the custom overlay in TorizonCore builder (since /home is not under OSTree control). AuthorizedKeysFile /etc/ssh/.auth/%u/authorized_keys .ssh/authorized_keys
This has worked fine for many, many update cycles and OS versions during internal testing.
Here’s where it gets weird:
Our systems have been running an OS build, 1.0.0-RC3 for quite a while. This OS build works fine.
Last week, we created an empty commit in our OS repository (no file deltas whatsoever) and tagged it 1.0.0 to produce a new build in preparation for final release. This repository contains a pre-built TorizonCore image and our TCB customizations, so no Yocto build happened.
Applying the 1.0.0 OS lockbox to any of our systems breaks SSH because root (/) is suddenly mounted with rwxrwxr-x permissions instead of the previous rwxr-xr-x permissions, and sshd gets upset:
debug1: userauth_pubkey: publickey test pkalg rsa-sha2-512 pkblob RSA SHA256:JNZay6QgFieD5RwH9SyaZwC/cOiSREpV+J/ZdNntzyI [preauth]
debug1: temporarily_use_uid: 1000/1000 (e=0/0)
debug1: trying public key file /etc/ssh/.auth/torizon/authorized_keys
debug1: fd 4 clearing O_NONBLOCK
---> Authentication refused: bad ownership or modes for directory / <---
It gets even weirder:
Checking out the code and doing a local build/deploy with TorizonCore Builder works just fine.
Forcing a rollback with fw_setenv rollback 1 and rebooting fixes the permissions and everything is happy again.
What would cause the mount permissions to be altered suddenly in this way? There is no OS delta on our end that would explain it.
This is a very strange sounding phenomenon you’ve run into. I attempted to see if I can reproduce this.
I flashed Torizon 6.4.0 on my device and prepped it for offline updates. Then for my Lockbox I took the 6.4.0 image as a base and used torizoncore-builder union to generate a new commit, but otherwise no actual changes. I pushed this commit to the platform and generated a Lockbox for it.
Before I did any update I took note of the permissions of the / directory after initially flashing my device. I can see the permissions are drwxr-xr-x.
I then performed the offline update with my previously generated Lockbox. The update succeeded and the device rebooted. I checked the permissions of / again and I still see they are drwxr-xr-x.
So I failed to reproduce this observation of yours. But at the least it doesn’t seem like anything inherent to the offline update process itself. Off the top of my head I can’t think of anything that might cause this on your side.
Since you’ve successfully done this before without issues, was there something different this time perhaps? I mean you’ve done previous update cycles just fine. And deploying the commit with TorizonCore Builder as you said is also fine. But these are just all OSTree operations underneath, I don’t see why the behavior would suddenly change now.
Yes, it makes no sense to us either. Nothing in the process has changed that I’m aware of, we are using CI to produce these artifacts from git repositories so it should be a consistent process every time.
It gets even weirder, in that I attempted to re-build the same lockbox locally and it applied without issue. After this I attempted to re-apply the original affected 1.0 lockbox and it reported “No updates found” suggesting there is no difference between what I built and the “official” CI artifact that manifests this issue.
Currently we have a sample size of 3-4 units where this was confirmed, so it is not an isolated instance.
What’s also interesting is this is an issue that isn’t likely to be found/observed except in circumstances like ours where we have locked the system to only allow SSH via private key and those keys are pre-baked into the image. Password authentication is unaffected and so even if this manifests elsewhere it may go undetected.
I have some additional data to add regarding affected lockboxes, so far we’ve identified the following as affected. (Not sure how useful it is, but adding it in just for context):
The OS 1.0.0 lockbox
A “system” (OS + containers) lockbox 1.0.0 which contains OS package 1.0.0
A system lockbox 1.0.0-RC5, also containing OS package 1.0.0
So it’s something about that package itself in the platform rather than the lockbox creation process.
So it’s something about that package itself in the platform rather than the lockbox creation process.
Okay so that narrows down things quite a bit. But now the issue is what causes this specific 1.0.0 package of yours to be special?
All I have to go off is that somehow your package changes the permissions of the / directory to rwxrwxr-x. I’m not sure how easy this is for you but maybe you need to dissect the various steps and such in your process to see what exactly triggers this permission change.
I assume your other “good” OS packages don’t change the permissions like this right?
Correct, this is the first time this has happened and we’ve had our SSH keys configured this way from very early on.
I did dig deeper and found that the OStree deploy directory is similarly affected and likely the cause here (bind-mount?)
drwxr-xr-x 13 root root 4.0K Nov 16 19:18 4813771708af972e90246b9c22bfd96b8d0e22863ca3c2a92607cd40241983a2.0
-rw-r--r-- 1 root root 82 Nov 16 19:18 4813771708af972e90246b9c22bfd96b8d0e22863ca3c2a92607cd40241983a2.0.origin
drwxrwxr-x 13 root root 4.0K Nov 16 19:23 e7d42cccd0d77a285eeff5bdec69b3f1df9f8c5b72acd9526330d2918d200774.0
-rw-r--r-- 1 root root 82 Nov 16 19:23 e7d42cccd0d77a285eeff5bdec69b3f1df9f8c5b72acd9526330d2918d200774.0.origin
I’m not even sure how it’s possible for me (or a build configuration) to accidentally make that change, since as I understand it, “/” doesn’t really exist in the traditional sense but rather is a construct of an ostree checkout, so the permissions must be determined internally in a place to which the regular user doesn’t have access.
Note the subdirectories all seem fine, for some unknown reason, it’s just the root of that checkout that’s configured with a permission mask of 775 instead of 755. This is also present in the. ota.tar.zst file we use for TEZI deploys.
A CI rebuild of the affected package from the same git tag seems fine so I think we’ve exclusively ruled out there is anything about the build infrastructure that’s at fault here (?)…
This is literally a bit-flip error and at this point I’m at a loss and inclined to think it’s either a very obscure latent bug in OSTree somewhere or just blame it on cosmic rays unless it recurs.
A CI rebuild of the affected package from the same git tag seems fine so I think we’ve exclusively ruled out there is anything about the build infrastructure that’s at fault here (?)…
Okay this is just weird now.
This is literally a bit-flip error and at this point I’m at a loss and inclined to think it’s either a very obscure latent bug in OSTree somewhere or just blame it on cosmic rays unless it recurs.
I mean I’m stumped, so I’m not against blaming cosmic rays at this point haha. As you said anyways I’m not even sure how one would change the permissions of / with OSTree, given the way it works.
We found the cause when it occurred to me to try re-running the failed build on the same agent it was originally built on, and voila, the incorrect 775 mask reappeared.
(How it got in this state we’ll never know, these agents are all clones of the same original Docker VM, and this affected agent has been making flawless builds forever as it was our first agent.)
Comparing the two agents involved, I noticed the permissions on our custom FS overlay folder were also 775 (as opposed to 755 on the “working” agent).
I changed the mask to 0735 as a silly test… lo and behold, suddenly the ostree deploy folder also had permission mask 0735. Same if I try 0777.
So the way to reproduce this issue is to have a custom overlay folder in your TCB yaml. Your root filesystem folder (/) of your deploy will then inherit the permission mask of that custom folder, and this seems like a rather undesirable (and potentially security-breaking) effect.
Was this other agent configured incorrectly somehow? Or how did it end up getting different permissions than the “good” agent?
custom FS overlay folder
Also just to clarify, you’re talking about the changes directory here right? The directory that gets generated by isolate and used by union in TorizonCore Builder?
this seems like a rather undesirable (and potentially security-breaking) effect.
Though OSTree must allow such things then it seems. On one hand I can see some niche case if for some reason someone would want to change the permissions of /. I guess it’s a matter of at your own risk kind of thing. I’ll mention this internally, but at least for now it seems like you’ve solved the issue.
Was this other agent configured incorrectly somehow? Or how did it end up getting different permissions than the “good” agent?
That will forever remain a mystery, I think.
Also just to clarify, you’re talking about the changes directory here right? The directory that gets generated by isolate and used by union in TorizonCore Builder?
Yes, via e.g. the following in your tcbuild.yaml file:
customization:
filesystem:
- my_custom/
A viable “fix” is to add an appropriate .tcattr entry for file: . in the root of your custom folder, but this is still an ugly mechanism for unsuspecting users to shoot themselves in the foot; the default tree you get from torizoncore-builder isolate places the .tcattr under [custom]/usr/etc and I suspect that means the permissions for /, /usr/ and /usr/etc are at the whim of the default umask of the system checking out the repository to do the build.
As such, it seems a good preventative measure for the isolate command would be to always generate .tcattr at the top level and include an entry for . and any folders involved, regardless of whether it was modified on the device. It doesn’t break or alter any functionality, but will likely save someone else a few hours of head scratching in the future if dealing with permission sensitive programs.