Upstream metadata expiry breaks lockboxes?

bw908 · October 11, 2023, 4:42pm

We are producing lockboxes for offline updates, which are fully self-contained with both the OS and docker images.

However, today we encountered a failure to update which appears to be due to an expiry in the Toradex metadata on 9/25/2023 having broken a lockbox that’s only a few weeks old (well before our desired 1 year expiry).

[DEBUG   ]: >> Oct 11 15:32:15 Hub-101 aktualizr-torizon[1361]: fetchMetaOffUpd() called with source_path: "/var/volatile/DEV_UPDATE/update"
[DEBUG   ]: >> Oct 11 15:32:15 Hub-101 aktualizr-torizon[1361]: Current version for ECU ID: 101 is unknown
[DEBUG   ]: >> Oct 11 15:32:15 Hub-101 aktualizr-torizon[1361]: New updates found in Director metadata. Checking Image repo metadata...
[DEBUG   ]: >> Oct 11 15:32:15 Hub-101 aktualizr-torizon[1361]: Failed to update Image repo metadata: The targets metadata was expired.
[DEBUG   ]: >> Oct 11 15:32:15 Hub-101 aktualizr-torizon[1361]: Event: UpdateCheckComplete, Result - Error

and diving into the JSON files, the culprit appears to be the snapshots.json:

{"signatures":<snip>,"signed":{"_type":"Snapshot","expires":"2023-09-25T15:20:03Z", <snip>

This seems like an unnecessary restriction, I don’t think there’s any reason to block an offline update based on metadata that isn’t relevant to it.

This seems to be a persistent problem as the snapshot.json of a more recent lockbox we built has an expiry only a few weeks in the future so this will continue to rear its head for any lockbox that has been built more than a few weeks ago. What can we do to prevent this?

jeremias.tx · October 11, 2023, 5:53pm

Greetings @bw908,

I believe we’ve had similar reports in the past regarding metadata in Lockboxes expiring before the actual Lockbox expiry date. However now that I check, the past report was for root metadata and not snapshot metadata like in your case. So perhaps any improvement we did in the past did not affect snapshot metadata.

Let me bring this up with our team and see if anything can be done here and we will get back to you.

Best Regards,
Jeremias

bw908 · October 11, 2023, 5:56pm

Thanks - I did some more digging and it appears the targets.json file is similarly affected (same date as the snapshots.json file)

I did confirm the lockbox is set to expire in on 09/21/2024 1:49:57 PM according to the online platform

jeremias.tx · October 12, 2023, 5:11pm

The team here is trying to look into this but need additional information. They want:

The email address associated with the account where you’re creating the Lockbox.
The Device ID for the device you’re trying to do the update with.
All the *.json metadata files in the problematic Lockbox you have.

Best Regards,
Jeremias

bw908 · October 12, 2023, 9:10pm

OK - I believe I have your contact information, I’ll send you an email with the requested details tomorrow

jeremias.tx · October 12, 2023, 11:02pm

I’ll keep an eye out for your information.

In other news I did a small test where I generated a Lockbox using TorizonCore Builder and examined the expiry dates of all the metadata files. The Lockbox I used was created a few months back and has an expiry of 7/14/2024. The Lockbox I just downloaded, the closest expiry date amongst all the metadata files was 1/9/2024.

It doesn’t seem to be I’m seeing the same behavior as you where your downloaded Lockboxes seem to have metadata expiring just weeks after being downloaded. This is another reason we need your information as we may need to dig into your account on our Platform and see if there’s anything strange going on.

Best Regards,
Jeremias

bw908 · October 18, 2023, 7:13pm

Just to follow up on this - we are also considering a use case where we will need extended expiry dates for lockboxes (2+ years).
How will this be affected by the expiry dates in any of the tdx feeds which are typically only a year or two int he future? Or is the issue here only because the targets.json has expired and this is the only relevant metadata for lockboxes comprised solely of custom packages we have added to the platform ourselves?

jeremias.tx · October 18, 2023, 7:51pm

Just to follow up on this - we are also considering a use case where we will need extended expiry dates for lockboxes (2+ years).

So what is your use-case here? Is your idea that you download the Lockbox and then it remains valid for the entire 2 years without ever re-downloading the Lockbox during that time?

How will this be affected by the expiry dates in any of the tdx feeds which are typically only a year or two int he future?

Well when a Lockbox is downloaded the server should bump the expiry dates of the related metadata such that the downloaded Lockbox won’t have metadata that expires “soon”. Though this seems to not be working in your case with some of the metadata not being bumped far in advance. With the reason still being under investigation.

Best Regards,
Jeremias

bw908 · October 18, 2023, 7:58pm

Yes - generally we build our software lockboxes once via CI and publish them per our ECO processes. Once built they’ll remain untouched for their lifespan.

However, this use case is a little different:

Device A, which is the primary device in the system and has some functionality.
Device B, which runs TorizonCore, is connected to A, and runs a lightweight application acting as a bridge between device A’s functionality and other infrastructure.

The software on B should stay synchronized with A due to the protocol/features supported. This is complicated because one of the desired use cases allows A to be swapped out for a different A possibly running a different version of the software. (Obligatory car analogy is “I’ve bought a car and a trailer, and you also have a car with a tow hitch, and would like to borrow my trailer for a few days”)

Therefore, the plan is to embed the lockbox (it’s only about 10MB) into the software on A, and offer it up to B anytime the two connect, to ensure the software on B is always paired regardless of what A is running. In an ideal world, A would always be running a relatively recent version of software, but we know that’s not always the case. The default expiry of 1 year feels a little too short for us when considering release schedules - it might not be that far behind the current version. And of course, the lockbox is baked into the software release, so it’s pinned at whatever was produced during that CI cycle.
Hence, we’d like to set an expiry date arbitrarily far enough into the future that it’s viable for all practical purposes - somewhere from 2-10 years. Hopefully we never actually see that kind of gap, but we also don’t want things to break if we do encounter it - swap outs of A should not cause B to no longer work with an older version of A because its lockbox has expired.

Granted, we could likely circumvent the metadata via e.g. manual docker save and docker load commands, but that feels like it’s working around a system that’s already in place and does the same job, in addition to a possible security hole.

jeremias.tx · October 18, 2023, 10:02pm

I see, I believe I understand your use-case but I’ll need to discuss it further internally. I don’t think we’ve ever dealt with such long expiry periods before. So I’m unsure if there would be any pitfalls or other issues in our system with this.

Best Regards,
Jeremias

bw908 · October 30, 2023, 1:11pm

Any developments on the initial issue of lockboxes expiring in a handful of weeks instead of a year?

Thanks!

jeremias.tx · October 30, 2023, 5:42pm

I haven’t heard anything from the team yet, let me check in with them and let you know.

jeremias.tx · November 3, 2023, 3:45pm

@bw908 Just to follow-up here. Our team found a possible bug that may be the cause of what you’re seeing here with regards to short-lived Lockboxes. I’ll let you know when a fix is available and you can try to see if it addresses your issue.

Best Regards,
Jeremias

bw908 · November 16, 2023, 2:29pm

Can we please get some traction on this? It’s become a blocking problem for us as we attempted to build a final SQA production candidate release yesterday only to find that the lockbox metadata has an expiry date of TODAY.

"expires": "2023-11-16T19:44:07Z",

This date appears in both image-repo/targets.json and image-repo/snapshot.json

jon.tx · November 16, 2023, 3:53pm

Apologies for the slow communication here. We have identified the issue, and a fix should be deployed to production soon.

After initially not being able to reproduce the issue, we narrowed it down to lockboxes created using the API. When you create a lockbox through the web UI, the expiration of all relevant repository metadata will be bumped so that it does not expire before the lockbox itself does, but that process was not working properly when they are created via API. So, as a workaround for today you can simply use the web UI to ensure the metadata doesn’t expire.

We should be deploying the fix on the API gateway to production on Monday or Tuesday.

Oh, and I should note that image-repo/snapshot.json will not have its expiry bumped, but that’s normal expected behaviour, and will not cause the update to be rejected. (c.f. Uptane PURE-2 metadata verification procedures, step 7(iv).)

bw908 · November 16, 2023, 4:32pm

Thanks, I can confirm the workaround that if I edit the existing lockbox in the web UI and then rebuild it the expiry in targets.json appears to be correct.

ben.tx · November 30, 2023, 3:26am

Just jumping in here to let you know this is deployed. Using the API V2 to create lockboxes should bump the uptane metadata expirations.

bw908 · December 1, 2023, 2:49pm

Thanks. I’ll update our API calls as we were still using v1 in our scripts.

bw908 · December 6, 2023, 6:00pm

As a follow-on to this, while updating our scripts I’ve noticed the v2 API is not very script friendly.

Previously, it would return hard HTTP error codes on failure, which made it easy to check the exit code of e.g. a curl --POST call.

However, the new v2beta API appears to return soft errors as messages, for example:

{code: "400" .... }

and no machine-parseable error status (i.e., the following script proceeds and thinks nothing is wrong)

curl -s --max-time 30  -H "Authorization: Bearer ${TOKEN}" "https://app.torizon.io/api/v2beta/non_existent_endpoint" > ota.json

if [ $? -ne 0 ]; then
	echo "Getting package list failed"
	exit 1;
fi

Would it be possible to revert or restore the “proper” error code behaviour? APIs are almost always used via scripts and error checking capability without attempting to “guess” based on returned text is critical.

jeremias.tx · December 6, 2023, 7:43pm

Thank you for the feedback regarding the V2 API. Let me bring this up to our cloud team.