Weston-vivante launch with gray screen

Isaga · March 22, 2024, 3:31pm

Hi!. We have an application in .NET. When we are flashing the image, this image already integrates our containers as explained in (Pre-provisioning Docker Containers onto a Torizon OS image | Toradex Developer Center). The situation that occurs is that after several restarts of the application, Weston enters the gray screen as if we were sending the --developer flag. We have already reviewed the case of Weston-vivante:2 launches differently. And it was verified that our application depends on the start of Weston; however, the gray screen appears occasionally.

Here are the logs of the application when it launches correctly.

Mar 19 23:02:03 verdin-imx8mm-14684133 systemd[1]: Starting Docker Compose service with docker compose…
Mar 19 23:02:03 verdin-imx8mm-14684133 systemd[1]: Started Docker Compose service with docker compose.
Mar 19 23:02:04 verdin-imx8mm-14684133 docker-compose[904]: Container torizon-weston-1 Creating
Mar 19 23:02:04 verdin-imx8mm-14684133 docker-compose[904]: Container torizon-weston-1 Created
Mar 19 23:02:04 verdin-imx8mm-14684133 docker-compose[904]: Container torizon-libera_8-1 Creating
Mar 19 23:02:05 verdin-imx8mm-14684133 docker-compose[904]: Container torizon-libera_8-1 Created
Mar 19 23:02:05 verdin-imx8mm-14684133 docker-compose[904]: Container torizon-weston-1 Starting
Mar 19 23:02:05 verdin-imx8mm-14684133 docker-compose[904]: Container torizon-weston-1 Started
Mar 19 23:02:05 verdin-imx8mm-14684133 docker-compose[904]: Container torizon-libera_8-1 Starting
Mar 19 23:02:06 verdin-imx8mm-14684133 docker-compose[904]: Container torizon-libera_8-1 Started

Here are the logs of the application when the gray screen appears.

Mar 22 15:14:05 verdin-imx8mm-14684208 systemd[1]: Starting Docker Compose service with docker compose…
Mar 22 15:14:05 verdin-imx8mm-14684208 systemd[1]: Started Docker Compose service with docker compose.
Mar 22 15:14:06 verdin-imx8mm-14684208 docker-compose[913]: Container torizon-weston-1 Created
Mar 22 15:14:06 verdin-imx8mm-14684208 docker-compose[913]: Container torizon-libera_8-1 Created
Mar 22 15:14:06 verdin-imx8mm-14684208 docker-compose[913]: Container torizon-weston-1 Starting
Mar 22 15:14:07 verdin-imx8mm-14684208 docker-compose[913]: Container torizon-weston-1 Started
Mar 22 15:14:07 verdin-imx8mm-14684208 docker-compose[913]: Container torizon-libera_8-1 Starting
Mar 22 15:14:07 verdin-imx8mm-14684208 docker-compose[913]: Container torizon-libera_8-1 Started

Here is the information from the tdx-info script

Software summary

Bootloader: U-Boot
Kernel version: 5.15.129-6.5.0-devel+git.6f8fd49366db #1-TorizonCore SMP PREEMPT Tue Feb 27 22:23:06 UTC 2024
Kernel command line: root=LABEL=otaroot rootfstype=ext4 quiet logo.nologo vt.global_cursor_default=0 plymouth.ignore-serial-consoles splash fbcon=map:3 ostree=/ostree/boot.1/torizon/f2e676822951ef81195d8dd948be456e336cf286e09243987eb8b68bb417a685/0
Distro name: NAME=“TorizonCore”
Distro version: VERSION_ID=6.5.0-devel-20240228202213-build.0
Distro variant: VARIANT=“Docker”
Hostname: verdin-imx8mm-14684133

Hardware info

HW model: Toradex Verdin iMX8M Mini on Verdin Development Board
Toradex version: 0057 V1.1B
Serial number: 14684133
Processor arch: aarch64

docker-compose.yml (2.0 KB)

jeremias.tx · March 22, 2024, 6:46pm

Greetings @Isaga,

I haven’t heard of such an observation before from others. Your logs don’t reveal too much info, so it’s not clear at the moment what might be happening on your system.

This “grey screen” that appears, how often would you say this occurs?

Also can you determine whether it’s Weston or your .NET application that is outputting the grey screen?

Finally, when this issue occurs could you try and get logs from the individual Weston and .NET containers? Perhaps one of these containers is throwing some kind of error or warning.

Best Regards,
Jeremias

bw908 · March 22, 2024, 7:28pm

As a data point, we also observed a single instance of a system somehow entering (or crashing?) to developer mode weston (and no application) under normal operation. In our case we are using Cog or Chromium to serve a browser-based UI, on TorizonCore 6.3.0 build 4 on a Verdin 8MP

We have not yet seen this recur beyond this sole occurrence and did not get to inspect it at the time.

Isaga · March 22, 2024, 8:06pm

Thanks jeremias. We are aware that it’s something strange because we’ve been working with the same application since TC 5.5.0 and we hadn’t identified that problem before

This “grey screen” that appears, how often would you say this occurs?

At the moment, we can say that around 40% of the flashed modules are experiencing this issue. We have been conducting tests, and when it occurs, we remove the containers and perform a ‘systemctl restart docker’. This causes the application to start, but not in all cases.

Finally, when this issue occurs could you try and get logs from the individual Weston and .NET containers? Perhaps one of these containers is throwing some kind of error or warning.

Sure, I’ll attach the logs
weston_bad.txt (6.7 KB)
weston_ok.txt (7.8 KB)
app-ok.txt (828 Bytes)

Also can you determine whether it’s Weston or your .NET application that is outputting the grey screen?

We’re currently running tests to confirm that.

Best Regards,

jeremias.tx · March 23, 2024, 12:17am

we’ve been working with the same application since TC 5.5.0 and we hadn’t identified that problem before

I see you’re now on 6.5.0. Did you only start seeing this issue when transitioning to this newer version of the OS?

Looking at your Weston logs I see one major difference between the “ok” and “bad” state. In the “bad” state it looks like XWayland never starts, these logs are missing in the “bad” state but are present in the “ok” state:

...
[17:32:59.751] Loading module '/usr/lib/aarch64-linux-gnu/libweston-9/xwayland.so'
[17:32:59.874] Registered plugin API 'weston_xwayland_v1' of size 32
[17:32:59.875] Registered plugin API 'weston_xwayland_surface_v1' of size 16
[17:32:59.877] xserver listening on display :0
...
[17:33:01.105] Spawned Xwayland server, pid 47
[     1] wl_drm_is_format_supported, format = 0x30335241
[     2] wl_drm_is_format_supported, format = 0x30335258
[     3] wl_drm_is_format_supported, format = 0x30334241
[     4] wl_drm_is_format_supported, format = 0x30334258
Disabling glamor and dri3 support, XWAYLAND_NO_GLAMOR is set
Failed to initialize glamor, falling back to sw
[17:33:01.273] xfixes version: 5.0
[17:33:01.297] created wm, root 71
The XKEYBOARD keymap compiler (xkbcomp) reports:
> Warning:          Unsupported maximum keycode 569, clipping.
>                   X11 cannot support keycodes above 255.
Errors from xkbcomp are not fatal to the X server

If your application relies on X then not having XWayland running would be problematic indeed.

Looking at your docker-compose.yml I noticed based on the way you’re starting the weston-vivante container that this must using the 2/bullseye tag for this container. Is that correct?

If that is the case could you adapt to instead use the 3/bookworm tag. Our 2/bullseye based containers were designed and optimized to run on Torizon OS 5.X. While they can run on 6.X it is much more preferable to use the 3/bookworm based containers instead. As these containers were designed and optimized to be used on 6.X. Perhaps this might be the issue itself.

Please try using the 3/bookworm based weston-vivante container instead and let me know if the issue still appears. Be aware that this newer version of the container requires slightly different arguments to be launched, you can’t just use the same arguments as you’ve been using. You can see in many of our articles like this: Debian Containers for Torizon | Toradex Developer Center

To see how to launch the 3/bookworm based weston-vivante container.

Best Regards,
Jeremias

bw908 · April 22, 2024, 7:42pm

@jeremias.tx We’ve just started seeing this semi-reliably on one of our products, and I have the unit in hand to dig deeper. Note it’s in a similar state as above where the Weston container is still using the 2 tag (and we are rectifying this) but I wanted to provide additional info and do an analysis of what is going on.

At the core this seems to be a container corruption issue - docker ps shows the container as being unhealthy, and upon using docker exec to enter the container, it appears that the contents of /etc/xdg/weston/weston.ini (and the weston-dev) counterpart have been erased.

/home/torizon# cat /etc/xdg/weston/weston.ini
/home/torizon# cat /etc/xdg/weston-dev/weston.ini 
/home/torizon# exit
torizon@dev$>docker diff torizon-weston-1
C /etc
C /etc/xdg
C /etc/xdg/weston
C /etc/xdg/weston/weston.ini
C /etc/xdg/weston-dev
C /etc/xdg/weston-dev/weston.ini
<snip>

There are also a number of .core files in /home/torizon, which suggest it may have been precipitated by weston crashing and clobbering the configuration files.

jeremias.tx · April 22, 2024, 8:50pm

We’ve just started seeing this semi-reliably on one of our products

Could you specify how often you are seeing this occur? Also when you say “one of our products”, do you mean seeing this on one device or multiple devices on the same product line?

Note it’s in a similar state as above where the Weston container is still using the 2 tag

I’m curious if this is reproducible with the 3 tag, or not.

At the core this seems to be a container corruption issue - docker ps shows the container as being unhealthy, and upon using docker exec to enter the container, it appears that the contents of /etc/xdg/weston/weston.ini (and the weston-dev) counterpart have been erased.

Interesting, I don’t recall ever seeing a similar corruption issue like this with containers before. If you remove and restart the container does it persist or does the issue go away?

At the moment it’s hard to say what exactly might be going on here, but what I can say is that almost every night we run tests on our Weston container to ensure they display what we expect them to. Last I checked we weren’t having any issues like this appearing in our tests. Though we are only testing tag 3 with Torizon 6.X and tag 2 with Torizon 5.X. So I wonder if that’s a possible variable here. Or, at least something must differ between our test setup and your devices since it sounds like you see this fairly regularly, at least enough to not be a “rare” issue.

It’d be helpful if you could distill this to somewhat reproducible steps. For example is it enough to just restart the Weston container a bunch of times? Does Cog/Chromium need to be involved? Any other factors?

Best Regards,
Jeremias

bw908 · April 23, 2024, 2:45pm

Hi Jeremias,

Unfortunately, it’s somewhat of a heisen-bug and at this point in time all I have are our post-mortem analysis and observations. I’ll outline what I can provide below:

Also when you say “one of our products”, do you mean seeing this on one device or multiple devices on the same product line?

We have observed it on exactly two devices so far. One was an SQA unit that did it once or twice and never did it again. We didn’t get a chance to inspect it before it was re-deployed (which recreates containers and therefore “fixes” it). The other is the one I was looking at yesterday and which would boot to this state fairly reliably (but I don’t think that’s conclusive since the underlying issue only needs to happen once and then the damaged container will persist). Both are the same product. Note that in our case. both weston and weston-dev have been customized with branding, so if the “grey” background is being observed it means there is definitely some corruption or loss of weston’s configuration rather than just an incorrect entry into the “developer” state.

I’m curious if this is reproducible with the 3 tag, or not.

As are we; unfortunately we have not found the steps to reproduce the actual issue yet (see below). We have prepared an update with the new weston image and will continue to monitor for reports of this issue.

Interesting, I don’t recall ever seeing a similar corruption issue like this with containers before. If you remove and restart the container does it persist or does the issue go away?

Yes, that fixed it and I was unable to cause the issue to recur after restarting the device ~10 times. What’s weird here is our representative reported this unit would occasionally boot into a “correct” state. Presumably at some point the container would fail to the point that Docker decides to re-create it and this would fix the issue again for a little while.

. For example is it enough to just restart the Weston container a bunch of times? Does Cog/Chromium need to be involved? Any other factors?

Our device does run a Cog container. The only additional data point I have is this was a unit used for expos and at some point the LVDS display connector became loose enough that the display would flicker and become unstable. My current best guess as to what happened is that interference or back-EMF from the unreliable connection upset the IMX8 GPU in some way, which caused Weston to crash. As far as I know, LVDS doesn’t have an active detection mechanism for the presence of a display, so I don’t think it was a case of Weston seeing the display repeatedly appearing and vanishing in a short time. It might be possible to reproduce the issue by re-introducing this problem but at this point I’m not sure it’s worth the risk of hardware damage to the iMX or display. If it is at all useful to you I did save a snapshot of the damaged docker image and I could pull some of the core files.

I did find this related issue here - (we don’t use any of the mentioned variables, but I noticed entry.sh does run dos2unix repeatedly on the configuration files at container startup). So it’s plausible an inopportune crash or power loss could result in an empty or partially written Weston file remaining on the disk and persisting to further boots.

Regards,
~BW

jeremias.tx · April 23, 2024, 6:50pm

Well thanks for the information, but that still doesn’t give me a lot to go on in terms of investigation and debugging. As a side-note I did confirm with a developer here that using tag 2 in Torizon 6.X has been known to have weird issues with rendering and other strange observations.

I additionally did a little test on my side here. I took a device and created a script that would run the Weston container (with tag 3 on Torizon OS 6.5.0), check both of the weston.ini files inside the container and make sure the contents match what is expected. Then I stop and remove the Weston container and repeat. The script went through 3000+ loops of this without failure. So either the issue can’t be reproduced by just restarting the Weston container, or it doesn’t occur with tag 3, or this is exceedingly rare, or I’m missing some other factor/variable to reproduce this.

Though I’m afraid that’s all I have for now at the moment. If you are able to uncover anymore information that would be appreciated. But for now there’s not much else I can look into here given the current information available.

Best Regards,
Jeremias

bw908 · April 24, 2024, 1:31pm

Understood - I’m not expecting anything conclusive to come out of this, rather just documenting our own observations on the issue for future reference

jeremias.tx · April 24, 2024, 6:49pm

I understand, I do appreciate you sharing this information despite not having anything we can act upon yet. Please do continue to share more on this topic if you happen to discover anything new.

Best Regards,
Jeremias

bw908 · July 11, 2024, 2:52pm

Following up with an additional tidbit of information - I can confirm we’ve had a second unit exhibit this problem now, and this unit was confirmed as running a Weston container derived from weston-vivante:3.

So far the common thread seems to be the affected units get in this state after experiencing some sort of hardware failure - in this newest instance, the system’s power supply had failed, and the prior unit had a failing display cable. Perhaps an inopportune power cycle or brownout is responsible for the running container getting in a “bad” state from which the system does not recover.

We’re still trying to learn what we can about this instance, I will follow up with any additional findings.

jeremias.tx · July 11, 2024, 6:37pm

hardware failure - in this newest instance, the system’s power supply had failed, and the prior unit had a failing display cable. Perhaps an inopportune power cycle or brownout is responsible for the running container getting in a “bad” state from which the system does not recover.

Well this could explain the rarity of the issue as well as the difficulty to reproduce reliably. I can mention this to the team though I’m not sure how much investigation we can perform at this time.

If you happen to find a method for even somewhat reliable reproduction that would be very helpful.

Best Regards,
Jeremias

Weston-vivante launch with gray screen

Software summary

Hardware info

HW model: Toradex Verdin iMX8M Mini on Verdin Development Board Toradex version: 0057 V1.1B Serial number: 14684133 Processor arch: aarch64

HW model: Toradex Verdin iMX8M Mini on Verdin Development Board
Toradex version: 0057 V1.1B
Serial number: 14684133
Processor arch: aarch64