Weston crashes on IMX8mp verdin + Dahlia with HDMI quick plugin/plugout sequence

Hello @mathijs,

I have some updates for you. After some testing on BSP versions 6 and 7, we found that the issue was never reproducible on BSP 7. Looks like some changes on the newer release of NXP BSP have fixed it. Also, BSP 7 uses a new version of Weston. Could you please quickly evaluate it? You can already install BSP 7 from the feeds of Toradex Easy Installer.

Hello @rudhi.tx,

Thank you for your reply. It did a small test last night on a Dahlia + imx8mp Verdin, and think I did see crashes. I’ll try to reproduce with more logging enabled and get more information.

But before I do that, one question. I am working with a corporate network on which I can’t connect development hardware. Therefore, I could not use the Toradex Easy Installer with feeds. Instead, I looked at the Toradex artifactory and downloaded the files from the latest montly for BSP 7.0.
I used the path: tdxref-oe-prerelease-frankfurt/scarthgap-7.x.y/monthly/2/verdin-imx8mp/tdx-xwayland/tdx-reference-multimedia-image/

Can you check whether that is a build in which the issue was assumed to be fixed?

thanks in advance!

Hi @mathijs !

Yes. This is the a Reference Multimedia Image from BSP 7.

Our tests were specifically done on build 1 (this is build 2). You can try on build 2.

Best regards,

Hi, thank you for your reply.
I indeed tested with build 2 of the monthly. I did see a crash, however getting more logging is tricky as enabling logging seems to affect timing making the crash less prone to happen.

Summary of the setup / teststeps:

Hardware:

  • Toradex Dahlia carrier
  • Verdin imx8mp
  • Toradex dsi to lvds board
  • Toradex 10inch display + capacitive touch
  • Extern HDMI screen attached to the Dahlia HDMI port

Software

  • tdx-reference-multimedia-image from scarthgap-7.x.y montly build number 2
  • Adapted /boot/overlays.txt to use the mentioned dsi to lvds and screen:
  • fdt_overlays=verdin-imx8mp_hdmi_overlay.dtbo verdin-imx8mp_dsi-to-lvds_panel-cap-touch-10inch-lvds_overlay.dtbo verdin-imx8mp_spidev_overlay.dtbo
  • Adapted /usr/lib/systemd/system/wayland-app-launch to not run in fullscreen mode (remove --fullscreen from the service file). This makes the QT cinematic demo to draw on both screens

Teststeps
I had to modify the previously used script a bit, the timing seems a bit different in Weston. I also added a quick check on whether weston is still running:

#/bin/bash
counter=1
 
while [ $counter -le 5000 ]
do
    counter=$((counter + 1))
    echo off > /sys/class/drm/card1-HDMI-A-1/status
    sleep 0.60;
    echo on > /sys/class/drm/card1-HDMI-A-1/status
    sleep 0.60;
    if ! systemctl status weston | grep -q "Active: active"; then
        break
    fi
done

Test results
After running above script for a while, I see a similar crash as before occurring:

Feb 28 09:24:08 verdin-imx8mp-07154657 weston[22326]: [09:24:08.567] Output 'HDMI-A-1' enabled with head(s) HDMI-A-1
Feb 28 09:24:08 verdin-imx8mp-07154657 weston[22326]: [09:24:08.568] DRM: head 'HDMI-A-1' updated, connector 40 is disconnected.
Feb 28 09:24:08 verdin-imx8mp-07154657 weston[22326]: [09:24:08.568] Output 'HDMI-A-1' no heads left, disabling.
Feb 28 09:24:08 verdin-imx8mp-07154657 weston[22326]: [09:24:08.568] Disabling output HDMI-A-1
Feb 28 09:24:08 verdin-imx8mp-07154657 systemd-logind[382]: Session c22 logged out. Waiting for processes to exit.
Feb 28 09:24:08 verdin-imx8mp-07154657 Qt5_CinematicExperience[22396]: The Wayland connection broke. Did the Wayland compositor die?
Feb 28 09:24:08 verdin-imx8mp-07154657 systemd[1]: weston.service: Main process exited, code=dumped, status=11/SEGV
Feb 28 09:24:08 verdin-imx8mp-07154657 systemd[1]: weston.service: Failed with result 'core-dump'.
Feb 28 09:24:08 verdin-imx8mp-07154657 systemd-logind[382]: Failed to restore VT, ignoring: Input/output error
Feb 28 09:24:08 verdin-imx8mp-07154657 systemd[1]: wayland-app-launch.service: Main process exited, code=exited, status=1/FAILURE
Feb 28 09:24:08 verdin-imx8mp-07154657 systemd[1]: wayland-app-launch.service: Failed with result 'exit-code'.

I attached a longer log and output from dmesg as separate files as well.

I can continue my attempts to reproduce it with more detailed logging enabled. But until now I did not see the crash with logging enabled
journalctl weston.log (232.7 KB)
dmesg.txt (35.6 KB)

Hello @mathijs,
This is a very complex issue to solve, and as you already noticed, it seems to only be reproducible in an artificial way. Do you expect the connection / disconnection cycle to happen so fast on your final application?
I tried to reproduce the issue by manipulating the HPD pin manually in my Verdin Development Board, and I couldn’t make it crash after 50 attempts. Even manipulating the HPD this way might be a stretch, considering I can “connect” and “disconnect” the cable much faster than actually trying to use the HDMI connector.

Hi Rafael,
Thank you for your reply.

It is indeed a complex issue. In normal use we don’t expect that the HDMI is plugged in /out this fast. However, we want our device to also keep running when foreseeable misuse or fault conditions occur.
Fault conditions that could trigger this issue could be e.g. a bad HDMI cable or HDMI peripheral.
Misuse could for example be a person toggling a HDMI mux continuously with the press of its switch button (I was able to produce this issue that way initially while playing around with BSP 6)

In all cases we would need our final application to keep running. If Weston crashes, then the application also crashes at this moment.

What is hard to determine is whether there is a timing and ‘change’ related element that causes the issue, making the artificial way of reproducing it hit it every now and then, or whether the artificial way of reproducing the issue is indeed part of the cause. In earlier BSP 6 I did see a similar crash with real hardware. Currently I am preparing a hardware tool that can also toggle the HPD line. I hope to do some more testing with that as well.

Hope you don’t mind me asking, but your FAE’s mentioned a potential fix for BSP6 a few times. Is it possible to see the attempted fix somewhere? Or can you share any details about the approach to attempt to solve it taken there? We ourselves also investigated the options to fix or workaround the problems and were wondering whether our ideas align with yours.

Kind regards,

Mathijs

I was able to reproduce this issue in BSP 6 by toggling the HPD line, and now on BSP 7 I can’t. I agree with your concern about something that might fail and wanting to make sure the app is robust in any situation. However, I think we must always balance the amount of effort it takes to reach any solution with the potential of a bad outcome. I’m not saying you have to agree with me about the outcome of this specific case, but in my perspective we’re dealing with a mostly theoretic failure at this point.

On another point, would it be possible to restart your application to cover the now unlikely possibility of weston crashing again?

I executed some tests in my setup with your script, and I reproduced the issue 2 or 3 times after running the script for a long time, usually more than 1 hour. I configured the system to capture the core dump from weston, and with it I discovered the crash seems to be related to the g2d renderer:

Using host libthread_db library "/usr/lib/libthread_db.so.1".
Core was generated by `weston --backend=drm-backend.so'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x0000007fa3344c90 in ?? () from /usr/lib/libweston-12/g2d-renderer.so
[Current thread is 1 (Thread 0x7fa445d020 (LWP 10786))]
(gdb) bt
#0  0x0000007fa3344c90 in ?? () from /usr/lib/libweston-12/g2d-renderer.so
#1  0x0000007fa383dc78 in ?? () from /usr/lib/libweston-12/drm-backend.so
#2  0x0000007fa383dee0 in ?? () from /usr/lib/libweston-12/drm-backend.so
#3  0x0000007fa41e5d84 in weston_output_disable () from /usr/lib/libweston-12.so.0
#4  0x0000007fa440b4c8 in ?? () from /usr/lib/weston/libexec_weston.so.0
#5  0x0000007fa41dd064 in ?? () from /usr/lib/libweston-12.so.0
#6  0x0000007fa415baa4 in wl_event_loop_dispatch_idle () from /usr/lib/libwayland-server.so.0
#7  0x0000007fa415bc00 in wl_event_loop_dispatch () from /usr/lib/libwayland-server.so.0
#8  0x0000007fa41590f4 in wl_display_run () from /usr/lib/libwayland-server.so.0
#9  0x0000007fa440d174 in wet_main () from /usr/lib/weston/libexec_weston.so.0
#10 0x0000007fa42784f4 in ?? () from /usr/lib/libc.so.6
#11 0x0000007fa42785cc in __libc_start_main () from /usr/lib/libc.so.6
#12 0x0000005580ad07b0 in ?? ()

I remember we had some issues in the past with the g2d renderer in weston (on older BSP versions), and at some point NXP even recommended to turn it off. I decided to just try disabling it and see if the issue was still reproducible. I just commented use-g2d=true in /etc/xdg/weston/weston.ini:

[core]
#use-g2d=true

After that, I couldn’t reproduce the issue anymore, even with the script (at least it got much harder since I left it running for 2 or 3 hours). Could you try this on your side and see if it makes a difference?

Hope you don’t mind me asking, but your FAE’s mentioned a potential fix for BSP6 a few times.

Not at all, and that’s true. We had one team member testing a possible solution, but we found that this proposed solution was also reverting the solution of another potential crash related to HDMI connections. I went through it and noticed it was a collection of 3 or 4 patches that were backported from newer NXP kernel versions.

I can try to get more details at a later point.

Best regards,
Rafael

I understand your point. Can NXP help here in any way? I did see a post there that also seemed posted by Toradex, related to this issue: https://community.nxp.com/t5/i-MX-Processors/Weston-crashes-after-several-HDMI-connection-cycles/td-p/1908706. However, NXP did not reply to it it seems?

In my case, I did see it crash with the BSP 7 version. I am still testing with the setup that physically toggles the HPD line.

I am also testing now with G2D disabled. However, we planned to use some G2D implementations to get screen cloning working (based on this NXP post: https://community.nxp.com/t5/i-MX-Graphics-Knowledge-Base/Weston-clone-mode-on-i-MX8MPlus/ta-p/1791853). We looked around and that seemed the only way that screen cloning is supported on imx8mp. So if we have to disable G2D, then we don’t have any solution for screen cloning which our device needs to support. Do you know any other solution for it? (Just to clarify, we did not apply anything of that patch to any of the tests where Weston crashed)

Our application has critical parts in it that do not allow it to crash and restart.

Of course, we prefer to have a stable compositor solution. But if there is no other way, one thing that we want to investigate is to disable the hotplug detection in the kernel by patching it out, and running a userspace application that reads out the HPD line and trigger the enable/disable of the HDMI screen (as shown in the testscripts, enabling/disabling is quite easy with an echo to the correct sys node). That would allow us to both perform e.g. rate limiting/filtering to stabilize the HPD signal, and only allow HPD to work when our application is in a non critical state.

Any ideas on that approach?

Would be nice to see your approach, thanks in advance!

Yes, this question was asked by the person that was investigating the issue on our side. Unfortunately, it’s sometimes very difficult to get answers from NXP regarding software issues…

Just to clarify, I did reproduce it in BSP 7 by using your script. I didn’t reproduce it by manually toggling the HPD line. Overall, it might be that it was not reproduced just because you have to try so many times even with the script. In my tests, it took more than an hour to reproduce the failure when it reproduced.

That’s unfortunate…

That looks like it should be possible. I think it might even be possible to disable the HPD pin on the device tree without having to patch the kernel.

Another thing that occurred to me that might be worth a test is to use the upstream kernel. HDMI support landed on the iMX8MP in BSP 7 upstream and because the implementation is very different there, the problem may not be reproducible. If this works, I would not know if this approach would be compatible with the previously mentioned screen cloning patch.

Hello @mathijs,

Have you had a chance to test this on the upstream kernel? Do you also have any updates regarding your tests disabling the HPD? Let me know if there are still any open questions :slight_smile:

Hi,

I took a quick look at the upstream build, and installed it on my Dahlia board. However, when I enable the dsi-lvds + cap. touch overlay (similarly as in above tests), the system didn’t boot anymore. I did not have time to investigate why that happens.

Do you know whether this hardware combination was tested / working at your side?

In above post it was mentioned that it would be investigated whether sharing the attempt to fix weston could be shared. Do you know whether there is progress on that part? It would be great to have as reference.

Kind regards,

Mathijs

Hi @mathijs,

To the best of my knowledge, this was tested on verdin development board. It should not happen theoretically just because using a Dahlia board. However, I will do a test to verify this with our BSP 7 upstream build and will let you know what I find.

Hello @mathijs,

I could just reproduce this behavior on a verdin imx8mp with BSP7 upstream image. I also checked internally to find the status of the DSI to LVDS overlay for this upstream image and I can only confirm that this is a work in progress. As you know this upstream image is also in an experimental state currently and we are working on adding and improving things on it. Anyway, I see that this was not the original topic in discussion on this thread. Were you able to get an HDMI display working successfully with the upstream image?
For the behavior on the DSI to LVDS display, I suggest that you create a new thread which would also help us to track it better.

Hi,

Yes, understandable that it is work in progress / experimental. I tested it based on the suggestion @rafael.tx made as side step, however it did not make sense for me to look further into it at this moment as I need two displays to be able to reproduce the issue. When DSI to LVDS overlay is not working, I only would have the HDMI interface left.

Do you have any update about supplying the attempted fix for the weston.imx version with us? Although it may not work, I am still interested to look at it as reference.

Thanks in advance,

Mathijs

Hello @mathijs,
I’m attaching here the changes that I mentioned were being tested. As discussed before, we found these brought regressions to previous issues we had solved.

changes_hdmi_crash.diff (11.3 KB)