Devices freezes and needs to be restarted

Hi
We are using Colibri imx8x with custom images based on TorizonCore 5.1.0+build.1, 5.3.0+build.7 and 5.5.0+build.11 and docker containers. We build 2 products based on this plattform:
1x Controller with display
1x Controller without display
Both products have the same software with a different configuration. Our Software communicates with our Backend and receives commands or send states to it. For the controller with display, we implemented a simple Browser without own controls (hardened kiosk mode) that shows our UI as webpage, which is offered by our software.
Now we have sometimes the problem, that the whole device freezes: First we discovered it on the device with display because no interaction was possible anymore and we thought that we maybe have a problem with the display. But then we have the same behaviour on devices without a display.
Behaviour is:

  • No UI interaction possible (no reaction on touch screen)
  • No connection to backend (TCP/IP)
  • Device is offline in app.torizon.io
  • It’s possible to connect with ssh but there is no error in our own logs

Are there any system logfiles I could check or what is the idea to analyse such problems?

You ca start from collecting boot log from debug UART/Serial console

Then you can login to Linux using that Serial Console and check other logs at

/var/log

Hi @synton,

When the device is in the hung state and you are connected over local ssh, you should be able to use the systemd journaling commands to view the logs of various services. Examples:

$ sudo journalctl -u docker-compose --no-pager -l
$ sudo journalctl -u aktualizr-torizon  --no-pager -l
$ sudo systemctl status --no-pager -l

@alex.tx’s suggestion of using a serial console is advisable as well in case there is a kernel oops or something on the console. Of course that means that you need to be connected to the console before the issue happens.

Additionally the output of sudo dmesg might be instructive.

Drew

Finally I have a device in freeze state in my office:

sudo journalctl -u docker-compose --no-pager -l
– Logs begin at Mon 2023-05-08 10:23:09 UTC, end at Wed 2023-05-10 11:35:27 UTC. –
– No entries –

sudo journalctl -u aktualizr-torizon --no-pager -l
– Logs begin at Mon 2023-05-08 10:23:09 UTC, end at Wed 2023-05-10 11:37:24 UTC. –
– No entries –

sudo systemctl status --no-pager -l
colibri-imx8x-06995787
State: running
Jobs: 0 queued
Failed: 0 units
Since: Wed 2023-04-26 15:44:38 UTC; 1 weeks 6 days ago
CGroup: /
├─user.slice
│ └─user-1000.slice
│ ├─user@1000.service …
│ │ └─init.scope
│ │ ├─58226 /usr/lib/systemd/systemd --user
│ │ └─58227 (sd-pam)
│ ├─session-c10.scope
│ │ ├─58236 sshd: torizon [priv]
│ │ ├─58238 sshd: torizon@pts/0
│ │ ├─58239 -sh
│ │ ├─58527 sudo systemctl status --no-pager -l
│ │ └─58528 systemctl status --no-pager -l
│ └─session-c9.scope
│ ├─58223 sshd: torizon [priv]
│ ├─58233 sshd: torizon@notty
│ └─58234 /usr/libexec/sftp-server
├─init.scope
│ └─1 /sbin/init
└─system.slice
├─rngd.service
│ └─515 /usr/sbin/rngd -f -r /dev/hwrng
├─systemd-networkd.service
│ └─594 /usr/lib/systemd/systemd-networkd
├─systemd-udevd.service
│ └─528 /usr/lib/systemd/systemd-udevd
├─system-serial\x2dgetty.slice
│ └─serial-getty@ttyLP3.service
│ └─900 /sbin/agetty -8 -L ttyLP3 115200 xterm
├─docker.service …
│ ├─ 713 /usr/bin/dockerd -H fd://
│ ├─ 721 containerd --config /var/run/docker/containerd/containerd.toml --log-level info
│ ├─ 926 containerd-shim -namespace moby -workdir /var/lib/docker/containerd/daemon/io.containerd.runtime.v1.linux/moby/fc3db69efdf883ce3e90ff76f69bd891d712a7bd486dd054c98ea30f08eb3e29 -address /var/run/docker/containerd/containerd.sock -containerd-binary /usr/bin/containerd -runtime-root /var/run/docker/runtime-runc
│ ├─ 930 containerd-shim -namespace moby -workdir /var/lib/docker/containerd/daemon/io.containerd.runtime.v1.linux/moby/46ed3eecc29ca071563b61d6609f3f70da82ed9f703a8d0b13019cb614acc1a5 -address /var/run/docker/containerd/containerd.sock -containerd-binary /usr/bin/containerd -runtime-root /var/run/docker/runtime-runc
│ ├─ 931 containerd-shim -namespace moby -workdir /var/lib/docker/containerd/daemon/io.containerd.runtime.v1.linux/moby/03752ae348e1c21aef7a80856b7764e4c3ce552a9945613a5f6f7051324a15d4 -address /var/run/docker/containerd/containerd.sock -containerd-binary /usr/bin/containerd -runtime-root /var/run/docker/runtime-runc
│ ├─ 982 /usr/bin/ucp-qt
│ ├─ 983 /bin/bash -l /usr/bin/entry.sh
│ ├─ 984 java -jar app.jar
│ ├─1079 setsid -w -f openvt -w -f -s -c 7 -e – bash -c /usr/bin/weston-launch --tty=/dev/tty7 --user=torizon – --current-mode > >(tee /proc/1/fd/1) 2> >(tee /proc/1/fd/2)
│ ├─1080 bash -c /usr/bin/weston-launch --tty=/dev/tty7 --user=torizon – --current-mode > >(tee /proc/1/fd/1) 2> >(tee /proc/1/fd/2)
│ ├─1084 /usr/bin/weston-launch --tty=/dev/tty7 --user=torizon – --current-mode
│ ├─1085 bash -c /usr/bin/weston-launch --tty=/dev/tty7 --user=torizon – --current-mode > >(tee /proc/1/fd/1) 2> >(tee /proc/1/fd/2)
│ ├─1086 bash -c /usr/bin/weston-launch --tty=/dev/tty7 --user=torizon – --current-mode > >(tee /proc/1/fd/1) 2> >(tee /proc/1/fd/2)
│ ├─1087 tee /proc/1/fd/2
│ ├─1088 tee /proc/1/fd/1
│ ├─1094 /usr/bin/weston --current-mode
│ ├─1099 /usr/lib/aarch64-linux-gnu/weston-keyboard
│ ├─1101 /usr/lib/aarch64-linux-gnu/weston-desktop-shell
│ ├─1106 /usr/lib/aarch64-linux-gnu/qt5/libexec/QtWebEngineProcess --type=zygote --no-sandbox --webengine-schemes=qrc:sLV --lang=en-US
│ └─1248 /usr/lib/aarch64-linux-gnu/qt5/libexec/QtWebEngineProcess --type=zygote --no-sandbox --webengine-schemes=qrc:sLV --lang=en-US
├─bluetooth.service
│ └─687 /usr/libexec/bluetooth/bluetoothd
├─wpa_supplicant.service
│ └─684 /usr/sbin/wpa_supplicant -u
├─ModemManager.service
│ └─578 /usr/sbin/ModemManager
├─systemd-journald.service
│ └─516 /usr/lib/systemd/systemd-journald
├─NetworkManager.service
│ └─593 /usr/sbin/NetworkManager --no-daemon
├─rpcbind.service
│ └─532 /usr/sbin/rpcbind -w -f
├─usermount.service
│ └─698 /usr/bin/usermount
├─systemd-resolved.service
│ └─649 /usr/lib/systemd/systemd-resolved
├─udisks2.service
│ └─592 /usr/libexec/udisks2/udisksd
├─dbus.service
│ └─579 /usr/bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation --syslog-only
├─systemd-timesyncd.service
│ └─545 /usr/lib/systemd/systemd-timesyncd
├─system-getty.slice
│ └─getty@tty1.service
│ └─899 /sbin/agetty -o -p – \u --noclear tty1 linux
├─avahi-daemon.service
│ ├─669 avahi-daemon: running [colibri-imx8x-06995787.local]
│ └─672 avahi-daemon: chroot helper
└─systemd-logind.service
└─645 /usr/lib/systemd/systemd-logind

The device is still in freeze state. I attached the output of command “sudo dmesg”
dmesg_output_10.5.2023.txt (40.3 KB)
On this device we have the following software setup:

  • 1 Container with business logic software that provides the UI as simple HTML (available on http://localhost:9090)
  • 1 Container based on Weston implemented a simple browser in
    Kioskmode that shows the UI provided by the first Container

We connected a Mouse and basically the website is usable and seems to work. With the touchscreen no navigation is possible. We have no error or warning in our logs and it seems that there is a problem with the touch input device.
What else can we do? Where can we find system logfiles (maybe some driver logfiles)

Hi @syntom !

We would like to ask a couple of questions:

You can also check the output of free -h and docker stats for long periods to see if there is any interesting behavior.

Best regards,

Hi @henrique.tx

  • I don’t know how to reproduce. It appears on devices that are in productive usage (customer installation) from time to time. On our test installation in the office we had to wait 2 weeks till the problem appeared again.
  • We can give it a try and create a new image based on the latest LTS version but we need time for that
  • It takes some time after a device boot, it’s hard to tell the conditions
  • Yes usually the devices are working as expected. We delivered our first device 2 years ago and till now we delivered about 35 devices to our customers.
  • Find the output of the commands attached
    One more thing. Today we found out, that we have 2 issues and not 1. Issue 1 with the controller with display is a different issue than the one we have with the controller without display.
    Issue with display: Our application is working and we have a connection (to our backend and to toradex OAT), we can start a ssh-connection. It seems that this is an issue with the display / display driver.
    Issue 2 without controller: There is no connection (to our backend or to toradex OAT system), we don’t know the state and the device has to be restarted. This happens to all our devices with an image based on torizoncore 5.5, here we wan’t to downgrade to 5.3 where we have the experience that this works.
    I think it’s better I open a new thread for issue 2.
    dockerps.txt (393 Bytes)
    docker-stats (586 Bytes)
    free-h.txt (207 Bytes)
    tdx-info.txt (27.6 KB)

@henrique.tx Is there a way to check if the touch driver has crashed?

Hi @syntom !

Seems like you are using this display right: Resistive Touch Display 7" Parallel | Toradex Developer Center?

Please be aware that this product is currently in the sample stage and is not recommended to be used in the end product:

Since you found out that the two issues are different, could you please create another thread here in Community? So we can focus on one issue here and on the other issue on the new thread?

Best regards,

Hi @henrique.tx

we are aware of this, and this has been discussed with technical sales of Toradex.

However, after a reboot of the system, the touchinput runs again as desired. Which supports our assumption that it is rather a problem on the software level (outside docker).
Therefore it would be helpful if there is a way to check if something is crashed on the operating system level, e.g. the touch driver.

Hi @mnubdi !

So, as you said, let’s make this thread about the controller with the display.
And you can open another thread about the controller without display, ok?

Could you please share with us a minimal way of reproducing the issue you are facing?

Best regards,

Hello @mnubdi,

I hope you are doing well! May I know if you have any updates on this topic?

Hi @rudhi.tx
Last week we had a talk with Michael and Drew and we agreed on this approach:

  • we try to setup a new image based on the latest 5.7.2 LTS Torizoncore image (and may be on the latest 6)
  • then we will test the image for about 1-2 weeks and maybe the problem is gone or not. However we will come back with results then.
  • after we know how to reproduce it, we will inform you immediately. atm we are not able to reproduce the issue (or we have always to wait for about 2 weeks till it occurs)

best regards

Hi @syntom!

Another idea is to unbind and bind the resistive touch driver (AD7879) when the problem occurs.

Before and after unbinding, please check the memory usage to see if the amount of memory available (or used) has changed.

After binding it again, please check if the touch started working again.

Best regards,

Hi @henrique.tx
very interesting. How can I unbind and bind the driver?
Best regards

Hi @syntom!

You can refer to this page, for example: Manual driver binding and unbinding [LWN.net]

Best regards,

Hello @syntom ,
Were you able to solve your issue with the info provided by @henrique.tx ?
If so, could you please mark his answer as solution?

Best regards,
Josep

Hi @josep.tx no the issue is not solved and we are still invastigating.
I think bind and unbind the driver is not a solution, we wan’t to find out the root cause.

Hello @syntom ,

We agree, this is only a temporary workaround that, in case it works might give us a clue on where to continue our investigation.
Did you try it to see if it worked?

Best regards,
Josep

Hello @syntom

Were you able to try this workaround?

Best regards,
Josep