Colibri iMX8QXP gets bricked when trying to add network device from ApolloX extension

mmarcos.sensor · September 7, 2023, 6:47pm

We are using a Colibri iMX8QXP SoM with a custom carrier board and custom TorizonCore image build with TorizonCore Builder (based off torizon-core-docker-colibri-imx8x-Tezi_5.7.2+build.20) with custom device-tree and overlays and an external driver module. The board is connected to the company’s local network through ethernet and the company’s local network is connected to the internet through a cooperate firewall/proxy/router (we point it out because it has caused us problems when working with Torizon tools before)

We are using the ApolloX extension v2.1.2 on an Ubuntu distribution running on WSL2.

First we do a clean install of our custom TorizonCore image on the board using EasyInstaller. We login to the board though SSH and change the factory password for the torizon user.

Then we open vscode, connect to WSL, open the ApolloX Torizon side bar and we try to add the network device manually by pressing the “+” (tooltip: “Manually Connect Device”, it prompts for IP, username and password. When we confirm, the board reboots and then keeps rebooting endlessly every 15 seconds approximately. It never stops rebooting even when disconnecting it from the network.

We can’t access TorizonCore by ssh before it reboots and we can’t even get a user prompt on the serial console before it reboots. All we can do from the serial console is to stop U-Boot from continuing to boot the kernel and get a U-Boot prompt.

This failure mode doesn’t happen always. Sometimes the device gets added and connected to the extension correctly and we use it normally for development. It happens with the Colibri iMX6DL SoM as well.

The following is the output from the serial console while booting:

CPU:   NXP i.MX8QXP RevC A35 at 1200 MHz at 44C

DRAM:  2 GiB
MMC:   FSL_SDHC: 0, FSL_SDHC: 1
Loading Environment from MMC... OK
In:    serial
Out:   serial
Err:   serial
Model: Toradex Colibri iMX8 QuadXPlus 2GB Wi-Fi / BT IT V1.0D, Serial# 07329488

 BuildInfo:
  - SCFW 216a2c2e, SECO-FW c9de51c0, IMX-MKIMAGE 6a315dbc, ATF 2fa8c63
  - U-Boot 2020.04-5.7.2+git.33bb8e968332

flash target is MMC:0
Net:   eth0: ethernet@5b040000 [PRIME]
Fastboot: Normal
Normal Boot
Hit any key to stop autoboot:  0
switch to partitions #0, OK
mmc0(part 0) is current device
Scanning mmc 0:1...
Found U-Boot script /boot.scr
973 bytes read in 10 ms (94.7 KiB/s)
## Executing script at 83200000
4688 bytes read in 22 ms (208 KiB/s)
127969 bytes read in 36 ms (3.4 MiB/s)
164 bytes read in 30 ms (4.9 KiB/s)
Applying Overlay: colibri-imx8x_display-lcdif_overlay.dtbo
1376 bytes read in 38 ms (35.2 KiB/s)
Applying Overlay: display-g104xce-l01_overlay.dtbo
739 bytes read in 38 ms (18.6 KiB/s)
Applying Overlay: colibri-imx8_goodix-gt928-ts_overlay.dtbo
398 bytes read in 36 ms (10.7 KiB/s)
Applying Overlay: colibri-imx8_flexcan1_overlay.dtbo
274 bytes read in 37 ms (6.8 KiB/s)
12197548 bytes read in 295 ms (39.4 MiB/s)
Uncompressed size: 30724608 = 0x1D4D200
9186396 bytes read in 232 ms (37.8 MiB/s)
## Flattened Device Tree blob at 83100000
   Booting using the fdt blob at 0x83100000
   Loading Ramdisk to fcda9000, end fd66bc5c ... OK
   Loading Device Tree to 00000000fcd66000, end 00000000fcda8fff ... OK

Starting kernel ...

[    0.135338] No BMan portals available!
[    0.136529] No QMan portals available!
[    1.530310] imx8qxp-lpcg-clk 37620000.clock-controller: failed to get clock parent names
[    2.015456] imx6q-pcie 5f010000.pcie: pcie_ext clock source missing or invalid
[    2.079674] debugfs: Directory '59040000.sai' with parent 'imx8qxp-sgtl5000' already present!
Starting version 244.5+
[    7.296084] Goodix-TS 17-0014: i2c test failed attempt 1: -5
[    7.333973] Goodix-TS 17-0014: i2c test failed attempt 2: -5
[    7.373544] Goodix-TS 17-0014: I2C communication failure: -5
[   18.511355] Bluetooth: hci0: unexpected event for opcode 0x0000
[   23.242237] reboot: Restarting system
▒

U-Boot 2020.04-5.7.2+git.33bb8e968332 (Jan 01 1970 - 00:00:00 +0000)

CPU:   NXP i.MX8QXP RevC A35 at 1200 MHz at 47C

DRAM:  2 GiB
MMC:   FSL_SDHC: 0, FSL_SDHC: 1
Loading Environment from MMC... OK
In:    serial
Out:   serial
Err:   serial
Model: Toradex Colibri iMX8 QuadXPlus 2GB Wi-Fi / BT IT V1.0D, Serial# 07329488

 BuildInfo:
  - SCFW 216a2c2e, SECO-FW c9de51c0, IMX-MKIMAGE 6a315dbc, ATF 2fa8c63
  - U-Boot 2020.04-5.7.2+git.33bb8e968332

flash target is MMC:0
Net:   eth0: ethernet@5b040000 [PRIME]
Fastboot: Normal
Normal Boot
Hit any key to stop autoboot:  0
switch to partitions #0, OK
mmc0(part 0) is current device
Scanning mmc 0:1...
Found U-Boot script /boot.scr
973 bytes read in 10 ms (94.7 KiB/s)
## Executing script at 83200000
4688 bytes read in 22 ms (208 KiB/s)
127969 bytes read in 36 ms (3.4 MiB/s)
164 bytes read in 30 ms (4.9 KiB/s)
Applying Overlay: colibri-imx8x_display-lcdif_overlay.dtbo
1376 bytes read in 38 ms (35.2 KiB/s)
Applying Overlay: display-g104xce-l01_overlay.dtbo
739 bytes read in 38 ms (18.6 KiB/s)
Applying Overlay: colibri-imx8_goodix-gt928-ts_overlay.dtbo
398 bytes read in 37 ms (9.8 KiB/s)
Applying Overlay: colibri-imx8_flexcan1_overlay.dtbo
274 bytes read in 37 ms (6.8 KiB/s)
12197548 bytes read in 296 ms (39.3 MiB/s)
Uncompressed size: 30724608 = 0x1D4D200
9186396 bytes read in 232 ms (37.8 MiB/s)
## Flattened Device Tree blob at 83100000
   Booting using the fdt blob at 0x83100000
   Loading Ramdisk to fcda9000, end fd66bc5c ... OK
   Loading Device Tree to 00000000fcd66000, end 00000000fcda8fff ... OK

Starting kernel ...

What could be the problem?

jeremias.tx · September 7, 2023, 7:38pm

Greetings @mmarcos.sensor,

This behavior sounds somewhat similar to a report from another user here: ApolloX extension periodically restarts imx8 board

Though as you can see we never got enough information from this user to further investigate the issue. We also were never able to reproduce the issue ourselves.

That said I’ll need some information from you before we can try and investigate this further.

First of all it sounds like you have a lot of various customizations to the OS image here. As a sanity check does this issue still happen if you flash a standard default TorizonCore image and connect with that instead?

Next, for context when ApolloX connects to a device it modifies that device’s /etc/docker/daemon.json. If possible could you try to get the content of this file on the devices where this issue occurs? I know you said the device reboots too quickly to access, but maybe you can try the following:

Stop the device boot at the U-Boot prompt.
In the U-Boot prompt run ums 0 mmc 0

What this command does is it makes the device’s eMMC available as a mountable USB device. You can then connect the device to a development PC via the USB OTG port and it should mount the device’s filesystem like a USB drive. Then from your development PC you can browse the filesystem and get the content of /etc/docker/daemon.json that way.

Now as a final question. On the device’s that don’t experience this failure mode, is their any notable differences between these devices and the devices that do experience the issue? Any setup or environmental differences maybe?

Best Regards,
Jeremias

mmarcos.sensor · September 11, 2023, 5:35pm

Yes, we saw it and it is indeed similar, but since the board and the conditions were not exactly the same we decided to post our own question instead of replying to theirs.

We download the following image: torizon-core-docker-colibri-imx8x-Tezi_5.7.2+build.20.tar, repeated the same procedure and got the same error.

We were able to do this but we couldn’t find the file exactly at the path you provided /etc/docker/daemon.json, instead we found it at the following path: /ostree/deploy/torizon/deploy/caf872fb557ded8c5bd945986031eda45aecabfa6729a66162dedd7435d33439.0/etc/docker/daemon.json.

the content of the file is the following:

{
   "insecure-registries" : [":5002"]
}

We compared it to the contents of the daemon.json file from one of the boards that was added correctly to the ApolloX extension and it seems to be missing the development’s PC IP address that hosts the local development docker registry.

We tried modifying the file by inserting the development’s PC IP address and the board booted correctly. After that we tried once more adding the device to the ApolloX extension and board started rebooting again.

Apparently when the daemon.json gest modified it sources the development’s PC IP address from somewhere and either it’s empty or it fails. How does the ApolloX extension source the IP address? That might give us a clue to what is failing.

By devices do you mean the colibri boards? We don’t see how the colibris might be different, they were all purchased at the same time and we have had it both fail and not fail on our custom board and also on the Colibri Evaluation board v3.2A. The development PCs on the other hand might have several differences, but not any notable difference except that the PC that it fails most of the time on has Ubuntu 20.04.6 LTS, while the PCs that it rarely fails on have Ubuntu 22.04.3 LTS, of course both are running on WSL2.

Best regards.

jeremias.tx · September 11, 2023, 6:28pm

Okay that seems to be the root cause here. Since the daemon.json isn’t valid Docker probably fails to start/crashes on boot. If Docker fails to start properly our system counts the boot as failed and tries to reboot. Since this can’t be resolved by rebooting the system gets into this loop you’re experiencing.

Apparently when the daemon.json gest modified it sources the development’s PC IP address from somewhere and either it’s empty or it fails. How does the ApolloX extension source the IP address? That might give us a clue to what is failing.

I’m not quite sure how the extension does this in the backend I’ll need to bring this up with our extension team. Though I’m pretty confident you found the cause of this.

I’ll let you know if there’s any updates here or if the team has further questions for you.

Best Regards,
Jeremias

mmarcos.sensor · September 11, 2023, 7:01pm

The IP that was missing from the daemon.json file is the WSL host’s IP. So the extension that is running on the WSL would have to somehow detect the host’s IP from the interface connected to the local network where the colibri board is also connected.

Digging around in the ~/.apollox/ folder in WSL we found the ~/apollox/scripts folder and in particular the ~/.apollox/scripts/utils/network.ps1 powershell script which includes de GetHostIp() funcion:

# include
. "$env:HOME/.apollox/scripts/utils/execHostCommand.ps1"

function GetHostIp() {
    if (! [string]::IsNullOrEmpty($env:WSL_DISTRO_NAME)) {
        $_out = ExecHostCommand `
                    /mnt/c/Windows/System32/Wbem/WMIC.exe `
                    NICCONFIG WHERE IPEnabled=true GET IPAddress | grep -oE '((1?[0-9][0-9]?|2[0-4][0-9]|25[0-5])\.){3}(1?[0-9][0-9]?|2[0-4][0-9]|25[0-5])'

        return $_out.split("`n")[0]
    } else {
        $_out = ExecHostCommand `
                    hostname -I | awk '{print $1}'

        return $_out
    }
}

The function seems to use the WMIC.exe windows application to somehow obtain the host system IP addresses and later parse the output with grep.

When we tried to run this command and we obtained the following error:

run-detectors: unable to find an interpreter for /mnt/c/Windows/System32/Wbem/WMIC.exe

In fact we couldn’t run any windows .exe executable from WSL. We found this very odd, since we are certain recently we were able to run explorer.exe . from WSL to open windows explorer on the current directory and now we were getting the following error:

run-detectors: unable to find an interpreter for /mnt/c/Windows/explorer.exe

We searched the error message online and none of the solutions suggested seemed to solve our problem, until we found this comment on a github issue. The commenter found that he had

installed nuget package manager in the wsl2-container, and nuget required mono-runtime, which registered another entry (named cli) with /proc/sys/fs/binfmt_misc

We also had installed nuget on the problematic PC so it seemed posible we had the same problem.

The commenter suggested

disabling the cli interpreter using the command:
sudo update-binfmts --disable cli

We executed the command and now we were able to run .exe executables again.

We tried adding the Colibri board to the ApolloX extension once again and up to now it seems to have solved our problem.

Best regards.

jeremias.tx · September 11, 2023, 8:32pm

That’s interesting though I don’t personally know enough about the intricate workings of Windows or WSL to comment further on your solution. But at least it’s working for you now so that’s good.

That said this doesn’t quite explain why before some devices worked for you and others didn’t. Also why you even had to make this change in the first place. On my Windows PC things just worked without any extra changes and everything is more or less default.

Best Regards,
Jeremias