Extremely slow boot with latest TorizonCore build

Hello @RoccoBr ,
We are still not able to reproduce the problem. To be sure we have a setup as similar as possible to yours:

Can you please provide this information? We have used a Verdin iMX8M Mini Q v1.1B, but maybe you have a different revision.
Also, which carrier board (name and version) did you use as a reference to build your custom carrier board? Could you share the schematics of your carrier board with us? (via private message, and mentioning this thread)

Best regards,
Josep

hi @josep.tx ,
we are using Verdin iMX8M mini v1.1A mostly and the reference board would be the verdin development board

Hello @RoccoBr ,
We have tried again with a module the same version as yours, using the reference-minimal image. The results on our side are normal: booting takes about 5 seconds and it takes about 20 seconds from a reset until SSH connection is allowed.

  • Do you have another module (same model and revision) which you can test? If so, test it with the reference-minimal image and send us the BSP log as before.

  • Can you ping the module continuously and measure (approximately) the time it takes to be available again from the time it is reset?

This entry in the log also raises some concerns. It is related to the current/voltage power monitor:

[    5.005549] ina2xx 3-0040: error configuring the device: -6

What are the specifications of your power supply in terms of voltage and current?

Best regards,
Josep

So, I have done some tests with verdin SOM rev 1.1A and rev1.1B and the results are the same.

I have tested all TorizonCore images released and I have found out that the problem started with the torizon-core-docker-verdin-imx8mm-Tezi_5.5.0+build.11.tar image and following.

in details:

  • torizon-core-docker-verdin-imx8mm-Tezi_5.2.0-devel-202104+build.11.tar boots OK

  • torizon-core-docker-verdin-imx8mm-Tezi_5.3.0+build.7.tar boots OK

  • torizon-core-docker-verdin-imx8mm-Tezi_5.4.0+build.10.tar boots OK (this is the fastest, it boots in 16 seconds)

  • torizon-core-docker-verdin-imx8mm-Tezi_5.5.0+build.11.tar delay of 100 seconds appears

In attachment the BSP log for each build
what has been changed between 5.4 and 5.5?
TorizonCore5.3_boot_log.txt (29.5 KB)
TorizonCore5.4_boot_log.txt (33.7 KB)
TorizonCore5.5_boot_log.txt (31.1 KB)

Hello @RoccoBr ,
Having a look at your log with version 5.5, I see some entries that seem to indicate that this image has been modified. For example :

[    4.057594] systemd[1]: Set hostname to <CB20300084>.

Can you please try again with an unmodified image and send us the log?

Best regards,
Josep

hi @josep.tx ,
what do you mean by “unmodified” image? I necessarily need to modify the downloaded image with TorizonCore build in order to pass the device tree of my custom carrier board.
this is my tcbuild.yaml file passed to Torizoncore build

# >> The input section specifies the image to be taken as the base for the
# >> customization.
input:
  easy-installer:
    local: ${OS_IMAGE?Please specify the base image to use}

customization:
  device-tree:
    include-dirs:
      - baseOS/device-trees/include/
      - baseOS/device-trees/dts-arm64/
    custom: baseOS/device-trees/dts-arm64/imx8mm-verdin-nonwifi-v1.1-cloudbox_v3.dts
    overlays:
      add:
        - baseOS/m4/dt/imx8mm-verdin-cloudbox_v3-rpmsg-overlay.dts
  filesystem:
     - baseOS/changes/
output:
  ostree:
    branch: ${REF_SPEC:-cloudbox_v3_baseOS}
    commit-subject: "${COMMIT_SUBJECT:-cloudbox_v3_baseOS}"
  easy-installer:
    name: "Torizon for Cloudbox V3"
    description: "TorizonCore Linux with Cloudbox V3 Application containers"
    local: firmware/cloudbox-${IMAGE_VERSION?Please specify the image version}
    bundle:
      compose-file : docker-compose-release.yml
      username : ${DOCKER_USER?Please provide a docker username}
      password : ${DOCKER_PASS?Please provide a docker password}

comparing all the logs for the different images (5.3, 5.4, 5.5, 5.7) the only relevant difference I can spot is the order: in working images (5.2, 5.3, 5.4) the messages

REG_3P3V: disabling
REG_MIPI_PHY: disabling

appear very late, in particular after phy-ethernet and CAN driver have been initialized. Instead in the “faulty” images 5.5, 5.6 and 5.7 those messages appear way before phy-ethernet and CAN initializations

Hello @RoccoBr ,

We need to start from a “base point” in order to discard any other causes for this slow boot that you experience. And this “base point” is an unmodified image, just as it has been downloaded.

Sure, there might be some errors during the boot time, but we have to start with the bare minimum, see if it boots in a reasonable time, and if so, then add one change at a time until the slow boot issue appears.

Best regards,
Josep

Hi @josep.tx
I have installed TorizonCore images unmodified directly from EasyInstaller. I have tested 5.5, 5.6 and 5.7. none of them show the delay in booting.
in attachment the boot logs
TorizonCore5.5_build19_EasyInstaller_log.txt (28.3 KB)
TorizonCore5.6_build22_EasyInstaller_log.txt (30.1 KB)
TorizonCore5.7_build24_EasyInstaller_log.txt (30.0 KB)

Hi,
I have done some investigation with TorizonCore v5.5 following this guide TorizonCore Boot Time Optimization | Toradex Developer Center

and I have found this:

CB20300084:~$ systemd-analyze time
Startup finished in 4.025s (kernel) + 2min 13.236s (userspace) = 2min 17.262s
multi-user.target reached after 2min 8.153s in userspace
CB20300084:~$ systemd-analyze blame
2min 575ms systemd-udev-settle.service
    7.855s modem.service
    5.908s sdcard@mmcblk1p1-add.service
    4.172s plymouth-quit-wait.service
    3.588s NetworkManager-wait-online.service
    1.818s docker.service
    1.560s dev-disk-by\x2dlabel-otaroot.device
    1.525s systemd-logind.service
     800ms udisks2.service
     765ms usermount.service
     745ms systemd-resolved.service
     721ms systemd-networkd.service
     691ms systemd-timesyncd.service
     653ms systemd-udevd.service
     634ms systemd-journald.service
     556ms user@1000.service
     436ms systemd-modules-load.service
     420ms ModemManager.service
     370ms systemd-udev-trigger.service
     211ms NetworkManager.service
     137ms greenboot-task-runner.service
     127ms kmod-static-nodes.service
     124ms dev-hugepages.mount
     123ms tmp.mount
     123ms rpcbind.service
     121ms dev-mqueue.mount
     119ms sys-kernel-debug.mount
     100ms mwifiex-delay.service
      97ms systemd-remount-fs.service
      95ms greenboot-healthcheck.service
      88ms systemd-random-seed.service
      78ms iptables.service
      76ms ip6tables.service
      75ms greenboot-status.service
      34ms sys-kernel-config.mount
      34ms plymouth-quit.service
      34ms var-volatile.mount
      33ms systemd-tmpfiles-setup.service
      33ms systemd-sysctl.service
      33ms avahi-daemon.service
      31ms ostree-remount.service
      30ms systemd-tmpfiles-setup-dev.service
      27ms systemd-update-utmp-runlevel.service
      27ms user-runtime-dir@1000.service
      26ms plymouth-start.service
      25ms systemd-journal-flush.service
      23ms plymouth-read-write.service
      23ms systemd-update-utmp.service
      22ms sshdgenkeys.service
      20ms var.mount
      19ms docker-compose.service
      16ms systemd-user-sessions.service
      16ms systemd-vconsole-setup.service
      15ms sshd.socket
      12ms docker.socket
       1ms m4.service

CB20300084:~$ systemd-analyze critical-chain
The time when unit became active or started is printed after the "@" character.
The time the unit took to start is printed after the "+" character.

multi-user.target @2min 8.153s
└─greenboot-status.service @2min 8.075s +75ms
  └─greenboot-task-runner.service @2min 7.929s +137ms
    └─boot-complete.target @2min 7.900s
      └─docker.service @2min 6.080s +1.818s
        └─network-online.target @2min 6.077s
          └─NetworkManager-wait-online.service @2min 2.486s +3.588s
            └─NetworkManager.service @2min 2.270s +211ms
              └─network-pre.target @2min 2.263s
                └─iptables.service @2min 2.155s +78ms
                  └─basic.target @2min 2.119s
                    └─sockets.target @2min 2.118s
                      └─sshd.socket @2min 2.098s +15ms
                        └─sysinit.target @2min 2.071s
                          └─rngd.service @2min 2.069s
                            └─systemd-journald.socket @902ms
                              └─system.slice @807ms
                                └─-.slice @807ms
CB20300084:~$ systemd-analyze critical-chain docker.service
The time when unit became active or started is printed after the "@" character.
The time the unit took to start is printed after the "+" character.

docker.service +1.818s
└─network-online.target @2min 6.077s
  └─NetworkManager-wait-online.service @2min 2.486s +3.588s
    └─NetworkManager.service @2min 2.270s +211ms
      └─network-pre.target @2min 2.263s
        └─iptables.service @2min 2.155s +78ms
          └─basic.target @2min 2.119s
            └─sockets.target @2min 2.118s
              └─sshd.socket @2min 2.098s +15ms
                └─sysinit.target @2min 2.071s
                  └─rngd.service @2min 2.069s
                    └─systemd-journald.socket @902ms
                      └─system.slice @807ms
                        └─-.slice @807ms
CB20300084:~$

comparing the output of the above commands with previous TorizonCore images, I can see that systemd-udev-settle.service is not present before v5.5.

there are some posts online regarding systemd-udev-settle.service causing slow boot.
My question is: why this is happening from TorizonCore v5.5 and not before?

as reported here https://www.freedesktop.org/software/systemd/man/systemd-udev-settle.service.html: Using this service is not recommended.

I can confirm the cause of the slow boot is systemd-udev-settle.service. as suggested here Topicbox I disabled the service

CB20300084:~$ sudo systemctl mask systemd-udev-settle.service
Created symlink /etc/systemd/system/systemd-udev-settle.service → /dev/null.

when I reboot, the system comes up in 16 seconds.

is it possible to disable this service directly in TorizonCore image? is it necessary for other things?

Hi @RoccoBr,

Just wanted to chime in here. We had a discussion here internally about your issue and I have some comments and suggestions.

I have installed TorizonCore images unmodified directly from EasyInstaller. I have tested 5.5, 5.6 and 5.7. none of them show the delay in booting.

First of all given your comment here, it seems pretty clear that this issue seems to be caused/made worse by whatever changes are made in your custom TorizonCore image.

Now to be clear the systemd-udev-settle.service is not being explicitly included in our images by us. Probably what happened is that some tangential package that we use in our builds got updated or changed, which caused this service to be added. That said we could remove this on our side. But, we don’t understand why the issue is happening to begin with, and why it only affects your setup.

Basically we prefer not to make such a change without understanding why first. Though this investigation is difficult for us to do since as you’ve seen we can’t reproduce this boot delay you’re experiencing as it seems to rely on your software configuration.

If it’s not too much to ask, it’d be appreciated if you could help investigate what specific change in the software causes this issue on your side. As it’s not know to us the full list of changes you did to the base image.

With all that said, you can easily just disable this service on your custom image specifically, then capture this change. Since systemd services are enabled/disabled in /etc such modifications can be captured using TorizonCore Builder.

Best Regards,
Jeremias

Hi Jeremias,
I want to reiterate that there is no change in our software/setup that causes the issue: we are being using the same configuration since TorizonCore v5.2, and that configuration is working fine with TorizonCore v5.2, v5.3, v5.4. The only change was switching to latest TorizonCore images, NOTHING ELSE.

I can disable the service in /etc with TorizonCore builder but my worry is that it’s just a workaround for the following reboot, I suspect I will have this problem every time I do an OS update via the OTA.

In any case I am more than happy to help you investigating this issue, let me know how I can support you.

Best regards,
Rocco

Hi @RoccoBr I’m just spitballing here but can you try to boot with a wired ethernet connection available? I’m wondering if this is related to not having enough entropy available and I seem to recall hearing that having a wired ethernet with packets on it would provide more entropy.

Also, if you are able to get a shell prompt (it’s not clear if you get to a login prompt in this situation or not), can you run cat /proc/sys/kernel/random/entropy_avail both before and after the boot delay?

Drew

I think you misunderstand. You tested with out of the box 5.5/5.6/5.7 and had no issues with the boot delay, correct?

You then apply some changes and have a custom TorizonCore image that is based on 5.5. There is now a boot delay. I am asking you to investigate what customization you are applying that causes this reaction.

Then we can have a discussion whether it makes sense to remove this service by default.

I can disable the service in /etc with TorizonCore builder but my worry is that it’s just a workaround for the following reboot, I suspect I will have this problem every time I do an OS update via the OTA.

Also I’m not sure I understand the issue here. Wouldn’t you just continue to disable this service going forward in your updates? Or why would the system be updating to an OS that has this service enabled?

Best Regards,
Jeremias

Hi Jeremias,
I understand that you want to focus on the differences between the unmodified image and my custom image, using the same base image for example TorizonCore v5.5
In my opinion this is not the best approach, it would be useful if I had this issue with ANY image: if I use TorizonCore v5.4 both the unmodified image and my custom image boot properly.

But if you think it can help you investigate better the issue, I’ll prepare a document where I try to highlight as much as possible my customisation and possibly I would send also a copy of my device tree and/or schematics of our carrier board.

Regards,
Rocco

hi @drew.tx ,
I am already using an active wired ethernet connection

what do you mean by “before and after the boot delay”? I don’t get a shell prompt until the device has fully booted
I can run the command cat /proc/sys/kernel/random/entropy_avail when the boot delay is present and when the delay has been eliminated with the command systemctl mask systemd-udev-settle.service. In bith cases entropy is 4034

OK. I guess it’s not what I was thinking. We probably need to understand what is taking so long in udev in your case. The workaround is to disable systemd-udev-settle but I’d like to understand the root cause if possible.

Have you had any luck reproducing on the Ixora board?

hi @drew.tx ,
I have tried to install our custom image on the dahlia board, but unfortunately it doesn’t boot at all