Eth0 not initializing occasionally on bootup

I have one TK1 module that fails to initialize the Eth0 wired Ethernet port on 3 out of 10 resets. It happens regardless of reset type (power-up, button press, GUI commanded restart). It also happens on different carrier boards, following the SOM in all cases. I have not seen this with any of the other 20 TK1 modules I’ve worked with, however, our customer experienced a similar Eth0 failure with a TK1 module on an Ixora carrier; that was reportedly a single event and they did not investigate. Most of the time, Eth0 of this particular module does initialize correctly and works without problems until the next reset.

I have not seen any posts on this forum regarding this and wonder if it’s just a case of a defective SOM, but have one observation to make in light of another recently uncovered aspect of the TK1. It has been confirmed that the TK1 is one of the Apalis modules that shut down their local power rails on reset, for about 60msec by my measurement. Since the SOM boots from embedded Flash rather than a hard disk, it probably comes up on the network init code pretty fast following reset. Assuming any crystal oscillators are also powered down during reset, can you confirm the on-module designed time delay from POWER_ENABLE_MOCI rising edge to when the CPU starts executing is sufficient to allow all clocks to stabilize?

Is that with downstream or mainline BSP? What exact software version?

Sorry, I’m a hardware guy and not that familiar with OS, BSP, etc. Below is the boot log at the very start. Our customer is using an older OS kernel due to the need to support the NVIDIA JetPack drivers which haven’t been updated for newer releases:

U-Boot 2016.11-2.7.2+g60021a4 (Apr 10 2017 - 07:55:49 +0200)
TEGRA124
DRAM:  2 GiB
MMC:   Tegra SD/MMC: 0, Tegra SD/MMC: 1, Tegra SD/MMC: 2
In:    serial
Out:   serial
Err:   serial
Model: Toradex Apalis TK1 2GB V1.1A, Serial# 02938385
Net:   No ethernet found.
Hit any key to stop autoboot:  0
Booting from internal eMMC chip...
reading tegra124-apalis-eval.dtb
49673 bytes read in 16 ms (3 MiB/s)
reading uImage
5461776 bytes read in 138 ms (37.7 MiB/s)
## Booting kernel from Legacy Image at 81000000 ...
   Image Name:   Linux-3.10.40-2.7.2+g0fd5295
   Image Type:   ARM Linux Kernel Image (uncompressed)
   Data Size:    5461712 Bytes = 5.2 MiB
   Load Address: 80008000
   Entry Point:  80008000
   Verifying Checksum ... OK
## Flattened Device Tree blob at 82000000
   Booting using the fdt blob at 0x82000000
   Loading Kernel Image ... OK
   Using Device Tree in place at 82000000, end 8200f208
Starting kernel ...
[    0.000000] Booting Linux on physical CPU 0x0
[    0.000000] Initializing cgroup subsys cpu
[    0.000000] Initializing cgroup subsys cpuacct
[    0.000000] Linux version 3.10.40-2.7.2+g0fd5295 (linuxdev@linuxdev.toradex.int) (gc         c version 6.2.1 20161016 (Linaro GCC 6.2-2016.11) ) #1 SMP PREEMPT Mon Apr 10 07:57:12          CEST 2017
[    0.000000] CPU: ARMv7 Processor [413fc0f3] revision 3 (ARMv7), cr=10c5387d
[    0.000000] CPU: PIPT / VIPT nonaliasing data cache, PIPT instruction cache
[    0.000000] Machine: apalis-tk1, model: Toradex Apalis TK1, serial: 0

Sorry, I’m a hardware guy and not that familiar with OS, BSP, etc.

Understood. However, as you may know such “hardware” issue are oftentimes worked around in software as it is the case here as well.

Below is the boot log at the very start.

Using long since unsupported early beta BSP 2.7b2.

Our customer is using an older OS kernel due to the need to support the NVIDIA JetPack drivers which haven’t been updated for newer releases:

That is not quite accurate. While NVIDIA did not update their JetPack they actually even very recently released Linux for Tegra aka L4T R21.7 with several fixes including some Spectre/Meltdown mitigations. You may want to upgrade at your earliest convenience. You may also do so using the new Toradex Easy Installer. However, for the time being the latest out-of-the-box images available are still based on NVIDIA’s L4T R21.6 and our BSP 2.7b5. We are in the process of updating them to NVIDIA’s L4T R21.7 and our BSP 2.8b3.

Marcel, thanks for clarifying. I was not familiar with the details regarding driver compatibility. Our customer is responsible for all software design of their product and restrict themselves to using official releases from their suppliers, Toradex in this case, to avoid modifying released/supported code themselves. The update to BSP 2.8b3, once Toradex releases it, will be a painfully slow and cautious process as they are producing a medical device.

I am curious about your reply: “…such “hardware” issue are oftentimes worked around in software as it is the case here as well.” If this is a workaround case as you imply, what exactly is the code trying to fix? Is there a known problem related to Ethernet startup, or do I just have a “funny” TK1 SOM that I should warranty swap?

Marcel, thanks for clarifying.

You are very welcome.

I was not familiar with the details regarding driver compatibility. Our customer is responsible for all software design of their product and restrict themselves to using official releases from their suppliers, Toradex in this case, to avoid modifying released/supported code themselves. The update to BSP 2.8b3, once Toradex releases it, will be a painfully slow and cautious process as they are producing a medical device.

Understood. However, maybe they should also not be using any beta quality BSPs. That said, for TK1 so far there only were beta BSPs so far with the upcoming Q3 BSP 2.8b4 being scheduled to become the first stable one.

I am curious about your reply: “…such “hardware” issue are oftentimes worked around in software as it is the case here as well.” If this is a workaround case as you imply, what exactly is the code trying to fix?

It is basically a combination of a few Intel i210/i211 erratas and our TK1 design requiring e.g. level-shifters on certain internal and external signals. To fully comply with all erratas we had to use completely separate rails to that MACPHY and make sure the power-up sequence is kosher (e.g. avoid adverse back-feeding, do proper reset, waiting for rail stabilisation, unreset and so forth) otherwise it could happend that due to some MACPHY internal PLLs not locking the link would never come up.

Is there a known problem related to Ethernet startup,

As I mentioned above it is now all covered by some erratas. During early validation & verification in our temperature chambers we did face some issues and subsequently made sure our hardware design may fully cope with it. Unfortunately, some errata workarounds are only possible with quite delicate hardware/software interaction which is what that link I sent you above is doing.

or do I just have a “funny” TK1 SOM that I should warranty swap?

I assume it has one of them “funny” MACPHYs refusing to work right unless the power-up sequence is kosher. However, we assume it can be worked around e.g. by using a later BSP. That said, should you continue to face any issues if already running the latest BSP we are of course happy to RMA that particular module.

So there is some internal history around this particular problem. This SOM will not end up in a sold system so I can live with the Eth0 failing one out of 3 or 4 power-ups. I’ll assume that a later BSP will contain the delicate software fix for this and hang onto the module for now. This will become a big issue with our customer, however, if a newer BSP does not eliminate the problem, as that means we received at least two SOMs in 50 units bought that exhibit the issue, not a particularly good manufacturing failure statistic. It would be useful if you had utility code that could recover the PHY without having to reset the SOM. That would alleviate any concern our customer may have.

So there is some internal history around this particular problem.

Yes, but we fixed this almost a year ago and consider it fully solved.

This SOM will not end up in a sold system so I can live with the Eth0 failing one out of 3 or 4 power-ups.

You should still update it to a later BSP making sure it really is this issue as depending on your exact use case, carrier board, environment and so forth it could of course also be something else which we may not even know about as of yet.

I’ll assume that a later BSP will contain the delicate software fix for this and hang onto the module for now. This will become a big issue with our customer, however, if a newer BSP does not eliminate the problem,

Definitely, which is another reason one should really update it to find out.

Please also note that it is the customer’s full responsibility to make sure he is running the latest BSP with all applicable fixes. That said, future modules will even no longer ship with any BSP pre-installed but rather feature the Toradex Easy Installer forcing the customer to first install an image suitable for him.

as that means we received at least two SOMs in 50 units bought that exhibit the issue, not a particularly good manufacturing failure statistic.

Unfortunately, such reliability issues are generally rather hard to catch during minimalistic mass production tests. This is why we separately do run extensive validation & verification tests in our temperature chambers to make sure our products perform reliably as per their specification.

It would be useful if you had utility code that could recover the PHY without having to reset the SOM. That would alleviate any concern our customer may have.

Unfortunately, in this case, this is very low-level BSP code which is integrated in the bootloader resp. the Linux kernel and rather difficult to do by any such “utility code”. That said, if you are afraid of breaking any user space software on your modules you could easily just separately upgrade the bootloader and/or Linux kernel and accompanying device tree and the issue should be gone for good once and for all.