Physical link in Linux Image not established after poweron reliably

Florian_K · July 31, 2017, 1:05pm

The physical link is not established reliably in a random manner when the TK1 SOM is powered on with the Linux LXDE Image running on it. The physical link led of the switch the SOM is connected to is not lighting up (indication for the proper establishment of a physical link). The behaviour occurs randomly.

I wrote a test script which pings the SOM every minute and switches a power supply via a Phidget DigitalOutput off and on whenever the board is pingable. If the board is not pingable the script is halted in endless loop and the SOM is kept powered on to enable its investigation over the serial interface.

This issue relates to a similar issue in u-boot.

Florian_K · August 1, 2017, 4:09am

I attached a journalctl log file and /var/log/Xorg log file (the issue occurred after approx. 6 minutes after test script start).

The journalctl log file indicates that the link is down due to PCIE :

Jun 18 22:24:49 apalis-tk1 kernel: PCIE.C: tegra_pcie_enable_regulators : regulator hvdd_pex
Jun 18 22:24:49 apalis-tk1 kernel: PCIE.C: tegra_pcie_enable_regulators : regulator pexio
Jun 18 22:24:49 apalis-tk1 kernel: PCIE.C: tegra_pcie_enable_regulators : regulator avdd_plle
Jun 18 22:24:49 apalis-tk1 kernel: PCIE: port 0: link down, retrying
Jun 18 22:24:49 apalis-tk1 kernel: PCIE: port 0: link down, retrying
Jun 18 22:24:49 apalis-tk1 kernel: PCIE: port 0: link down, retrying
Jun 18 22:24:49 apalis-tk1 kernel: PCIE: port 0: link down, ignoring
Jun 18 22:24:49 apalis-tk1 kernel: PCIE: port 1: link down, retrying
Jun 18 22:24:49 apalis-tk1 kernel: PCIE: port 1: link down, retrying
Jun 18 22:24:49 apalis-tk1 kernel: PCIE: port 1: link down, retrying
Jun 18 22:24:49 apalis-tk1 kernel: PCIE: port 1: link down, ignoring
Jun 18 22:24:49 apalis-tk1 kernel: PCIE: no ports detected

Florian_K · August 1, 2017, 4:09am

To investigate a possible impact of EEE I disabled it during kernel boot:

Apalis TK1 # printenv defargs 
defargs=lp0_vec=2064@0xf46ff000 core_edp_mv=1150 core_edp_ma=4000 usb_port_owner_info=2 lane_owner_info=6 emc_max_dvfs=0
Apalis TK1 # setenv defargs lp0_vec=2064@0xf46ff000 core_edp_mv=1150 core_edp_ma=4000 usb_port_owner_info=2 lane_owner_info=6 emc_max_dvfs=0 igb.EEE=0
Apalis TK1 # saveenv
Saving Environment to MMC...
Writing to MMC(0)... done

After poweron it is disabled:

root@apalis-tk1:~# ethtool --show-eee enp1s0 
EEE Settings for enp1s0:
        EEE status: disabled
        Tx LPI: disabled
        Supported EEE link modes:  100baseT/Full 
                                   1000baseT/Full 
        Advertised EEE link modes:  Not reported
        Link partner advertised EEE link modes:  100baseT/Full 
                                                 1000baseT/Full

I ran the test script again. Unfortunately the test script halted after approx. 10 minutes with a missing link (see journalctl log file with EEE disabled).

marcel.tx · July 31, 2017, 7:58pm

I believe disabling Energy Efficient Ethernet aka EEE in the Linux kernel igb driver did fix this isn’t it?

Florian_K · August 1, 2017, 9:29am

Unfortunately not. I reproduced the issue again and attached another full journalctl log file (EEE disabled). (It needs to be considered that the time stamps are not set before the time synchronization correctly.) The log file contains the same log messages for PCIE as in the other log file:

cat journalctl_EEE_disabled_again_full.log | grep PCI

Jun 18 22:24:49 apalis-tk1 kernel: PCIE.C: tegra_pcie_enable_regulators : regulator hvdd_pex
Jun 18 22:24:49 apalis-tk1 kernel: PCIE.C: tegra_pcie_enable_regulators : regulator pexio
Jun 18 22:24:49 apalis-tk1 kernel: PCIE.C: tegra_pcie_enable_regulators : regulator avdd_plle
Jun 18 22:24:49 apalis-tk1 kernel: PCIE: port 0: link down, retrying
Jun 18 22:24:49 apalis-tk1 kernel: PCIE: port 0: link down, retrying
Jun 18 22:24:49 apalis-tk1 kernel: PCIE: port 0: link down, retrying
Jun 18 22:24:49 apalis-tk1 kernel: PCIE: port 0: link down, ignoring
Jun 18 22:24:49 apalis-tk1 kernel: PCIE: port 1: link down, retrying
Jun 18 22:24:49 apalis-tk1 kernel: PCIE: port 1: link down, retrying
Jun 18 22:24:49 apalis-tk1 kernel: PCIE: port 1: link down, retrying
Jun 18 22:24:49 apalis-tk1 kernel: PCIE: port 1: link down, ignoring
Jun 18 22:24:49 apalis-tk1 kernel: PCIE: no ports detected
Jun 18 22:24:49 apalis-tk1 kernel: ehci-pci: EHCI PCI platform driver

Florian_K · August 1, 2017, 9:30am

If the link is missing listing the pci devices leads to the following output:

root@apalis-tk1:~# lspci
root@apalis-tk1:~#
root@apalis-tk1:~# lspci -t
-[0000:00]-

If the link is established listing the pci devices leads to the following output:

root@apalis-tk1:~# lspci
00:00.0 PCI bridge: NVIDIA Corporation TegraK1 PCIe x1 Bridge (rev a1)
01:00.0 Ethernet controller: Intel Corporation I210 Gigabit Network Connection (rev 03)

marcel.tx · August 10, 2017, 3:47pm

Please find a fix for this on our -next branch as well now.

Florian_K · August 11, 2017, 8:49am

Thanks a lot.

marcel.tx · August 11, 2017, 11:25am

You are very welcome and any feedback is also welcome (;-p).

BTW: I refined the patch a little just now.

Florian_K · August 16, 2017, 7:40am

I ran the test script over the long weekend (from last Friday on). Unfortunately the test script halted again.

But as the device is kept powered on after script halt I was able to check the physical link led on the switch which is turned on like it should be. lspci lists the ethernet controller Ethernet controller: Intel Corporation I210 Gigabit Network Connection (rev 03) as it should be. And the device is pingable with ping as it should be. That means that either (a) some network issue over the weekend, (b) a too short time delay between power cycles in the test script or (c) we missed to consider your refinement of the patches is likely as root cause for the script halting. I think more (a) is more probable.

I check if we considered your patch refinement and I increase the power cycle time delay in the test script and let the test script run again until tomorrow.

Florian_K · August 16, 2017, 7:46am

I appreciate that

Florian_K · August 16, 2017, 7:57am

What commit(s) do you mean with “refined the patch” exactly?

marcel.tx · August 16, 2017, 8:23am

OK, thanks for letting us know. We actually also run a collection of modules during Assumption of Mary in our temperature chamber over the full temperature range (e.g. -25 to 85 deg C). However first just with stock 2.7b3 to have a baseline. I am about to update them all with my fixes and we will see…

flixr · August 18, 2017, 4:01pm

After merging the next branch, I’m now seeing this in the dmesg log:

[    6.740520] gpio wake14 for gpio=235
[    6.746196] gpio-keys gpio-keys.3: Failed to request GPIO 117, error -16
[    6.754984] gpio-keys: probe of gpio-keys.3 failed with error -16

Seems there is still something missing for the added gpios…

marcel.tx · August 21, 2017, 12:54pm

As mentioned in your other thread we did successfully test the U-Boot fix and therefore validated the hardware to be fine. Further validation & verification running the full BSP will be conducted over time also with the upcoming V1.2A hardware revision.

Florian_K · September 28, 2017, 2:22pm

I would like to run the test script again with:

Toradex TK1 non-mainline V2.7b3 image with u-boot, dtb, kernel, modules, etc. replaced with latest:
u-boot commit
linux commit
Apalis Eval Board v1.1A
external power supply

What of the available power supply alternatives shall I use to prevent from an power supply impact onto the test Apalis Eval Board data sheet, p. 11? The power supply alternative must support continuous power supply, see this related question.

marcel.tx · September 29, 2017, 7:25pm

I believe @diego.tx answered that one already, right? Please note that we are just wrapping up the Q3 2.7b4 release and should have it available shortly.

Florian_K · October 5, 2017, 8:48am

Yes. I used the Ixora Carrier Board instead of the Apalis Ecal Board and powered it with 10V and current limitation set to a sufficient upper bound.

marcel.tx · October 9, 2017, 1:02pm

Our temperature chamber test setup is actually also based on Ixoras.

BTW: Meanwhile we did release 2.7b4.