Thanks a lot for that hint.
I wrote the script and let it run over the weekend for the warm-start scenario: Within 70 hours the issue was not reproducible.
I am blocked right now because the DC power switch has not arrived yet and which I need for the cold-start scenario. I let you know about any status changes…
I reproduced the behaviour in Linux and attached the log file of /var/log/kern.log and the log file of /var/log/dmesg and the output of for sudo ethtool eth0
. The good case (physical link is present) output is attached here.
If one tries to establish a link in u-boot
setenv autoload false; if env exists ethaddr; then; else setenv ethaddr 00:14:2d:00:00:00; fi; pci enum; dhcp; run setethupdate;
and if it is not established
e1000: no NVM
e1000: e1000#0: ERROR: Valid Link not detected: -8
executing reset
multiple times and trying to establish the link again each time does always fail then… A poweroff/on cycle is required to get a link again.
If the link is not established in u-boot the pci devices are listed as usual:
Apalis TK1 # pci
Scanning PCI devices on bus 0
BusDevFun VendorId DeviceId Device Class Sub-Class
_____________________________________________________________
00.02.00 0x10de 0x0e13 Bridge device 0x04
Apalis TK1 # pci header 00.02.00
vendor ID = 0x10de
device ID = 0x0e13
command register ID = 0x0007
status register = 0x0010
revision ID = 0xa1
class code = 0x06 (Bridge device)
sub class code = 0x04
programming interface = 0x00
cache line = 0x08
latency time = 0x00
header type = 0x01
BIST = 0x00
base address 0 = 0x00000000
base address 1 = 0x00000000
primary bus number = 0x00
secondary bus number = 0x01
subordinate bus number = 0x01
secondary latency timer = 0x00
IO base = 0x11
IO limit = 0x11
secondary status = 0x0000
memory base = 0x1300
memory limit = 0x1300
prefetch memory base = 0x2001
prefetch memory limit = 0x1ff1
prefetch memory base upper = 0x00000000
prefetch memory limit upper = 0x00000000
IO base upper 16 bits = 0x0000
IO limit upper 16 bits = 0x0000
expansion ROM base address = 0x00000000
interrupt line = 0x00
interrupt pin = 0x01
bridge control = 0x0000
But within linux in more rare cases a colleague missed the pci device…
“Energy Efficient Ethernet” (EEE) is mentioned as another possible root cause for the observed behaviour and is enabled in Angström per default:
root@apalis-tk1:~# ethtool --show-eee enp1s0
EEE Settings for enp1s0:
EEE status: enabled - active
Tx LPI: 0 (us)
Supported EEE link modes: 100baseT/Full
1000baseT/Full
Advertised EEE link modes: 100baseT/Full
1000baseT/Full
Link partner advertised EEE link modes: 100baseT/Full
1000baseT/Full
I disabled EEE with ethtool --set-eee enp1s0 eee off
termporarily. I connected/reconnected the ethernet cable to from/to the switch over and over again and was not able to reproduce the missing link (the behaviour observed of my colleague).
Unfortunately I am not able to test the same in the warm-start and cold-start scenarios because the EEE configuration is enabled again after every reset
or power cycle.
Do you know how to disable EEE persistently over reboots?
I open another question because that’s a different topic…
I received the Phidgetes “Digital Output” to control the DC power supply yesterday. I wrote a test script and was able to reproduce the issue with the Apalis TK1 Linux BSP v2.7b3 for the cold-start scenario as well (after 2 1/2 hours cyclic power cycles every approx. 45 seconds). I will be able to reproduce that again. Please let me know what types of log files will be valuable for you. I can provide them to you then.
I will run the test script over the weekend with our image (which has EEE disabled). Hopefully the issue is not reproducible with that change anymore…
In an approx. 60 hour run with EEE disabled during kernel boot in our image over the weekend the issue did not occur again. I can run the same script with your image v2.7b3 with EEE disabled in u-boot as well over night till tomorrow morning (approx. 12 hours).
I create new question specific to the user space link establishment issue.
It turns out that the current PCIe reset implementation in the PCIe board init function is not quite working reliably due to PCIe reset timing violations. Fix this by overriding the tegra_pcie_board_port_reset() function.
Please find resp. patches on our U-Boot -next branch.
Great. Thanks a lot for the patch. Does this patch fix issue on the linux (kernel/user) level as well? (related forum question)
I guess that depends. Most possibly yes should one already bring up the link in U-Boot. However a regular boot won’t do that. I’m actually working on an improved solution for Linux as well and will update resp. thread shortly.
BTW: Please note that my -next stuff already went through multiple iterations with the latest one dating back to yesterday evening.
Ok. We will figure it out either way when we run the tests again.
I’m in the final stages of testing and will commit the Linux kernel part soon as well.
Great. Thanks.
What do you mean with “Also allow optionally bringing up the PCIe switch as found on the Apalis Evaluation board. Note however that the Apalis PCIe port is also left disabled in the device tree by default.” in the commit message of the bugfix in u-boot exactly?
It means exactly that. One may optionally bring up the PCIe switch also in U-Boot if desired/required or whatever. But regular booting does not require any of that and in fact regular booting actually does not touch PCIe at all. Basically unless one explicitly does pci enum
PCIe won’t be touched.