Ethernet does not come back up after standby

edwaugh · March 12, 2021, 11:54am

Hi,
As part of testing the suspend to ram function I noticed that ethernet does not come back up on my board after sleep. The command I use is:

sudo rtcwake -m mem -d rtc1 -s10

The board sleeps and the command prompt returns on the serial interface after 10 seconds.

However, after that ethernet is unresponsive. I can’t ping the board or ping from it to names or ip addresses. Can’t connect over SSH. Ifconfig shows:

ethernet0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500  metric 1
        inet 192.168.1.108  netmask 255.255.255.0  broadcast 192.168.1.255
        inet6 fe80::e9a5:de17:e3f5:c58c  prefixlen 64  scopeid 0x20<link>
        ether 00:14:2d:67:28:91  txqueuelen 1000  (Ethernet)
        RX packets 679  bytes 62912 (61.4 KiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 896  bytes 82279 (80.3 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

route shows:

Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
default         _gateway        0.0.0.0         UG    100    0        0 ethernet0
172.17.0.0      *               255.255.0.0     U     0      0        0 docker0
192.168.1.0     *               255.255.255.0   U     100    0        0 ethernet0

All suggestions very welcome.
Thanks
Ed

edwaugh · March 12, 2021, 2:33pm

I can see that the ethernet connection does not recover by itself but taking the link down and up again does fix it.

 verdin-imx8mm-06760593:~$ sudo rtcwake -m mem -d rtc1 -s10
    rtcwake: assuming RTC uses UTC ...
    rtcwake: wakeup from "mem" using rtc1 at Thu Jan  1 04:56:14 1970
    verdin-imx8mm-06760593:~$
    verdin-imx8mm-06760593:~$ ping google.co.uk
    ping: bad address 'google.co.uk'
    verdin-imx8mm-06760593:~$ sudo ip link set ethernet0 down
    verdin-imx8mm-06760593:~$ sudo ip link set ethernet0 up
    verdin-imx8mm-06760593:~$ ping google.co.uk
    PING google.co.uk (216.58.206.131): 56 data bytes
    64 bytes from 216.58.206.131: seq=0 ttl=117 time=12.257 ms

Maybe there is some script that runs when it comes back up which is not correct? I am not sure where to look for that. @andrecurvello.tx how does it work for you?

jeremias.tx · March 12, 2021, 7:17pm

Hi @edwaugh,

I did some digging and it seems this is a known issue in our backlog. I don’t see much more than that however. Let me mention your issue and see if we can get things moving again.

Best Regards,
Jeremias

edwaugh · March 13, 2021, 7:13am

Hi @jeremias.tx,

Good news that it is logged somewhere. Can you make it visible here:

Can it be booked in for the next monthly? I can probably work around it but it is a bug that impacts our system design.

Thanks

Ed

jeremias.tx · March 15, 2021, 8:45pm

Understood, I’ll see what I can do. This did bring some more attention to the issue and it’s being actively looked at currently. Also just a correction this is actually an issue with the underlying BSP that has been inherited to Torizon. So future information about this will appear here when we publish it: Toradex System/Computer on Modules - Linux BSP Release

Best Regards,
Jeremias

edwaugh · March 16, 2021, 7:21am

Hi @jeremias.tx,

I included a work around in my code that seems ok, not well tested but bringing ethernet0 down then up seems to sort it out.

    def do_sleep_now(self):
      ''' Set the system to sleep and bring ethernet back up afterwards '''
      self.do_sleep = False
      print('Going to sleep for 10 seconds')
      time.sleep(1)
      response = subprocess.run(['rtcwake', '-m', 'mem', '-d', 'rtc1', '-s10'], stdout=subprocess.PIPE, text=True)
      if response.returncode != 0:
        print('\nreturncode: {0}, {1}'.format(response.returncode, response.stdout))
      # Another process triggers the sleep so we will do our own pause here
      time.sleep(2)
      print('Wake Up')

       # Put the ethernet link down
      response = subprocess.run(['ip', 'link', 'set', 'ethernet0', 'down'], stdout=subprocess.PIPE, text=True)
      if response.returncode == 0:
          print('\nreturncode: {0}, status: {1}'.format(response.returncode, response.stdout))
      else:
          print('\nreturncode: {0}, error: {1}'.format(response.returncode, response.stderr))

      # Bring ethernet back up, CAN also affected but not enabled during idle
      response = subprocess.run(['ip', 'link', 'set', 'ethernet0', 'up'], stdout=subprocess.PIPE, text=True)
      if response.returncode == 0:
          print('\nreturncode: {0}, status: {1}'.format(response.returncode, response.stdout))
      else:
          print('\nreturncode: {0}, error: {1}'.format(response.returncode, response.stderr))

andrecurvello.tx · March 16, 2021, 3:44pm

Thanks for the support on this @jeremias.tx.

@edwaugh, in the meantime, the usage of udhcpc eth0 can bring the interface back to life after resume.

Can you check that works for you until the real fix is not ready, @edwaugh ?

edwaugh · April 20, 2021, 6:55am

hi @jeremias.tx,

Reposting to this thread as I have found an odd sleep problem with my setup today. After a number of sleeps the device never comes back up. I will investigate further but we have updated to the latest BSP since I did testing on sleep. Also the ethernet problem is not fixed in the new BSP,I get this message:

Going to sleep for 10 seconds
[  406.292362] fec 30be0000.ethernet ethernet0: MDIO write timeout
[  406.332362] fec 30be0000.ethernet ethernet0: MDIO write timeout
[  406.372639] fec 30be0000.ethernet ethernet0: MDIO write timeout
[  406.452359] fec 30be0000.ethernet ethernet0: MDIO write timeout
[  406.492338] fec 30be0000.ethernet ethernet0: MDIO write timeout
[  406.532347] fec 30be0000.ethernet ethernet0: MDIO write timeout
[  406.612335] fec 30be0000.ethernet ethernet0: MDIO write timeout
[  406.652345] fec 30be0000.ethernet ethernet0: MDIO write timeout
[  406.692354] fec 30be0000.ethernet ethernet0: MDIO write timeout
[  406.732365] fec 30be0000.ethernet ethernet0: MDIO write timeout
Wake Up

Could you tell me where this is in the backlog? Would be good if someone could take a look.

Thanks

Ed

andrecurvello.tx · April 20, 2021, 2:40pm

Hi @edwaugh,

Can you please confirm which TorizonCore are you using?

I did a check internally and this fix is on the latest Monthly of TorizonCore 5.2.0.

Best regards,
André Curvello

edwaugh · April 20, 2021, 3:14pm

I am on build 11 so the very latest. I think the messages are new, maybe they are related to the fix?

jeremias.tx · April 21, 2021, 8:52pm

As far as we can tell the latest monthly should have the fix incorporated and it was tested to work well. As for these new error messages, there might be some other issue here affecting this.

I believe the last time I saw this was on the Apalis i.MX8. I think it was power related issues that caused the Ethernet PHY to not start up correctly. Which sounds somewhat related to what you’re seeing here.

Can you reliably reproduce these messages? Or is it somewhat random?

Best Regards,
Jeremias

edwaugh · April 22, 2021, 6:24am

Hi @jeremias.tx,
I will put it on my list for today to try recovery from sleep without bringing the interface up and down. I don’t think I should have a power problem, the SOM is always supplied during the sleep cycle.
Ed

jeremias.tx · April 22, 2021, 5:00pm

Please keep us updated. If you can come up with a method/process that reliably reproduces this issue, it will be a lot easier for us to start debugging/investigating.

gauravks · April 27, 2021, 2:11pm

hi @jeremias.tx @andrecurvello.tx,

Can you share the link to fix commit?

Regards,
Gaurav

jeremias.tx · April 27, 2021, 7:52pm

Hi @gauravks,

According to the team this was the commit for the fix in the kernel: linux-toradex.git - Linux kernel for Apalis, Colibri and Verdin modules

Best Regards,
Jeremias

edwaugh · April 27, 2021, 7:56pm

Thanks @jeremias.tx, we are still seeing the error message I reported above when sleeping the device, my code looks like:

if do_sleep == True:
    print('{0}, Going to sleep for 20 seconds'.format(datetime.datetime.now()))
    response = subprocess.run(['rtcwake', '-m', 'mem', '-d', 'rtc1', '-s20'], stdout=subprocess.PIPE, text=True)
    if response.returncode != 0:
        print('\nreturncode: {0}, {1}'.format(response.returncode, response.stdout))
    time.sleep(0.5)
    print('{0}, Wake Up for 5 seconds'.format(datetime.datetime.now()))
    time.sleep(5)

Are you able to replicate this? The error messages come through STDERR so you need to be connected to the A53 serial port or other place where you can see that.

jeremias.tx · April 27, 2021, 10:06pm

Let me see if I or someone else internally can replicate this then we can get back to you on this.

Best Regards,
Jeremias

jeremias.tx · April 28, 2021, 6:30pm

So far we’ve been unable to reproduce these error messages on our side. Just to confirm you are on 5.2 correct? Can you provide the output of uname -a and cat /etc/issue, just want to make sure the versions here all correct.

Best Regards,
Jeremias

edwaugh · April 28, 2021, 6:38pm

Linux verdin-imx8mm-06760593 5.4.91-5.2.0-devel+git.3ae7ec26415b #1-TorizonCore SMP PREEMPT Fri Apr 9 10:59:52 UTC 2021 aarch64 aarch64 aarch64 GNU/Linux

TorizonCore 5.2.0-devel-202104+build.11 \n \l

I don’t think there is anything unusual about our setup, I think it matches the Verdin development board on the hardware side. Are you definitely monitoring STDERR and not just STDOUT?

This guy sees the same message as part of his problem, not sure if it is related:

Similar and old thread here:
https://lkml.org/lkml/2014/5/7/587

jaski.tx · April 30, 2021, 2:09pm

Hi @edwaugh

I tested this on Bsp 5.2 (Linux verdin-imx8mm 5.4.91-5.2.0+git.6afb048a71e3) using the following instructions and it is working fine. There are these timeout messages but then eth0 interface is up again.

[  172.659435] fec 30be0000.ethernet eth0: MDIO write timeout
[  172.695446] fec 30be0000.ethernet eth0: MDIO write timeout
[  172.773112] PM: resume devices took 0.716 seconds
[  172.854780] OOM killer enabled.
[  172.857927] Restarting tasks ... done.
[  172.865117] PM: suspend exit
root@verdin-imx8mm:~#  [  175.562621] fec 30be0000.ethernet eth0: Link is Up - 1Gbps/Full - flow control off

Best regards,
Jaski