Apalis T30 OS lockup and/or networking stops working

We recently rolled out Wec7 OS2.3 to 100 units in the field.

Our setup has Apalis T30It connected to emcraft smart fusion SOM via microchip ksz8895 10/100 ethernet switch, all on the same PCB. That ethernet switch connects to another x86 wec7 platform over ethernet cable plus an embedded cellular modem platform, for a total of 4 networked devices on local wired LAN. We also have sdio8787 ublox ella131 wifi on the apalis platform for wireless.

Sometimes we get one of two things:

  • apalis OS is unresponsive. Our app logging on two different apps stops at the same time immediately, looking like OS crash. All 3 other devices on the LAN can still ping each other. Only the Apalis module is unresponsive.
  • Or, sometimes we keep logging, apps still run, but we still cant ping apalis platform, while the other 3 can talk to each other.

We’ve eliminated ethernet switch and such hardware, because the other 3 platforms still can communicate. We thought that the sdio stack patch fixed this, but perhaps not?

We will try to reproduce this with debug logs and find a way to make it happen faster. So far out of 100 units we had 10-15 crashes in a few days. The appearance is performance is worse than OS 2.1.

Any ideas?

Dear @kswain

The problem does not sound familiar to me. To track down the problem I can give you some generic hints:

  • Try to disable the WiFi to be sure it is not an issue of routing tables or similar issues. Do you still see the network drops?
  • Enable debug messages. Maybe you can see any hints there.
  • In the situation where you expect a crashed OS: Are you still able to do simple things such as moving a mouse pointer? Or is it you application which crashed?

Regards, Andy

We can’t disable the Wifi in the field, because logs are uploaded to the back end server when the vehicle enters the garage each day. Note, that we are using NETUI with a 1 min connection interval.

We do add to the routing table. Something like:
route add 10.20.1.29 192.168.0.1

This is to provide a route for the Verizon 3G modem such that we use that only when we need to and use Wifi as the default connection as needed.

In one case, a crash was observed on site where the mouse pointer still worked, but you couldn’t do anything. In most cases it looks and feels like a high priority driver thread is stuck as opposed to a crash. The result is our lower priority app at priority 252-255 doesn’t run.

We will try the things you have suggested in the lab and try to provide a log.

Andy,
I work with Kory & have a few updates. A unit was setup to capture debug messages while power cycling ( 2 minutes powered on , then 2 minutes powered off).

We are seeing the following error being reported:

EMEM_DEC_ERR by mpcorew @ 0xd9d93ff0 (Status = 0x20030039)
PCIe: AXI response decode error (0xdd9c)

In putty20191021134450.log there are 85 instances on lines 954-1040.
Do you have any further details on the error that is being reported?
Thanks,
Dave

Here are a few more instances of the same error messages on different days.

In putty20191028153206.log there are 77 instances on lines 13841-13920.link text

In putty20191029155327.log there are 55 instances on lines 17611-17669.link text

In putty20191022153318.log there is 1 instance on lines 1431-1434.link text

Thanks,
Dave

Dear @ddufresne

Unfortunately the error message does not give us enough hints to trace back to the source of the error:

EMEM_DEC_ERR by mpcorew @ 0xd9d93ff0 (Status = 0x20030039) tells us that the one of the main CPUs was trying to write to the physical address 0xd9d93ff0. On a module with 2GB of RAM, there would be RAM at this location, but on a 1GB module there’s nothing at this address.

We are trying to modify our BSP, so we would get a proper exception in this error case. We hope that the exception would lead us to the code which tries to access this invalid location.
I will get back to you as soon as we know more.

Regards, Andy

There is another similar ticket here:

If we could get additional debugging info to track down the offending code, that would be ideal. Do you think this is a problem in the OS/drivers or application code?

Dear @kswain
We cannot say whether the error is caused by your application or by the OS/drivers.

Just a gutt feeling: as we had issues with the WiFi driver in the past, and no other customer reported the same problem, WiFi support could be the origin of the fault.

I will get back to you as soon as we have a test image which could generate better debug output. Do you have a particular deadline until when the problem needs to be solved?

Regards, Andy

Andy,
With the issue locking up fielded units in service, the deadline is always yesterday:-)

@andy.tx We are going to collect more data on a different version of our application. This issue has been in the field for 1.5 years now and causing lockups on a regular basis. We’ve already had a patch for SDIO stack twice, and I think the silicon calibration issue may have affected some units, but we still have some problem with lockup of networking. We are also constructing a loopback test to stress test the networking.

Based purely on observation and timing, it looks like our application starts using wired network, just when Wifi connects with NETUI at about 30 second mark into bootup. That window seems like the most likely time to lockup.

I agree, the Wifi SDIO8787 is different that other customers, so it’s likely either a Wifi/SDIO issue or an application problem on our end.

Any extra debugging help you can provide would be very helpful.

Any update on this? We are still trying to resolve the issue.

Dear @kswain
I’m afraid it will take a few more days. Adding the debug output is not as straight-forward as we hoped. We need to tweak the memory mapping in order to make the error trigger a different exception.
Regards, Andy

Andy, Any updates?
Thanks, Dave

Dear @ddufresne
We just had our weekly planning meeting. I’m afraid we only can find time next week to look into this.
Regards, Andy

Dear @gnicholson
Yes, your issue was scheduled for analysis for this week. I will get back to you as soon as we have any results.
Regards, Andy

Hi Andy,

I am an Engineering Manager at Trapeze and Kory (@kswain) reports to me. Dave (@ddufresne) is assisting me with this investigation.

Did your most recent weekly planning meeting including a review of this issue?

Andy,

I have an update for you.

We ran a different test. We removed the application and then ran a loopback test which simply sends a file to the Apalis Tegra T30 and the file is returned. A checksum is used to confirm the file is not corrupted during the transfer. The test ran for 9 days continuously before failing. I have attached the log that was captured just before the test failure.
Does this log file point to why the test failed after running for over 200 hours?

link text