We recently rolled out Wec7 OS2.3 to 100 units in the field.
Our setup has Apalis T30It connected to emcraft smart fusion SOM via microchip ksz8895 10/100 ethernet switch, all on the same PCB. That ethernet switch connects to another x86 wec7 platform over ethernet cable plus an embedded cellular modem platform, for a total of 4 networked devices on local wired LAN. We also have sdio8787 ublox ella131 wifi on the apalis platform for wireless.
Sometimes we get one of two things:
apalis OS is unresponsive. Our app logging on two different apps stops at the same time immediately, looking like OS crash. All 3 other devices on the LAN can still ping each other. Only the Apalis module is unresponsive.
Or, sometimes we keep logging, apps still run, but we still cant ping apalis platform, while the other 3 can talk to each other.
We’ve eliminated ethernet switch and such hardware, because the other 3 platforms still can communicate. We thought that the sdio stack patch fixed this, but perhaps not?
We will try to reproduce this with debug logs and find a way to make it happen faster. So far out of 100 units we had 10-15 crashes in a few days. The appearance is performance is worse than OS 2.1.
In the situation where you expect a crashed OS: Are you still able to do simple things such as moving a mouse pointer? Or is it you application which crashed?
We can’t disable the Wifi in the field, because logs are uploaded to the back end server when the vehicle enters the garage each day. Note, that we are using NETUI with a 1 min connection interval.
We do add to the routing table. Something like:
route add 10.20.1.29 192.168.0.1
This is to provide a route for the Verizon 3G modem such that we use that only when we need to and use Wifi as the default connection as needed.
In one case, a crash was observed on site where the mouse pointer still worked, but you couldn’t do anything. In most cases it looks and feels like a high priority driver thread is stuck as opposed to a crash. The result is our lower priority app at priority 252-255 doesn’t run.
We will try the things you have suggested in the lab and try to provide a log.
Andy,
I work with Kory & have a few updates. A unit was setup to capture debug messages while power cycling ( 2 minutes powered on , then 2 minutes powered off).
Unfortunately the error message does not give us enough hints to trace back to the source of the error:
EMEM_DEC_ERR by mpcorew @ 0xd9d93ff0 (Status = 0x20030039) tells us that the one of the main CPUs was trying to write to the physical address 0xd9d93ff0. On a module with 2GB of RAM, there would be RAM at this location, but on a 1GB module there’s nothing at this address.
We are trying to modify our BSP, so we would get a proper exception in this error case. We hope that the exception would lead us to the code which tries to access this invalid location.
I will get back to you as soon as we know more.
If we could get additional debugging info to track down the offending code, that would be ideal. Do you think this is a problem in the OS/drivers or application code?
Dear @kswain
We cannot say whether the error is caused by your application or by the OS/drivers.
Just a gutt feeling: as we had issues with the WiFi driver in the past, and no other customer reported the same problem, WiFi support could be the origin of the fault.
I will get back to you as soon as we have a test image which could generate better debug output. Do you have a particular deadline until when the problem needs to be solved?
@andy.tx We are going to collect more data on a different version of our application. This issue has been in the field for 1.5 years now and causing lockups on a regular basis. We’ve already had a patch for SDIO stack twice, and I think the silicon calibration issue may have affected some units, but we still have some problem with lockup of networking. We are also constructing a loopback test to stress test the networking.
Based purely on observation and timing, it looks like our application starts using wired network, just when Wifi connects with NETUI at about 30 second mark into bootup. That window seems like the most likely time to lockup.
I agree, the Wifi SDIO8787 is different that other customers, so it’s likely either a Wifi/SDIO issue or an application problem on our end.
Any extra debugging help you can provide would be very helpful.
Dear @kswain
I’m afraid it will take a few more days. Adding the debug output is not as straight-forward as we hoped. We need to tweak the memory mapping in order to make the error trigger a different exception.
Regards, Andy
We ran a different test. We removed the application and then ran a loopback test which simply sends a file to the Apalis Tegra T30 and the file is returned. A checksum is used to confirm the file is not corrupted during the transfer. The test ran for 9 days continuously before failing. I have attached the log that was captured just before the test failure.
Does this log file point to why the test failed after running for over 200 hours?