Colibri T30, WinCE 8, System freeze

Hello.

I’m supporting a legacy product running Windows CE 8.0 on Colibri T30.
The product runs a C# application interfacing the legacy Windows CE libraries.
We are aware that all this stuff is deprecated years ago and are in the process of developing new Linux based solutions…

Anyways, we are facing an issue where an increasing amount of customers are reporting random system freezes that can only be resolved by rebooting the device (power cycling).
We have quite randomly managed to reproduce the issue…
The serial debug interface does not provide anything between our last application random logging entry and the freeze, drWatson also does not generate any file in this case.

The application uses stuff like CAN, Ethernet and such. The hardware watchdog is disabled, currently its not feasible to have the system reboot.

Currently our main root cause suspects are:

1

WC-1747: “Threads with higher priority (<101) can freeze the device”.

  • We dont explicitly force any of our threads to run on core 1, however in most cases everything just runs on core 1.
  • We dont explicitly start any threads with such high priority.
    However, when examining the application process using the task manager, we observe a thread with priority “001” the average cpu load of the thread is close to 0.01%. I suspect the thread could be spawned by a (legacy) driver?
    Can you confirm if this is the case and if its possible that this could be the root cause?

2

Since the application uses the legacy Windows CE libraries I had a look at the release notes for the newer libaries (the single api/dll one) and stumbled across:
WCL-1194: “Re-Initializing CAN Library could freeze the system”.

  • The CAN communication to the slave device can sometimes timeout, etc. so that we may re-initialize the can library.
    However since we are using the legacy libraries (SJA1000 v3.3.2) we want to have some clarification as to if this issue is relevant for that library?
    We are unable to make the system freeze by forcing a lot of CAN re-inits…

Thanks in advance for taking the time to provide feedback. As i mentioned, we are aware that the solutions used in this product was deprecated years ago…

My user does not have file upload permissions so here goes the boot up dbg log:
"“6/25/2024 3:13:28 PM”, Toradex Bootloader 2.4b2 for Tegra Built Feb 16 2020 16:19:11

“6/25/2024 3:13:29 PM”,Press [SPACE] to enter Bootloader Menu

“6/25/2024 3:13:29 PM”,Colibri T30 1GB 1.1G Serial: 7281994
“6/25/2024 3:13:29 PM”,RAM: 1024 MB, CarveOut: 2 MB
“6/25/2024 3:13:29 PM”,Loading IM1 image from flash…Done(279ms)
“6/25/2024 3:13:29 PM”,Checking Image CRC…OK(60ms)
“6/25/2024 3:13:29 PM”,Decompressing IMAGE(25100684, 42310380) from 8286FB00 to 80016000…Done(368ms)
“6/25/2024 3:13:30 PM”,Jumping to image at 0x80016000…

“6/25/2024 3:13:30 PM”,Toradex Windows CE 8.0 2.4b2 for Tegra Built Feb 16 2020 16:10:21
“6/25/2024 3:13:30 PM”,INFO:OALLogSetZones: dpCurSettings.ulZoneMask: 0xb
“6/25/2024 3:13:30 PM”,L2 cache enabled
“6/25/2024 3:13:30 PM”,MainMemoryEndAddress adjusted from 0x9F000000 to 0x9FE00000
“6/25/2024 3:13:30 PM”,Main Phys Mem: 0x80000000:0x9FDFFFFF
“6/25/2024 3:13:30 PM”,Carveout Phys: 0x9FE00000:0x9FFFFFFF
“6/25/2024 3:13:30 PM”,Cold boot selected
“6/25/2024 3:13:30 PM”,SMP: Active CPUs = 4
“6/25/2024 3:13:30 PM”,Extended Mem : 0xA0000000:0xBFFFFFFF
“6/25/2024 3:13:30 PM”,Chip Id: 0x30 (Handheld SOC) Major: 0x1 Minor: 0x3 SKU: 0xb1
“6/25/2024 3:13:30 PM”,ATE prog ver 4.0
“6/25/2024 3:13:30 PM”,Speedo: CPU: 354 (Corner: 1), Core: 216 (Corner: 0)
“6/25/2024 3:13:30 PM”,NVRM Initialized shmoo database
“6/25/2024 3:13:30 PM”,PllClocks(Mhz): X=1300, M=800, C=600, P=408, A=24.576
“6/25/2024 3:13:30 PM”,SysClocks(Mhz): CPU=1300, AVP=240, SysBus=240, Mem=400, EMem=800
“6/25/2024 3:13:30 PM”,GraphicClocks(Mhz): Host=133, 3D=133, 2D=133, Epp=133, Mpe=133, Vde=408
“6/25/2024 3:13:30 PM”,Loading ResourceManager…
“6/25/2024 3:13:30 PM”,Load_Init_Syncpoint_Values:
“6/25/2024 3:13:30 PM”,Done Loading ResourceManager (3 ms)
“6/25/2024 3:13:30 PM”,Loading FlashFileSystem(NAND)…
“6/25/2024 3:13:30 PM”,Failed(0) Loading FlashFileSystem(NAND) (0 ms)
“6/25/2024 3:13:30 PM”,Loading FlashFileSystem(eMMC)…
“6/25/2024 3:13:30 PM”,Done Loading FlashFileSystem(eMMC) (0 ms)
“6/25/2024 3:13:30 PM”,Waiting for flash disk \FlashDisk for 300 seconds
“6/25/2024 3:13:30 PM”,Folder mounted
“6/25/2024 3:13:30 PM”,Loading MultimediaManager…
“6/25/2024 3:13:30 PM”,Done Loading MultimediaManager (0 ms)
“6/25/2024 3:13:30 PM”,Loading Audio…
“6/25/2024 3:13:30 PM”,Loading Serial3…
“6/25/2024 3:13:30 PM”,Done Loading Serial3 (1 ms)
“6/25/2024 3:13:30 PM”,Loading Serial2…
“6/25/2024 3:13:30 PM”,Loading Serial1…
“6/25/2024 3:13:30 PM”,Done Loading Serial2 (1 ms)
“6/25/2024 3:13:30 PM”,Failed(0) Loading Serial1 (0 ms)
“6/25/2024 3:13:30 PM”,Loading SDIO…
“6/25/2024 3:13:30 PM”,Done Loading SDIO (5 ms)
“6/25/2024 3:13:30 PM”,Loading USBOTG(Port1)…
“6/25/2024 3:13:30 PM”,Loading USBHost(Port3)…
“6/25/2024 3:13:30 PM”,Loading USBHost(Port2)…
“6/25/2024 3:13:30 PM”,Done Loading USBOTG(Port1) (5 ms)
“6/25/2024 3:13:30 PM”,VBUS=0, ID=0
“6/25/2024 3:13:30 PM”,ChangeMode(0, 2)
“6/25/2024 3:13:30 PM”,OTG(0): Loading HCD…
“6/25/2024 3:13:30 PM”,SDHCDIndicateSlotStateChange(DeviceInserted)
“6/25/2024 3:13:30 PM”,Loading USBHost(Port1)…
“6/25/2024 3:13:30 PM”,Loading DisplayDriver…
“6/25/2024 3:13:30 PM”,Done Loading DisplayDriver (37 ms)
“6/25/2024 3:13:30 PM”,Set LCD to 800x480 (51Hz)
“6/25/2024 3:13:30 PM”,Set HDMI to 800x480 (60Hz)
“6/25/2024 3:13:30 PM”,Set HDMI to 800x480 (60Hz)
“6/25/2024 3:13:30 PM”,Set HDMI to 800x480 (60Hz)
“6/25/2024 3:13:30 PM”,Done Loading USBHost(Port3) (198 ms)
“6/25/2024 3:13:30 PM”,Loading Keyboard/MouseDriver…
“6/25/2024 3:13:31 PM”,Done Loading Keyboard/MouseDriver
“6/25/2024 3:13:31 PM”,NvDdkSdioSetClockFrequency: Requested: 100 KHz, Actual: 100 KHz
“6/25/2024 3:13:31 PM”,Loading TouchDriver…
“6/25/2024 3:13:31 PM”,Maximum Allowed Error 5:
“6/25/2024 3:13:31 PM”,Done Loading TouchDriver
“6/25/2024 3:13:31 PM”,Done Loading Audio (346 ms)
“6/25/2024 3:13:31 PM”,SoftRTC enabled
“6/25/2024 3:13:31 PM”,Done Loading USBHost(Port2) (336 ms)
“6/25/2024 3:13:31 PM”,Done Loading USBHost(Port1) (452 ms)
“6/25/2024 3:13:31 PM”,OTG(0) Done Loading HCD
“6/25/2024 3:13:31 PM”,No external RTC found
“6/25/2024 3:13:32 PM”,Loading NETUI…
“6/25/2024 3:13:32 PM”,Done NETUI
“6/25/2024 3:13:32 PM”,AX88772B: instantiate [AX88772B1]
“6/25/2024 3:13:36 PM”,AX88772B: MediaChange()
“6/25/2024 3:13:39 PM”,AX88772B: MediaChange()
“6/25/2024 3:13:41 PM”,AX88772B: MediaChange()

“6/25/2024 3:14:38 PM”,Using CANLib Version 3.3.2

“6/25/2024 3:14:38 PM”,Using IntLib Version 2.6.0

“6/25/2024 3:14:38 PM”,Using ClkLib Version 3.3.0

“6/25/2024 3:14:38 PM”,Message Buffer created. Size 100 Messages Physical Address 0xbb85c000

“6/25/2024 3:14:38 PM”,Starting IST:"

Hi, I’m the Toradex AI Assistant. I’m in early beta, and I would appreciate your feedback.
The Toradex Support team will follow up with you shortly!

Did you like my answer?

:+1: :-1:


Hello.

Addressing the issues you’ve outlined with your Colibri T30 running Windows CE 8.0, let’s delve into the potential causes and their implications.

1. Threads with Higher Priority Freezing the Device

The behavior you’re observing, where threads with higher priority can potentially freeze the device, is indeed plausible. In Windows CE, threads with very high priority can starve other threads, including those managing critical system functions, if not carefully managed. The thread with priority “001” you’ve observed could indeed be spawned by a legacy driver or system process. While it’s difficult to confirm without a deeper analysis of the system and its running processes, this scenario could contribute to system freezes if that thread is monopolizing CPU resources or causing a deadlock situation.

2. Re-Initializing CAN Library Freezing the System

Regarding the CAN library issue (WCL-1194), if you’re using an older version of the library (SJA1000 v3.3.2), it’s essential to note that the specific fix mentioned might not be applied to your version. While you’ve mentioned that forcing many CAN re-initializations doesn’t reproduce the freeze, intermittent issues like this can be challenging to replicate consistently. It’s possible that under certain conditions, not easily replicated in a controlled environment, the re-initialization could contribute to system instability.

General Recommendations

Given the legacy nature of the system and the complexity of diagnosing intermittent freezes, especially without direct access to debug outputs or error logs indicating a clear cause, the approach should be multifaceted:

  1. Review Thread Priorities: Carefully review the priorities of all threads, especially those with higher priorities, to ensure they are justified and not adversely affecting system stability. Adjusting priorities to more conservative levels, where feasible, could help mitigate potential issues.

  2. CAN Library Analysis: If possible, consider testing with a newer version of the CAN library, even temporarily, to see if the issue persists. This could help isolate whether the library version is a contributing factor.

  3. System Monitoring: Implement additional logging around critical operations, especially those involving CAN communication and thread management. This could help identify patterns or specific conditions leading to the freezes.

  4. Hardware Watchdog: While currently disabled, if feasible, enabling the hardware watchdog could provide a recovery mechanism in case of system freezes, minimizing downtime until a more permanent solution is found.

  5. Incremental Isolation: Try to isolate the issue by selectively disabling or modifying components of the application or

Please, also check if these links can help you solve your problem:

Apalis iMX6 | Toradex Developer Center.

The WCL-1194 issue, “Re-Initializing CAN Library could freeze the system,” is only applicable to the new Toradex Libraries and has already been fixed. The new library has a completely different structure and rewritten codebase compared to the legacy library, so it’s not possible to determine if the legacy library may have a similar issue without significant effort in analyzing and testing the legacy code, which we are not inclined to do due to its deprecated status.

Does the log you provided reflect a system “freeze” state?
Does the boot log from a normally running system have any other debug output past the “Starting IST” message?
If yes, it is most likely that the freeze happened at the Interrupt Dispatch Thread of the CANLib-SJ1000. Invoking the DeInitSJA1000 function should clear this.

First of all, thanks for taking the time to reply.
As previously stated I’m aware that everything in this project is deprecated from OS to hardware to libraries. Basically its an ancient project that we are still maintaining… so I don’t expect you put significant effort into analyzing and testing.

The logs i provided does NOT reflect a system freeze state, just a normal bootup - i mainly included it since it contains alot of details about the specific hardware, os, libraries etc.

The logs from a normally runnning system has various debug output from our own application and occasional

CANReceiveMessages WaitforSingleObject Timeout.
CANReceiveMessages timeout

From debug log perspective there is no difference from a normally running system to a system that freezes. The last few log entries prior to a freeze are random application written logs and does not follow a distinct pattern. There are no system logs, no exceptions or similair indicators in the log output prior to a freeze.

On some systems the issue is reported to occour once on a 1-2 weeks frequency, while other systems do not suffer from it at all. This of course makes it quite hard to reproduce, hence the strange questions in my original post.
From an application point of view the systems are identical and we have of course tried to analyze the differences in detail between the systems but haven’t found any reasonable explanation here.

Regarding topic 1: WC-1747: “Threads with higher priority (<101) can freeze the device”.

  • Can you confirm if any of the legacy libaries spawn threads with priority 001 ? (which would violate WC-1747).
  • Can you provide or refer to some more technical details as to how this could freeze the device ? (I’m guessing its because the kernel has threads with priority 101 which, if stalled by higher priority thread, may freeze the system depending on some specific details unknown to me)

Again, thanks for taking the time to reply :slight_smile:

  1. Yes, the mentioned DispatchIST has a default priority 1.
  2. If a thread is running with a priority above 101 the device can get freezed on Tegra or gets instable on IMX6. This only happens if the affinity of the high prio Thrad is set to Core 1 and the thread is running all the time. The reason for the freeze is in the Interrupt Handler: On calling “InteruptDone”, the interrupt handler moves the affinity to core 1. If the thread calling InterruptDone has a lower priority then, it will never get the chance to get out of the InterruptDone. It seems this has some effect on all kind of drivers (USB, Mouse, Graphics, …) as most of them run with a priority above 100 and are interrupt based.

Ok, thanks for the clarification. We have managed to reduce the occourence of the freeze to a somewhat acceptable level by reducing the applications CPU and driver load.
Again, thanks for taking the time to support this ancient stuff.