Debugchk failed in rpmsg lib

Hi all,

on WinCE side I sometimes get this error from Rpmsg:

Unknown: DEBUGCHK failed in file .\src\RpMsg_imx7\src\imx7_platform.c at line 236

As I said, the error occurs sporadic.
If error occurs, application on A7 side crashes.
On M4 side it seems that application is running.

What does this error mean?
And how can I solve the issue?

Kind regards,
kuzco

Dear @Kuzco

The Rpmsg library keeps an internal counter how often interrupts were disabled and enabled (this happens exactly each time any mutex is locked / unlocked).

The DEBUGCHK is hit if the counter falls below zero, ie. there were more mutex-unlock calls than mutex-lock calls.

Mutexes are used in various places in the library, therefore I cannot tell you on a higher level which function caused the problem.
Knowing the callstack when the DEBUGCHK is hit would be helpful. Can you see this in the debugger? Or can you provide me a test project to reproduce the problem?

Regards, Andy

Dear @andy.tx,
got the error today.
Visual Studio shows me only the call stack in following picture:
1654-call-stack-debugcheck.png
Maybe it will help?

Regards, kuzco

Dear @Kuzco

Somehow your screenshot didn’t make it into the post, can you please retry it.

If you write your message in the “Your answer” section, you will see a live preview. You can finally copy the text into the “add comment” editor to submit it. (Or submit it in the “Your answer” editor, then I will turn it into a comment).

Regards, Andy

1654-call-stack-debugcheck.png

Hope this works now

Dear @Kuzco

The top two lines are expected. As I wrote in my initial answer, the DEBUGCHK is hit as a result of releasing a mutex.
The bottom line tells us more, but unfortunately not the full story: According to the current content of the stack, imx7_env_unlock_mutex() was called from address 0x00e9fdf0. There is no symbol at this location, so something got screwed up.
Either that location 0x00e9fdf0 was reached by jumping to a wrong function pointer, or the stack got messed up, resulting in a wrong return address.

I’m afraid there’s no simple way to trace back the source of the problem. A few approaches to learn more:

  • disassemble the range around 0x00e9fdf0 to guess what happened
  • examine the stack contents to guess what happened.
  • try to figure out from top-down what your application is doing when the error happens.

Regards, Andy

Dear @andy.tx,
it’s very hard to debug that failure.
Got the same error but with another address:

a7m4_com.exe!imx7_platform_interrupt_enable_all() Line 236	C
a7m4_com.exe!imx7_env_unlock_mutex(void * lock) Line 297	C
001cfa48()	Unknown

If I disassemble address 0x001cfa48 it shows me following:

001CFA2C  bvc         001CF9DE  
001CFA2E  ands        r2,r2,r0  
001CFA30  movs        r0,r0  
001CFA32  lsls        r0,r1,#1  
001CFA34  movs        r0,r0  
001CFA36  movs        r0,r0  
001CFA38  adr         r3,001CFCFC  
001CFA3A  movs        r5,r3  
001CFA3C  movs        r0,r0  
001CFA3E  movs        r0,r0  
001CFA40  ?cmp        r4,r2  
001CFA42  movs        r5,r3  
001CFA44  movs        r0,r0  
001CFA46  movs        r0,r0  
001CFA48  ?? ?? 
001CFA4A  movs        r4,r3  
001CFA4C  str         r1,[r1,#0x14]  
001CFA4E  movs        r5,r0  
001CFA50  str         r3,[sp,#0x100]  
001CFA52  movs        r5,r3  
001CFA54  movs        r0,r0  
001CFA56  movs        r0,r0  
001CFA58  movs        r0,r0  
001CFA5A  movs        r0,r0  
001CFA5C  ldr         r7,001CFA80  

As you can see at address 0x001cfa48 there are only question marks.
Am I doing it right?

What came to my mind is, that this failure occurs since I disable MU interrupt on M4 if I send data over SPI (referring to this community post).

Is it possible that there are more mutex unlocks than locks because MU interrupt is disabled on M4?
This could be an argument because I got this error very sporadic and can’t reproduce it.
So it could be a timing problem?

Best regards,
kuzco

Dear @Kuzco

I agree this issue is hard to track. Everything before and after 0x001cfa48 does not look like reasonable program code.

So still one possible explanation is stack corruption.

Another possibility that just came up to my mind, as you mentioned you were changing the interrupt configuration:

Is there maybe an interrupt vector pointing to an invalid address, and the DEBUGCHCK() is reached because the system triggers an interrupt which you don’t expect and jumps to the undefined interrupt vector?

Regards, Andy