Sporadic issue with RPMSG on VF61

I developed my application on VF61 and it uses RPMSG communication between A5 and M4 cores.

Basically it works as expected, but the communication is not 100% reliable and some of the packets are lost.

In my scenario it’s always the A5 core that sends a message and waits for an answer from M4 core.

A5 core sends a new message every second with Rpmsg_Write(), then waits for the handle hReceiveGlobal created over "DataAvailableEvent". The timeout for thie event is quite long (0.1 s).
When the event is received, Rpmsg_Read() is called.

As far as I can see, one or two messages every 1000 (more or less) don’t see any answer from M4 core and the handle timeouts.

So I started to debug inside M4 core and I use a GPIO: M4 core set it high every time it receives a new messgae from A5 rpmsg_rtos_recv_nocopy() and set it low after he sends the answer calling rpmsg_rtos_recv_nocopy_free().

I see that M4 core receives every message sent by A5 core and it calls rpmsg_rtos_recv_nocopy_free() every time after 4 or 5 ms.

So I suspect two possible reasons for the missing answers:

  • some errors inside M4 functions (but in this case the behavior should be similar when A5 runs Linux)
  • some race conditions or whatever in CE 6 RPMSG functions

Can someone provide some ideas on how to debug deeper this scenario?

Dear @vix

Is it possible that you send me your application (or of course a stripped-down extract) so I can debug into the rpmsg library to search for the problem? I would need:

  • The M4 application
    (binary should be fine for the moment)
  • Source code of the WinCe application
    (I prefer a complete VS project)

You can mark your reply as private if you don’t want the public to see it.

Regards, Andy

Hello @andy.tx

thanks for your interest in this topic.
I’ll try to do my best to narrow down the application because on both cores it’s a little bit complicated.

In this moment I don’t understand if (and how) the issue is related to this one (that I had some months ago, but then disappeared).

It seems a little bit different, but it requires deeper investigation from my side.

I let you know.

Dear @vix

You already figured out that the communication M4 → A7 is the issue.
You could try to find out whether the problem is related to bidirectional communication, or whether it is still there if you just send messages from the M4 to the A7 only, on a timer base.

Regards, Andy

Hi @andy.tx

since I use VF61, the communication is M4 <–> A5 (and not A7).
But I’m going to investigate

Hi @andy.tx

is it possible use Rpmsg library on WinCE side without loading and executing firmware (to create communication channel only)?

I’m trying to load and debug firmware on M4 core through JTAG while my application runs on A5 core…

I spent the last couple of days debugging the issue and I found something really strange.

When the firmware on M4 core runs standalone (i.e without the application on A5), everything works fine.

In this scenario no communication happens between the two cores (I haven’t find a way to debug M4 through JTAG while application on A5 is running and communicates with M4).

When the application on A5 loads and starts firmware on M4, communication happens between the two cores and the memory region OCRAM from 0x3F040000 to 0x3F070000 is unexpectedly written.(some bytes every 512 bytes).

Since RPMSG buffers starts from 0x3F070000 I wonder if it’s possible that a bug into RPMSG communication uses this portion of OCRAM.

The same memory region is not written when no communication happens between the two cores.

Can someone from Toradex side help with this issue?

Dear @vix

The current implementation does not allow to skip the firmware loading. I will check how easy it would be to add this feature.

It’s not only about the actual firmware loading (which is simple to avoid). But you probably want the relevant clocks to be enabled by the library before loading the firmware, and possibly the M4 firmware needs to be up and running before the actual channel initialization.

Regards, Andy

Dear @vix
Thank you for the detailed report. I will try to find some time to debug the issue. Please allow me some days for this.
Regards, Andy

I can add some other info (in case this helps).

It seems that OCRAM addresses from 0x3F040000 to 0x3F060000 are not “dirty”.

If I’m right, starting from 0x3F060000 I can see 16 buffers of 512 bytes each, with the rpmsg messages sent from M4 to A5.

I added a temporary workaround to my project, not using 0x4000 from 0x3F060000 to 0x3F064000 and it seems that everything works fine…

Finger crossed

If you can investigate into this behavior this will help a lot.

Dear @vix
I uploaded a new preliminary version of the Toradex CE Libraries:

In this version I added a new allowed “FileFormat” value “NoM4boot” (for VF61 only). if you add the following code to your application

Rpmsg_SetConfigString(hRpmsg, L"FileFormat", L"NoM4boot", StoreVolatile);

the function Rpmsg_Open() will neither load nor start the M4 firmware, so you can use JTAG do this before calling Rpmsg_Open().

The approach works, but I found that it is not always stable. For example, I observed that my JTAG environment (I’m using a SEGGER J-Link) stops both cores while downloading the firmware, which sometimes caused the VisualStudio debugger to lose the connection.

Regards, Andy

Hi @andy.tx

thank you for this preliminary version.

I’m going to test it in the next few days and I let you know.

In the meanwhile I’ve been doing other tests from my side and I found that, after having fixed some kind of heap and/or stack overflow in my code for M4, I can build the DEBUG version of the firmware for M4 and load and execute it from A5 core (using the official library 2.3-20181011).

Then I can use JTAG to debug the M4 in “connect-only” mode (connection to M4 while it’s running).

After this progress I was able to see the unexpected memory corruption that I described below.

Did you have a chance to look to this behavior?

Hi @andy.tx

do you have news on your side?

I did additional tests and it seems that the issue is related to what A5 core writes at address 0x3F070000.

This should contain (from M4 point of view) rdev->tvq->vq_ring.desc[x].addr and I see that addr is 0x3F060000.

1569-capture.png

I can’t find any explanation on this value (that is outside of the expected area).

Could you double-check and verify, please?

The issue is quite urgent, from my side

Moreover rdev->rvq->vq_ring.desc[x].addr points to 0x3F062000.

So it seems that both TX and RX puts their messages at addresses 0x3F06xxxx.

Is this expected behavior?

Should I configure M4 so that it doesn’t use 0x3F06xxxx?

One more thing, that could be useful to Toradex engineers:

I leave my application working for at least a couple of hours, with A5 core sending a request to M4 every second; M4 sends out to the UART the messages received and transmitted to A5.

Suddenly Rpmsg_Read() on A5 core returns a wrong buffer.

Both the UART and the debugger connected to M4 confirms that M4 core received the right requets and write to memory the expected (right) answer.

It seems that rpmsg on WinCE is buggy…

Could it be that for some reasons A5 core reads from the wrong message buffer in the FIFO?

Is it possible that Rpmsg library returns the memory address of message buffer where it reads from? In this way I can compare with the address where M4 writes (that I send to UART).

I’ve double checked and it’s not a matter of memory leak on A5 side.

Hi @vix
Unfortunately I didn’t find too much time to debug your issue.
One thing I could easily do is generate a version of the RpmsgLib which outputs more information on the debug serial port, This will slow down the message transfer, but I think this is not relevant for your application.
Regards, Andy

Hi @andy.tx

if you mean printing out info on UARTA (the one used by the bootloader), please go ahead.

I can try everything you need and everything can be useful.

I found a workaroung on M4 side but I have no chance on A5, so this is a blocking issue for me.

Hi @vix
Please try the debug version of the preliminary Toradex Ce Libraries:

There’s an additional buffer where the WinCe-RpmsgLib stores the incoming messages before you read them from your application. Therefore you see two sets of addresses displayed on UARTA.

BTW: I implemented the additional debug messages in a way that I can activate them by setting a preprocessor #define DBGBUFFERS. I will remove this #define again for any public version of the RpMsgLib. If you will need it again in the future, let me know to build a temporary library version again.

Regars, Andy

Hi @andy.tx

I’ve just rebuilt my application with this new version of library but I can’t see any debug message on UARTA.

I only see the messages printed out by the bootloader.

Can you verify, please?