Sporadic issue with RPMSG on VF61

Hi @vix

You also need to enable debug messages in the bootloader config block. First thing you will notice is, that you see some driver loading messages from the image booting into WinCe.

Regards, Andy

Hi @andy.tx

I’ve alread enabled debug messages in the bootloader and I see them.

No messages from Rpmsg at all

Hi @vix

Sorry for not doing a final test before - I turned out that the #define DBGBUFFERS got lost in the process of creating the preliminary release package.
Here’s the updated libraries:

Regards, Andy

Hi @andy.tx

now the messages are printed out as expected.

First of all, one strange (but I can’t say unexpected thing): this is the first message

RpmsgLib can buffer 1024 incoming messages.
Rpmsg: Map sharedMem phys 0x3f060000 -> logic 0x00bd0000
Rpmsg: Map vring[0] phys 0x3f070000 -> logic 0x00be0000
Rpmsg: Map vring[1] phys 0x3f074000 -> logic 0x00bf0000

Does this mean that 0x3f060000 is used as shared memory between cores?
And so M4 should not use it (as I found during my debug)?

A couple of years ago I asked about strange memory overwriting at addresses above 0x3f060000 (see here).

The answer from Toradex side was not clear (at that time).

Can we say now that memory between 0x3f060000 and 0x3f064000 is used by rpmsg library?

And so M4 core can’t use it?

Is it possible to add this info to the documentation clearly?

Another minor thing: the firsl line is

RpmsgLib can buffer 1024 incoming messages

Is this right? Shouldn’t be 32 incoming messages (512 bytes each)?

Dear @vix

I analyzed the rpmsg behavior in more detail in the recent days, so I meanwhile have a better insight int what’s happening. For example there are multiple buffers working together, and which are not perfectly adjusted to each other in size. However, I didn’t want to change too much now.
Here’s a (still somewhat simplified) view:

  • For the communication between the M4 and A5, there are 3 shared memory ranges used:
    • shared memory @ 0x3f060000 contains the actual message data (512 bytes each)
    • vring[0] and [1] are two linked lists of fixed size, each element containing basically a pointer to one message. (The lists are much longer than the number of buffers in the shared memory, which is basically a waste of space…)

You are correct: memory between 0x3f060000 and 0x3f064000[there was a typo in your comment above] is used by rpmsg library, and therefore not available for anything else. The space reserved is used for 32 buffers, 512 bytes each.
I will add this to the documentation.

The 1024 messages reported in the log output are about the local buffer inside the RpMsgLib - the library reads the messages from the M4 in an interrupt routine. Your application could read these 1024 messages long after they arrived.

Regards, Andy

Hi @andy.tx

your investigation clarifies some points.

I see that the log messages from the library are like this:

 Rpmsg: got M4 data[0] @shared 0x00bd0210, store @local buf 0x00525ec0
 Rpmsg: App read local buf @0x00525ec0
 Rpmsg: got M4 data[1] @shared 0x00bd0410, store @local buf 0x00df7880
 Rpmsg: App read local buf @0x00df7880

Can you clarify a little bit hte meaning of the seveal addresses?

I think that @shared points to the original buffer in the shared memory.
Otherwise store @local and App read local buf should be the memory addresses where rpmsg reads the message.

Is this right?

Which kind of checks should I do on this addresses?

I have a full log (more or less 2500 messages) which exhibits the issue and I like finding the reason.

Hi @andy.tx
I think it can be useful.

Dear @vix

Your assumptions are correct.
You have to look at these addresses in context with the initialization messages in the beginning:

Rpmsg: Map sharedMem phys 0x3f060000 -> logic 0x000f0000
Rpmsg: Map vring[0] phys 0x3f070000 -> logic 0x00100000
Rpmsg: Map vring[1] phys 0x3f074000 -> logic 0x00110000
Rpmsg: got M4 data[0] @shared 0x000f0210, store @local buf 0x00043ee0
Rpmsg: App read local buf @0x00043ee0
Rpmsg: got M4 data[1] @shared 0x000f0410, store @local buf 0x00043ee0
Rpmsg: App read local buf @0x00043ee0

Please consider

  • Applications always work based on logical (“translated”) memory addresses. The imporant one in your case is the first line: Map sharedMem phys 0x3f060000 -> logic 0x000f0000, so all accesses to the memory range 0x000f0000 further down do actually access the physical memory @ 0x3f060000. The translated address can change, in your example above, it seems to be 0x00bd0000.
  • The local buffer is an additional buffer inside the RpMsgLib, and not relevant for your investigations.

This means you can check all addresses @ 0x000f0000, translate them and verify that your M4 software has written to the translated address just before.

Regards, Andy

Hi @andy.tx

translated addresses seems ok because at every message the address increments by 0x200 (i.e. 512 bytes) and it overlaps after 0x2000 (as expected).

Does the application (i.e. Rpmsg_Read) read data from the translated addresse or from the local buffers?

Is it possible to print out the number of received bytes?

Rpmsg: got M4 data[0] xxx bytes @shared 0x000f0210, store @local buf 0x00043ee0

It would be useful for my investigation

Dear @vix

  • The Rpmsg_Read() itself reads only from the local buffer.
  • But there’s an interrupt-driven thread inside the RpMsgLib, which copies the messages from the translated address (shared memory) to the local buffer. This thread outputs the message got M4 data[0] ....

It is possible to print the number of bytes, but it is probably not helpful: The message data from M4 to A5 always uses exactly 512 bytes. The number of bytes would just be an indication how many of the 512 bytes are useful data, and how many are unused. If I am correct these unused bytes are cleared to zero on the M4 side.

Regards, Andy

Hi @andy.tx

I modified something in my application and I added a messageID to every message sent from A5 to M4.

M4 copies this messageID into the answer message and sends to A5.

I found what I expected: randomly (sometimes after some seconds, sometimes after minutes) A5 core is one message back: it sends messageID = X but it reads back messageID = X-1.

The translated addresses seems ok.

How can I explain this behavior?

Do you mean copying inside ToradexCElibraries\libs\lib folder (side by side to Toradex_CE600 (ARMv4I)Release folder)?

Hi @andy.tx

my investigation is going on and I have more info (useful, I hope).

For some reasons (I can’t guess why) sometimes WaitForSingleObject returns an error (I’m going to investigate which kind of error) even if M4 core received the rpmsg and answered in the expected way. Let’s suppose this message has ID = X.

The following call to Rpmsg_Write() sends a new message to M4 core (ID = X+1) but hte matching Rpms_Read() returns the previous message ID = X (which has been received properly).

And so on. Tx and Rx are now misaligned…

I’m going to investigate on WaitForSingleObject, but you’re welcome if you have ideas.

Dear @vix,

The most common error is a TIMEOUT. And as it seems that your application continues successfully after this error, you simply need to change your algorithm to repeat the WaitForSingleObject () if it was not successful (or use a longer timeout).

Regards, Andy

Yes, I confirm it’s a timeout (100 ms).

The strange thing (from my side) is that I used some GPIO to debug M4 core and I’m sure the answer takes less than 5 ms. And 100 ms seems longer enough to me.

Anyway I’m going to increse it and see what happens.

Dear @vix

If a WinCe thread with a higher priority than your application is busy for more than 100ms, your application is blocked for this time. Even if there are multiple busy threads of equal priority, they will alternate in a cycle of 100ms each.

You can either enlarge the timeout, or give your application a higher priority.

Regards, Andy

Since the topic is quite long, I summarize the final results.
The sporadic issues I saw came from two different issues:

  • the documentation from Toradex for rpmsg communication was wrong (as @andy.tx pointed out): memory between 0x3f060000 and 0x3f064000 is used by rpmsg library, and therefore not available for anything else. The space reserved is used for 32 buffers, 512 bytes each and must be left empty by the M4 firmware.
  • on A5 side, a timeout of 100 ms for Rpmsg_Read() is sometimes too short. I increased it to 500 ms (that is ok for my needs)

Fixing these two issues solved my problem.