RPMSG stops working after given time IMX7D

vhmm100 · August 12, 2022, 1:31pm

Hi,

I am using a Colibri IMX7D 512MB with a custom Linux Yocto image running on A7 and the default freeRTOS from Toradex on the M4.

We did have some trouble to establish the RPMsg communication between the two cores, but after a while we did figure out that the problem was that we were sending some broken characters from M4 to A7 through the RPMsg channel, that caused some randomness in our communication, working or not working randomly. To avoid that, we did implement a function to check the characters range before sending them, the range following the ASCII table is within ‘32’ to ‘126’, and we got to make the RPMsg work stable, sending and receiving data from both cores. We did implement the same function on the A7 side, but it didn’t change the results.

Our problem is that after some time (±30 min), our RPMsg stops working. From what we did see and debug, the problem seems to be on the A7 side, we can control some stuff with the M4 (like a buzzer when pressing a stop button) and this stuff still works just fine, making we assume that the problem is on the A7 side. Also, our screen freezes and we can’t do anything unless we re-start our device. If we ignore the RPMsg, everything will run just fine without breaking.

We did try to debug the A7 in real-time by printing in some file the RPMsg bytes written and read. We can check the file at all times by logging trough SFTP to the board, and before the RPMsg breaking, we can see perfectly the number of bytes written and read. When it breaks, if we look to the file, we will see that it is printing "-1’ to the bytes written and ‘0’ to the bytes read. Since the A7 will write first before reading incoming data from M4 side, we suppose from the ‘-1’ bytes written, that is the A7 that is crashing the RPMsg, and therefore, all our code just stops and go nuts.

Would you have any idea of what might be causing the RPMsg to stop after a while?

Thanks for the help !

Edward · August 12, 2022, 2:39pm

RPMSG has no limitations on characters used. Check your TTY settings, perhaps TTY layer is not in RAW mode.

RPMSG is working well on iMX7. However look at @hfranco.tx mentioned excerpt from kernel docs in recent thread here. Once your A-side TX buffers are exhausted, RPMSG will fall asleep for 15 seconds… I guess you see just that, not complete death of RPMSG but this 15 seconds pause.

How to not unexpectedly waste your buffers and so not face this 15 seconds pause. Having a lot of data to send, don’t sent a byte or few bytes at a time. Instead send as much as you can up to 496 bytes at a time in one write() call. 496 comes from max buffer size 512 minus 16 bytes for RPMSG header. Both sides, Cortex-A and -M should follow this rule. One transfer triggers one interrupt. High interrupt rate is performance killer for every CPU. Sending by single byte instead of burst of bytes you multiply interrupt events. Interrupt rate is not so big problem for Cortex-M, but there are delivery to CM interrupts which are sent back to CA.

vhmm100 · August 12, 2022, 2:57pm

Hi @Edward ,

Thanks for your response. The fix we did for the “broken” characters worked just fine (it is just on the M4 side), we didn’t check our TTY settings though.

It seems that the problem is on the A7 side, since we can still control some things with the M4 (e.g. buzzer). It might make sense that the A7 is going to sleep mode, since we are getting a ‘-1’ value for the written bytes, but the RPMsg don’t stop just for 15s, it will stop definitely and our code will never resume.

We initialize our RPMsg in the beginning of our code (both sides) and we wait for the RPMsg channel to be established before going on, but we never did implement anything to reset the RPMsg it if stops. Would the RPMsg come back after the 15s A7’s sleep automatically?

Also, could you explain why would the A7 TX buffers get exhausted? We need the RPMsg to work continuously, without any interruption.

… Having a lot of data to send, don’t sent a byte or few bytes at a time. Instead send as much as you can up to 496 bytes at a time in one write() call. 496 comes from max buffer size 512 minus 16 bytes for RPMSG header.

We don’t think that this might be a problem now, because we are sending everything at once (currently we have 173 bytes). Both sides have equal RPMsg configurations regarding buffer sizes, RPMsg structs …

Thanks.

hfranco.tx · August 12, 2022, 5:16pm

Hi @vhmm100,

thanks for jumping in @Edward!

I have some questions in order to better understand your scenario.

The “-1” value, it’s only written on the memory or it’s also the return code from your read/write function?
In this case, I would recommend using the library errno, then call a printf to errno or use perror to get more information about what is happening with your code.

From the Kernel Documentation about RPMSG, after 15 seconds the function will return -ERESTARTSYS, which means (according to this link):

-ERESTARTSYS is connected to the concept of a restartable system call. 
A restartable system call is one that can be transparently re-executed by 
the kernel when there is some interruption.

For instance the user space process which is sleeping in a system call can 
get a signal, execute a handler, and then when the handler returns, it 
appears to go back into the kernel and keeps sleeping on the original 
system call.

Using the POSIX sigaction API's SA_RESTART flag, processes can arrange the 
restart behavior associated with signals.

In the Linux kernel, when a driver or other module blocking in the context 
of  a system call detects that a task has been woken because of a signal,
it can return -EINTR. But -EINTR will bubble up to user space and cause 
the system call to return -1 with errno set to EINTR.

If you return -ERESTARTSYS instead, it means that your system call is 
restartable. The ERESTARTSYS code will not necessarily be seen in user
 space. It either gets translated to a -1 return and errno set to EINTR 
(then, obviously, seen in user space), or it is translated into a system 
call restart behavior, which means that your syscall is called again 
with the same arguments (by no action on part of the user space process: 
the kernel does this by stashing the info in a special restart block).

From what I understand, your Cortex-A is completely frozen and you can do anything, am I correct?
How have you compiled your code? Are you using the demo from NXP, ttyrpmsg?

Best Regards,
Hiago.

Edward · August 12, 2022, 8:30pm

No matter what side, both M4 and A7 support all characters. I’m using RAW binary data, no problems.

No problems on A7, unless you flood it with short messages too often. Tx to M4 counts here as well.

Once it gets irresponsive, perhaps it needs 15 of silence without attempts to send data again. I’m not sure, that problem was solved long ago.

Yes it should recover. But I’m not sure is it required to not try to send more within those seconds or not. In my bad SW variant I could stop sending stream of data and try sending single message at will and see if it passes to M4. Clearly it was recovering after 15s. Yes, automatically.

Simple. You have limited amount of buffers. If you send more messages than the amount of available buffers faster than M4 side confirms it received them (when A7 receives message ack it can free corresponding Tx buffer) - you’ll have no spare buffer and RPMSG will stuck for those 15s. Your M4 code should read messages from A7 ASAP and store them in some FIFO instead of reading one message, doing some slow stuff and only then reading another message.

Then what about message rate? 1kHz should be fine anyway, provided M4 side is able to pick messages at the same rate. Try increasing/decreasing message rate a bit, your 30 min period should change as well.

vhmm100 · August 13, 2022, 11:54pm

Hi @hfranco.tx ,

The “-1” value, it’s only written on the memory or it’s also the return code from your read/write function?
In this case, I would recommend using the library errno, then call a printf to errno or use perror to get more information about what is happening with your code.

So, after looking a bit more into the code, we did see that the ‘-1’ is coming from the write function. We are using the default RPMsg from Toradex, we didn’t change it.

/* Write N bytes of BUF to FD.  Return the number written, or -1.

   This function is a cancellation point and therefore not marked with
   __THROW.  */
extern ssize_t write (int __fd, const void *__buf, size_t __n) __wur;

Since we didn’t manage to access via UART and serial terminal the A7 while it is running, we did make a quick solution to print the written and read bytes into a .txt file and it is there that after the RPMsg crashes, that we will see ‘-1’, before that we can see the perfectly the written bytes (173) and read bytes (174). Also for further information, we did manage to see that the A7 is the one crashing the RPMsg, since the M4 still works, it won’t try to write since the A7 has the write turn.

From what I understand, your Cortex-A is completely frozen and you can do anything, am I correct?

Yes, our A7 is completely frozen and we can’t do anything.

How have you compiled your code?

We are compiling our A7 code with:

Make

‘. /usr/local/oecore-x86_64/environment-setup-armv7at2hf-neon-angstrom-linux-gnueabi’

OS Ubuntu 18.04.6 LTS

And our M4 code with:

CMake

GCC 9.3.1 arm-none-eabi

OS Windows 10

Linker: MCIMX7D_M4_tcm.ld

Are you using the demo from NXP, ttyrpmsg?

We are using all the RPMsg provided by the Toradex repository with freeRTOS.

Kind regards.

hfranco.tx · August 16, 2022, 5:01pm

Hi @vhmm100,

Thanks for your information!

From what I understand, you are using the imx_rpmsg_tty driver, am I correct? You are sending data from M4 and then reading it through /dev/ttyRPMSG0 and also writing data to this device and sending it to M4, right?

From the imx_rpmsg_tty driver source code that I’ve sent you, apparently, the max size defined for sending strings is 256 bytes, which shouldn’t be a problem as you’re sending fewer bytes.

Taking I deeper look into the rpmsg_rtos.h source code (FreeRTOS-Colibri-iMX7/middleware/multicore/open-amp/rpmsg/), I could see there are different functions to send and also read data between the two cores:

For example:

 /*!
 * @brief Allocates the tx buffer for message payload.
 *
 * This API can only be called at process context to get the tx buffer in vring. By this way, the
 * application can directly put its message into the vring tx buffer without copy from an application buffer.
 * It is the application responsibility to correctly fill the allocated tx buffer by data and passing correct
 * parameters to the rpmsg_rtos_send_nocopy() function to perform data no-copy-send mechanism.
 *
 * @param[in] ept   Pointer to the RPMsg endpoint that requests tx buffer allocation
 * @param[out] size Pointer to store tx buffer size
 *
 * @return The tx buffer address on success and NULL on failure
 * 
 * @see rpmsg_rtos_send_nocopy
 */
void *rpmsg_rtos_alloc_tx_buffer(struct rpmsg_endpoint *ept, unsigned long *size);

/*!
 * @brief Sends a message across to the remote processor.
 *  
 * This function sends data of length len to the remote dst address.
 * In case there are no TX buffers available, the function will block until
 * one becomes available, or a timeout of 15 seconds elapses. When the latter
 * happens, -ERESTARTSYS is returned.
 *  
 * @param[in] ept  Pointer to the RPMsg endpoint
 * @param[in] data Pointer to the application buffer containing data to be sent
 * @param[in] len  Size of the data, in bytes, to transmit
 * @param[in] dst  Destination address of the message
 *
 * @return 0 on success and an appropriate error value on failure
 * 
 * @see rpmsg_rtos_send_nocopy
 */
int rpmsg_rtos_send(struct rpmsg_endpoint *ept, void *data, int len, unsigned long dst);

/*!
 * @brief Sends a message in tx buffer allocated by rpmsg_rtos_alloc_tx_buffer() 
 * to the remote processor.
 * 
 * This function sends txbuf of length len to the remote dst address.
 * The application has to take the responsibility for:
 *  1. tx buffer allocation (rpmsg_rtos_alloc_tx_buffer() )
 *  2. filling the data to be sent into the pre-allocated tx buffer
 *  3. not exceeding the buffer size when filling the data
 *  4. data cache coherency
 *
 * After the rpmsg_rtos_send_nocopy() function is issued the tx buffer is no more owned 
 * by the sending task and must not be touched anymore unless the rpmsg_rtos_send_nocopy() 
 * function fails and returns an error. In that case the application should try 
 * to re-issue the rpmsg_rtos_send_nocopy() again and if it is still not possible to send 
 * the message and the application wants to give it up from whatever reasons 
 * the rpmsg_rtos_recv_nocopy_free function could be called, 
 * passing the pointer to the tx buffer to be released as a parameter.     
 *
 * @param[in] ept   Pointer to the RPMsg endpoint
 * @param[in] txbuf Tx buffer with message filled
 * @param[in] len   Size of the data, in bytes, to transmit
 * @param[in] dst   Destination address of the message
 *
 * @return 0 on success and an appropriate error value on failure
 * 
 * @see rpmsg_rtos_alloc_tx_buffer
 * @see rpmsg_rtos_send
 */
int rpmsg_rtos_send_nocopy(struct rpmsg_endpoint *ept, void *txbuf, int len, unsigned long dst);

Which exact functions are you using in your RTOS code?

I made a quick test here, a loop sending data to M4 and it worked. The only thing is that when it becomes too fast (3 ms period), M4 stops echoing back my messages, but the Cortex-A or M4 are not frozen. I can simply run my code again and it works:

COUNTER=0

while true; do
        echo "abcdefghijklmnopqrstuvwxyz${COUNTER}" > /dev/ttyRPMSG0
        COUNTER=$(( COUNTER + 1 ))
        if [ $COUNTER -eq 256 ]; then
                COUNTER=0
        fi
        sleep 0.01
done

Please, explain how the data is being transferred so we can provide better support.

Thanks!

Best Regards,
Hiago.

vhmm100 · August 16, 2022, 6:35pm

Hi @hfranco.tx ,

From what I understand, you are using the imx_rpmsg_tty driver, am I correct?

No, everything we got for our RPMsg to run on M4, we did take from this Toradex Repository.

You are sending data from M4 and then reading it through /dev/ttyRPMSG0 and also writing data to this device and sending it to M4, right?

Exactly that, we are reading/writing through /dev/tty/RPMSG0 from both cores.

Which exact functions are you using in your RTOS code?

We are using the following functions to write & read on the M4 side:

int app_rpmsg_read(void* message)
{
    int message_lenght;

    if(message == NULL){
        return -1;
    }

    /* If no message available, return zero */
    if (msg_count == 0){
        return 0;
    }

    /* Copy string from RPMsg rx buffer */
    message_lenght = app_msg[app_idx].len;
    assert(message_lenght < sizeof(app_buf));
    memcpy(app_buf, app_msg[app_idx].data, message_lenght);
    memcpy(message, app_buf, message_lenght);// copy app_buf to string input pointer "message"

    /* Release held RPMsg rx buffer */
    //rpmsg_release_rx_buffer(app_chnl, app_msg[app_idx].data); //bug, this here instead of the interrupt can cause the program to break
    app_idx = (app_idx + 1) % STRING_BUFFER_CNT;
    /* Once a message is consumed, minus the msg_count and might enable MU interrupt again */
    app_rpmsg_enable_rx_int(true);

    return message_lenght;
}

void app_rpmsg_write(void* message, int message_lenght)
{
    void *tx_buffer;
    unsigned long size;

    if(message_lenght < 1){
        PRINTF("message_lenght < 1\r\n");
        return;
    }
    if(message == NULL){
        PRINTF("message is a NULL pointer\r\n");
        return;
    }
    /* Get tx buffer from RPMsg */
    tx_buffer = rpmsg_alloc_tx_buffer(app_chnl, &size, RPMSG_TRUE);
    assert(tx_buffer);
    /* Copy string to RPMsg tx buffer */
    memcpy(tx_buffer, message, message_lenght);
    /* Echo back received message with nocopy send */
    rpmsg_sendto_nocopy(app_chnl, tx_buffer, message_lenght, app_chnl->dst);
}

And the following functions on A7 side:

static char readBuffer[512];
static int rpmsgFileDescriptor = -1;

int rpmsgInit()
{
    if(rpmsgFileDescriptor > -1)
    {
        return 0;
    }
    rpmsgFileDescriptor = open("/dev/ttyRPMSG0", O_RDWR | O_NOCTTY | O_NONBLOCK);
    if(rpmsgFileDescriptor < 1)
    {
        printf("Not able to open RPMSG0\r\n");
        return -1;
    }
    return 0;
}

/*!
*   @brief Deinit module
*/
void rpmsgDeinit()
{
    if(rpmsgFileDescriptor < 1)
    {
        close(rpmsgFileDescriptor);
    }
    return;
}

int rpmsgWriteToM4(void* message, int size)
{
    int bytesWritten = -2;
    if(message != NULL)
    {
        bytesWritten = write(rpmsgFileDescriptor, message, size);
    }
    return bytesWritten;
}

int rpmsgReadFromM4(void* message, int size)
{
    int bytesRead = -2;
    if(message != NULL)
    {
        bytesRead = read(rpmsgFileDescriptor, message, size);
    }
    return bytesRead;
}

Please, explain how the data is being transferred so we can provide better support.

Our RPMsg starts from the M4 side creating the channel /dev/ttyRPMSG0 where the communication will be done and waiting for a handshake from A7. So, the A7 will do a handshake with M4 and wait for its response, after that our code will go on as the RPMsg channel was successfully created and the communication was established.

Each core will send/receive packages when it is their turn. The communication turns are controlled with a flag that will say to each core if its their turn to listen or write.

Kind regards,

hfranco.tx · August 16, 2022, 8:27pm

Hi @vhmm100,

Ok, thanks for the information. I will get everything and work on this on my side.
One last question, what changes have you done to the device tree? Can you send the device tree source code for us?
You can use this link: share.toradex.com

Thanks!

Best Regards,
Hiago.

Edward · August 17, 2022, 6:46am

@vhmm100,

You’ve commented rpmsg_rtos_recv_nocopy_free. You still need to free allocated rx buffers. rpmsg recv call allocates it, you need to free it. With tx buffers it is opposite, you allocate tx buffer, rpmsg send call frees it.

This is probably why it stops after some time.

Edward

vhmm100 · August 17, 2022, 5:45pm

Hi @hfranco.tx,

I would like to thank you for your fast responses and time.

One last question, what changes have you done to the device tree? Can you send the device tree source code for us?

Unfortunately, I can’t tell you exactly what was done to our device tree, because it wasn’t me that made it, and the person who did it isn’t working with us anymore. I did upload our device tree (.dtb) here Device Tree. From what I know, our device tree was made from the file imx7d-colibri-custom.dtb from Toradex repositories, although I am not aware if any changes were made.

Kind reagrds,

vhmm100 · August 17, 2022, 5:50pm

Hi @Edward ,

Did you mean this function rpmsg_release_rx_buffer(app_chnl, app_msg[app_idx].data) ?

Apparently it was causing a bug (//bug, this here instead of the interrupt can cause the program to break), so we didn’t use it.

Kind regards,

Edward · August 17, 2022, 6:25pm

bug is exactly absence of that call. And this must be why after some time you get heap overflow.

alex.tx · August 17, 2022, 6:32pm

3.2.7 int rpmsg_rtos_recv_nocopy_free ( struct rpmsg_endpoint ∗ ept, void ∗ data)
This function frees a buffer previously returned by rpmsg_rtos_recv_nocopy().
Once the zero-copy mechanism of receiving data is used, this function has to be called to free a buffer and
to make it available for the next data transfer.

vhmm100 · August 18, 2022, 4:20pm

Hi everyone,

We are glad to inform that we solved the issue so far.

I was mistaken when I said that we did comment the function rpmsg_release_rx_buffer. After commenting it, we did leave our code running through night, and it didn’t freeze or had any issues. We don’t know why this function was breaking our code, but so fat it was responsible for it. We are going to leave our code running for more time to see if it will have other RPMsg’s problems.

If you have any questions, feel free to contact us.

Thank you all for your help @hfranco.tx , @Edward and @alex.tx .

hfranco.tx · August 18, 2022, 5:14pm

Hi @vhmm100,

I’m glad it is solved now!

I’ve marked @Edward’s answer as a solution. Feel free to return and ask any questions that you might have in the future.

Best regards,
Hiago.