Memory Allocation Issue: Cortex M7 - m_data Insufficient Size

crivi · May 30, 2024, 8:20am

Hello everyone,

I’m currently working with the Verdin iMX8M Plus and facing an issue regarding memory allocation while trying to apply a FIR filter to a signal, utilizing cmsis-dsp on the Cortex M7. The filter is a float32_t of 256 samples, while the signal is a float32_t of 46039 samples (~180Kb). Unfortunately, the m_data partition available is only 128Kb, which is insufficient to store everything required.

In my initial attempt, I tried to utilize “__attribute__((section(”.ncache")))" to allocate the signal’s memory on m_data2, which offers 16Mb of space. However, something strange occurred. Upon running the ./build_release step, the output indicated:

Memory region         Used Size  Region Size  %age Used
    m_interrupts:         680 B         1 KB     66.41%
          m_text:       17344 B       127 KB     13.34%
          m_data:        2584 B       128 KB      1.97%
         m_data2:      196208 B        16 MB      1.17%

But in the release directory, I observed the .bin file built was approximately 2 GB! whereas the .elf file was 238.48 KB.

Can anyone offer insight into what might be going wrong? And how I shoud use m_data2?
Additionally, I’m seeking advice on how to increase the size of m_data. I saw the given constraints of the Cortex-M7 Memory Map, where both the upper and lower addresses of memory around m_data (DTCM) are reserved:

2002_0000 - 2005_FFFF Reserved (256KB)
2000_0000 - 2001_FFFF DTCM (128KB)
1FFE_0000 - 1FFF_FFFF Reserved (128KB)

Any suggestions or guidance would be greatly appreciated.

Thank you!

ToradexAI · May 30, 2024, 8:21am

Hi, I’m the Toradex AI Assistant. I’m in early beta, and I would appreciate your feedback.
The Toradex Support team will follow up with you shortly!

Did you like my answer?

Hello,

It seems you’re encountering a common challenge when working with embedded systems, especially regarding memory allocation for large data sets on constrained devices like the Cortex-M7 on the Verdin iMX8M Plus.

Regarding the issue with m_data2 and the large .bin file size:

The discrepancy in the .bin file size could be due to how the linker script is configured or how the memory regions are defined. When you specify a section to be placed in a large memory area like m_data2 (16MB), but only a small portion is used, the linker might still allocate the entire section as defined, leading to a significantly larger binary. This is because the binary format does not differentiate between used and unused space within a defined memory region; it allocates the entire block as specified.

To properly use m_data2, ensure your linker script correctly defines this memory region and that your attributes (__attribute__((section(".ncache")))) are correctly applied to the variables you want to store there. Review the linker script to ensure that m_data2 is defined to start at the correct address and that it does not inadvertently reserve more space than necessary.

Increasing the size of m_data (DTCM):

Unfortunately, the size of the DTCM (Data Tightly Coupled Memory) is hardware-defined and cannot be increased beyond its specified limit (128KB in your case). The memory map you provided indicates that the regions around the DTCM are reserved, which further limits the flexibility in memory allocation.

However, you can explore a few strategies to work within these constraints:

Optimize your data: Review your data structures and processing algorithms to minimize their memory footprint. For example, can the filter or signal be processed in smaller chunks?
External Memory: Consider using external memory interfaces available on the Verdin iMX8M Plus for storing larger data sets. While this may introduce latency compared to DTCM, it could be a viable solution for your application.
Memory Pooling: For critical sections that must use DTCM for performance reasons, implement a memory pooling mechanism to efficiently manage the available DTCM space. This involves dynamically allocating and deallocating memory blocks as needed rather than statically allocating all required memory upfront.

Given the constraints of the Cortex-M7 and the Verdin iMX8M Plus, creative solutions and optimizations are key to managing memory effectively. While hardware limitations define the maximum capacity of tightly coupled memory, software strategies can help mitigate these limitations to

Please, also check if these links can help you solve your problem:

Cortex-M and Memory Areas on Toradex SoMs | Toradex Developer Center.

rudhi.tx · June 10, 2024, 11:00am

Hello @crivi,

Thanks for reaching out on Toradex Community!

But in the release directory, I observed the .bin file built was approximately 2 GB! whereas the .elf file was 238.48 KB.

I don’t think you are doing anything wrong here. Bin file represents continuous raw image, so there are probably large gaps in the image containing unnecessary data dumped (like all 0xFF) which can make the image bigger.

For your application, it makes sense that you use elf file. You can use elf binaries with remoteproc: How to Use RemoteProc | Toradex Developer Center

crivi · June 13, 2024, 9:22am

Thank you @rudhi.tx! I’ve successfully loaded and run hello_world.elf using remote proc.

However, I’m facing issues with m_data2—the board crashes and reboots itself when I attempt to use it. Could someone provide a simple example demonstrating how to allocate an array in m_data2, both statically and dynamically? It would also be very helpful to see how to load a function into that memory section.

Additionally, since m_data2 is non-cacheable (ncache), will accessing data in this section be slower compared to m_data?

Thank you in advance for your assistance!

rudhi.tx · June 13, 2024, 4:52pm

Hello @crivi,

Here is an example code that you can add in your source code (for example, in hello_world.c) to allocate an array in m_data2 statically:

#define ARRAY_SIZE 1024000

// Place myArray in the non-cacheable memory region m_data2
__attribute__((section(".ncache"))) uint8_t myArray[ARRAY_SIZE];

int main(void) {
    // Fill the array with values
    for (uint32_t i = 0; i < ARRAY_SIZE; i++) {
        myArray[i] = (uint8_t)i;
    }
}

Here is how you would allocate it dynamically:

#define M_DATA2_SIZE (16 * 1024 * 1024)  // 16 MB
#define ARRAY_SIZE 102400

__attribute__((section(".ncache"))) uint8_t m_data2_pool[M_DATA2_SIZE];

static uint8_t* m_data2_alloc_ptr = m_data2_pool;


// Custom allocator function for m_data2 region
void* m_data2_malloc(size_t size) {
    void* ptr = NULL;
    if ((m_data2_alloc_ptr + size) <= (m_data2_pool + M_DATA2_SIZE)) {
        ptr = m_data2_alloc_ptr;
        m_data2_alloc_ptr += size;
    }
    return ptr;
}

int main(void) {
    uint8_t* myArray = (uint8_t*)m_data2_malloc(ARRAY_SIZE * sizeof(uint8_t));


    for (uint32_t i = 0; i < ARRAY_SIZE; i++) {
        myArray[i] = (uint8_t)i;
    }
}

You may also check the MIMX8ML8xxxxx_cm7_ram.ld: linker script defining memory regions, including the m_data2 section for a better understanding.

As you have already guessed, accessing data in non-cacheable memory (m_data2) will generally be slower compared to cacheable memory (m_data) since it bypasses the CPU cache, directly accessing the memory.

crivi · June 14, 2024, 8:00am

Hello @rudhi.tx , thank you so much for the help.

The memory allocation works well, but I’m still having a hard time with the amount of time it takes to execute. I’d like to perform a complex FIR filtering on the signals below, and I’m wondering if the following is a feasible application or if it’s too much workload for the Cortex-M7.

I have two arrays of 76,000 float32 samples:

float32_t signal_I[76000];
float32_t signal_Q[76000];

I also have a complex filter with:

float32_t filter_I[256];
float32_t filter_Q[256];

The filtering will be:

for (uint32_t n = 0; n < 76000; n++)
{
    for (uint16_t k = 0; k < 256; k++)
    {
        if (n >= k)
        {
            res_I[n] += signal_I[n - k] * filter_I[k] - signal_Q[n - k] * filter_Q[k];
            res_Q[n] += signal_I[n - k] * filter_Q[k] + signal_Q[n - k] * filter_I[k];
        }
    }
}

The resulting arrays are also:

float32_t res_I[76000];
float32_t res_Q[76000];

All the arrays are stored in m_data2 ncache as you suggested, using attribute((section(“.ncache”))) in their declarations. It takes around 15 seconds to execute and ofc is not acceptable for any application.

Could someone kindly reproduce this on a Verdin iMX8M Plus in the Cortex-M7 core? This will help me determine if there’s a configuration change I need to make or if it’s simply too many instructions to handle for the core. Thank you so much in advance!

vix · June 14, 2024, 1:33pm

Hi @crivi
I don’t know if it helps in your case, but isn’t there any useful filtering function in library CMSIS-DSP that performs what you need?
Usually they’re optimized and take the advantage of hardware registers of the architecture (when possible).

Moreover, if the filter coefficients are constant and known at compile time (and I think so), I strongly suggest you to define them as

const float32_t filter_I[256];
const float32_t filter_Q[256];

When you define/declare as const as much as you can, you give the compiler more chances to optimize your code in terms of performance (-O2 flag).
I cannot say how fast I would expect, but 15 seconds seems too much to me to execute that code.

rudhi.tx · June 14, 2024, 3:07pm

Hello @crivi,

As @vix suggested, you can use CMSIS-DSP library functions optimized for the Cortex-M7 which can significantly speed up the computation. I believe what you need is to use the CMSIS-DSP library’s arm_fir_f32 function, which is optimized for FIR filtering on ARM Cortex-M processors. You will find an example here: CMSIS-DSP/Examples/ARM/arm_fir_example/arm_fir_example_f32.c at main · ARM-software/CMSIS-DSP · GitHub

crivi · June 18, 2024, 10:01am

Hello @rudhi.tx and @vix, thank you for your advice. I tried making the variables const, but it didn’t change much. However, using the CMSIS-DSP library showed improvement, reducing execution time by ~2 seconds. Despite this, the total execution time of 13 seconds is still too high. Here is the function I’m using. Could someone from support benchmark this and help me understand if you have the same result or maybe there are issues with my build configuration, memory allocation, etc.? Your assistance would be greatly appreciated.

PS: Sorry for the confusion in the previous comment where I mentioned 76000 samples for the array signals. The correct size I’m using is 46000 samples.

#define BLOCK_SIZE 460 
#define BLOCK_NUM 100
#define FILTER_SIZE 256
#define SIGNAL_SIZE 46000

// Allocate memory for the result of the FIR filtering
__attribute__((section(".ncache"))) float32_t res_i[SIGNAL_SIZE];
__attribute__((section(".ncache"))) float32_t res_q[SIGNAL_SIZE];

// Allocate memory for the partial results of filtering
__attribute__((section(".ncache"))) float32_t fir_real_real[SIGNAL_SIZE];
__attribute__((section(".ncache"))) float32_t fir_real_imag[SIGNAL_SIZE];
__attribute__((section(".ncache"))) float32_t fir_imag_imag[SIGNAL_SIZE];
__attribute__((section(".ncache"))) float32_t fir_imag_real[SIGNAL_SIZE];

// Initialize filter coefficient
__attribute__((section(".ncache"))) float32_t firStateF32R[BLOCK_SIZE + FILTER_SIZE - 1];
__attribute__((section(".ncache"))) float32_t firStateF32I[BLOCK_SIZE + FILTER_SIZE - 1];


typedef struct
{
    float32_t *i;
    float32_t *q;
    
} cmplx_signal;

void complex_filtering(void)
{

    cmplx_signal *res = (cmplx_signal *)malloc(sizeof(cmplx_signal));
    res->i = &res_i[0];
    res->q = &res_q[0];

    cmplx_signal *sig1 = (cmplx_signal *)malloc(sizeof(cmplx_signal));
    sig1->i = &pss_filter_I[0];
    sig1->q = &pss_filter_Q[0];

    cmplx_signal *sig2 = (cmplx_signal *)malloc(sizeof(cmplx_signal));
    sig2->i = &pss_search_inproc_I[0];
    sig2->q = &pss_search_inproc_Q[0];


    arm_fir_instance_f32 FIR_real, FIR_imag;
    arm_fir_init_f32(&FIR_real, FILTER_SIZE, sig1->i, firStateF32R, BLOCK_SIZE);
    arm_fir_init_f32(&FIR_imag, FILTER_SIZE, sig1->q, firStateF32I, BLOCK_SIZE);


    // Apply FIR filtering to real and imaginary parts separately
    for (int i = 0; i < BLOCK_NUM; i++)
    {
        arm_fir_f32(&FIR_real, sig2->i + (i * BLOCK_SIZE), fir_real_real + (i * BLOCK_SIZE), BLOCK_SIZE);
        arm_fir_f32(&FIR_imag, sig2->q + (i * BLOCK_SIZE), fir_imag_imag + (i * BLOCK_SIZE), BLOCK_SIZE);
        arm_fir_f32(&FIR_real, sig2->q + (i * BLOCK_SIZE), fir_real_imag + (i * BLOCK_SIZE), BLOCK_SIZE);
        arm_fir_f32(&FIR_imag, sig2->i + (i * BLOCK_SIZE), fir_imag_real + (i * BLOCK_SIZE), BLOCK_SIZE);
    }

    // Combine filtered real and imaginary parts
    arm_sub_f32(fir_real_real, fir_imag_imag, res->i, SIGNAL_SIZE);
    arm_add_f32(fir_imag_real, fir_real_imag, res->q, SIGNAL_SIZE);

    // Free allocated memory
    free(sig1);
    free(sig2);
}

crivi · June 24, 2024, 7:18am

Hi @rudhi.tx, any update on this?

rudhi.tx · June 24, 2024, 7:27am

Hello @crivi,

In my opinion, you need to do some code optimization, especially around memory allocation and usage. The first recommendation would be to avoid dynamic memory allocation, if possible. Other than that, I can suggest you to set the optimization flags in CMakeLists.txt ( Use higher optimization levels such as -O2 or -O3 in the compiler settings.):

set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} -mcpu=cortex-m7 -mfpu=fpv5-sp-d16 -mfloat-abi=hard -O2 -DARM_MATH_CM7 -D__FPU_PRESENT=1")

I have not tested this. Please give it a try and let me know how it goes.

crivi · June 27, 2024, 9:45am

Hello @rudhi.tx,
I’m already using these optimization flags and I’m not using dynamic memory allocation in this case

vix · June 27, 2024, 1:18pm

Hi @crivi,
first of all, another simple but important check is that you must enable hardware support for single-precision floating point. Otherwise everything is emulated by software and the performances are really low.
It depends on what is you starting point, but usually this is done with some specific compiler and linker options.
If I’m right this should be --cpu=Cortex-M7.fp.sp but, please, check the technical docs.

Once you’re sure about this, you can have a look to the ASM code generated by the compiler to see how many instructions are generated for every iteration of your inner loop.
As a coarse indication of performance you can assume that every ASM instruction take 1 CPU cycle to execute (I know it’s not true, but it’s to get an idea on the magnitude order).
Every execution of your inner loop has 4 float32_t * float32_t, 3 float32_t + float32_t and 1 float32_t - float32_t for a total number of 8 simple operations on floating points.
With floating point hardware support, every simple operation (+, -, *) should be mapped into a single ASM instruction.
And so, for a coarse benchmark:
46000 * 256 * 8 = 94 millions of simple operations in you processing stage.
You should consider some additonal instructions to handle the if, and the for() loops, but you should consider that M7 is able to run at 400 MHz (you can approx to 400 millions of instructions per second).

Please, I know that my indication is not correct 100%, but it’s only to give an idea.
13 seconds to execute that computation is too high, based on my experience.
Unlesse you forgot to enable floating point hardware support (maybe you started from an example for a different Cortex-M that doesn’t have this support, and you didn’t enable it).