Cortex M7 - CMSIS-DSP performance

Hello everyone,

signals.c (5.7 KB)
I’ve been working on the Verdin iMX8M Plus and recently began testing a simple convolution operation as a benchmark on the Cortex-M7. I successfully ran the application and integrated the CMSIS DSP library for optimized operations. However, I’ve encountered an interesting observation and would appreciate some insight.

I’ve noticed that when comparing my own convolutional function to the CMSIS function arm_conv_f32(), there isn’t a significant difference in the cycles taken by the processor to execute them. Upon examining the assembly code, it appears that my convolution function generates approximately 143,000 instructions, and the cycles taken by the Cortex-M7 are as follows:

  • For my simple convolution: 107,000 cycles, roughly equivalent to 0.13 ms.
  • For the optimized CMSIS-DSP convolution function: 76,000 cycles, approximately 0.095 ms.

I’m seeking advice from those with more experience to confirm if these numbers are credible. Could it be possible that I need to enable something beyond the FPU for better performance? Perhaps there are optimizations or configurations I may have overlooked that could further improve the execution time for this specific convolution operation.

For clarity, here’s my simple convolution function:


void convolution(float result[]) {
    int i, j;

    for (i = 0; i < KHZ1_15_SIG_LEN + IMP_RSP_LENGTH - 1; i++) {
        result[i] = 0;
        for (j = 0; j < IMP_RSP_LENGTH; j++) {
            if (i - j >= 0 && i - j < KHZ1_15_SIG_LEN) {
                result[i] += input_signal_f32_1kHz_15kHz[i - j] * impulse_response[j];


#define KHZ1_15_SIG_LEN		320
#define IMP_RSP_LENGTH		29

I’ve also attached the signal as a text file to this message for completeness.

Your expertise and guidance on this matter would be greatly appreciated. Thank you in advance for your help!


Your function can’t be 143k instructions big, unless you totally unrolled your loops? You mentioned your code size but not CMSIS-DSP version, so it’s not very clear what are you wondering about. Why simple and ?smaller? function takes longer than big one? It could be for example due to cache organization and perhaps CMSIS-DSP code is more tuned for cache nuissances. Try googling about loop tiling. [ARM Cortex-A programming guide] has chapter about it.

if you need assistance on CMSIS DSP library you can open an issue on the github repo
Usually the developers answer to the questions there.

Thank you for the response! I will check also loop tiling. Just to clarify, I’m curious if the optimized convolution function arm_conv_f32() provided by the CMSIS Library (v1.15.0) should actually require even fewer cycles than what I mentioned, that is a 30% reduction compared to my own convolution implementation.

But as vix said, this might be more of a question pertaining to the performance of the CMSIS library itself.

Hi @crivi

The question will most likely be how much the compiler can optimize. So, what kind of compiler and linker flags did you use? I guess the 30% reduction from arm_conf_f32 to your code is realistic, but you would have to compare the assembly code for the exact answer. However, I think there is no special extension (e.g. like NEON for Cortex A) they use to improve the speed. That’s why the results are so close.