Cortex-M4 and cached memories

Hi.
I’m developing a Fw for the CM$ in bare metal and im having some issues with the cache memory.
My firmware so far is based on the sample rpmsg_str_echo_bm, but with only the initialisation and the following code.

#define MODE  0
uint32_t x = 0;
to_t ti, tf;
while (1) {

#if (MODE == 0)
  dbgout("----- CODE : OFF  ---  SYS: OFF -----\r\n");
  Dly (100 mS);
  __disable_irq();
  LMEM_DisableSystemCache(LMEM);
  LMEM_DisableCodeCache(LMEM);
#elif (MODE == 1)
  dbgout("----- CODE : OFF  ---  SYS: ON -----\r\n");
  Dly (100 mS);
  __disable_irq();
  LMEM_EnableSystemCache(LMEM);
  LMEM_DisableCodeCache(LMEM);
#elif (MODE == 2)
  dbgout("----- CODE : ON  ---  SYS: OFF -----\r\n");
  Dly (100 mS);
  __disable_irq();
  LMEM_DisableSystemCache(LMEM);
  LMEM_EnableCodeCache(LMEM);
#elif (MODE == 3)
  dbgout("----- CODE : ON  ---  SYS: ON -----\r\n");
  Dly (100 mS);
  __disable_irq();
  LMEM_EnableSystemCache(LMEM);
  LMEM_EnableCodeCache(LMEM);
#else
  dbgout("--- UNDEFINED ---\r\n");
  Dly (100 mS);
  __disable_irq();
#endif
  uint32_t i = 160000000;
  ti = clock();
  do {
  } while (--i);
  tf = clock();
  __enable_irq();
  tf -= ti;
  tf /= (1 uS);
  dbgout("(x:%d -> %d uS)\r\n", ++x, tf);
  Dly (100 mS);
}

I’ve put this the same code in an STM32L476 @48Mhz for reference and got the following results:

// S -> System Cache   C-> Code Cache
// STM32 Code: Flash Data: RAM    -> 10000000 uS -> 48MHz (baseline)
// iMX7D Code: TCML  Data: TCML   ->  2000000 uS (S: xx  / C: xx  ) -> F = 240Mhz (Ok)
// iMX7D Code: OCRAM Data: OCRAM  -> 15112001 uS (S: xx  / C: Off ) -> F = 31,76Mhz
// iMX7D Code: OCRAM Data: OCRAM  -> 20800000 uS (S: xx  / C: On  ) -> F = 23,08Mhz
// iMX7D Code: DDR   Data: DDR    -> 47741988 uS (S: xx  / C: Off ) -> F = 10,06Mhz
// iMX7D Code: DDR   Data: DDR    -> 52414589 uS (S: xx  / C: On  ) -> F =  9,16Mhz

I’d like to know whats wrong.
( The code is respectively 7,5x, 10,4x, 23,9x, 26,2x slower than on TCM)
For the cache disabled it seems similar to what I saw in an NXP forum (OCRAM is 8 times slower than TCM and DDR is 4 times slower than OCRAM).
The problem is that when I enable the Cache, instead of improving the performance it becomes worse. I was expecting something about 70% of TCM.

Any ideas? My firmware will not fit into the 32/64kb of TCM and with this kind of performance it is unusable in OCRAM.

Another unrelated question is about the Clock Frequency. The examples uses SysPLL/2 (240Mhz) for the CM4 clock. but the datasheet says that the CM4 maximum freq is 200Mhz. Is that right? Can is it reliable to run it at 240Mhz?

Thanks, Mauricio

Dear @mscaff

Even though I didn’t personally play with the cache so far, I might have some hints for you:

The i.MX7 Reference Manual states in section 4.2.9.3.5:

To use cache, user needs to configure MPU to set those memories as cacheable and all the other memories set as non-cacheable.

It looks like the FreeRTOS implementation provides code for this in the SystemInit() function, which is located at \platform\devices\MCIMX7D\startup\system_MCIMX7D_M4.c. I attached this file for reference.

It would be great if you could post your results here.

Abot the operating frequency:

The examples were provided by NXP, and all implementations I saw so far were using the CM4 at 240MHz. Therefore I strongly assume that the datasheet is not fully correct, and it is safe to operate the M4 at 240MHz.

Regards, Andy

Just an update.
I think that I found the problem.
In the examples , when OCRAM is used it is defined in the linker at address 0x00920000, but it seems the although that’s the OCRAM location, it can only be cached when accessed 0x20200000.
Here an example using Code and Data inside the 1st 128Kb of OCRAM and with much improved test results.

 ROM   (rx)    : ORIGIN = 0x20200000, LENGTH = 64k
 SRAM1 (rwx)   : ORIGIN = 0x20210000, LENGTH = 64k

Dear @mscaff

Thank you for reporting this. I will look into the topic and update the linker scripts if I don’t find any critical issue.

Side note: Make sure you don’t run into the cache bug described in the Article

Regards, Andy

Thanks for the heads up.

I guess that to be safe I’ll use code on OCRAM and data on TCM

  ROM   (rx)   : ORIGIN = 0x20200000, LENGTH = 128k
  SRAM1 (rw)   : ORIGIN = 0x20000000, LENGTH = 32k

Regards, Mauricio

Dear @mscaff

In general your linker setup is correct. But regarding the bug described above, it seems to be unsafe to place any code containing STREX/LDREX instruction into cached memory.
By placing all your code into the ROM section (OCRAM, 0x20200000), you might run into this issue.
The solution is to assure that critical code is placed either in TCMU or TCML. Ther are compiler #pragmas to place single functions into different linker sections. This gives you control over which code ends up in which physical memory.

Regards, Andy

As I understand the problem with STREX and LDREX happens with the src/dst memory is in a cached region and not when the instructions itself are there.
My ideia is to place the code in the cached region (code that may contain STREX/LDREX), but use as data the non cached region TCM. That way any volatile variable will be located in the TCM region.
Am I wrong? This is my first time working with caches.
By the way, I’m not sure, but I guess that if i’m working in bare metal, the compiler has no reason to use those instructions unless I explicitly use them. Is this correct?
Regards, Mauricio