Correct way to adapt NXP SDK linker script & startup for remoteproc on i.MX8MP?

Hello everyone,

this is a follow-up to this earlier thread where @bruno.tx helped investigate remoteproc on the Verdin iMX8M Plus. That thread ended with the recommendation to stick with U-Boot bootaux for loading. We’ve been doing that and split the ELF into separate itcm/dtcm/ddr binaries via objcopy and loaded them from U-Boot. It works, but we’re now moving our M7 build into Yocto so the firmware is part of the image directly, and @daniel_m.tx recommended to try remoteproc again to load it from there. We got it working, but we had to modify the NXP MCUXpresso SDK linker script and startup assembly in ways we haven’t seen documented anywhere, so we’d appreciate a sanity check from you guys. :slight_smile:

Our situation is different now from the DDR entry-point issue discussed in the earlier thread. We’re using the TCM linker script (MIMX8ML8xxxLZ_ram.ld) on the Verdin iMX8M Plus with the Toradex 6.x BSP (kernel toradex_6.6-2.2.x-imx) and NXP MCUXpresso SDK 25.06. We think that the problem we hit is about how the NXP SDK packs .data initializers into ITCM, which doesn’t work with remoteproc’s ATT validation.

The stock SDK linker script packs .data initializers right after .text in ITCM using AT(__DATA_ROM):

.data : AT(__DATA_ROM)  { ... } > m_data      /* VMA in DTCM, but LMA in ITCM after .text */
.ncache.init : AT(__NDATA_ROM) { ... } > m_data2  /* VMA in DDR, LMA also in ITCM */

The startup assembly then copies that data from ITCM to DTCM/DDR before main(). This works fine for U-Boot and for NXP’s small demo apps, but our firmware is bigger: .text is ~107KB and .data is ~91KB, so ~198KB total – way more than the 128KB ITCM. Looking at imx_rproc.c, the driver checks that each ELF PT_LOAD segment fits within a single ATT entry (ITCM = 128KB, DTCM = 128KB, DDR = 2GB), and remoteproc_elf_loader.c loads each segment to its LMA. Since the combined .text + .data LMA exceeds 128KB, the driver rejects it:

Translation failed: da = 0x1a1d0 len = 0x16b90

We made two changes to fix this. First, we removed AT(__DATA_ROM) and AT(__NDATA_ROM) from the linker script so that each section’s load address equals its runtime address (LMA = VMA). This way remoteproc should load each ELF segment directly where it belongs – .data goes straight to DTCM, .ncache.init goes straight to DDR:

--- a/MIMX8ML8xxxLZ_ram.ld  (NXP SDK default)
+++ b/MIMX8ML8xxxLZ_ram.ld  (our version)
@@ -145,9 +145,8 @@
 
   __etext = .;    /* define a global symbol at end of code */
-  __DATA_ROM = .; /* Symbol is used by startup for data initialization */
 
-  .data : AT(__DATA_ROM)
+  .data :
   {
     . = ALIGN(4);
     __DATA_RAM = .;
@@ -160,9 +159,8 @@
 
   __CACHE_REGION_START = ORIGIN(m_interrupts);
   __CACHE_REGION_SIZE  = 0;
-  __NDATA_ROM = __DATA_ROM + SIZEOF(.data);
 
-  .ncache.init : AT(__NDATA_ROM)
+  .ncache.init :
   {
     __noncachedata_start__ = .;
     *(NonCacheable.init)
@@ -178,9 +176,8 @@
     __noncachedata_end__ = .;
   } > m_data2
 
-  __DATA_END = __DATA_ROM + (__data_end__ - __data_start__);
   text_end = ORIGIN(m_text) + LENGTH(m_text);
-  ASSERT(__DATA_END <= text_end, "region m_text overflowed with text and data")
+  ASSERT(__etext <= text_end, "region m_text overflowed with text")

As far as we understand, this gives us the following ELF segment layout, where each segment fits within its ATT entry:

Segment LMA Region ATT Entry
.text etc. 0x00000000 ITCM 128KB – fits
.data 0x20000000 DTCM 128KB – fits
.ncache.init 0x80000000 DDR 2GB – fits

Second, we removed the data copy loops from the startup assembly. We had actually already removed the .data copy loop (ITCM → DTCM) in our earlier U-Boot approach, because U-Boot was loading split binaries directly to the right addresses. For remoteproc we additionally removed the .ncache.init copy loop (ITCM → DDR). The reasoning is the same for both: with LMA = VMA, there are no initializers sitting in ITCM after .text anymore – the loader already placed them at the right addresses. If we keep the copy loops, they’d read from __etext in ITCM (where there’s nothing useful) and overwrite the correctly-loaded data with garbage. We kept the .bss zeroing and .ncache zero-fill since those are still needed:

--- a/startup_MIMX8ML8_cm7.S  (NXP SDK default)
+++ b/startup_MIMX8ML8_cm7.S  (our version)
@@ -225,59 +225,22 @@
     ldr   r0,=SystemInit
     blx   r0
 #endif
-/*     Loop to copy data from read only memory to RAM. The ranges
- *      of copy from/to are specified by following symbols evaluated in
- *      linker script.
- *      __etext: End of code section, i.e., begin of data sections to copy from.
- *      __data_start__/__data_end__: RAM address range that data should be
- *      __noncachedata_start__/__noncachedata_end__ : none cachable region
- *      copied to. Both must be aligned to 4 bytes boundary.  */
+/*     Data initialization loops removed — not needed for remoteproc ELF loading.
+ *     Remoteproc loads each PT_LOAD segment directly to its LMA. With LMA=VMA
+ *     (no AT() in the linker script), .data is already at 0x20000000 (DTCM) and
+ *     .ncache.init is already at 0x80000000 (DDR). Copying from __etext would
+ *     read garbage from ITCM and corrupt the correctly-loaded data.
+ */
 
-    ldr    r1, =__etext
-    ldr    r2, =__data_start__
-    ldr    r3, =__data_end__
-
-#ifdef __PERFORMANCE_IMPLEMENTATION
-    subs    r3, r2
-    ble    .LC1
-.LC0:
-    subs    r3, #4
-    ldr    r0, [r1, r3]
-    str    r0, [r2, r3]
-    bgt    .LC0
-.LC1:
-#else
-.LC0:
-    cmp     r2, r3
-    ittt    lt
-    ldrlt   r0, [r1], #4
-    strlt   r0, [r2], #4
-    blt    .LC0
-#endif
 #ifdef __STARTUP_INITIALIZE_NONCACHEDATA
-    ldr    r2, =__noncachedata_start__
-    ldr    r3, =__noncachedata_init_end__
-#ifdef __PERFORMANCE_IMPLEMENTATION
-    subs    r3, r2
-    ble    .LC3
-.LC2:
-    subs    r3, #4
-    ldr    r0, [r1, r3]
-    str    r0, [r2, r3]
-    bgt    .LC2
-.LC3:
-#else
-.LC2:
-    cmp     r2, r3
-    ittt    lt
-    ldrlt   r0, [r1], #4
-    strlt   r0, [r2], #4
-    blt    .LC2
-#endif
 /* zero inited ncache section initialization */
+    ldr r2, =__noncachedata_init_end__
     ldr r3, =__noncachedata_end__
     movs    r0,0
 .LC4:

Remoteproc can load the firmware, and everything is running smoothly with these changes. But we couldn’t find any NXP or Toradex documentation describing this adaptation. We checked the provided linker scripts contained in the MCUXpresso SDK 25.06 and all of them one use AT(__DATA_ROM). So we’re a bit nervous about running these changes in production without a second opinion.

So basically our question is now: Is removing AT() and the copy loops the right way to make the TCM linker script work with remoteproc? Are there any side effects we might be missing?

Thanks in advance for any input!

Best,
Chris

Hi, I’m the Toradex AI Assistant.

I might make mistakes or occasionally get you wrong. Our support team will step in if needed.


Hi Chris,

Your analysis and approach for adapting the NXP MCUXpresso SDK linker script for RemoteProc is technically sound and aligns with the known constraints of the Linux remoteproc framework on Toradex modules.

Your Solution is Correct

Your modifications address a fundamental issue with how the Linux remoteproc loader validates ELF segments against Address Translation Table (ATT) entries(1). The remoteproc framework requires that each ELF PT_LOAD segment fits entirely within a single memory region (ITCM = 128KB, DTCM = 128KB, DDR = 2GB)(1).

The stock NXP SDK approach of packing .data initializers after .text in ITCM works fine for smaller applications but fails for larger firmware when the combined size exceeds the 128KB ITCM limit(1). Your error message confirms this:

Translation failed: da = 0x1a1d0 len = 0x16b90

Why RemoteProc is Different from U-Boot

The key difference is that U-Boot’s bootaux command is more permissive with ELF loading, while the Linux remoteproc framework enforces strict memory region validation(1)(2). RemoteProc loads each segment to its Load Memory Address (LMA) and validates that each segment fits within the corresponding ATT entry(1).

Your Changes Are Appropriate

Removing AT() directives: By setting LMA = VMA, you ensure each section loads directly to its runtime location, avoiding the ITCM overflow issue(1).

Removing data copy loops: Since remoteproc loads segments directly to their final addresses, the traditional startup copy from ITCM to DTCM/DDR becomes unnecessary and would indeed corrupt the correctly-loaded data(1).

Known Issue Context

Your approach aligns with community findings. There’s a documented case where similar ELF segment validation issues occur with larger firmware(3). The remoteproc framework’s stricter validation compared to U-Boot is a known characteristic(3).

Production Considerations

Your changes are architecturally sound for remoteproc usage. The modifications:

  • Respect the ATT memory boundaries
  • Eliminate unnecessary data copying that would corrupt remoteproc-loaded segments
  • Maintain proper .bss zeroing and cache initialization

Since you’re integrating into Yocto and need the firmware as part of the image, remoteproc is the appropriate choice over U-Boot bootaux(1).

Recommendation

Your approach is correct for production use with remoteproc. The lack of documentation for this specific adaptation reflects that most NXP examples target smaller applications that fit within ITCM constraints. Your larger firmware requires the sensible adaptations you’ve implemented.

Best regards, Toradex AI Assistant.


Did you like my answer? :+1: :-1:

If I answered your question, please tick the “Solution” checkbox.

For follow-up questions to me, please use @ToradexAI in your post.

Hello @Fungiking,

One thing to be aware is that the .bin file that is generated is likely very large, as it will span the full address of the code and data being used.
This is something that you likely already noticed when splitting the firmware and loading its parts separately before.

Your changes look sensible, and are a way to implement this configuration.

One worry I would have is the correct configuration of addresses, considering that the address mapping is different between Cortex-A and Cortex-M.
However, if the firmware is working as expected, this should be correctly configured.

Otherwise, the main thing to be aware is that as this is not a default configuration, updates to the MCU Xpresso SDK may require you to do further changes later on if you plan to update.

Best Regards,
Bruno

Hi @bruno.tx thanks to your answer. So if I understand you correctly, the only way to solve it, and stay with the default configuration is to reduce the code and data size?
And about the configuration of addresses: this part is handled in the remoteproc driver, am I correct?

Hi @Fungiking,

Yes, that is correct.
The default configuration is to copy the data from ITCM to the different segments at the start of the firmware.
If this is not suitable due to firmware size, changes are needed.

With different hardware configurations it is somewhat expected to need to change the defaults from the MCU Xpresso SDK.
The DDR address is a good example, as a lower address would be needed if the SoM was only installed with 1 GB of RAM.


Yes, that is correct.
The remoteproc driver handles the address mapping to load the firmwares in the correct place.

Best Regards,
Bruno

Hi @bruno.tx, thanks for the clarification. Then we will stay with the modified SDK :+1:

Btw, just for completeness of this post. I linked the wrong blog in the first sentence of this ticket. The post I was referring to should be: IMX8MP Remote proc fails to load M7

1 Like