Hello everyone,
this is a follow-up to this earlier thread where @bruno.tx helped investigate remoteproc on the Verdin iMX8M Plus. That thread ended with the recommendation to stick with U-Boot bootaux for loading. We’ve been doing that and split the ELF into separate itcm/dtcm/ddr binaries via objcopy and loaded them from U-Boot. It works, but we’re now moving our M7 build into Yocto so the firmware is part of the image directly, and @daniel_m.tx recommended to try remoteproc again to load it from there. We got it working, but we had to modify the NXP MCUXpresso SDK linker script and startup assembly in ways we haven’t seen documented anywhere, so we’d appreciate a sanity check from you guys. ![]()
Our situation is different now from the DDR entry-point issue discussed in the earlier thread. We’re using the TCM linker script (MIMX8ML8xxxLZ_ram.ld) on the Verdin iMX8M Plus with the Toradex 6.x BSP (kernel toradex_6.6-2.2.x-imx) and NXP MCUXpresso SDK 25.06. We think that the problem we hit is about how the NXP SDK packs .data initializers into ITCM, which doesn’t work with remoteproc’s ATT validation.
The stock SDK linker script packs .data initializers right after .text in ITCM using AT(__DATA_ROM):
.data : AT(__DATA_ROM) { ... } > m_data /* VMA in DTCM, but LMA in ITCM after .text */
.ncache.init : AT(__NDATA_ROM) { ... } > m_data2 /* VMA in DDR, LMA also in ITCM */
The startup assembly then copies that data from ITCM to DTCM/DDR before main(). This works fine for U-Boot and for NXP’s small demo apps, but our firmware is bigger: .text is ~107KB and .data is ~91KB, so ~198KB total – way more than the 128KB ITCM. Looking at imx_rproc.c, the driver checks that each ELF PT_LOAD segment fits within a single ATT entry (ITCM = 128KB, DTCM = 128KB, DDR = 2GB), and remoteproc_elf_loader.c loads each segment to its LMA. Since the combined .text + .data LMA exceeds 128KB, the driver rejects it:
Translation failed: da = 0x1a1d0 len = 0x16b90
We made two changes to fix this. First, we removed AT(__DATA_ROM) and AT(__NDATA_ROM) from the linker script so that each section’s load address equals its runtime address (LMA = VMA). This way remoteproc should load each ELF segment directly where it belongs – .data goes straight to DTCM, .ncache.init goes straight to DDR:
--- a/MIMX8ML8xxxLZ_ram.ld (NXP SDK default)
+++ b/MIMX8ML8xxxLZ_ram.ld (our version)
@@ -145,9 +145,8 @@
__etext = .; /* define a global symbol at end of code */
- __DATA_ROM = .; /* Symbol is used by startup for data initialization */
- .data : AT(__DATA_ROM)
+ .data :
{
. = ALIGN(4);
__DATA_RAM = .;
@@ -160,9 +159,8 @@
__CACHE_REGION_START = ORIGIN(m_interrupts);
__CACHE_REGION_SIZE = 0;
- __NDATA_ROM = __DATA_ROM + SIZEOF(.data);
- .ncache.init : AT(__NDATA_ROM)
+ .ncache.init :
{
__noncachedata_start__ = .;
*(NonCacheable.init)
@@ -178,9 +176,8 @@
__noncachedata_end__ = .;
} > m_data2
- __DATA_END = __DATA_ROM + (__data_end__ - __data_start__);
text_end = ORIGIN(m_text) + LENGTH(m_text);
- ASSERT(__DATA_END <= text_end, "region m_text overflowed with text and data")
+ ASSERT(__etext <= text_end, "region m_text overflowed with text")
As far as we understand, this gives us the following ELF segment layout, where each segment fits within its ATT entry:
| Segment | LMA | Region | ATT Entry |
|---|---|---|---|
.text etc. |
0x00000000 |
ITCM | 128KB – fits |
.data |
0x20000000 |
DTCM | 128KB – fits |
.ncache.init |
0x80000000 |
DDR | 2GB – fits |
Second, we removed the data copy loops from the startup assembly. We had actually already removed the .data copy loop (ITCM → DTCM) in our earlier U-Boot approach, because U-Boot was loading split binaries directly to the right addresses. For remoteproc we additionally removed the .ncache.init copy loop (ITCM → DDR). The reasoning is the same for both: with LMA = VMA, there are no initializers sitting in ITCM after .text anymore – the loader already placed them at the right addresses. If we keep the copy loops, they’d read from __etext in ITCM (where there’s nothing useful) and overwrite the correctly-loaded data with garbage. We kept the .bss zeroing and .ncache zero-fill since those are still needed:
--- a/startup_MIMX8ML8_cm7.S (NXP SDK default)
+++ b/startup_MIMX8ML8_cm7.S (our version)
@@ -225,59 +225,22 @@
ldr r0,=SystemInit
blx r0
#endif
-/* Loop to copy data from read only memory to RAM. The ranges
- * of copy from/to are specified by following symbols evaluated in
- * linker script.
- * __etext: End of code section, i.e., begin of data sections to copy from.
- * __data_start__/__data_end__: RAM address range that data should be
- * __noncachedata_start__/__noncachedata_end__ : none cachable region
- * copied to. Both must be aligned to 4 bytes boundary. */
+/* Data initialization loops removed — not needed for remoteproc ELF loading.
+ * Remoteproc loads each PT_LOAD segment directly to its LMA. With LMA=VMA
+ * (no AT() in the linker script), .data is already at 0x20000000 (DTCM) and
+ * .ncache.init is already at 0x80000000 (DDR). Copying from __etext would
+ * read garbage from ITCM and corrupt the correctly-loaded data.
+ */
- ldr r1, =__etext
- ldr r2, =__data_start__
- ldr r3, =__data_end__
-
-#ifdef __PERFORMANCE_IMPLEMENTATION
- subs r3, r2
- ble .LC1
-.LC0:
- subs r3, #4
- ldr r0, [r1, r3]
- str r0, [r2, r3]
- bgt .LC0
-.LC1:
-#else
-.LC0:
- cmp r2, r3
- ittt lt
- ldrlt r0, [r1], #4
- strlt r0, [r2], #4
- blt .LC0
-#endif
#ifdef __STARTUP_INITIALIZE_NONCACHEDATA
- ldr r2, =__noncachedata_start__
- ldr r3, =__noncachedata_init_end__
-#ifdef __PERFORMANCE_IMPLEMENTATION
- subs r3, r2
- ble .LC3
-.LC2:
- subs r3, #4
- ldr r0, [r1, r3]
- str r0, [r2, r3]
- bgt .LC2
-.LC3:
-#else
-.LC2:
- cmp r2, r3
- ittt lt
- ldrlt r0, [r1], #4
- strlt r0, [r2], #4
- blt .LC2
-#endif
/* zero inited ncache section initialization */
+ ldr r2, =__noncachedata_init_end__
ldr r3, =__noncachedata_end__
movs r0,0
.LC4:
Remoteproc can load the firmware, and everything is running smoothly with these changes. But we couldn’t find any NXP or Toradex documentation describing this adaptation. We checked the provided linker scripts contained in the MCUXpresso SDK 25.06 and all of them one use AT(__DATA_ROM). So we’re a bit nervous about running these changes in production without a second opinion.
So basically our question is now: Is removing AT() and the copy loops the right way to make the TCM linker script work with remoteproc? Are there any side effects we might be missing?
Thanks in advance for any input!
Best,
Chris