Rpmsg not running stable on m4

andriscewo · August 20, 2018, 11:57am

memset(dst + phdr->p_filesz, 0x00, phdr->p_memsz - phdr->p_filesz);

This clears the leftovers of the memory segment it loaded too.

I understand that, but if you do it there would be a check necessary if it doesn’t overwrite the memory section. At the moment it can overwrite the memory section anytime and that should not be done! If the m_text and m_data are not continuous it could overwrite memory from any program.

Yes and no, I guess. The first
hardcoded part would be shared memory
where both sides expect to find things

messages to each other. But it does not need M4 running, or more
specifically, RPMsg enabled M4 code
running on the other side before/while
Linux boots. You can delay M4 startup
& firmware boot. And do your imx
remote proc boot you busy with trying
to make work .. (otherwise, whats the
point).

Well, rpmsg never worked with remoteproc for me, thought that was the issue, but then probably there was just something other missing. I checked it and it doesn’t matter when the m4 code starts the rpmsg. Could delay it without any issues.

As for the DDR region,

This must be already handled in device tree by simply disabling the area used for the M4 code.

Well, the bug I encountered shows that this is not sufficient. I would rather have the linux aware of the memory section and setting it readonly, than it being “not existent” for the linux and having potential undefined behavior regarding caches.

For the OCRAM this could be with the RDC ..

The RTOS hw init code does RDC setup for OCRAM, TCM, and so likely also happens on Linux side.

TCM can only be accessed from the m4 if I understand it correctly. As for the OCRAM I’ve never seen some RDC initialization and even if there is one, it would be still to late (as long as you don’t want to do it in assembler in the startup code). The memory protection has to be in place before the m4 is started else there is still room for violations

dry · August 21, 2018, 9:41am

@andriscewo May I ask, when booting the M4 code from U-boot, do you follow the required step of ‘dcache flush’ command, before bootaux ..?

dry · August 21, 2018, 10:43am

The m_data load address is inside the m_text memory region. So it would have to consider the memory boundaries of the m_text region and not the m_data region. It doesn’t do that!

I do not see what you do. I think you read this wrong.

Looking where you pointed, git link text, please then point out where your concern happens.

It has 2 issues: It zeros out the load
address of the .bss, .heap and .stack
instead of the execution address

Where? (also, what do you mean instead of execution address?)

The zeroing out part, which @stefan.tx also commented, is this what you are caught on?

if (phdr->p_filesz != phdr->p_memsz)
             memset(dst + phdr->p_filesz, 0x00,
                    phdr->p_memsz - phdr->p_filesz);

( While the check can be more safe, filesz must be less eq memsiz, if not, your linker config or linker are fkd heavy..)

Again, this zeros out correctly within the left-over of the segment.

Note, exactly same zeroing is done in the remoteproc_elf_loader : link text

/*
		 * Zero out remaining memory for this segment.
		 *
		 * This isn't strictly required since dma_alloc_coherent already
		 * did this for us. albeit harmless, we may consider removing
		 * this.
		 */
		if (memsz > filesz)
			memset(ptr + filesz, 0, memsz - filesz);

So why ? Because it does not hurt.

stefan.tx · August 21, 2018, 10:47am

I understand that, but if you do it there would be a check necessary if it doesn’t overwrite the memory section. At the moment it can overwrite the memory section anytime and that should not be done! If the m_text and m_data are not continuous it could overwrite memory from any program.

I am not sure I follow completely on that topic, but afaik, if m_data and m_text are not continuous it should lead to two Elf32_Phdr…

If you have 2 GIG of physical RAM, but it is shared between processors. Do you want some hardware attached to it (cache) that isn’t managed by anyone and could potentially write something into your RAM? This isn’t about remoteproc or not. The memory is there, the cache has access to it and I would reason that it’s more dangerous to not let your linux know that it’s actually there.

I see what you are saying in general, and I agree it would be preferable if Linux has a complete view of all memory, and also is aware of where the remoteproc firmware is located. As far as I know this is mostly the case with upstream/remoteproc.

There is a SRAM driver to specify SRAM area, which Linux does not touch by default. However, doing the same with DDR turned out to be more complex than I thought. Also you cannot simply request a specific physical memory location are from within a driver. I think there have been some changes around that which might allow it, but that did not work back in 4.9 times.

Anyway, the chosen solution is actually quite solid. Typically when you write beyond the available memory (e.g. beyond 1GiB), it also leads to a write in the lower memory area (since the highest address bit is simply ignored). Despite Linux not being aware of the region above 1GiB, it does not touch/use it, and this is considered safe.

Not specifying a section in the middle is not much different from that. U-Boot carves out 2MiB, which is a deliberate chosen size since it corresponds to the memory size which a complete table full of PTE’s manages. This way Linux’ memory manage will simply avoid setting up a complete PTE table. This makes sure that there is no side effects from adjacent memory regions.

No, it should not assign a memory region as default, but based on the address you can easily check if your elf file will be loaded into the OCRAM region. If it is done one could protect the .bss, .heap and .stack by the RDC. If this was done there wouldn’t have been this issue in the first place.

Using RDC to protect is definitely a good idea. That said, I am not sure how that particular case actually would behave: U-Boot (needs to) write to the area before the firmware boots and RDC would get setup. Hence at the point where RDC prohibits writes to those sections, the zeros are already in the primary cores cache… It could be that those cache lines then still hit those sections, depending on where exactly RDC prohibits writes…

dry · August 21, 2018, 11:00am

I don’t know if OP uses M4 FreeRTOS based bsp as provided by FSL/NXP, or by Toradex, but in FSL/NXP bsp documentation it is stated in few places that dcache flush must happen before bootaux.

May be the answer to this question is please check the docs..?

On the other hand, in Toradex git commit link text, the log message has :

 tftp ${loadaddr} low_power_demo.elf && bootaux ${loadaddr}

Which skips required flush (unless hidden somewhere..). Thus OP does point to valid problem.

However, @stefan.tx in your ticket that you mentioned above you will raise, you should then ask to do full flush for the bin case too, in do_bootaux not in load_elf_image_phdr.
I think …

EDIT: @andriscewo nicely caught… Hopefully you moved on & the code works for you.

andriscewo · August 21, 2018, 12:27pm

I give up. I tried to create a small example where the problem happens, but no luck. I tried also to recompile some old code from us, but no luck in getting the error.

stefan.tx · August 21, 2018, 12:58pm

Yeah NXP requires the user to manually flush. This has the downside that it flushes more than necessary, but it works of course…

We added the flush_cache to bootaux for the elf case to avoid this… Unfortunately, it turned out to introduce the bug @andriscewo found.

andriscewo · August 23, 2018, 8:20am

Thanks a lot. I will take another look at it here and hopefully it solves our problems

andriscewo · August 23, 2018, 8:34am

Let me try again. This are the memory regions for the OCRAM linker file

/* Specify the memory areas */
MEMORY
{
  m_interrupts          (RX)  : ORIGIN = 0x00000000, LENGTH = 0x00000240
  m_text                (RX)  : ORIGIN = 0x00910000, LENGTH = 0x00010000
  m_data                (RW)  : ORIGIN = 0x00920000, LENGTH = 0x00020000 /* EPDC */
}

I did now increase the .bss a lot by increasing the freeRTOS heap size. When I look now at the created elf file with objdump -h exec.elf I get this

The code above will write 0s until .stack + its size. Here it’s not that much of a problem because the region is still inside the m_data, but you can’t assume that in general

dry · August 24, 2018, 12:29am

@andriscewo I’m like a giraffe , but I finally caught on with what you pointed out about that clearing.

( Bah! So ironic, after I spent time making you nice screenshots explaining how that specific FSL linker script organized memory in your other thread … )

You right, that clearing of memory will go over memory segment. Because at this iteration, your destination address, for the data segment, is within the previous memory segment. The zeroing out may only be harmless for contiguous segments, but it makes less sense now, as there can be quite a bit of un-used memory left in the text segment. Which means only part of data segment will be zeroed out.

dry · August 27, 2018, 1:32am

That objdump picture never made it / cannot see it.

But note, its only init data that gets into m_text. Thus I’m not sure I’m with you on

because the region is still inside the m_data

The only data region which should be there is the init data. Not bss. Unless you meant something else ? (no dragging with ).
See screenshot (my map uses ocram + tcm, but it doesn’t matter for point I’m trying to make; in fact probably more clear as tcm & ocram are not contiguous address spaces).

The effect here of you increasing .bss should be just more of potential wrong zeroing out cross-memory by the load elf method.

If not for that caveat with init data in text & wrong zeroing cross memory, that cache flush probably should have worked just fine ..

jaski.tx · August 28, 2018, 8:33am

Thanks for you feedback. Let us know, when you can reproduce the issue.

andriscewo · August 29, 2018, 8:37am

Hm, strange about the image, but based on your comment I think you understand now what I was worrying about.

I did understand your point about flushing the caches. I’m very happy you mentioned it as I didn’t know this! I will try this as soon as possible, I just didn’t have the time yet.

andriscewo · August 30, 2018, 6:29am

I left scewo. Can you please keep the other guys updated about your progress

stefan.tx · August 30, 2018, 9:02am

@dry, @andriscewo,

I looked closer into this issue, and I am pretty sure @andriscewo is right. The p_memsz member of the program header stores the size of the memory image, this is everything we use > m_data in the linker file. This number can well grew out of m_text and grew into the m_data section, at least if they are just next of each other. Here is an example:

I use the MCIMX7D_M4_tcm.ld linker file to compile the str_echo_freertos example. To better illustriate the issue I use the following section alignment (note this does not make sense logically, but is a valid layout otherwise):

MEMORY
{
  m_text                (RX)  : ORIGIN = 0x1FFF8000, LENGTH = 0x00008000
  m_interrupts          (RX)  : ORIGIN = 0x20000000, LENGTH = 0x00000240
  m_data                (RW)  : ORIGIN = 0x20000240, LENGTH = 0x00007000
}

arm-linux-gnueabihf-objdump -x rpmsg_str_echo_freertos.elf looks like this:

Program Header:
    LOAD off    0x00001000 vaddr 0x1fff8000 paddr 0x1fff8000 align 2**12
         filesz 0x00005cdc memsz 0x00005cdc flags rwx
    LOAD off    0x00007240 vaddr 0x20000240 paddr 0x1fffdcdc align 2**12
         filesz 0x000001d0 memsz 0x000055c4 flags rw-
    LOAD off    0x00008000 vaddr 0x20000000 paddr 0x20000000 align     2**12
         filesz 0x00000240 memsz 0x00000240 flags r--
    LOAD off    0x00008804 vaddr 0x20005804 paddr 0x200032a0 align 2**12
         filesz 0x00000000 memsz 0x00000804 flags rw-

The second program header has a paddr of 0x1fffdcdc, filesz 0x000001d0 and a memsz of 0x000055c4… With that U-Boot would zero out up until 0x200032A0!

Thanks to the order of the program header it probably would not hurt in practice since U-Boot will load the m_interrupt area after the zero out… But it is definitely not what we want!

I think the reason for that is that memsz is mean in vaddr space, which we do not really use in U-Boot. Maybe it is also related to how we use the AT linker keyword. Anyway, it seems safer to not clear the additional section, so I will remove the memset.

dry · August 30, 2018, 9:55am

@stefan.tx please see my comment above (3days ago), and below (6 days ago or so .. search “Bah”. I confirm what he spotted). with linker screenshots, it’s covered what you mentioned. (Also we had a thread discussing linker script in question).

Yes he is right mostly.

The problem is worse when memory segments are non continuous .

The p_memsz member of the program header stores the size of the memory image,

The code/data memory image size, yes. (But not the total physical mem at least .. ).

this is everything we use > m_data in the linker file.

The init data goes to text, the rest (bss, heap, stack) to m_data. The init data gets copied to m_data on startup (which is inefficient but that’s another issue).

The problem with memset is mainly due to the fact that on that last iteration (for this example of linker script) , the load address is actually inside previous segment (m_text), while iteration assumes , which would be ok normally, that you are loading at the start of m_data segment.

The clearing happens cross m_text - m_data iff ( phdr->p_memsz - phdr->p_filesz ) does not fit into the left overs of m_text.

EDIT:
And at least for the default/provided (more/less same between NXP & Toradex as provided) linker script we’ve been looking at, there is a chance that the clearing will not go over m_text boundary. It means this ‘best’ case is not what the clearing intention was, but it should not hurt.

Most often or at least in the problem we looking at, the clearing will go over the memory segments - from m_text to m_data (I’m referring at least to examples we’ve been discussion before stefan’s last example, which is almost same but he shifted interrupts..). And here it is more harry with what that zero’ing will do:
[And even IGNORING the chaos case when m_text, m_data memories are not contiguous]

If it does not go over the init data area at startup of m_data, then it clears the area which will be loaded by the startup code anyway (because iMX startup code loads that RO data to memory); So unless the 0’s start getting flushed after startup code wrote it, there is no problem.
If it goes over data and into .bss, well, .bss should be cleared/set by the C library anyway on the startup, and if C library doesn’t do it there is the compile macro you can set in the startup code (as also @andriscewo mentioned), which will clear out .bss for us; So again, unless the non flushed 0’s get flushed after startup code , no issue.
If it goes over the data+ .bss and into heap + stack, this is probably the “worst” (as far as I thought about it ..). As far as I know, nothing should touch that memory on the init, e.g. nothing clears it as such (e.g. unless calloc is called, no heap will be cleared ) . If the cache wasn’t flushed after the firmware was loaded, if somehow, after the code has started that area is modified outside of M4 control .. well. This is how I understand the possible problem manifested by what @andriscewo reported.

Probably how the non-flushing of the cache for the zero’ed out memory from that elf load will trigger weird or undefined behavior.

Now I’m a bit sketchy about why would that cached memory (which was zeored out by the elf loader ..) stay so long un-flushed / or out of sync.

If you understand differently & clearly how the wrong clearing + cache non flush wrong mechanics work here @andriscewo @stefan.tx please share.

stefan.tx · August 30, 2018, 10:39am

Storing the .data section in “ROM” is actually anyway not necessary in the i.MX 7 case. There is no real ROM, and the boot loader is able to load the .data section into the final place directly. This patch removes the copy loop in the initialization code in platform/devices/MCIMX7D/startup/gcc/startup_MCIMX7D_M4.S and removes the AT keyword for the .data section in the linker file platform/devices/MCIMX7D/linker/gcc/MCIMX7D_M4.ld.

With that the program headers vaddr and paddr are well aligned and a zeroing out would be safe:

Program Header:
    LOAD off    0x00001000 vaddr 0x00000000 paddr 0x00000000 align 2**12
         filesz 0x00000240 memsz 0x00000240 flags r--
    LOAD off    0x00002000 vaddr 0x1fff8000 paddr 0x1fff8000 align 2**12
         filesz 0x00005cbc memsz 0x00005cbc flags rwx
    LOAD off    0x00008000 vaddr 0x20000000 paddr 0x20000000 align 2**12
         filesz 0x000001d0 memsz 0x00005dc8 flags rw-

Also, this actually leaves more room for code as it avoids unnecessary duplication of the .data section.

dry · August 30, 2018, 11:26am

@stefan.tx @andriscewo

Sorry this is not answer but I didnt’ want this to get lost in comments.

We agree now ( full bonus to @andriscewo) on the problems reported (mostly I guess?) . The linker script or scripts in question, originating from FSL/NXP, turned out to be problematic for the U-boot and remote proc elf loaders. I have to admit I saw no issues with how these linker scripts organize memory, and I saw/see perhaps a plus in doing it this way.

The linker issue @andriscewo reported here error in linker, and we also discussed it too, and in relation to remote proc elf loader.

Motivated, I reported what I saw as problem with the elf loader in remote proc (see thread referred above) to the Linux kernel remote proc mailing list. I also mentioned the clearing/zeroing out problem as was discussed here, because as was pointed out, that code also does it.

I think the fixing can be done easily , yet , from the reply I got, people just re-work those NXP’s linker scripts. And because one can adjust how that data is placed relatively easy, there is no need or want to change how the elf loader works in remote proc kernel driver (including it’s memset()). Or may be I should just send the patch.
Basically, it may be , when using Linux based loaders, avoid such elfs produced with “strange” placements and start-up code. Linux firmware elf loaders won’t like it.

(So I have to admit too, @andriscewo was saying it’s a strange linker script in his thread, but , may be because I’m coming from RTOS side with ‘smarter’ external loaders, I was arguing it’s just fine … Well, I have to give him bonus on that discussion too. )

Regards

stefan.tx · August 30, 2018, 12:07pm

Now I’m a bit sketchy about why would that cached memory (which was zeored out by the elf loader ..) stay so long un-flushed / or out of sync.

This is pretty much in line what I expect actually:

Caches do not get flushed automatically after a timeout. Caches get only flushed when another entry takes its place.

Since the A7 has a L1 and a rather large L2 cache of 512KiB, chances are quite high that a cache line only gets flushed during Linux boot, or even later…

stefan.tx · August 30, 2018, 12:21pm

Unfortunately I did not had the time to follow the discussion regarding remoteproc fully, but as far as I understand by dropping the AT keyword it should resolve the remoteproc too, is this correct? I pushed the change to our git tree since I really feel it is a sensible change and even saves space in the text section: