I want to use kexec to gather crash logs when a kernel panic occurs. I’m using the Verdin IMX8MM module. I’m using a modified TorizonCore 5.7.0: 5.4.193-5.7.0-devel+git.90bfeac00dbe
To do this, I’ve followed the instructions below
I’ve also gone through this helpful thread:
To quickly summarize what I’ve done:
Made sure the following kernel config options were set:
CONFIG_KEXEC=y
CONFIG_CRASH_DUMP=y
CONFIG_PROC_VMCORE=y
CONFIG_RELOCATABLE=y
Added IMAGE_INSTALL_append = " kexec-tools makedumpfile" to my image recipe.
Added the cmdline argument crashkernel=128M to the main kernel by executing setenv defargs "crashkernel=128M" and saveenv in U-Boot.
Verified the above steps took effect with cat /proc/cmdline and cat /proc/io mem to check the cmdline arguments and the reserved memory, respectively.
Here, the kernel always seems to hang after “Bus freq driver module loaded”. I’ve included the full log as an attachment.
What am I missing to get this working? The goal is to load the kernel with kexec -p instead of kexec -l so that I can use it to gather a crashdump after a kernel panic. This of course, also doesn’t work. Debugging with kexec -p and echo c > /proc/sysrq-trigger gives me no output after the kernel panic, so the crash kernel does not seem to load properly.
I have several questions about your situation so I can better understand your setup.
Correct me if I misunderstand anything. So you have a custom-built TorizonCore image. This image is experiencing some kind of kernel crash/panic. You then added several debugging utilities to this image in-order to debug this further. Is that more or less correct?
If yes, then here are my questions:
Why are you custom building TorizonCore and what changes have you done to the default image?
What is this kernel panic/crash you are experiencing in the first place? Is it related to the custom changes you did or does default TorizonCore do this as well?
Here, the kernel always seems to hang after “Bus freq driver module loaded”. I’ve included the full log as an attachment.
I think you also forgot to attach the logs for this.
The custom TorizonCore was necessary to change some kernel configuration options, add drivers (the display driver only seemed to work when built into the kernel instead of loaded as a module), edit recipes such as Plymouth to change the orientation of the splash screen and position of the spinner (our screen is rotated), and so on. I also wrote a patch for the NXP/Freescale I2S driver to enable BCLK before our speaker driver’s BCLK is enabled as that driver IC needs the clock to be enabled before configuration.
I don’t think the crash is related to TorizonCore. It seems to be an Out Of Memory issue related to Docker. I would just like to be able to retrieve the logs so I can be sure memory is the problem, and can see when exactly it occurs, and maybe even why it ran out of memory at that time and decided to kill the Docker daemon instead of a container. This would be helpful in debugging our containers or pinpointing a memory leak if that is the issue. The exact problem is described here: Container limit? - #15 by henrique.tx. We’ve now tested with Verdin IMX8MM modules with more memory, and it does seem to alleviate, but not eliminate the problem. I’ll also do some testing with different memory limit settings per container soon.
I indeed forgot the attachment, it should be included this time kexec_log.txt (10.2 KB)
Given the information you provided I have some comments.
First of all at the moment I have no idea why kexec is causing the behavior you are seeing and not causing the kernel the boot. I personally have not tried using this on TorizonCore and do not know of anyone else who has.
Second, the container limit issue you referenced. During initial investigations what our team found was that for some reason when initializing a container, Docker requests a fixed amount of memory. This doesn’t seem to be related to how heavy or light the container is. For that reason even starting several lightweight containers in parallel seem to use up all the memory and cause the observed crash. Now this is still being investigated internally so more details may pop up.
Though on that note, what is your container setup where you are seeing this error? What kinds of containers are you starting and how many?
With that all said, here’s my thoughts we can try and debug your kexec issue, but this may take time and doesn’t really tackle the actual issue you want to debug. We can try alternative debugging methods and see where we get with that. For example you could try enabling persistent logging: Persistent Journald Logging | Toradex Developer Center
This might get you useful logs from the docker daemon and other processes.
Too bad that no one has tried kexec before, it seems like a valuable option.
The container setup is actually exactly the same as in linked thread, we’re working on the same product. We’re running a Cog container, and then a few applications. We’ve reduced the number of containers and are now working with modules with more RAM, which has helped. When I have some time to look into it more I would also like to try limiting the container memory to see if that helps when booting them in parallel. Timing them does also seem to help but is finicky and then we can’t use the restart option in docker-compose, lest the device crashes and all containers start at boot, crashing the device and getting stuck in a boot loop.
I’ve already enabled persistent logging, booting a kernel from the kernel was not my first choice
For now I will put the kexec idea on the backburner and see if our issues get resolved without needing to resort to it. I’m sure it can be useful to others as well tho, not just for debugging panics.
I didn’t realize you and the other customer in that thread were from the same group. That makes sense now.
limiting the container memory to see if that helps when booting them in parallel.
From our testing/investigation so far this doesn’t seem to help. Limiting a container’s available memory only seems to come into effect once the container is running. This doesn’t seem to limit the memory used on container startup. Though your mileage may vary.
In short it just seems to be how Docker works with requiring a sort of fixed amount of memory when a container initializes. Strangely enough changing the version/variant of Docker affects the max limit of containers that can be started in parallel as well. In other words there is something very strange going on with what Docker is doing under the hood.
Unfortunately we probably won’t be able to do much more investigation here in the short-term on our side since our 5.7.0 release is coming and we will be moving on to our 6.X releases thereafter. Therefore in the short-term we can only suggest working around this “container limit” at least until we have more time to investigate this behavior.
Thanks for the extra info and saving me some time from checking that Docker behaviour, interesting that there would be differences between Docker versions. I’m curious to see possible solutions but we will continue with the workaround for now.