Trouble getting kernel crash dumps working on Colibri IMX7

conoroneill · October 22, 2019, 5:01am

Hello

I am attempting to get kernel crash dumps working on the Colibri iMX7. I’m using Toradex BSP 2.7, with some slight modifications to the kernel and rootfs configuration. I’ve been unable to get crash dumps working, and I’m hoping someone can point me to what I’m doing wrong.

The mechanism for kernel crash dumps is that a second “dump kernel” will start up when the first kernel crashes and collect state information about the running kernel. However I’m seeing the dump kernel hang as soon as it starts. From the digging I’ve done into the issue, the hang in the dump kernel seems to happen as it jumps to the start_kernel routine in init/main.c. The new kernel executes the assembly leading up to the jump to __enter_kernel in arch/arm/boot/compressed/head.S, but after it jumps to the entry point I see no response from it. I see the same behavior when I attempt to induce a crash and have a crash kernel take over (kexec -p), and when I load the dump kernel using ‘kexec -l’ and execute it without a crash using ‘kexec -e’.

Is there any documentation on getting kernel crash dumps working on Colibri IMX7, or is anyone able to diagnose what I’m doing incorrectly? There seems to be pretty sparse information on this feature on the internet, and what information is there is often conflicts.

Here is what I am doing to reproduce.

Sync the yocto project system used to build the BSP: “repo init -u Index of /toradex-bsp-platform.git -b LinuxImageV2.7”, and “repo sync”.
Make the following modifications to the recipe files:

2.1) Add the following to layers/meta-toradex-nxp/recipes-kernel/linux/linux-toradex-4.1-2.0.x/defconfig. I also modified linux-toradex-4.1-2.0x./mx7/defconfig and a bunch of other kernels as I was unsure of the exact version that would be picked up. Confirm these features are enabled in your build by looking at build/tmp-glibc/work-shared/colibri-imx7/kernel-build-artifacts/.config and observing that these features are enabled.

CONFIG_KEXEC=y    
CONFIG_SYSFS=y    
CONFIG_DEBUG_INFO=y        
CONFIG_CRASH_DUMP=y      
CONFIG_PROC_VMCORE=y    
CONFIG_DEBUG_LL=y    
CONFIG_EARLY_PRINTK=y      
CONFIG_MAGIC_SYSRQ=y

2.2) Add the corresonding userspace utilities to your rootfs by modifying layers/openembedded-core/meta/recipes-core/images/core-image-minimal.bb and adding the following:

IMAGE_INSTALL_append = " kexec-tools makedumpfile"

2.3) Change your machine to colibri-imx7 in local.conf

bitbake core-image-minimal linux-toradex u-boot-toradex
Add the resulting zImage and zImage-imx7d-colibri-aster.dtb to your root filesystem at the base location. These are the binaries that will be used as the dump kernel - they are identical copies to the versions in nand that are used to boot the system. Do this by mounting core-image-minimal-colibri-imx7.ubifs, copying zImage and the dtb to the root location in that mount point, and save the result as a new ubifs filesystem.

Side note, I am doing it this way because the kernel and dtb are being stored as raw binaries spread out across a UBI backing in our system - they are not on a UBIFS that can be mounted at runtime, hence not accessible to kexec command. I believe providing copies of the binaries in this way should work.

Flash zImage, u-boot-toradex.imx and the the new filesystem you created in step 4 to the device.
Boot into u-boot. If you want to reproduce using the method of inducing a kernel panic, you will have to reserve a section of memory for the dump kernel. Do this by stopping bootup by pressing enter and add the following argument to the ‘bootargs’ uboot env var:

crashkernel=128M@2058M

This will reserve a 128M block of memory in a place that doesn’t overlap with the reserved kernel memory (past the 0x80000000 start of DDR). You can verify it worked from the running system by issuing ‘cat /proc/iomem’ and noting the reserved block of memory for crash kernel.

This isn’t necessary if you want to reproduce using ‘kexec -l’ followed by ‘kexec -e’. Though I’m not totally certain the hangs are caused by the same problem.

You are now in linux. You’ll see a bunch of init go by. Get to the point where you are looking at a root prompt. And issue the following:

export DUMPK_CMDLINE=“1 console=tty1 console=ttymxc0,115200n8 consoleblank=0 root=ubi0:rootfs rootfstype=ubifs rootwait init=/sbin/init maxcpus=1 reset_devices”

kexec --type zImage -l /zImage --dtb=/zImage-imx7d-colibri-aster.dtb --append=${DUMPK_CMDLINE}

kexec -e

[ 120.679519] kexec: Starting new kernel
[ 120.685639] Disabling non-boot CPUs …
[ 120.721259] CPU1: shutdown
[ 120.751820] Bye!
Uncompressing Linux… done, booting the kernel.

This is the last thing I see. I expect to see the new kernel doing its early init, but it appears to hang instead. If you wish to reproduce using kernel panic method, ensure you have the ‘crashkernel’ argument to your kernel as described above, and issue the following:

# export DUMPK_CMDLINE="1 console=tty1 console=ttymxc0,115200n8 consoleblank=0 root=ubi0:rootfs rootfstype=ubifs rootwait init=/sbin/init maxcpus=1 reset_devices"

# kexec --type zImage -p /zImage --dtb=/zImage-imx7d-colibri-aster.dtb --append=${DUMPK_CMDLINE} 

# echo c > /proc/sysrq-trigger [   87.302114] sysrq: SysRq :
Trigger a crash
[   87.310430] Unable to handle kernel NULL pointer dereference at virtual address 00000000
[   87.323020] pgd = 94840000
[   87.327862] [00000000] *pgd=94885831, *pte=00000000, *ppte=00000000
[   87.336382] Internal error: Oops: 817 [#1] SMP ARM
[   87.343282] Modules linked in:
[   87.348419] CPU: 0 PID: 233 Comm: sh Not tainted 4.1.44-2.7.6+g18717e2b1ca9 #1
[   87.359822] Hardware name: Freescale i.MX7 Dual (Device Tree)
[   87.367712] task: 94310a80 ti: 9487e000 task.ti: 9487e000
[   87.375251] PC is at sysrq_handle_crash+0x48/0x50
[   87.382072] LR is at __handle_sysrq+0x120/0x174
[   87.388661] pc : [<80346b7c>]    lr : [<803473d8>]    psr: 60080013
[   87.388661] sp : 9487feb0  ip : 00000000  fp : 7eeca668
[   87.404273] r10: 00000000  r9 : 00000002  r8 : 00000000
[   87.411540] r7 : 808dbfcc  r6 : 00000007  r5 : 00000063  r4 : 808c5ba8
[   87.420079] r3 : 00000000  r2 : 00000001  r1 : 97b8031c  r0 : 00000063
[   87.428585] Flags: nZCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment user
[   87.437722] Control: 10c5387d  Table: 9484006a  DAC: 00000015
[   87.445434] Process sh (pid: 233, stack limit = 0x9487e210)
[   87.452975] Stack: (0x9487feb0 to 0x94880000)
[   87.459270] fea0:                                     00000002 00000000 00000000 94327b80
[   87.471280] fec0: 00b37d88 00000002 00000000 80347818 803477e0 8013f8d0 9474ba80 8013f874
[   87.483317] fee0: 9487ff80 00000002 00b37d88 800ea4b0 9400ac00 00000001 808bb880 9400ad88
[   87.495413] ff00: 00000001 00000000 00000000 800ec9ac 00000020 00000301 00000020 000000bc
[   87.507649] ff20: 00000000 00000000 94310a78 9487ff60 94310a80 9474ba80 00b37d88 9474ba80
[   87.520016] ff40: 00b37d88 9487ff80 00000002 00b37d88 00000002 800ead28 00000000 800300dc
[   87.532421] ff60: 00000003 9474ba80 9474ba80 00000000 00000000 00b37d88 00000002 800eb5e4
[   87.544945] ff80: 00000000 00000000 00000200 00070878 00000002 00b37d88 00000004 8000f4c4
[   87.557637] ffa0: 9487e000 8000f340 00070878 00000002 00000001 00b37d88 00000002 00070878
[   87.570477] ffc0: 00070878 00000002 00b37d88 00000004 00000020 00b37d88 00000000 7eeca668
[   87.583527] ffe0: 00000000 7eeca434 0000dcb1 76e997f0 40080010 00000001 97fbe821 97fbec21
[   87.596668] [<80346b7c>] (sysrq_handle_crash) from [<00000000>] (  (null))
[   87.606105] Code: e5c32000 e8bd8010 e3a03000 e3a02001 (e5c32000)
[   87.614787] CPU 1 will stop doing anything useful since another CPU has crashed
[   87.627964] Loading crashdump kernel...
[   87.634395] Bye!
Uncompressing Linux... done, booting the kernel.

Again, that is the last thing I see.

Can anyone offer any pointers as to how to correctly configure your system for crash dump?

Thanks in advance,

Conor

raja.tx · October 23, 2019, 4:57pm

Dear @conoroneill,

Thank you for contacting support. I am looking at this issue, Could you please wait for a couple of days, we will get back to you soon.

jaski.tx · October 28, 2019, 8:22am

HI @conoroneill

I tried this out with the kernel 4.1 and I am getting the same output as you. It works a bit better with the 4.9 kernel, but not complete. You might try the kernel 4.14 and check if you get the crashdump of the kernel.

Best regards,
Jaski

conoroneill · October 31, 2019, 9:25pm

Hello

I tried with kernel 4.14 and I get exactly the same output as described above. Attempting to build kernel 4.9 to validate @jaski.tx 's results.

When you say it works better with the 4.9 kernel, do you mean when you use 4.9 kernel as your running kernel, when you use 4.9 kernel as your crash kernel, or both? At the moment I can only use 4.1 for my running kernel.

Is there anything you did differently than what I described in your experiments with 4.9 kernel?

jaski.tx · October 31, 2019, 9:29pm

hi

I used 4.9 for both, running and crash kernel. What did you do for 4.14?
I have visited someone at ELCE 2019 who was using Colibri iMX7 with 4.19 for kernel crash dump which was working fine for him.

Best regards,
Jaski

conoroneill · November 6, 2019, 11:04pm

Hello

I tried on 4.14 both the running kernel and the dump kernel. I’m getting the exact same behavior as I originally outlined. Here is the message when I try to run 'kexec -e with the new kernel.

root@phoenix-univ:~# kexec -e
[   25.800230] ci_hdrc ci_hdrc.0: remove, state 4
[   25.806732] usb usb1: USB disconnect, device number 1
[   25.814805] ci_hdrc ci_hdrc.0: USB bus 1 deregistered
[   25.825747] kexec_core: Starting new kernel
[   25.832412] Disabling non-boot CPUs ...
[   25.930200] IRQ 21: no longer affine to CPU1
[   25.931019] Bye!
Uncompressing Linux... done, booting the kernel.

I also tried with 4.9 to similar effect. You mentioned you made more progress with 4.9. Can you please outline the specific steps to reproduce the progress you made?

jaski.tx · November 7, 2019, 1:44pm

Hi @conoroneill

I created a custom kernel with a specific .config. This kernel config and the output log can be found in the attached file.

Best regards,
Jaski

jaski.tx · November 25, 2019, 12:36pm

Hi @conoroneill

Meanwhile I managed make the Kernel crash dump work on the Colibri iMX7 eMMC Module. Unfortunately on the Nand version of iMX7, I’m having some issues related to UbiFs.

These are the command I used:

export DUMPK_CMDLINE="1 console=tty1 console=ttymxc0,115200n8 consoleblank=0 root=mtdblock0 rootfstype=ubifs rootwait init=/sbin/init maxcpus=1 reset_devices"

kexec --type zImage -l /zImage --dtb /imx7d-colibri-eval-v3.dtb --append="${DUMPK_CMDLINE}"

The kernel with the corresponding device-tree you can find here.

I think, one solution could be to use SD card or NFS Boot for writing the kernel debug to.

Best regards,
Jaski

conoroneill · November 25, 2019, 10:12pm

Thank you! This is a really good start. Let me use this as well as your previous config that you provided to get started and see if I can get the nand version working. Just getting it running through the early kernel init for the new kernel will be a big success.

jaski.tx · November 26, 2019, 7:07am

You are welcome. I think for the nand version you might need a special partition which I did not try yet. Let me know, if you have any news or questions.

Best regards,
Jaski

philippe.tx · December 6, 2019, 3:39pm

Hi @conoroneill

I debugged this issue further and would like to update you on the latest findings.

U-Boot passes the nand layout to the kernel on boot. So if you want to load a kernel from a kernel you have to tell the kernel how nand is structured. The best way to do it in my opinion is just by kernel parameters with mtdparts. This kexec command worked for me on an iMX7s:

kexec --type zImage -l zImage --dtb=imx7s-colibri-eval-v3.dtb --append=“ubi.mtd=ubi root=ubi0:rootfs rootfstype=ubifs ubi.fm_autoconvert=1 console=tty1 console=ttymxc0,115200n8 consoleblank=0 video=mxsfb:640x480M-16@60 mtdparts=gpmi-nand:512k(mx7-bcb),1536k(u-boot1),1536k(u-boot2),512k(u-boot-env),-(ubi)”
There is an issue in NXP’s code that prevents the kernel to boot if it is not loaded from U-Boot. I found out that this is caused by CONFIG_MXC_PXP_V3. If you turn this off everything works on my side.

I will debug further and hope I can solve the issue that is present in this driver. Otherwise if you don’t need this Pixel Processing Pipeline stuff you could just disable it for your kernel.

I hope this helps. I would also be interested if you where able to make it work on your side.

Best regards,
Philippe

conoroneill · December 10, 2019, 7:16pm

Hi Philippe

I tried exactly as you recommended. Unfortunately I still see the same behavior. When I run ‘kexec -e’ I get the following:

kexec: Starting new kernel
[ 161.889315] Disabling non-boot CPUs …
[ 161.964274] CPU1: shutdown
[ 161.994984] Bye!
Uncompressing Linux… done, booting the kernel.

And it hangs forever.

I built a kernel with exactly the configs I mentioned above, and I set CONFIG_MXC_PXP_V3=N in my defconfig and I used kexec with your cmdline. I also verified your mtdparts are correct and match what uboot is passing into the kernel.

I’m using kernel 4.1, that’s a requirement of the system I’m working with. Did you reproduce on a different kernel version?

jaski.tx · December 10, 2019, 9:48pm

Hi @conoroneill

Philippe tried this on the kernel 4.14, which is our current development kernel. The kernel 4.1 is part of the Bsp 2.7, which is not supported anymore.

Best regards,
Jaski

Trouble getting kernel crash dumps working on Colibri IMX7

export DUMPK_CMDLINE=“1 console=tty1 console=ttymxc0,115200n8 consoleblank=0 root=ubi0:rootfs rootfstype=ubifs rootwait init=/sbin/init maxcpus=1 reset_devices”

kexec --type zImage -l /zImage --dtb=/zImage-imx7d-colibri-aster.dtb --append=${DUMPK_CMDLINE}

kexec -e