Intermittent reboot, kernel debug options

luciolis · September 26, 2022, 10:29am

Dear everyone,

I’m building a yocto-type OS for the imx8mp on a custom carrier board. My current issue is that the SOM reboots randomly but only after specific actions that involve either i2c or rpmsg. I suspect a memory corruption error or a driver failure.
However I have no way to get the kernel stack trace or any kind of log before the SOM reboots. I tried to activate some kernel modules and systemd.conf options, but it is not working. Do you have a way to enable kernel stacktraces (on UART3 or file) in the current BSP?

Many thanks
Adrian

andrejs.tx · September 26, 2022, 11:48am

Hi Adrian,

Default Linux console will print the relevant crash information when it happens. I suspect your issue might be SCU related, but of course it’s impossible to guess since you didn’t provided too much of details.

Best regards,
Andrejs Cainikovs.

seasoned_geek · September 26, 2022, 12:21pm

I would seriously be looking for a hardware problem on the custom carrier board first. Odds are high there is power being sent to a pin that either isn’t connected to anything or is connected to logic ground when it should be earth ground.

I’m not a hardware guy, but I worked at a shop that had an established board they used. One day the place they had making them sent a new file and requested they make one of the mounting holes a couple thousands of an inch bigger. They were having a really low success rate drilling that hole. I think they could no longer obtain quality bits of that size so about 40% of the time the bit would snap and ruin the board.

Everyone looked at the new drawings and signed off. The way the board was mounted it was a tiny change or no change at all . . . I forget. Roughly a week after receiving the first new batch of boards they started having random crashes in the post assembly test rack. Sometimes they would run days all the way to a week or more. Other times they would crash within minutes.

They spent about a month tracking this down. Finally my boss found the issue by visually reading every layer of the file. The software the board manufacturer used tried to “help them out” quietly connecting earth ground to logic ground in a middle layer of the board. There was no way to jumper around or fix it. The “random” crashes where happening any time a device needed to throw power to earth ground “up-hill” from that joining. It then backfed on logic ground and caused the crash.

I’m not a hardware guy and only remember bits and pieces of what was said in disgust around me. Modern board layout software tries to quietly fix mistakes and it doesn’t always do it correctly.

matthias.tx · September 26, 2022, 1:28pm

Hello @luciolis,

we might check your schematic first to see if this is not a Hardware related problem.
did you check the supply rails with OSZ to see if there are any big voltage drops before reboots?
we could set up a 30min call to check together if there is any hardware related problem.

Best Regards,

Matthias

luciolis · September 27, 2022, 10:26am

Hello,

Thanks for the answers.
here is my defconfig. I tried to enable some debug options to get a kernel debug output, but still have nothing. Did I do something wrong (or maybe indeed it’s an hardware issue) ? Also, is it possible to see the watchdog status at boot (why it was rebooted, etc…) ? In other U-boot forks we can see that in the boot message, but I don’t see it there.

We are trying to replicate the issue with an oscilloscope but currently we have a quite stable output. We have very short deviations from 5V to 4.7V when the motors start but nothing less than the 3.3V that the SoM should accept.

defconfig (27.9 KB)

andrejs.tx · September 27, 2022, 10:37am

Hi Adrian,

Can you describe more precisely when exactly reboot happens? You wrote in your first post it’s “only after specific actions that involve either i2c or rpmsg”. Which I2C exactly? What about rpmsg - you have some communication from Linux with M7 core? If yes, have you made any modifications to SCU firmware, in particular with resource allocation?

Best regards,
Andrejs Cainikovs.

luciolis · September 27, 2022, 11:01am

Hi,

The intermittent reboots happens 1 times out of 30 when we initialize external systems using I2C and RPMSG. The I2C primarily used is I2C2 (Verdin I2C_2).

We used FreeRTOS with rpmsg support from NXP’s example and it works pretty well (only rpmsg not remoteproc). The device tree and memory addresses used are from another toradex thread about rpmsg.

andrejs.tx · September 27, 2022, 12:25pm

In this case I also start to suspect this might be hardware issue (you wrote when we initialize external systems which is solid direction away from SoC), and will step out, as this might be out of my competence (I’m a BSP engineer, not hardware). I will leave this to our support engineers.

Couple of answers to your previous questions:

I tried to enable some debug options to get a kernel debug output, but still have nothing.

As I wrote before, you will get a crash info if the there’s kernel failure (aka oops) on your default console. No need to enable anything extra in defconfig. If you see absolutely nothing in console and system just silently reboots - in most cases this is a hardware issue.

Did I do something wrong (or maybe indeed it’s an hardware issue)?

This requires further analysis.

Also, is it possible to see the watchdog status at boot (why it was rebooted, etc…)? In other U-boot forks we can see that in the boot message, but I don’t see it there.

Please check CONFIG_DISPLAY_CPUINFO option in U-Boot.

Best regards.
Andrejs Cainikovs.

luciolis · September 28, 2022, 9:54am

Thanks ! I’ll try that. Do you think that if a memory corruption occurs and the kernel crashes, the SoM will always be rebooted because of the watchdog ? Or can i be another reason?

henrique.tx · September 28, 2022, 1:20pm

Hi @luciolis !

We will need internally to check if someone can answer this question.

Also, I would like to ask more questions about your setup and about what you have already tried.

Which exact Verdin iMX8M Plus are you using? Please add the version as well.
Do you see the bad behavior only one specific module or you also see it on other modules?
Which BSP version are you using?
Have you tried to reproduce the bad behavior on a Toradex Carrier Board? If you have not, I would like to ask you to do so.

Best regards,

luciolis · September 29, 2022, 11:07am

- Which exact Verdin iMX8M Plus are you using? Please add the version as well.
  → It’s the 1.1A 4GB WB, but with 1.0E it does the same thing
It’s with every module
5.7.0, almost last git version from Toradex
It’s in our plans to reproduce it on the development board, but it’s quite tedious to set it up since we use 3 I2C ports, 2 SPI, PCI-Express and native LVDS

henrique.tx · September 29, 2022, 11:31am

Hi @luciolis !

Thanks for the answers.

I would like to ask you to try to reproduce it in a step-by-step fashion instead of connecting and testing everything right away. Testing different setups will probably help to understand which interface usage (if any) is the source of the issue.

Also, have you tried to test which I2C or RPMsg might be causing this? Could you elaborate on this?

Since seems like the issue is related to I2C or RPMsg, maybe you don’t need to assemble the other interfaces/devices (“2 SPI, PCI-Express and native LVDS”).

It will be very helpful if you manage to come up with a very minimal hardware setup (and source code, if needed) that is able to reproduce your issue. This way we can also create a setup exactly like yours to reproduce and better investigate the issue.

Could you share/point out which one you used?

Best regards,

hfranco.tx · September 29, 2022, 2:32pm

Hi @luciolis,

Please, also check if your device tree is the same as the patch below:

diff --git a/arch/arm64/boot/dts/freescale/imx8mp-verdin-rpmsg.dtsi b/arch/arm64/boot/dts/freescale/imx8mp-verdin-rpmsg.dtsi
new file mode 100644
index 000000000000..d690127bb5ed
--- /dev/null
+++ b/arch/arm64/boot/dts/freescale/imx8mp-verdin-rpmsg.dtsi
@@ -0,0 +1,76 @@
+
+// SPDX-License-Identifier: GPL-2.0-or-later OR MIT
+/*
+ * Copyright 2022 Toradex
+ */
+
+#include <dt-bindings/clock/imx8mp-clock.h>
+
+// Enable RPMSG support 
+
+/ {
+	reserved-memory {
+
+		#address-cells = <2>;
+		#size-cells = <2>;
+		ranges;
+
+        /* use linux config instead */
+		/delete-node/ linux,cma;
+
+        /* Allocate 16MB DDR RAM memory for cortex M -> check the ram drr linker file for details */
+		m7_reserved: m7@0x80000000 {
+			no-map;
+			reg = <0 0x80000000 0 0x1000000>;
+		};
+
+        /* Allocate resource table from Cortex-M7 -> check copyResourceTable inside rsc_table.c for details */
+		rsc_table: rsc_table@550ff000 {
+			reg = <0 0x550ff000 0 0x1000>;
+			no-map;
+		};
+
+        /* VDEV0_VRING_BASE 0 comes from FreeRTOS rsc_table.c */
+		vdev0vring0: vdev0vring0@55000000 {
+			reg = <0 0x55000000 0 0x8000>;
+			no-map;
+		};
+
+        /* VDEV0_VRING_BASE 1 comes from FreeRTOS rsc_table.c */
+		vdev0vring1: vdev0vring1@55008000 {
+			reg = <0 0x55008000 0 0x8000>;
+			no-map;
+		};
+
+        /* Buffers to use with RPMSG */
+		vdevbuffer: vdevbuffer@55400000 {
+            compatible = "shared-dma-pool";
+			reg = <0 0x55400000 0 0x100000>;
+			no-map;
+		};
+
+	};
+
+	imx8mp-cm7 {
+		compatible = "fsl,imx8mp-cm7";
+		rsc-da = <0x55000000>;
+		clocks = <&clk IMX8MP_CLK_M7_DIV>;
+		mbox-names = "tx", "rx", "rxdb";
+		mboxes = <&mu 0 1
+			  &mu 1 1
+			  &mu 3 1>;
+		memory-region = <&vdev0vring0>, <&vdev0vring1>, <&vdevbuffer>, <&rsc_table>, <&m7_reserved>;
+		status = "okay";
+	};
+};
+
+&rpmsg{
+	/*
+	 * 64K for one rpmsg instance:
+	 * --0x55000000~0x5500ffff: pingpong
+	 */
+	vdev-nums = <1>;
+	reg = <0x0 0x55000000 0x0 0x10000>;
+	memory-region = <&vdevbuffer>, <&rsc_table>, <&m7_reserved>;
+	status = "disabled";
+};
diff --git a/arch/arm64/boot/dts/freescale/imx8mp-verdin-wifi-dahlia.dts b/arch/arm64/boot/dts/freescale/imx8mp-verdin-wifi-dahlia.dts
index 4dafa67f2d6d..bc2c462d3ec1 100755
--- a/arch/arm64/boot/dts/freescale/imx8mp-verdin-wifi-dahlia.dts
+++ b/arch/arm64/boot/dts/freescale/imx8mp-verdin-wifi-dahlia.dts
@@ -8,6 +8,7 @@
 #include "imx8mp-verdin.dtsi"
 #include "imx8mp-verdin-wifi.dtsi"
 #include "imx8mp-verdin-dahlia.dtsi"
+#include "imx8mp-verdin-rpmsg.dtsi"
 
 / {
 	model = "Toradex Verdin iMX8M Plus WB on Dahlia Board";
diff --git a/arch/arm64/boot/dts/freescale/imx8mp-verdin.dtsi b/arch/arm64/boot/dts/freescale/imx8mp-verdin.dtsi
index 3a9f54f98aea..8932005ab8cc 100755
--- a/arch/arm64/boot/dts/freescale/imx8mp-verdin.dtsi
+++ b/arch/arm64/boot/dts/freescale/imx8mp-verdin.dtsi
@@ -124,20 +124,6 @@
 		off-on-delay = <100000>;
 		vin-supply = <&buck4_reg>;
 	};
-
-	reserved-memory {
-		#address-cells = <2>;
-		#size-cells = <2>;
-		ranges;
-
-		/* use the kernel configuration settings instead */
-		/delete-node/ linux,cma;
-
-		rpmsg_reserved: rpmsg@55800000 {
-			no-map;
-			reg = <0 0x55800000 0 0x800000>;
-		};
-	};
 };
 
 &A53_0 {
@@ -728,7 +714,6 @@
 	ext_osc = <0>;
 	pinctrl-names = "default";
 	pinctrl-0 = <&pinctrl_pcie>;
-	reserved-region = <&rpmsg_reserved>;
 	/* PCIE_1_RESET# (SODIMM 244) */
 	reset-gpio = <&gpio4 19 GPIO_ACTIVE_LOW>;
 };

This patch is enabling rpmsg for Verdin iMX8M Plus Wifi with Dahlia carrier board. If you are not using this board, just add #include imx8mp-verdin-rpmsg.dtsi to you board device tree and compile it.

For this device tree, the M7 binari should be loaded to DDR memory, so you need to use the build_ddr_release or build_ddr_debug script and load it with u-boot to the 0x80000000:

ext4load mmc 2:2 0x44500000 /var/<your_binary>.bin; cp.b 0x44500000 0x80000000 <your_binary_size>; dcache flush; bootaux 0x80000000;

Let me know if you need any help with that.

Best Regards,
Hiago.

henrique.tx · September 29, 2022, 2:55pm

Hi @luciolis !

I would like to ask some other questions:

Could you please elaborate on the type of motor (DC, Brushless, AC?) you are using?
How is the motor connected to the carrier board?
Are you controlling the motor with the Cortex M7?

Best regards,

luciolis · October 4, 2022, 11:11am

Hi @henrique.tx,

The motors are OEM modules (18V DC) controlled by I2C to an external MCU, we are using i2c_dev from the linux kernel. They are not driven by the carrier board but share the same power supply.

Best

josep.tx · October 4, 2022, 2:06pm

Hello @luciolis ,
Are you able to do a test with the motor and the carrier board using each of them its own power supply?
The minimum operating voltage for the Dahlia carrier board is 4.5V (5V-10%), and with 4.7 you are quite close to the limit.

Best regards,
Josep

luciolis · October 5, 2022, 8:25am

I tried to put the RSC table at 550ff000 but it does not load at uboot

With RSC table at 55000000 it works but the kernel does not like it (message of overflow at start), without RSC table in the fdt it also works.

rafael.tx · October 5, 2022, 9:12am

Hello @luciolis,
I think we need to be more systematic here in order to try to isolate the source of the problem. As the colleagues already mentioned, this could very well be a hardware issue. My guess is that if this was a failure that the Linux kernel could detect you would be seeing logs of it (i.e. memory corruption on the cortex A side, a processor exception that is unrecoverable, a software bug). Because you don’t see any log when the system reboot, I think the focus of the investigation should be “external” to linux, at least for now.

Here are some examples of what should be looked into:

Analyze the power pins to the module with an oscilloscope and compare the behavior when the module reset and when it doesn’t;
Look at the external module reset lines also
Remove the motors but leave the rest of the system running and give the commands to start them. Does it still happen then?
What are you running on the M4 side? Is it the motor control? If you remove the part of the code that processes the rpmsg, does it still happen?
What about the code that handles the i2c?

Each one of these steps should be done independently so that we start ruling out possible causes for this problem.

Best regards,
Rafael

henrique.tx · October 13, 2022, 12:57pm

Hi @luciolis !

Do you have news on your issue? Did you have time to look into @rafael.tx’s questions?

Best regards,

luciolis · October 13, 2022, 4:27pm

Hi ! Thanks for the tip and reminder. We are currently trying to reproduce the problem with better control. However we are also busy with other parts of the development, and I can have some news in around 1-2 months