I’m building a yocto-type OS for the imx8mp on a custom carrier board. My current issue is that the SOM reboots randomly but only after specific actions that involve either i2c or rpmsg. I suspect a memory corruption error or a driver failure.
However I have no way to get the kernel stack trace or any kind of log before the SOM reboots. I tried to activate some kernel modules and systemd.conf options, but it is not working. Do you have a way to enable kernel stacktraces (on UART3 or file) in the current BSP?
Default Linux console will print the relevant crash information when it happens. I suspect your issue might be SCU related, but of course it’s impossible to guess since you didn’t provided too much of details.
I would seriously be looking for a hardware problem on the custom carrier board first. Odds are high there is power being sent to a pin that either isn’t connected to anything or is connected to logic ground when it should be earth ground.
I’m not a hardware guy, but I worked at a shop that had an established board they used. One day the place they had making them sent a new file and requested they make one of the mounting holes a couple thousands of an inch bigger. They were having a really low success rate drilling that hole. I think they could no longer obtain quality bits of that size so about 40% of the time the bit would snap and ruin the board.
Everyone looked at the new drawings and signed off. The way the board was mounted it was a tiny change or no change at all . . . I forget. Roughly a week after receiving the first new batch of boards they started having random crashes in the post assembly test rack. Sometimes they would run days all the way to a week or more. Other times they would crash within minutes.
They spent about a month tracking this down. Finally my boss found the issue by visually reading every layer of the file. The software the board manufacturer used tried to “help them out” quietly connecting earth ground to logic ground in a middle layer of the board. There was no way to jumper around or fix it. The “random” crashes where happening any time a device needed to throw power to earth ground “up-hill” from that joining. It then backfed on logic ground and caused the crash.
I’m not a hardware guy and only remember bits and pieces of what was said in disgust around me. Modern board layout software tries to quietly fix mistakes and it doesn’t always do it correctly.
we might check your schematic first to see if this is not a Hardware related problem.
did you check the supply rails with OSZ to see if there are any big voltage drops before reboots?
we could set up a 30min call to check together if there is any hardware related problem.
Thanks for the answers.
here is my defconfig. I tried to enable some debug options to get a kernel debug output, but still have nothing. Did I do something wrong (or maybe indeed it’s an hardware issue) ? Also, is it possible to see the watchdog status at boot (why it was rebooted, etc…) ? In other U-boot forks we can see that in the boot message, but I don’t see it there.
We are trying to replicate the issue with an oscilloscope but currently we have a quite stable output. We have very short deviations from 5V to 4.7V when the motors start but nothing less than the 3.3V that the SoM should accept.
Can you describe more precisely when exactly reboot happens? You wrote in your first post it’s “only after specific actions that involve either i2c or rpmsg”. Which I2C exactly? What about rpmsg - you have some communication from Linux with M7 core? If yes, have you made any modifications to SCU firmware, in particular with resource allocation?
In this case I also start to suspect this might be hardware issue (you wrote when we initialize external systems which is solid direction away from SoC), and will step out, as this might be out of my competence (I’m a BSP engineer, not hardware). I will leave this to our support engineers.
Couple of answers to your previous questions:
I tried to enable some debug options to get a kernel debug output, but still have nothing.
As I wrote before, you will get a crash info if the there’s kernel failure (aka oops) on your default console. No need to enable anything extra in defconfig. If you see absolutely nothing in console and system just silently reboots - in most cases this is a hardware issue.
Did I do something wrong (or maybe indeed it’s an hardware issue)?
This requires further analysis.
Also, is it possible to see the watchdog status at boot (why it was rebooted, etc…)? In other U-boot forks we can see that in the boot message, but I don’t see it there.
Please check CONFIG_DISPLAY_CPUINFO option in U-Boot.
I would like to ask you to try to reproduce it in a step-by-step fashion instead of connecting and testing everything right away. Testing different setups will probably help to understand which interface usage (if any) is the source of the issue.
Also, have you tried to test which I2C or RPMsg might be causing this? Could you elaborate on this?
Since seems like the issue is related to I2C or RPMsg, maybe you don’t need to assemble the other interfaces/devices (“2 SPI, PCI-Express and native LVDS”).
It will be very helpful if you manage to come up with a very minimal hardware setup (and source code, if needed) that is able to reproduce your issue. This way we can also create a setup exactly like yours to reproduce and better investigate the issue.
Hello @luciolis ,
Are you able to do a test with the motor and the carrier board using each of them its own power supply?
The minimum operating voltage for the Dahlia carrier board is 4.5V (5V-10%), and with 4.7 you are quite close to the limit.
I think we need to be more systematic here in order to try to isolate the source of the problem. As the colleagues already mentioned, this could very well be a hardware issue. My guess is that if this was a failure that the Linux kernel could detect you would be seeing logs of it (i.e. memory corruption on the cortex A side, a processor exception that is unrecoverable, a software bug). Because you don’t see any log when the system reboot, I think the focus of the investigation should be “external” to linux, at least for now.
Here are some examples of what should be looked into:
Analyze the power pins to the module with an oscilloscope and compare the behavior when the module reset and when it doesn’t;
Look at the external module reset lines also
Remove the motors but leave the rest of the system running and give the commands to start them. Does it still happen then?
What are you running on the M4 side? Is it the motor control? If you remove the part of the code that processes the rpmsg, does it still happen?
What about the code that handles the i2c?
Each one of these steps should be done independently so that we start ruling out possible causes for this problem.
Hi ! Thanks for the tip and reminder. We are currently trying to reproduce the problem with better control. However we are also busy with other parts of the development, and I can have some news in around 1-2 months