Rogue power-off pulse at bootup

RPHILLE · March 16, 2020, 5:08pm

We have a TK1 based system that boots to a non-Ubuntu GUI with V1.2A TK1 SOMs. Specifically, Yocto build from your template, Using Zeus. Nvidia kernel 3.10 with patches, including the GPIO7 ‘fix’. On a number of systems with this build and up to build 3.1.0RCf, we have experienced system power shutdowns just prior to the GUI coming active, that is, the display spits out the usual bootup log scroll right up to login prompt, blanks, and then the system shuts down. Please note this happens very infrequently across a population of maybe 6 systems, and has never been experienced under the bootup to Ubuntu desktop under any preceding BSP generations.

Since the SOM is the sole driver of the POWER_ENABLE_MOCI signal which commands the host power system to shut off, it appeared to be the prime suspect. Our host board (SOM carrier) implements a 120msec masking filter on the POWER_ENABLE_MOCI signal to tolerate the 60msec software reset pulse per Question # 24926 in this forum. This has worked very well since being implemented and we’ve never experienced shutdowns on software or pushbutton resets of the SOM.

Taking one system which experienced self-shutdown several times in a short period, a slight modification was made to gate off the POWER_ENABLE_MOCI signal from the host board power system. The modification was not directly applied to the signal itself in order to preserve the electrical conditions present on POWER_ENABLE_MOCI. We then set up some automation that would power cycle the unit and trigger on any falling transitions of POWER_ENABLE_MOCI. Once triggered, the power cycle timer was halted so the system is left powered after the POWER_ENABLE_MOCI falling edge took place. A scope was also connected and triggered by the falling edge of POWER_ENABLE_MOCI. After days of power cycles, we were fortunate enough to capture a single shutdown event, the scope trace of which I have attached here.

The lower trace is POWER_ENABLE_MOCI from the SOM. Two questions are raised here; if this was a software reset, why is the pulse longer than the claimed 60msec described in Q/A 24926? If this was a deliberate shutdown request, why does the pulse return high after 250msec? The system remained live and responsive despite not having executed a full power-cycle restart. On actual commanded shutdowns, POWER_ENABLE_MOCI remains low and the system is unresponsive if the power remains active.

The client’s GUI did come up on display and was responsive to touch, indicating the USB subsystem was initialized and active, and the client’s application had started up correctly and was running stably (we let the system sit powered for a couple days and it remained responsive). What are the possible mechanisms in the circuitry of the TK1 SOM that could produce a rogue 250msec low pulse? The description provided in Q/A 24926 by Toradex seems to imply a hardware determined maximum pulse width of 60msec.

For your interest, the upper trace is the control signal from the power cycle timer; this signal is not used in place of the POWER_ENABLE_MOCI but actuates a button input on the power controller. It asserts low for 7-seconds to trip a 6-second power-down timer. It runs on a 50-second cycle, long enough to let the system boot up to the client GUI. Once off, it powers the system up 42 seconds later to ensure local rails discharge substantially and any DRAM data gets scrubbed. In the trace, the short low transition coincident with the lower trace (POWER_ENABLE_MOCI ) falling edge simply reflects the timer gating function once a trigger occurs.

RPHILLE · March 19, 2020, 2:22pm

Our client’s software developers checked and are finding the GPIO2 (and all others) of the PMIC being defined as an input on bootup. This disagrees with the statement by Toradex Support in Q/A 24926:
“The POWER_ENABLE_MOCI is enabled together with the on-module 3.3V rail. This rail is switched by the GPIO2 of the PMIC. In the power up sequence of the PMIC, the GPIO is fused to be in slot 7 while the slot interval is set to 4ms. This means the POWER_ENABLE_MOCI is released 28ms after the power up sequence has started.”
Please explain the operation of the GPIO2 and maybe provide a schematic excerpt showing its connection to the POWER_ENABLE_MOCI signal.

jaski.tx · March 23, 2020, 4:16pm

HI @RPHILLE

Sorry for the delayed answer.

Could you provide the exact software version of your module?
What changes have you done to the kernel and devicetree?

Please provide more information, how can be reproduce this issue on our side?

Best regards,
Jaski

RPHILLE · March 24, 2020, 5:04am

As outlined, we have a single system set up to run a 50 second power-up power-down cycle, hence a full power-cycle every 100 seconds. Frustratingly, so far we have only captured the single trace I posted here; it appears to be an extremely elusive event and we have no sense of what external stimuli may invoke the issue. The best I can suggest is that you build a load exactly as we have specified and perform a series of power-up cycles.

A bigger question here remains; in Q/A 24926, we were informed that PMIC GPIO2 switches the on-module 3.3V rail as well as the POWER_ENABLE_MOCI signal using a fused timing interval (28msec). It was confirmed that the observed 60msec pulse following reset added up correctly with the stated timing sources (28msec + 32msec) which were purely hardware originating within the PMIC. How can the kernel version and devicetree affect what had been stated to be entirely hardware determined? From our perspective, we should only ever need to deal with a 60msec (plus tolerance) pulse, irrespective of kernel version and possible modifications.

I am preemptively widening the pulse mask filter to around 500msec ( it is currently around 100-130msec) based on the observed pulse width of 250msec. I figure a 2X margin would be a good guess but we really need to know from you folks how software can touch that signal and what action could possibly drive it low for 1/4sec. If POWER_ENABLE_MOCI really reflects the state of the on-module 3.3V rail, what condition other than reset could call for such a long shutdown of a local rail?

peter.tx · March 24, 2020, 8:10am

Hi @RPHILLE,

The PMIC GPIO2 goes directly to a FET power switch which enables the on-module 3.3V rail. The 3.3V rail is directly used as the POWER_ENABLE_MOCI signal. the GPIO2 is fused to be in the power-up (and down) sequence. If the POWER_ENABLE_MOCI goes low for 250ms, this would also mean the 3.3V rail is going down. Do you know whether other power rails on the module and/or the Reset went also down? Please find here a description of the test points on the module.

Unfortunately, I have currently no clue why the POWER_ENABLE_MOCI goes down for 250ms. Maybe it is a kind of power failure. Even though the GPIO2 is fused to be in the power-up and down sequence, it is still possible to control the GPIO over the software in runtime. However, disabling the 3.3V rail in runtime would be fatal, since a lot of essential on-module peripherals (including the eMMC) are powered from this rail.

I am currently working from home. Therefore, I do not have any hardware or measurement tools around. This means I am not able to my own tests.

RPHILLE · March 24, 2020, 2:30pm

Peter, thank you for the further details. It is indeed troubling that the local 3.3V rail is shutting off for 1/4 sec. Would this unconditionally trigger a reset of the CPU or just shut down a few peripheral devices that may not happen to be in use, hence occur transparently to the running code? It had appeared that the system remained running after the pulse as my setup stops the power cycling after triggering, but is not sophisticated enough to match scope captures with captured bootup logs. Since the pulse capture did not occur while I was watching, I don’t know if the board just kept running after the pulse or executed a reset cycle. The final two bootup cycles in the log capture don’t appear different or reveal anything odd.

Thanks also for revealing that GPIO2 can be controlled by runtime software. As you point out, it is fatal to shut it off while the module is operating, which begs the question, why isn’t it write protected? Even if this proves ultimately not to be the cause of the pulse, exposing such a critical control point to the runtime environment is, in my opinion, a serious reliability hazard. Is there any way that the GPIO can be masked off from access through a hardware interlock method? This is now the prime suspect, and although it could be a purely random event, the close proximity to 250msec gives it the flavor of a deliberately timed code action.

peter.tx · March 25, 2020, 3:59pm

Hi @RPHILLE,

The 3.3V is neither directly used by the CPU nor monitored. Therefore, I do not think a missing 3.3V would directly trigger a reset. The 3.3V is only used by the on-module peripherals as well as for the I/O voltage of the 3.3V GPIOs.

It would be extremely helpful to know whether there are other voltage rails which are going down during these 250ms. This might give us a clue about what is causing the failure of the 3.3V rail.

The GPIO2, as well as any other PMIC registers, can be controlled over the I2C interface. This means the kernel can access them. Theoretically, the kernel could accidentally also change the state of any other voltage rail of the PMIC. As long as the kernel makes sure the application software does not have access to the PMCI I2C bus, the GPIO2 as well as any other PMIC registers, I do not see here a reliability issue. It is a common approach to ARM-based systems.

BTW, are you using the POWER_ENABLE_MOCI signal for enabling also the main input voltage rail of the module? We recommend using a GPIO in combination with the kill input of the power button IC for shutting down the system. More information, you will find here: High performance, low power Embedded Computing Systems | Toradex Developer Center

RPHILLE · March 25, 2020, 5:22pm

Thanks for confirming that the loss of the local 3.3V rail would not invoke a reset, thus confirming that an errant (or intentional?) software process could toggle the control without too much affect on program execution. In addition to passing through a reset pulse filter, the signal also directly turns off two peripheral rails on our board, however, their momentary loss does not invoke a system reset. As they are lightly loaded with significant capacitance, the drop during the 1/4 sec disable is not enough to disrupt devices powered from those rails. The main 3.3V rail applied to the SOM, however, remains untouched and thus does not trigger our main board reset.

My concern about the reliability of the design approach taken comes from a long background of designing high-availability, high-reliability systems. As you stated earlier, de-asserting POWER_ENABLE_MOCI in a non-shutdown context would be a fatal event; my view is that something so fundamentally critical to system integrity should be a) implemented exclusively in hardware if possible, and b) not be made available to the runtime environment if a) is not possible. I liken the signal to a “maintenance access only” lever which should be kept locked up in a restricted access closet.

Our power-ON mechanism is independent of the POWER_ENABLE_MOCI, allowing the signal to be used to shut power OFF without jamming the power-ON. This has served well for the last couple of years we’ve had this design in use.

jaski.tx · March 31, 2020, 8:33pm

HI @RPHILLE

As you stated earlier, de-asserting POWER_ENABLE_MOCI in a non-shutdown context would be a fatal event; my view is that something so fundamentally critical to system integrity should be a) implemented exclusively in hardware if possible, and b) not be made available to the runtime environment if a) is not possible. I liken the signal to a “maintenance access only” lever which should be kept locked up in a restricted access closet.

I can understand your concern but usually you just need to configure the kernel correctly to not reset this Pin by the user Application.

Best regards,
Jaski

RPHILLE · April 1, 2020, 2:23am

Can you at least tell us if any of the Toradex provided boot and/or Kernel code requires access to any of the PMIC GPIOs or registers, or is it all hardware auto-pilot and the GPIO runtime accessibility is simply the side-effect of having the PMIC I2C tied to the SoC for potential maintenance/update? This issue did not present itself until we stopped using the stock Ubuntu load.

jaski.tx · April 3, 2020, 5:54am

Hi @RPHILLE

I created an internal Ticket for this and I will come back to you once I know more.

Thanks for your patience and best regards,
Jaski

RPHILLE · April 15, 2020, 4:05pm

Any progress on this? Briefly looking at the AS3722 datasheet, doesn’t look like there is a write-protect feature on the registers, and I suspect the TK1 I2C port likely doesn’t have any sort of I2C port write cycle blocking feature. A crude protection could be implemented through the I2C-SPI selector if regular code does not access the PMIC, but requires resetting the PMIC.

jaski.tx · April 16, 2020, 2:07pm

No, there is no progress. This issue is planned to be processed in one month.

Thanks and best regards,
Jaski

RPHILLE · May 13, 2020, 3:31pm

Hi, We are coming up on 1 month since you created an internal ticket for this issue. Please inform us of your intended action schedule for this problem. To refresh, we need an answer to the questions I posted April 1. Our client’s developers are justifiably reluctant to make any code patches in the kernel init settings until they are clear as to what changes should be made and understand how the GPIOs in question are used by the Kernel and/or any other Toradex supplied code.

We only have one system where this problem is reliably repeatable, and have isolated it in order to conduct deep analysis, possibly in coordination with Toradex support. This is not the system which yielded the scope trace posted here March 16. That particular unit has been in continuous power-cycle testing since that post date with NO further occurrences. The isolated system continues to exhibit the shutdown and has exhibited instances where shutdown occurred over several consecutive power-ups. Once up, no systems, including the isolated one, have ever spontaneously shut down, nor shut down following a soft-commanded or pushbutton reset. This issue appears to be related to power-up bootup well before activation of any of the high-power elements within the system.

Despite the frustrating rarity of shutdown occurrences, our client’s verification team anecdotally hold that most if not all systems in the total test population have exhibited a bootup shutdown at least once, too infrequent to log and initially attributed to user error or some other suspected problem. Now that they understand that this is a “thing” to watch for, some incidents have been logged on a few other systems, however, typically single event instances.

alex.tx · June 2, 2020, 2:58pm

Our engineer created a set up with an Apalis TK1 1.2A and the Ixora 1.1A carrier board in a climate chamber. The climate chamber control was programmed to powers on the TK1, wait until it booted and then power it off. The LeCroy scope was connected to POWER_ENABLE_MOCI, RESET_MOCI#, SYS_RESET#1V8 (on testpoint TP39) and +V1.8 (on testpoint TP25). If the POWER_ENABLE_MOCI goes to 0V and it isn’t during power down, it would trigger and save a screenshot of the scope.
After 5 days of testing neither the scope measured reported behavior nor the climate chamber control failed on booting the module over the whole 5 days.

Looks like the described issue related to custom carrier board hardware.

RPHILLE · June 6, 2020, 9:46pm

Alex

thanks for confirming what we had seen as well. No systems were seen to exhibit the issue with your stock BSP Ubuntu Linux. The systems that were causing trouble are running a Yocto build of the latest head-less Linux release for the TK1. We have had great difficulty reproducing the problem and have only a single system out of dozens that consistently exhibits this. I am not surprized you did not see the problem: I have a client system in which the problem initially occurred several power-ups in a row, and I was finally able to catch the scope trace I posted here back in March. I set up a similar power-cycle jig and had that system run continuous power cycles (10’s of thousands) until recently without another incident.

The one system that is consistently exhibiting this is at our client site in another country and we’ve arranged to perform an investigation, following a carefully crafted plan to ensure we don’t make the problem disappear through multiple simultaneous changes. If it turns out the problem follows the TK1 SOM, we will likely send it to Toradex for analysis.

What I needed from Toradex however is an answer to my question of April 01, namely, what Toradex code is dependent upon access to the GPIO2 of the PMIC? We want our developers to delist the port from the IO Tree but need to be sure there are no kernel or driver dependencies and that the code does not re-map the GPIO sometime later in bootup.

RPHILLE · July 3, 2020, 6:25pm

Is there anything your support team can offer? One of our client’s software developers commented that the 250msec pulse duration looks very deliberate, as if there was a code-driven attempt to recover a peripheral that may have stopped responding by power-cycling it. This would apply to a device that is on-SOM powered from the internal IO rail from which POWER_ENABLE_MOCI is derived. Can anyone at Toradex even just comment on this possibility? We have a possible work-around solution by extending the pulse filter time but we need to know if any longer pulses may be possible through low-level code.

alex.tx · July 10, 2020, 4:20pm

PMIC. GPIO 2 should not be used by Linux. Could you tell me from what exactly file you are going “ask developers to delist the port from the IO Tree”?

Unfortunately we were not able to repro that issue so it hard to say what is causing it. It could be even problem with power supply overcurrent. Some modules can consume more power in a peak load causing voltage drop. Could you also check by scope a main input voltage to the module when monitoring POWER_ENABLE_MOCI?

matthias.tx · October 19, 2020, 10:52am

Hello RPHILLE,

I have been looking over that whole ticket quickly and for me, it looks like a power issue. I still have to read deper into it.

What I am missing is some kind of typical engineering test matrix.

Does the same TK1 Module behaves the same in different carrier boards?
Does the carrier board shows the same issue with a different module?
Please messure the module supply voltage and if possible the current during the boot up and plot them over the digital signals so that you can see if there is a coralation.
Please provide the schematic of you carrier board for review. you can send it to support.eu@toradex.com
what is you max current of your carrier board Dc/Dc converters. And have you tested this?
What is the additional current consumption of you Carrier board phripherals when the Power_ENABLE_MOCI signal switches them on.
measure the dynamik max current when you switch on and the static one.

Best Rgards,

Matthias Gohlke

RPHILLE · October 21, 2020, 4:28pm

We have examined this problem exactly as you suggest: point 3 has been confirmed clean, point 4 has been addressed by schematic and actual sample unit provided to Toradex US support - the very unit from which the scope capture posted here in March was derived. The 3.3V converter on our host can support 8Amps continuous with 2x330uF low-ESR polymer electrolytic decoupling in addition to the ref-design ceramic caps. Voltage transients have been measured to stay below around +/-75mV with high load transients (4A step). Other draws on the 3.3V are trivial (<200mA) at the time the shutdown occurs.

Initial testing of points 1 and 2 seemed to indicate the problem followed the SOM, but unfortunately, the very few SOMs with the problem stopped exhibiting the issue, exactly as happened with the original unit that Toradex have at their support lab. With no systems on which the problem can be analyzed, we have suspended pursuit of this issue and will keep close watch on new system builds. So far, the problem has not manifested in recent hardware/software builds.