Intermittent reboots (temperature or power related?) in iMX8 QuadMax

Our system is rebooting intermittently, frequently accompanied by some subset of the following dmesg logs.

The SoM is in a custom carrier board based on the apalis evaluation board.

We are looking for insight as to whether temperature could be an instigating factor (we’re running with a heat sink but no fan in an open environment), or if there are any known QA issues with this part.

The issue seems to be becoming more prevalent with time (i.e. once per day 2 weeks ago; 10x’s this morning).

Example 1:

[   24.307410] apex 0000:01:00.0: RAM did not enable within timeout (12000 ms)
[   24.326436] panel-simple lvds1_panel: lvds1_panel supply power not found, using dummy regulator
[   29.517671] apex 0000:01:00.0: Apex performance not throttled due to temperature 

Example 2:

[   80.818660] apex 0000:01:00.0: Apex performance not throttled due to temperature
[   82.354676] mmc0: cqhci: timeout for tag 31
[   82.358882] mmc0: cqhci: ============ CQHCI REGISTER DUMP ===========
[   82.365343] mmc0: cqhci: Caps:      0x0000310a | Version:  0x00000510
[   82.371796] mmc0: cqhci: Config:    0x00001001 | Control:  0x00000000
[   82.378248] mmc0: cqhci: Int stat:  0x00000000 | Int enab: 0x00000006
[   82.384704] mmc0: cqhci: Int sig:   0x00000006 | Int Coal: 0x00000000
[   82.391156] mmc0: cqhci: TDL base:  0xffffb000 | TDL up32: 0x00000000
[   82.397612] mmc0: cqhci: Doorbell:  0xfffffffe | TCN:      0x00000000
[   82.404069] mmc0: cqhci: Dev queue: 0x00000000 | Dev Pend: 0x00000001
[   82.410522] mmc0: cqhci: Task clr:  0x00000000 | SSC1:     0x00011000
[   82.416977] mmc0: cqhci: SSC2:      0x00000001 | DCMD rsp: 0x00000000
[   82.423432] mmc0: cqhci: RED mask:  0xfdf9a080 | TERRI:    0x00000000
[   82.429887] mmc0: cqhci: Resp idx:  0x0000000d | Resp arg: 0x00000000
[   82.436343] mmc0: sdhci: ============ SDHCI REGISTER DUMP ===========
[   82.442799] mmc0: sdhci: Sys addr:  0xffade000 | Version:  0x00000002
[   82.449249] mmc0: sdhci: Blk size:  0x00000200 | Blk cnt:  0x00000008
[   82.455698] mmc0: sdhci: Argument:  0x00018000 | Trn mode: 0x00000033
[   82.462152] mmc0: sdhci: Present:   0x01fd8008 | Host ctl: 0x00000030
[   82.468607] mmc0: sdhci: Power:     0x00000002 | Blk gap:  0x00000080
[   82.475064] mmc0: sdhci: Wake-up:   0x00000008 | Clock:    0x0000000f
[   82.481517] mmc0: sdhci: Timeout:   0x0000008f | Int stat: 0x00000000
[   82.487972] mmc0: sdhci: Int enab:  0x107f4000 | Sig enab: 0x107f4000
[   82.494426] mmc0: sdhci: ACmd stat: 0x00000000 | Slot int: 0x00000502
[   82.500881] mmc0: sdhci: Caps:      0x07eb0000 | Caps_1:   0x8000b407
[   82.507337] mmc0: sdhci: Cmd:       0x00000d1a | Max curr: 0x00ffffff
[   82.513791] mmc0: sdhci: Resp[0]:   0x00000000 | Resp[1]:  0xffffffff
[   82.520246] mmc0: sdhci: Resp[2]:   0x328f5903 | Resp[3]:  0x00d02701
[   82.526700] mmc0: sdhci: Host ctl2: 0x00000008
[   82.531155] mmc0: sdhci: ADMA Err:  0x00000000 | ADMA Ptr: 0xffff5c08
[   82.537608] mmc0: sdhci: ============================================
[   82.544164] mmc0: running CQE recovery
[   85.940064] apex 0000:01:00.0: Apex performance not throttled due to temperature
[   87.202070] kauditd_printk_skb: 113 callbacks suppressed
[   87.202074] audit: type=1006 audit(1600977524.613:43): pid=4585 uid=0 old-auid=4294967295 auid=1000 tty=(none) old-ses=4294967295 ses=1 res=1
[   87.630101] audit: type=1006 audit(1600977525.041:44): pid=4461 uid=0 old-auid=4294967295 auid=1000 tty=(none) old-ses=4294967295 ses=2 res=1
[   91.061684] apex 0000:01:00.0: Apex performance not throttled due to temperature

Hi @Gwen

Thanks for writing to the Toradex Support.

Could you provide the version of hardware (including carrier board) and software of your module?
Which application are you running on the SoM?

Best regards,
Jaski