eMMC blocking error on iMX8X

Hello,

We are experiencing eMMC related (or so we suspect) problems on our iMX8X module when running a custom-setup database on a container. Initially, everything seems fine, but after 10-20 minutes the kernel panics (see picture) with an error in which it’s stated that the jbd2/mmcblk0p2 task has been blocked. In the meantime the system becomes increasingly unresponsive.

Does this indicate that eMMC might be corrupted on the module? Or could this be a device tree/TorizonCore related issue?

Greetings @DominykasD,

The jbd2/mmcblk0p2 error does seem to imply/indicate some sort of eMMC error or issue. Whether it’s corruption or some kind of other issue, it’s currently hard to tell at the moment.

I have a couple of questions/suggestions that might narrow down what exactly is going on here.

  1. Judging by your initial post it seems this error is consistently reproducible correct? Is this error also reproducible on other iMX8X modules or only on this specific one? This would be helpful to know whether this is some kind of greater systematic issue or specific to one piece of hardware.
  2. Next could you share the details of the “custom-setup database” container that causes this issue? Or if not could you explain in greater detail what this container is doing or maybe provide a minimal version of the container that still recreates this error? This is in case we need to do further testing/reproduce on the Toradex side.

Best Regards,
Jeremias

Answering your questions:

  1. Yes, once I load the containers with all the related data to a new module running the exact same OS, the error appears once again with the exact same behaviour.

  2. It consists of an InfluxDB database container with complementary Telegraf and Chronograf containers (they are essentially unmodified, only with some custom graphs loaded specific to our application), and also an intermediate container which requests data from hardware provider containers (they are used for communication between the module and the (sub-)devices attached to the module via CAN, SPI, I2C etc.) and pushes that data to the InfluxDB container. (those custom containers are intellectual property, so sadly we cannot publish them to recreate the behaviour from your side)

Unfortunately there doesn’t seem to be any known issues with the eMMC on the i.MX8X that we know of currently.

Apologies, but we really require your container sources to be able to investigate/debug this effectively from our side. It may be the case that there’s a very specific set of sequences that causes this eMMC error.

Either you can provide a minimal version of your solution without sensitive information, that is still capable of producing this error. Or we can go about this through more discrete means and have you share us the sources under agreement/NDA if this is acceptable.

Best Regards,
Jeremias

@DominykasD,

I spoke a bit with the team internally about your possible eMMC issues. It is possible that certain Docker workloads could cause eMMC related issues. Therefore I have to request again the source for your container stack so that we can accurately debug and investigate this from our side.

Additionally one team member has requested the following information. In the U-Boot prompt of the i.MX8X could you run the the following commands:

# fuse read 0 18
# fuse read 0 19

And then please gives us the output from both of these read commands.

Best Regards,
Jeremias

It is possible to reproduce the error without NDA or anything custom, the error also occurs with this stack.

IMAGE                                                        
arm64v8/debian:stable-slim                                      
mcr.microsoft.com/dotnet/core/runtime:3.0-buster-slim-arm64v8  
mcr.microsoft.com/dotnet/core/runtime:3.0-buster-slim-arm64v8  
mcr.microsoft.com/dotnet/core/aspnet:3.0-buster-slim-arm64v8   
mcr.microsoft.com/dotnet/core/aspnet:3.0-buster-slim-arm64v8   
arm64v8/telegraf:latest                                        
arm64v8/influxdb:latest                                       
arm64v8/chronograf:latest

For InfluxDB, Telegraf and Chronograf, the docker-compose file should be like this:

version: '2'

services:
    influxdb:
        image: arm64v8/influxdb:latest
        container_name: influxdb
        restart: always
        network_mode: host
        ports:
            - "8086:8086"
        volumes:
            -  /<any folder>/influxdb:/var/lib/influxdb
    chronograf:
        image: arm64v8/chronograf:latest
        container_name: chronograf
        restart: always
        network_mode: host
        ports:
            - "8888:8888"
        environment:
            - INFLUXDB_URL=http://influxdb:8086
        volumes:
            - /<any folder>/chronograf:/var/lib/chronograf

    telegraf:
        image: arm64v8/telegraf:latest
        container_name: telegraf
        restart: always
        network_mode: host
        volumes:
            - /<any folder>/telegraf.conf:/etc/telegraf/telegraf.conf:ro
            - /var/run/docker.sock:/var/run/docker.sock
            - /:/hostfs:ro

Other containers are run with -d -it and with the entrypoint being /bin/sh.
We also have an application which pushes some random data similar to what we use in our workload. To make it work, an empty file should be created in C:\imx8-emmc-test\settings\dia\csv.log (otherwise, the application will exit). Once this is unzipped, inside the folder, there is an appsettings.json file. There, the IP address of the remote database should be updated with the IP of the module:

  "InfluxApi": {
    "ApiUrl": "http://<any IP>:8086"
  },

Once updated, the SendInx.exe should be launched, and if everything is working, every half a second “Sending data” should appear in the console.
Note - it may take a few hours until the error appears.

And the outputs from fuse read are as follows:
2503-fuse.png

Hi @DominykasD,

Thanks for the feedback.

Could you check your pictures uploads?

They are not appearing in your last comment.

Best regards, André Curvello

Hi,

Should be good now.

DominykasD

Hi @DominykasD,

We have already escalated your stack/setup to our Validation and Verification team to add it to our test environment.

Can you please check the results of these two commands as well?

fuse read 0 765
fuse read 0 766

Thank you for your patience.

Best regards,
André Curvello

Hi,

These are the results:
2514-fuse2.png

DominykasD

Hi @DominykasD,

We’ll be analyzing the details and environment.

Thanks for the feedback.

Best regards,
André Curvello

Hi,
Are there any news, has the problem been reproduced at your side?

Hi @DominykasD,

So far we were unable to reproduce this problem at our side.

We have escalated this topic internally to join efforts from different engineers, and as soon as we have results we’ll provide feedback.

It would really help if you could provide an image or the containers that you are using, so we can make a more realistic debug.

We understand the point of the NDA, and we will conduct everything with full discretion.

CC @jaski.tx @walter.tx.

Best regards,
André Curvello