OpenCL on i.MX8

Hi,

After I could not run an OpenCL kernel on the i.MX8, I tried to get the device info and got quite an interesting output via the below code.

	// Getting platform and device information
	cl_platform_id platformId = NULL;
	cl_device_id deviceID = NULL;
	cl_uint retNumDevices;
	cl_uint retNumPlatforms;
	cl_int ret = clGetPlatformIDs(1, &platformId, &retNumPlatforms);
	ret = clGetDeviceIDs(platformId, CL_DEVICE_TYPE_DEFAULT, 1, &deviceID, &retNumDevices);

	// Get GPU device information
	char cBuffer[1024];
	cl_uint maxComputeUnits;
	cl_uint maxClockFreq;
	cl_ulong globalMemSize;
	cl_ulong localMemSize;
	cl_ulong maxMemAllocSize;
	size_t maxWorkGroupSize;
	size_t maxWorkItemSizes[3];
	cl_uint maxWorkItemDims;
	cl_device_type deviceType;

	printf("\n=== OpenCL Device Information ===\n");

	// Device name
	clGetDeviceInfo(deviceID, CL_DEVICE_NAME, sizeof(cBuffer), &cBuffer, NULL);
	printf("Device Name: %s\n", cBuffer);

	// Device vendor
	clGetDeviceInfo(deviceID, CL_DEVICE_VENDOR, sizeof(cBuffer), &cBuffer, NULL);
	printf("Device Vendor: %s\n", cBuffer);

	// Device type
	clGetDeviceInfo(deviceID, CL_DEVICE_TYPE, sizeof(deviceType), &deviceType, NULL);
	printf("Device Type: ");
	if (deviceType & CL_DEVICE_TYPE_GPU) printf("GPU\n");
	else if (deviceType & CL_DEVICE_TYPE_CPU) printf("CPU\n");
	else if (deviceType & CL_DEVICE_TYPE_ACCELERATOR) printf("Accelerator\n");
	else printf("Unknown\n");

	// OpenCL version
	clGetDeviceInfo(deviceID, CL_DEVICE_OPENCL_C_VERSION, sizeof(cBuffer), &cBuffer, NULL);
	printf("OpenCL C Version: %s\n", cBuffer);

	// Driver version
	clGetDeviceInfo(deviceID, CL_DRIVER_VERSION, sizeof(cBuffer), &cBuffer, NULL);
	printf("Driver Version: %s\n", cBuffer);

	// Compute units
	clGetDeviceInfo(deviceID, CL_DEVICE_MAX_COMPUTE_UNITS, sizeof(maxComputeUnits), &maxComputeUnits, NULL);
	printf("Max Compute Units: %u\n", maxComputeUnits);

	// Clock frequency
	clGetDeviceInfo(deviceID, CL_DEVICE_MAX_CLOCK_FREQUENCY, sizeof(maxClockFreq), &maxClockFreq, NULL);
	printf("Max Clock Frequency: %u MHz\n", maxClockFreq);

	// Global memory
	clGetDeviceInfo(deviceID, CL_DEVICE_GLOBAL_MEM_SIZE, sizeof(globalMemSize), &globalMemSize, NULL);
	printf("Global Memory Size: %lu MB\n", globalMemSize / (1024 * 1024));

	// Max memory allocation
	clGetDeviceInfo(deviceID, CL_DEVICE_MAX_MEM_ALLOC_SIZE, sizeof(maxMemAllocSize), &maxMemAllocSize, NULL);
	printf("Max Memory Allocation: %lu MB\n", maxMemAllocSize / (1024 * 1024));

	// Local memory
	clGetDeviceInfo(deviceID, CL_DEVICE_LOCAL_MEM_SIZE, sizeof(localMemSize), &localMemSize, NULL);
	printf("Local Memory Size: %lu KB\n", localMemSize / 1024);

	// Max work group size
	clGetDeviceInfo(deviceID, CL_DEVICE_MAX_WORK_GROUP_SIZE, sizeof(maxWorkGroupSize), &maxWorkGroupSize, NULL);
	printf("Max Work Group Size: %zu\n", maxWorkGroupSize);

	// Max work item dimensions
	clGetDeviceInfo(deviceID, CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS, sizeof(maxWorkItemDims), &maxWorkItemDims, NULL);
	printf("Max Work Item Dimensions: %u\n", maxWorkItemDims);

	// Max work item sizes
	clGetDeviceInfo(deviceID, CL_DEVICE_MAX_WORK_ITEM_SIZES, sizeof(maxWorkItemSizes), &maxWorkItemSizes, NULL);
	printf("Max Work Item Sizes: %zu x %zu x %zu\n", maxWorkItemSizes[0], maxWorkItemSizes[1], maxWorkItemSizes[2]);

	printf("=================================\n\n");
ids-1  | === OpenCL Device Information ===
ids-1  | Device Name: 
ids-1  | Device Vendor: 
ids-1  | Device Type: CPU
ids-1  | OpenCL C Version: 
ids-1  | Driver Version: 
ids-1  | Max Compute Units: 0
ids-1  | Max Clock Frequency: 1 MHz
ids-1  | Global Memory Size: 268434198 MB
ids-1  | Max Memory Allocation: 268435394 MB
ids-1  | Local Memory Size: 0 KB
ids-1  | Max Work Group Size: 281473657739148
ids-1  | Max Work Item Dimensions: 65535
ids-1  | Max Work Item Sizes: 281473657909640 x 281473657671680 x 281474911768320
ids-1  | =================================

On my PC, the same code produces the following output:

=== OpenCL Device Information ===
Device Name: NVIDIA GeForce RTX 4060
Device Vendor: NVIDIA Corporation
Device Type: GPU
OpenCL C Version: OpenCL C 1.2 
Driver Version: 580.65.06
Max Compute Units: 24
Max Clock Frequency: 2460 MHz
Global Memory Size: 7782 MB
Max Memory Allocation: 1945 MB
Local Memory Size: 48 KB
Max Work Group Size: 1024
Max Work Item Dimensions: 3
Max Work Item Sizes: 1024 x 1024 x 64
=================================

The OpenCL kernel (and application) that I am mentioning has been tested on both AMD and NVIDIA GPUs and works perfectly well.

My Verdin module has been flashed with a standard Torizon OS image, I have not done any customisations (like drivers or device tree, U-boot, etc).

torizon@verdin-imx8mm-15400884:~$ uname -a
Linux verdin-imx8mm-15400884 5.15.177-6.8.2+git.5a6602cc6c32 #1-TorizonCore SMP PREEMPT Fri Mar 21 16:22:58 UTC 2025 aarch64 aarch64 aarch64 GNU/Linux

Is there any document with a guide how to properly prepare the device for OpenCL?

Hi, I’m the Toradex AI Assistant.

I might make mistakes or occasionally get you wrong. Our support team will step in if needed.


OpenCL Setup on Verdin iMX8M Plus with Torizon

Your OpenCL device information output indicates that the GPU drivers aren’t properly loaded or configured. The empty device name and vendor strings, along with the CPU device type detection, suggest that OpenCL is falling back to a software implementation rather than using the GPU hardware acceleration (1).

Prerequisites and Supported Hardware

OpenCL 1.2 support on Torizon is available for NXP i.MX8/8X/8M Plus SoCs, including your Verdin iMX8M Plus module (1). However, you need to use a properly configured container environment with the necessary GPU libraries and device access.

Proper Setup Using Containers

The standard approach is to use a containerized application with the required OpenCL libraries. Here’s the complete setup process (1):

1. Clone the Sample Repository

$ git clone -b bookworm https://github.com/toradex/torizon-samples.git  
$ cd torizon-samples/opencl  

(1)

2. Build the Container Image

$ docker build -t <your-dockerhub-username>/opencl-image .  

(1)

3. Deploy to Your Module

$ docker push <your-dockerhub-username>/opencl-image  

(1)

On your Verdin module:

# docker pull <your-dockerhub-username>/opencl-image  

(1)

4. Run with Proper Device Access

# docker run -e ACCEPT_FSL_EULA=1 -it --rm --name=clpeak-container --device /dev/galcore:/dev/galcore <your-dockerhub-username>/opencl-image  

(1)

Key Requirements

The container setup includes several critical components (1):

  • Debian base container with Toradex-specific GPU packages
  • OpenCL runtime libraries optimized for i.MX8 GPUs
  • Device access to /dev/galcore for GPU communication
  • NXP EULA acceptance via environment variable

Testing Your Setup

The sample includes two verification tools (1):

  • clpeak: Benchmarking tool to measure OpenCL device performance
  • clinfo: Debugging tool that displays all available OpenCL platforms and devices

When properly configured, you should see output similar to this from a working Apalis iMX8 (1):

Platform: Vivante OpenCL Platform  
  Device: Vivante OpenCL Device GC7000XSVX.6009.0000  
    Driver version  : OpenCL 1.2 V6.2.4.p4.190076 (Linux ARM64)  
    Compute units   : 1  
    Clock frequency : 996 MHz  

(1)

Why Your Current Approach Isn’t Working

Running OpenCL directly on the host system without the proper container environment and GPU libraries will result in the software fallback behavior you’re experiencing. The Torizon approach uses containers with pre-configured GPU access and the necessary Vivante OpenCL libraries (1).

The complete Dockerfile implementation and additional examples are available in the Torizon samples repository (1).

Best regards,
Toradex AI Assistant.


Did you like my answer? :+1: :-1:

If I answered your question, please tick the “Solution” checkbox.

For follow-up questions to me, please use @ToradexAI in your post.

Hello @vmetodiev ,

Was the information from Toradex AI enough to address your questions?

If anything is still unclear or if you need further support, please let me know. I will be happy to help.

Best regards,

Hi @daniel_m.tx ,

Please excuse my delay in the response to this thread!

Today I had some time to play with the suggested flow by the Toradex AI.

I managed to assemble some Frankenstein-like Docker file as follows:

# --- GLOBAL ARGS ---
ARG IMAGE_ARCH=arm64
ARG DOCKER_REGISTRY=torizon
ARG APP_ROOT=/home/torizon/app
ARG CROSS_TC_IMAGE_NAME=debian-cross-toolchain-arm64-vivante
ARG CROSS_TC_IMAGE_TAG=3-bookworm
ARG BASE_NAME=debian
ARG IMAGE_TAG=3-bookworm

# BUILD STAGE ------------------------------------------------------------------
FROM ${DOCKER_REGISTRY}/${CROSS_TC_IMAGE_NAME}:${CROSS_TC_IMAGE_TAG} AS build

ARG IMAGE_ARCH
ARG APP_ROOT
ARG SDK_COMPILER_PREFIX=aarch64-linux-gnu-

RUN apt-get -y update && apt-get install -y --no-install-recommends \
    cmake \
    imx-gpu-viv-wayland-dev \
    && apt-get clean && rm -rf /var/lib/apt/lists/*

# 1. Build clpeak
# Note: Cloning to /clpeak (root)
RUN git clone https://github.com/krrishnarraj/clpeak.git /clpeak && \
    cd /clpeak && \
    git submodule update --init --recursive && \
    mkdir build && cd build && \
    cmake \
        -DCMAKE_C_COMPILER=${SDK_COMPILER_PREFIX}gcc \
        -DCMAKE_CXX_COMPILER=${SDK_COMPILER_PREFIX}g++ \
        .. && \
    cmake --build .

# 2. Build your Application
COPY . ${APP_ROOT}
WORKDIR ${APP_ROOT}

RUN if [ "$IMAGE_ARCH" = "arm64" ] ; then \
        make ARCH=${IMAGE_ARCH} CC=${SDK_COMPILER_PREFIX}g++ ; \
    else \
        echo "Add other arch logic here" ; \
    fi


# DEPLOY STAGE -----------------------------------------------------------------
FROM --platform=linux/${IMAGE_ARCH} ${DOCKER_REGISTRY}/${BASE_NAME}:${IMAGE_TAG} AS deploy

ARG IMAGE_ARCH
ARG APP_ROOT

# Install runtime GPU libraries
RUN apt-get -y update && apt-get install -y --no-install-recommends \
    imx-gpu-viv-wayland \
    libwayland-client0 \
    && apt-get clean && rm -rf /var/lib/apt/lists/*

WORKDIR ${APP_ROOT}

# COPY FROM BUILD STAGE TO DEPLOY STAGE
# -------------------------------------
# Copy your app
COPY --from=build ${APP_ROOT}/build-${IMAGE_ARCH}/bin ${APP_ROOT}

# Copy OpenCL kernels
COPY --from=build ${APP_ROOT}/src/*.cl ${APP_ROOT}/

# Copy clpeak binary to a standard system path
COPY --from=build /clpeak/build/clpeak /usr/bin/clpeak

# CMD ["./ids"]
CMD ["clpeak"]

The output from the clpeak is:

[+] Running 1/1
 âś” Container torizon-ids-1  Created                                                                                                                                              0.8s 
Attaching to ids-1
ids-1  | 
ids-1  | Platform: Vivante OpenCL Platform
ids-1  | clCreateContextFromType (-30)
ids-1 exited with code 255
 *  Terminal will be reused by tasks, press any key to close it. 

And again, from my test app I get the same abnormal values:

[+] Running 1/1
 âś” Container torizon-ids-1  Created                                                                                                                                              1.2s 
Attaching to ids-1
ids-1  | Generated buffer len: 62208
ids-1  | 
ids-1  | === OpenCL Device Information ===
ids-1  | Device Name: 
ids-1  | Device Vendor: 
ids-1  | Device Type: CPU
ids-1  | OpenCL C Version: 
ids-1  | Driver Version: 
ids-1  | Max Compute Units: 0
ids-1  | Max Clock Frequency: 1 MHz
ids-1  | Global Memory Size: 268434216 MB
ids-1  | Max Memory Allocation: 268434807 MB
ids-1  | Local Memory Size: 0 KB
ids-1  | Max Work Group Size: 281473677108732
ids-1  | Max Work Item Dimensions: 65535
ids-1  | Max Work Item Sizes: 281473677257720 x 281473677041664 x 281474297144544
ids-1  | =================================
ids-1  | 
ids-1 exited with code 0
 *  Terminal will be reused by tasks, press any key to close it. 

I also want to add a note that this is neither critical to me, nor urgent at all. Do please just let me know if OpenCL is indeed supported and tested on my module:

Linux verdin-imx8mm-15400884 5.15.177-6.8.2+git.5a6602cc6c32 #1-TorizonCore SMP PREEMPT Fri Mar 21 16:22:58 UTC 2025 aarch64 aarch64 aarch64 GNU/Linux

If not, I will purchase a new model that you recommend.

Thanks!

Hello @vmetodiev ,

Thanks for the detailed follow-up, and no worries at all regarding urgency.

To answer your question directly: OpenCL is not officially tested or validated on Torizon for the Verdin iMX8MM (or our other modules). That said, OpenCL libraries are present as part of the GPU stack, and the general expectation is that it can work on a best-effort basis, without formal guarantees.

What your tests show is consistent with what we see in practice on iMX8MM: OpenCL depends on the Vivante GPU runtime, and when that runtime does not initialize correctly (especially in containerized setups), the behavior can be inconsistent or incomplete. There is no independent CPU OpenCL backend on this platform.

As a practical next step, you may want to try running the same workload directly on the host (outside a container) to confirm whether the behavior is container-related or inherent to the runtime.

Happy to continue the discussion if you’d like to explore this further.

Best regards,
Daniel Morais

1 Like

Hi @daniel_m.tx ,

Thank you for the detailed and clear explanations!

No further questions from my side… so you can close this thread.

Best regards,

Varban