Apalis iMX8: OpenCL related questions

Hi team,

A Japanese partner is looking to implement their AI framework in our Apalis iMX8 with OpenCL and forwarded me some questions.

They are using SqueezeNet for CPU and yolo_tiny for GPU.

Can we change the GPU memory size?

Currently, the GPU memory size setting is 256MByte.
How can we change the setting to 512MByte, 1GB … etc?

We already tried the following method, but didn’t seem to change:

  • adding gpumem argument for Linux Kernel parameter
  • changing DeviceTree’s register mapping node of imx8_gpu_ss

OpenCL driver doesn’t support fp16

On the document for Vivante GC7000, fp16 Gflops seems to be two times faster than fp32.

Vivante GC7000 GPUs Deliver Desktop-Class Graphics to Mobile Devices

But, the OpenCL driver doesn’t seem to support fp16.
CL_DEVICE_EXTENSIONS only indicate the following flags :


If the driver supports fp16, cl_khr_fp16 flag will be included.

Do we have any further information about enabling the fp16 feature on i.MX8?

We’ve checked some old information but not sure if there is any further information on the topic:

Kindly let us know.

OpenCL kernel compiler hangs on specific array size

The following kernel code will hang during clBuildProgram.

__kernel void test(__global int* dst, __global int* src, int mode)
    int pos = (int)get_global_id(0);
    const int ofstbl[9] = {1, 3, 7, 5, 6, 4, 8, 0, 2}; // NG
//    const int ofstbl[8] = {1, 3, 7, 5, 6, 4, 8, 0};    // OK
    dst[pos] = src[pos] + ofstbl[mode];

The following errors output for stdout.

double free or corruption (!prev)
Aborted (core dumped)

It seems that when the array size is more than 9 it will hang.
Kindly check and fix this issue, please.

Which benchmark software do we use for validating OpenCL performance?

Basically, what is the benchmark software do we recommend for OpenCL on i.MX8 platform?

Copy from GPU to CPU is around 10 times slower compared to a write from CPU to GPU

We tested the performance of data copy between CPU and GPU.
When it transfer from GPU to CPU is 10 times slower than from CPU to GPU.
We use clEnqueueWriteBuffer for copying from CPU to GPU,
clEnqueueReadBuffer for copying from GPU to CPU.

Do we know why these differences happen and how to solve it?
We think it is related to DMA but any help would be appreciated.

Finally, they also asked for the reference manual for i.MX8 and the latest version I have is from June 2018. Is there any newer version?

Thanks and kind regards,

Hi Alvaro

Can we change the GPU memory size?

I tried the following:

--- a/arch/arm64/boot/dts/freescale/fsl-imx8qm-apalis-eval.dtsi
+++ b/arch/arm64/boot/dts/freescale/fsl-imx8qm-apalis-eval.dtsi
@@ -199,6 +199,8 @@
 &imx8_gpu_ss {
+imx8_gpu_ss: imx8_gpu_ss {
+       reg = <0x0 0x80000000 0x0 0x80000000>, <0x0 0x0 0x0 0x20000000>;
        status = "okay";

After compilation and deployment of the device tree I see a change:

root@apalis-imx8:~# cat /sys/module/galcore/parameters/contiguousSize           

Whether or not that changed anything in the GPU behaviour I do not know.

OpenCL driver doesn’t support fp16

The article you linked is still true. We provide whatever NXP provides and the features are as they are. Nothing we can do here.

With the BSP 3.0 images we integrated the driver version of the NXP’s GA release, so if you have that you have latest and greatest.

Also, in NXP’s imx-yocto-L4.14.98_2.0.0_ga documentation set, i.MX_Graphics_User’s_Guide.pdf states that that particular OpenCL extension is not supported.

OpenCL kernel compiler hangs on specific array size

It’s probably best to discuss that in the NXP community.

Which benchmark software do we use for validating OpenCL performance?

We run clpeak to see if the OpenCL related libraries have been correctly deployed.

Copy from GPU to CPU is around 10 times slower compared to a write from CPU to GPU

We cannot comment on that as we simply have no experiance with OpenCL whatsoever.

Reference Manual

Rev. E, 06/2018 is the latest we got from NXP.


Many thanks @max.tx !