Run Pytorch Model on the internal NPU of the IMX95

TobiasAtAGI · October 31, 2025, 9:11am

I am trying to run a PyTorch model on the internal NPU of the IMX95. I followed the following steps:

Export PyTorch model as ONNX file (opset version 18; dynamo=True; model weights: Float32)
Quantize the model to int8 using onnx2quant and a calibration dataset with 500 samples
Convert the ONNX file to a TFLite file using eiq-converter.exe (--plugin eiq-converter-onnx2tflite)
(eIQ_Toolkit_v1.17.0\bin\eiq-converter.exe)
Convert the TFLite file for the IMX95 using neutron-converter.exe
(eIQ_Toolkit_v1.17.0\bin\neutron-converter\MCU_SDK_25.09.00+Linux_6.12.34_2.1.0\neutron-converter.exe)

All steps run without error, and I have created a PowerShell script that I use to run steps 2–4. It can be downloaded here:
script.ps1

The ONNX file (Output step 1) can be downloaded here:
model.onnx

The issue I have:
When I run the model on the NPU of the IMX95, the output is always the same despite providing different input images. However, when I run the TFLite model (Output of step 3) on the CPU of the IMX95, the output is correct for each image. This suggests that the conversion and quantization steps worked correctly, but the final step introduces issues.

I successfully converted (using the neutron-converter) and ran the example model provided by NXP (mobilenet_v1_1.0_224_quant.tflite) on the NPU of the IMX95. So the inference code is not the issue.

I suspect there is either a problem with the layers used in the model or something incorrect in the export or conversion process.

Thank you for any input or suggestions on how to solve this.

ToradexAI · October 31, 2025, 9:13am

Hi, I’m the Toradex AI Assistant.

I might make mistakes or occasionally get you wrong. Our support team will step in if needed.

I understand you’re experiencing issues with your PyTorch model producing consistent outputs when running on the i.MX95 NPU, despite working correctly on CPU. This is a known challenge with neural network model conversion pipelines.

Analysis of Your Conversion Pipeline

Your conversion workflow (PyTorch → ONNX → Quantized ONNX → TFLite → Neutron) is complex and introduces multiple potential points of failure. The fact that your model works correctly on CPU but produces identical outputs on NPU suggests the issue occurs during the neutron-converter step.

Common Causes and Solutions

1. Layer Compatibility Issues

The Neutron-S NPU has specific operator constraints. Based on the community discussion (1), certain layers like fully connected layers can encounter conversion issues even when they appear to meet documented constraints.

2. Quantization Artifacts

The quantization process (step 2) might be introducing artifacts that only manifest when running on NPU hardware. The NPU’s quantized inference path may be more sensitive to these artifacts than CPU execution.

3. Input/Output Tensor Handling

Verify that your input preprocessing and output postprocessing remain consistent between CPU and NPU execution paths. The neutron-converter may alter tensor layouts or scaling factors.

Recommended Debugging Steps

Step 1: Simplify the Model

Create a minimal version of your model with fewer layers to isolate which operations are causing issues.

Step 2: Compare Intermediate Outputs

If possible, extract and compare intermediate layer outputs between CPU (TFLite) and NPU (Neutron) versions to identify where divergence occurs.

Step 3: Verify Quantization Parameters

Ensure your calibration dataset of 500 samples is representative of your actual inference data and covers the full range of expected inputs.

Step 4: Test Alternative Conversion Paths

Consider converting directly from TensorFlow/Keras to TFLite, then to Neutron, bypassing the PyTorch → ONNX conversion if feasible.

Step 5: Enable Verbose Logging

Use verbose logging in the neutron-converter to identify any warnings or constraints being applied during conversion.

Model Architecture Considerations

Since you mentioned the NXP MobileNet example works correctly, compare your model architecture against MobileNet:

Activation functions: Ensure you’re using NPU-supported activations
Layer types: Verify all your layers are in the supported operator list
Tensor shapes: Check for any dynamic shapes that might not be handled correctly

Next Steps

Given the complexity of your conversion pipeline, I’d recommend starting with the simplest possible model that reproduces your issue, then gradually adding complexity while monitoring where the problem emerges.

The community discussion (1) shows similar conversion challenges with classifier models, suggesting this is a known area where the neutron-converter may need refinement.

Best regards,
Toradex AI Assistant.

Did you like my answer?

If I answered your question, please tick the “Solution” checkbox.

For follow-up questions to me, please use @ToradexAI in your post.

bruno.tx · October 31, 2025, 4:32pm

Hello @TobiasAtAGI,

Just to confirm, which hardware are you using?
And which BSP version?

Did you use the same converter tool version as you did for your model?
If yes, did you get accurate results with this model?
I ask because I am aware of issues with the converter for 6.12 that affected even the example model.
If those are solved, it could be related to the specific model you are using.
The provider of the NPU and the tools used to convert the model is NXP, so we depend on them for this.
However, if you are having problems, we could raise this with NXP.

Best Regards,
Bruno

TobiasAtAGI · November 4, 2025, 1:09pm

Hello @bruno.tx,
Thank you for the response.

I am using the IMX.95 EVK.

The BSP version is 6.12.20.
I checked in /proc/version:

Linux version 6.12.20-lts-next-gdfaf2136deb2 (oe-user@oe-host) (aarch64-poky-linux-gcc (GCC) 14.2.0, GNU ld (GNU Binutils) 2.44) #1 SMP PREEMPT Wed Jun 4 10:15:09 UTC 2025

Yes, I used the same tool to convert the mobilenet.tflite file. When converting the model using the MCU_SDK_25.09.00+Linux_6.12.34_2.1.0 converter the output looks more usefull even if it not the same as with the CPU Model. However when I use the Linux_6.6.3_1.0.0 the same issue happens as with the custom model. The output changes

NPU (Linux_6.6.3_1.0.0):

— Inference Results —
Inference Time (average): 0.50 ms
Model Output: (1, 1001)
[[0 0 0 … 0 0 0]]
Non-zero class scores:
All class scores are zero!

NPU (MCU_SDK_25.09.00+Linux_6.12.34_2.1.0):

— Inference Results —
Inference Time (average): 2.00 ms
Model Output: (1, 1001)
[[0 6 0 … 0 0 0]]
Non-zero class scores:
Class ID: 1 Score: 6
Class ID: 114 Score: 6
Class ID: 117 Score: 6
Class ID: 118 Score: 6
Class ID: 123 Score: 6
Class ID: 125 Score: 6
Class ID: 242 Score: 6
Class ID: 245 Score: 6
Class ID: 246 Score: 6
Class ID: 250 Score: 6
Class ID: 251 Score: 6
Class ID: 253 Score: 6
Class ID: 256 Score: 6
Class ID: 257 Score: 6
Class ID: 370 Score: 6
Class ID: 373 Score: 6
Class ID: 374 Score: 6
Class ID: 379 Score: 6
Class ID: 381 Score: 6
Class ID: 457 Score: 6
Class ID: 498 Score: 6
Class ID: 501 Score: 6
Class ID: 502 Score: 6
Class ID: 507 Score: 6
Class ID: 508 Score: 5
Class ID: 509 Score: 6
Class ID: 513 Score: 6
Class ID: 514 Score: 6
Class ID: 626 Score: 6
Class ID: 629 Score: 6
Class ID: 630 Score: 6
Class ID: 635 Score: 6
Class ID: 637 Score: 6
Class ID: 709 Score: 3
Class ID: 754 Score: 6
Class ID: 757 Score: 6
Class ID: 758 Score: 6
Class ID: 763 Score: 6
Class ID: 765 Score: 6
Class ID: 882 Score: 6
Class ID: 885 Score: 6
Class ID: 886 Score: 6
Class ID: 890 Score: 6
Class ID: 891 Score: 6
Class ID: 893 Score: 2

Predicted class ID: 1
Predicted class score: 6
Min score: 0 Max score: 6

CPU:

— Inference Results —
Inference Time (average): 116.31 ms
Model Output: (1, 1001)
[[0 0 0 … 0 0 0]]
Non-zero class scores:
Class ID: 439 Score: 18
Class ID: 441 Score: 1
Class ID: 454 Score: 1
Class ID: 496 Score: 1
Class ID: 505 Score: 3
Class ID: 506 Score: 7
Class ID: 527 Score: 2
Class ID: 528 Score: 1
Class ID: 544 Score: 1
Class ID: 549 Score: 1
Class ID: 551 Score: 1
Class ID: 554 Score: 1
Class ID: 573 Score: 15
Class ID: 599 Score: 2
Class ID: 605 Score: 112
Class ID: 620 Score: 11
Class ID: 627 Score: 1
Class ID: 630 Score: 1
Class ID: 645 Score: 1
Class ID: 665 Score: 2
Class ID: 697 Score: 6
Class ID: 712 Score: 13
Class ID: 726 Score: 1
Class ID: 738 Score: 2
Class ID: 744 Score: 1
Class ID: 774 Score: 3
Class ID: 783 Score: 8
Class ID: 800 Score: 1
Class ID: 805 Score: 2
Class ID: 847 Score: 5
Class ID: 852 Score: 3
Class ID: 895 Score: 1
Class ID: 899 Score: 3
Class ID: 900 Score: 5
Class ID: 906 Score: 3
Class ID: 908 Score: 1
Class ID: 967 Score: 3

Predicted class ID: 605
Predicted class score: 112
Min score: 0 Max score: 112

Python Code Inference:

import time

import numpy as np
from PIL import Image
import matplotlib.pyplot as plt

from tensorflow.lite.python.interpreter import Interpreter
from tensorflow.lite.python.interpreter import load_delegate

from tqdm import tqdm

MODEL_PATH_NPU = ‘./mobilenet_v1_1.0_224_quant_IMX_95.tflite’
MODEL_PATH_CPU = ‘./mobilenet_v1_1.0_224_quant.tflite’
IMAGE_PATH = ‘./train.jpg’

USE_NPU_DELEGATE = True
NPU_DELEGATE_PATH = ‘/usr/lib/liblitert_neutron_delegate.so’

INFERENCE_WARMUP_STEPS = 10
INFERENCE_STEPS = 10

class TFLiteModel:

def __init__(self, model_path, use_npu_delegate, npu_delegate_path):
    if use_npu_delegate:
        ext_delegate = [load_delegate(npu_delegate_path)]
        self.interpreter = Interpreter(model_path=model_path, experimental_delegates=ext_delegate)
    else:
        self.interpreter = Interpreter(model_path=model_path)

    self.interpreter.allocate_tensors()

    self.input_details = self.interpreter.get_input_details()
    self.output_details = self.interpreter.get_output_details()

def load_image(self, image_path, input_mean, input_std):
    image_height = self.input_details[0]['shape'][1]
    image_width = self.input_details[0]['shape'][2]

    img = Image.open(image_path).resize((image_width, image_height))

    img_array = np.array(img)
    input_data = np.expand_dims(img_array, axis=0)

    return input_data

def infer(self, input_tensor):
    self.interpreter.set_tensor(self.input_details[0]['index'], input_tensor)
    self.interpreter.invoke()
    output_data = self.interpreter.get_tensor(self.output_details[0]['index'])
    return output_data

if __name__ == ‘__main__':
 if not USE_NPU_DELEGATE:
  MODEL_PATH = MODEL_PATH_CPU
else:
 MODEL_PATH = MODEL_PATH_NPU

tf_lite_model = TFLiteModel(MODEL_PATH, USE_NPU_DELEGATE, NPU_DELEGATE_PATH)
input_tensor = tf_lite_model.load_image(IMAGE_PATH, input_mean=[0.485, 0.456, 0.406], input_std=[0.229, 0.224, 0.225])

#Warm up
for _ in tqdm(range(INFERENCE_WARMUP_STEPS), desc="Warming up",total=INFERENCE_WARMUP_STEPS):
    tf_lite_model.infer(input_tensor)


# Actual inference
average_inference_time = 0

for _ in tqdm(range(INFERENCE_STEPS), desc="Inference", total=INFERENCE_STEPS):
    start_time = time.time()
    output_data = tf_lite_model.infer(input_tensor)
    stop_time = time.time()
    average_inference_time += (stop_time - start_time) * 1000
average_inference_time /= INFERENCE_STEPS

print('\n--- Inference Results ---')

print(f'Inference Time (average): {average_inference_time:.2f} ms')
print(f'Model Output: {output_data.shape} \n {output_data}')

# List all None-zero class scores
print('Non-zero class scores:')
number_of_zero_scores = 0
for class_id, score in enumerate(output_data[0]):
    if np.abs(score) > 1e-6:
        print(f'    Class ID: {class_id}   Score: {score}')
    else:
        number_of_zero_scores += 1

if number_of_zero_scores == len(output_data[0]):
    print('    All class scores are zero!')
else:
    id = np.argmax(output_data)
    print(f'\nPredicted class ID: {id}')
    print(f'Predicted class score: {output_data[0][id]}')
    print(f'Min score: {np.min(output_data)}    Max score: {np.max(output_data)}')

print('-------------------------\n')

# Plot a heat map with the logits for each class
class_width_pixels = 100
plt.figure(figsize=(20, 4))
plt.imshow(output_data, aspect='auto', cmap='viridis', extent=[0, output_data.shape[1]*class_width_pixels, 0, 1])
plt.colorbar()
plt.xlabel('Class ID')
plt.title('Model Output Scores per Class')
plt.savefig(f'model_output_scores_{"NPU" if USE_NPU_DELEGATE else "CPU"}.png')
plt.close()```

rudhi.tx · November 11, 2025, 11:52am

Hi @TobiasAtAGI,

I have not tried your custom model yet. However, I’ve been experimenting with the sample TensorFlow Lite model (/usr/bin/tensorflow-lite-2.xx.2/example/mobilenet_v1_1.0_224_quant.tflite) from NXP on different versions of NXP BSP and converters. I copied the mobilenet model from runtime from different BSP versions on iMX95 into my PC and converted them with neutron-converter (in different SDK versions) in the eIQ_Toolkit_v1.17.0, and copied the converted model back to the module.

One thing I noticed is that the model converted with MCU_SDK_25.09.00+Linux_6.12.34_2.1.0 does not work on BSP 6.12.34 or on BSP 6.12.20 on the NPU.
This is the result I get when trying to run the model converted with MCU_SDK_25.09.00+Linux_6.12.34_2.1.0 on BSP 6.12.34 for NPU:

root@imx95-19x19-verdin:/usr/bin/tensorflow-lite-2.19.0/examples# ./label_image -m mobilenet_v1_1.0_224_quant_neutron61234.tflite -i grace_hopper.bmp -l labels.txt --external_delegate_path=/usr/lib/libneutron_delegate.so 
INFO: Loaded model mobilenet_v1_1.0_224_quant_neutron61234.tflite
INFO: resolved reporter
INFO: EXTERNAL delegate created.
INFO: NeutronDelegate delegate: 1 nodes delegated out of 4 nodes with 1 partitions.

INFO: Neutron delegate version: v1.0.0-be8bf399, zerocp enabled.
INFO: Applied EXTERNAL delegate.
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
INFO: invoked
INFO: average time: 0.392 ms

If I try to run the same converted model on BSP 6.12.20, I get a warning for microcode version mismatch:

 # ON NPU (converted with MCU_SDK_25.09.00+Linux_6.12.34_2.1.0/)

root@imx95-19x19-verdin:/usr/bin/tensorflow-lite-2.18.0/examples# ./label_image -m mobilenet_v1_1.0_224_quant_neutron61234.tflite -i grace_hopper.bmp -l labels.txt --external_delegate_path=/usr/lib/libneutron_delegate.so 
INFO: Loaded model mobilenet_v1_1.0_224_quant_neutron61234.tflite
INFO: resolved reporter
INFO: EXTERNAL delegate created.
INFO: NeutronDelegate delegate: 1 nodes delegated out of 4 nodes with 1 partitions.

INFO: Neutron delegate version: v1.0.0-a5d640e6, zerocp enabled.
INFO: Applied EXTERNAL delegate.
Error in cpuinfo: prctl(PR_SVE_GET_VL) failed
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
Warning: microcode version mismatch! 0xaf140cf5 (expected 0xcebb80a)
INFO: invoked
INFO: average time: 0.36 ms

The model converted with MCU_SDK_25.06.00+Linux_6.12.20_2.0.0 shows some results on the corresponding BSP 6.12.20, however, the results are not correct:

# ON NPU (converted with MCU_SDK_25.06.00+Linux_6.12.20_2.0.0/)

root@imx95-19x19-verdin:/usr/bin/tensorflow-lite-2.18.0/examples# ./label_image -m mobilenet_v1_1.0_224_quant_neutron.tflite -i grace_hopper.bmp -l labels.txt --external_delegate_path=/usr/lib/libneutron_delegate.so 
INFO: Loaded model mobilenet_v1_1.0_224_quant_neutron.tflite
INFO: resolved reporter
INFO: EXTERNAL delegate created.
INFO: NeutronDelegate delegate: 1 nodes delegated out of 4 nodes with 1 partitions.

INFO: Neutron delegate version: v1.0.0-a5d640e6, zerocp enabled.
INFO: Applied EXTERNAL delegate.
Error in cpuinfo: prctl(PR_SVE_GET_VL) failed
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
INFO: invoked
INFO: average time: 1.851 ms
INFO: 0.0313726: 858 throne
INFO: 0.0313726: 850 teapot
INFO: 0.0313726: 766 rocking chair
INFO: 0.0313726: 730 plate rack
INFO: 0.0313726: 722 pillow

You posted your results from NPU inference on BSP 6.12.20, with a model converted with MCU_SDK_25.09.00+Linux_6.12.34_2.1.0 - are you sure this result is accurate? To the best of my knowledge, the versions of BSP and SDK that works as of today is Linux_6.6.36_2.1.0. So our recommendation is to try it out on this BSP and a converted model with the corresponding SDK. For the newer BSPs, there are some known issues and we are in contact with NXP to find a solution.