Verdin iMX95 Neutron Converter "No more TCM space for data"

We are using a NXP i.MX 95 Computer on Module (Verdin iMX95) connected to a Dahlia Carrier Board with HDMI Adapter.

The BSP Version is:
Linux version 6.6.101-7.4.0-devel

We are trying to run a custom deep learning model on the system. The model is originaly trained in Pytorch and converted to a quantized TFLite model following the NXP Guide (Deploy PyTorch Models to NXP Supported Inference Engines
Using ONNX).

As far as we understand with version BSP Version 6.6. we do not need to convert the model to a specific BSP Version using the Neutron Converter (neutron-converter.exe).

So we can directly infer the quantized model. To test the model performance we use the provided ./benchmark_model

root@verdin-imx95-12594073:/usr/bin/tensorflow-lite-2.16.2/examples# ./benchmark_model --graph=/root/lld_ops15_640.tflite --external_delegate_path=/usr/lib/libneutron_delegate.so --verbose=true

When we run this we get a Warning:

WARNING: De-extracting subgraph ‘subgraph_031’ because of a failure!

WARNING: The failure occurred with message ‘No more TCM space for data.’ at line 4241 in file ‘/home/osc/jiwei/neutron/neutron-software-dev/neutron-converter/src/NeutronResourceManager.cpp’ in domain ‘Generator’.

And zero nodes are delegated to the NPU

INFO: NeutronDelegate delegate: 0 nodes delegated out of 44 nodes with 0 partitions.

If we run the same tests with the built in model mobilenet_v1_1.0_224_quant.tflite. There is no warnings and the nodes get delegated to the NPU.

The lld_ops15_640.tflite model takes in an 640x640 RGB Image, uses a Resnet18 Backbone with some fully connected layers and has about 11 Million Parameters. Since in the NXP Guide they only mention Resnet50 with an input image size of 224x224x3 we also tried a version with only the Backbone model and downsaceld to a 224x224x3 Input. All with the same results.

The models can be downloaded:
lld_ops15_640.tflite
lld_ops15_640_backbone.tflite
lld_ops15_224.tflite
lld_ops15_224_backbone.tflite

Thank you for any suggestions on how to solve this issue.

Hi, I’m the Toradex AI Assistant.

I might make mistakes or occasionally get you wrong. Our support team will step in if needed.


Let me help you troubleshoot this issue with your custom deep learning model on the Verdin iMX95.

Understanding the TCM Space Issue

The error message “No more TCM space for data” indicates that your model requires more Tightly Coupled Memory (TCM) than is available on the Neutron-S NPU (1). The TCM is a dedicated memory space used by the NPU for efficient processing, and when it’s exhausted, the delegate falls back to CPU execution.

Key Differences in Model Requirements

The reason your custom models fail while the built-in mobilenet_v1_1.0_224_quant.tflite works is likely due to:

  1. Model Size: Your models have 11 million parameters compared to MobileNet’s much smaller footprint
  2. Input Size: 640x640 inputs require significantly more memory than 224x224
  3. Architecture Complexity: ResNet18 with fully connected layers may have memory allocation patterns that don’t fit well in the available TCM

Running Examples on iMX95

For the iMX95 platform, the documentation shows the correct command format (1):

# ./label_image -m mobilenet_v1_1.0_224_quant.tflite -i grace_hopper.bmp -l labels.txt --external_delegate_path=/usr/lib64/libneutron_delegate.so

Note that on iMX95, the delegate path should be /usr/lib64/libneutron_delegate.so rather than /usr/lib/libneutron_delegate.so (1).

Potential Solutions

1. Model Optimization

  • Reduce Input Size: Try further reducing your input size below 224x224 if your use case allows
  • Model Pruning: Remove unnecessary layers or reduce the number of parameters
  • Quantization Optimization: Ensure your quantization is optimal for the NPU

2. Architecture Considerations

Based on community discussions, there are known issues with fully connected layers in the Neutron converter (2). Consider:

  • Replacing fully connected layers with convolutional equivalents where possible
  • Simplifying the head of your network
  • Using depthwise separable convolutions instead of standard convolutions

3. Memory Management

  • Try processing smaller batches or tiles of your input image
  • Consider splitting your model into multiple smaller models if architecturally feasible

4. Verification Steps

  1. Test with even smaller input sizes (e.g., 128x128) to confirm TCM constraints
  2. Try running just the backbone without the fully connected layers
  3. Compare memory usage patterns with working models

Framework Information

You’re using the correct approach with TensorFlow Lite 2.16.2 and the Neutron delegate for iMX95 (1). The fact that you can successfully run the MobileNet example confirms your setup is correct.

The issue appears to be specifically related to your model’s memory requirements exceeding the NPU’s TCM capacity, which is a hardware limitation rather than a software configuration issue.

Best regards, Toradex AI Assistant.


Did you like my answer? :+1: :-1:

If I answered your question, please tick the “Solution” checkbox.

For follow-up questions to me, please use @ToradexAI in your post.

Hello @TobiasAtAGI,

Thanks for sending your tflite models. After analysing one of them in Netron, it looks like it is correctly quantized. So I would say, your conversion from Pytorch to tflite model looks correct/good.

Now, regarding converting the model with neutron-converter: As you figured out, this step is not necessary for the BSP version 6.6 (I believe for BSP >= 6.6.52). Therefore, it is not needed on the version you have - which is 6.6.101. I tested the standard mobilenet model in the examples directory, both on CPU and NPU delegation, without any neutron conversion on a Verdin iMX95 module running Toradex BSP 7.4.0 with 6.6.94 kernel, and that worked fine.

Testing your tflite model on the above-mentioned BSP version on a Verdin iMX95 SoM, with neutron delegation, I see the same “Out of TCM memory!” warning message printed out. I tried to convert the model with the Neutron converter in eIQ SDK 1.14 ( /opt/nxp/eIQ_Toolkit_v1.14.0/neutron-tuning/neutron-converter, which is supposed to work for NXP BSP 6.6.52), I get a warning message “WARNING: None of the operators from graph is supported by Neutron!”:

rudhi@rudhi-nb:/opt/nxp/eIQ_Toolkit_v1.14.0/neutron-tuning$ ./neutron-converter --input lld_ops15_224.tflite --output lld_ops15_224_neutron.tflite --target imx95 --dump-statistics
WARNING: Deltas for subsystem are deactivated for now.
Converting model with the following options:
  Input  = lld_ops15_224.tflite
  Output = lld_ops15_224_neutron.tflite
  Target = imx95
WARNING: De-extracting subgraph 'subgraph_031' because of a failure!
======================================================== Model: ========================================================

Performance estimates:
	Clock Frequency: 0.00 MHz
	Clock cycles per inference: 0
	Latency per inference: -nan ms 
	Inferences per second: -nan

Memory  footprint: 
	Variables size: 0.000000 MB
	Constants size: 0.000000 MB
	Microcode size: 0.000000 MB

Conversion statistics:
  Number of operators after import    = 44
  Number of operators after optimize  = 57
    Number of operators converted     = 0
    Number of operators NOT converted = 57
  Number of operators after extract   = 57
    Number of Neutron graphs          = 0
    Number of operators NOT converted = 57
  Operator conversion ratio           = 0 / 57 = 0
WARNING: None of the operators from the graph is supported by Neutron!
Time for optimization = 0.0183315 (seconds)
Time for extraction   = 0.00630146 (seconds)
Time for generation   = 0.936474 (seconds) 

This gives me the indication that your model (specifically the operators) is not supported by this BSP version. So I tried to run your model on a higher BSP version from NXP, which is 6.12 Walnascar. However, we do not have a Walnascar BSP from Toradex for Verdin iMX95 yet. Therefore, I have to rely on the NXP BSP and the Verdin iMX95 EVK. I can see some promising results there. The model needs to be converted with neutron-converter in a compatible eIQ SDK version (I used SDK 2.2.3), for the NPU delegations to work:

I converted the model with eIQ SDK 2.2.3:
rudhi@rudhi-nb:~/Toradex/ML_Projects/eiq-neutron-sdk-linux-2.2.3-ext/eiq-neutron-sdk-linux-2.2.3/bin$ ./neutron-converter --input lld_ops15_224.tflite --output lld_ops15_224_neutron.tflite --target imx95 --dump-statistics

Infer on CPU:

root@imx95-19x19-verdin:/usr/bin/tensorflow-lite-2.19.0/examples# ./label_image -m lld_ops15_224.tflite -i grace_hopper.bmp -l labels.txt
INFO: Loaded model lld_ops15_224.tflite
INFO: resolved reporter
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
INFO: invoked
INFO: average time: 32.53 ms
INFO: 0.565003: 0 background

Infer on NPU (model converted with SDK 2.2.3 as mentioned above):

root@imx95-19x19-verdin:/usr/bin/tensorflow-lite-2.19.0/examples# ./label_image -m lld_ops15_224_neutron.tflite -i grace_hopper.bmp -l labels.txt --external_delegate_path=/usr/lib/libneutron_delegate.so
INFO: Loaded model lld_ops15_224_neutron.tflite
INFO: resolved reporter
INFO: EXTERNAL delegate created.
INFO: NeutronDelegate delegate: 1 nodes delegated out of 9 nodes with 1 partitions.

INFO: Neutron delegate version: v1.0.0-d98743a7, zerocp enabled.
INFO: Applied EXTERNAL delegate.
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
INFO: invoked
INFO: average time: 0.249 ms
INFO: 0.500431: 0 background

Infer on NPU with the default NON-converted model:

root@imx95-19x19-verdin:/usr/bin/tensorflow-lite-2.19.0/examples# ./label_image -m lld_ops15_224.tflite -i grace_hopper.bmp -l labels.txt --external_delegate_path=/usr/lib/libneutron_delegate.so
INFO: Loaded model lld_ops15_224.tflite
INFO: resolved reporter
INFO: EXTERNAL delegate created.
INFO: NeutronDelegate delegate: 0 nodes delegated out of 44 nodes with 0 partitions.

INFO: Applied EXTERNAL delegate.
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
INFO: invoked
INFO: average time: 32.48 ms
INFO: 0.565003: 0 background

Looking at the average time printed out in the output, it is clear that the converted model is getting delegated correctly to the NPU.

I think if you really want to get this model working on BSP with kernel 6.6, you need to use a backbone model that enables the neutron optimizations. In the eIQ Toolkit UserGuide that comes with eIQ 1.14 version (which they say is supported by BSP 6.6.52), I see the following list of models with such optimizations enabled:

Not sure if you are already using one of these models.

Otherwise, the option would be to wait until Toradex releases a Walnascar BSP image for Verdin iMX95, which should be sometime soon.

@rudhi.tx Thank you for the response.

It is good to know that the Pytorch → Tensorflow Lite conversion is not the problem.

I have used our I.MX95 EVK with BSP Version

Linux version 6.12.49-lts-next-gdf24f9428e38

to run the experiments you proposed. I used the Neutron-Converter with version 2.2.3 as you specified. I get the same results as you.

root@imx95-19x19-verdin:/usr/bin/tensorflow-lite-2.19.0/examples# ./label_image -m /root/liquid_level_detector_ops15_224_imx95.tflite --external_delegate_path=/usr/lib/libneutron_delegate.so
INFO: Loaded model /root/liquid_level_detector_ops15_224_imx95.tflite
INFO: resolved reporter
INFO: EXTERNAL delegate created.
INFO: NeutronDelegate delegate: 1 nodes delegated out of 9 nodes with 1 partitions.
INFO: Neutron delegate version: v1.0.0-f24d08e5, zerocp enabled.
INFO: Applied EXTERNAL delegate.
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
INFO: invoked
INFO: average time: 0.247 ms
INFO: 0.498833: 0 background

However when I look at the output logits of the model they all have the same value (which is wrong) and they do not change even if the input image is changed.

Running the same script with the original (not converted) model produces this output. (which is correct)

(Python Script)
test_model.py (9.1 KB)

This can also be observed with the built in model.

Original Model (Correct Output):

root@imx95-19x19-verdin:/usr/bin/tensorflow-lite-2.19.0/examples# ./label_image -m /root/mobilenet_v1_1.0_224_quant.tflite --external_delegate_path=/usr/lib/libneutron_delegate.so
INFO: Loaded model /root/mobilenet_v1_1.0_224_quant.tflite
INFO: resolved reporter
INFO: EXTERNAL delegate created.
INFO: NeutronDelegate delegate: 0 nodes delegated out of 31 nodes with 0 partitions.

INFO: Applied EXTERNAL delegate.
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
INFO: invoked
INFO: average time: 16.768 ms
INFO: 0.764706: 653 military uniform
INFO: 0.121569: 907 Windsor tie
INFO: 0.0156863: 458 bow tie
INFO: 0.0117647: 466 bulletproof vest
INFO: 0.00784314: 835 suit

Converted Model with the specified Neutron Converter:

root@imx95-19x19-verdin:/usr/bin/tensorflow-lite-2.19.0/examples# ./label_image -m /root/mobilenet_imx95.tflite --external_delegate_path=/usr/lib/libneutron_delegate.so
INFO: Loaded model /root/mobilenet_imx95.tflite
INFO: resolved reporter
INFO: EXTERNAL delegate created.
INFO: NeutronDelegate delegate: 1 nodes delegated out of 5 nodes with 1 partitions.

INFO: Neutron delegate version: v1.0.0-f24d08e5, zerocp enabled.
INFO: Applied EXTERNAL delegate.
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
INFO: invoked
INFO: average time: 0.294 ms

We have already raised this issue with NXP (NXP Community).

I will look into the Models you have recommended and will do some more experiments with these. We currently use a Resnet 18 Model as the Backbone with is just a smaller version of the Resnet 50 Model and since there are no Limitations it should support this. We just add three Fully Connected Layers, Batch Normalization, Relu and Softmax Activation. The Softmax function is not supported but it is anyway not executed on the NPU and the Batch Normalization gets included into the Fully Connected Layers when quantazing the model. So there should not be any unsupported layers in the model.

Update We have changed the backbone model from ResNet-18 to ResNet-50 and retrained the model. We also lowered the input image size from 640 × 640 px to 320 × 320 px. With this model, the conversion works, and 76 nodes are delegated to the NPU out of 87. (BSP Version Linux 6.6.101) Since this model is much larger than the old model, the setup time is very long (about 400 seconds). The inference time is excellent at about 160 ms, and the output is also accurate.

This approach works for the immediate future. However, the setup time is a major pain point and would force our users to wait 6–7 minutes after startup for the ML models to work, which is unacceptable in the long term. With lower input resolutions (224 × 224 px), this time is reduced. However, we have not yet managed to train a model at this resolution (the model does not converge). I suspect that this long setup time is due to the Neutron converter needing to convert this large model at startup. Since there are no changes to the model, is there a way to cache the converted model to reduce startup time? Or are there any other optimizations we could apply?

Thanks for any suggestions.

Hi @TobiasAtAGI,

Apologies for the delay. Somehow, I overlooked the follow-up questions in your reply. It’s good to hear that the situation improved by using a supported backbone model (ResNet-50).

I just checked the machine learning user guide from NXP to see if they talk about any caching for Neutron delegate. Unfortunately, for Neutron, I can’t find any. They do support caching for GPU/NPU on the iMX8 series and also for iMX93 ethos series (explained in section 7.1.3 and 7.2.6.4) to improve the hardware accelerators warmup time. I think you need something similar for Neutron. I can check with NXP to see if they have anything for Neutron in this regard.

Another thing you could try is NPU performance tuning, explained in section 7.4.3. Please let me know if you see any improvement in case you happen to test that.

FYI, I have asked this question on NXP community forum: https://community.nxp.com/t5/Other-NXP-Products/Long-Neutron-setup-time-for-large-TFLite-model-on-Neutron-NPU-on/td-p/2302332

Hi @rudhi.tx

Thank you for the response. No worries.

Thank’s for raising the issue with NXP. Let us see what they come back with. We will also have a look at the NPU performance tuning in the mean time.

Best Regards