NPU Assignment Degradation and Inference Time Increase When Changing MobileNetV2 Input Resolution

Dear Toradex Support Team,

We are conducting inference using MobileNetV2 (keras → TFLite) on i.MX95 and have observed significant differences in NPU assignment depending on input resolution. We are writing to confirm whether this is a specification constraint or an issue.

Observed Phenomenon

When using MobileNetV2:

At 224×224 input resolution:

  • NPU conversion proceeds without issues

  • High NPU convert ratio (121/127 operators)

  • Short inference time (CPU: 37.21ms → NPU: 3.17ms)

At 320×320 or higher input resolution:

  • NPU convert ratio decreases significantly (many operators fall back to CPU execution)

  • Inference time increases dramatically as a result (CPU: 76.08ms → NPU: 87.38ms)

We confirmed the above differences when only changing the resolution with the same model architecture. (We also observed the same phenomenon with ResNet50V2, indicating this is not model-specific.)

Detailed experimental environment, inference times, and conversion results are shown below.

■ Experimental Environment

Item Details
BSP 6.6.36_2.1.0
eIQ Toolkit 1.16.0
Model MobileNet_v2 (tf.keras)
Conversion Method tensorflow.lite.TFLiteConverter (2.15.0)
Quantization Method Static quantization, INT8 (Per-channel)
Calibration Data ImageNet-V2, 100 samples

■ Results Comparison by Resolution

Metric 224×224 320×320 Ratio Change
Operator Conversion Ratio 82/88 (93.2%) 22/88 (25.0%) -73.2%
Inference Time (CPU) 37.21 ms 76.08 ms ×2.04
Inference Time (NPU) 3.17 ms 87.38 ms ×27.6
NPU Speedup vs CPU 11.7× faster 1.15× slower Performance reversal

Key Observations:

  • NPU conversion ratio drops dramatically from 93% to 25%

  • NPU inference time increases 27.6×, far exceeding the theoretical 2.04× based on resolution increase

  • At 320×320, NPU becomes slower than CPU, defeating the purpose of hardware acceleration

■ Neutron Convert log(CLI)

At 320×320, only operators near the output layer are converted to NPU**, while early and intermediate layers fall back to CPU execution

224x224

320x320

Questions

Q1. Are there any upper limits or constraints on input resolution or intermediate tensor sizes for the i.MX95 NPU?

For example:

  • Maximum feature map size (width × height × channels)

  • Total number of elements in convolution input/output tensors

  • Memory bandwidth or internal SRAM block constraints

  • Other undocumented limitations

Q2. Is it possible for specific layers to become NPU-incompatible when resolution increases?

For instance:

  • At 224×224: A certain layer can be assigned to NPU

  • At 320×320: The same layer falls back to CPU execution

Q3. Is this behavior due to undocumented internal specifications or a known issue?

We could not find such constraints in the current documentation. We would like to confirm whether:

  • Unpublished internal specification constraints exist

  • This is a known bug or limitation in BSP 6.6.36_2.1.0

  • There are any workarounds or recommended settings

Hi, I’m the Toradex AI Assistant.

I might make mistakes or occasionally get you wrong. Our support team will step in if needed.


I’d be glad to help with your NPU performance question! To provide the most accurate guidance for your setup, could you let me know which OS and version you are running? You can use the tdx-info command to get detailed information about your hardware and software configuration - more details at Getting Device Information with Tdx-Info | Toradex Developer Center

Understanding NPU Resolution Constraints

Based on the available documentation, your observations about NPU performance degradation at higher resolutions align with known characteristics of the i.MX95 NPU architecture. The documentation indicates that NPU performance is influenced by several factors related to tensor sizes and memory constraints.

Key NPU Considerations

The documentation shows that NPU acceleration effectiveness can vary significantly based on model characteristics(1). For the i.MX95 platform specifically, the NPU delegate behavior you’re observing suggests internal memory or computational constraints that aren’t fully documented in the public materials.

Performance Comparison Context

Your performance measurements show a pattern consistent with what’s documented for other i.MX platforms. The documentation shows significant performance variations between CPU and NPU execution depending on model characteristics(1). For example, on Verdin iMX8M Plus, NPU performance can range from 3.05ms to much higher values depending on the specific operations being performed(1).

Resolution-Dependent Behavior

The dramatic drop in NPU conversion ratio from 93% to 25% when moving from 224×224 to 320×320 suggests that the NPU delegate is encountering size-based constraints that force fallback to CPU execution for many operators. This behavior indicates that certain layers exceed internal NPU limitations when tensor dimensions increase.

Recommendations

  1. Verify Platform Configuration: Ensure your NPU delegate is properly configured using the external delegate path as documented(1)

  2. Check Memory Configuration: The documentation indicates that NPU performance can be affected by memory bandwidth and internal constraints

  3. Consider Model Optimization: The significant performance difference suggests that model architecture optimization for the specific NPU constraints might be beneficial

Next Steps

To get more specific guidance about the i.MX95 NPU constraints and potential workarounds for your resolution requirements, I recommend checking the latest NXP i.MX Machine Learning User’s Guide documentation, as it contains detailed information about NPU limitations and optimization strategies that may not be fully covered in the general Toradex documentation.

The behavior you’re observing appears to be related to undocumented internal NPU specifications rather than a software bug, given the consistent pattern across different model architectures.

Best regards, Toradex AI Assistant.


Did you like my answer? :+1: :-1:

If I answered your question, please tick the “Solution” checkbox.

For follow-up questions to me, please use @ToradexAI in your post.

Hi @alpha , unfortunately we don’t know the answers to question 1 and 2. Like the AI recommended, your best bet is to check the NXP i.MX Machine Learning User’s Guide documentation.

This said, regarding question 3, it is a known topic that NPU is still not working as well as expected even in the more recent 6.12.20: https://community.toradex.com/t/inference-degradation-with-quantized-resnet-mobilenetv2-models-on-i-mx95-neutron-npu/29302/6

I recommend trying in a B0 based silicon i.MX95 and the newest NXP BSP.

Kind regards,
Alvaro.

1 Like