How to Improve GPIO Read Speed Using libgpiod for DPRAM Access on Verdin iMX8MP

prem0919 · May 28, 2025, 8:45am

I’m currently working on a Verdin iMX8MP-based project where I interface a dual-port RAM (DPRAM) using GPIOs. The DPRAM interface has:

16-bit address lines
8-bit data lines
Control lines (OE, CE, etc.)

I’m using the libgpiod library in C to read the data from DPRAM. The typical read operation includes:

Writing an address via GPIOs.
Asserting control signals.
Reading 8-bit data via GPIO input lines.

The issue I’m facing is performance. When I read 40,000 address locations, it takes approximately 4 seconds just to complete the read operation. That’s about 100 microseconds per read, which is too slow for my application.

What is the best way to optimize GPIO access speed using libgpiod on the Verdin iMX8MP??
Can pinctrl or GPIO configuration in the device tree (such as slew rate, drive strength) significantly affect speed, and how should I tune it?(in my pinctrl i have used the bit register value = 0x00000146)

ToradexAI · May 28, 2025, 8:46am

Hi, I’m the Toradex AI Assistant.

I might make mistakes. Our support team will step in if needed. If I answered your question, please tick the “Solution” checkbox. For follow-up questions to me, please use @ToradexAI in your post.

Did you like my answer?

Hello! Thank you for your detailed description of the project and your clear explanation of the performance challenge.

GPIO access speed on user space (`libgpiod`):

The main performance bottleneck is generally not in pin or drive strength configuration, but in the inherent latency of user space GPIO frameworks like libgpiod. Each operation (setting address, toggling control lines, reading data) incurs overhead due to:

User-to-kernel context switches
Generic, non-optimized access to GPIO controller hardware

On typical Linux platforms (like Verdin iMX8MP), user space GPIO access is slower than direct register access, which limits the rate at which you can use GPIOs for parallel interfaces.

Device Tree Tuning and Its Effects

1. Pad Control Register (Slew Rate, Drive Strength)

These settings mainly affect signal integrity and electrical characteristics, not the fundamental CPU/software speed of GPIO toggling.
Fast slew rate and higher drive strength can slightly decrease pin rise/fall times and thus can impact signaling at very high speeds, but for standard GPIO toggling via software, the improvement is marginal compared to software overhead.
Your register value 0x00000146 can be decoded:
- Bits 0x6 (DSE=6): Indicates a certain drive strength.
- Bit 0x40 (FSE): Fast slew rate enabled.
- These are reasonable electrical settings for fast switching; details here.

2. SION Setting

If you need to read back data lines, make sure SION (Software Input On) is set if required, but again, this will not increase software-controlled toggling speed.

3. Conflicts/Pinmux

Ensure no pinmux conflicts and proper device tree definition:
How to customize device trees
GPIO Device Tree Guidelines

Approaches to Significant Speed Improvement

A. Move to Kernel Space Driver or UIO

Custom kernel module: By implementing the interface logic in the kernel, you eliminate user-kernel context switching, resulting in orders-of-magnitude faster GPIO access.
UIO (User-space I/O) or direct memory-mapped GPIO: If your board supports this securely, you can map the GPIO controller registers into user space for direct, faster access.

B. Use a Parallel Bus Peripheral/Interface

For high-speed data, the Linux GPIO framework is not designed for bus interfaces. Consider using an FPGA, or using an SoC peripheral more suited for parallel data, like an EIM/LCD bus or FlexIO (if available).

C. Hardware Electrical Optimization (as secondary consideration)

You already have fast slew and high drive.
Ensure signal traces are short and well-terminated to minimize propagation delay (but this rarely is the bottleneck for software-based GPIO toggling).

Summary Table

Optimization	Expected Impact	How to Apply
User-space pinctrl tweak	Minor	Confirm fast slew/high drive (done)
Kernel-space driver	Major	Write kernel module doing read cycles
Direct reg. access	Major	Possibly via `/dev/mem` or UIO (if safe/legal in your use case)
Use bus peripheral	Very major	Re-architect HW/SW to use suitable parallel interface

Device Tree Example

Your configuration is reasonable:

fsl,pins = <MX8MM_IOMUXC_GPIO1_IO01_GPIO1_IO1 0x146>;

More about pad control fields.

But as stated, this alone will not get you the speed-up you need.

References

Recommendation: To dramatically increase throughput, move your DPRAM control logic from user space to a custom kernel module, or pursue direct hardware (memory-mapped) access to GPIO registers, if possible.

If you need more details on writing a kernel driver or further tuning, please let me know your software requirements.

Best regards,
Toradex AI Assistant.

alex.tx · May 28, 2025, 4:14pm

Using libgpiod for bit-banging a dual-port RAM (DPRAM) interface is not optimal.
Libgpiod is intended for general-purpose GPIO control and is not suitable for high-speed or timing-critical tasks such as external memory interfacing.
libgpiod introduces substantial overhead due to system calls and user-space/kernel context switching, which typically limits throughput to around **100 µs per read.
While the i.MX8MP’s GPIO controller operates at ~100–200 MHz, bit-banging from user space is typically constrained to 10–50 kHz, mainly due to software bottlenecks.

Recommended Alternatives:

If your application allows, consider using memory devices that support high-speed interfaces such as Quad SPI (QSPI). The i.MX8MP SoC supports QSPI and can offer significantly higher throughput and lower latency than GPIO-based communication.

If DPRAM access is essential, the best approach is to implement a custom kernel driver for GPIO-based bit-banging. This enables:

Direct register access using ioremap()
Elimination of user-space overhead
Substantially improved performance, typically in the range of 1–10 µs per read — up to 100x faster than user-space GPIO access

Optimization Tips (if continuing with libgpiod):

Use struct gpiod_line_bulk: Perform batch address/data reads and writes using gpiod_line_set_value_bulk() and gpiod_line_get_value_bulk() to reduce syscall overhead.
Check device tree (DT) configuration: Ensure GPIO pins are configured for fast slew rates

prem0919 · June 4, 2025, 5:21am

Hi @alex.tx ,
Thanks for the response.
Currently, I have shifted the DPRAM read logic to the Cortex-M7 core. I’m using the NXP SDK’s GPIO driver as an example on M7 (RTOS/bare-metal) to read the GPIO-mapped address and data lines connected to the DPRAM. I’ve developed a binary that reads the DPRAM contents via GPIO on M7, and it is working as expected.

I now want to confirm:

Is using the Cortex-M7 core considered a preferred/optimal method for high-speed GPIO-based DPRAM reads, compared to implementing a custom Linux kernel driver on Cortex-A53?
Can I realistically achieve the performance target of reading 64K locations (e.g., 4096 addresses × 16-bit data) in ≤4 milliseconds using the M7 core approach?

(That’s approximately 250 nanoseconds per read cycle.)
The GPIO lines are directly interfaced to DPRAM address/data/control signals.

Any insights or hardware-specific optimizations (e.g., cache settings, DTCM usage, etc.) you can share would be appreciated.

Thanks & regards,
prem0919

alex.tx · June 4, 2025, 4:38pm

Using GPIO bit banging high-speed DPRAM reads can not be optimal by default. However utilizing Cortex-M7 core (bare metal or under Free RTOS) definitely has advantages over Linux driver approach.

Your target requires 250ns per read cycle (including address setup, OE/CE toggling, and data read) look feasible.

i.MX8MP M7 Core: Runs at 800MHz (default), enabling 1.25ns per clock cycle.
GPIO Timing:
- Address setup (~50ns) + OE pulse (~50ns) + Data read (~50ns) = ~150ns/cycle (theoretical).
- With loop overhead (branching, GPIO writes), 250ns/cycle is realistic.

Optimization Recommendations:

Use direct register writes (e.g., GPIOx->DR) instead of SDK functions to minimize overhead.
Try to fit whole M7 code into the TCM (Tightly Coupled Memory) the fastest RAM available for the M core.
Ensure DPRAM data buffer is 32-bit aligned for efficient writes.

prem0919 · June 9, 2025, 8:03am

Hi @alex.tx ,
we are able to read the 40kb data in ~300m/s which is not optimal.
as you suggested i have used the direct register writes to set address and read data from that address.
i can see that the toggling of a pin taking 250ns which is minimal, so with this still we can achieve atleast of least <50m/s.
This is strange that it is taking ~300ms
could you please suggest me where could be the issue caused.
Flow:

Address lines → 16 pins, as digital outputs
Data lines → 8 pins, as digital inputs
Control lines (CE, OE, RW) → as digital outputs
Address setup.
CE-0,OE-0,R/W-1
ndelay(15)
read byte at address.

regards,
prem0919

prem0919 · June 9, 2025, 3:50pm

Hi @alex.tx ,
From the NXP thread
https://community.nxp.com/t5/i-MX-RT-Crossover-MCUs-Knowledge/RT1060-Normal-GPIO-vs-Fast-GPIO/ta-p/1119856

Can we change the gpio1/2/3/4/5 to 6/7/8/9/10
As they mentioned those are also having the same pad controls.

Can you provide me any examples to directly access the DMA to improve the speed?

Waiting for the response!

Thanks & regards,
Prem

alex.tx · June 9, 2025, 5:12pm

@prem0919

The thread you’re referring to discusses the RT1060 SoC, which is significantly different from the i.MX8M Plus you’re using, so the information there is not applicable in this case.

Can we change the gpio1/2/3/4/5 to 6/7/8/9/10?

I’m sorry, but I don’t fully understand your question — could you please clarify what you’re trying to achieve?

As for DMA, it cannot increase GPIO speed; it can only offload CPU usage by handling data transfers in parallel with other processing tasks.
We currently don’t have any specific examples for direct DMA access. For detailed information, please refer to the i.MX8M Plus Reference Manual.

prem0919 · June 9, 2025, 5:38pm

Hello @alex.tx

My goal is to read 40K memory locations within 4ms. However, since I’m still new to MCUXpresso and the Cortex-M7 architecture, I’m not yet fully familiar with the optimization techniques required to achieve this performance. Any suggestions or guidance on how to improve memory access speed would be greatly appreciated.

Thanks & regards,
Prem0919

alex.tx · June 9, 2025, 10:08pm

I don’t have any additional suggestions beyond what I’ve already listed here.

prem0919 · June 11, 2025, 6:13am

Hi @alex.tx ,

1.can you provide me any information regarding how to place all the data and code into TCM.
2.Currently the toggling time was ~250ns,means that the current clock is running in ~2MHz.So can you please provide me information to how can we set the clock configuration to achieve atleast of least 133MHz.(i haven’t find any relavant documentation to set the clock)

Thanks & regards,
prem0919

prem0919 · June 11, 2025, 10:27am

Hi @alex.tx ,

The binary is made from the debug/release target, which means the binary file will run at TCM, used the following commands to boot:
setenv load_cmd “ext4load mmc 2:2”

setenv cm_image “/lib/firmware/igpio_led_output.bin”

setenv cm_isize 20000

setenv load_cm_image “${load_cmd} ${loadaddr} ${cm_image}”

setenv cm_boot “${load_cm_image}; cp.b ${loadaddr} 0x7e0000 ${cm_isize}; dcache flush; mw.w 0x550ff000 0 64;bootaux 0x7e0000”

Note: Using the gpio_driver example sdk i am generating the binary.

alex.tx · June 11, 2025, 4:17pm

This is done via the linker script (example for i.MX7).

As for your other questions, they are not specific to Toradex modules but rather pertain to the i.MX 8M Plus SoC itself. Therefore, I recommend familiarizing yourself with the official NXP documentation and using the NXP community forums for further support.

prem0919 · June 15, 2025, 10:36am

Hi @alex.tx ,

Thanks for your response.

I understand that placing code and data into TCM is done via the linker script, but unfortunately, I haven’t been able to successfully modify it. The structure of the linker script is quite difficult for me to follow, and my attempts to move code (.text, .interrupts) to ITCM and data (.data, .bss) to DTCM have resulted in linker errors and overlap issues.

I would really appreciate it if you could provide a working example linker script for the i.MX8MP Cortex-M7 that:

Places code and interrupt vectors into ITCM (0x00000000)
Places data, BSS, stack into DTCM (0x20000000)

Even a minimal sample would help me understand how this is done correctly.

Thanks again for your support!

alex.tx · June 16, 2025, 5:38pm

Since your questions are not specific to Toradex modules but rather pertain to the i.MX 8M Plus SoC itself. Therefore, I recommend familiarizing yourself with the official NXP documentation and using the NXP community forums for further support. If you have any Toradex-specific questions (e.g., carrier board design, module configuration, etc), feel free to ask here!

prem0919 · July 8, 2025, 10:26am

Hi @alex.tx ,

I have done all the Optimizations which you recommended.
I have achieved 16ms for the total 40k locations for read.
But i want to achieve this read cycle in <4ms.
Any suggestions would be appreciated.

Thanks & regards,
prem0919

alex.tx · July 8, 2025, 4:22pm

As I mentioned earlier, using GPIO bit-banging for high-speed DPRAM reads is inherently suboptimal. I strongly recommend using memory devices that support high-speed interfaces, such as Quad SPI (QSPI). The i.MX8MP SoC includes QSPI support and can offer significantly higher throughput and lower latency compared to GPIO-based communication.

Alternatively, if you can share more about your overall objective (it’s not clear why the DPRAM needs to be connected via GPIO), we may be able to suggest a more suitable solution.

How to Improve GPIO Read Speed Using libgpiod for DPRAM Access on Verdin iMX8MP

GPIO access speed on user space (libgpiod):

Device Tree Tuning and Its Effects

1. Pad Control Register (Slew Rate, Drive Strength)

2. SION Setting

3. Conflicts/Pinmux

Approaches to Significant Speed Improvement

A. Move to Kernel Space Driver or UIO

B. Use a Parallel Bus Peripheral/Interface

C. Hardware Electrical Optimization (as secondary consideration)

Summary Table

Device Tree Example

References

GPIO access speed on user space (`libgpiod`):