V4L2 camera limitations on Verdin iMX8MP

Hello dear community,

I am currently benchmarking a camera setup with a toradex device. Specifically, I am trying to maximize the bandwidth/framerate on a C++ application that receives frames from the v4l2 driver. I am experiencing lower frames per second (fps) than possible using v4l2-ctl or gstreamer command line tools for video streaming.

My setup is as follows:

Precisely, my issue lies in the fact that using the v4l2 driver interface in a C++ application I am able to only achieve around 40 fps (Full HD), although 65 fps are possible according to the camera fact sheet and also tested on the setup using v4l2-ctl and gstreamer command line tooling. In the C++ application, I am copying the frames from the v4l2 driver memory pointer to local process memory using std::memcpy() function. Looking at cpu resource consumption, executing this function in this setup consumes more than 95% of a single core, yielding 40 fps or around 160 Mbit/s of copied memory. However, when I just copy random memory blocks of the same size with std::memcpy() which are not pointing to v4l2 driver memory, I can yield around 2.2 Gbit/s of copied memory.
So the question is whether something is blocking the streaming process when using the v4l2 driver as the source for the std::memcpy() function. This setup can only use the v4l2 memory mapping functionality (3.2. Streaming I/O (Memory Mapping) — The Linux Kernel documentation) since DMA-buffer and user pointer mode are not available on the driver/camera capabilities.
Also, there is no other process that could observed consuming significant amounts of cpu resource.
Here is the snippet that dequeues the v4l2 driver buffer to copy its contents to local memory, this won’t compile but shows the relevant parts:

struct v4l2_buffer buf;
std::memset(&buf, 0, sizeof(buf));
buf.type = V4L2_BUF_TYPE_VIDEO_CAPTURE_MPLANE;
buf.memory = V4L2_MEMORY_MMAP;

struct v4l2_plane planes[1];
buf.m.planes = planes;
buf.length = 1;

Buffer* buffers = new Buffer[16];
buffers[i].length = buf.m.planes[0].length;
buffers[i].start = mmap(nullptr, buf.m.planes[0].length, PROT_READ | PROT_WRITE, MAP_SHARED, fd, buf.m.planes[0].m.mem_offset);
uint8_t* local_frame_buffer = new uint8_t[frame_width * frame_height * pixel_size];

if (ioctl(fd, VIDIOC_DQBUF, &buf) == -1) {
    std::cerr << "Error dequeuing buffer: " << strerror(errno) << std::endl;
    break;
}

// Significantly slower if 'buffers' is memory mapped (mmap) to a v4l2 buffer
std::memcpy(local_frame_buffer, buffers[buf.index].start, buffers[buf.index].length);

if (ioctl(fd, VIDIOC_QBUF, &buf) == -1) {
    std::cerr << "Error requeuing buffer: " << strerror(errno) << std::endl;
    break;
}

Any advice on what could be the cause here? I suspect it could be a caching issue also. So far, I tried manual cache invalidation, prefetching, compiler optimisation flags (-march=archv8-a+simd -mtune=cortex-a53) and madvise().
Thank you in advance!

Best regards,
Aljoscha

Hi, I’m the Toradex AI Assistant. The Toradex Support team will follow up with you shortly!

Did you like my answer?

:+1: :-1:


Hello Aljoscha,

Thank you for reaching out to the community with your question. It sounds like you’re encountering a performance bottleneck when using the v4l2 driver in your C++ application compared to command line tools like v4l2-ctl or GStreamer. Let’s explore some potential areas to investigate:

  1. Memory Mapping and Access:

    • Since you’re using mmap, ensure that the memory is correctly aligned and that your access patterns are optimized for the CPU cache. The high CPU usage suggests that the memory copying process might be inefficient. Consider exploring alternatives to std::memcpy() that could better utilize the CPU’s capabilities.
  2. CPU and Cache Optimization:

    • You’ve already tried manual cache invalidation and prefetching, which is great. Double-check that these operations are correctly implemented, as incorrect cache handling can degrade performance. Additionally, while you’ve used -march=archv8-a+simd and -mtune=cortex-a53, experimenting with different compiler flags or combinations might yield better results.
  3. Driver and Kernel Considerations:

    • Verify that the v4l2 driver and kernel are optimally configured for your use case. Sometimes, driver settings or kernel parameters can impact performance. Ensure you have the latest updates or patches for the driver that might address performance issues.
  4. Benchmarking and Profiling:

    • Use profiling tools to identify where the bottlenecks occur in your application. This can provide insights into whether the issue is with memory copying, CPU usage, or another part of the process.

Given the complexity of the issue, a combination of these approaches might be necessary to achieve the desired performance. If the problem persists, further investigation into the specific interactions between your application and the v4l2 driver may be required.

I hope these suggestions help you improve the performance of your application.

Best regards,
Toradex AI Assistant

Please, also check if these links can help you solve your problem:

https://www.nxp.com/docs/en/user-guide/IMX-MACHINE-LEARNING-UG.pdf.

Hello AI Assistent,

I have already used gprof and perf for profiling, without any further insights on other processes consuming cpu resources. Please specify how I can (1) ensure that the memory is correctly aligned and that access patterns are optimized for the CPU cache and (2) verify that the v4l2 driver and kernel are optimally configured. Thank you!

Best regards,
Aljoscha