Hello dear community,
I am currently benchmarking a camera setup with a toradex device. Specifically, I am trying to maximize the bandwidth/framerate on a C++ application that receives frames from the v4l2 driver. I am experiencing lower frames per second (fps) than possible using v4l2-ctl or gstreamer command line tools for video streaming.
My setup is as follows:
- Verdin IMX8M Plus on Dahlia Carrier Board
- Yocto-built tdx-reference-multimedia image (OS) with added bitbake layer for camera driver and device tree overlays (as described at First Steps with CSI Camera Set 5MP AR0521 Color (Linux) | Toradex Developer Center)
- CSI Camera Set 5MP AR0521 Color (V1.0A)
- Toradex-provided v4l2 driver for ar0521 (meta-toradex-econ/meta-ar0521/recipes-kernel/linux/linux-toradex/0002-add-driver.patch at kirkstone-6.x.y · toradex/meta-toradex-econ · GitHub)
Precisely, my issue lies in the fact that using the v4l2 driver interface in a C++ application I am able to only achieve around 40 fps (Full HD), although 65 fps are possible according to the camera fact sheet and also tested on the setup using v4l2-ctl and gstreamer command line tooling. In the C++ application, I am copying the frames from the v4l2 driver memory pointer to local process memory using std::memcpy() function. Looking at cpu resource consumption, executing this function in this setup consumes more than 95% of a single core, yielding 40 fps or around 160 Mbit/s of copied memory. However, when I just copy random memory blocks of the same size with std::memcpy() which are not pointing to v4l2 driver memory, I can yield around 2.2 Gbit/s of copied memory.
So the question is whether something is blocking the streaming process when using the v4l2 driver as the source for the std::memcpy() function. This setup can only use the v4l2 memory mapping functionality (3.2. Streaming I/O (Memory Mapping) — The Linux Kernel documentation) since DMA-buffer and user pointer mode are not available on the driver/camera capabilities.
Also, there is no other process that could observed consuming significant amounts of cpu resource.
Here is the snippet that dequeues the v4l2 driver buffer to copy its contents to local memory, this won’t compile but shows the relevant parts:
struct v4l2_buffer buf;
std::memset(&buf, 0, sizeof(buf));
buf.type = V4L2_BUF_TYPE_VIDEO_CAPTURE_MPLANE;
buf.memory = V4L2_MEMORY_MMAP;
struct v4l2_plane planes[1];
buf.m.planes = planes;
buf.length = 1;
Buffer* buffers = new Buffer[16];
buffers[i].length = buf.m.planes[0].length;
buffers[i].start = mmap(nullptr, buf.m.planes[0].length, PROT_READ | PROT_WRITE, MAP_SHARED, fd, buf.m.planes[0].m.mem_offset);
uint8_t* local_frame_buffer = new uint8_t[frame_width * frame_height * pixel_size];
if (ioctl(fd, VIDIOC_DQBUF, &buf) == -1) {
std::cerr << "Error dequeuing buffer: " << strerror(errno) << std::endl;
break;
}
// Significantly slower if 'buffers' is memory mapped (mmap) to a v4l2 buffer
std::memcpy(local_frame_buffer, buffers[buf.index].start, buffers[buf.index].length);
if (ioctl(fd, VIDIOC_QBUF, &buf) == -1) {
std::cerr << "Error requeuing buffer: " << strerror(errno) << std::endl;
break;
}
Any advice on what could be the cause here? I suspect it could be a caching issue also. So far, I tried manual cache invalidation, prefetching, compiler optimisation flags (-march=archv8-a+simd -mtune=cortex-a53) and madvise().
Thank you in advance!
Best regards,
Aljoscha