Real-time scheduler in TorizonCore container

techczech · August 1, 2024, 8:32pm

I’m working with a Verdin iMX8M Plus. It has TorizonCore (PREEMPT_RT) 5.7.0 installed. We have a container based on torizon/weston-vivante:2.

We have a real-time embedded application we’re running in that container. We have a thread that appears to be getting starved and dropping a lot of sensor data. Looking around for the right way to get real-time working, I found this :

It was referenced from another Toradex community thread. Per that, I tried adding

    --cpu-rt-runtime=950000 \
    --ulimit rtprio=99 \
    --cap-add=sys_nice \

to our container startup. And the cpu-rt-runtime option is causing this response :

docker: Error response from daemon: Your kernel does not support CPU real-time scheduler.

Per that, the TorizonCore PREEMPT_RT version we’re using doesn’t support real-time?! Isn’t that what the PREEMPT_RT versions are for?

ToradexAI · August 1, 2024, 8:32pm

Hi, I’m the Toradex AI Assistant. I’m in early beta, and I would appreciate your feedback.
The Toradex Support team will follow up with you shortly!

Did you like my answer?

Hand-shake,

The TorizonCore with PREEMPT_RT is indeed designed to support real-time operations, and the PREEMPT_RT patch set is specifically aimed at reducing the maximum latency of the system. The error you’re encountering, “Your kernel does not support CPU real-time scheduler,” suggests there might be a misunderstanding or a misconfiguration rather than a lack of real-time capabilities in the kernel itself.

Firstly, ensure that your TorizonCore version is correctly installed and that the PREEMPT_RT patch is active. You can verify the PREEMPT_RT status by checking the kernel version with the command uname -a and looking for the PREEMPT_RT string in the output. This confirms that the real-time kernel is running.

The Docker runtime options you’re trying to use, such as --cpu-rt-runtime=950000, are correct for setting real-time scheduling policies for containers. However, the error message indicates that the Docker daemon might not be recognizing the real-time capabilities of your kernel. This could be due to several reasons:

Docker Version: Ensure your Docker version supports these real-time options. Docker’s support for real-time scheduling has evolved, and ensuring you’re on a version that supports these flags is crucial.
Kernel Configuration: The kernel needs to be configured with certain options to support real-time scheduling. Specifically, CONFIG_RT_GROUP_SCHED should be enabled. You can check the current kernel configuration by looking at /proc/config.gz (if available) or the kernel configuration file used during the build process.
Docker Daemon Configuration: The Docker daemon itself might need to be configured to allow real-time scheduling. This can involve editing the Docker daemon configuration file (usually found at /etc/docker/daemon.json) to ensure it’s not restricting real-time scheduling capabilities.
Security and Capabilities: The error message and your Docker run command suggest you’re on the right track with --cap-add=sys_nice. However, ensure that no other security profiles or Docker daemon configurations are inadvertently blocking real-time scheduling capabilities.

If after checking these points you still face issues, it might be helpful to directly test real-time capabilities outside of Docker to ensure the kernel is behaving as expected. This can be done by setting real-time scheduling policies on a simple test process using chrt or similar tools. If this test succeeds, the issue likely lies with Docker’s interaction with the kernel rather than the kernel itself.

Remember, while TorizonCore with PRE

Please, also check if these links can help you solve your problem:

High performance, low power Embedded Computing Systems | Toradex Developer Center.

alex.tx · August 1, 2024, 9:12pm

Have you tried simply increasing the priority of the thread that is experiencing starvation?

As about your question about real time - please check if your kernel is configured to support real-time scheduling

zcat /proc/config.gz | grep CONFIG_RT_GROUP_SCHED

If CONFIG_RT_GROUP_SCHED is not set, you cannot use Docker’s --cpu-rt-runtime and related options. However, you can still achieve real-time capabilities :

You can set real-time scheduling policies such as SCHED_RR (round-robin) or SCHED_FIFO (first-in, first-out) directly within your application or through tools like chrt:

chrt -r 99 your_application

This command sets your application to use the SCHED_RR policy with the highest priority (99).

If you want to set the scheduling policy inside a Docker container, you might need to run the container with elevated privileges and use the --cap-add=sys_nice option:

docker run -it --privileged --cap-add=sys_nice your_container_image

and then inside the container use

chrt -r 99 your_application

Ensure that your system limits and capabilities allow for real-time scheduling:
Edit /etc/security/limits.conf to allow your user or group to use real-time priorities:

your_user_name - rtprio 99

jeremias.tx · August 1, 2024, 9:27pm

Greetings @techczech,

Looking at the Docker documentation it seems these runtime options require the CONFIG_RT_GROUP_SCHED kernel config to be enabled. I just checked the PREEMPT_RT image for Torizon and this config is not enabled by default:

torizon@verdin-imx8mp-06849059:~$ zcat /proc/config.gz | grep RT_GROUP
# CONFIG_RT_GROUP_SCHED is not set

I could make a request to our team to enable this configuration in our default image. Though this would only be on the latest versions of Torizon OS, not the older 5.7.0 you are on. Would you still like for me to put in this request?

Best Regards,
Jeremias

techczech · August 1, 2024, 9:32pm

Have you tried simply increasing the priority of the thread that is experiencing starvation?

I should have given more background. We’re bringing over an application from Colibri T30 with Yocto Linux. That application has thousands of hour of testing, and uses the same thread scheduling and priorities, and does not exhibit this issue at all.

Another data point, there’s another thread with the same priority talking to a sensor with much higher rate output. And there are no issues with it. And both threads are talking on serial ports.

CPU use per top is only about 25% total for all processes.

And here’s something weird - before we found sys_nice, our calls to set thread processing to SCHED_RR and to set thread priorities were failing. So all threads are at the same priority with the default non-real-time scheduling. And we have no data loss there.

alex.tx · August 1, 2024, 9:49pm

Hi @techczech ,

Does this mean that applying the CAP_SYS_NICE capability to your container resolved the issue?

jeremias.tx · August 1, 2024, 9:50pm

Just to summarize you have 2 sensors running in two separate threads. One sensor has a higher rate output and has no issues with it. The other sensor has a lower rate output but is somehow dropping data. The CPU is not closed to stressed at all with the current processes running on your system.

And here’s something weird - before we found sys_nice, our calls to set thread processing to SCHED_RR and to set thread priorities were failing. So all threads are at the same priority with the default non-real-time scheduling. And we have no data loss there.

When you say you “have no data loss” do you mean on both sensors or just the higher rate output one?

It’s hard for me to say for sure what could be the issue given the current information. Would you still want to try the real-time scheduler options for Docker?

Best Regards,
Jeremias

techczech · August 5, 2024, 10:13pm

Here are some updates.

I don’t see anything from “zcat /proc/config.gz | grep CONFIG_RT_GROUP”.

We’re not using chrt, we are using pthread_setschedparam in the application and this appears to be working. We’re setting to SCHED_RR via that, and per process and thread priorities in top it’s working.

When we start the container without SYS_NICE, pthread_setschedprio fails. All our threads are at PR 20 (user, not real-time). And there’s no data loss with this. And yes the processor has plenty of spare CPU%, we’re not overloading it.

When we start the container with SYS_NICE, pthread_setschedprio works and in top we now see negative PR values which means we’re running real-time. And that is where we have issues on one of the two serial ports.

About the serial ports, we’re using the second native serial port (ttymxc1) to talk to a GPS receiver, and we’re using a USB/serial chip port (ttyACM0) to talk to the higher output sensor. The issue is on the native serial port.

We were using thread priority ranges between -30 to -75 for our threads. We’ve set those lower priority(e.g. -10), and the problem goes away. We’re still trial & erroring to find out where the danger zone is. From that it looks like our higher priority threads are interfering with something in the OS for the native serial port. And again this is an application we’re bringing over from Colibri T30 + Yocto Linux, with the same thread scheduling scheme, and no similar issues there.

Are you aware of any issues with running threads with PR -50 or higher priority on that version of TorizonCore? Could this be happening because CONFIG_RT_GROUP_SCHED is not set? Yes I’d like that as part of the default image. That might help us, and I’d guess we’re not the only real-time applications.

We’re working on migrating to the latest TorizonCore (6.7), we know that will get us over a year of fixes and improvements.

techczech · August 5, 2024, 10:26pm

Oh, and the thread pulling the lower volume GPS data isn’t getting starved. It’s fairly high priority compared to our other threads. Data is coming in at 10 Hz (100 milliseconds), the longest we see the thread go dormant is 2 milliseconds.

jeremias.tx · August 5, 2024, 10:49pm

I don’t see anything from “zcat /proc/config.gz | grep CONFIG_RT_GROUP”.

Really? You don’t even see it not being set like in my case? That’s strange, you are running this on the device outside of any container yes?

When we start the container with SYS_NICE, pthread_setschedprio works and in top we now see negative PR values which means we’re running real-time.

That’s expected since you need to grant the container with SYS_NICE capabilities to affect things like kernel process priority and scheduling inside the container.

About the serial ports, we’re using the second native serial port (ttymxc1) to talk to a GPS receiver, and we’re using a USB/serial chip port (ttyACM0) to talk to the higher output sensor. The issue is on the native serial port.

Just to confirm, have you tried swapping which serial port each of your devices are connected to? For example if you put the GPS receiver on ttyACM0 does it behave better now or is the issue still present? I want to make sure the issue is indeed with this specific serial port itself and not some other factor.

Are you aware of any issues with running threads with PR -50 or higher priority on that version of TorizonCore?

I am not. Though to be fair I don’t hear from too many users that set the process priority like this. Real-time in general is not something that is used by too many of our customers so it’s more difficult to discover issues related to real-time.

Could this be happening because CONFIG_RT_GROUP_SCHED is not set?

There’s no proof that this configuration is at all related to the issues you are observing. The only reason I brought this up, is because this configuration is required to use those docker runtime options you were looking at, in the start of this thread. All I know is that setting this kernel configuration would allow you to use these runtime options. There’s no guarantee that it will actually help your situation.

Yes I’d like that as part of the default image. That might help us, and I’d guess we’re not the only real-time applications.
We’re working on migrating to the latest TorizonCore (6.7), we know that will get us over a year of fixes and improvements.

Before we explore this option, could you try a little experiment for me. So you’re currently on an older Torizon OS version (5.7.0) which uses an older version of the Linux kernel as well. If possible could you try your setup on the latest Torizon (6.7.0). This version of the OS uses a newer version of the Linux kernel. I’m curious if just using a newer version of the kernel fixes, or at least improves your situation. This could be the case since there are hundred of fixes and changes to the Linux kernel with every new version.

This would at least be worth a try before we go and try adding any new kernel configurations, that we’re not even sure will help.

Best Regards,
Jeremias

techczech · August 6, 2024, 8:17pm

I’m in an SSH session, so on the TorizonCore “torizon” account, not in the docker container. And get nothing with RT_GROUP, at all, in /proc/config.gz.

It would be difficult to do, this is baked into our mainboard design, not cables & DB9 connectors. So would require a respin of the board or substantial hardware mods to try this.

Another person here is working on that. I’ll try to remember to report back when that gets done. It apparently wasn’t straighforward, several things have changed since 5.7.0 that we need.

techczech · August 6, 2024, 8:28pm

Here’s what has fixed the problem in testing so far. We have about a dozen threads, running at different priorities. Here are the priorities work on the T30, and exhibit the serial port noise/corruption on the iMX8M Plus -

THREAD_PRIORITY_BELOW_NORMAL 30
THREAD_PRIORITY_NORMAL 50
THREAD_PRIORITY_ABOVE_NORMAL 55
THREAD_PRIORITY_HIGHEST 75

The threads pulling data from sensors off the serial ports are ABOVE_NORMAL. There’s one other thread at HIGHEST.

Investigating we noticed it appeared our high priority thread(s) were somehow interfering with the native serial port driver, though the ACM driver was OK. So we switched the relative priorities to this -

THREAD_PRIORITY_BELOW_NORMAL 10
THREAD_PRIORITY_NORMAL 15
THREAD_PRIORITY_ABOVE_NORMAL 20
THREAD_PRIORITY_HIGHEST 25

Using those priorities gets all the thread priorities at or under 25. And the serial port corruption goes away, on testing done so far. We have to do testing next to see how this affects us in timing critical applications.

jeremias.tx · August 6, 2024, 9:17pm

It would be difficult to do, this is baked into our mainboard design, not cables & DB9 connectors. So would require a respin of the board or substantial hardware mods to try this.

That’s fair, it would have been a nice data-point though. Especially if we are suspecting here that the issue may be with the native serial port driver and not necessarily the sensor that was attached.

Another person here is working on that. I’ll try to remember to report back when that gets done. It apparently wasn’t straighforward, several things have changed since 5.7.0 that we need.

Please do let me know, this would be very good information to know whether it helps or not. Anyways any fix or change we do would require you to migrate to a newer version anyways.

Investigating we noticed it appeared our high priority thread(s) were somehow interfering with the native serial port driver, though the ACM driver was OK.

This is an interesting observation, some kernel or driver change between the kernels on the T30 and now may have caused some regression or change in behavior with regards to your use-case here. It’s why I’m hoping another kernel version change may improve the situation here.

By the way, how did you come to this conclusion that high priority threads were interfering with the serial port driver?

We have to do testing next to see how this affects us in timing critical applications.

Let us know if this ends up being a major blocker. I can’t guarantee anything since at the moment it’s not clear where the root issue is, but further information could change the story here.

Best Regards,
Jeremias

techczech · August 6, 2024, 9:33pm

First indicator was the problem doesn’t happen when the container is run without privilege or SYS_NICE. All our threads are then at 20 (non real-time) and there’s no data dropout. Then someone here suggested lowering them all to -2 (very low priority real-time, requires SYS_NICE). We did that, and there’s no data dropout. Tried with all threads at -51, the problem was there, but happened much less often. We also noticed the problem wasn’t there when we only run using the GPS, that cuts out two other high priority threads for the other sensor.

Taking all that into account, tried the lowered thread priority values, and things look good so far.

jeremias.tx · August 6, 2024, 10:32pm

Appreciate the explanation. I can see how that would be a reasonable conclusion given the current information. Keep me updated if you uncover anything else. Or if you get a chance to try on the newer version.

Best Regards,
Jeremias

jeremias.tx · August 23, 2024, 10:09pm

Greetings,

Just checking in, were there any further findings or updates here on your end?

Best Regards,
Jeremias

techczech · August 28, 2024, 8:45pm

Still working on migrating to 6.7.0… will report back when we’re there.

jeremias.tx · August 28, 2024, 11:19pm

Still working on migrating to 6.7.0… will report back when we’re there.

Thanks for letting me know. Just wanted to check in.

Best Regards,
Jeremias

techczech · October 15, 2024, 6:06pm

Reporting in - we’re migrated to 6.8.0 now. This behavior is still present, using higher real-time thread priorities is somehow interfering with low rate native serial port input. But much higher rate USB/serial input has no issues. So we have to stay with the lowered thread priorities I gave previously.

Also I now see a setting for CONFIG_RT_GROUP_SCHED from “zcat /proc/config.gz | grep CONFIG_RT_GROUP”. It’s not set : “#CONFIG_RT_GROUP_SCHED is not set”. Would turning that on help us in any way? I’m going to investigate a timing critical application we need next. We’re having trouble getting this to work with anywhere near the stability of ancient Windows CE on a Colibri T30.

jeremias.tx · October 15, 2024, 9:53pm

Would turning that on help us in any way?

Honestly it’s hard to say for sure with any certainty without just trying it.

We’re having trouble getting this to work with anywhere near the stability of ancient Windows CE on a Colibri T30.

This is a tough comparison. Windows CE is considered a real time operating system by technical definition. Unfortunately Linux is not. Even with the real-time patch applied it still doesn’t pass the requirements of a real time OS.

The issue I see here is that there’s a lot of potential places to look at, for what might be causing this behavior you’re observing. Could be the kernel itself, or the serial drivers, maybe even an issue with how containers are processing the priorities.

Let me confer internally and see if any of my colleagues has a potential idea here.

Best Regards,
Jeremias