SMP breaks my realtime

Grimme · December 18, 2018, 10:26am

Hi everyone,

I did some tests with a realtime-patched kernel on a single-core Colibri iMX6 and I found that I get much shorter latencies when I deactivate smp in the kernel.

To get some more cpu power for my application and future enhancements now I am using a dual-core iMX6. The idea was to dedicate one core for my appication using taskset or code it in my application with sched_setaffinity.
Without smp activated in the kernel the second core is not usable, so I had to reactivate it.
But even with a second core smp breaks my realtime.

In top I see very low cpu usage (both cores >90% idle) but my latencies are off limits.

Why is the kernel-latency so much longer with smp even on a cual-core cpu?

Is there something wrong with my kernel-config?

Anything that has to be activated, deactivated, adjusted to use smp with preempt-rt?

Thanks,
Grimme

.config

stefan_e.tx · December 19, 2018, 8:36am

Dear @Grimme

I’m not fully surprised that with SMP enabled the latency changes. The overhead for managing the caches will be higher with more than one CPU.

However, can you give us some more details:

What latency do you measure?
What latency do you expect?
How do you measure the latency?
What priorities did you set?

According to measurements from OSADL a latency around 100us can be achievable:
https://www.osadl.org/Long-term-latency-plot-of-system-in-rack.qa-3d-latencyplot-rbs7.0.html?shadow=0

Can you also upload the config file again? It’s not downloadable, please rename it to config instead of .config.

Regards,
Stefan

Grimme · December 19, 2018, 10:00am

Hi Stefan,

there are several threads in my application that run periodically. On startup they get the actual time using clock_gettime(CLOCK_MONOTONIC, &ts_next); before they enter a while(TRUE)-loop.

In every iteration of the loop they first get the current time, again using clock_gettime(CLOCK_MONOTONIC, &ts_now);, and calculate the diffenerce between the current time (ts_now) and the target time (ts_next).

Then they do there task, add an amount of time (500us for the most rapidly running thread) to ts_next and go to sleep with clock_nanosleep(CLOCK_MONOTONIC, TIMER_ABSTIME, &ts_next, NULL);.

On an iMX6Solo in most cases the differences between ts_now and ts_next are between 10 and 20us, in worst cases between 50 and 60us.

On an iMX6DualLite the differences are mostly between 20 and 30us, in worst cases up to 180us.

The 10us more Latency in most cases is not fatal for my application but the bandwidth up to 180us.

My expectation was that the latency will be the same or better with 2 cores.
Of cause there is some more overhead but I expected the lower load using 2 cores to compensate that.

My threads are running with priorities of 0, 10 and 20. There is also a thread dedicated to handle SPI-communication running with a priority of 30.

Do you think that is a matter of priority? But why it is better using a single core?

my renamed config

Thanks,
Grimme

stefan_e.tx · December 19, 2018, 12:59pm

Hi @Grimme,

Thanks for the additional information. I have three ideas that you could try.

Can you put the priority of the latency measuring task to highest priortiy (just to see if the latency changes).

Set the governor into performance mode (this is my biggest hope):

echo performance > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor

Disable CONFIG_PM_SLEEP_SMP (I don’t think this will help but who knows)

I could imagine because the second CPU has no load it throttles the speed what could lead to the higher latency. If you only have one Core the load is higher and it therefore doesn’t throttle that early.

Regards,
Stefan

Grimme · December 20, 2018, 4:00pm

Hi Stefan,

for my single-core I used low priorities because it does nor make much sense to pass data to the kernel in time but don’t give it the time to write it to the hardware.
For dual-core that does not matter, I hope. So I increased the priorities to 80 and 90 for my highest-priority-threads.

I already set the scaling-governor to performance in an init-script but forgot to adapt it to multi-core or to copy it to my yocto-recipe (shame on me).

I also disabled CONFIG_PM_SLEEP_SMP but the results are not very encouraging.

The latencies are now <140us. That is 40us better than before but still 80us worse than with single-core.

Isn’t there any way to use the 2nd core without smp?
Something like amp?

When I “glue” my application to a specific core using taskset or something similar and let the other core for everything else I don’t need any loadbalancing.

Regards,

Grimme

jaski.tx · January 3, 2019, 2:12pm

Hi @Grimme, Stefan will be back next week and then he can answer your question.