Best strategy for lowest lattency between FPGA and TK1 card

jipihorn · October 3, 2016, 7:23am

Hello,

I try to find which strategy minimizes the latency when sending small data packets (about 64 bytes) to a TK1 card from a fpga (Artix-7 or similar) and get them back. 3 cores of the 4 available are exclusively dedicated to read and post back the data after processing by removing them from the linux scheduler.

I first search for GPIO lines solution using interrupts, but latency is incredibly high (5-10us at best) which is, to me, unusable (I don’t understand how a 1.5Ghz processor can take so much time for handling an interrupt). On the other hand, the small carrier card has too few of them and it is not clear exactly how may are actually available.

Now, I took a look to a PCIe solution, but I can’t get a real idea of latencies involved here. Throughput is very large, but it is not the key parameter : the amount of data transferred is very small. My absolute constraint is timing. Or, at least, a very stable latency. Each packet is sent at about 1Mhz which is not so fast. The packets are processed by 3 Arm cores and put back to the FPGA for very tight timing synchronizations with external peripherals. But before investing in a PCIe solution, I want to be sure that it is a way to go. Creating a mini-PCIe card seems elegant for this, this one of the reasons I’m interested in the TK1 with the related small carrier board. But, as it is a big amount of work and time, I can’t afford to try just to see that latency is still too important. 1us is the very maximum allowed and, more important, it should be very stable. I can use FIFOs or other buffers to store data in between, but the timing MUST stay very stable.

Is it the case or are there other possibilities to send/get data with very low latency with this board ?

Many thanks in advance,

Jérôme.

marcel.tx · October 3, 2016, 4:05pm

For a non real time optimised system 5-10us interrupt latency is actually very good and I very much doubt that one will ever reach much lower values using PCIe. That said Apalis TK1 also features a Kinetis K20 companion MCU which should be able to process low level events with a much lower latency.

jipihorn · October 6, 2016, 9:12am

Hello,

That’s something I don’t understand. Latency at about 3 or 4 for a order of magnitude greater than the clock (10us compared to less than 700ps) is absolutely ridiculous. I can understand some overhead, but at this level… In fact, even though latency is huge, the jitter - a much important problem - is even worse.
Now, using the micro-controller, why not ? but how many gpios are available ? How to send data to the Arm cores without adding latency ? Are they using shared memory ?

Is there a possibility to use DMA to copy data directly to the RAM, independently of the CPU ?

This is something I don’t understand. It is very easy to use for example a high performance audio codec. How the audio signals (sometimes multichannel 384 Ks/s) can be handled correctly without having huge amount of jitter ? If this works, this means that there are low latency mechanisms to get data. How a stereo audio stream at 192 Ks/s with 24bit data ( 2.6us per sample) can be handled with latency of 10us with random jitter than can be even larger ? That’s not logic : codecs, DACs or ADCs are widely used without any problems, even using the main processor for signal processing.

How HDMI connections can work with such awful timings ? It’s not logic, HDMI connections works very well with high resolutions.

How PCIe multi channel audio cards can work flawlessly on a standard PC with software plugins, having only the latency of the signal processing (which can be large, but pretty constant) ? That’s not logic.

So, there must exist some ways to acquire and emit data with timings better than that ! I least, I can accept a large latency if it is guaranteed to be stable, but it is very far from the case !

I know that, this is mainly related to Linux, so are there alternatives for the OS ? Even a RTOS could be OK as I only need to do signal processing on several cores. Are there system and toochains other than Linux available for the TK1 board with the minimal drivers amount (at least GPIO and ethernet )?

J.

marcel.tx · October 11, 2016, 11:48am

That’s something I don’t understand. Latency at about 3 or 4 for a order of magnitude greater than the clock (10us compared to less than 700ps) is absolutely ridiculous.

Not really, you ever heard of hardware/software partitioning? That’s exactly the underlying reason for this.

I can understand some overhead, but at this level… In fact, even though latency is huge, the jitter - a much important problem - is even worse.

To achieve such low jitter as you expect you would definitely need an RTOS or even run this on a dedicated MCU.

Now, using the micro-controller, why not ? but how many gpios are available ?

There are actually more than 40 GPIOs available as both the parallel camera as well as the parallel display interfaces are not supported on Apalis TK1 and we just routed those MXM3 pins to the K20.

How to send data to the Arm cores without adding latency ?

I’m afraid that is not really possible.

Are they using shared memory ?

No, they are using a dedicated SPI interface.

Is there a possibility to use DMA to copy data directly to the RAM, independently of the CPU ?

No.

This is something I don’t understand. It is very easy to use for example a high performance audio codec. How the audio signals (sometimes multichannel 384 Ks/s) can be handled correctly without having huge amount of jitter ?

This involves buffering and dedicated hardware e.g. like the AVP.

If this works, this means that there are low latency mechanisms to get data. How a stereo audio stream at 192 Ks/s with 24bit data ( 2.6us per sample) can be handled with latency of 10us with random jitter than can be even larger ?

None of them process data at such small junks like 64 bytes but rather buffer a lot more.

That’s not logic : codecs, DACs or ADCs are widely used without any problems, even using the main processor for signal processing.

Yes, using kilobytes if not megabytes of buffering.

How HDMI connections can work with such awful timings ? It’s not logic, HDMI connections works very well with high resolutions.

All done in hardware really.

How PCIe multi channel audio cards can work flawlessly on a standard PC with software plugins, having only the latency of the signal processing (which can be large, but pretty constant) ? That’s not logic.

Just buffering, really. Plus it may even drop some samples without you ever noticing.

So, there must exist some ways to acquire and emit data with timings better than that ! I least, I can accept a large latency if it is guaranteed to be stable, but it is very far from the case !

Not without running an RTOS or changing the way you plan to go about transferring data. Which makes me wonder what exactly you plan on using with those 64 byte chunks and if they are coming off an FPGA why one could not pre-process and buffer them some more so that this whole senseless discussion would be rendered obsolete.

I know that, this is mainly related to Linux, so are there alternatives for the OS ?

No, this is not related to Linux at all. I don’t think any other general purpose OS will meet what you expect be it M$ or fruity.

Even a RTOS could be OK as I only need to do signal processing on several cores. Are there system and toochains other than Linux available for the TK1 board with the minimal drivers amount (at least GPIO and ethernet )?

No, so far I have not heard of any such. The biggest hurdle on TK1 would be how to make use of all the GPU functionality in such a case. However depending on what exactly you are trying to achieve you may just run U-Boot and get away with your own dedicated native stuff making use of certain low-level drivers available there like e.g. for PCIe.