M4 - ADC sampling (and bug?)

Hi,
I want to use the M4 to capture ADC values continuously at a fixed rate. The sampling rate should be at least 100kSps but 1MSps would be better. I wrote a FreeRTOS based program that stores the conversion results and the timestamps (using GPT at 6MHz which means 60 ticks per ADC conversion). My program text is located in OCRAM. The data segment is splitted into OCRAM_EPDC for general data and DDR-RAM for measurement data. However, when i run an acquisition of 10k samples there are some timing violations between consecutive timestamps, e.g. 180 instead of 60 ticks. This is not accecptable for my application. I guess this issue is because the GPT and ADC and OCRAM are connected via the AXI bus, right? Is there any way to improve the timing accuracy? I already noticed that DMA is not supported by the FreeRTOS ADC driver. Thanks in advance.

And another question: Is there a bug at adc_imx7d.c line 289? The reference manual states that: The conversion rate = (CHA_TIMER + 1) times sample rate. I changed the code to the following:

uint32_t convertDiv = sampleRate / convertRate - 1;  // Added -1

Without that change the measured times are twice the value i would expect (in my case ~120 instead of 60).

Hi,
I wrote a little demo program to show both issues mentioned above. Most important: The sample rate equals the conversion rate of 100kSps and without the fix above you will notice that 50k samples are acquired within 1s although the sample rate was set to 100kSps. Secondly when setting the ADC to 100kSps ~100k samples are aquired without timing violation (the first sampling may be reported as error but thats ok). When setting the sample rate to 1MSps the values are not processed fast enough. The whole program is located in TCM. So the question is: How could i ever reach the specified sample rate of 1MSps?

PS: Its getting even worse when using FreeRTOS

Hi @andy.tx
Are there any updates yet?

Dear @qojote

The investigations take longer as expected, and I could spend less time than I wanted on that issue. Hence I come up with partial results. However, this should be sufficient to guide you into the right direction.

CPU Performance Figures

To achieve 1MSpS you will need to optimize your code, so it is all about execution time. For my descriptions below, let’s use an M4 CPU cycle as the unit of time. The M4 is clocked at 240 MHz, so

  • 1 cyc = 1 / 240 MHz ≊ 4ns

  • The ADC sampling period is 240 cyc (@ 1MSps)

Let’s focus on what the M4 can do in this time:

  • A typical assembly instruction is executed in 1 cyc (if executed in TCM memory)

    • The same instruction executed from OCRAM takes 10 cyc.
      Enabling cache makes it worse, the instruction takes 12-16cyc to execute
  • Load and store operations (LDR, STR) to TCM take 2 cyc

    • A load (LDR) from a peripheral register takes 60 cyc
    • A store to a peripheral register takes 60 cyc. However, the write is buffered. This means the CPU continues executing instructions after 2 cyc, except if there is a read from the same peripheral, which requires the system to wait until the previous write has finished.
      (I didn’t analyze the exact conditions when the system waits for a write to finish, so my previous statement might be somewhat unprecise).

As you can see the really expensive operations are reading (and writing) peripheral registers - each register read consumes 25% of the available CPU time. This major goal must be minimizing these operations!

Achieving 1 MSps

My approach would be the following:

  1. Setup the ADC to do cyclic measurements, generating an interrupt after each sample.
  2. The interrupt service routine (ISR) should
  3. Read the ADC sample value
  4. Clear the interrupt flag by Writing 0x00000000 to the ADC Status register
    Don’t do a bit-clear operation, as this would require reading this register before modifying and writing it!
  5. Process the sample for at least 60 cyc.
    This allows the flag-clear operation to take effect before you leave the ISR

With this approach you should get away with only one expensive read operation. Even the write operation time can be hidden by using it for processing the sample, so there’s approximately 160 cycles available to process each sample.

FreeRTOS Performance Impact

As far as I can estimate, there’s a few major impacts on performance caused by FreeRTOS:

  • Interrupts used by FreeRTOS. In the basic configuration, there’s only the timer interrupt which occurs once every tick (typically every 1ms).
  • Any pending register read must be finished and can delay the execution of the ADC interrupt by 60 cyc.
  • Task switches are rather expensive operations and reduce the CPU time which is available outside of the ISR.
    • There are times when interrupts are disabled during task switches. I didn’t analyze how long such interrupt-disabled-periods can last.

You may consider of not using the FreeRTOS task switching at all. If you use FreeRTOS only to initialize the system and don’t start the scheduler, there should be no performance impact at all.

Further Options

If you need additional performance improvements, you could use DMA to move ADC samples to RAM, and even do some preprocessing.
The i.MX7 features a smart DMA engine which is quite powerful. However, it would add complexity to your system, and nxp suggests that you approach them in order to let them create DMA scripts.

I hope this information helps you to reach the required performance goals.

Regards, Andy

Dear @andy.tx
Thanks a lot for your detailed investigation. The section about “CPU Performance Figures” is quite useful. I already knew that i have only 240 instructions to handle the ADC interrupt but i was not aware that loading the ADC value (and peripheral registers in general) itself takes 60 cycles. So now when i just count the ADC interrupts within one second and skip the timing validation i get approx. the expected amount of events.

What’s about the possible bug within the adc driver code? Can you confirm it? When i undo my change the received samples are always half the values i would expect…

You suggest to “Process the sample for at least 60 cyc.” to use the time the register write takes, right? Do you know by any chance how long a storage operation to the RAM takes? Because this would be the best place for me to store those values.

I would pefer to use DMA but however there isn’t much documentation and guides of how to use the DMA engine. And there is no example neither.

Regards

Dear @qojote

Thank you for reporting the issue in adc_imx7d.c.
I confirm this is a bug, I was able to reproduce the same wrong behavior. I added the issue to our bug tracker, we will fix it in our next FreeRTOS release.

Regards, Andy

Dear @qojote

I added a comment to the question itself to confirm the bug, so it is easier to find for others.

I suggested to “Process the sample for at least 60 cyc.”, because I assume if you leave the ISR earlier, the interrupt is not yet cleared, and the system would immediately enter the ISR again. However, I didn’t test this.

When you refer to “sorage operation to the RAM”, which RAM do you refer to ?

  • Writing to TCM takes 2 cyc.
  • Writing to OCRAM takes 10 cyc.
    (not measured, but I assume this from the code execution performance)
  • Writing to DRAM is even slower, I didn’t do any measurements.
    However, if caching is setup properly, sequential writes might be very fast. If you try this, make sure to do it properly. There’s a note in the i.MX7 Reference Manual:

To use cache, user needs to configure MPU to set those memories as cacheable
and all the other memories set as non- cacheable.

For the DMA it appears that NXP designed it “for internal use only”. We also don’t get more information, and there is a pretty clear statement that one should contact NXP for DMA scripts to support new peripherals (such as the ADC).

Regards, Andy

I am refering to the DRAM although i am aware that this memory is the slowest option but I need to store several MBytes of data. Caching is only possible for the first 2MB of DRAM but this area is used for Linux (and maybe u-boot itself) and i guess it’s not easy to move things around. Are there any tutorials how to use the cachable regions of DRAM?
As long as i am able to acquire and store the data fast enough i do not need DMA.

Dear @qojote

To use the cacheable DRAM area, you need to move the Linux Kernel to another location. One of my colleagues wrote a blog with the (unverified) concept of how to do it. There’s also some additional M4 performance measurements in the blog.

Regards, Andy