SPI quirks and timing

Hi, it’s me again.

In a previous post about “SPI demo not working” valter.tx helped me splendid to get the SPI working. After this I wrote some code to interface with an SPI IO expander (MCP23S08). However during this code writing I came across some odd quirks or even maybe bugs because I didn’t read anything about this behavioral, let me explain.

When initializing the SPI with the following command:

Imx6Spi_SetConfigInt(SPI, L"BitsPerWord", 32, StoreVolatile);

When this setting is used the output of the MOSI is as expected:
Bytes to send: 0x01 0x02 0x03 0x04
Bytes read on an oscilloscope: 0x01 0x02 0x03 0x04

*Downside of this is that you will need to send bytes in multiple of 4, otherwise you run the risk of overwriting another register if the device allows sequential write/read.

However initialising the SPI with the following command:

Imx6Spi_SetConfigInt(SPI, L"BitsPerWord", 8, StoreVolatile);

When this setting is used the output of the MOSI is NOT as expected:
Bytes to send: 0x01 0x02 0x03 0x04
Bytes read on an oscilloscope: 0x04 0x03 0x02 0x01
As you can see the bytes are swapped. It isn’t a real problem but you have to find it first.

*The advantage of this that you can send bytes that are not multiple of 4. But there is a catch I discovered.

The catch is, that you still have to send out bytes that are multiple of 4. Because when I tried to send 3 bytes, they were send in a complete different (maybe even random) order. If I remember it correctly:
Bytes to send: 0x01 0x02 0x03 (0x04 ← added to stuff it to a full DWORD but will not be send).
Bytes read on an oscilloscope: 0x03 0x02 0x04 0x01

I tried and measure it multiple times, but the same results where visible. If you keep in mind those quirks I got my IO expander to work.

In addition I also measured the execution time of mine SPI routing (writing to six IO expanders as fast as possible). The routine is sending to 6 IO expanders 4 bytes to update the output and than invert the value and send again, this was done in a loop.

To my surprise there was also a big time different between using 32 and 8 “BitsPerWord”.
For one step in the loop with 32 “BitsPerWord” the execution time is ~0.28 ms.
For one step in the loop with 8"BitsPerWord" the execution time is ~31.6 ms.

This lead me to wonder how fast, or in my case, how slow is the SPI. Going with ~0,28 ms, for 6 SPI write cycles with each 4 bytes, this will results in ~46,6 us for one SPI write with 4 bytes.
When calculating how long SPI will need to send this, it’s ~3,2 us. So there is an overhead of ~43,4 us.
I know there is some overhead involved to process the code and send it by SPI, but in my opinion the overhead takes rather long.

Am I correct to assume this is the min overhead time or is there a way to get it going much faster or would Linux a better option for this?

I know it’s a long post, but I appreciate the help and want to explain it fully to my extend. Therefore you can find the code I used for this here: https://share.toradex.com/wt2hfsi47row4cn
Also I added to measured time in the code as comments.

Kind regards,

Remco

For byte ordering we discovered the issue a couple of weeks ago and fixed it.
Fix will be included in release 1.1, sorry for the problem.
The performance difference is due to the fact that SPI controller of i.mx6 has no concept of word lenght and frame lenght, as most SPI controller do. It has only a generic packet len and data must be properly aligned into a 32bit FIFO to be sent in the right way. It also can’t send more than 512bytes without toggling the chipselect, so we had to control it as a GPIO.
How do you measure cycle times? At what transfer speed?
You can improve jitter by increasing the priority of your thread, this will prevent it from being interrupted from other threads using the CPU.

Weird, I toughed that I had placed an answer on this.

But anyway, good to hear that its not only me with that quirk and that it will be solved in the near future.

But with regards to the measuring of the SPI. I use just the default pin configuration with Spi_Init(L"SPI2"), in mode 0, with a clock of 10 MHz and 32 bit word aligned. And than in a small routine of sending 1 DWORD and reading 1 DWORD I measure the time. I placed this routine in it’s own thread. See the below code how I measure:

while (TRUE)
	{
		for (j = 0; j < 10; j++)
		{
			clock_t begin_time = clock();
			for (i = 0; i < 1000; i++)
			{
				Gpio_SetLevel(hGPIO, Probe_1, ioLow);
				Spi_Read(hSPI, Read_array_T, 1);
				Read_array = Read_array_T[0];
				Write_array_T[0] = Write_array;
				Spi_Write(hSPI, Write_array_T, 1);
				Gpio_SetLevel(hGPIO, Probe_1, ioHigh);
			}
			clock_t end_time = clock() - begin_time;
			aray[j] = end_time;
		}
	}  

I measure in a loop of 1000 or more otherwise you can get any good numbers because the clock_t time resolution is in ms.
When measuring the whole loop, I get ~92us per loop.
When I comment out the Spi_Read and Spi_Write (so I measure my own code), one loop will take roughly 2us max.
When calculating how long it takes to read 32 bits and write 32 bits on the hardware level it will takes roughly 6.4 us.
When subtracting the 6.4us and 2us of the 92us I have a remainder of 83.6us. In my opinion that is rather long when you are only writing 1 DWORD and reading 1 DWORD.

Release 1.1 is now available and should solve your issue.
Some points about the test:

  • to do precise measurement you may use QueryPerformanceCounter/QueryPerformanceFrequency, those should return high-res values.
  • You can try to increase priority of your send/receive thread using CeSetThreadPriority, this will prevent application thread from pre-emptying your thread when it’s waiting for completition of SPI operation.
    Consider also that on SPI the driver starts the operation and then waits for an interrupt notification in an IST, this has some overhead, compared to a busy loop on the completion flag but will also leave the CPU free for other tasks. The overhead is somewhat fixed, not depending on the amount of data exchanged, so it’s much more higher when you transfer a small amount of data.

I have updated the WEC2013 to the latest version. And verified that is was indeed correctly updated.

After the update I added to code to give the thread a priority 0 (highest possible). After it was compiled on the the device, I run the program. Some functions didn’t work any more, but that has to be expected when you do not allow other programs/threads to run. Afin, after some tests I still got the same time period as before (also checked with an oscilloscope, time periods matched to that of the software time).

After this I think I run against the needed overhead. I knew that an OS would have an higher overhead to processes hardware related stuff, but I didn’t think is would be so much.