I am building an (video) image processing application on windows embedded 2013 using the T30, but I find that low level image processing is rather slow. E.g. simply subtracting two 8bit gray-scale images (640x480), will take more then 20ms per image. That is by far not fast enough for our application (there is more to be done). I tried several things to improve the performance: parallelizing the code in three different ways: using std::thread, using a thread pool, and using openMP. All run correctly but are slightly slower then the sequential version. I also used the NEON intrinsics, vectorizing the loop, also working, but giving only a very slight performance gain.
Do you have any tips or should I switch to a board with a faster processor with SSE2? The same code runs > 20x faster on a regular desktop…
Thanks in advance,
p.s. using a function with floats gives a dramaticly slow performance.