UART communication failure rate is higher on Colibri IMX6 Dual than on Colibri IMX6 Solo

Hi!

In our devices we interchangeably use Colibri IMX6 Solo 256 MB and Colibri IMX6 Dual 512 MB. Our software communicates with our hardware via UART by sending request packets and receiving response packets (COM2, 921 kBaud; the high baud rate is necessary since transmitting data volume is significant).

We noticed that when a Solo module is used, communication failure rate is exact 0, all requests are successfully responsed. When we use a Dual module, the failure rate is never 0 and reaches 0.2% (1 failure for every 500 transactions).

Everything but the SOM is identical: our hardware, our software, OS image, configuration block.

Could you please give advice what can we do (maybe some fine adjustment?) to make UART communication on Dual SOM as reliable as on Solo SOM?

Colibri iMX6S 256 MB
Colibri iMX6DL 512 MB
BSP 1.6 CE 7.0

Here are some completely unfounded ideas and statements from Captain Obvious and his witless cohort, neither of whom use Windows for anything if there is any way around it. (Given quitting is an option, there is always a way around it :wink: )

Sharing like a Two Year Old
Your code has threads that use some common (not contained within the thread) resources like a shared buffer or the COM port itself. When running on the single SOM, it’s like mom sitting between her two kids so they don’t fight in church. You don’t have enough horsepower for them to try running at the same time so they don’t trash each other. When you have a second SOM fist-a-cuffs can reign supreme! They both scream like two year olds “Me me me! Mine Mine Mine!” End result, they trash the common resource that isn’t shared properly and you get corrupted or null packet some of the time.

The infamous OK-FLAG
During my days of programming in COBOL for Crescent Counties Foundation for Medical Care (no longer around) one programmer, Scott, wrote this entry screen that processed a large (for the day) chunk of data. It was massive. The entire house of cards was controlled by a single OK-FLAG. Every PERFORM paragraph was controlled by and returned state in the OK-FLAG. We told him two bytes of memory burned out every time the program ran; the byte he was using and the byte closest to it due to the excess heat.

In the world of C/C++ developed embedded systems I’ve seen a lot of OK-FLAG type programs. It’s kind of the “sharing like a two year old” problem but it’s not. This is just bad design! This is also why you never use AGILE in any project. When you are adding features a User Story at a time it is too easy and convenient to simply re-use an OK-FLAG from some higher scope. If everything had its own OK-FLAG you could run everything independently without fear. “sharing like a two year old” is a big resource problem, something actually needed to do one’s job. The OK-FLAG syndrome is an accidental resource problem caused by a stupid/lazy mistake you knew was wrong when you did it. Adding insult to injury, the OK-FLAG syndrome can also be caused by the combination of this stupid mistake an uninitialized pointer/buffer overrun/etc. The house of cards would “work” if something wasn’t walking on the OK-FLAG. You can find and fix this with good static source code analysis tools. Need the really good commercial ones though.

Mom got in the way - also known as I got here first!
In an effort to stop two year olds from fighting someone went full-on nanny state with mutex locking. Now, instead of trashing a resource causing null/corrupted packets processes have to cool their heels waiting for a resource to free up. 921 kBaud isn’t leaving much time for heel cooling but with two SOM you have a lot more things happening and they all want their mutex locks.

Unrealistic heartbeat
You never said “what” the comm failure was. Many times a comm failure is due to an unrealistic heartbeat. We have a heartbeat constraint of N packets every Y clock ticks. It’s generally derived by someone holding a thumb up to the screen and squinting over it with a single eye saying “Eh? That looks good.” When more things start happening on the box the comm “failure” is due to an arbitrary time limit. The whole packet got there, just a bit late.

It ain’t your fault
This is the one we always reach for way too soon. It’s also possible this is not your fault. I actually ran into this on my last project with Toradex products, just not an IMX6. The Dahlia carrier board had things that weren’t properly grounded/tied-off and when load started creeping up all kinds of “board noise” would invade place that had nothing to do with it. You need a very diligent and determined hardware person to go through your carrier board looking at everything you’re not using to ensure it has all been properly tied off so it can neither throw nor catch “board noise.”

===== Potential next steps.
Run the above mentioned static analysis on all code
You probably won’t be lucky enough for it to be an uninitialized something or an overflow somewhere, but one can always hold out hope.

Have a meeting of the minds about the length of your heartbeat timeout
Is there an actual business case reason for it or did it "just seem to work? This can also indicate a sh*t design. I’ve seen developers create really horrible packet communications without any well defined packet boundary markers. Very common with developers who came from the Internet era where TCP/IP would define a packet for them and they could stuff whatever they wanted inside. As a result they try to “guess” when a full packet will be in the I/O buffer and bake that value into a timer.

  1. Use a ring buffer that can hold at least 3 full packets.
  2. Create a protocol of well defined packets where they are either
    a) all start with the same 3 (or more) bytes that cannot appear in the data and are all the same fixed size.
    b) bounded by 3 (or more) unique start of packet bytes and 3 (or more) unique end of packet bytes. This series of bytes cannot occur within the data.

Your UART I/O code reads bytes off the UART stuffing them into the next available bytes of the ring buffer (wrapping when hitting end) and it scans the buffer looking for a “full packet.” When a full packet is found it signals/calls a call-back with boundary information that puts such information on a queue then quickly returns. Some other thread notices there is a new entry in the queue and processes it.

No need for the wish upon a star heartbeat logic.

Search your code for an OK-FLAG type problem
There is no tool I know of. Any place where you are keeping track of a state is a good place to start. Believe it or not, full Doxygen comments in your code can be used to generate a diagram showing the relationships of every class/module. Sometimes you can find the problem just by looking at the picture. Most times not.

Cut the size of your packets in half
Does the problem remain roughly the same or does it zero out? If it zeroes out there is a high probability mom is definitely getting in the way or you had unrealistic heartbeat. It’s counter intuitive to think that things can run worse with dual or multiple processors than with single but I’ve seen it. There wasn’t enough left over with one CPU and 256 MB of RAM to allow another thread/process to run, but with either 512 MB or the extra CPU, the problem child comes to life. If there was an IMX6 Solo with 512 MB that could give you an interesting data point. Was the problem child dormant due to RAM or CPU? If it stays dormant with 512 MB RAM and Solo SOM, you would at least know the additional processor brought this problem child to life.

Sorry for the massive ramble, but it is Sunday and I was looking for an excuse to avoid doing what I’m supposed to be doing in the office this morning. Hopefully some of this will actually help.

Thank you for your reply! I’ll consider your points.

You never said “what” the comm failure was.

My bad.

The software sends a request and expects a fixed length response (200 bytes). In 0.2% cases, instead of 200 bytes ReadFile() retrieves 180…199 bytes with correct content, and the remaining 1…20 bytes never arrive.

Comm. timeouts, total and interval, are large enough not to be the cause. For requests having shorter responses (< 50 bytes), such failures never happen.

Could it have something to do with the UART buffer length? Recently we dealt with a short sound playback problem on Colibri T20 SOMs, and the problem was fixed by changing the sound driver’s buffer size, as suggested by the Toradex documentation.

Never? Or are they simply thrown away because you have a start of packet header and after the timeout everything is thrown away that occurs before that?

You see, unless you have a lot more threads/processes using the COM port with the Dual SOM, or you were maxing out the single SOM so completely that overrun could not happen, I don’t see how it could be the buffer size in the driver. If you have the exact same number of threads reading and writing the exact same number of messages and those messages are basically the same as when it works, you weren’t overflowing the buffer with a single SOM . . . A second SOM that isn’t generating messages or using the COM port in any way wouldn’t make a difference. Some threads would run on it and others on the original SOM.

I’m thinking of Linux though. Win CE has a loooong sordid history of not-logical issues and design flaws.

Aren’t you using handshaking?

How long is your cable? That should be garbage though, not consistently dropping the last few bytes of a packet.

https://social.msdn.microsoft.com/Forums/vstudio/en-US/5264f591-5b61-4938-bfff-3cc20927516d/difference-between-ceoverrun-and-cerxover-in-comstat-errors?forum=vclanguage

Have you run a full static source code analysis on all your code to make certain you aren’t chasing a gremlin caused by a completely unrelated part of the source? Right now that is the best bang for the buck you could ever get.

I have no real desire to work in Windows of any kind. Just from what I’ve seem in this message thread, this has all of the earmarks of a comm design being run without any kind of handshaking or flow control. It feels like you have a timer problem causing a new read thread to launch before the old read thread has had a chance to complete. The reason it feels like that is you have a perfect packet just missing a few bytes at the end. Proper handshaking and flow control will fix that. Some form of mutex or whatever that blocks a new read thread/process from starting before the last one bails.

Brute force debugging:
At the start of each read thread dump out "**** starting read: " and a timestamp line to the terminal/log.
At the end of each read thread dump out "---- completed read: " and a timestamp line and packet length to terminal/log.

If the problem is a too soon start you will see two “**** starting read:” lines without a “----” line between them.

Do the same thing with your WRITE logic in case a second write is walking on an unfinished write. In a system without proper flow control this can easily happen.

In a Linux world one would look to see if mgetty somehow got started and told to read COM2. That’s the process that lets one hit a return and gives you a terminal login prompt. Windows CE might have something similar that is somehow magically getting clock cycles when you install the extra horsepower.

Sending a request and receiving its response are already performed within a critical section. No thread can access the port until the running transaction is fully completed. If not all bytes have arrived, the OS never provides the remaining bytes even with second-long comm timeouts. So invalid access synchronization is not the cause. The cable is several cm long, not a problem.

Flow control could possibly be the cause, it is however strange (given the synchronization exists) that the problem occurs on a more powerful module.

I hoped there are some UART fine tuning parameters that we could play with and see what would happen. As far as I understood, there are no such parameters, and without a deep design inspection locating and fixing the issue is hardly possible.

Please check chapter Chapter 16 Serial Driver of https://community.nxp.com/pwmxy87654/attachments/pwmxy87654/imx-processors/79792/1/WCE700_MX51_ER_1106_ReferenceManual.pdf
for details about serial port fine tuning.

Not really.

On a weak module it had all it could do to execute the task at hand. There wasn’t extra horse power to run anything that might interfere.

Had this on a Qt project. Off-shore team had ancient laptops that were gasping and wheezing trying to run the application inside a VM. Qt garbage collection never got opportunity to run because there wasn’t any “idle” time. Here in the states we had latest and greatest laptops. Thing would stack dump on every machine we had though off-shore swore they couldn’t reproduce. We actually reproduced it for them on video conference. Just let the application sit there idle after you had done a few things and boom!

I see Alex sent you the tweaking stuff.

I’ve been doing serial comm since the days of DOS. When you are getting perfect short packets the order of things to look for is:

  1. port use conflict
  2. flow control - far too many designs completely ignore it or worse, hardware CTS so the UART can’t defend itself when its internal buffer is full.
  3. OK-FLAG syndrome
  4. Someone nuked the source

For that last one, it is why I mentioned the garbage collection stuff above. You need to verify your source packet, well, the object or buffer containing it, didn’t go out of scope and get garbage collected. You said your actual transmission code is in a “critical section.” Does this also mean you are running with the privs of God in that section and an access violation would go unreported?

Various libraries have deadly things people don’t fully comprehend. One such thing is deleteLater() in Qt. It queues an even to garbage collect the object on a “do during idle time” queue. Other libraries have similar things. You didn’t say (or I didn’t read) what language you are using. C# and the DOT-NOT world have funky garbage collection.