Here are some completely unfounded ideas and statements from Captain Obvious and his witless cohort, neither of whom use Windows for anything if there is any way around it. (Given quitting is an option, there is always a way around it
)
Sharing like a Two Year Old
Your code has threads that use some common (not contained within the thread) resources like a shared buffer or the COM port itself. When running on the single SOM, it’s like mom sitting between her two kids so they don’t fight in church. You don’t have enough horsepower for them to try running at the same time so they don’t trash each other. When you have a second SOM fist-a-cuffs can reign supreme! They both scream like two year olds “Me me me! Mine Mine Mine!” End result, they trash the common resource that isn’t shared properly and you get corrupted or null packet some of the time.
The infamous OK-FLAG
During my days of programming in COBOL for Crescent Counties Foundation for Medical Care (no longer around) one programmer, Scott, wrote this entry screen that processed a large (for the day) chunk of data. It was massive. The entire house of cards was controlled by a single OK-FLAG. Every PERFORM paragraph was controlled by and returned state in the OK-FLAG. We told him two bytes of memory burned out every time the program ran; the byte he was using and the byte closest to it due to the excess heat.
In the world of C/C++ developed embedded systems I’ve seen a lot of OK-FLAG type programs. It’s kind of the “sharing like a two year old” problem but it’s not. This is just bad design! This is also why you never use AGILE in any project. When you are adding features a User Story at a time it is too easy and convenient to simply re-use an OK-FLAG from some higher scope. If everything had its own OK-FLAG you could run everything independently without fear. “sharing like a two year old” is a big resource problem, something actually needed to do one’s job. The OK-FLAG syndrome is an accidental resource problem caused by a stupid/lazy mistake you knew was wrong when you did it. Adding insult to injury, the OK-FLAG syndrome can also be caused by the combination of this stupid mistake an uninitialized pointer/buffer overrun/etc. The house of cards would “work” if something wasn’t walking on the OK-FLAG. You can find and fix this with good static source code analysis tools. Need the really good commercial ones though.
Mom got in the way - also known as I got here first!
In an effort to stop two year olds from fighting someone went full-on nanny state with mutex locking. Now, instead of trashing a resource causing null/corrupted packets processes have to cool their heels waiting for a resource to free up. 921 kBaud isn’t leaving much time for heel cooling but with two SOM you have a lot more things happening and they all want their mutex locks.
Unrealistic heartbeat
You never said “what” the comm failure was. Many times a comm failure is due to an unrealistic heartbeat. We have a heartbeat constraint of N packets every Y clock ticks. It’s generally derived by someone holding a thumb up to the screen and squinting over it with a single eye saying “Eh? That looks good.” When more things start happening on the box the comm “failure” is due to an arbitrary time limit. The whole packet got there, just a bit late.
It ain’t your fault
This is the one we always reach for way too soon. It’s also possible this is not your fault. I actually ran into this on my last project with Toradex products, just not an IMX6. The Dahlia carrier board had things that weren’t properly grounded/tied-off and when load started creeping up all kinds of “board noise” would invade place that had nothing to do with it. You need a very diligent and determined hardware person to go through your carrier board looking at everything you’re not using to ensure it has all been properly tied off so it can neither throw nor catch “board noise.”
===== Potential next steps.
Run the above mentioned static analysis on all code
You probably won’t be lucky enough for it to be an uninitialized something or an overflow somewhere, but one can always hold out hope.
Have a meeting of the minds about the length of your heartbeat timeout
Is there an actual business case reason for it or did it "just seem to work? This can also indicate a sh*t design. I’ve seen developers create really horrible packet communications without any well defined packet boundary markers. Very common with developers who came from the Internet era where TCP/IP would define a packet for them and they could stuff whatever they wanted inside. As a result they try to “guess” when a full packet will be in the I/O buffer and bake that value into a timer.
- Use a ring buffer that can hold at least 3 full packets.
- Create a protocol of well defined packets where they are either
a) all start with the same 3 (or more) bytes that cannot appear in the data and are all the same fixed size.
b) bounded by 3 (or more) unique start of packet bytes and 3 (or more) unique end of packet bytes. This series of bytes cannot occur within the data.
Your UART I/O code reads bytes off the UART stuffing them into the next available bytes of the ring buffer (wrapping when hitting end) and it scans the buffer looking for a “full packet.” When a full packet is found it signals/calls a call-back with boundary information that puts such information on a queue then quickly returns. Some other thread notices there is a new entry in the queue and processes it.
No need for the wish upon a star heartbeat logic.
Search your code for an OK-FLAG type problem
There is no tool I know of. Any place where you are keeping track of a state is a good place to start. Believe it or not, full Doxygen comments in your code can be used to generate a diagram showing the relationships of every class/module. Sometimes you can find the problem just by looking at the picture. Most times not.
Cut the size of your packets in half
Does the problem remain roughly the same or does it zero out? If it zeroes out there is a high probability mom is definitely getting in the way or you had unrealistic heartbeat. It’s counter intuitive to think that things can run worse with dual or multiple processors than with single but I’ve seen it. There wasn’t enough left over with one CPU and 256 MB of RAM to allow another thread/process to run, but with either 512 MB or the extra CPU, the problem child comes to life. If there was an IMX6 Solo with 512 MB that could give you an interesting data point. Was the problem child dormant due to RAM or CPU? If it stays dormant with 512 MB RAM and Solo SOM, you would at least know the additional processor brought this problem child to life.
Sorry for the massive ramble, but it is Sunday and I was looking for an excuse to avoid doing what I’m supposed to be doing in the office this morning. Hopefully some of this will actually help.