Fixing native RS485 DE toggling on UART6

tobiasjakobi-compleo · October 2, 2023, 9:28am

Dear Toradex support,

we are using the Colibri i.MX7 SoM (eMMC, 1GB DRAM) in our product and have recently upgraded our OS to Yocto Kirkstone.

This means we are now using the kirkstone-6.x.y branch of the meta-toradex-bsp-common Git repository. More precisely we use the linux-toradex-mainline recipe to build our kernel.

We have a .bbappend with some patches that we apply to the kernel, but more on that later.

So first of all, Kirkstone is working pretty well. No larger problems so far. I also was pleasantly surprised that we amount of patches applied to the mainline kernel is pretty small. So I guess, a big thanks for upstreaming!

Now for the problem itself. On the board, where the Colibri SoM is operating on, we have a RS485 transceiver (some Renesas thing, ISL8xxx series). The transceiver is connected to UART6. We talk ModBus RTU over this bus, the SoM is the ModBus master.

Here’s how stuff is connected:

SODIMM 152 (with 22K PD) → ISL8xxx DE
SODIMM 103 → ISL8xxx DI (driver input)
SODIMM 101 → ISL8xxx RO (receiver output)

We used to use this pinctrl configuration in the DT:

pinctrl_uart6_p5x: uart6-grp {
    fsl,pins = <
        MX7D_PAD_ECSPI1_MOSI__UART6_DCE_TX  0x79	/* SODIMM 103 */
        MX7D_PAD_ECSPI1_SCLK__UART6_DCE_RX  0x79	/* SODIMM 101 */
        MX7D_PAD_EPDC_DATA11__UART6_DTE_RTS 0x79	/* SODIMM 152 */
    >;
};

Together with this UART configuration:

&uart6 {
	status = "okay";

	pinctrl-names = "default";
	pinctrl-0 = <&pinctrl_uart6_p5x>;

	assigned-clocks = <&clks IMX7D_UART6_ROOT_SRC>;
	assigned-clock-parents = <&clks IMX7D_OSC_24M_CLK>;

	linux,rs485-enabled-at-boot-time;
	uart-has-rtscts;
};

This was the configuration that was working with the old 5.4 downstream kernel: toradex_5.4-2.3.x-imx

This configuration stopped working with the 6.1 upstream kernel. Stopped working, as in, regular ModBus read register calls (that previously worked), so longer work (either due to timeouts, or corrupted data). I tracked the problem down to the following commit by Marek Vasut: Link

As a first step I configured the RTS pin as GPIO and used to rts-gpios property. This restores functionality, but I wasn’t really satisfied with this approach. Why has this worked all the time, was the question that went through my head.

So this commit, which attempts to fix the RS485 DE active high scenario, seems to break the configuration we use on our hardware. Further analysis reveals that it’s the loopback mode that the commit uses to fix things.

I’ve put my changes on the FD GitLab:

So I’ve introduced a new DT property that disables the use of loopback mode. Once I disable loopback things work again. Of course the scenario that Marek was trying to fix is then broken again, but this scenario doesn’t matter for us (we can live with the transceiver blocking the bus if nobody has opened the corresponding tty).

For reference here’s the DTSI we are using:

What I don’t understand is how loopback mode could’ve broken anything to begin with. Hence me posting here to maybe get some insight from Toradex.

With best wishes,
Tobias

P.S.: As I’ve said above, we have applied some other patches to the kernel, see the FD GitLab. But the other patches have no influence on the UART (just a bunch of DRM stuff).

henrique.tx · October 3, 2023, 11:01am

Hi @tobiasjakobi-compleo !

Thanks a lot for the detailed description of the issue.

Some questions:

Can you please share the output of tdx-info? Reference Getting Device Information with Tdx-Info | Toradex Developer Center
Which carrier board are you using? Is one from Toradex or a custom one? Could you please try to reproduce it on a Toradex carrier board?

In the mean time, I am also trying to reproduce the issue.

Best regards,

DaveM · October 3, 2023, 3:05pm

Does this newer commit do anything to help your issue?

github.com/torvalds/linux

tty: serial: imx: fix rs485 rx after tx

committed 01:41PM - 19 Jun 23 UTC

+13 -5

Since commit 79d0224f6bf2 ("tty: serial: imx: Handle RS485 DE signal active high…") RS485 reception no longer works after a transmission. The following scenario shows the problem: 1) Open a port in RS485 mode 2) Receive data from remote (OK) 3) Transmit data to remote (OK) 4) Receive data from remote (Nothing received) In RS485 mode, imx_uart_start_tx() calls imx_uart_stop_rx() and, when the transmission is complete, imx_uart_stop_tx() calls imx_uart_start_rx(). Since the above commit imx_uart_stop_rx() now sets the loopback bit but imx_uart_start_rx() does not clear it causing the hardware to remain in loopback mode and not receive external data. Fix this by moving the existing loopback disable code to a helper function and calling it from imx_uart_start_rx() too. Fixes: 79d0224f6bf2 ("tty: serial: imx: Handle RS485 DE signal active high") Cc: stable@vger.kernel.org Signed-off-by: Martin Fuzzey <martin.fuzzey@flowbird.group> Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Link: https://lore.kernel.org/r/20230616104838.2729694-1-martin.fuzzey@flowbird.group Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

tobiasjakobi-compleo · October 4, 2023, 8:05am

@henrique.tx Hi, here’s the output of tdx-info:

Software summary
------------------------------------------------------------
Bootloader:               U-Boot
Kernel version:           6.1.42-compleo #1 SMP Thu Jul 27 06:50:53 UTC 2023
Kernel command line:      fbcon=rotate:2,logo-pos:center vt.global_cursor_default=0 vt.color=0xf0 console=ttymxc0,115200n8 ro rootwait panic=3 root=PARTUUID=34fc9833-fd99-4139-97c7-9e501c63c370 rootfstype=squashfs init=/linuxrc compleo_bootslot=b
Distro name:              NAME="TDX Compleo"
Distro version:           VERSION_ID=6.106.0-devel-20230906110805-build.0
Hostname:                 p52-tobi0
------------------------------------------------------------

Hardware info
------------------------------------------------------------
HW model:                 Toradex Colibri iMX7D on Compleo P52
Toradex version:          0039 V1.1A
Serial number:            06801287
Processor arch:           armv7l
------------------------------------------------------------

Concerning the carrier board: This is a custom one. I have access to the Colibri evaluation board, but I don’t see how testing on the eval board is feasible. As far as I can see in the eval board datasheet the specific UART6 we are using is not exposed on the board. And even if it was, there is a different RS485 transceiver chip installed there (the datasheet mentioned some Analog Devices chip, we use a Renesas one). If it helps, I would try to check if I can provide the circuit diagram for the board (or at least an excerpt of the diagram).

@DaveM Hi, I’m aware of this commit. Actually it was among the first things I noticed when going through the commit log. To clarify things. The Linux kernel is v6.1.42 (vanilla), plus the Toradex patches from the .bb, and patches from our .bbappend.
And v6.1.42 includes this particular commit. So all our tests we performed with this commit applied.

DaveM · October 4, 2023, 2:56pm

@tobiasjakobi-compleo I few cents worth regarding 485.
I’ve been around rs485 systems for many years and through many iterations of different OSs, microcontrollers, comm chips, transceivers, etc. Anything with the prefix “auto” has been nothing but trouble for me. I usually give it a try thinking it’ll save me some code and that the hardware folks have flushed out all the scenarios. In the end, it’s back to setting up the RTS line manually as soon as possible, keeping generic code away from it, and handling it where necessary. GPIO is your friend.

The combination of device tree bindings for active-high and active-low should set RTS appropriately during startup. I’ve also found the rts-delay binding quite necessary when connecting to different (older) pieces of hardware in the field. It’s a crapshoot if all of the bindings are implemented by the necessary driver though. But if you’re in there adding bindings anyway, no problem.

Good luck!

emanuele.tx · October 5, 2023, 6:39am

Hello Tobias,
Marek commit is quite interesting…
I try to explain. Having the serial peripheral handling RTS (RS485 DE) is a good thing. It knows exactly when the transmission is ended and enable RX as soon as.
In case you have a “request/response” protocol where your iMX7 is sending the request this is a good thing because if the answering device answer “too early” (just after the request) the rs485 line is “free”.
If your iMX7 move the RTS signal with a delay there is a possibility that the “answer” is corrupted.
Marek patch do not change this behavior BUT it enable loopback while transmitting and disable loopback at end of transmission. It also play with rx enable (RXEN), which maybe the worst thing (it enable/disable the internal clock and it is supposed to be used, IMHO, to definitively disable the RX if not needed).
And this is done in interrupts. Which, for sure, they have a latency.
Well, I do not know the ModBus specification, and if it specifies a delay before the answer… but from my understanding it is a request-response protocol. And if the answering device is fast we can fall in problems.

At the end, from my point of view with the Marek patch we loss the benefit of having RTS automatically handled by uart peripheral.

Well, another patch (Linux-Kernel Archive: [PATCH 5.15 501/846] tty: serial: imx: disable UCR4_OREN in .stop_rx() instead of .shutdown()) is related to the problem.

I think you have these options:

revert patches
remove all this playing around with LOOP and RXEN in imx_uart_stop_rx / imx_uart_start_rx (pay attention to imx_uart_disable_loopback_rs485, you have to get rid of this)
open the uart with SER_RS485_RX_DURING_TX but you have to manage the fact you receive back your data

Emanuele

PS: Read also this email, it seems to confirm my thoughts Re: [PATCH] tty: serial: imx: fix rs485 rx after tx - Sebastien Laveze

tobiasjakobi-compleo · October 5, 2023, 10:22am

Hey Emanuele,

thanks for looking into this!

Concerning Modbus RTU, it is indeed a request-response protocol. The Modbus master (which is the i.MX7) is sending a request, and the Modbus slave (which is an insulation monitoring device in our case) then sends its response back. There is some 3.5 character delay thing in the Modbus specs, so slaves are not allowed to “immediately” answer.

I’m not completely sure how to read your analysis. So you’re saying that the additional delay introduced by enabling/disabling loopback (which happens when starting Rx) is the cause for this issue? To be honest, I don’t find this very plausible. I mean, it’s just one read of the UTS register, some bit manipulation, and one write to the register again. That can’t be that bad. At least I don’t see this introducing several milliseconds of latency here.

Did you take a look at my patch that introduces the fsl,no-loopback binding?

With the fsl,no-loopback property set, RXEN is cleared in imx_uart_stop_rx() , and then set again in imx_uart_start_rx(). If toggling RXEN would really be the problem, then we should see the problem also in this scenario. But my tests shows that setting the DT property makes our Modbus communication stable again. So this doesn’t support the suspicion that RXEN is the culprit.

I still want to look into potential register write order problems. E.g. in imx_uart_start_rx() I would first disable loopback, and then write UCR1/2 (which in turn sets the RXEN bit). Doing it the other way around looks error prone to me.

Any thoughts?

emanuele.tx · October 10, 2023, 11:38am

Hello Tobias,
your are true, if the specification report something like that (3.5 char, at which bitrate?) my thoughts are probably wrong.
I look at your patch. And I would ask: is the modbus library keeping the serial “on” for the whole time or is it opening the file descriptor at every transaction (or every group of transaction)?
It can be of help some oscilloscope acquisition and/or a detailed description of the “error” (is missing or corrupted something in the message reception? Is the request not correctly received by the counterpart?

Kind regards.

rafael.tx · October 13, 2023, 7:09pm

Hello @tobiasjakobi-compleo
I also looked into the issue you describe, and I tend to agree with Emanuelle. We need to investigate a little bit better what the application is doing to try to understand how the addition of the loopback mode could affect it.

If we’re considering just standard register writes you are correct, but I can imagine a scenario where the hardware takes a lot of time to leave loopback mode for instance, and this could affect the performance.

Is there a way you could monitor the serial data with an oscilloscope, maybe also with another channel looking at the DE pin to check if the drive is indeed being disabled before the slave is starting to transmit?

Regards,
Rafael

rafael.tx · October 23, 2023, 2:13pm

@tobiasjakobi-compleo do you have any news regarding the investigation?

Thanks,
Rafael

tobiasjakobi-compleo · October 23, 2023, 2:51pm

Hello @emanuele.tx @rafael.tx

Sorry, haven’t had time yet to do further tests.

Concerning the application and the file descriptor question: The application is using plain libmodbus, which opens the device (in the sense of a tty device, i.e. /dev/ttymxc), and keeps it open during the entire runtime of the application. So it’s not a matter of the application constantly closing and opening the tty device.

If I want to properly measure things, I would need to measure before the transceiver, which would probably mean modifying our board. Not exactly sure how invasive this is, as I’m not a hardware engineer. I’ll let you know once I have further info.

rafael.tx · October 23, 2023, 3:55pm

What kind of modbus device are you connecting to?

If I want to properly measure things, I would need to measure before the transceiver, which would probably mean modifying our board. Not exactly sure how invasive this is, as I’m not a hardware engineer. I’ll let you know once I have further info.

Another option would be to use our eval board, but I don’t know if your application can run at all on it.

tobiasjakobi-compleo · October 24, 2023, 9:55am

@rafael.tx This is the insulation monitor that we communicate with: ISOMETER® isoCHA425HV with AGH420-1

Concerning testing with the eval board: It is certainly possible to run the application on the eval board. But maybe I’m missing something, but we neither have we the same transceiver chip on the eval board, nor do we have access to UART6 through exactly the same pins we are currently using. Or maybe you’re proposing a different test here?

rafael.tx · October 24, 2023, 10:33am

You’re right, I forgot you’re using a different UART port and a different transceiver so this will make it harder to test.
However, considering your previous findings, I would say it would be possible to replicate the same problem using the RS485 that’s provided on the eval board. If this is the case, it’s very easy to access the pins before the transceiver using the pin bars present there. This is what I did to connect the logic analyzer on mine when I was trying to reproduce this issue.

rudhi.tx · November 10, 2023, 12:25pm

Hello @tobiasjakobi-compleo,

Do you have any updates regarding this topic?

tobiasjakobi-compleo · March 7, 2024, 10:11am

Hello guys/gals,

sorry for the very late reply. I got swamped with other topics at work, and didn’t have any time to have a deeper look at the problem in the past weeks.

Things have settled down a bit now, and I’ve picked up the issue again. And it looks like we can consider this one resolved.

So I never got around to debug this with a scope. Instead I build a small ModBus bus sniffer with a cheap dongle and modified an existing sniffer tool to also print timing information, i.e. how long does it take from the end of a previous ModBus frame to the start of the current frame. I had the hunch that we might not honor the ModBus interframe gap/pause correctly.

Turns out that we actually have two problems here.

Rx not working (at all) on master side
Interframe gap too small

Problem 1:

The sniffer shows that communication between our ModBus master (the i.MX7) and the slave (the insulation monitor) actually works. You can see the request frame appearing on the bus, and then some milliseconds later the slave sends his reply frame.
What doesn’t work is that apparantly the master doesn’t see this reply. But it’s definitely there on the bus.

As I’ve already described above this problem disappears when disabling this loopback logic in the driver. And while I was busy hacking away at other code, other people have also encountered this problem:
https://lore.kernel.org/all/20240220061243.4169045-1-rickaran@axis.com/T/#me6095e4a14125df6034b89413764fbcc7fc7dc25

So it’s not a problem with just our hardware setup, but a general one. If you look at the mailing list thread, there are currently two patches being discussed. I have only tested the second one (the one by Christoph Niedermaier), and can confirm that it works. So given enough time, this fix is going to land upstream. For now I’m just going to proceed with my solution. Once the fix is upstream, I’m going to drop my code.

Problem 2:

Even with the loopback logic disabled, there were instances where we were running into a timeout waiting for the slave to send his reply. This usually happened when querying a bunch of ModBus registers in sequence. First query did work, but the second query ran into a timeout. The third query worked again.

The sniffer analysis shows that the gap between the end of the slave’s reply and the start of the master’s next query is sometimes cutting it dangerously close to the specifications. Baud rate is 19200, ModBus specs says that the gap should be at least 1.75ms.

We have fixed this by setting rs485-rts-delay to <2 0> in the DeviceTree.

With these two changes we no longer see any problems on the RS485 bus.

With best wishes,
Tobias

tobiasjakobi-compleo · March 7, 2024, 10:15am

Forgot to add. The library we use, which is libmodbus, doesn’t honor the interframe gap:
https://groups.google.com/g/libmodbus/c/xZR66Gk_G2g?pli=1

So it’s the responsibility of the user to add certain delays.

rafael.tx · March 28, 2024, 12:52pm

Hello @tobiasjakobi-compleo,
thank you for the update! Whenever the changes for one of these patches are merged, I would expect it to be backported to the stable releases as well, and because of that, it will end up in one of our future releases.

Your plan to keep your changes for now and drop them when the fix comes from upstream looks good.

Thank you again for sharing the details with us.

Best regards,
Rafael