Wayland proxy load balancer

Updated Dec 23

Wayland clients (applications) may face various difficulties not primary caused by them. There are three main Wayland compositors (Mutter/Gnome, KWin/KDE and WLRoots/Sway) and every compositor behaves differently in some corner cases not exactly defined by Wayland standards (or bents the specification somehow).

In X11 world an underlying X.Org implementation is the same for every desktop environment and there are differences introduced by window managers. Wayland merged window manager (like Metacity) and renderer (X11) to one block.

One of Firefox Wayland top crash bug is a Lost connection to Wayland compositor one.

Mutter (and maybe other ones) terminates Wayland client if it’s recognized as stalled. It usually means Wayland client doesn’t read messages from Wayland display socket fast enough and compositor message output buffer is full. It may be a bug in application itself (an event loop is not processed) or it’s caused by input devices like 1000 Hz mouse which generates too many events.

Unfortunately Wayland protocol doesn’t implement any kind of display connection management. Once the connection is lost / disconnected, there isn’t any way how it can be restored and Wayland client is terminated. The most visible example is Wayland compositor crash which takes down all applications, there isn’t any recovery point available.

There are various discussions going on [1],[2] but they’re stalled too 😀 as well as discussion about PIP – Picture-in-Picture Wayland protocol extension implementation [3].

But Firefox needs to solve the crashes caused by message jam right now as it’s shipped to wide audience. An initial idea popped out in discussion to create a proxy between Firefox and Wayland compositor to cache messages and prevent compositor message queue overflow. It’s very simplified case of WayPipe which routes Wayland communication over network and adds network transparency to Wayland protocol.

There’s initial successful proof-of-concept written in Rust and I implement it in C++ as wayland-proxy module which can be shipped with Firefox (mainly because my Rust knowledge is non-existent).

Wayland-proxy can be used as stand alone application or library included in Wayland application and it’s shipped with Fedora Firefox 121.0 right now and let’s see how it works.

UPDATE: Thanks to the valuable feedback at comments, it’s definitely worth to looks at it! There are more issues which needs different approach and Firefox problem also depends on Gtk3 and it’s way of handling Wayland connection. I was also pointed to Qt 6.6 robustness project which implement Wayland reconnection on Qt toolkit level.

12 thoughts on “Wayland proxy load balancer

      1. Please do give me a shout if you want to talk about that approach. I had written patches for other toolkits, I never tried GTK3 as it was feature frozen, but it’s definitely doable.

        Like

  1. GTK also immediately kills the application with an abort if it ever gets an EAGAIN from the socket, which is… suboptimal. If the compositor itself gets stalled for any reason, applications like Firefox can crash.

    Even making the socket blocking and stalling the application would be better than this.

    Like

  2. https://wayland-book.com/xdg-shell-basics.html
    Sorry this is part of the wayland protocol. There is a ping from wayland compositor application need to respond with correct pong and if your application does not respond the Wayland compositor by protocol is in it right to kill the connection because it gets to class your application as deadlocked.

    Nothing that happening to firefox here is in fact outside protocol or undefined.

    “”Unfortunately Wayland protocol doesn’t implement any kind of display connection management. Once the connection is lost / disconnected, there isn’t any way how it can be restored and Wayland client is terminated. “”

    This is not in fact true.
    https://blog.davidedmundson.co.uk/blog/qt6_wayland_robustness/

    There is the Wayland robustness stuff. Yes this event could be treated the same by the application as that the compositor has crashes so redoing connection.

    Yes lost connection is dead connection but nothing in Wayland protocol rules says that a application that has been incorrectly declared deadlocked cannot open a new connection to the wayland compositor and keep on running using the parts made for wayland robustness..

    This is not a mutter bug any Wayland compositor implemented exactly to protocol is going to terminate connections to applications that do not keep up on their processing because the application will be failing to send back required pongs.

    The wayland proxy bit means you are more likely to respond with the required pongs so reducing the issue but you have not fixed the issue.

    The issue is how to handle when Wayland compositor has declared Firefox deadlocked when Firefox is not deadlocked.

    Wayland compositor closing the socket between application and compositor there is no reverse to this action so you can only go forwards and either terminate or create new connection. The reality is if application want to remain running after being declared deadlocked is open a new connection to the Wayland compositor and pull a Wayland robustness re-sending everything to set output back up.

    Like

    1. Well, the proxy targets real issue with Wayland/Firefox/Gtk3. We can discuss where the problem is but we also need a solution which prevents Firefox from crashes right now. See the referenced issues at freedesktop – there isn’t any solution available.

      Like

      1. You have not targeted all the real issue.
        https://wayland.app/protocols/xdg-shell#xdg_wm_base:event:ping

        I don’t see your proxy handling the Wayland ping/pong required to tell wayland compositor not to kill your program connection.

        Failure to respond to ping wayland compositor send unresponsive error failure to respond to that Wayland compositor proceeds as per protocol to kill the connection.

        Wayland protocol has a watchdog and this is how compositors normally detect application as stalled/deadlocked. Yes this does mean the proxy need to at least process enough of what it being sent to detect pings and respond with pongs. Failure to respond with pongs equals connection terminated at the compositor own determination and this is written into the protocol.

        With way you have done the proxy since it does not handle wayland compositor pings all you have done is reduced the probability of being terminated. Not fixed it.

        Full fix will be doing something like wayland robustness for the case wayland compositor cuts off connection from application.

        “arbitrary amounts of time” this is the problem. The fact wayland protocol has a watchdog you cannot just que up and store messages processing zero messages and expect nothing bad to happen.

        You have hit the first bug Wayland compistor killing you connection because the buffers are end up full because you are not processing buffer fast enough. The second bug is not processing the messages at all are going to result in connection being closed by compositor because wayland protocol ping/pong watchdog kicks in.

        This watchdog in the wayland protocol is why just “increase the socket buffer size” will not for sure fix the problem either.

        Yes the watchdog of wayland is most likely not 100 percent avoidable. This means you now need robustness code to deal with the case connection gets cut off.

        Fun point wayland compositors are free to decide when they will ping/pong your application normally they do this their code thinks there is something wrong with application. It could be possible that firefox lacks ping/pong processing completely. So the watchdog behavior is basically unique per wayland compositor.

        This is the problem with the right now fix. It comes really simple to not go though the protocol and map out the path of failure to see what need doing.

        https://gitlab.freedesktop.org/wayland/wayland/-/issues/159

        The issue has been debated many times. The reality at some point the Wayland compositor has to kill applications connections.

        Wayland compositors are not going to do what X11 servers did of just keep on allocating more and more memory until the systems kills the x11 server or the system runs out of memory because an application is not keep up processing input.

        One of the realities of the way the wayland protocol is written is the Wayland compositor is free to rug pull the application at any time. This is where Wayland robustness comes in.

        Like

  3. I suspect “Mutter (and maybe other ones) terminates Wayland client if it’s recognized as stalled.” is a misunderstanding of who terminates what. There is no need to terminate the client program, disconnecting relieves the compositor already and even that is not usually necessary.

    “Once the connection is lost / disconnected, there isn’t any way how it can be restored and Wayland client is terminated.” is false.

    Once the connection is lost, libwayland functions start returning errors. It does not terminate the program. You can choose to handle those errors by e.g. saving your work and do a clean exit, or make a new Wayland connection and even re-create your windows, not to mention that Qt developers have been demonstrating recovery. If you choose to abort your program on disconnection, that’s on you or your choice of a toolkit.

    There is a very simple solution to the program main event loop stalls causing Wayland event overflows: create another thread whose only job is to read the Wayland connection and queue incoming events. Then you have unlimited incoming event buffering where it belongs, and you can stall as much as you like. How to integrate that with a toolkit of your choice is another question. Did anyone suggest this? The Mozilla bug has to many comments to read.

    The opposite problem (your link [2]), the client flooding the compositor, is the client sending more requests that the compositor can handle. This is completely orthogonal to the event overflow (your link [1]).

    Like

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Design a site like this with WordPress.com
Get started