If you read the above title, you might wonder how the switch to wayland (yes, the graphical stack replacing the venerable X11) can possibly relate to SSH agents. The answer is easy.
For as long as I can remember, as a long time user of gpg-agent as SSH agent (because my SSH key is a GPG sub-key) I relied on /etc/X11/Xsession.d/90gpg-agent that would configure the SSH_AUTH_SOCK environment variable (pointing to gpg-agent’s socket) provided that I added enable-ssh-support in ~/.gnupg/gpg-agent.conf.
Now when I switched to Wayland, that shell script used in the startup sequence of Xorg was no longer used. During a while I cheated a bit by setting SSH_AUTH_SOCK directly in my ~/.bashrc. But that only works for terminals, and not for other applications that are started by the session manager (which is basically systemd --user).
So how is that supposed to work out of the box nowadays? The SSH agents (as packaged in Debian) have all adopted the same trick, their .socket unit have an ExecStartPost setting which runs systemctl --user set-environment SSH_AUTH_SOCK=some-value. This command dynamically modifies the environment of the running systemd daemon and thus influences the environment for the future units started. Putting this in a socket unit ensures an early run, before most of the applications are started so it’s a good choice. They tend to also explicitly ensure this with a directive like Before=graphical-session-pre.target.
However, in a typical installation you end up with multiple SSH agents (right now I have ssh-agent, gpg-agent, and gcr-ssh-agent), which one is the one that the user ends up using? Well, that is not clearly defined, the one that wins is the one that runs last… because each of them overwrites the value in the systemd environment.
Some of them fight to have that place (cf #1079246 for gcr-ssh-agent) by setting explicit After directives. In the above bug I argue that we should let gpg-agent.socket have the priority since that’s the only one that is not enabled by default and that requires the user to opt-in. However, ultimately there will always be cases where you will want to be explicit about the SSH agent that should win.
You could rely on systemd overrides to add/remove ordering directives but that’s pretty fragile. Instead the right way to deal with this is to “mask” the socket units of the SSH agents that you don’t want. Note that disabling (i.e. systemctl --user disable) either will not work[1] or will not be sufficient[2]. In my case, I wanted to keep gpg-agent.socket so I masked gcr-ssh-agent.socket and ssh-agent.socket:
$ systemctl --user mask ssh-agent.socket gcr-ssh-agent.socket
Created symlink '/home/rhertzog/.config/systemd/user/ssh-agent.socket' → '/dev/null'.
Created symlink '/home/rhertzog/.config/systemd/user/gcr-ssh-agent.socket' → '/dev/null'.
Note that if you want that behaviour to apply to all users of your computer, you can use sudo systemctl --global mask ssh-agent.socket gcr-ssh-agent.socket. Now on next login, you will only get a single ssh agent socket unit that runs and the SSH_AUTH_SOCK value will thus be predictable again!
Hopefully you will find that useful as it’s already the second time that I stumble upon this either for me or for a relative. Next time, I will know where to look it up. 🙂
[1]: If you try to run systemctl --user disable gcr-ssh-agent.socket, you will get a message saying that it will not work because the unit is enabled for all users at the “global” level. You can do it with --global instead of --user but it doesn’t help, cf below.
[2]: Disabling an unit basically means stopping to explicitely schedule its startup as part of a desired target. However, the unit can still be started as a dependency of other units and that’s the case here because a socket unit will typically be pulled in by its corresponding service unit.