this post was submitted on 18 Feb 2025
5 points (100.0% liked)

Nix / NixOS

1982 readers
2 users here now

Main links

Videos

founded 2 years ago
MODERATORS
 

I'm rebuilding my home server in nixos.

Rather that configuring the various services natively in nixos, I decided to run containers via virtualisation.oci-containers whenever possible, mostly to be able to independently update the system and the various services.

Everything is going smoothly, but whenever I (for whatever reason) do nixos-rebuild boot and reboot after adding a container instead of nixos-rebuild switch, I run into this issue where podman isn't able to resolve the host (below you see the docker hub host, but it also happened with ghcr.io):

podman-apprise-start[1352]: Trying to pull docker.io/caronc/apprise:1.1.8...
podman-apprise-start[1352]: Pulling image //caronc/apprise:1.1.8 inside systemd: setting pull timeout to 5m0s
podman-apprise-start[1352]: Error: initializing source docker://caronc/apprise:1.1.8: pinging container registry registry-1.docker.io: Get "https://registry-1.docker.io/v2/": dial tcp: lookup registry-1.docker.io: no such host

I thought that my podman-* services were missing a dependency on network-online and that they were started before the network was available, but it is't the case:

# systemctl list-dependencies podman-apprise.service 
podman-apprise.service
● ├─system.slice
● ├─network-online.target
● │ └─systemd-networkd-wait-online.service
● └─sysinit.target
●   ├─dev-hugepages.mount
[...snip...]

Do you happen to know what the issue is?

PS: Manually running systemctl start podman-whatever once fixes the issue, of course, but I wonder if there's a more robust solution?


update:

After investigating based on balsoft input below, the issue seems to be that systemd-networkd-wait-online doesn't behave as expected (by me).

Basically, systemd-networkd-wait-online waits for network interfaces to have a carrier (working ethernet cable) and an IP address. This is what in systemd-networkd docs is called the "degraded" state (no, it doesn't mean that something got worse than before... don't think too much of what "degraded" implies in English).

In my case, I have an interface that is setup via DHCP and that also has static IPs assigned:

$ cat /etc/systemd/network/00-lan1.network 
[Match]
Name=lan1

[Network]
DHCP=ipv4
IPv6AcceptRA=no
LinkLocalAddressing=no

[Address]
Address=192.168.10.10/24

[Address]
Address=192.168.10.99/24

If you are wondering, the reason I do this is that I want static IPs for my dns server and reverse proxy, but I also want my home server to use DHCP to fetch some network-wide configuration which, critically, includes the default route.

Back to the issue: IIUC, since the interface has a non-link-local address (which systemd-networkd confusingly calls a "routable" address), it is immediately considered "routable" (a state that is moar better than "degraded") and so not only it's basically ignored by the default systemd-networkd-wait-online configuration, but even adding

[Link]
RequiredForOnline=routable

to /etc/systemd/network/00-lan1.network doesn't make a difference whatsoever.

For now, my stopgap solution is to explicitly set the default route for the "lan1" network:

[Network]
Gateway=192.168.10.1

this seems to solve the issue with podman and, while the system still thinks to be "online" before being fully configured, it will suffice until I find a more elegant/robust way (ping me in a while if you are interested).

refs:
systemd-networkd-wait-online man page
systemd-networkd docs on "RequiredForOnline"
networkctl man page

top 5 comments
sorted by: hot top controversial new old
[–] [email protected] 2 points 5 days ago* (last edited 5 days ago) (1 children)

As a relatively heavy solution, you can use a container orchestrator that understands a failure to pull an image as a temporary or transient situation. The lightest orchestrator that I've used on NixOS in a homelab is k3s:

services.k3s.enable = true;

It is more-or-less a cut-down Kubernetes, and as such it will not fail to start merely because one Pod had an error pulling one Image. It will also offer a path forward if you want to continue building up software-defined networking.

That all said, I'd re-examine what you want from service isolation during OS upgrades; it's possible for routine NixOS updates to only restart affected services and not reboot. In production, nixos-rebuild switch can do things like upgrade shared libraries, switch webroots, or flip feature flags. Containerization might be unnecessary overhead, and I say that as a Kubernetes proponent.

[–] [email protected] 1 points 4 days ago

I too experimented with k3s, but then abandoned the idea of using it after I realized the proper way to run postgres on it was (IIUC) to use bitnami's helm chart. I like to have some level of understanding of how my homelab and it's config works, and that humongous amount of unreadable templates was not appealing in the least.

As for containers, I am not really looking for service isolation (IIUC until ##368565 lands, all virtualisation.oci-containers basically run as root and I'm fine with that*)... I just want to be able to run different (usually more recent, but in nixos one also can't easily "pin" an older version of a package if the need arises **) versions of services than those packaged is nixos. Also, not all services I want to run are available as nixos packages, and even less have modules.

* I know what risk I'm running (more or less): nothing in my homelab is accessible from outside my lan and, even if the container host was somehow pwned, that machine can't really do much harm (the important stuff is on a separate one).

** I guess I could import an older version of nixpkgs in my flake, but that requires way too much editing just to pin a package (time I'd rather spend solving the actual issue).

[–] [email protected] 2 points 1 week ago (1 children)

The issue is that network-online.target does not necessarily mean that you actually have internet access, it usually just means that you have an IP address. I've found that it can take a dozen seconds to actually get internet connectivity, depending on your setup. As a hack, you can add while ! ping -c1 ghcr.io; do sleep 1; done or similar to PreStart of your container services.

[–] [email protected] 2 points 1 week ago (1 children)

Thanks that was really helpful!

In my case, the system did not have a default route - I've updated the post with details.

[–] [email protected] 2 points 1 week ago

Glad to hear you got it working!