coily channel: pure-Go resolver bypasses Tailscale MagicDNS on Linux #101

Open
opened 2026-05-26 19:54:37 +00:00 by coilysiren · 0 comments
Owner

Symptom

coily channel read <id> fails from kai-server with:

coily: channel: GET /agent-channel/VHGC: Get "http://api/agent-channel/VHGC": dial tcp: lookup api on 127.0.0.53:53: server misbehaving

curl http://api from the same shell on the same host works fine and resolves api to a tailnet IP (100.126.181.22 at the time of writing).

Cause

Coily is built with Go's default pure-Go resolver. The pure-Go resolver reads /etc/resolv.conf directly and sends queries to whatever nameserver is listed there. On Ubuntu kai-server that is 127.0.0.53 (systemd-resolved), which does not know Tailscale MagicDNS names. The pure-Go resolver does not consult /etc/nsswitch.conf, so it bypasses the Tailscale resolver entry that libc-based clients (curl, getent hosts, ssh, anything that calls getaddrinfo) honor.

This is a known Go-on-Linux footgun. Reference: golang/go#57757, Tailscale's MagicDNS doc (notes the systemd-resolved + Go interaction).

Fix options, ranked

  1. Build with cgo resolver - add -tags netcgo to the release build, or import _ "net" with GODEBUG=netdns=cgo+1 baked in. This routes Go's net package through libc getaddrinfo, picking up nsswitch and the Tailscale entry. Lowest blast radius, no runtime config, fixes every coily verb that opens an HTTP connection, not just channel. Cost: cgo enabled = need C toolchain at build time. The release builds already use cgo on macOS for keychain access, so this is incremental on Linux only.

  2. Runtime env in the systemd unit / shell rc - set GODEBUG=netdns=cgo in whatever launches coily. Works, but every host has to be configured, including ad-hoc shells, so it drifts.

  3. Resolve via the Tailscale local API - tailscaled exposes a Unix socket at /var/run/tailscale/tailscaled.sock and Tailscale's Go library has tailscale.com/client/tailscale.LocalClient with a WhoIs / LookupHostname surface. Coily could ask tailscaled directly when the requested host is a bare hostname or *.ts.net. More invasive, only helps the tailnet path, and breaks on hosts where tailscaled isn't local. Mostly mentioned for completeness - not the right call here.

Option 1 is the right move. Option 2 unblocks Kai today (GODEBUG=netdns=cgo coily channel read <id> works, verified).

Acceptance

  • coily channel read <id> against a MagicDNS hostname succeeds from kai-server with no per-shell env tweak.
  • A regression test that resolves a tailnet hostname (or a mocked equivalent that exercises the same code path) fails before the fix.

Provenance

Surfaced in agentic-os-kai#292 while teaching Claude about the o2r agent-channel surface.

**Symptom** `coily channel read <id>` fails from kai-server with: ``` coily: channel: GET /agent-channel/VHGC: Get "http://api/agent-channel/VHGC": dial tcp: lookup api on 127.0.0.53:53: server misbehaving ``` `curl http://api` from the same shell on the same host works fine and resolves `api` to a tailnet IP (`100.126.181.22` at the time of writing). **Cause** Coily is built with Go's default pure-Go resolver. The pure-Go resolver reads `/etc/resolv.conf` directly and sends queries to whatever nameserver is listed there. On Ubuntu kai-server that is `127.0.0.53` (systemd-resolved), which does not know Tailscale MagicDNS names. The pure-Go resolver does **not** consult `/etc/nsswitch.conf`, so it bypasses the Tailscale resolver entry that libc-based clients (`curl`, `getent hosts`, `ssh`, anything that calls `getaddrinfo`) honor. This is a known Go-on-Linux footgun. Reference: [golang/go#57757](https://github.com/golang/go/issues/57757), Tailscale's [MagicDNS doc](https://tailscale.com/kb/1081/magicdns/) (notes the systemd-resolved + Go interaction). **Fix options, ranked** 1. **Build with cgo resolver** - add `-tags netcgo` to the release build, or `import _ "net"` with `GODEBUG=netdns=cgo+1` baked in. This routes Go's net package through libc `getaddrinfo`, picking up nsswitch and the Tailscale entry. Lowest blast radius, no runtime config, fixes every coily verb that opens an HTTP connection, not just `channel`. Cost: cgo enabled = need C toolchain at build time. The release builds already use cgo on macOS for keychain access, so this is incremental on Linux only. 2. **Runtime env in the systemd unit / shell rc** - set `GODEBUG=netdns=cgo` in whatever launches `coily`. Works, but every host has to be configured, including ad-hoc shells, so it drifts. 3. **Resolve via the Tailscale local API** - tailscaled exposes a Unix socket at `/var/run/tailscale/tailscaled.sock` and Tailscale's Go library has `tailscale.com/client/tailscale.LocalClient` with a `WhoIs` / `LookupHostname` surface. Coily could ask tailscaled directly when the requested host is a bare hostname or `*.ts.net`. More invasive, only helps the tailnet path, and breaks on hosts where tailscaled isn't local. Mostly mentioned for completeness - not the right call here. Option 1 is the right move. Option 2 unblocks Kai today (`GODEBUG=netdns=cgo coily channel read <id>` works, verified). **Acceptance** - `coily channel read <id>` against a MagicDNS hostname succeeds from kai-server with no per-shell env tweak. - A regression test that resolves a tailnet hostname (or a mocked equivalent that exercises the same code path) fails before the fix. **Provenance** Surfaced in agentic-os-kai#292 while teaching Claude about the o2r agent-channel surface.
coilysiren added
P3
and removed
P2
labels 2026-05-31 06:59:43 +00:00
Sign in to join this conversation.
No labels
P0
P1
P2
P3
P4
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
coilyco-bridge/coily#101
No description provided.