forgejo runner docker bridge churn breaks host TCP intermittently #151

New issue

Closed

opened 2026-05-27 03:10:50 +00:00 by coilysiren · 0 comments

coilysiren commented

2026-05-27 03:10:50 +00:00

Owner

Problem

Forgejo Actions runner on kai-server causes intermittent host-network outages. During workflow runs, docker user-defined bridges churn (create/destroy bursts of br-XXXXXXXX interfaces) and race k3s kube-proxy's iptables sync. Result: host-namespace TCP listeners (sshd:22, kube-apiserver:6443, tailscaled PeerAPI:53096) become unreachable for ~5+ minute windows. Pod-network ingress (caddy 80/443, public services) keeps working throughout, which is what made the failure hard to diagnose at first.

Evidence (2026-05-26 incident)

Diag snapshot: /Users/kai/projects/coilysiren/output.txt (kai-server, 02:53Z).

Bridge churn clusters in dmesg line up exactly with observed outages:

17:18 - 17:27 PT - bridge create/destroy cluster
17:57 PT - k3s apiserver was unable to write a JSON response: http: Handler timeout
18:58 PT + 19:16 - 19:17 PT - more bridge churn
19:21 PT - second apiserver Handler timeout, followed by use of closed network connection loop on 127.0.0.1:6443 at 19:27 - 19:28 PT
19:44 - 19:45 PT - heavy bridge churn, coincident with reported "it's down again"

Ruled out: sshd is healthy (NRestarts=0, up since 2026-05-18). Conntrack 1758/917504 - nowhere near full. Zero OOM hits. fail2ban / ufw not installed.

Failure signature during outage:

tailscale ping kai-server - pong in <10ms (tailscaled responsive)
TCP/22, TCP/6443, TCP/53096 - all timeout (not refused)
Public HTTPS to caddy ingress - 200 OK

That combination (tailscale-control plane up, host TCP dead, pod TCP up) is consistent with host-namespace forward rules being torn down and rebuilt by competing iptables managers (docker + k3s kube-proxy + flanneld). The pod-net path survives because flannel + kube-proxy own those rules and re-sync within their own loop.

Options

Move forgejo runner off kai-server. Run it on kai-desktop-tower or a small VPS. Eliminates the docker-vs-k3s iptables race entirely.
Switch runner from docker mode to kubernetes mode. Job pods scheduled directly by k3s, no docker bridges. Still on kai-server but eliminates the bridge churn class.
Pin docker to userland-proxy / disable docker iptables management. dockerd --iptables=false, then own the rules manually. Fragile.

Recommend option 2 if the runner image supports it, otherwise option 1. Option 3 is a trap.

Out of scope here

The public SSH brute-force exposure (router DMZ leak). Fixed during the same session, can be closed separately or rolled in.
A diagnostic capture cron on kai-server so the next recurrence has an inside-the-outage snapshot. Worth filing as a sibling.

How to apply

Pick option 1 or 2, write the migration plan as Martin-Fowler-style tiny commits.

**Problem** Forgejo Actions runner on kai-server causes intermittent host-network outages. During workflow runs, docker user-defined bridges churn (create/destroy bursts of `br-XXXXXXXX` interfaces) and race k3s kube-proxy's iptables sync. Result: host-namespace TCP listeners (sshd:22, kube-apiserver:6443, tailscaled PeerAPI:53096) become unreachable for ~5+ minute windows. Pod-network ingress (caddy 80/443, public services) keeps working throughout, which is what made the failure hard to diagnose at first. **Evidence (2026-05-26 incident)** Diag snapshot: `/Users/kai/projects/coilysiren/output.txt` (kai-server, 02:53Z). Bridge churn clusters in `dmesg` line up exactly with observed outages: - 17:18 - 17:27 PT - bridge create/destroy cluster - 17:57 PT - k3s `apiserver was unable to write a JSON response: http: Handler timeout` - 18:58 PT + 19:16 - 19:17 PT - more bridge churn - 19:21 PT - second apiserver `Handler timeout`, followed by `use of closed network connection` loop on 127.0.0.1:6443 at 19:27 - 19:28 PT - 19:44 - 19:45 PT - heavy bridge churn, coincident with reported "it's down again" Ruled out: sshd is healthy (`NRestarts=0`, up since 2026-05-18). Conntrack 1758/917504 - nowhere near full. Zero OOM hits. fail2ban / ufw not installed. Failure signature during outage: - `tailscale ping kai-server` - pong in <10ms (tailscaled responsive) - TCP/22, TCP/6443, TCP/53096 - all timeout (not refused) - Public HTTPS to caddy ingress - 200 OK That combination (tailscale-control plane up, host TCP dead, pod TCP up) is consistent with host-namespace forward rules being torn down and rebuilt by competing iptables managers (docker + k3s kube-proxy + flanneld). The pod-net path survives because flannel + kube-proxy own those rules and re-sync within their own loop. **Options** 1. **Move forgejo runner off kai-server.** Run it on kai-desktop-tower or a small VPS. Eliminates the docker-vs-k3s iptables race entirely. 2. **Switch runner from docker mode to kubernetes mode.** Job pods scheduled directly by k3s, no docker bridges. Still on kai-server but eliminates the bridge churn class. 3. **Pin docker to userland-proxy / disable docker iptables management.** `dockerd --iptables=false`, then own the rules manually. Fragile. Recommend option 2 if the runner image supports it, otherwise option 1. Option 3 is a trap. **Out of scope here** - The public SSH brute-force exposure (router DMZ leak). Fixed during the same session, can be closed separately or rolled in. - A diagnostic capture cron on kai-server so the *next* recurrence has an inside-the-outage snapshot. Worth filing as a sibling. **How to apply** Pick option 1 or 2, write the migration plan as Martin-Fowler-style tiny commits.

coilysiren referenced this issue from a commit

2026-05-27 03:14:23 +00:00

fix(forgejo): pin runner to kai-desktop-tower-wsl, off kai-server

coilysiren closed this issue

2026-05-27 03:14:23 +00:00

coilysiren referenced this issue

2026-05-27 03:18:11 +00:00

scripts/host-watch.sh: generic tailnet-host SSH watchdog with diag capture on recovery #152

coilysiren referenced this issue from a commit

2026-05-27 03:18:52 +00:00

feat(scripts): host-watch + host-diag for tailnet-host SSH watchdog

coilysiren referenced this issue