forgejo runner docker bridge churn breaks host TCP intermittently #151
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Problem
Forgejo Actions runner on kai-server causes intermittent host-network outages. During workflow runs, docker user-defined bridges churn (create/destroy bursts of
br-XXXXXXXXinterfaces) and race k3s kube-proxy's iptables sync. Result: host-namespace TCP listeners (sshd:22, kube-apiserver:6443, tailscaled PeerAPI:53096) become unreachable for ~5+ minute windows. Pod-network ingress (caddy 80/443, public services) keeps working throughout, which is what made the failure hard to diagnose at first.Evidence (2026-05-26 incident)
Diag snapshot:
/Users/kai/projects/coilysiren/output.txt(kai-server, 02:53Z).Bridge churn clusters in
dmesgline up exactly with observed outages:apiserver was unable to write a JSON response: http: Handler timeoutHandler timeout, followed byuse of closed network connectionloop on 127.0.0.1:6443 at 19:27 - 19:28 PTRuled out: sshd is healthy (
NRestarts=0, up since 2026-05-18). Conntrack 1758/917504 - nowhere near full. Zero OOM hits. fail2ban / ufw not installed.Failure signature during outage:
tailscale ping kai-server- pong in <10ms (tailscaled responsive)That combination (tailscale-control plane up, host TCP dead, pod TCP up) is consistent with host-namespace forward rules being torn down and rebuilt by competing iptables managers (docker + k3s kube-proxy + flanneld). The pod-net path survives because flannel + kube-proxy own those rules and re-sync within their own loop.
Options
dockerd --iptables=false, then own the rules manually. Fragile.Recommend option 2 if the runner image supports it, otherwise option 1. Option 3 is a trap.
Out of scope here
How to apply
Pick option 1 or 2, write the migration plan as Martin-Fowler-style tiny commits.