forgejo runners not serving CI — scale back up + verify after #151 repin #163
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
Forgejo Actions runners are not serving CI. Today's
coilyrelease pipeline showsbump-formulajobs failing and the homebrew formula is stuck at v2.43.0 while tags advanced to v2.44.0. Theforgejo-runnerStatefulSet (namespace: forgejo) appears to be down — desired-state manifest saysreplicas: 2, but live capacity is not running jobs. We need to find out why, scale back up, and verify CI runs green.Background (root cause already understood — see #151)
#151 ("forgejo runner docker bridge churn breaks host TCP intermittently", closed 2026-05-27) established: each runner pod runs a privileged
docker:28-dindsidecar whose dockerd created/destroyedbr-XXXXXXXXbridges in bursts, racing k3s kube-proxy's iptables sync. That tore down host-namespace TCP forward rules on kai-server (sshd:22, apiserver:6443, tailscaled PeerAPI) during CI runs — SSH timed out (not refused), caddy pod-net ingress stayed healthy. The fix that shipped (commit0ae7bf3) repinned the runnernodeSelectorfromkai-server→kai-desktop-tower-wslso the bridge churn lives on a worker node, off kai-server's host netns. DinD was kept. Manifest:infrastructure/deploy/forgejo-runner.yml,replicas: 2.The repin fix was committed and #151 closed, but the runners are evidently not serving CI now. The likely scale-to-zero (an emergency
kubectl scale --replicas=0mitigation during the Monday 2026-05-25 SSH-flapping incident) was never reconciled, OR the worker node can't schedule the pods.Hypotheses to confirm
--replicas=0during the incident and never reapplied (manifest still says 2 → live drift).kai-desktop-tower-wslbut that WSL node isNotReady/offline, so the 2 replicas can't schedule.Tasks
tailscale upis done andssh kai@kai-serverworks from the tailnet (kubectl needs the tailnet).kubectl -n forgejo get statefulset forgejo-runner -o wide— compare desired vs ready replicas.kubectl -n forgejo get pods -o wide | grep runner— running? pending? on which node?kubectl get nodes -o wide— iskai-desktop-tower-wslReady?kubectl -n forgejo describe pod <runner-pod>for scheduling events.kubectl -n forgejo apply -f deploy/forgejo-runner.yml(restoresreplicas: 2). Avoid a barekubectl scalethat leaves manifest/live drift.kai-desktop-tower-wslis offline: bring that node back online (it's a WSL node). Do NOT repin the runner back tokai-server— that reintroduces the #151 SSH-flapping regression.coilybump-formulatask or push a trivial commit — and confirm it completessuccess.ssh kai@kai-serverstays up) to confirm moving runners to the worker node actually fixed the host-TCP flapping.Guardrails
kai-server. Keep them on a worker node (per #151).bump-formulajob failed (not "pending"), which can also mean jobs run but the bump-formula action itself errors (e.g.FORGEJO_PAT/ forgejo Contents API). If runners come up and jobs still fail, that's a distinct problem incoilysiren/agentic-os/actions/bump-formula— note it and link a follow-up rather than conflating it with the scale-up.Pointers
infrastructure/deploy/forgejo-runner.yml— the StatefulSet (replicas, nodeSelector, dind sidecar).infrastructure/docs/k3s-deploy-notes.md— triage sections on runner bridge churn / host TCP flapping.Diagnosis: runners can schedule but cross-node flannel is dead — repin to a worker is currently non-functional
Worked this from the tasks/hypotheses in the issue. Findings, in order:
1. Live scale-to-zero drift (hypothesis 1 — confirmed)
kubectl -n forgejo get statefulset forgejo-runnershowed0/0while the manifest saysreplicas: 2. The emergencyscale --replicas=0from the Monday incident was never reconciled. Fixed by reapplying the manifest (not a barescale):2. Stale kai-server-bound PVCs blocked scheduling (new, not in hypotheses)
After scaling back up,
forgejo-runner-0wentPendingwithvolume node affinity conflict. The twolocal-pathPVCs (data-forgejo-runner-{0,1}) were provisioned 3d ago when the runner still lived on kai-server, so their PVs are node-pinned to kai-server. The repinned pod must land onkai-desktop-tower-wsl, so the bound PVCs conflict. Cleared them:The StatefulSet recreated fresh PVCs on the worker (the
.runnerstate file is regenerable — the initContainer re-registers from an empty volume). Pod then scheduled onkai-desktop-tower-wsl.3. The actual blocker: cross-node flannel VXLAN is down
Once scheduled, the
registerinitContainer fails →Init:CrashLoopBackOff. Logs are unreadable via the API (apiserver→kubelet502on the WSL node, see below), so I diagnosed via a debug pod's termination message. From a pod onkai-desktop-tower-wsl:nslookup … 10.43.0.10(kube-dns ClusterIP) →connection timed out; no servers could be reachednslookup … 10.42.0.153(coredns pod IP, direct) → also times outping 10.42.0.153(a kai-server pod IP) → 100% packet lossSo it's not kube-proxy — the worker simply cannot reach kai-server's pod network at all. Same result from
kai-macbook-pro-vm(the other worker), so this is cluster-wide, not WSL-specific.Root cause: worker nodes advertise non-routable InternalIPs
Flannel builds VXLAN tunnels to each node's InternalIP. kai-server (LAN endpoint
192.168.0.194) cannot send VXLAN (UDP 8472) to a WSL-internal172.27.xaddress, and the tunnel to the VM's tailnet IP isn't forming either. No tunnel → no cross-node pod traffic → runners can't reach forgejo to register. This also explains the apiserver→kubelet502on the WSL node (its172.27.xkubelet IP isn't routable from kai-server, sokubectl logs/execfail there).What I changed
replicas: 2(manifest ↔ live drift fixed).docs/k3s-deploy-notes.md§7/§9 + change log.replicas: 2so the desired state matches the manifest; the pods will stayInit:CrashLoopBackOffuntil the networking is fixed (honest failure state rather than hidden drift).What's still needed (blocked — needs a decision + worker-host access)
The fix is a flannel networking change I can't make from a workstation session: put flannel on a common routable plane across all nodes — set each k3s agent's
--node-ip/--flannel-ifaceto the tailnet interface (kai-server included), since the tailnet is the only network all three nodes share. That requires reconfiguring the k3s agent onkai-desktop-tower-wsl(Windows/WSL host) andkai-macbook-pro-vm, plus likely kai-server.kai-macbook-pro-vmis also currentlySchedulingDisabledand is a laptop VM, so it's not a reliable CI home either.Separate follow-up to file
Per the issue's guardrail: today's
bump-formulafailures can't be re-tested until runners are healthy. Once they are, if jobs still fail, that's a distinctcoilysiren/agentic-os/actions/bump-formulaproblem (likelyFORGEJO_PAT/ Contents API), not this scale-up.