coilyco-flight-deck/infrastructure

Fork 0

forgejo runners not serving CI — scale back up + verify after #151 repin #163

New issue

Closed

opened 2026-05-28 06:16:01 +00:00 by coilysiren · 1 comment

coilysiren commented

2026-05-28 06:16:01 +00:00

Owner

Summary

Forgejo Actions runners are not serving CI. Today's coily release pipeline shows bump-formula jobs failing and the homebrew formula is stuck at v2.43.0 while tags advanced to v2.44.0. The forgejo-runner StatefulSet (namespace: forgejo) appears to be down — desired-state manifest says replicas: 2, but live capacity is not running jobs. We need to find out why, scale back up, and verify CI runs green.

Background (root cause already understood — see #151)

#151 ("forgejo runner docker bridge churn breaks host TCP intermittently", closed 2026-05-27) established: each runner pod runs a privileged docker:28-dind sidecar whose dockerd created/destroyed br-XXXXXXXX bridges in bursts, racing k3s kube-proxy's iptables sync. That tore down host-namespace TCP forward rules on kai-server (sshd:22, apiserver:6443, tailscaled PeerAPI) during CI runs — SSH timed out (not refused), caddy pod-net ingress stayed healthy. The fix that shipped (commit 0ae7bf3) repinned the runner nodeSelector from kai-server → kai-desktop-tower-wsl so the bridge churn lives on a worker node, off kai-server's host netns. DinD was kept. Manifest: infrastructure/deploy/forgejo-runner.yml, replicas: 2.

The repin fix was committed and #151 closed, but the runners are evidently not serving CI now. The likely scale-to-zero (an emergency kubectl scale --replicas=0 mitigation during the Monday 2026-05-25 SSH-flapping incident) was never reconciled, OR the worker node can't schedule the pods.

Hypotheses to confirm

Live StatefulSet was manually scaled to --replicas=0 during the incident and never reapplied (manifest still says 2 → live drift).
Manifest is pinned to kai-desktop-tower-wsl but that WSL node is NotReady/offline, so the 2 replicas can't schedule.

Tasks

Connectivity: ensure tailscale up is done and ssh kai@kai-server works from the tailnet (kubectl needs the tailnet).
Inspect live state:
- kubectl -n forgejo get statefulset forgejo-runner -o wide — compare desired vs ready replicas.
- kubectl -n forgejo get pods -o wide | grep runner — running? pending? on which node?
- kubectl get nodes -o wide — is kai-desktop-tower-wsl Ready?
- If pods pending: kubectl -n forgejo describe pod <runner-pod> for scheduling events.
Diagnose which hypothesis is true (live scale=0 vs unschedulable).
Fix:
- If manually scaled to 0: reconcile by reapplying the manifest — kubectl -n forgejo apply -f deploy/forgejo-runner.yml (restores replicas: 2). Avoid a bare kubectl scale that leaves manifest/live drift.
- If kai-desktop-tower-wsl is offline: bring that node back online (it's a WSL node). Do NOT repin the runner back to kai-server — that reintroduces the #151 SSH-flapping regression.
Verify CI: confirm both runner pods register in forgejo (Actions → Runners), then trigger/observe a real job — e.g. re-run the failed coily bump-formula task or push a trivial commit — and confirm it completes success.
Confirm #151 holds: during a CI run, watch kai-server SSH stability (ssh kai@kai-server stays up) to confirm moving runners to the worker node actually fixed the host-TCP flapping.

Guardrails

Do NOT repin runners to kai-server. Keep them on a worker node (per #151).
Reconcile manifest ↔ live state; don't leave drift.
Separate issue to flag, not fix here: today's bump-formula job failed (not "pending"), which can also mean jobs run but the bump-formula action itself errors (e.g. FORGEJO_PAT / forgejo Contents API). If runners come up and jobs still fail, that's a distinct problem in coilysiren/agentic-os/actions/bump-formula — note it and link a follow-up rather than conflating it with the scale-up.

Pointers

infrastructure/deploy/forgejo-runner.yml — the StatefulSet (replicas, nodeSelector, dind sidecar).
infrastructure/docs/k3s-deploy-notes.md — triage sections on runner bridge churn / host TCP flapping.
Root-cause history: #151.

## Summary Forgejo Actions runners are not serving CI. Today's `coily` release pipeline shows `bump-formula` jobs failing and the homebrew formula is stuck at v2.43.0 while tags advanced to v2.44.0. The `forgejo-runner` StatefulSet (`namespace: forgejo`) appears to be down — desired-state manifest says `replicas: 2`, but live capacity is not running jobs. We need to find out why, scale back up, and verify CI runs green. ## Background (root cause already understood — see #151) #151 ("forgejo runner docker bridge churn breaks host TCP intermittently", closed 2026-05-27) established: each runner pod runs a privileged `docker:28-dind` sidecar whose dockerd created/destroyed `br-XXXXXXXX` bridges in bursts, racing k3s kube-proxy's iptables sync. That tore down host-namespace TCP forward rules on **kai-server** (sshd:22, apiserver:6443, tailscaled PeerAPI) during CI runs — SSH timed out (not refused), caddy pod-net ingress stayed healthy. The fix that shipped (commit `0ae7bf3`) repinned the runner `nodeSelector` from `kai-server` → `kai-desktop-tower-wsl` so the bridge churn lives on a worker node, off kai-server's host netns. DinD was kept. Manifest: `infrastructure/deploy/forgejo-runner.yml`, `replicas: 2`. The repin fix was committed and #151 closed, but the runners are evidently not serving CI now. The likely scale-to-zero (an emergency `kubectl scale --replicas=0` mitigation during the Monday 2026-05-25 SSH-flapping incident) was never reconciled, OR the worker node can't schedule the pods. ## Hypotheses to confirm 1. Live StatefulSet was manually scaled to `--replicas=0` during the incident and never reapplied (manifest still says 2 → live drift). 2. Manifest is pinned to `kai-desktop-tower-wsl` but that WSL node is `NotReady`/offline, so the 2 replicas can't schedule. ## Tasks 1. **Connectivity**: ensure `tailscale up` is done and `ssh kai@kai-server` works from the tailnet (kubectl needs the tailnet). 2. **Inspect live state**: - `kubectl -n forgejo get statefulset forgejo-runner -o wide` — compare desired vs ready replicas. - `kubectl -n forgejo get pods -o wide | grep runner` — running? pending? on which node? - `kubectl get nodes -o wide` — is `kai-desktop-tower-wsl` Ready? - If pods pending: `kubectl -n forgejo describe pod <runner-pod>` for scheduling events. 3. **Diagnose** which hypothesis is true (live scale=0 vs unschedulable). 4. **Fix**: - If manually scaled to 0: reconcile by reapplying the manifest — `kubectl -n forgejo apply -f deploy/forgejo-runner.yml` (restores `replicas: 2`). Avoid a bare `kubectl scale` that leaves manifest/live drift. - If `kai-desktop-tower-wsl` is offline: bring that node back online (it's a WSL node). Do **NOT** repin the runner back to `kai-server` — that reintroduces the #151 SSH-flapping regression. 5. **Verify CI**: confirm both runner pods register in forgejo (Actions → Runners), then trigger/observe a real job — e.g. re-run the failed `coily` `bump-formula` task or push a trivial commit — and confirm it completes `success`. 6. **Confirm #151 holds**: during a CI run, watch kai-server SSH stability (`ssh kai@kai-server` stays up) to confirm moving runners to the worker node actually fixed the host-TCP flapping. ## Guardrails - Do **NOT** repin runners to `kai-server`. Keep them on a worker node (per #151). - Reconcile manifest ↔ live state; don't leave drift. - **Separate issue to flag, not fix here**: today's `bump-formula` job *failed* (not "pending"), which can also mean jobs run but the bump-formula action itself errors (e.g. `FORGEJO_PAT` / forgejo Contents API). If runners come up and jobs still fail, that's a distinct problem in `coilysiren/agentic-os/actions/bump-formula` — note it and link a follow-up rather than conflating it with the scale-up. ## Pointers - `infrastructure/deploy/forgejo-runner.yml` — the StatefulSet (replicas, nodeSelector, dind sidecar). - `infrastructure/docs/k3s-deploy-notes.md` — triage sections on runner bridge churn / host TCP flapping. - Root-cause history: #151.

coilysiren commented

2026-05-28 06:27:04 +00:00

Author

Owner

Diagnosis: runners can schedule but cross-node flannel is dead — repin to a worker is currently non-functional

Worked this from the tasks/hypotheses in the issue. Findings, in order:

1. Live scale-to-zero drift (hypothesis 1 — confirmed)

kubectl -n forgejo get statefulset forgejo-runner showed 0/0 while the manifest says replicas: 2. The emergency scale --replicas=0 from the Monday incident was never reconciled. Fixed by reapplying the manifest (not a bare scale):

kubectl -n forgejo apply -f deploy/forgejo-runner.yml   # restores replicas: 2

2. Stale kai-server-bound PVCs blocked scheduling (new, not in hypotheses)

After scaling back up, forgejo-runner-0 went Pending with volume node affinity conflict. The two local-path PVCs (data-forgejo-runner-{0,1}) were provisioned 3d ago when the runner still lived on kai-server, so their PVs are node-pinned to kai-server. The repinned pod must land on kai-desktop-tower-wsl, so the bound PVCs conflict. Cleared them:

kubectl -n forgejo delete pvc data-forgejo-runner-0 data-forgejo-runner-1

The StatefulSet recreated fresh PVCs on the worker (the .runner state file is regenerable — the initContainer re-registers from an empty volume). Pod then scheduled on kai-desktop-tower-wsl.

3. The actual blocker: cross-node flannel VXLAN is down

Once scheduled, the register initContainer fails → Init:CrashLoopBackOff. Logs are unreadable via the API (apiserver→kubelet 502 on the WSL node, see below), so I diagnosed via a debug pod's termination message. From a pod on kai-desktop-tower-wsl:

nslookup … 10.43.0.10 (kube-dns ClusterIP) → connection timed out; no servers could be reached
nslookup … 10.42.0.153 (coredns pod IP, direct) → also times out
ping 10.42.0.153 (a kai-server pod IP) → 100% packet loss

So it's not kube-proxy — the worker simply cannot reach kai-server's pod network at all. Same result from kai-macbook-pro-vm (the other worker), so this is cluster-wide, not WSL-specific.

Root cause: worker nodes advertise non-routable InternalIPs

kai-server             192.168.0.194     (LAN)
kai-desktop-tower-wsl   172.27.244.126   (WSL2-internal NAT — unroutable off the Windows host)
kai-macbook-pro-vm      100.96.209.24    (tailnet CGNAT)

Flannel builds VXLAN tunnels to each node's InternalIP. kai-server (LAN endpoint 192.168.0.194) cannot send VXLAN (UDP 8472) to a WSL-internal 172.27.x address, and the tunnel to the VM's tailnet IP isn't forming either. No tunnel → no cross-node pod traffic → runners can't reach forgejo to register. This also explains the apiserver→kubelet 502 on the WSL node (its 172.27.x kubelet IP isn't routable from kai-server, so kubectl logs/exec fail there).

What I changed

Reconciled the StatefulSet to replicas: 2 (manifest ↔ live drift fixed).
Deleted the stale kai-server-bound PVCs.
Documented the full trap + triage path in docs/k3s-deploy-notes.md §7/§9 + change log.
Did not repin to kai-server (per the #151 guardrail). Left replicas: 2 so the desired state matches the manifest; the pods will stay Init:CrashLoopBackOff until the networking is fixed (honest failure state rather than hidden drift).

What's still needed (blocked — needs a decision + worker-host access)

The fix is a flannel networking change I can't make from a workstation session: put flannel on a common routable plane across all nodes — set each k3s agent's --node-ip / --flannel-iface to the tailnet interface (kai-server included), since the tailnet is the only network all three nodes share. That requires reconfiguring the k3s agent on kai-desktop-tower-wsl (Windows/WSL host) and kai-macbook-pro-vm, plus likely kai-server. kai-macbook-pro-vm is also currently SchedulingDisabled and is a laptop VM, so it's not a reliable CI home either.

Separate follow-up to file

Per the issue's guardrail: today's bump-formula failures can't be re-tested until runners are healthy. Once they are, if jobs still fail, that's a distinct coilysiren/agentic-os/actions/bump-formula problem (likely FORGEJO_PAT / Contents API), not this scale-up.

## Diagnosis: runners can schedule but cross-node flannel is dead — repin to a worker is currently non-functional Worked this from the tasks/hypotheses in the issue. Findings, in order: ### 1. Live scale-to-zero drift (hypothesis 1 — confirmed) `kubectl -n forgejo get statefulset forgejo-runner` showed `0/0` while the manifest says `replicas: 2`. The emergency `scale --replicas=0` from the Monday incident was never reconciled. Fixed by reapplying the manifest (not a bare `scale`): ``` kubectl -n forgejo apply -f deploy/forgejo-runner.yml # restores replicas: 2 ``` ### 2. Stale kai-server-bound PVCs blocked scheduling (new, not in hypotheses) After scaling back up, `forgejo-runner-0` went `Pending` with `volume node affinity conflict`. The two `local-path` PVCs (`data-forgejo-runner-{0,1}`) were provisioned 3d ago when the runner still lived on **kai-server**, so their PVs are node-pinned to kai-server. The repinned pod must land on `kai-desktop-tower-wsl`, so the bound PVCs conflict. Cleared them: ``` kubectl -n forgejo delete pvc data-forgejo-runner-0 data-forgejo-runner-1 ``` The StatefulSet recreated fresh PVCs on the worker (the `.runner` state file is regenerable — the initContainer re-registers from an empty volume). Pod then scheduled on `kai-desktop-tower-wsl`. ### 3. The actual blocker: cross-node flannel VXLAN is down Once scheduled, the `register` initContainer fails → `Init:CrashLoopBackOff`. Logs are unreadable via the API (apiserver→kubelet `502` on the WSL node, see below), so I diagnosed via a debug pod's termination message. From a pod on `kai-desktop-tower-wsl`: - `nslookup … 10.43.0.10` (kube-dns ClusterIP) → `connection timed out; no servers could be reached` - `nslookup … 10.42.0.153` (coredns pod IP, direct) → also times out - `ping 10.42.0.153` (a kai-server pod IP) → **100% packet loss** So it's not kube-proxy — the worker simply cannot reach kai-server's pod network at all. **Same result from `kai-macbook-pro-vm`** (the other worker), so this is cluster-wide, not WSL-specific. ### Root cause: worker nodes advertise non-routable InternalIPs ``` kai-server 192.168.0.194 (LAN) kai-desktop-tower-wsl 172.27.244.126 (WSL2-internal NAT — unroutable off the Windows host) kai-macbook-pro-vm 100.96.209.24 (tailnet CGNAT) ``` Flannel builds VXLAN tunnels to each node's InternalIP. kai-server (LAN endpoint `192.168.0.194`) cannot send VXLAN (UDP 8472) to a WSL-internal `172.27.x` address, and the tunnel to the VM's tailnet IP isn't forming either. No tunnel → no cross-node pod traffic → runners can't reach forgejo to register. This also explains the **apiserver→kubelet `502`** on the WSL node (its `172.27.x` kubelet IP isn't routable from kai-server, so `kubectl logs/exec` fail there). ### What I changed - Reconciled the StatefulSet to `replicas: 2` (manifest ↔ live drift fixed). - Deleted the stale kai-server-bound PVCs. - Documented the full trap + triage path in `docs/k3s-deploy-notes.md` §7/§9 + change log. - Did **not** repin to kai-server (per the #151 guardrail). Left `replicas: 2` so the desired state matches the manifest; the pods will stay `Init:CrashLoopBackOff` until the networking is fixed (honest failure state rather than hidden drift). ### What's still needed (blocked — needs a decision + worker-host access) The fix is a flannel networking change I can't make from a workstation session: put flannel on a common routable plane across **all** nodes — set each k3s agent's `--node-ip` / `--flannel-iface` to the **tailnet** interface (kai-server included), since the tailnet is the only network all three nodes share. That requires reconfiguring the k3s agent on `kai-desktop-tower-wsl` (Windows/WSL host) and `kai-macbook-pro-vm`, plus likely kai-server. `kai-macbook-pro-vm` is also currently `SchedulingDisabled` and is a laptop VM, so it's not a reliable CI home either. ### Separate follow-up to file Per the issue's guardrail: today's `bump-formula` failures can't be re-tested until runners are healthy. Once they are, if jobs still fail, that's a distinct `coilysiren/agentic-os/actions/bump-formula` problem (likely `FORGEJO_PAT` / Contents API), not this scale-up.

coilysiren referenced this issue

2026-05-28 09:32:43 +00:00

Forgejo Actions: find (or build) an API endpoint for action job/step logs #166

coilysiren referenced this issue from a commit

2026-05-28 09:45:10 +00:00

fix(k3s): bind flannel VXLAN to the tailnet plane for cross-node networking

coilysiren closed this issue

2026-05-28 09:45:10 +00:00

coilysiren referenced this issue from a commit

2026-06-01 05:45:06 +00:00

docs(k3s): forgejo-runner repin traps - PVC node affinity + flannel VXLAN

coilysiren referenced this issue