forgejo-runner PVC pinned to kai-server blocks WSL migration - volume node affinity conflict #153

New issue

Open

opened 2026-05-27 04:38:15 +00:00 by coilysiren · 0 comments

coilysiren commented

2026-05-27 04:38:15 +00:00

Owner

Problem

Pinning forgejo-runner StatefulSet to kai-desktop-tower-wsl (commit 0ae7bf3, applied 2026-05-27T03:55:22Z) does not actually move the pod. PVC data-forgejo-runner-1 is bound to a local-path PV pinned to kai-server, and the new pod's nodeSelector requires kai-desktop-tower-wsl. Scheduler deadlock.

Evidence

Captured via coily ops kubectl -n forgejo describe pod forgejo-runner-1 (see /tmp/runner-decision.log for full output):

Node-Selectors: kubernetes.io/hostname=kai-desktop-tower-wsl
Events:
  Warning  FailedScheduling  ...  0/3 nodes are available:
    1 node(s) didn't match Pod's node affinity/selector,
    1 node(s) had volume node affinity conflict,
    1 node(s) were unschedulable.

"didn't match" = kai-server (correctly rejected by the new nodeSelector)
"volume node affinity conflict" = kai-desktop-tower-wsl (the PV for data-forgejo-runner-1 is pinned to kai-server, so the pod can't schedule on WSL even though the nodeSelector permits it)
"were unschedulable" = kai-macbook-pro-vm (cordoned, SchedulingDisabled)

kai-desktop-tower-wsl itself is Ready and a 12d-old worker. The blocker is the PV, not the node.

What was done overnight

Scaled forgejo-runner StatefulSet to 0 replicas (Kai-authorized safety move). Bridge churn from the DinD sidecar stops, kai-server host network expected to stabilize.
Watch (coily exec host-watch host=kai-server) still running, will log whether the outage cadence actually stops post-scale.

Options for the real fix

Delete PVC, accept re-provision. data-forgejo-runner-* is a docker layer cache from DinD. Empty cache means the first workflow run is slow but nothing else breaks. Cleanest. Single-line.
Switch to a portable StorageClass. Longhorn or NFS via a fronting share. Heavier infra change but moves the runner freely between nodes.
Drop the persistent volume entirely. Use an emptyDir for the DinD cache. Runner state (.runner reg file) is the only thing that needs to persist, and it's an init-container output that can be recreated. Smallest steady-state surface.

Recommend option 3 if the registration init-container is truly idempotent, otherwise option 1.

Out of scope

Diagnosing why host-network TCP outages were occurring even before the apply (those continued post-power-cycle until the scale-to-0). Possibly an unrelated host-side issue. Diff the recovery snapshots in /tmp/host-watch-kai-server/recovery-*.txt before vs after the scale to confirm whether the runner was the only cause.

How to apply

Pick an option, write the migration as a Martin-Fowler-style tiny commit on top of 2e7b0b4, scale runner back to >=1, verify pods land on kai-desktop-tower-wsl.

**Problem** Pinning `forgejo-runner` StatefulSet to `kai-desktop-tower-wsl` (commit 0ae7bf3, applied 2026-05-27T03:55:22Z) does not actually move the pod. PVC `data-forgejo-runner-1` is bound to a local-path PV pinned to kai-server, and the new pod's nodeSelector requires kai-desktop-tower-wsl. Scheduler deadlock. **Evidence** Captured via `coily ops kubectl -n forgejo describe pod forgejo-runner-1` (see `/tmp/runner-decision.log` for full output): ``` Node-Selectors: kubernetes.io/hostname=kai-desktop-tower-wsl Events: Warning FailedScheduling ... 0/3 nodes are available: 1 node(s) didn't match Pod's node affinity/selector, 1 node(s) had volume node affinity conflict, 1 node(s) were unschedulable. ``` - "didn't match" = kai-server (correctly rejected by the new nodeSelector) - "volume node affinity conflict" = kai-desktop-tower-wsl (the PV for `data-forgejo-runner-1` is pinned to kai-server, so the pod can't schedule on WSL even though the nodeSelector permits it) - "were unschedulable" = kai-macbook-pro-vm (cordoned, SchedulingDisabled) kai-desktop-tower-wsl itself is Ready and a 12d-old worker. The blocker is the PV, not the node. **What was done overnight** - Scaled `forgejo-runner` StatefulSet to 0 replicas (Kai-authorized safety move). Bridge churn from the DinD sidecar stops, kai-server host network expected to stabilize. - Watch (`coily exec host-watch host=kai-server`) still running, will log whether the outage cadence actually stops post-scale. **Options for the real fix** 1. **Delete PVC, accept re-provision.** `data-forgejo-runner-*` is a docker layer cache from DinD. Empty cache means the first workflow run is slow but nothing else breaks. Cleanest. Single-line. 2. **Switch to a portable StorageClass.** Longhorn or NFS via a fronting share. Heavier infra change but moves the runner freely between nodes. 3. **Drop the persistent volume entirely.** Use an emptyDir for the DinD cache. Runner state (`.runner` reg file) is the only thing that needs to persist, and it's an init-container output that can be recreated. Smallest steady-state surface. Recommend option 3 if the registration init-container is truly idempotent, otherwise option 1. **Out of scope** - Diagnosing why host-network TCP outages were occurring even before the apply (those continued post-power-cycle until the scale-to-0). Possibly an unrelated host-side issue. Diff the recovery snapshots in `/tmp/host-watch-kai-server/recovery-*.txt` before vs after the scale to confirm whether the runner was the only cause. **How to apply** Pick an option, write the migration as a Martin-Fowler-style tiny commit on top of 2e7b0b4, scale runner back to >=1, verify pods land on kai-desktop-tower-wsl.

coilysiren referenced this issue

2026-05-27 04:44:37 +00:00

morning checkin after 2026-05-26 kai-server outage session #154

coilysiren referenced this issue

2026-05-27 04:59:41 +00:00

postmortem - 2026-05-26 kai-server outage session - three independent issues stacked #155

coilysiren added the

label

2026-06-04 08:16:53 +00:00

coilysiren added

and removed

labels