forgejo-runner PVC pinned to kai-server blocks WSL migration - volume node affinity conflict #153

Open
opened 2026-05-27 04:38:15 +00:00 by coilysiren · 0 comments
Owner

Problem

Pinning forgejo-runner StatefulSet to kai-desktop-tower-wsl (commit 0ae7bf3, applied 2026-05-27T03:55:22Z) does not actually move the pod. PVC data-forgejo-runner-1 is bound to a local-path PV pinned to kai-server, and the new pod's nodeSelector requires kai-desktop-tower-wsl. Scheduler deadlock.

Evidence

Captured via coily ops kubectl -n forgejo describe pod forgejo-runner-1 (see /tmp/runner-decision.log for full output):

Node-Selectors: kubernetes.io/hostname=kai-desktop-tower-wsl
Events:
  Warning  FailedScheduling  ...  0/3 nodes are available:
    1 node(s) didn't match Pod's node affinity/selector,
    1 node(s) had volume node affinity conflict,
    1 node(s) were unschedulable.
  • "didn't match" = kai-server (correctly rejected by the new nodeSelector)
  • "volume node affinity conflict" = kai-desktop-tower-wsl (the PV for data-forgejo-runner-1 is pinned to kai-server, so the pod can't schedule on WSL even though the nodeSelector permits it)
  • "were unschedulable" = kai-macbook-pro-vm (cordoned, SchedulingDisabled)

kai-desktop-tower-wsl itself is Ready and a 12d-old worker. The blocker is the PV, not the node.

What was done overnight

  • Scaled forgejo-runner StatefulSet to 0 replicas (Kai-authorized safety move). Bridge churn from the DinD sidecar stops, kai-server host network expected to stabilize.
  • Watch (coily exec host-watch host=kai-server) still running, will log whether the outage cadence actually stops post-scale.

Options for the real fix

  1. Delete PVC, accept re-provision. data-forgejo-runner-* is a docker layer cache from DinD. Empty cache means the first workflow run is slow but nothing else breaks. Cleanest. Single-line.
  2. Switch to a portable StorageClass. Longhorn or NFS via a fronting share. Heavier infra change but moves the runner freely between nodes.
  3. Drop the persistent volume entirely. Use an emptyDir for the DinD cache. Runner state (.runner reg file) is the only thing that needs to persist, and it's an init-container output that can be recreated. Smallest steady-state surface.

Recommend option 3 if the registration init-container is truly idempotent, otherwise option 1.

Out of scope

  • Diagnosing why host-network TCP outages were occurring even before the apply (those continued post-power-cycle until the scale-to-0). Possibly an unrelated host-side issue. Diff the recovery snapshots in /tmp/host-watch-kai-server/recovery-*.txt before vs after the scale to confirm whether the runner was the only cause.

How to apply

Pick an option, write the migration as a Martin-Fowler-style tiny commit on top of 2e7b0b4, scale runner back to >=1, verify pods land on kai-desktop-tower-wsl.

**Problem** Pinning `forgejo-runner` StatefulSet to `kai-desktop-tower-wsl` (commit 0ae7bf3, applied 2026-05-27T03:55:22Z) does not actually move the pod. PVC `data-forgejo-runner-1` is bound to a local-path PV pinned to kai-server, and the new pod's nodeSelector requires kai-desktop-tower-wsl. Scheduler deadlock. **Evidence** Captured via `coily ops kubectl -n forgejo describe pod forgejo-runner-1` (see `/tmp/runner-decision.log` for full output): ``` Node-Selectors: kubernetes.io/hostname=kai-desktop-tower-wsl Events: Warning FailedScheduling ... 0/3 nodes are available: 1 node(s) didn't match Pod's node affinity/selector, 1 node(s) had volume node affinity conflict, 1 node(s) were unschedulable. ``` - "didn't match" = kai-server (correctly rejected by the new nodeSelector) - "volume node affinity conflict" = kai-desktop-tower-wsl (the PV for `data-forgejo-runner-1` is pinned to kai-server, so the pod can't schedule on WSL even though the nodeSelector permits it) - "were unschedulable" = kai-macbook-pro-vm (cordoned, SchedulingDisabled) kai-desktop-tower-wsl itself is Ready and a 12d-old worker. The blocker is the PV, not the node. **What was done overnight** - Scaled `forgejo-runner` StatefulSet to 0 replicas (Kai-authorized safety move). Bridge churn from the DinD sidecar stops, kai-server host network expected to stabilize. - Watch (`coily exec host-watch host=kai-server`) still running, will log whether the outage cadence actually stops post-scale. **Options for the real fix** 1. **Delete PVC, accept re-provision.** `data-forgejo-runner-*` is a docker layer cache from DinD. Empty cache means the first workflow run is slow but nothing else breaks. Cleanest. Single-line. 2. **Switch to a portable StorageClass.** Longhorn or NFS via a fronting share. Heavier infra change but moves the runner freely between nodes. 3. **Drop the persistent volume entirely.** Use an emptyDir for the DinD cache. Runner state (`.runner` reg file) is the only thing that needs to persist, and it's an init-container output that can be recreated. Smallest steady-state surface. Recommend option 3 if the registration init-container is truly idempotent, otherwise option 1. **Out of scope** - Diagnosing why host-network TCP outages were occurring even before the apply (those continued post-power-cycle until the scale-to-0). Possibly an unrelated host-side issue. Diff the recovery snapshots in `/tmp/host-watch-kai-server/recovery-*.txt` before vs after the scale to confirm whether the runner was the only cause. **How to apply** Pick an option, write the migration as a Martin-Fowler-style tiny commit on top of 2e7b0b4, scale runner back to >=1, verify pods land on kai-desktop-tower-wsl.
coilysiren added
P3
and removed
P2
labels 2026-05-31 07:00:37 +00:00
Sign in to join this conversation.
No labels
P0
P1
P2
P3
P4
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
coilyco-flight-deck/infrastructure#153
No description provided.