morning checkin after 2026-05-26 kai-server outage session #154
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Anchor
Tomorrow-morning checkin after the 2026-05-26 kai-server outage session. Catches up future-Kai and any dispatched agent.
State as of bedtime
nc -zv 99.110.50.213 22should refuse/timeout.forgejo-runnerStatefulSet scaled to 0 replicas. Reason: PVC pinned to kai-server prevented the WSL move from working (see #153). Bridge churn stops while at 0.coily exec host-watch host=kai-serverleft running (pid was 64331 at start, may have changed). Logs to/tmp/host-watch-kai-server/watch.log, recovery snapshots in same dir. Kill withpkill -f host-watch.shwhen no longer needed.0ae7bf3(runner-pin),2e7b0b4(host-watch+host-diag). Pushed to forgejo+github fan-out.First checks on wake
tail -30 /tmp/host-watch-kai-server/watch.log- did the outage cadence stop after the scale-to-0? If yes, runner was the sole cause. If no, there's a second host-network issue to chase.coily ops kubectl -n forgejo get pods -o wide- confirm no forgejo-runner-* pods are running. forgejo, forgejo-db, ts-forgejo should still be up.coily ops aws ssm get-parameter --name /coilysiren/home/public-ip- sanity check the recorded public IP still matches.Open work
data-forgejo-runner-*PVC off node-pinned local-path so the WSL move actually takes. Three options listed in the body, recommend option 3 (emptyDir, ditch PVC entirely) if the runner registration init-container is idempotent./tmp/host-watch-kai-server/recovery-*.txtbefore vs after the scale - if outage cadence didn't stop, the diff names the second cause.Skip if
How to apply
Read the three open issues in order: #153 (PVC), #152 (host-watch, closed), #151 (runner-pin, closed). Decide whether to dig into #153 or close it WONTFIX in favor of the skip-if path.