morning checkin after 2026-05-26 kai-server outage session #154

Open
opened 2026-05-27 04:44:37 +00:00 by coilysiren · 0 comments
Owner

Anchor

Tomorrow-morning checkin after the 2026-05-26 kai-server outage session. Catches up future-Kai and any dispatched agent.

State as of bedtime

  • Public SSH closed. Router DMZ off. Virtual Servers allowlist: 80/tcp, 443/tcp, 3000/udp, 3001/tcp, 3002/tcp, 3003/?, 34197/udp - all to 192.168.0.194. Verify from cellular: nc -zv 99.110.50.213 22 should refuse/timeout.
  • forgejo-runner StatefulSet scaled to 0 replicas. Reason: PVC pinned to kai-server prevented the WSL move from working (see #153). Bridge churn stops while at 0.
  • coily exec host-watch host=kai-server left running (pid was 64331 at start, may have changed). Logs to /tmp/host-watch-kai-server/watch.log, recovery snapshots in same dir. Kill with pkill -f host-watch.sh when no longer needed.
  • Commits on main: 0ae7bf3 (runner-pin), 2e7b0b4 (host-watch+host-diag). Pushed to forgejo+github fan-out.

First checks on wake

  1. tail -30 /tmp/host-watch-kai-server/watch.log - did the outage cadence stop after the scale-to-0? If yes, runner was the sole cause. If no, there's a second host-network issue to chase.
  2. coily ops kubectl -n forgejo get pods -o wide - confirm no forgejo-runner-* pods are running. forgejo, forgejo-db, ts-forgejo should still be up.
  3. coily ops aws ssm get-parameter --name /coilysiren/home/public-ip - sanity check the recorded public IP still matches.

Open work

  • #153 - migrate data-forgejo-runner-* PVC off node-pinned local-path so the WSL move actually takes. Three options listed in the body, recommend option 3 (emptyDir, ditch PVC entirely) if the runner registration init-container is idempotent.
  • Diff /tmp/host-watch-kai-server/recovery-*.txt before vs after the scale - if outage cadence didn't stop, the diff names the second cause.
  • (Lower priority) Audit router UPnP service list. Decide whether to leave UPnP on or off.

Skip if

  • Outage cadence stopped overnight AND #153 still feels like more effort than warranted: leave runner at 0 indefinitely, run forgejo CI from a different runner (github actions on external-facing repos, or none for personal repos). Forgejo Actions isn't load-bearing.

How to apply

Read the three open issues in order: #153 (PVC), #152 (host-watch, closed), #151 (runner-pin, closed). Decide whether to dig into #153 or close it WONTFIX in favor of the skip-if path.

**Anchor** Tomorrow-morning checkin after the 2026-05-26 kai-server outage session. Catches up future-Kai and any dispatched agent. **State as of bedtime** - Public SSH closed. Router DMZ off. Virtual Servers allowlist: 80/tcp, 443/tcp, 3000/udp, 3001/tcp, 3002/tcp, 3003/?, 34197/udp - all to 192.168.0.194. Verify from cellular: `nc -zv 99.110.50.213 22` should refuse/timeout. - `forgejo-runner` StatefulSet scaled to 0 replicas. Reason: PVC pinned to kai-server prevented the WSL move from working (see #153). Bridge churn stops while at 0. - `coily exec host-watch host=kai-server` left running (pid was 64331 at start, may have changed). Logs to `/tmp/host-watch-kai-server/watch.log`, recovery snapshots in same dir. Kill with `pkill -f host-watch.sh` when no longer needed. - Commits on main: 0ae7bf3 (runner-pin), 2e7b0b4 (host-watch+host-diag). Pushed to forgejo+github fan-out. **First checks on wake** 1. `tail -30 /tmp/host-watch-kai-server/watch.log` - did the outage cadence stop after the scale-to-0? If yes, runner was the sole cause. If no, there's a second host-network issue to chase. 2. `coily ops kubectl -n forgejo get pods -o wide` - confirm no forgejo-runner-* pods are running. forgejo, forgejo-db, ts-forgejo should still be up. 3. `coily ops aws ssm get-parameter --name /coilysiren/home/public-ip` - sanity check the recorded public IP still matches. **Open work** - #153 - migrate `data-forgejo-runner-*` PVC off node-pinned local-path so the WSL move actually takes. Three options listed in the body, recommend option 3 (emptyDir, ditch PVC entirely) if the runner registration init-container is idempotent. - Diff `/tmp/host-watch-kai-server/recovery-*.txt` before vs after the scale - if outage cadence didn't stop, the diff names the second cause. - (Lower priority) Audit router UPnP service list. Decide whether to leave UPnP on or off. **Skip if** - Outage cadence stopped overnight AND #153 still feels like more effort than warranted: leave runner at 0 indefinitely, run forgejo CI from a different runner (github actions on external-facing repos, or none for personal repos). Forgejo Actions isn't load-bearing. **How to apply** Read the three open issues in order: #153 (PVC), #152 (host-watch, closed), #151 (runner-pin, closed). Decide whether to dig into #153 or close it WONTFIX in favor of the skip-if path.
coilysiren added
P4
and removed
P3
labels 2026-05-31 07:00:37 +00:00
Sign in to join this conversation.
No labels
P0
P1
P2
P3
P4
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
coilyco-flight-deck/infrastructure#154
No description provided.