postmortem - 2026-05-26 kai-server outage session - three independent issues stacked #155

New issue

Closed

opened 2026-05-27 04:59:41 +00:00 by coilysiren · 1 comment

coilysiren commented

2026-05-27 04:59:41 +00:00

Owner

Anchor

Postmortem of the 2026-05-26 evening session on kai-server. Three independent issues stacked and looked identical from the outside: public SSH brute force, control-plane host-network flapping, and a monitoring tool that started lying when its underlying CLI tightened. Filing the whole timeline because conflating these in the future would burn hours again.

Symptom that opened the session

Kai dictated "ssh to kai server just locked me out? help me debug." Initial probe: coily ssh kai-server returned tcp dial kai-server:22: i/o timeout. Tailscale ping returned pong in 6ms. Eco-mcp public HTTPS returned 200. Classic host-network-down-but-pod-network-up signature.

Cause 1 - DinD bridge churn on the control plane (resolved by 0ae7bf3, see #151)

Forgejo Actions runner was a StatefulSet pinned to kai-server with a DinD sidecar. Each workflow job created docker user-bridges (br-XXXXXXXX) which raced k3s kube-proxy iptables sync. During the race, host-namespace TCP listeners (sshd:22, kube-apiserver:6443, tailscaled PeerAPI:53096) got dropped from the INPUT path for 5+ minute windows. Pod-net survived because flannel+kube-proxy own those rules separately.

Smoking gun: kernel ring buffer showed bursts of br-XXXXXXXX entered blocking/forwarding/disabled state lines, and the bursts time-correlated 1:1 with reported outages (17:18-17:27, 18:58, 19:16-19:17, 19:44-19:45 all PT). K3s apiserver logged apiserver was unable to write a JSON response: http: Handler timeout at 17:57 and 19:21 PT, inside two of those windows.

Fix: pin runner to kai-desktop-tower-wsl instead of kai-server (commit 0ae7bf3). DinD churn still happens, but on a worker node where no critical host-network listeners live.

Cause 2 - PVC pins prevent the runner from actually moving (open as #153)

Apply succeeded but the new pod stayed Pending with volume node affinity conflict. The data-forgejo-runner-1 PVC is backed by a local-path PV pinned to kai-server. New nodeSelector says WSL, PV says kai-server, scheduler can't satisfy both. The cause-1 fix didn't take effect at all until we scaled the StatefulSet to 0 replicas at 04:37:13Z.

After scale-to-0, all forgejo-runner pods terminated, DinD bridge churn stopped, host network stabilized. Verified via direct nc to 22 and 6443 plus coily whoami over SSH at ~04:50Z - all healthy.

#153 holds the morning fix.

Cause 3 - public SSH exposed via router DMZ (resolved at router level)

While debugging, the first diagnostic snapshot showed sshd absorbing a brute-force scan from 60.163.139.198 at one attempt every ~8 seconds, dozens of usernames. Public SSH was never supposed to be exposed. Audit found:

Router Virtual Servers table: factorio (34197/udp), eco (3000-3003) - clean, no SSH forward.
Router DMZ: ON, pointed at kai-server's LAN IP. Every port on kai-server was public.

Public services should be an explicit allowlist, not a DMZ default-allow. Fix applied during the session:

Added explicit Virtual Servers for 80/tcp and 443/tcp to 192.168.0.194.
Turned DMZ off.
Verified public HTTPS still works through caddy (200 OK from eco-mcp, forgejo, coilysiren).

Brute force surface closed. The brute force itself was not load-bearing for the host-network outages; it ran for hours without affecting them. Pure background hardening.

Cause 4 - monitoring tool lied silently after a coily upgrade

scripts/host-watch.sh (committed 2e7b0b4 via #152) polls SSH liveness via coily ssh kai-server -- echo alive and on dead->alive transitions streams scripts/host-diag.sh into the remote via bash -s. Both forms relied on coily ssh accepting free-form remote argv.

Mid-session, coily upgraded from 2.42.0 to 2.43.0 and tightened the ssh lockdown: coily ssh now only ships args to a remote coily subcommand, not to free-form bash. The probe started returning coily's help dump (exit 1) and the diag started capturing coily help instead of the actual diagnostic. Watch logged a metronomic 4-min "outage" cycle that didn't exist - all post-04:37 watch.log transitions are probe artifacts, not real host state.

This is the regression that scares me most. The tool kept producing output that looked plausible to the operator (timestamped state transitions, recovery snapshot files, file sizes that grew). The outages were real until 04:37, then turned into pure noise. A reader could waste an hour chasing a phantom 4-min cron.

Kai's call: remove coily ssh entirely rather than chase it. scripts/host-watch.sh and scripts/host-diag.sh will be broken until rewritten to a non-coily-ssh path (or deleted). Tracked separately as the followup issues filed alongside this one.

Timeline (UTC)

~02:50 - session opens, host-network outage in progress
02:53 - first diag captured to /tmp/output.txt, 1448 lines, real data. Names DinD bridge churn as cause.
03:11 - dispatch claude headless on #151 to fix the runner pin
03:13 - dispatch commits 0ae7bf3 (nodeSelector swap)
~03:20 - router DMZ off, explicit Virtual Servers for 80/443 added
03:55:22 - apply runs successfully, but new runner pod stuck Pending (cause 2)
04:08 onwards - host-network outages continue; assumed bridge churn was still in play
04:30 - decision script written; on next recovery, scale runner to 0
04:36 - scale-to-0 fires successfully; all runner pods terminate
04:37 onwards - host network stable; watch.log lies about metronomic outages because of cause 4
~04:50 - cause 4 identified via direct nc + curl + coily whoami confirming host is fine
Filed #151, #152 (closed by 2e7b0b4), #153, #154 (morning checkin)

Lessons / regression bait

These would each be worth a sibling issue. Filing the most surprising one separately; the rest go here.

Probes must validate output, not just exit code. This session's monitor logged false state for ~30 min before we noticed. See sibling issue (filed alongside this one).
Public allowlist must live in the repo, not just on the router. DMZ flip was archaeology - nobody remembers turning it on. Documented as a sibling issue.
Local-path PVs lock statefulsets to one node. Any time we pin a statefulset to a node and assume future moves are cheap, this bites. Add a section to docs/k3s-deploy-notes.md §9 traps.
Control-plane nodes must not host docker workloads that mutate iptables. Forgejo runner with DinD is the canonical anti-pattern. Add a section to docs/k3s-deploy-notes.md §9 traps.
kai-desktop-tower-wsl was joined as a k3s worker but never confirmed end-to-end. Latent infra debt - we have a node that might or might not be schedulable for any nontrivial pod. Should be either confirmed-and-documented or removed.
Coily breaking-changes (like the ssh lockdown tightening) need a downstream audit. Anything that builds on top of coily ssh free-form is now broken. Need a cross-repo grep when shipping breaking coily changes.

Out of scope

Long-term observability for kai-server (Prometheus host-network metrics, Grafana dashboard for INPUT chain packet rate). Would have made cause 1 visible in 30 seconds. Not blocking.
Migrating off DinD entirely (kubernetes-mode forgejo runner if/when forgejo ships it).
A tooling-supply-chain-audit for coily-as-dependency in scripts.

How to apply

Close as documentation-only once filed. Lessons 3 and 4 should land as doc edits to docs/k3s-deploy-notes.md §9 in a follow-up commit. Lessons 1 and 2 have sibling issues for the prevention work. Lessons 5 and 6 worth filing if/when prioritized.

**Anchor** Postmortem of the 2026-05-26 evening session on kai-server. Three independent issues stacked and looked identical from the outside: public SSH brute force, control-plane host-network flapping, and a monitoring tool that started lying when its underlying CLI tightened. Filing the whole timeline because conflating these in the future would burn hours again. **Symptom that opened the session** Kai dictated "ssh to kai server just locked me out? help me debug." Initial probe: `coily ssh kai-server` returned `tcp dial kai-server:22: i/o timeout`. Tailscale ping returned pong in 6ms. Eco-mcp public HTTPS returned 200. Classic host-network-down-but-pod-network-up signature. **Cause 1 - DinD bridge churn on the control plane (resolved by 0ae7bf3, see #151)** Forgejo Actions runner was a StatefulSet pinned to kai-server with a DinD sidecar. Each workflow job created docker user-bridges (`br-XXXXXXXX`) which raced k3s kube-proxy iptables sync. During the race, host-namespace TCP listeners (sshd:22, kube-apiserver:6443, tailscaled PeerAPI:53096) got dropped from the INPUT path for 5+ minute windows. Pod-net survived because flannel+kube-proxy own those rules separately. Smoking gun: kernel ring buffer showed bursts of `br-XXXXXXXX entered blocking/forwarding/disabled state` lines, and the bursts time-correlated 1:1 with reported outages (17:18-17:27, 18:58, 19:16-19:17, 19:44-19:45 all PT). K3s apiserver logged `apiserver was unable to write a JSON response: http: Handler timeout` at 17:57 and 19:21 PT, inside two of those windows. Fix: pin runner to `kai-desktop-tower-wsl` instead of kai-server (commit 0ae7bf3). DinD churn still happens, but on a worker node where no critical host-network listeners live. **Cause 2 - PVC pins prevent the runner from actually moving (open as #153)** Apply succeeded but the new pod stayed Pending with `volume node affinity conflict`. The `data-forgejo-runner-1` PVC is backed by a local-path PV pinned to kai-server. New nodeSelector says WSL, PV says kai-server, scheduler can't satisfy both. **The cause-1 fix didn't take effect at all** until we scaled the StatefulSet to 0 replicas at 04:37:13Z. After scale-to-0, all forgejo-runner pods terminated, DinD bridge churn stopped, host network stabilized. Verified via direct nc to 22 and 6443 plus `coily whoami` over SSH at ~04:50Z - all healthy. #153 holds the morning fix. **Cause 3 - public SSH exposed via router DMZ (resolved at router level)** While debugging, the first diagnostic snapshot showed sshd absorbing a brute-force scan from `60.163.139.198` at one attempt every ~8 seconds, dozens of usernames. Public SSH was never supposed to be exposed. Audit found: - Router Virtual Servers table: `factorio (34197/udp), eco (3000-3003)` - clean, no SSH forward. - Router DMZ: **ON**, pointed at kai-server's LAN IP. Every port on kai-server was public. Public services should be an explicit allowlist, not a DMZ default-allow. Fix applied during the session: - Added explicit Virtual Servers for `80/tcp` and `443/tcp` to 192.168.0.194. - Turned DMZ off. - Verified public HTTPS still works through caddy (200 OK from eco-mcp, forgejo, coilysiren). Brute force surface closed. The brute force itself was not load-bearing for the host-network outages; it ran for hours without affecting them. Pure background hardening. **Cause 4 - monitoring tool lied silently after a coily upgrade** `scripts/host-watch.sh` (committed 2e7b0b4 via #152) polls SSH liveness via `coily ssh kai-server -- echo alive` and on dead->alive transitions streams `scripts/host-diag.sh` into the remote via `bash -s`. Both forms relied on `coily ssh` accepting free-form remote argv. Mid-session, `coily` upgraded from 2.42.0 to 2.43.0 and tightened the ssh lockdown: `coily ssh` now only ships args to a remote `coily` subcommand, not to free-form bash. The probe started returning coily's help dump (exit 1) and the diag started capturing coily help instead of the actual diagnostic. Watch logged a metronomic 4-min "outage" cycle that didn't exist - all post-04:37 watch.log transitions are probe artifacts, not real host state. This is the regression that scares me most. **The tool kept producing output that looked plausible to the operator** (timestamped state transitions, recovery snapshot files, file sizes that grew). The outages were real until 04:37, then turned into pure noise. A reader could waste an hour chasing a phantom 4-min cron. Kai's call: remove `coily ssh` entirely rather than chase it. `scripts/host-watch.sh` and `scripts/host-diag.sh` will be broken until rewritten to a non-coily-ssh path (or deleted). Tracked separately as the followup issues filed alongside this one. **Timeline (UTC)** - ~02:50 - session opens, host-network outage in progress - 02:53 - first diag captured to /tmp/output.txt, 1448 lines, real data. Names DinD bridge churn as cause. - 03:11 - dispatch claude headless on #151 to fix the runner pin - 03:13 - dispatch commits 0ae7bf3 (nodeSelector swap) - ~03:20 - router DMZ off, explicit Virtual Servers for 80/443 added - 03:55:22 - apply runs successfully, but new runner pod stuck Pending (cause 2) - 04:08 onwards - host-network outages continue; assumed bridge churn was still in play - 04:30 - decision script written; on next recovery, scale runner to 0 - 04:36 - scale-to-0 fires successfully; all runner pods terminate - 04:37 onwards - host network stable; watch.log lies about metronomic outages because of cause 4 - ~04:50 - cause 4 identified via direct nc + curl + `coily whoami` confirming host is fine - Filed #151, #152 (closed by 2e7b0b4), #153, #154 (morning checkin) **Lessons / regression bait** These would each be worth a sibling issue. Filing the most surprising one separately; the rest go here. 1. **Probes must validate output, not just exit code.** This session's monitor logged false state for ~30 min before we noticed. See sibling issue (filed alongside this one). 2. **Public allowlist must live in the repo, not just on the router.** DMZ flip was archaeology - nobody remembers turning it on. Documented as a sibling issue. 3. **Local-path PVs lock statefulsets to one node.** Any time we pin a statefulset to a node and assume future moves are cheap, this bites. Add a section to docs/k3s-deploy-notes.md §9 traps. 4. **Control-plane nodes must not host docker workloads that mutate iptables.** Forgejo runner with DinD is the canonical anti-pattern. Add a section to docs/k3s-deploy-notes.md §9 traps. 5. **kai-desktop-tower-wsl was joined as a k3s worker but never confirmed end-to-end.** Latent infra debt - we have a node that might or might not be schedulable for any nontrivial pod. Should be either confirmed-and-documented or removed. 6. **Coily breaking-changes (like the ssh lockdown tightening) need a downstream audit.** Anything that builds on top of coily ssh free-form is now broken. Need a cross-repo grep when shipping breaking coily changes. **Out of scope** - Long-term observability for kai-server (Prometheus host-network metrics, Grafana dashboard for INPUT chain packet rate). Would have made cause 1 visible in 30 seconds. Not blocking. - Migrating off DinD entirely (kubernetes-mode forgejo runner if/when forgejo ships it). - A `tooling-supply-chain-audit` for coily-as-dependency in scripts. **How to apply** Close as documentation-only once filed. Lessons 3 and 4 should land as doc edits to docs/k3s-deploy-notes.md §9 in a follow-up commit. Lessons 1 and 2 have sibling issues for the prevention work. Lessons 5 and 6 worth filing if/when prioritized.

coilysiren added the

label

2026-06-04 08:16:53 +00:00

coilysiren commented

2026-06-17 08:22:40 +00:00

Author

Owner

Backlog burndown 2026-06-17: closing low-priority (P3/P4) to bring the open count to a manageable level. Nothing lost — reopen if this resurfaces. Batch tag: burndown-2026-06.

Backlog burndown 2026-06-17: closing low-priority (P3/P4) to bring the open count to a manageable level. Nothing lost — reopen if this resurfaces. Batch tag: `burndown-2026-06`.

coilysiren

2026-06-17 08:22:41 +00:00