postmortem - 2026-05-26 kai-server outage session - three independent issues stacked #155
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Anchor
Postmortem of the 2026-05-26 evening session on kai-server. Three independent issues stacked and looked identical from the outside: public SSH brute force, control-plane host-network flapping, and a monitoring tool that started lying when its underlying CLI tightened. Filing the whole timeline because conflating these in the future would burn hours again.
Symptom that opened the session
Kai dictated "ssh to kai server just locked me out? help me debug." Initial probe:
coily ssh kai-serverreturnedtcp dial kai-server:22: i/o timeout. Tailscale ping returned pong in 6ms. Eco-mcp public HTTPS returned 200. Classic host-network-down-but-pod-network-up signature.Cause 1 - DinD bridge churn on the control plane (resolved by
0ae7bf3, see #151)Forgejo Actions runner was a StatefulSet pinned to kai-server with a DinD sidecar. Each workflow job created docker user-bridges (
br-XXXXXXXX) which raced k3s kube-proxy iptables sync. During the race, host-namespace TCP listeners (sshd:22, kube-apiserver:6443, tailscaled PeerAPI:53096) got dropped from the INPUT path for 5+ minute windows. Pod-net survived because flannel+kube-proxy own those rules separately.Smoking gun: kernel ring buffer showed bursts of
br-XXXXXXXX entered blocking/forwarding/disabled statelines, and the bursts time-correlated 1:1 with reported outages (17:18-17:27, 18:58, 19:16-19:17, 19:44-19:45 all PT). K3s apiserver loggedapiserver was unable to write a JSON response: http: Handler timeoutat 17:57 and 19:21 PT, inside two of those windows.Fix: pin runner to
kai-desktop-tower-wslinstead of kai-server (commit0ae7bf3). DinD churn still happens, but on a worker node where no critical host-network listeners live.Cause 2 - PVC pins prevent the runner from actually moving (open as #153)
Apply succeeded but the new pod stayed Pending with
volume node affinity conflict. Thedata-forgejo-runner-1PVC is backed by a local-path PV pinned to kai-server. New nodeSelector says WSL, PV says kai-server, scheduler can't satisfy both. The cause-1 fix didn't take effect at all until we scaled the StatefulSet to 0 replicas at 04:37:13Z.After scale-to-0, all forgejo-runner pods terminated, DinD bridge churn stopped, host network stabilized. Verified via direct nc to 22 and 6443 plus
coily whoamiover SSH at ~04:50Z - all healthy.#153 holds the morning fix.
Cause 3 - public SSH exposed via router DMZ (resolved at router level)
While debugging, the first diagnostic snapshot showed sshd absorbing a brute-force scan from
60.163.139.198at one attempt every ~8 seconds, dozens of usernames. Public SSH was never supposed to be exposed. Audit found:factorio (34197/udp), eco (3000-3003)- clean, no SSH forward.Public services should be an explicit allowlist, not a DMZ default-allow. Fix applied during the session:
80/tcpand443/tcpto 192.168.0.194.Brute force surface closed. The brute force itself was not load-bearing for the host-network outages; it ran for hours without affecting them. Pure background hardening.
Cause 4 - monitoring tool lied silently after a coily upgrade
scripts/host-watch.sh(committed2e7b0b4via #152) polls SSH liveness viacoily ssh kai-server -- echo aliveand on dead->alive transitions streamsscripts/host-diag.shinto the remote viabash -s. Both forms relied oncoily sshaccepting free-form remote argv.Mid-session,
coilyupgraded from 2.42.0 to 2.43.0 and tightened the ssh lockdown:coily sshnow only ships args to a remotecoilysubcommand, not to free-form bash. The probe started returning coily's help dump (exit 1) and the diag started capturing coily help instead of the actual diagnostic. Watch logged a metronomic 4-min "outage" cycle that didn't exist - all post-04:37 watch.log transitions are probe artifacts, not real host state.This is the regression that scares me most. The tool kept producing output that looked plausible to the operator (timestamped state transitions, recovery snapshot files, file sizes that grew). The outages were real until 04:37, then turned into pure noise. A reader could waste an hour chasing a phantom 4-min cron.
Kai's call: remove
coily sshentirely rather than chase it.scripts/host-watch.shandscripts/host-diag.shwill be broken until rewritten to a non-coily-ssh path (or deleted). Tracked separately as the followup issues filed alongside this one.Timeline (UTC)
0ae7bf3(nodeSelector swap)coily whoamiconfirming host is fine2e7b0b4), #153, #154 (morning checkin)Lessons / regression bait
These would each be worth a sibling issue. Filing the most surprising one separately; the rest go here.
Out of scope
tooling-supply-chain-auditfor coily-as-dependency in scripts.How to apply
Close as documentation-only once filed. Lessons 3 and 4 should land as doc edits to docs/k3s-deploy-notes.md §9 in a follow-up commit. Lessons 1 and 2 have sibling issues for the prevention work. Lessons 5 and 6 worth filing if/when prioritized.