Alert on leading-edge memory exhaustion (swap drain + PSI) before livelock blackout #184

Open
opened 2026-05-30 17:53:52 +00:00 by coilysiren · 0 comments
Owner

Problem

kai-server livelocked from memory exhaustion two nights running (2026-05-29 ~06:22 and 2026-05-30 ~06:51). Root cause: a 3am brew upgrade (via coily-update.service) spawned a swarm of ~16 cc1plus compilers (~1GB RSS each) building a formula from source, which drove anon memory to ~30.7GB on a 31GB-usable box with only 2GB swap. Swap hit 0, the kernel entered unbounded __alloc_pages_slowpath reclaim, and the box thrashed for ~3 hours (one OOM stack trace took 03:14 to 06:51 to finish printing) before a power-cycle recovered it ~07:00.

Key constraint: in-host telemetry goes dark during the livelock

node_exporter on kai-server has a hard data gap from ~03:15 to ~07:10 PDT on 2026-05-30 - the exact livelock window. SwapFree was observably draining (0.72 -> 0.23 GiB) right up to 03:15, then total blackout until reboot. You can only ever see up to the cliff, never through it. So post-hoc logging cannot diagnose the terminal phase. The only useful signal is an alert that fires on the leading edge (03:00-03:13 ramp) while the box can still emit a notification.

Ask

Add a vmalert rule (or equivalent) on data that already exists in vmsingle from node_exporter:

  • node_memory_SwapFree_bytes{instance="kai-server"} low for N minutes (e.g. < 15% of SwapTotal).
  • rate(node_pressure_memory_waiting_seconds_total{instance="kai-server"}[1m]) rising (PSI memory stall = reclaim thrash starting).
  • Optionally node_memory_MemAvailable_bytes collapsing.

Route to a push notification (Discord / the existing alert path) so the spiral is caught before the blackout.

Notes

  • node_exporter already exposes node_memory_Swap{Total,Free,Cached}_bytes and node_pressure_memory_{waiting,stalled}_seconds_total for kai-server. No new exporter needed.
  • Pairs with the coily-update cap (separate issue) and the heartbeat-ingestion fix (separate issue).

Found during the 2026-05-30 crash investigation.

## Problem kai-server livelocked from memory exhaustion two nights running (2026-05-29 ~06:22 and 2026-05-30 ~06:51). Root cause: a 3am `brew upgrade` (via `coily-update.service`) spawned a swarm of ~16 `cc1plus` compilers (~1GB RSS each) building a formula from source, which drove anon memory to ~30.7GB on a 31GB-usable box with only 2GB swap. Swap hit 0, the kernel entered unbounded `__alloc_pages_slowpath` reclaim, and the box thrashed for ~3 hours (one OOM stack trace took 03:14 to 06:51 to finish printing) before a power-cycle recovered it ~07:00. ## Key constraint: in-host telemetry goes dark during the livelock `node_exporter` on kai-server has a hard data gap from ~03:15 to ~07:10 PDT on 2026-05-30 - the exact livelock window. SwapFree was observably draining (0.72 -> 0.23 GiB) right up to 03:15, then total blackout until reboot. You can only ever see *up to* the cliff, never through it. So post-hoc logging cannot diagnose the terminal phase. The only useful signal is an alert that fires on the **leading edge** (03:00-03:13 ramp) while the box can still emit a notification. ## Ask Add a vmalert rule (or equivalent) on data that already exists in vmsingle from node_exporter: - `node_memory_SwapFree_bytes{instance="kai-server"}` low for N minutes (e.g. < 15% of SwapTotal). - `rate(node_pressure_memory_waiting_seconds_total{instance="kai-server"}[1m])` rising (PSI memory stall = reclaim thrash starting). - Optionally `node_memory_MemAvailable_bytes` collapsing. Route to a push notification (Discord / the existing alert path) so the spiral is caught before the blackout. ## Notes - node_exporter already exposes `node_memory_Swap{Total,Free,Cached}_bytes` and `node_pressure_memory_{waiting,stalled}_seconds_total` for kai-server. No new exporter needed. - Pairs with the `coily-update` cap (separate issue) and the heartbeat-ingestion fix (separate issue). Found during the 2026-05-30 crash investigation.
Sign in to join this conversation.
No labels
P0
P1
P2
P3
P4
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
coilyco-flight-deck/infrastructure#184
No description provided.