kai-server hard-hangs under EcoServer + observability I/O load (kine watchdog cascade) #48

Open
opened 2026-05-23 20:54:33 +00:00 by coilysiren · 0 comments
Owner

Originally filed by @coilysiren on 2026-05-18T22:15:10Z - https://github.com/coilysiren/infrastructure/issues/182

Symptom

kai-server hard-wedged twice in ~16h on 2026-05-18 (boots at 12:39 and 14:55 with no preceding clean shutdown — both manually power-cycled by Kai). last reboot shows the same pattern at Feb 6, Feb 7, Apr 13 — long-running, low-frequency.

Diagnosis

Not OOM, not a kernel panic, not a hardware failure. The signature in journalctl -b -<N> of each crashed boot is a cascade of systemd watchdog timeouts:

  • systemd-journald.service: Watchdog timeout (limit 3min)! recurring at 00:24, 05:26, 06:38, 09:54 on 2026-05-18
  • snapd.service: Watchdog timeout (limit 5min)!
  • k3s/kine Slow SQL ... total time: 2m7s for trivial SELECTs
  • tailscaled ... subscriber for health.Change is slow (3m20s elapsed)

System wedged on I/O for minutes at a time until watchdogs tripped and started killing critical services, then the cascade took the box down.

Hardware ruled out

nvme smart-log /dev/nvme0n1: critical_warning 0, available_spare 100%, percentage_used 20%, media_errors 0, num_err_log_entries 0, temperature 39C. NVMe is healthy. Notable side observation: unsafe_shutdowns 41 / power_cycles 51 — this hang-and-power-cycle pattern has been going on a long time.

Root cause

EcoServer working set is ~11 GiB at 8 min uptime, grows over a world's life. Combined with k3s kine (embedded SQLite) and the OTLP firehose into vmagent → VictoriaMetrics, the NVMe write path saturates. kine fsync stalls → control plane stalls → journald can't flush → watchdogs trip → wedge.

Fix candidates, ranked

  1. Cap EcoServer memory via systemctl edit eco-server + MemoryMax=16G. Forces a known failure mode (EcoServer OOMs and restarts) instead of dragging the host down.
  2. Reduce EcoServer OTLP export volume. Currently exports System.Runtime metrics every few seconds with large resource blocks. Drop frequency to 30-60s or sample.
  3. Raise the journald watchdog (WatchdogSec=15min override) so a transient stall doesn't escalate. Cosmetic, not root-cause.
  4. Migrate k3s from kine SQLite to embedded etcd or move kine onto a separate disk. Overkill for homelab; skip unless 1+2 don't hold.

Meta-improvement

This is the third or fourth time this has happened (Feb 6/7, Apr 13, May 18) and no prior diagnosis was captured anywhere — no skill, no issue, no vault digest. Adding a unsafe_shutdowns delta + new Watchdog timeout line check to daily-operational would let the routine surface this signal without manual investigation. Related: agentic-os-kai#281, #282 (daily-operational source errors for k3s_pods/k3s_nodes — likely silent symptoms of these wedges).

Related

  • EcoServer-process-crash cascade tracked separately on coilysiren/eco-mods (filed concurrently).
_Originally filed by @coilysiren on 2026-05-18T22:15:10Z - [https://github.com/coilysiren/infrastructure/issues/182](https://github.com/coilysiren/infrastructure/issues/182)_ **Symptom** kai-server hard-wedged twice in ~16h on 2026-05-18 (boots at 12:39 and 14:55 with no preceding clean shutdown — both manually power-cycled by Kai). `last reboot` shows the same pattern at Feb 6, Feb 7, Apr 13 — long-running, low-frequency. **Diagnosis** Not OOM, not a kernel panic, not a hardware failure. The signature in `journalctl -b -<N>` of each crashed boot is a cascade of systemd watchdog timeouts: - `systemd-journald.service: Watchdog timeout (limit 3min)!` recurring at 00:24, 05:26, 06:38, 09:54 on 2026-05-18 - `snapd.service: Watchdog timeout (limit 5min)!` - k3s/kine `Slow SQL ... total time: 2m7s` for trivial SELECTs - `tailscaled ... subscriber for health.Change is slow (3m20s elapsed)` System wedged on I/O for minutes at a time until watchdogs tripped and started killing critical services, then the cascade took the box down. **Hardware ruled out** `nvme smart-log /dev/nvme0n1`: `critical_warning 0`, `available_spare 100%`, `percentage_used 20%`, `media_errors 0`, `num_err_log_entries 0`, `temperature 39C`. NVMe is healthy. Notable side observation: `unsafe_shutdowns 41 / power_cycles 51` — this hang-and-power-cycle pattern has been going on a long time. **Root cause** EcoServer working set is ~11 GiB at 8 min uptime, grows over a world's life. Combined with k3s kine (embedded SQLite) and the OTLP firehose into vmagent → VictoriaMetrics, the NVMe write path saturates. kine fsync stalls → control plane stalls → journald can't flush → watchdogs trip → wedge. **Fix candidates, ranked** 1. **Cap EcoServer memory** via `systemctl edit eco-server` + `MemoryMax=16G`. Forces a known failure mode (EcoServer OOMs and restarts) instead of dragging the host down. 2. **Reduce EcoServer OTLP export volume.** Currently exports `System.Runtime` metrics every few seconds with large resource blocks. Drop frequency to 30-60s or sample. 3. **Raise the journald watchdog** (`WatchdogSec=15min` override) so a transient stall doesn't escalate. Cosmetic, not root-cause. 4. **Migrate k3s from kine SQLite to embedded etcd** or move kine onto a separate disk. Overkill for homelab; skip unless 1+2 don't hold. **Meta-improvement** This is the third or fourth time this has happened (Feb 6/7, Apr 13, May 18) and no prior diagnosis was captured anywhere — no skill, no issue, no vault digest. Adding a `unsafe_shutdowns` delta + new `Watchdog timeout` line check to `daily-operational` would let the routine surface this signal without manual investigation. Related: agentic-os-kai#281, #282 (daily-operational source errors for k3s_pods/k3s_nodes — likely silent symptoms of these wedges). **Related** - EcoServer-process-crash cascade tracked separately on coilysiren/eco-mods (filed concurrently).
Sign in to join this conversation.
No labels
P0
P1
P2
P3
P4
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
coilyco-flight-deck/infrastructure#48
No description provided.