kai-server hard-hangs under EcoServer + observability I/O load (kine watchdog cascade) #48
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally filed by @coilysiren on 2026-05-18T22:15:10Z - https://github.com/coilysiren/infrastructure/issues/182
Symptom
kai-server hard-wedged twice in ~16h on 2026-05-18 (boots at 12:39 and 14:55 with no preceding clean shutdown — both manually power-cycled by Kai).
last rebootshows the same pattern at Feb 6, Feb 7, Apr 13 — long-running, low-frequency.Diagnosis
Not OOM, not a kernel panic, not a hardware failure. The signature in
journalctl -b -<N>of each crashed boot is a cascade of systemd watchdog timeouts:systemd-journald.service: Watchdog timeout (limit 3min)!recurring at 00:24, 05:26, 06:38, 09:54 on 2026-05-18snapd.service: Watchdog timeout (limit 5min)!Slow SQL ... total time: 2m7sfor trivial SELECTstailscaled ... subscriber for health.Change is slow (3m20s elapsed)System wedged on I/O for minutes at a time until watchdogs tripped and started killing critical services, then the cascade took the box down.
Hardware ruled out
nvme smart-log /dev/nvme0n1:critical_warning 0,available_spare 100%,percentage_used 20%,media_errors 0,num_err_log_entries 0,temperature 39C. NVMe is healthy. Notable side observation:unsafe_shutdowns 41 / power_cycles 51— this hang-and-power-cycle pattern has been going on a long time.Root cause
EcoServer working set is ~11 GiB at 8 min uptime, grows over a world's life. Combined with k3s kine (embedded SQLite) and the OTLP firehose into vmagent → VictoriaMetrics, the NVMe write path saturates. kine fsync stalls → control plane stalls → journald can't flush → watchdogs trip → wedge.
Fix candidates, ranked
systemctl edit eco-server+MemoryMax=16G. Forces a known failure mode (EcoServer OOMs and restarts) instead of dragging the host down.System.Runtimemetrics every few seconds with large resource blocks. Drop frequency to 30-60s or sample.WatchdogSec=15minoverride) so a transient stall doesn't escalate. Cosmetic, not root-cause.Meta-improvement
This is the third or fourth time this has happened (Feb 6/7, Apr 13, May 18) and no prior diagnosis was captured anywhere — no skill, no issue, no vault digest. Adding a
unsafe_shutdownsdelta + newWatchdog timeoutline check todaily-operationalwould let the routine surface this signal without manual investigation. Related: agentic-os-kai#281, #282 (daily-operational source errors for k3s_pods/k3s_nodes — likely silent symptoms of these wedges).Related