kai-server hard-hangs under EcoServer + observability I/O load (kine watchdog cascade) #48

New issue

Open

opened 2026-05-23 20:54:33 +00:00 by coilysiren · 0 comments

coilysiren commented

2026-05-23 20:54:33 +00:00

Owner

Originally filed by @coilysiren on 2026-05-18T22:15:10Z - https://github.com/coilysiren/infrastructure/issues/182

Symptom

kai-server hard-wedged twice in ~16h on 2026-05-18 (boots at 12:39 and 14:55 with no preceding clean shutdown — both manually power-cycled by Kai). last reboot shows the same pattern at Feb 6, Feb 7, Apr 13 — long-running, low-frequency.

Diagnosis

Not OOM, not a kernel panic, not a hardware failure. The signature in journalctl -b -<N> of each crashed boot is a cascade of systemd watchdog timeouts:

systemd-journald.service: Watchdog timeout (limit 3min)! recurring at 00:24, 05:26, 06:38, 09:54 on 2026-05-18
snapd.service: Watchdog timeout (limit 5min)!
k3s/kine Slow SQL ... total time: 2m7s for trivial SELECTs
tailscaled ... subscriber for health.Change is slow (3m20s elapsed)

System wedged on I/O for minutes at a time until watchdogs tripped and started killing critical services, then the cascade took the box down.

Hardware ruled out

nvme smart-log /dev/nvme0n1: critical_warning 0, available_spare 100%, percentage_used 20%, media_errors 0, num_err_log_entries 0, temperature 39C. NVMe is healthy. Notable side observation: unsafe_shutdowns 41 / power_cycles 51 — this hang-and-power-cycle pattern has been going on a long time.

Root cause

EcoServer working set is ~11 GiB at 8 min uptime, grows over a world's life. Combined with k3s kine (embedded SQLite) and the OTLP firehose into vmagent → VictoriaMetrics, the NVMe write path saturates. kine fsync stalls → control plane stalls → journald can't flush → watchdogs trip → wedge.

Fix candidates, ranked

Cap EcoServer memory via systemctl edit eco-server + MemoryMax=16G. Forces a known failure mode (EcoServer OOMs and restarts) instead of dragging the host down.
Reduce EcoServer OTLP export volume. Currently exports System.Runtime metrics every few seconds with large resource blocks. Drop frequency to 30-60s or sample.
Raise the journald watchdog (WatchdogSec=15min override) so a transient stall doesn't escalate. Cosmetic, not root-cause.
Migrate k3s from kine SQLite to embedded etcd or move kine onto a separate disk. Overkill for homelab; skip unless 1+2 don't hold.

Meta-improvement

This is the third or fourth time this has happened (Feb 6/7, Apr 13, May 18) and no prior diagnosis was captured anywhere — no skill, no issue, no vault digest. Adding a unsafe_shutdowns delta + new Watchdog timeout line check to daily-operational would let the routine surface this signal without manual investigation. Related: agentic-os-kai#281, #282 (daily-operational source errors for k3s_pods/k3s_nodes — likely silent symptoms of these wedges).

Related

EcoServer-process-crash cascade tracked separately on coilysiren/eco-mods (filed concurrently).

_Originally filed by @coilysiren on 2026-05-18T22:15:10Z - [https://github.com/coilysiren/infrastructure/issues/182](https://github.com/coilysiren/infrastructure/issues/182)_ **Symptom** kai-server hard-wedged twice in ~16h on 2026-05-18 (boots at 12:39 and 14:55 with no preceding clean shutdown — both manually power-cycled by Kai). `last reboot` shows the same pattern at Feb 6, Feb 7, Apr 13 — long-running, low-frequency. **Diagnosis** Not OOM, not a kernel panic, not a hardware failure. The signature in `journalctl -b -<N>` of each crashed boot is a cascade of systemd watchdog timeouts: - `systemd-journald.service: Watchdog timeout (limit 3min)!` recurring at 00:24, 05:26, 06:38, 09:54 on 2026-05-18 - `snapd.service: Watchdog timeout (limit 5min)!` - k3s/kine `Slow SQL ... total time: 2m7s` for trivial SELECTs - `tailscaled ... subscriber for health.Change is slow (3m20s elapsed)` System wedged on I/O for minutes at a time until watchdogs tripped and started killing critical services, then the cascade took the box down. **Hardware ruled out** `nvme smart-log /dev/nvme0n1`: `critical_warning 0`, `available_spare 100%`, `percentage_used 20%`, `media_errors 0`, `num_err_log_entries 0`, `temperature 39C`. NVMe is healthy. Notable side observation: `unsafe_shutdowns 41 / power_cycles 51` — this hang-and-power-cycle pattern has been going on a long time. **Root cause** EcoServer working set is ~11 GiB at 8 min uptime, grows over a world's life. Combined with k3s kine (embedded SQLite) and the OTLP firehose into vmagent → VictoriaMetrics, the NVMe write path saturates. kine fsync stalls → control plane stalls → journald can't flush → watchdogs trip → wedge. **Fix candidates, ranked** 1. **Cap EcoServer memory** via `systemctl edit eco-server` + `MemoryMax=16G`. Forces a known failure mode (EcoServer OOMs and restarts) instead of dragging the host down. 2. **Reduce EcoServer OTLP export volume.** Currently exports `System.Runtime` metrics every few seconds with large resource blocks. Drop frequency to 30-60s or sample. 3. **Raise the journald watchdog** (`WatchdogSec=15min` override) so a transient stall doesn't escalate. Cosmetic, not root-cause. 4. **Migrate k3s from kine SQLite to embedded etcd** or move kine onto a separate disk. Overkill for homelab; skip unless 1+2 don't hold. **Meta-improvement** This is the third or fourth time this has happened (Feb 6/7, Apr 13, May 18) and no prior diagnosis was captured anywhere — no skill, no issue, no vault digest. Adding a `unsafe_shutdowns` delta + new `Watchdog timeout` line check to `daily-operational` would let the routine surface this signal without manual investigation. Related: agentic-os-kai#281, #282 (daily-operational source errors for k3s_pods/k3s_nodes — likely silent symptoms of these wedges). **Related** - EcoServer-process-crash cascade tracked separately on coilysiren/eco-mods (filed concurrently).

coilysiren referenced this issue

2026-05-30 05:43:29 +00:00

Self-hosted Backstage on kai-server #94

coilysiren added the

label

2026-05-31 01:54:46 +00:00

No labels

No milestone

No project

No assignees

1 participant

Notifications

Due date

The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference

coilyco-flight-deck/infrastructure#48

No description provided.

Rows
Columns