systemd unit logging + monitoring on kai-server, prereq for auto-deploys from infrastructure repo #55

New issue

Open

opened 2026-05-23 20:54:34 +00:00 by coilysiren · 0 comments

coilysiren commented

2026-05-23 20:54:34 +00:00

Owner

Originally filed by @coilysiren on 2026-05-13T07:54:27Z - https://github.com/coilysiren/infrastructure/issues/130

Problem

Multiple coily / infrastructure features are gated on "my operational confidence with systemd timers isn't high enough yet to trust auto-pull + reload loops." Until we can see what systemd units are doing (restarts, failures, restart loops, mass-restart events), we keep manual-only as the safe default. That keeps surfacing in adjacent issues:

coilysiren/infrastructure#129 - caddy tailnet shortcuts: manual git pull + caddy reload today.
coilysiren/infrastructure#127 - audit dashboard timer is shipped, but we have no automated signal if it stops firing or if its underlying service starts failing silently.
Cluster ingress generation in general (next-tier follow-ups behind #129) wants the same auto-pull pattern.

The blocker is observability, not the automation itself.

What's needed

Feed systemd unit lifecycle into the existing VictoriaMetrics + Grafana stack on kai-server (per docs/k3s-deploy-notes.md §1). Then we can:

See per-unit restart counts and last-failure timestamps in Grafana.
Alert on restart loops (e.g. >3 restarts in 5 min).
Alert when a load-bearing timer (coily-audit-dashboard.timer, future infrastructure-auto-pull.timer) hasn't fired in N minutes.
Get a daily summary in daily-operational so failures are noticed even without an alert page.

Shape

scripts/systemd-heartbeat.py - same stdlib + OTLP/HTTP protobuf pattern as process-memory-heartbeat.py and thermal-heartbeat.py. Reads systemctl list-units --state=failed, plus a configurable allowlist of "load-bearing" units. Emits:
- systemd_unit_active{unit,sub_state,host} gauge (1 active / 0 inactive).
- systemd_unit_restart_count_total{unit,host} counter (cumulative NRestarts from systemctl show).
- systemd_unit_last_exit_status{unit,host} gauge.
- systemd_unit_seconds_since_active{unit,host} gauge (so we can alert on stale timers).
systemd/systemd-heartbeat.{service,timer} - oneshot, fire every 30s like the siblings.
deploy/observability/grafana/systemd.json dashboard - one row per load-bearing unit, with restart-count and last-state panels. Imported via the existing make observability workflow.
Alert rules - VMAlert rules co-located with the existing observability stack. Two starters: restart-loop (>3 NRestart bump in 5m) and timer-stale (no transition for >2x the timer interval).
daily-operational skill update - pull systemd failures from VictoriaMetrics in the daily routine.

Considerations

Should ride alongside the existing heartbeat scripts so the operational pattern stays uniform. No new daemon. No new framework.
Allowlist-of-units rather than "every unit on the box" - kai-server has dozens of units, most are noise.
Initial allowlist: coily-audit-dashboard, coily-update, repo-recall, repo-recall-update, claude-remote-control, claude-remote-control-restart, the game-server units, plus k3s.

What this unblocks

Once we have eyes on systemd, we can ship the follow-up:

infrastructure-auto-pull.timer - polls coilysiren/infrastructure main, pulls, reloads caddy if caddy/ changed. Closes the loop from #129.
Same pattern for any other systemd-driven config-from-repo workflow we add later.

Out of scope

Pushing journalctl logs themselves into VictoriaMetrics. Metrics first; log shipping is a separate decision and a different storage shape (Loki, journal-export, etc.).
Cross-host (the Mac, other future Linux hosts) systemd visibility. kai-server only for now.
Replacing the heartbeat scripts with systemd-exporter or similar prebuilt. Keep the stdlib pattern coherent across the existing heartbeats; revisit if it gets unwieldy.

Cross-links

coilysiren/infrastructure#127 - audit dashboard timer (already running, this would monitor it).
coilysiren/infrastructure#129 - caddy shortcut framework (auto-pull blocked on this).
scripts/process-memory-heartbeat.py, scripts/thermal-heartbeat.py - the pattern this follows.
docs/k3s-deploy-notes.md §1 - VictoriaMetrics + Grafana topology.
coilyco-ai daily-operational skill - downstream consumer of the same data.

_Originally filed by @coilysiren on 2026-05-13T07:54:27Z - [https://github.com/coilysiren/infrastructure/issues/130](https://github.com/coilysiren/infrastructure/issues/130)_ ## Problem Multiple coily / infrastructure features are gated on "my operational confidence with systemd timers isn't high enough yet to trust auto-pull + reload loops." Until we can _see_ what systemd units are doing (restarts, failures, restart loops, mass-restart events), we keep manual-only as the safe default. That keeps surfacing in adjacent issues: * coilysiren/infrastructure#129 - caddy tailnet shortcuts: manual `git pull` + `caddy reload` today. * coilysiren/infrastructure#127 - audit dashboard timer is shipped, but we have no automated signal if it stops firing or if its underlying service starts failing silently. * Cluster ingress generation in general (next-tier follow-ups behind #129) wants the same auto-pull pattern. The blocker is observability, not the automation itself. ## What's needed Feed systemd unit lifecycle into the existing VictoriaMetrics + Grafana stack on kai-server (per `docs/k3s-deploy-notes.md` §1). Then we can: * See per-unit restart counts and last-failure timestamps in Grafana. * Alert on restart loops (e.g. >3 restarts in 5 min). * Alert when a load-bearing timer (`coily-audit-dashboard.timer`, future `infrastructure-auto-pull.timer`) hasn't fired in N minutes. * Get a daily summary in `daily-operational` so failures are noticed even without an alert page. ## Shape 1. **`scripts/systemd-heartbeat.py`** - same stdlib + OTLP/HTTP protobuf pattern as `process-memory-heartbeat.py` and `thermal-heartbeat.py`. Reads `systemctl list-units --state=failed`, plus a configurable allowlist of "load-bearing" units. Emits: * `systemd_unit_active{unit,sub_state,host}` gauge (1 active / 0 inactive). * `systemd_unit_restart_count_total{unit,host}` counter (cumulative `NRestarts` from `systemctl show`). * `systemd_unit_last_exit_status{unit,host}` gauge. * `systemd_unit_seconds_since_active{unit,host}` gauge (so we can alert on stale timers). 2. **`systemd/systemd-heartbeat.{service,timer}`** - oneshot, fire every 30s like the siblings. 3. **`deploy/observability/grafana/systemd.json`** dashboard - one row per load-bearing unit, with restart-count and last-state panels. Imported via the existing `make observability` workflow. 4. **Alert rules** - VMAlert rules co-located with the existing observability stack. Two starters: restart-loop (>3 NRestart bump in 5m) and timer-stale (no transition for >2x the timer interval). 5. **`daily-operational` skill update** - pull systemd failures from VictoriaMetrics in the daily routine. ## Considerations * Should ride alongside the existing heartbeat scripts so the operational pattern stays uniform. No new daemon. No new framework. * Allowlist-of-units rather than "every unit on the box" - kai-server has dozens of units, most are noise. * Initial allowlist: `coily-audit-dashboard`, `coily-update`, `repo-recall`, `repo-recall-update`, `claude-remote-control`, `claude-remote-control-restart`, the game-server units, plus k3s. ## What this unblocks Once we have eyes on systemd, we can ship the follow-up: * `infrastructure-auto-pull.timer` - polls `coilysiren/infrastructure` main, pulls, reloads caddy if `caddy/` changed. Closes the loop from #129. * Same pattern for any other systemd-driven config-from-repo workflow we add later. ## Out of scope * Pushing journalctl logs themselves into VictoriaMetrics. Metrics first; log shipping is a separate decision and a different storage shape (Loki, journal-export, etc.). * Cross-host (the Mac, other future Linux hosts) systemd visibility. kai-server only for now. * Replacing the heartbeat scripts with `systemd-exporter` or similar prebuilt. Keep the stdlib pattern coherent across the existing heartbeats; revisit if it gets unwieldy. ## Cross-links * coilysiren/infrastructure#127 - audit dashboard timer (already running, this would monitor it). * coilysiren/infrastructure#129 - caddy shortcut framework (auto-pull blocked on this). * `scripts/process-memory-heartbeat.py`, `scripts/thermal-heartbeat.py` - the pattern this follows. * `docs/k3s-deploy-notes.md` §1 - VictoriaMetrics + Grafana topology. * coilyco-ai daily-operational skill - downstream consumer of the same data.

coilysiren added the

label

2026-05-31 01:54:45 +00:00

coilysiren added

and removed

labels

2026-05-31 07:00:48 +00:00

No labels

No milestone

No project

No assignees

1 participant

Notifications

Due date

The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference

coilyco-flight-deck/infrastructure#55

No description provided.

Rows
Columns