systemd unit logging + monitoring on kai-server, prereq for auto-deploys from infrastructure repo #55
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally filed by @coilysiren on 2026-05-13T07:54:27Z - https://github.com/coilysiren/infrastructure/issues/130
Problem
Multiple coily / infrastructure features are gated on "my operational confidence with systemd timers isn't high enough yet to trust auto-pull + reload loops." Until we can see what systemd units are doing (restarts, failures, restart loops, mass-restart events), we keep manual-only as the safe default. That keeps surfacing in adjacent issues:
git pull+caddy reloadtoday.The blocker is observability, not the automation itself.
What's needed
Feed systemd unit lifecycle into the existing VictoriaMetrics + Grafana stack on kai-server (per
docs/k3s-deploy-notes.md§1). Then we can:coily-audit-dashboard.timer, futureinfrastructure-auto-pull.timer) hasn't fired in N minutes.daily-operationalso failures are noticed even without an alert page.Shape
scripts/systemd-heartbeat.py- same stdlib + OTLP/HTTP protobuf pattern asprocess-memory-heartbeat.pyandthermal-heartbeat.py. Readssystemctl list-units --state=failed, plus a configurable allowlist of "load-bearing" units. Emits:systemd_unit_active{unit,sub_state,host}gauge (1 active / 0 inactive).systemd_unit_restart_count_total{unit,host}counter (cumulativeNRestartsfromsystemctl show).systemd_unit_last_exit_status{unit,host}gauge.systemd_unit_seconds_since_active{unit,host}gauge (so we can alert on stale timers).systemd/systemd-heartbeat.{service,timer}- oneshot, fire every 30s like the siblings.deploy/observability/grafana/systemd.jsondashboard - one row per load-bearing unit, with restart-count and last-state panels. Imported via the existingmake observabilityworkflow.daily-operationalskill update - pull systemd failures from VictoriaMetrics in the daily routine.Considerations
coily-audit-dashboard,coily-update,repo-recall,repo-recall-update,claude-remote-control,claude-remote-control-restart, the game-server units, plus k3s.What this unblocks
Once we have eyes on systemd, we can ship the follow-up:
infrastructure-auto-pull.timer- pollscoilysiren/infrastructuremain, pulls, reloads caddy ifcaddy/changed. Closes the loop from #129.Out of scope
systemd-exporteror similar prebuilt. Keep the stdlib pattern coherent across the existing heartbeats; revisit if it gets unwieldy.Cross-links
scripts/process-memory-heartbeat.py,scripts/thermal-heartbeat.py- the pattern this follows.docs/k3s-deploy-notes.md§1 - VictoriaMetrics + Grafana topology.