thermal heartbeat for kai-server: dual push to Sentry + VM/Grafana #75

Open
opened 2026-05-23 20:54:38 +00:00 by coilysiren · 0 comments
Owner

Originally filed by @coilysiren on 2026-05-01T17:17:36Z - https://github.com/coilysiren/infrastructure/issues/85

Goal

Ship a thermal heartbeat from kai-server that dual-pushes:

  1. VictoriaMetrics / Grafana via node-exporter's textfile collector (no new ingress, no HTTP loopback into k3s).
  2. Sentry cron monitor for missed-beat paging + threshold-breach events with thermal payload as tags.

A "thermal heartbeat" here means: every 30s, read all available temp sensors, write them to a Prom textfile, and ping a Sentry cron monitor. Missed beats and out-of-band temps both page.

Sources to read

  • sensors -j (lm-sensors): CPU package + per-core, motherboard, ambient.
  • nvme smart-log -o json /dev/nvmeXnY for each NVMe.
  • /sys/class/thermal/thermal_zone*/temp as a backstop / sanity check.

Push side: VM/Grafana

  • New systemd oneshot thermal-heartbeat.service + thermal-heartbeat.timer (30s interval, matches vmagent scrape).
  • Script writes /var/lib/node-exporter/textfile/thermal.prom atomically (write-then-rename).
  • Metrics:
    • node_thermal_celsius{source,chip,sensor} gauge
    • node_thermal_heartbeat_seconds gauge (unix ts of last successful collection)
  • Update deploy/observability/node-exporter-values.yml:
    • Drop --no-collector.hwmon from extraArgs (was dropped for label noise; now we want it).
    • Add --collector.textfile.directory=/var/lib/node-exporter/textfile.
    • Add hostPath volume + mount for the textfile dir.

Push side: Sentry

  • Same script, after writing the textfile, POSTs to a Sentry cron monitor check-in endpoint.
  • DSN/URL stashed in SSM, loaded by the systemd unit via EnvironmentFile=.
  • Threshold breach (configurable per source, default: CPU pkg > 85C, NVMe > 70C) emits a Sentry event with thermal payload as tags so Sentry alert rules can fire on level:warning.
  • Sentry cron monitor configured for 30s interval with grace, missed-checkin alert wired.

Grafana side

  • Add a thermal panel to the node-exporter dashboard (or a new small dashboard) plotting node_thermal_celsius by chip,sensor.
  • vmalert is currently off (per deploy/observability/README.md); leave it off for v1, since Sentry is the paging path. Revisit if VM-side alerting becomes desired.

Files touched

  • scripts/thermal-heartbeat.sh (new)
  • systemd/thermal-heartbeat.service (new)
  • systemd/thermal-heartbeat.timer (new)
  • deploy/observability/node-exporter-values.yml (textfile collector + hwmon)
  • deploy/observability/README.md (deploy notes)
  • SSM.md (new param for Sentry cron monitor URL)

Out of scope

  • vmalert wiring.
  • kai-desktop-tower thermal coverage (node-exporter runs there, but no systemd timer being deployed there yet).
  • Public-facing thermal dashboard.
_Originally filed by @coilysiren on 2026-05-01T17:17:36Z - [https://github.com/coilysiren/infrastructure/issues/85](https://github.com/coilysiren/infrastructure/issues/85)_ ## Goal Ship a thermal heartbeat from kai-server that dual-pushes: 1. **VictoriaMetrics / Grafana** via node-exporter's textfile collector (no new ingress, no HTTP loopback into k3s). 2. **Sentry cron monitor** for missed-beat paging + threshold-breach events with thermal payload as tags. A "thermal heartbeat" here means: every 30s, read all available temp sensors, write them to a Prom textfile, and ping a Sentry cron monitor. Missed beats and out-of-band temps both page. ## Sources to read - `sensors -j` (lm-sensors): CPU package + per-core, motherboard, ambient. - `nvme smart-log -o json /dev/nvmeXnY` for each NVMe. - `/sys/class/thermal/thermal_zone*/temp` as a backstop / sanity check. ## Push side: VM/Grafana - New systemd oneshot `thermal-heartbeat.service` + `thermal-heartbeat.timer` (30s interval, matches vmagent scrape). - Script writes `/var/lib/node-exporter/textfile/thermal.prom` atomically (write-then-rename). - Metrics: - `node_thermal_celsius{source,chip,sensor}` gauge - `node_thermal_heartbeat_seconds` gauge (unix ts of last successful collection) - Update `deploy/observability/node-exporter-values.yml`: - Drop `--no-collector.hwmon` from `extraArgs` (was dropped for label noise; now we want it). - Add `--collector.textfile.directory=/var/lib/node-exporter/textfile`. - Add hostPath volume + mount for the textfile dir. ## Push side: Sentry - Same script, after writing the textfile, POSTs to a Sentry cron monitor check-in endpoint. - DSN/URL stashed in SSM, loaded by the systemd unit via `EnvironmentFile=`. - Threshold breach (configurable per source, default: CPU pkg > 85C, NVMe > 70C) emits a Sentry event with thermal payload as tags so Sentry alert rules can fire on `level:warning`. - Sentry cron monitor configured for 30s interval with grace, missed-checkin alert wired. ## Grafana side - Add a thermal panel to the node-exporter dashboard (or a new small dashboard) plotting `node_thermal_celsius` by `chip,sensor`. - vmalert is currently off (per `deploy/observability/README.md`); leave it off for v1, since Sentry is the paging path. Revisit if VM-side alerting becomes desired. ## Files touched - `scripts/thermal-heartbeat.sh` (new) - `systemd/thermal-heartbeat.service` (new) - `systemd/thermal-heartbeat.timer` (new) - `deploy/observability/node-exporter-values.yml` (textfile collector + hwmon) - `deploy/observability/README.md` (deploy notes) - `SSM.md` (new param for Sentry cron monitor URL) ## Out of scope - vmalert wiring. - kai-desktop-tower thermal coverage (node-exporter runs there, but no systemd timer being deployed there yet). - Public-facing thermal dashboard.
coilysiren added
P4
and removed
P3
labels 2026-05-31 07:00:44 +00:00
Sign in to join this conversation.
No labels
P0
P1
P2
P3
P4
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
coilyco-flight-deck/infrastructure#75
No description provided.