thermal heartbeat for kai-server: dual push to Sentry + VM/Grafana #75
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally filed by @coilysiren on 2026-05-01T17:17:36Z - https://github.com/coilysiren/infrastructure/issues/85
Goal
Ship a thermal heartbeat from kai-server that dual-pushes:
A "thermal heartbeat" here means: every 30s, read all available temp sensors, write them to a Prom textfile, and ping a Sentry cron monitor. Missed beats and out-of-band temps both page.
Sources to read
sensors -j(lm-sensors): CPU package + per-core, motherboard, ambient.nvme smart-log -o json /dev/nvmeXnYfor each NVMe./sys/class/thermal/thermal_zone*/tempas a backstop / sanity check.Push side: VM/Grafana
thermal-heartbeat.service+thermal-heartbeat.timer(30s interval, matches vmagent scrape)./var/lib/node-exporter/textfile/thermal.promatomically (write-then-rename).node_thermal_celsius{source,chip,sensor}gaugenode_thermal_heartbeat_secondsgauge (unix ts of last successful collection)deploy/observability/node-exporter-values.yml:--no-collector.hwmonfromextraArgs(was dropped for label noise; now we want it).--collector.textfile.directory=/var/lib/node-exporter/textfile.Push side: Sentry
EnvironmentFile=.level:warning.Grafana side
node_thermal_celsiusbychip,sensor.deploy/observability/README.md); leave it off for v1, since Sentry is the paging path. Revisit if VM-side alerting becomes desired.Files touched
scripts/thermal-heartbeat.sh(new)systemd/thermal-heartbeat.service(new)systemd/thermal-heartbeat.timer(new)deploy/observability/node-exporter-values.yml(textfile collector + hwmon)deploy/observability/README.md(deploy notes)SSM.md(new param for Sentry cron monitor URL)Out of scope