process-memory-heartbeat: OTLP push silently dropped by vmsingle, add swap metrics #185
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Problem
process-memory-heartbeat.service(samples per-process RSS every 30s, POSTs OTLP/protobuf to vmsingle atlocalhost:30428/opentelemetry/v1/metrics) reports success on every run ("Deactivated successfully", exit 0) but its data is not queryable in vmsingle:process_memory_rss_bytesis absent from the__name__catalog.process_namelabel has zero values.Connection refusedat 07:05-07:06 right after reboot (vmsingle pod not up yet), which is expected and unrelated.So the POST returns 2xx and the service believes it is working, while VM drops or mis-names the payload. This is an opaque-success bug: the per-process memory history we would want to autopsy an OOM event does not actually exist. Discovered when the 2026-05-30 crash investigation tried to pull the
cc1plusRSS spike and found nothing.Likely cause
The script hand-encodes a minimal OTLP protobuf (
scripts/process-memory-heartbeat.py, see thebuild_protobuf/_metrichelpers). Candidates:process_memory_rss_bytes(unitByhandling, scope, or dots->underscores edge case). Confirm the actual ingested name via the__name__catalog after a known POST.Ask
/proc/meminfobut only emits total/available/free/buffers/cached - neverSwap{Total,Free,Cached}. Swap exhaustion was the trigger of the 2026-05-30 livelock. (Note: node_exporter already covers system swap, so this is lower priority than the per-proc fix, but cheap given meminfo is already parsed.)Context
/etc/systemd/system/process-memory-heartbeat.service, timerOnUnitActiveSec=30s.infrastructure/scripts/process-memory-heartbeat.py.(process_name, user), top 20 - so a swarm of identical procs (e.g. 16x cc1plus) correctly collapses to one series, which is exactly what we want for OOM autopsy once ingestion works.Found during the 2026-05-30 crash investigation.