process-memory-heartbeat: OTLP push silently dropped by vmsingle, add swap metrics #185

New issue

Open

opened 2026-05-30 17:53:58 +00:00 by coilysiren · 0 comments

coilysiren commented

2026-05-30 17:53:58 +00:00

Owner

Problem

process-memory-heartbeat.service (samples per-process RSS every 30s, POSTs OTLP/protobuf to vmsingle at localhost:30428/opentelemetry/v1/metrics) reports success on every run ("Deactivated successfully", exit 0) but its data is not queryable in vmsingle:

process_memory_rss_bytes is absent from the __name__ catalog.
The process_name label has zero values.
The only logged failures are Connection refused at 07:05-07:06 right after reboot (vmsingle pod not up yet), which is expected and unrelated.

So the POST returns 2xx and the service believes it is working, while VM drops or mis-names the payload. This is an opaque-success bug: the per-process memory history we would want to autopsy an OOM event does not actually exist. Discovered when the 2026-05-30 crash investigation tried to pull the cc1plus RSS spike and found nothing.

Likely cause

The script hand-encodes a minimal OTLP protobuf (scripts/process-memory-heartbeat.py, see the build_protobuf / _metric helpers). Candidates:

VM's OTLP ingestion names the metric differently than process_memory_rss_bytes (unit By handling, scope, or dots->underscores edge case). Confirm the actual ingested name via the __name__ catalog after a known POST.
A malformed field in the hand-rolled protobuf that VM accepts (200) but silently drops.

Ask

Make the heartbeat verify ingestion, not just POST status - e.g. exit non-zero (or log loudly) if a follow-up query for its own metric returns empty, so silent drops surface.
Fix the metric so per-process-by-name RSS is actually queryable in vmsingle.
Add swap to the system sample. The script reads /proc/meminfo but only emits total/available/free/buffers/cached - never Swap{Total,Free,Cached}. Swap exhaustion was the trigger of the 2026-05-30 livelock. (Note: node_exporter already covers system swap, so this is lower priority than the per-proc fix, but cheap given meminfo is already parsed.)

Context

Service/unit: /etc/systemd/system/process-memory-heartbeat.service, timer OnUnitActiveSec=30s.
Script: infrastructure/scripts/process-memory-heartbeat.py.
Aggregates RSS by (process_name, user), top 20 - so a swarm of identical procs (e.g. 16x cc1plus) correctly collapses to one series, which is exactly what we want for OOM autopsy once ingestion works.

Found during the 2026-05-30 crash investigation.

## Problem `process-memory-heartbeat.service` (samples per-process RSS every 30s, POSTs OTLP/protobuf to vmsingle at `localhost:30428/opentelemetry/v1/metrics`) reports success on every run ("Deactivated successfully", exit 0) but **its data is not queryable in vmsingle**: - `process_memory_rss_bytes` is absent from the `__name__` catalog. - The `process_name` label has zero values. - The only logged failures are `Connection refused` at 07:05-07:06 right after reboot (vmsingle pod not up yet), which is expected and unrelated. So the POST returns 2xx and the service believes it is working, while VM drops or mis-names the payload. This is an opaque-success bug: the per-process memory history we would want to autopsy an OOM event does not actually exist. Discovered when the 2026-05-30 crash investigation tried to pull the `cc1plus` RSS spike and found nothing. ## Likely cause The script hand-encodes a minimal OTLP protobuf (`scripts/process-memory-heartbeat.py`, see the `build_protobuf` / `_metric` helpers). Candidates: - VM's OTLP ingestion names the metric differently than `process_memory_rss_bytes` (unit `By` handling, scope, or dots->underscores edge case). Confirm the actual ingested name via the `__name__` catalog after a known POST. - A malformed field in the hand-rolled protobuf that VM accepts (200) but silently drops. ## Ask 1. Make the heartbeat verify ingestion, not just POST status - e.g. exit non-zero (or log loudly) if a follow-up query for its own metric returns empty, so silent drops surface. 2. Fix the metric so per-process-by-name RSS is actually queryable in vmsingle. 3. **Add swap to the system sample.** The script reads `/proc/meminfo` but only emits total/available/free/buffers/cached - never `Swap{Total,Free,Cached}`. Swap exhaustion was the trigger of the 2026-05-30 livelock. (Note: node_exporter already covers system swap, so this is lower priority than the per-proc fix, but cheap given meminfo is already parsed.) ## Context - Service/unit: `/etc/systemd/system/process-memory-heartbeat.service`, timer `OnUnitActiveSec=30s`. - Script: `infrastructure/scripts/process-memory-heartbeat.py`. - Aggregates RSS by `(process_name, user)`, top 20 - so a swarm of identical procs (e.g. 16x cc1plus) correctly collapses to one series, which is exactly what we want for OOM autopsy once ingestion works. Found during the 2026-05-30 crash investigation.

coilysiren added the

label

2026-06-04 08:16:53 +00:00

coilysiren referenced this issue

2026-07-04 10:38:26 +00:00

Apply priority triage to 78 un-triaged flight-deck+bridge tickets (labels out of director tier) #470

coilysiren referenced this issue

2026-07-04 10:38:33 +00:00

Apply priority triage to 78 un-triaged flight-deck+bridge tickets (labels out of director tier) #471