Cap coily-update.service memory and disable brew build-from-source (3am cc1plus OOM livelock) #186

Open
opened 2026-05-30 17:54:05 +00:00 by coilysiren · 0 comments
Owner

Problem

coily-update.service runs brew update && brew upgrade (all formulae) daily at 03:00 on kai-server. When linuxbrew lacks a prebuilt bottle for a formula it builds from source, spawning a parallel cc1plus compiler swarm. On 2026-05-30 that produced ~16 concurrent cc1plus processes at ~1-1.5GB RSS each (~16-20GB), stacked on the resident workload (EcoServer ~4GB + k3s/SigNoz/ClickHouse/gitea/grafana ~8-10GB). Total committed memory blew past 31GB usable RAM + 2GB swap, swap hit 0, and the host fell into a multi-hour reclaim-thrash livelock that only a power-cycle recovered. Same fingerprint on 2026-05-29.

The OOM killer never recovered the box because it only ever reached the small high-oom_score_adj k8s pods (java, repo-recall, coredns, signoz-otel-col), while the unprotected brew build (oom_score_adj 0) was never targeted.

Ask (stop-the-bleeding, host-level fix)

  1. Cap the unit with a systemd cgroup limit. Add to coily-update.service:
    • MemoryMax=6G (a from-source build then gets OOM-killed inside its own cgroup; the host survives untouched).
    • OOMScoreAdjust=1000 (make the build the preferred OOM victim, not the k8s pods).
  2. Stop unattended source builds. Set in the unit env:
    • HOMEBREW_NO_BUILD_FROM_SOURCE=1 (skip formulae lacking a bottle rather than compiling), or at minimum
    • HOMEBREW_MAKE_JOBS=2 (cap compiler fan-out so a source build cannot saturate all cores/RAM).

Either of (1) or (2) alone prevents the host kill. Both is belt-and-suspenders.

  • Swap is being bumped 2GB -> 32GB separately (converts a livelock into a recoverable OOM, defense in depth, not the primary fix).
  • Leading-edge alert: separate issue.
  • The 02:45-03:06 timer cluster (coilysiren-pull-all, coily-update, claude-remote-control-restart) piles up at 3am; consider staggering if source builds are kept.

Context

  • Unit: /etc/systemd/system/coily-update.service -> infrastructure/scripts/coily-update.sh.
  • Timer: coily-update.timer, OnCalendar=*-*-* 03:00:00 (daily despite the script comment saying "weekly").

Found during the 2026-05-30 crash investigation.

## Problem `coily-update.service` runs `brew update && brew upgrade` (all formulae) daily at 03:00 on kai-server. When linuxbrew lacks a prebuilt bottle for a formula it builds **from source**, spawning a parallel `cc1plus` compiler swarm. On 2026-05-30 that produced ~16 concurrent `cc1plus` processes at ~1-1.5GB RSS each (~16-20GB), stacked on the resident workload (EcoServer ~4GB + k3s/SigNoz/ClickHouse/gitea/grafana ~8-10GB). Total committed memory blew past 31GB usable RAM + 2GB swap, swap hit 0, and the host fell into a multi-hour reclaim-thrash livelock that only a power-cycle recovered. Same fingerprint on 2026-05-29. The OOM killer never recovered the box because it only ever reached the small high-`oom_score_adj` k8s pods (java, repo-recall, coredns, signoz-otel-col), while the unprotected brew build (`oom_score_adj 0`) was never targeted. ## Ask (stop-the-bleeding, host-level fix) 1. **Cap the unit with a systemd cgroup limit.** Add to `coily-update.service`: - `MemoryMax=6G` (a from-source build then gets OOM-killed inside its own cgroup; the host survives untouched). - `OOMScoreAdjust=1000` (make the build the preferred OOM victim, not the k8s pods). 2. **Stop unattended source builds.** Set in the unit env: - `HOMEBREW_NO_BUILD_FROM_SOURCE=1` (skip formulae lacking a bottle rather than compiling), or at minimum - `HOMEBREW_MAKE_JOBS=2` (cap compiler fan-out so a source build cannot saturate all cores/RAM). Either of (1) or (2) alone prevents the host kill. Both is belt-and-suspenders. ## Related - Swap is being bumped 2GB -> 32GB separately (converts a livelock into a recoverable OOM, defense in depth, not the primary fix). - Leading-edge alert: separate issue. - The 02:45-03:06 timer cluster (`coilysiren-pull-all`, `coily-update`, `claude-remote-control-restart`) piles up at 3am; consider staggering if source builds are kept. ## Context - Unit: `/etc/systemd/system/coily-update.service` -> `infrastructure/scripts/coily-update.sh`. - Timer: `coily-update.timer`, `OnCalendar=*-*-* 03:00:00` (daily despite the script comment saying "weekly"). Found during the 2026-05-30 crash investigation.
Sign in to join this conversation.
No labels
P0
P1
P2
P3
P4
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
coilyco-flight-deck/infrastructure#186
No description provided.