monitoring probes must validate output shape, not just exit code #156
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Anchor
Sibling to the 2026-05-26 postmortem (filed alongside). The monitoring tool we built mid-session lied for ~30 min before we caught it. The failure mode is general enough that any future probe written the same way will repeat the bug. Capturing as a standalone issue so the design rule doesn't get lost.
What happened
scripts/host-watch.sh(committed2e7b0b4via #152) polls a remote host with:The function returns 0 if the probe succeeded, non-0 if not. Watch state is tracked solely on
$?. Output is discarded.Mid-session,
coilyupgraded from 2.42.0 to 2.43.0 and tightenedcoily sshto only forward args to a remotecoilysubcommand, not to free-form bash. The probe started returning exit 1 with coily's--helpdump on stdout/stderr (both redirected to /dev/null). Watch logged "dead". Sometimes the probe still returned 0 for reasons not fully understood (maybe a race during coily binary swap, maybe an intermittent path through the ssh wrapper that bypasses the lockdown). Watch logged "alive". Net effect: metronomic 4-min "outage" cycle in watch.log that didn't correspond to anything real.The real host had been stable since 04:37Z (after scale-to-0 on the runner). The watch logged ~7 false dead/alive cycles between 04:37 and ~04:50. The diag-on-recovery captures ran (we saw recovery-*.txt files appear with non-zero size) but contained only coily help text, not actual diagnostic output. From the operator's vantage, everything looked like a working monitor producing data.
Why this is dangerous
The failure shape is more general than this specific bug: any probe whose contract is "exit 0 means alive" can be broken by an upstream tool change that returns 0 for the wrong reason, or non-0 for the wrong reason, without any guarantee that the operator notices.
Design rule
Probes must validate the shape of the output, not just the exit code. The watch should check that the output contains an expected sentinel, not just that the command succeeded. Minimal version:
The sentinel makes the probe self-validating against tool-shape drift. If
coily whoamiever stops emittinglogin:, the probe loudly returns dead instead of silently flapping. Cheap, mechanical, no new dependency.Bigger rule
This isn't unique to host-watch. Any future script that walks an external CLI and decides flow based on
$?should be reviewed against the same lens. Probably worth:>/dev/null 2>&1immediately followed byif [ "$?"... or&&/||short-circuits where the command output isn't otherwise checked. Or just apylint-style nag for shell scripts.Scope of this issue
lib/probe.shhelper that wraps the sentinel-matching pattern so future scripts don't reinvent it.Out of scope
coily sshis gone. Whatever replaces it inherits the sentinel rule.How to apply
Write the doc, optionally write the helper. Apply to host-watch.sh when it gets touched again. Close once the doc lands.