monitoring probes must validate output shape, not just exit code #156

Open
opened 2026-05-27 04:59:42 +00:00 by coilysiren · 0 comments
Owner

Anchor

Sibling to the 2026-05-26 postmortem (filed alongside). The monitoring tool we built mid-session lied for ~30 min before we caught it. The failure mode is general enough that any future probe written the same way will repeat the bug. Capturing as a standalone issue so the design rule doesn't get lost.

What happened

scripts/host-watch.sh (committed 2e7b0b4 via #152) polls a remote host with:

probe() {
  coily ssh "$HOST" -- echo alive >/dev/null 2>&1
}

The function returns 0 if the probe succeeded, non-0 if not. Watch state is tracked solely on $?. Output is discarded.

Mid-session, coily upgraded from 2.42.0 to 2.43.0 and tightened coily ssh to only forward args to a remote coily subcommand, not to free-form bash. The probe started returning exit 1 with coily's --help dump on stdout/stderr (both redirected to /dev/null). Watch logged "dead". Sometimes the probe still returned 0 for reasons not fully understood (maybe a race during coily binary swap, maybe an intermittent path through the ssh wrapper that bypasses the lockdown). Watch logged "alive". Net effect: metronomic 4-min "outage" cycle in watch.log that didn't correspond to anything real.

The real host had been stable since 04:37Z (after scale-to-0 on the runner). The watch logged ~7 false dead/alive cycles between 04:37 and ~04:50. The diag-on-recovery captures ran (we saw recovery-*.txt files appear with non-zero size) but contained only coily help text, not actual diagnostic output. From the operator's vantage, everything looked like a working monitor producing data.

Why this is dangerous

  1. Silent failure during an active incident. Operator is trying to decide whether to escalate based on monitor signal; monitor says system is flapping; operator burns time chasing a phantom 4-min cron.
  2. False data persists in the log forever. Post-incident review of watch.log will treat the 4-min cycles as a real signal unless the reader knows about the coily upgrade.
  3. The recovery snapshots got overwritten on each cycle, displacing the real ones we wanted.

The failure shape is more general than this specific bug: any probe whose contract is "exit 0 means alive" can be broken by an upstream tool change that returns 0 for the wrong reason, or non-0 for the wrong reason, without any guarantee that the operator notices.

Design rule

Probes must validate the shape of the output, not just the exit code. The watch should check that the output contains an expected sentinel, not just that the command succeeded. Minimal version:

probe() {
  local out
  out="$(coily ssh "$HOST" -- coily whoami 2>&1)" || return 1
  [[ "$out" == *"login:"* ]] || return 1   # any per-tool sentinel works
  return 0
}

The sentinel makes the probe self-validating against tool-shape drift. If coily whoami ever stops emitting login:, the probe loudly returns dead instead of silently flapping. Cheap, mechanical, no new dependency.

Bigger rule

This isn't unique to host-watch. Any future script that walks an external CLI and decides flow based on $? should be reviewed against the same lens. Probably worth:

  • A short doc in docs/ on probe semantics (sentinel matching, why exit code alone isn't enough).
  • A pre-commit hook that flags >/dev/null 2>&1 immediately followed by if [ "$?"... or && / || short-circuits where the command output isn't otherwise checked. Or just a pylint-style nag for shell scripts.

Scope of this issue

  • Land the probe-output-validation rule in docs/ (small).
  • Apply the sentinel pattern to host-watch.sh when it gets rewritten as part of the coily-ssh removal work.
  • Optional: ship a tiny lib/probe.sh helper that wraps the sentinel-matching pattern so future scripts don't reinvent it.

Out of scope

  • The host-watch rewrite itself. Probably going to be deleted or rebuilt on a different transport entirely once coily ssh is gone. Whatever replaces it inherits the sentinel rule.
  • General observability for kai-server. Different problem.

How to apply

Write the doc, optionally write the helper. Apply to host-watch.sh when it gets touched again. Close once the doc lands.

**Anchor** Sibling to the 2026-05-26 postmortem (filed alongside). The monitoring tool we built mid-session lied for ~30 min before we caught it. The failure mode is general enough that any future probe written the same way will repeat the bug. Capturing as a standalone issue so the design rule doesn't get lost. **What happened** `scripts/host-watch.sh` (committed 2e7b0b4 via #152) polls a remote host with: ```bash probe() { coily ssh "$HOST" -- echo alive >/dev/null 2>&1 } ``` The function returns 0 if the probe succeeded, non-0 if not. Watch state is tracked solely on `$?`. **Output is discarded.** Mid-session, `coily` upgraded from 2.42.0 to 2.43.0 and tightened `coily ssh` to only forward args to a remote `coily` subcommand, not to free-form bash. The probe started returning exit 1 with coily's `--help` dump on stdout/stderr (both redirected to /dev/null). Watch logged "dead". Sometimes the probe still returned 0 for reasons not fully understood (maybe a race during coily binary swap, maybe an intermittent path through the ssh wrapper that bypasses the lockdown). Watch logged "alive". Net effect: metronomic 4-min "outage" cycle in watch.log that didn't correspond to anything real. The real host had been stable since 04:37Z (after scale-to-0 on the runner). The watch logged ~7 false dead/alive cycles between 04:37 and ~04:50. The diag-on-recovery captures ran (we saw recovery-*.txt files appear with non-zero size) but contained only coily help text, not actual diagnostic output. **From the operator's vantage, everything looked like a working monitor producing data.** **Why this is dangerous** 1. Silent failure during an active incident. Operator is trying to decide whether to escalate based on monitor signal; monitor says system is flapping; operator burns time chasing a phantom 4-min cron. 2. False data persists in the log forever. Post-incident review of watch.log will treat the 4-min cycles as a real signal unless the reader knows about the coily upgrade. 3. The recovery snapshots got overwritten on each cycle, displacing the real ones we wanted. The failure shape is more general than this specific bug: **any probe whose contract is "exit 0 means alive" can be broken by an upstream tool change that returns 0 for the wrong reason, or non-0 for the wrong reason, without any guarantee that the operator notices.** **Design rule** Probes must validate the shape of the output, not just the exit code. The watch should check that the output contains an expected sentinel, not just that the command succeeded. Minimal version: ```bash probe() { local out out="$(coily ssh "$HOST" -- coily whoami 2>&1)" || return 1 [[ "$out" == *"login:"* ]] || return 1 # any per-tool sentinel works return 0 } ``` The sentinel makes the probe self-validating against tool-shape drift. If `coily whoami` ever stops emitting `login:`, the probe loudly returns dead instead of silently flapping. Cheap, mechanical, no new dependency. **Bigger rule** This isn't unique to host-watch. Any future script that walks an external CLI and decides flow based on `$?` should be reviewed against the same lens. Probably worth: - A short doc in docs/ on probe semantics (sentinel matching, why exit code alone isn't enough). - A pre-commit hook that flags `>/dev/null 2>&1` immediately followed by `if [ "$?"`... or `&&` / `||` short-circuits where the command output isn't otherwise checked. Or just a `pylint`-style nag for shell scripts. **Scope of this issue** - Land the probe-output-validation rule in docs/ (small). - Apply the sentinel pattern to host-watch.sh when it gets rewritten as part of the coily-ssh removal work. - Optional: ship a tiny `lib/probe.sh` helper that wraps the sentinel-matching pattern so future scripts don't reinvent it. **Out of scope** - The host-watch rewrite itself. Probably going to be deleted or rebuilt on a different transport entirely once `coily ssh` is gone. Whatever replaces it inherits the sentinel rule. - General observability for kai-server. Different problem. **How to apply** Write the doc, optionally write the helper. Apply to host-watch.sh when it gets touched again. Close once the doc lands.
coilysiren added
P3
and removed
P2
labels 2026-05-31 07:00:36 +00:00
Sign in to join this conversation.
No labels
P0
P1
P2
P3
P4
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
coilyco-flight-deck/infrastructure#156
No description provided.