scripts/host-watch.sh: generic tailnet-host SSH watchdog with diag capture on recovery #152

Closed
opened 2026-05-27 03:18:11 +00:00 by coilysiren · 0 comments
Owner

Problem

scripts/host-diag.sh exists ad-hoc in this session's /tmp/ but isn't installed in the repo, isn't reachable via coily exec, and is hard-coded to one operator's Mac. The watch loop that fires it on dead->alive recovery has the same problem. Promote both into the repo as a single generic verb.

Scope

  • scripts/host-diag.sh - run on the remote host via ssh -- bash -s <. Captures listening sockets, sshd state, journals (ssh / k3s / tailscaled), dmesg, conntrack stats, iptables filter + nat, nft ruleset, ufw, fail2ban, interfaces, top RSS, auth.log.
  • scripts/host-watch.sh - polls coily ssh <alias> -- echo alive every POLL_INTERVAL seconds (default 15). On dead->alive transition, streams host-diag.sh into the remote and writes the output locally as recovery-<ts>.txt. State log + snapshots under OUT_DIR (default /tmp/host-watch-<alias>).
  • Makefile target host-watch with host=<alias> arg.
  • .coily/coily.yaml verb host-watch delegating to the Make target.

Use

coily exec host-watch host=kai-server

Runs interactively until killed (Ctrl-C or pkill -f host-watch.sh). Originally written to catch the kai-server host-namespace outage during coilysiren/infrastructure#151. Generic enough to point at any tailnet host where coily ssh works.

Out of scope

  • Daemonizing as a systemd unit. Operator-driven loop, not a service.
  • Capturing during the outage window (currently captures post-recovery only). The next mile is a privileged DaemonSet pod that can capture host iptables/conntrack mid-outage, but the recovery snapshot already names the cause in the kernel ring.
**Problem** `scripts/host-diag.sh` exists ad-hoc in this session's `/tmp/` but isn't installed in the repo, isn't reachable via `coily exec`, and is hard-coded to one operator's Mac. The watch loop that fires it on dead->alive recovery has the same problem. Promote both into the repo as a single generic verb. **Scope** - `scripts/host-diag.sh` - run on the remote host via `ssh -- bash -s <`. Captures listening sockets, sshd state, journals (ssh / k3s / tailscaled), dmesg, conntrack stats, iptables filter + nat, nft ruleset, ufw, fail2ban, interfaces, top RSS, auth.log. - `scripts/host-watch.sh` - polls `coily ssh <alias> -- echo alive` every `POLL_INTERVAL` seconds (default 15). On dead->alive transition, streams `host-diag.sh` into the remote and writes the output locally as `recovery-<ts>.txt`. State log + snapshots under `OUT_DIR` (default `/tmp/host-watch-<alias>`). - `Makefile` target `host-watch` with `host=<alias>` arg. - `.coily/coily.yaml` verb `host-watch` delegating to the Make target. **Use** ``` coily exec host-watch host=kai-server ``` Runs interactively until killed (Ctrl-C or `pkill -f host-watch.sh`). Originally written to catch the kai-server host-namespace outage during coilysiren/infrastructure#151. Generic enough to point at any tailnet host where coily ssh works. **Out of scope** - Daemonizing as a systemd unit. Operator-driven loop, not a service. - Capturing during the outage window (currently captures post-recovery only). The next mile is a privileged DaemonSet pod that can capture host iptables/conntrack mid-outage, but the recovery snapshot already names the cause in the kernel ring.
Sign in to join this conversation.
No labels
P0
P1
P2
P3
P4
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
coilyco-flight-deck/infrastructure#152
No description provided.