kai-server runs a stale GitHub-only infra checkout - k3s-start.sh diverged from canonical Forgejo clone #172

Open
opened 2026-05-28 12:44:51 +00:00 by coilysiren · 1 comment
Owner

Summary

kai-server has two separate local checkouts of the infrastructure repo, and the one the live k3s.service systemd unit actually runs is a stale GitHub-only clone that is behind the canonical Forgejo clone on scripts/k3s-start.sh. The two clones also share a GitHub push remote despite divergent histories, which makes pushing hazardous.

The two checkouts

  • Deployed (GitHub-only): /home/kai/projects/infrastructure
    • origin = git@github.com:coilysiren/infrastructure.git (fetch + push, GitHub only)
    • This is the path the systemd unit runs: ExecStart=bash -c /home/kai/projects/infrastructure/scripts/k3s-start.sh
    • Its k3s-start.sh is the short wrapper (bare k3s server, no tailnet flags).
  • Canonical (Forgejo-primary): /home/kai/projects/coilysiren/infrastructure
    • origin fetch = forgejo.coilysiren.me/coilysiren/infrastructure, push fans out to both Forgejo and GitHub.
    • Its k3s-start.sh is the full wrapper (tailnet node-ip resolution + --write-kubeconfig-mode=0644 + --flannel-iface=tailscale0).

Drift on scripts/k3s-start.sh

The deployed (GitHub) wrapper is 2 commits behind the canonical (Forgejo) one and is missing:

  • f3461dd fix(k3s): pin --write-kubeconfig-mode=0644 in start script
  • a9f7f62 fix(k3s): bind flannel VXLAN to the tailnet plane for cross-node networking

So the live deployment runs a wrapper without the --node-ip / --flannel-iface=tailscale0 flags. The cluster is currently Ready cross-node (k3s appears to have persisted the networking config from a prior init), but this is a latent footgun: a k3s data wipe + restart would bring the node up without the tailnet flannel binding.

Push hazard

Both clones push to the same GitHub remote (git@github.com:coilysiren/infrastructure.git) but have divergent k3s-start.sh histories. The canonical clone's origin fans out to GitHub, so a routine push from it would collide with whatever the deployed clone has pushed. Reconciliation has to happen before any push.

Current state (notify fix already applied to both)

The unrelated k3s systemd-notify bug was just fixed by adding exec to both wrappers (see coilysiren/agentic-os-kai#480). Both commits are local, unpushed:

  • deployed clone: dd03a53 (exec on short wrapper)
  • canonical clone: 43468a0 (exec on full wrapper)

Proposed resolution (needs a human call)

  1. Decide the canonical wrapper content (presumably the Forgejo full version + the exec fix).
  2. Make the deployment run that canonical wrapper - either repoint the systemd unit's ExecStart at the canonical checkout, or sync the deployed checkout to canonical. Avoid maintaining two clones on the same host long-term.
  3. Reconcile the GitHub/Forgejo k3s-start.sh history divergence, then push once.
  4. Restart k3s to pick up the canonical flags (control-plane blip) so the deployed runtime config matches the wrapper, removing the data-wipe footgun.

Files / paths

  • Unit: /etc/systemd/system/k3s.service (ExecStart path: /home/kai/projects/infrastructure/scripts/k3s-start.sh)
  • Deployed wrapper: /home/kai/projects/infrastructure/scripts/k3s-start.sh
  • Canonical wrapper: /home/kai/projects/coilysiren/infrastructure/scripts/k3s-start.sh
## Summary `kai-server` has **two separate local checkouts of the infrastructure repo**, and the one the live `k3s.service` systemd unit actually runs is a stale GitHub-only clone that is behind the canonical Forgejo clone on `scripts/k3s-start.sh`. The two clones also share a GitHub push remote despite divergent histories, which makes pushing hazardous. ## The two checkouts * **Deployed (GitHub-only):** `/home/kai/projects/infrastructure` * `origin` = `git@github.com:coilysiren/infrastructure.git` (fetch + push, GitHub only) * This is the path the systemd unit runs: `ExecStart=bash -c /home/kai/projects/infrastructure/scripts/k3s-start.sh` * Its `k3s-start.sh` is the **short** wrapper (bare `k3s server`, no tailnet flags). * **Canonical (Forgejo-primary):** `/home/kai/projects/coilysiren/infrastructure` * `origin` fetch = `forgejo.coilysiren.me/coilysiren/infrastructure`, push **fans out to both Forgejo and GitHub**. * Its `k3s-start.sh` is the **full** wrapper (tailnet node-ip resolution + `--write-kubeconfig-mode=0644` + `--flannel-iface=tailscale0`). ## Drift on scripts/k3s-start.sh The deployed (GitHub) wrapper is **2 commits behind** the canonical (Forgejo) one and is missing: * `f3461dd fix(k3s): pin --write-kubeconfig-mode=0644 in start script` * `a9f7f62 fix(k3s): bind flannel VXLAN to the tailnet plane for cross-node networking` So the live deployment runs a wrapper without the `--node-ip` / `--flannel-iface=tailscale0` flags. The cluster is currently `Ready` cross-node (k3s appears to have persisted the networking config from a prior init), but this is a latent footgun: a `k3s` data wipe + restart would bring the node up **without** the tailnet flannel binding. ## Push hazard Both clones push to the same GitHub remote (`git@github.com:coilysiren/infrastructure.git`) but have divergent `k3s-start.sh` histories. The canonical clone's `origin` fans out to GitHub, so a routine push from it would collide with whatever the deployed clone has pushed. Reconciliation has to happen before any push. ## Current state (notify fix already applied to both) The unrelated `k3s` systemd-notify bug was just fixed by adding `exec` to both wrappers (see coilysiren/agentic-os-kai#480). Both commits are **local, unpushed**: * deployed clone: `dd03a53` (exec on short wrapper) * canonical clone: `43468a0` (exec on full wrapper) ## Proposed resolution (needs a human call) 1. Decide the canonical wrapper content (presumably the Forgejo full version + the exec fix). 2. Make the deployment run that canonical wrapper - either repoint the systemd unit's `ExecStart` at the canonical checkout, or sync the deployed checkout to canonical. Avoid maintaining two clones on the same host long-term. 3. Reconcile the GitHub/Forgejo `k3s-start.sh` history divergence, then push once. 4. Restart `k3s` to pick up the canonical flags (control-plane blip) so the deployed runtime config matches the wrapper, removing the data-wipe footgun. ## Files / paths * Unit: `/etc/systemd/system/k3s.service` (ExecStart path: `/home/kai/projects/infrastructure/scripts/k3s-start.sh`) * Deployed wrapper: `/home/kai/projects/infrastructure/scripts/k3s-start.sh` * Canonical wrapper: `/home/kai/projects/coilysiren/infrastructure/scripts/k3s-start.sh`
Author
Owner

Resolved on kai-server (claude-linux-kai-server session)

Reconciled the wrapper drift and synced both checkouts. Worked out simpler than feared: the divergence was a single superseded commit.

What the divergence actually was

The canonical (Forgejo) clone already contained every commit the deployed (GitHub) clone had, except dd03a53 (exec on the short wrapper). That commit is fully superseded by canonical's 43468a0 (exec on the full wrapper). GitHub's main (3441f82) is a clean fast-forward ancestor of 43468a0, so there was never a real merge conflict - just an unpushed, superseded commit.

Done

  1. Canonical wrapper content = the full version (43468a0): tailnet --node-ip resolution + --write-kubeconfig-mode=0644 + --flannel-iface=tailscale0 + the exec notify fix. No content decision needed - the full wrapper is strictly the superset.
  2. Pushed canonical to Forgejo (primary): 8eb5b40..43468a0. Source of truth is reconciled.
  3. Synced the deployed clone (/home/kai/projects/infrastructure) to 43468a0 via a local-path fetch + git reset --hard (dropping the superseded dd03a53). The deployed scripts/k3s-start.sh that systemd runs is now byte-identical to canonical (full tailnet wrapper). This removes the data-wipe footgun: any future k3s restart now reads the correct --node-ip / --flannel-iface=tailscale0 flags.

Note: I did not repoint the systemd unit, because /home/kai/projects/infrastructure is referenced by 6 systemd units (k3s + core-keeper / eco-server / factorio-backup / factorio-server / icarus-server), not just k3s. Syncing the clone in place fixes the drift without disturbing the game-server units. Consolidating to a single clone is a larger, separate effort.

Remaining / deferred

  • k3s NOT restarted (Kai's call - deferred). The cluster is currently Ready. Because the wrapper file is now correct, the footgun is already gone; the restart only matters to make the currently running process match the flags. Run sudo systemctl restart k3s (brief control-plane blip) at a convenient time.
  • GitHub mirror is behind at 3441f82 (clean FF target 43468a0, no divergence). The fan-out push to GitHub failed: ssh: connect to host github.com port 22: Connection timed out. kai-server's outbound SSH:22 to GitHub is blocked (HTTPS reaches github.com fine, 200). Since Forgejo is primary and both clones now agree on 43468a0 with 3441f82 as a FF ancestor, this is no longer a hazard - just a lagging mirror. Filing the SSH:22 egress block as a follow-up.
## Resolved on kai-server (claude-linux-kai-server session) Reconciled the wrapper drift and synced both checkouts. Worked out simpler than feared: the divergence was a single superseded commit. ### What the divergence actually was The canonical (Forgejo) clone already contained **every** commit the deployed (GitHub) clone had, **except** `dd03a53` (exec on the *short* wrapper). That commit is fully superseded by canonical's `43468a0` (exec on the *full* wrapper). GitHub's `main` (`3441f82`) is a clean fast-forward ancestor of `43468a0`, so there was never a real merge conflict - just an unpushed, superseded commit. ### Done 1. **Canonical wrapper content** = the full version (`43468a0`): tailnet `--node-ip` resolution + `--write-kubeconfig-mode=0644` + `--flannel-iface=tailscale0` + the `exec` notify fix. No content decision needed - the full wrapper is strictly the superset. 2. **Pushed canonical to Forgejo (primary)**: `8eb5b40..43468a0`. Source of truth is reconciled. 3. **Synced the deployed clone** (`/home/kai/projects/infrastructure`) to `43468a0` via a local-path fetch + `git reset --hard` (dropping the superseded `dd03a53`). The deployed `scripts/k3s-start.sh` that systemd runs is now byte-identical to canonical (full tailnet wrapper). **This removes the data-wipe footgun**: any future `k3s` restart now reads the correct `--node-ip` / `--flannel-iface=tailscale0` flags. Note: I did **not** repoint the systemd unit, because `/home/kai/projects/infrastructure` is referenced by **6 systemd units** (k3s + core-keeper / eco-server / factorio-backup / factorio-server / icarus-server), not just k3s. Syncing the clone in place fixes the drift without disturbing the game-server units. Consolidating to a single clone is a larger, separate effort. ### Remaining / deferred - **k3s NOT restarted** (Kai's call - deferred). The cluster is currently `Ready`. Because the wrapper file is now correct, the footgun is already gone; the restart only matters to make the *currently running* process match the flags. Run `sudo systemctl restart k3s` (brief control-plane blip) at a convenient time. - **GitHub mirror is behind** at `3441f82` (clean FF target `43468a0`, no divergence). The fan-out push to GitHub **failed**: `ssh: connect to host github.com port 22: Connection timed out`. kai-server's outbound **SSH:22 to GitHub is blocked** (HTTPS reaches github.com fine, 200). Since Forgejo is primary and both clones now agree on `43468a0` with `3441f82` as a FF ancestor, this is no longer a hazard - just a lagging mirror. Filing the SSH:22 egress block as a follow-up.
coilysiren added
P2
and removed
P1
labels 2026-05-31 07:00:33 +00:00
Sign in to join this conversation.
No labels
P0
P1
P2
P3
P4
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
coilyco-flight-deck/infrastructure#172
No description provided.