kai-server runs a stale GitHub-only infra checkout - k3s-start.sh diverged from canonical Forgejo clone #172
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
kai-serverhas two separate local checkouts of the infrastructure repo, and the one the livek3s.servicesystemd unit actually runs is a stale GitHub-only clone that is behind the canonical Forgejo clone onscripts/k3s-start.sh. The two clones also share a GitHub push remote despite divergent histories, which makes pushing hazardous.The two checkouts
/home/kai/projects/infrastructureorigin=git@github.com:coilysiren/infrastructure.git(fetch + push, GitHub only)ExecStart=bash -c /home/kai/projects/infrastructure/scripts/k3s-start.shk3s-start.shis the short wrapper (barek3s server, no tailnet flags)./home/kai/projects/coilysiren/infrastructureoriginfetch =forgejo.coilysiren.me/coilysiren/infrastructure, push fans out to both Forgejo and GitHub.k3s-start.shis the full wrapper (tailnet node-ip resolution +--write-kubeconfig-mode=0644+--flannel-iface=tailscale0).Drift on scripts/k3s-start.sh
The deployed (GitHub) wrapper is 2 commits behind the canonical (Forgejo) one and is missing:
f3461dd fix(k3s): pin --write-kubeconfig-mode=0644 in start scripta9f7f62 fix(k3s): bind flannel VXLAN to the tailnet plane for cross-node networkingSo the live deployment runs a wrapper without the
--node-ip/--flannel-iface=tailscale0flags. The cluster is currentlyReadycross-node (k3s appears to have persisted the networking config from a prior init), but this is a latent footgun: ak3sdata wipe + restart would bring the node up without the tailnet flannel binding.Push hazard
Both clones push to the same GitHub remote (
git@github.com:coilysiren/infrastructure.git) but have divergentk3s-start.shhistories. The canonical clone'soriginfans out to GitHub, so a routine push from it would collide with whatever the deployed clone has pushed. Reconciliation has to happen before any push.Current state (notify fix already applied to both)
The unrelated
k3ssystemd-notify bug was just fixed by addingexecto both wrappers (see coilysiren/agentic-os-kai#480). Both commits are local, unpushed:dd03a53(exec on short wrapper)43468a0(exec on full wrapper)Proposed resolution (needs a human call)
ExecStartat the canonical checkout, or sync the deployed checkout to canonical. Avoid maintaining two clones on the same host long-term.k3s-start.shhistory divergence, then push once.k3sto pick up the canonical flags (control-plane blip) so the deployed runtime config matches the wrapper, removing the data-wipe footgun.Files / paths
/etc/systemd/system/k3s.service(ExecStart path:/home/kai/projects/infrastructure/scripts/k3s-start.sh)/home/kai/projects/infrastructure/scripts/k3s-start.sh/home/kai/projects/coilysiren/infrastructure/scripts/k3s-start.shResolved on kai-server (claude-linux-kai-server session)
Reconciled the wrapper drift and synced both checkouts. Worked out simpler than feared: the divergence was a single superseded commit.
What the divergence actually was
The canonical (Forgejo) clone already contained every commit the deployed (GitHub) clone had, except
dd03a53(exec on the short wrapper). That commit is fully superseded by canonical's43468a0(exec on the full wrapper). GitHub'smain(3441f82) is a clean fast-forward ancestor of43468a0, so there was never a real merge conflict - just an unpushed, superseded commit.Done
43468a0): tailnet--node-ipresolution +--write-kubeconfig-mode=0644+--flannel-iface=tailscale0+ theexecnotify fix. No content decision needed - the full wrapper is strictly the superset.8eb5b40..43468a0. Source of truth is reconciled./home/kai/projects/infrastructure) to43468a0via a local-path fetch +git reset --hard(dropping the supersededdd03a53). The deployedscripts/k3s-start.shthat systemd runs is now byte-identical to canonical (full tailnet wrapper). This removes the data-wipe footgun: any futurek3srestart now reads the correct--node-ip/--flannel-iface=tailscale0flags.Note: I did not repoint the systemd unit, because
/home/kai/projects/infrastructureis referenced by 6 systemd units (k3s + core-keeper / eco-server / factorio-backup / factorio-server / icarus-server), not just k3s. Syncing the clone in place fixes the drift without disturbing the game-server units. Consolidating to a single clone is a larger, separate effort.Remaining / deferred
Ready. Because the wrapper file is now correct, the footgun is already gone; the restart only matters to make the currently running process match the flags. Runsudo systemctl restart k3s(brief control-plane blip) at a convenient time.3441f82(clean FF target43468a0, no divergence). The fan-out push to GitHub failed:ssh: connect to host github.com port 22: Connection timed out. kai-server's outbound SSH:22 to GitHub is blocked (HTTPS reaches github.com fine, 200). Since Forgejo is primary and both clones now agree on43468a0with3441f82as a FF ancestor, this is no longer a hazard - just a lagging mirror. Filing the SSH:22 egress block as a follow-up.