Finish operator-to-static-keys migration - eco-mcp, eco-spec, galaxy-gen, backend #41

Open
opened 2026-05-23 20:54:32 +00:00 by coilysiren · 0 comments
Owner

Originally filed by @coilysiren on 2026-05-20T08:22:21Z - https://github.com/coilysiren/infrastructure/issues/210

Multi-session tracker for finishing the operator-to-static-keys migration started this session.

Current state

Live on new pattern (operator-free):

  • repo-recall - in-pod sidecar (canary). #201, #202 (nodeSelector), #203 (port 80). http://repo-recall returns 200.
  • vmsingle - standalone ts-proxy Deployment (helm-managed workload pattern). #204. http://vmsingle and http://vmsingle:8428/health return 200.
  • forgejo - standalone ts-proxy Deployment (rootless-workload pattern). #209. Joined as forgejo-1 because the old forgejo device record still holds the name in the Tailscale admin console; manual delete + ts-forgejo pod restart will reclaim. Old forgejo device is offline.

Manifest changes pushed, blocked from landing:

  • galaxy-gen (coilysiren/galaxy-gen#72). CI's build-and-publish.yml uses kubectl set image only, never apply -f deploy/main.yml, so the sidecar manifest never reaches the cluster. Workflow needs to add an envsubst+apply step, OR we ship the apply manually.
  • eco-mcp (coilysiren/eco-mcp-app#60). CI uses the legacy TS_OAUTH_CLIENT_ID/SECRET pattern. That credential is broken (related to but distinct from #198's operator-oauth). Workflow also reads config.yml which doesn't exist locally or on kai-server. CI has been failing for ~7 days; the workload pod is 7 days old.
  • eco-spec (coilysiren/eco-jobs-tracker#42). Same as eco-mcp.
  • backend (coilysiren/backend#64). Same as eco-mcp. Plus pre-existing ImagePullBackOff unrelated to either thread.

The terraform side is fully landed: all six SSM auth keys exist at /coilysiren/{repo-recall,vmsingle,eco-mcp,eco-spec,galaxy-gen,backend,forgejo}/ts-authkey. terraform-tailscale-devices is the source of truth.

What unblocks finishing

The four blocked CIs share a root cause with #198: a Tailscale OAuth credential that was minted long ago and silently aged out. galaxy-gen is the modern shape (federated identity via tailscale-oidc/, no long-lived bearer). The three eco-/backend CIs need the same migration. config.yml needs to be created or its yq lookup needs a default.

Two paths to finish:

  1. Modernize the four CIs first. Port eco-mcp-app, eco-jobs-tracker, backend to the OIDC pattern in tailscale-oidc/. Add the missing envsubst+apply step to galaxy-gen's workflow. Then a no-op push reruns each CI and the sidecar lands.
  2. Manual applies, defer CI fixes. Render each manifest locally with envsubst and apply directly. Faster but doesn't fix the underlying CI debt - same problem will recur on the next ordinary push.

Recommend path 1.

Once all six are migrated

  • Verify each new tailnet device is reachable: curl -s -o /dev/null -w '%{http_code}\n' http://<name>/healthz for all six.
  • Confirm all operator-managed StatefulSets in tailscale namespace are gone (operator reconciles them away when each Service drops tailscale.com/expose).
  • Deprovision the operator (#198, #198 follow-up): helm uninstall tailscale-operator -n tailscale, delete operator-oauth Secret, revoke the OAuth client in the Tailscale admin console.
  • Delete the orphan forgejo device record in the Tailscale admin console; restart ts-forgejo Deployment so it claims the freed name.
  • Document the static-device pattern in docs/tailscale-static-devices.md (in-pod sidecar vs standalone proxy decision rule, RBAC shape, port-80 conventions, rotation runbook). Tracked as the doc task in the original session.
_Originally filed by @coilysiren on 2026-05-20T08:22:21Z - [https://github.com/coilysiren/infrastructure/issues/210](https://github.com/coilysiren/infrastructure/issues/210)_ Multi-session tracker for finishing the operator-to-static-keys migration started this session. ## Current state **Live on new pattern (operator-free):** - [x] `repo-recall` - in-pod sidecar (canary). #201, #202 (nodeSelector), #203 (port 80). `http://repo-recall` returns 200. - [x] `vmsingle` - standalone ts-proxy Deployment (helm-managed workload pattern). #204. `http://vmsingle` and `http://vmsingle:8428/health` return 200. - [x] `forgejo` - standalone ts-proxy Deployment (rootless-workload pattern). #209. Joined as `forgejo-1` because the old `forgejo` device record still holds the name in the Tailscale admin console; manual delete + ts-forgejo pod restart will reclaim. Old `forgejo` device is offline. **Manifest changes pushed, blocked from landing:** - [ ] `galaxy-gen` (coilysiren/galaxy-gen#72). CI's `build-and-publish.yml` uses `kubectl set image` only, never `apply -f deploy/main.yml`, so the sidecar manifest never reaches the cluster. Workflow needs to add an envsubst+apply step, OR we ship the apply manually. - [ ] `eco-mcp` (coilysiren/eco-mcp-app#60). CI uses the legacy `TS_OAUTH_CLIENT_ID/SECRET` pattern. That credential is broken (related to but distinct from #198's `operator-oauth`). Workflow also reads `config.yml` which doesn't exist locally or on kai-server. CI has been failing for ~7 days; the workload pod is 7 days old. - [ ] `eco-spec` (coilysiren/eco-jobs-tracker#42). Same as eco-mcp. - [ ] `backend` (coilysiren/backend#64). Same as eco-mcp. Plus pre-existing `ImagePullBackOff` unrelated to either thread. The terraform side is fully landed: all six SSM auth keys exist at `/coilysiren/{repo-recall,vmsingle,eco-mcp,eco-spec,galaxy-gen,backend,forgejo}/ts-authkey`. `terraform-tailscale-devices` is the source of truth. ## What unblocks finishing The four blocked CIs share a root cause with #198: a Tailscale OAuth credential that was minted long ago and silently aged out. galaxy-gen is the modern shape (federated identity via `tailscale-oidc/`, no long-lived bearer). The three eco-/backend CIs need the same migration. config.yml needs to be created or its yq lookup needs a default. Two paths to finish: 1. **Modernize the four CIs first.** Port `eco-mcp-app`, `eco-jobs-tracker`, `backend` to the OIDC pattern in `tailscale-oidc/`. Add the missing envsubst+apply step to `galaxy-gen`'s workflow. Then a no-op push reruns each CI and the sidecar lands. 2. **Manual applies, defer CI fixes.** Render each manifest locally with envsubst and apply directly. Faster but doesn't fix the underlying CI debt - same problem will recur on the next ordinary push. Recommend path 1. ## Once all six are migrated - Verify each new tailnet device is reachable: `curl -s -o /dev/null -w '%{http_code}\n' http://<name>/healthz` for all six. - Confirm all operator-managed StatefulSets in `tailscale` namespace are gone (operator reconciles them away when each Service drops `tailscale.com/expose`). - Deprovision the operator (#198, #198 follow-up): `helm uninstall tailscale-operator -n tailscale`, delete `operator-oauth` Secret, revoke the OAuth client in the Tailscale admin console. - Delete the orphan `forgejo` device record in the Tailscale admin console; restart `ts-forgejo` Deployment so it claims the freed name. - Document the static-device pattern in `docs/tailscale-static-devices.md` (in-pod sidecar vs standalone proxy decision rule, RBAC shape, port-80 conventions, rotation runbook). Tracked as the doc task in the original session. ## Related issues - coilysiren/infrastructure#198 - operator OAuth 401 (root cause) - coilysiren/infrastructure#199 - terraform-tailscale-devices module - coilysiren/infrastructure#200 - drop tag:k8s-operator from ACL - coilysiren/infrastructure#201 - repo-recall sidecar canary - coilysiren/infrastructure#202 - repo-recall nodeSelector - coilysiren/infrastructure#203 - repo-recall port 80 - coilysiren/infrastructure#204 - vmsingle standalone proxy - coilysiren/infrastructure#205, #206, #207, #208 - per-service terraform additions - coilysiren/infrastructure#209 - forgejo standalone proxy - coilysiren/eco-mcp-app#60 - eco-mcp sidecar manifest - coilysiren/eco-jobs-tracker#42 - eco-spec sidecar manifest - coilysiren/galaxy-gen#72 - galaxy-gen sidecar manifest - coilysiren/backend#64 - backend sidecar manifest
coilysiren added
P2
and removed
P1
labels 2026-05-31 07:00:51 +00:00
Sign in to join this conversation.
No labels
P0
P1
P2
P3
P4
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
coilyco-flight-deck/infrastructure#41
No description provided.