Forgejo deploy path: DinD->registry unreachable + heavy build starves control-plane node #18

Open
opened 2026-05-28 17:16:01 +00:00 by coilysiren · 0 comments
Owner

The repo-side Forgejo deploy migration (#17) is merged: .forgejo/workflows/build-publish-deploy.yml, the deployer SA/Role/RoleBinding/token in deploy/main.yml (applied to the coilysiren-galaxy-gen namespace, RBAC verified), registry image ref + imagePullPolicy: Always, and the dead .github/workflows/build-and-publish.yml removed. The test job is green. The deploy job does not yet land a pod. Two infra blockers remain (both shared with coilysiren/backend's deploy, which is iterating on the same path) plus one manual credential step.

Blocker 1 - DinD daemon cannot reach the in-cluster registry (shared with backend)

backend's deploy (run#29) builds fine but the push to 192.168.0.194:30500 times out:

The push refers to repository [192.168.0.194:30500/coilysiren-backend]
Get "http://192.168.0.194:30500/v2/": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

The DinD sidecar's network namespace has no route to the NodePort registry. The registry is verified reachable from kai-server's containerd (coilysiren/infrastructure#168, #171), but that is a different network path than the DinD daemon. Needs an infra fix (route/NodePort reachability from the runner's DinD netns, or an in-cluster ClusterIP registry endpoint the DinD can hit).

Blocker 2 - heavy build starves the control-plane node (galaxy-gen-specific)

galaxy-gen's image is a full Rust release compile + wasm-pack + webpack. Building it on the in-cluster DinD (on kai-server) destabilized the node: during run#13's build the k3s API and Forgejo API both went unreachable (Bad Gateway / TLS handshake timeout) and the build was killed mid-Dockerfile. kai-server's node Ready condition shows a NotReady->Ready transition at 2026-05-28 07:49 PT, matching the build window. kai-server hosts the control plane + Forgejo + the app pods, so a CPU/memory-hungry build there is self-defeating.

Mitigations to weigh (infra): pin the runner / DinD build to a non-control-plane node (kai-desktop-tower-wsl), add cgroup CPU/memory caps to the DinD sidecar, or build with --cpu-quota/nice. Possibly precompile or cache the Rust build to cut peak load.

Manual step - DEPLOY_KUBECONFIG secret

coily has no verb to set a Forgejo Actions secret (only actions task list/logs), so this is a one-time web-UI step (same as how backend#25's secret was set). The kubeconfig has been generated from the deployer SA token (server https://192.168.0.194:6443, namespace-scoped, RBAC verified) and base64-encoded. Set repo secret DEPLOY_KUBECONFIG on coilysiren/galaxy-gen to that base64 value. The Roll deployment step has not been reached yet (it is gated behind the build/push), so this is not yet the failing step but is required before the first successful deploy.

Also

GitHub main holds the pre-rebase SHA 8ec8d94 from the first push (a sibling lockdown commit raced in on Forgejo, forcing a rebase to fbd681e). The origin remote pushes to both GitHub and Forgejo; GitHub has since diverged. Reconciling it needs a force-update of GitHub main to match canonical Forgejo, which was not done autonomously per the no-force-push rule.

refs #17

The repo-side Forgejo deploy migration (#17) is merged: `.forgejo/workflows/build-publish-deploy.yml`, the `deployer` SA/Role/RoleBinding/token in `deploy/main.yml` (applied to the `coilysiren-galaxy-gen` namespace, RBAC verified), registry image ref + `imagePullPolicy: Always`, and the dead `.github/workflows/build-and-publish.yml` removed. The `test` job is green. The `deploy` job does not yet land a pod. Two infra blockers remain (both shared with coilysiren/backend's deploy, which is iterating on the same path) plus one manual credential step. ## Blocker 1 - DinD daemon cannot reach the in-cluster registry (shared with backend) backend's deploy (run#29) builds fine but the push to `192.168.0.194:30500` times out: ``` The push refers to repository [192.168.0.194:30500/coilysiren-backend] Get "http://192.168.0.194:30500/v2/": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers) ``` The DinD sidecar's network namespace has no route to the NodePort registry. The registry is verified reachable from kai-server's containerd (coilysiren/infrastructure#168, #171), but that is a different network path than the DinD daemon. Needs an infra fix (route/NodePort reachability from the runner's DinD netns, or an in-cluster ClusterIP registry endpoint the DinD can hit). ## Blocker 2 - heavy build starves the control-plane node (galaxy-gen-specific) galaxy-gen's image is a full Rust release compile + wasm-pack + webpack. Building it on the in-cluster DinD (on kai-server) destabilized the node: during run#13's build the k3s API and Forgejo API both went unreachable (Bad Gateway / TLS handshake timeout) and the build was killed mid-Dockerfile. kai-server's node `Ready` condition shows a NotReady->Ready transition at 2026-05-28 07:49 PT, matching the build window. kai-server hosts the control plane + Forgejo + the app pods, so a CPU/memory-hungry build there is self-defeating. Mitigations to weigh (infra): pin the runner / DinD build to a non-control-plane node (kai-desktop-tower-wsl), add cgroup CPU/memory caps to the DinD sidecar, or build with `--cpu-quota`/`nice`. Possibly precompile or cache the Rust build to cut peak load. ## Manual step - DEPLOY_KUBECONFIG secret coily has no verb to set a Forgejo Actions secret (only `actions task list/logs`), so this is a one-time web-UI step (same as how backend#25's secret was set). The kubeconfig has been generated from the `deployer` SA token (server `https://192.168.0.194:6443`, namespace-scoped, RBAC verified) and base64-encoded. Set repo secret `DEPLOY_KUBECONFIG` on coilysiren/galaxy-gen to that base64 value. The `Roll deployment` step has not been reached yet (it is gated behind the build/push), so this is not yet the failing step but is required before the first successful deploy. ## Also GitHub `main` holds the pre-rebase SHA `8ec8d94` from the first push (a sibling lockdown commit raced in on Forgejo, forcing a rebase to `fbd681e`). The `origin` remote pushes to both GitHub and Forgejo; GitHub has since diverged. Reconciling it needs a force-update of GitHub `main` to match canonical Forgejo, which was not done autonomously per the no-force-push rule. refs #17
Sign in to join this conversation.
No labels
P0
P1
P2
P3
P4
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
coilyco-flight-deck/galaxy-gen#18
No description provided.