Forgejo deploy path: DinD->registry unreachable + heavy build starves control-plane node #18
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
The repo-side Forgejo deploy migration (#17) is merged:
.forgejo/workflows/build-publish-deploy.yml, thedeployerSA/Role/RoleBinding/token indeploy/main.yml(applied to thecoilysiren-galaxy-gennamespace, RBAC verified), registry image ref +imagePullPolicy: Always, and the dead.github/workflows/build-and-publish.ymlremoved. Thetestjob is green. Thedeployjob does not yet land a pod. Two infra blockers remain (both shared with coilysiren/backend's deploy, which is iterating on the same path) plus one manual credential step.Blocker 1 - DinD daemon cannot reach the in-cluster registry (shared with backend)
backend's deploy (run#29) builds fine but the push to
192.168.0.194:30500times out:The DinD sidecar's network namespace has no route to the NodePort registry. The registry is verified reachable from kai-server's containerd (coilysiren/infrastructure#168, #171), but that is a different network path than the DinD daemon. Needs an infra fix (route/NodePort reachability from the runner's DinD netns, or an in-cluster ClusterIP registry endpoint the DinD can hit).
Blocker 2 - heavy build starves the control-plane node (galaxy-gen-specific)
galaxy-gen's image is a full Rust release compile + wasm-pack + webpack. Building it on the in-cluster DinD (on kai-server) destabilized the node: during run#13's build the k3s API and Forgejo API both went unreachable (Bad Gateway / TLS handshake timeout) and the build was killed mid-Dockerfile. kai-server's node
Readycondition shows a NotReady->Ready transition at 2026-05-28 07:49 PT, matching the build window. kai-server hosts the control plane + Forgejo + the app pods, so a CPU/memory-hungry build there is self-defeating.Mitigations to weigh (infra): pin the runner / DinD build to a non-control-plane node (kai-desktop-tower-wsl), add cgroup CPU/memory caps to the DinD sidecar, or build with
--cpu-quota/nice. Possibly precompile or cache the Rust build to cut peak load.Manual step - DEPLOY_KUBECONFIG secret
coily has no verb to set a Forgejo Actions secret (only
actions task list/logs), so this is a one-time web-UI step (same as how backend#25's secret was set). The kubeconfig has been generated from thedeployerSA token (serverhttps://192.168.0.194:6443, namespace-scoped, RBAC verified) and base64-encoded. Set repo secretDEPLOY_KUBECONFIGon coilysiren/galaxy-gen to that base64 value. TheRoll deploymentstep has not been reached yet (it is gated behind the build/push), so this is not yet the failing step but is required before the first successful deploy.Also
GitHub
mainholds the pre-rebase SHA8ec8d94from the first push (a sibling lockdown commit raced in on Forgejo, forcing a rebase tofbd681e). Theoriginremote pushes to both GitHub and Forgejo; GitHub has since diverged. Reconciling it needs a force-update of GitHubmainto match canonical Forgejo, which was not done autonomously per the no-force-push rule.refs #17