Deploy path broken: WSL runner can't reach kai-server LAN IP 192.168.0.194 (registry push + k3s API time out) #175

New issue

Closed

opened 2026-05-28 13:31:41 +00:00 by coilysiren · 0 comments

coilysiren commented

2026-05-28 13:31:41 +00:00

Owner

Summary

The GitHub-free deploy path (infra #168 / backend #25) is broken end-to-end: the
Forgejo deploy job builds the image fine but docker push to
192.168.0.194:30500 times out, because the runner is pinned to the
WSL node (kai-desktop-tower-wsl) and pods there have no route to
kai-server's LAN IP 192.168.0.194.

Found while running backend #26 (independent verification). The deployer was
authored assuming 192.168.0.194:30500 is "reachable from both the WSL-node
runner's DinD and kai-server's containerd" (k3s-deploy-notes §11). That is
false for the WSL runner.

Evidence

Forgejo run#29 (commit 8f8d9b2) deploy job, build + tag succeed, push fails:

Successfully tagged 192.168.0.194:30500/coilysiren-backend:8f8d9b2a...
The push refers to repository [192.168.0.194:30500/coilysiren-backend]
Get "http://192.168.0.194:30500/v2/": net/http: request canceled while
  waiting for connection (Client.Timeout exceeded while awaiting headers)

Reachability probes from the runner DinD (forgejo-runner-0, on
kai-desktop-tower-wsl):

http://10.43.131.232:5000/v2/ (registry ClusterIP) -> 200 OK
http://100.69.164.66:30500/v2/ (registry NodePort via kai-server tailnet IP) -> 200 OK
https://100.69.164.66:6443/healthz (k3s API via kai-server tailnet IP) -> 401 (reachable)
http://192.168.0.194:30500/v2/ (registry NodePort via kai-server LAN IP) -> timeout
https://192.168.0.194:6443/healthz (k3s API via kai-server LAN IP) -> timeout

Node InternalIPs are all tailnet 100.x (flannel VXLAN over tailscale):

kai-desktop-tower-wsl   100.107.172.77
kai-server              100.69.164.66

So the cluster fabric is tailscale; the WSL node only reaches kai-server over
the tailnet, never its LAN IP.

Why the runner can't just move

Runner is pinned to the WSL node on purpose (infra #151): DinD bridge churn on
kai-server knocked out host TCP listeners (sshd / apiserver / tailscaled).
Pinning back to kai-server is the known-bad option. So the registry + API must
be addressed by something the WSL node can reach.

Fix direction

Address the registry and API by kai-server's tailnet IP (SSM
/coilysiren/kai-server/tailnet-ip), not the LAN IP. The tailnet IP is opaque
per AGENTS "Configs go in SSM / IPs and tailnet FQDNs count as opaque", so it
must not be hardcoded in committed YAML or the workflow - resolve from SSM
(external-secrets for the manifest, a resolve step or templated secret for the
workflow).

Concretely, three coupled changes:

DinD --insecure-registry (deploy/forgejo-runner.yml) must include
<tailnet-ip>:30500 (currently only 192.168.0.194:30500), so the
plain-http push to the tailnet-IP NodePort is allowed.
Push target / image ref (backend/.forgejo/workflows/build-publish-deploy.yml)
-> <tailnet-ip>:30500/coilysiren-backend:<sha> (runner-reachable). The
registry pod/storage is shared across every NodePort address, so kai-server
can still pull the same repo+tag via its own locally-reachable endpoint:
add a registries.yaml mirror mapping <tailnet-ip>:30500 ->
endpoint: http://192.168.0.194:30500 (avoids kai-server's own
tailnet-loopback gap, §7).
Roll-deployment kubeconfig server -> https://<tailnet-ip>:6443
(the tailnet IP is in the k3s cert SANs; the LAN IP is not WSL-reachable).
Can be a workflow --server= override; the CA/token in DEPLOY_KUBECONFIG
stay valid.

Also update k3s-deploy-notes §11, which currently asserts the LAN IP is
reachable from the WSL runner.

Blocks

backend #26 (end-to-end verification) - cannot pass until this lands.

Refs: infra #168, backend #25, backend #26, infra #151, infra #147.

## Summary The GitHub-free deploy path (infra #168 / backend #25) is **broken end-to-end**: the Forgejo deploy job builds the image fine but `docker push` to `192.168.0.194:30500` times out, because the runner is pinned to the **WSL node** (`kai-desktop-tower-wsl`) and pods there have **no route to kai-server's LAN IP `192.168.0.194`**. Found while running backend #26 (independent verification). The deployer was authored assuming `192.168.0.194:30500` is "reachable from both the WSL-node runner's DinD and kai-server's containerd" (k3s-deploy-notes §11). That is **false** for the WSL runner. ## Evidence Forgejo run#29 (commit `8f8d9b2`) deploy job, build + tag succeed, push fails: ``` Successfully tagged 192.168.0.194:30500/coilysiren-backend:8f8d9b2a... The push refers to repository [192.168.0.194:30500/coilysiren-backend] Get "http://192.168.0.194:30500/v2/": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers) ``` Reachability probes from the runner DinD (`forgejo-runner-0`, on `kai-desktop-tower-wsl`): * `http://10.43.131.232:5000/v2/` (registry ClusterIP) -> **200 OK** * `http://100.69.164.66:30500/v2/` (registry NodePort via kai-server **tailnet** IP) -> **200 OK** * `https://100.69.164.66:6443/healthz` (k3s API via kai-server **tailnet** IP) -> **401** (reachable) * `http://192.168.0.194:30500/v2/` (registry NodePort via kai-server **LAN** IP) -> **timeout** * `https://192.168.0.194:6443/healthz` (k3s API via kai-server **LAN** IP) -> **timeout** Node InternalIPs are all tailnet `100.x` (flannel VXLAN over tailscale): ``` kai-desktop-tower-wsl 100.107.172.77 kai-server 100.69.164.66 ``` So the cluster fabric is tailscale; the WSL node only reaches kai-server over the tailnet, never its LAN IP. ## Why the runner can't just move Runner is pinned to the WSL node on purpose (infra #151): DinD bridge churn on kai-server knocked out host TCP listeners (sshd / apiserver / tailscaled). Pinning back to kai-server is the known-bad option. So the registry + API must be addressed by something the WSL node can reach. ## Fix direction Address the registry and API by kai-server's **tailnet IP** (SSM `/coilysiren/kai-server/tailnet-ip`), not the LAN IP. The tailnet IP is opaque per AGENTS "Configs go in SSM / IPs and tailnet FQDNs count as opaque", so it must not be hardcoded in committed YAML or the workflow - resolve from SSM (external-secrets for the manifest, a resolve step or templated secret for the workflow). Concretely, three coupled changes: 1. **DinD `--insecure-registry`** (`deploy/forgejo-runner.yml`) must include `<tailnet-ip>:30500` (currently only `192.168.0.194:30500`), so the plain-http push to the tailnet-IP NodePort is allowed. 2. **Push target / image ref** (`backend/.forgejo/workflows/build-publish-deploy.yml`) -> `<tailnet-ip>:30500/coilysiren-backend:<sha>` (runner-reachable). The registry pod/storage is shared across every NodePort address, so kai-server can still pull the same repo+tag via its own locally-reachable endpoint: add a `registries.yaml` mirror mapping `<tailnet-ip>:30500` -> `endpoint: http://192.168.0.194:30500` (avoids kai-server's own tailnet-loopback gap, §7). 3. **Roll-deployment kubeconfig server** -> `https://<tailnet-ip>:6443` (the tailnet IP is in the k3s cert SANs; the LAN IP is not WSL-reachable). Can be a workflow `--server=` override; the CA/token in `DEPLOY_KUBECONFIG` stay valid. Also update k3s-deploy-notes §11, which currently asserts the LAN IP is reachable from the WSL runner. ## Blocks backend #26 (end-to-end verification) - cannot pass until this lands. Refs: infra #168, backend #25, backend #26, infra #151, infra #147.

coilysiren referenced this issue from coilyco-flight-deck/backend

2026-05-28 13:32:16 +00:00

Verify backend deploy via Forgejo + in-cluster registry succeeds end-to-end #26

coilysiren added the

label

2026-06-04 08:16:53 +00:00

coilysiren closed this issue

2026-06-17 07:44:06 +00:00