Deploy path broken: WSL runner can't reach kai-server LAN IP 192.168.0.194 (registry push + k3s API time out) #175
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
The GitHub-free deploy path (infra #168 / backend #25) is broken end-to-end: the
Forgejo deploy job builds the image fine but
docker pushto192.168.0.194:30500times out, because the runner is pinned to theWSL node (
kai-desktop-tower-wsl) and pods there have no route tokai-server's LAN IP
192.168.0.194.Found while running backend #26 (independent verification). The deployer was
authored assuming
192.168.0.194:30500is "reachable from both the WSL-noderunner's DinD and kai-server's containerd" (k3s-deploy-notes §11). That is
false for the WSL runner.
Evidence
Forgejo run#29 (commit
8f8d9b2) deploy job, build + tag succeed, push fails:Reachability probes from the runner DinD (
forgejo-runner-0, onkai-desktop-tower-wsl):http://10.43.131.232:5000/v2/(registry ClusterIP) -> 200 OKhttp://100.69.164.66:30500/v2/(registry NodePort via kai-server tailnet IP) -> 200 OKhttps://100.69.164.66:6443/healthz(k3s API via kai-server tailnet IP) -> 401 (reachable)http://192.168.0.194:30500/v2/(registry NodePort via kai-server LAN IP) -> timeouthttps://192.168.0.194:6443/healthz(k3s API via kai-server LAN IP) -> timeoutNode InternalIPs are all tailnet
100.x(flannel VXLAN over tailscale):So the cluster fabric is tailscale; the WSL node only reaches kai-server over
the tailnet, never its LAN IP.
Why the runner can't just move
Runner is pinned to the WSL node on purpose (infra #151): DinD bridge churn on
kai-server knocked out host TCP listeners (sshd / apiserver / tailscaled).
Pinning back to kai-server is the known-bad option. So the registry + API must
be addressed by something the WSL node can reach.
Fix direction
Address the registry and API by kai-server's tailnet IP (SSM
/coilysiren/kai-server/tailnet-ip), not the LAN IP. The tailnet IP is opaqueper AGENTS "Configs go in SSM / IPs and tailnet FQDNs count as opaque", so it
must not be hardcoded in committed YAML or the workflow - resolve from SSM
(external-secrets for the manifest, a resolve step or templated secret for the
workflow).
Concretely, three coupled changes:
--insecure-registry(deploy/forgejo-runner.yml) must include<tailnet-ip>:30500(currently only192.168.0.194:30500), so theplain-http push to the tailnet-IP NodePort is allowed.
backend/.forgejo/workflows/build-publish-deploy.yml)->
<tailnet-ip>:30500/coilysiren-backend:<sha>(runner-reachable). Theregistry pod/storage is shared across every NodePort address, so kai-server
can still pull the same repo+tag via its own locally-reachable endpoint:
add a
registries.yamlmirror mapping<tailnet-ip>:30500->endpoint: http://192.168.0.194:30500(avoids kai-server's owntailnet-loopback gap, §7).
https://<tailnet-ip>:6443(the tailnet IP is in the k3s cert SANs; the LAN IP is not WSL-reachable).
Can be a workflow
--server=override; the CA/token inDEPLOY_KUBECONFIGstay valid.
Also update k3s-deploy-notes §11, which currently asserts the LAN IP is
reachable from the WSL runner.
Blocks
backend #26 (end-to-end verification) - cannot pass until this lands.
Refs: infra #168, backend #25, backend #26, infra #151, infra #147.