WSL k3s agent registers Hyper-V NIC as InternalIP, breaks apiserver-mediated kubectl from kai-server #147
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Symptom
From kai-server (the k3s control plane), every apiserver-mediated operation against a pod on the
kai-desktop-tower-wslagent node fails:Pod-local traffic is fine - the openclaw pod's own readiness probe (which runs inside the pod) passes, the pod is 1/1 Ready, and Ollama on the Windows side of the same tower is reachable from the tailnet.
Cause
The WSL k3s agent has registered its
InternalIPas172.27.244.126, which is the address of its Hyper-V NIC inside Windows. That address is not reachable from kai-server over the tailnet (it's a Microsoft-assigned RFC1918 NAT inside the WSL guest). The k3s apiserver uses the kubelet's InternalIP when proxying exec / logs / port-forward, so all of those break for pods scheduled on this node. The kubelet itself is fine - traffic that goes through the CNI (pod-to-pod, the in-pod liveness probe) works.This was masked until VHGC's openclaw rollout because every other workload on the tower has either run in-pod-only or been driven from a session on the WSL node directly.
Fix options
Re-register the kubelet with
--node-ip=<wsl-tailnet-ip>- the cleanest fix. The WSL box's tailnet IP is100.107.172.77(resolved fromtailscale status; SSM the FQDN, not the IP). SetINSTALL_K3S_EXEC/K3S_NODE_IPor systemd unit override on the WSL agent so kubelet advertises the tailnet IP as its InternalIP. apiserver then dials kubelet over the tailnet, which already works for any other tailnet-to-tailnet traffic.Accept it and document - kubectl exec / logs / port-forward to WSL-node pods only work from a session on that node. Workable but every cross-node verification step then has to coordinate through the agent channel.
Option 1 is the right call. Option 2 is what we're doing today, and it added 20 minutes of channel-handoff overhead to a routine verify.
Acceptance
coily ops kubectl logs -n openclaw deploy/openclawfrom kai-server returns logs.coily ops kubectl port-forward -n openclaw svc/openclaw 18789:18789from kai-server stays up long enough for a curl.docs/k3s-deploy-notes.mdunder the existing MagicDNS / tailnet-IP traps section.Provenance
Surfaced while applying
deploy/openclaw.ymlfor VHGC. Comms #38 on the channel has the full apply context. The deploy itself is up; only the apiserver-mediated verification path is broken.