tailscale loopback: kai-server host userspace cannot reach its own ts-proxy at 100.115.195.2:8428 #89

New issue

Closed

opened 2026-05-23 20:54:40 +00:00 by coilysiren · 1 comment

coilysiren commented

2026-05-23 20:54:40 +00:00

Owner

Originally filed by @coilysiren on 2026-04-28T09:23:32Z - https://github.com/coilysiren/infrastructure/issues/71

🤖 Filed by Claude Code on Kai's behalf.

The eco-server systemd unit on kai-server cannot reach the vmsingle ts-proxy at 100.115.195.2:8428 despite kai-server itself being a tailnet member. Confirmed via in-process synchronous HTTP probe at startup:

2026-04-28T09:22:01.6725262Z smoke probe FAILED: TaskCanceledException: The request was canceled due to the configured HttpClient.Timeout of 5 seconds elapsing.

Probe target was the exact URL configured for OtlpMetricsEndpoint. Same URL responds 200 from a laptop on the tailnet. So the failure is specific to host-userspace traffic going to a ts-proxy peer that lives on the same physical host as the originating tailscale daemon. Tailscale subnet routing back to a local-host peer doesn't appear to work from outside-cluster userspace.

This explains everything in coilysiren/eco-telemetry#5: the OTLP exporter was firing on schedule, the SDK was generating valid metric payloads (eco_players_online + the System.Runtime fan), the URL construction was correct, vmsingle was reachable from the rest of the network. The TCP packets just weren't arriving.

Reachable paths from eco-server's userspace:

localhost (127.0.0.1)
kai-server LAN IP (192.168.0.194)
kai-server's own tailnet IP (100.69.164.66)
ClusterIPs in 10.43.0.0/16 (k3s installs iptables rules at host level so host-userspace can hit them)

Not reachable:

Other tailnet peers' IPs when those peers are ts-proxies running on kai-server itself (loopback gap).

Suggested fixes, ranked:

NodePort or hostPort on vmsingle's :8428. Simplest. Add a vmsingle-host-service.yml with type: NodePort (or extend the existing tailscale ClusterIP service). Eco-server points at http://localhost:<nodeport>/opentelemetry/api/v1/push. No tailscale traversal at all - same-host traffic goes through k3s's iptables rules.
Use vmsingle's ClusterIP directly. Resolve the cluster IP via kubectl once and point eco-telemetry's config at it. Brittle - cluster IPs aren't stable across helm upgrades. Pair with a DNS alias in /etc/hosts on kai-server.
Run an OTel Collector on the host. Receives OTLP locally, batches, remote-writes to vmsingle via in-cluster path. Heavyweight.
Investigate the tailscale loopback - might be a tailscale set --advertise-routes configuration that's missing the cluster CIDR, or a sysctl thing. Dependent on Tailscale's behavior with self-hosted ts-proxies.

For tonight's session I've kept the failing endpoint configured because flipping to a different one needs a cluster-side change that's destructive enough to wait on a clear-headed pass. Diagnostic surface added in eco-telemetry will keep working once a reachable endpoint is configured.

What ships in eco-telemetry from tonight (closes the upstream half of #5):

Plugin discovery fix - IModKitPlugin declaration. Eco's scanner now sees the plugin. (Bug behind everything else.)
EmitConsoleAlongsideOtlp config flag - mirror metrics to console for debugging.
Manual OtlpMetricExporter + PeriodicExportingMetricReader construction. The AddOtlpExporter helper was silently no-op'ing in our setup; manual wiring is what put the OTLP reader in the pipeline at all.
OTel OTEL_DIAGNOSTICS.json self-diagnostics enabled by install-eco-mod.sh.
Smoke probe at startup that writes its result to Logs/EcoTelemetry/smoke-probe.txt (this is how I caught the loopback issue).

🤖 Filed by Claude Code on Kai's behalf.

_Originally filed by @coilysiren on 2026-04-28T09:23:32Z - [https://github.com/coilysiren/infrastructure/issues/71](https://github.com/coilysiren/infrastructure/issues/71)_ > 🤖 Filed by Claude Code on Kai's behalf. The eco-server systemd unit on kai-server cannot reach the vmsingle ts-proxy at `100.115.195.2:8428` despite kai-server itself being a tailnet member. Confirmed via in-process synchronous HTTP probe at startup: ``` 2026-04-28T09:22:01.6725262Z smoke probe FAILED: TaskCanceledException: The request was canceled due to the configured HttpClient.Timeout of 5 seconds elapsing. ``` Probe target was the exact URL configured for OtlpMetricsEndpoint. Same URL responds 200 from a laptop on the tailnet. So the failure is specific to host-userspace traffic going to a ts-proxy peer that lives on the same physical host as the originating tailscale daemon. Tailscale subnet routing back to a local-host peer doesn't appear to work from outside-cluster userspace. This explains everything in coilysiren/eco-telemetry#5: the OTLP exporter was firing on schedule, the SDK was generating valid metric payloads (eco_players_online + the System.Runtime fan), the URL construction was correct, vmsingle was reachable from the rest of the network. The TCP packets just weren't arriving. **Reachable paths from eco-server's userspace:** - localhost (127.0.0.1) - kai-server LAN IP (192.168.0.194) - kai-server's own tailnet IP (100.69.164.66) - ClusterIPs in 10.43.0.0/16 (k3s installs iptables rules at host level so host-userspace can hit them) **Not reachable:** - Other tailnet peers' IPs *when those peers are ts-proxies running on kai-server itself* (loopback gap). **Suggested fixes, ranked:** 1. **NodePort or hostPort on vmsingle's :8428.** Simplest. Add a `vmsingle-host-service.yml` with `type: NodePort` (or extend the existing tailscale ClusterIP service). Eco-server points at `http://localhost:<nodeport>/opentelemetry/api/v1/push`. No tailscale traversal at all - same-host traffic goes through k3s's iptables rules. 2. **Use vmsingle's ClusterIP directly.** Resolve the cluster IP via kubectl once and point eco-telemetry's config at it. Brittle - cluster IPs aren't stable across helm upgrades. Pair with a DNS alias in /etc/hosts on kai-server. 3. **Run an OTel Collector on the host.** Receives OTLP locally, batches, remote-writes to vmsingle via in-cluster path. Heavyweight. 4. **Investigate the tailscale loopback** - might be a `tailscale set --advertise-routes` configuration that's missing the cluster CIDR, or a sysctl thing. Dependent on Tailscale's behavior with self-hosted ts-proxies. For tonight's session I've kept the failing endpoint configured because flipping to a different one needs a cluster-side change that's destructive enough to wait on a clear-headed pass. Diagnostic surface added in eco-telemetry will keep working once a reachable endpoint is configured. **What ships in eco-telemetry from tonight (closes the upstream half of #5):** - Plugin discovery fix - `IModKitPlugin` declaration. Eco's scanner now sees the plugin. (Bug behind everything else.) - `EmitConsoleAlongsideOtlp` config flag - mirror metrics to console for debugging. - Manual `OtlpMetricExporter` + `PeriodicExportingMetricReader` construction. The `AddOtlpExporter` helper was silently no-op'ing in our setup; manual wiring is what put the OTLP reader in the pipeline at all. - OTel `OTEL_DIAGNOSTICS.json` self-diagnostics enabled by `install-eco-mod.sh`. - Smoke probe at startup that writes its result to `Logs/EcoTelemetry/smoke-probe.txt` (this is how I caught the loopback issue). > 🤖 Filed by Claude Code on Kai's behalf.

coilysiren added the

label

2026-06-04 08:16:55 +00:00

coilysiren commented

2026-06-17 08:22:45 +00:00

Author

Owner

Backlog burndown 2026-06-17: closing low-priority (P3/P4) to bring the open count to a manageable level. Nothing lost — reopen if this resurfaces. Batch tag: burndown-2026-06.

Backlog burndown 2026-06-17: closing low-priority (P3/P4) to bring the open count to a manageable level. Nothing lost — reopen if this resurfaces. Batch tag: `burndown-2026-06`.

coilysiren

2026-06-17 08:22:45 +00:00