tailscale loopback: kai-server host userspace cannot reach its own ts-proxy at 100.115.195.2:8428 #89

Open
opened 2026-05-23 20:54:40 +00:00 by coilysiren · 0 comments
Owner

Originally filed by @coilysiren on 2026-04-28T09:23:32Z - https://github.com/coilysiren/infrastructure/issues/71

🤖 Filed by Claude Code on Kai's behalf.

The eco-server systemd unit on kai-server cannot reach the vmsingle ts-proxy at 100.115.195.2:8428 despite kai-server itself being a tailnet member. Confirmed via in-process synchronous HTTP probe at startup:

2026-04-28T09:22:01.6725262Z smoke probe FAILED: TaskCanceledException: The request was canceled due to the configured HttpClient.Timeout of 5 seconds elapsing.

Probe target was the exact URL configured for OtlpMetricsEndpoint. Same URL responds 200 from a laptop on the tailnet. So the failure is specific to host-userspace traffic going to a ts-proxy peer that lives on the same physical host as the originating tailscale daemon. Tailscale subnet routing back to a local-host peer doesn't appear to work from outside-cluster userspace.

This explains everything in coilysiren/eco-telemetry#5: the OTLP exporter was firing on schedule, the SDK was generating valid metric payloads (eco_players_online + the System.Runtime fan), the URL construction was correct, vmsingle was reachable from the rest of the network. The TCP packets just weren't arriving.

Reachable paths from eco-server's userspace:

  • localhost (127.0.0.1)
  • kai-server LAN IP (192.168.0.194)
  • kai-server's own tailnet IP (100.69.164.66)
  • ClusterIPs in 10.43.0.0/16 (k3s installs iptables rules at host level so host-userspace can hit them)

Not reachable:

  • Other tailnet peers' IPs when those peers are ts-proxies running on kai-server itself (loopback gap).

Suggested fixes, ranked:

  1. NodePort or hostPort on vmsingle's :8428. Simplest. Add a vmsingle-host-service.yml with type: NodePort (or extend the existing tailscale ClusterIP service). Eco-server points at http://localhost:<nodeport>/opentelemetry/api/v1/push. No tailscale traversal at all - same-host traffic goes through k3s's iptables rules.
  2. Use vmsingle's ClusterIP directly. Resolve the cluster IP via kubectl once and point eco-telemetry's config at it. Brittle - cluster IPs aren't stable across helm upgrades. Pair with a DNS alias in /etc/hosts on kai-server.
  3. Run an OTel Collector on the host. Receives OTLP locally, batches, remote-writes to vmsingle via in-cluster path. Heavyweight.
  4. Investigate the tailscale loopback - might be a tailscale set --advertise-routes configuration that's missing the cluster CIDR, or a sysctl thing. Dependent on Tailscale's behavior with self-hosted ts-proxies.

For tonight's session I've kept the failing endpoint configured because flipping to a different one needs a cluster-side change that's destructive enough to wait on a clear-headed pass. Diagnostic surface added in eco-telemetry will keep working once a reachable endpoint is configured.

What ships in eco-telemetry from tonight (closes the upstream half of #5):

  • Plugin discovery fix - IModKitPlugin declaration. Eco's scanner now sees the plugin. (Bug behind everything else.)
  • EmitConsoleAlongsideOtlp config flag - mirror metrics to console for debugging.
  • Manual OtlpMetricExporter + PeriodicExportingMetricReader construction. The AddOtlpExporter helper was silently no-op'ing in our setup; manual wiring is what put the OTLP reader in the pipeline at all.
  • OTel OTEL_DIAGNOSTICS.json self-diagnostics enabled by install-eco-mod.sh.
  • Smoke probe at startup that writes its result to Logs/EcoTelemetry/smoke-probe.txt (this is how I caught the loopback issue).

🤖 Filed by Claude Code on Kai's behalf.

_Originally filed by @coilysiren on 2026-04-28T09:23:32Z - [https://github.com/coilysiren/infrastructure/issues/71](https://github.com/coilysiren/infrastructure/issues/71)_ > 🤖 Filed by Claude Code on Kai's behalf. The eco-server systemd unit on kai-server cannot reach the vmsingle ts-proxy at `100.115.195.2:8428` despite kai-server itself being a tailnet member. Confirmed via in-process synchronous HTTP probe at startup: ``` 2026-04-28T09:22:01.6725262Z smoke probe FAILED: TaskCanceledException: The request was canceled due to the configured HttpClient.Timeout of 5 seconds elapsing. ``` Probe target was the exact URL configured for OtlpMetricsEndpoint. Same URL responds 200 from a laptop on the tailnet. So the failure is specific to host-userspace traffic going to a ts-proxy peer that lives on the same physical host as the originating tailscale daemon. Tailscale subnet routing back to a local-host peer doesn't appear to work from outside-cluster userspace. This explains everything in coilysiren/eco-telemetry#5: the OTLP exporter was firing on schedule, the SDK was generating valid metric payloads (eco_players_online + the System.Runtime fan), the URL construction was correct, vmsingle was reachable from the rest of the network. The TCP packets just weren't arriving. **Reachable paths from eco-server's userspace:** - localhost (127.0.0.1) - kai-server LAN IP (192.168.0.194) - kai-server's own tailnet IP (100.69.164.66) - ClusterIPs in 10.43.0.0/16 (k3s installs iptables rules at host level so host-userspace can hit them) **Not reachable:** - Other tailnet peers' IPs *when those peers are ts-proxies running on kai-server itself* (loopback gap). **Suggested fixes, ranked:** 1. **NodePort or hostPort on vmsingle's :8428.** Simplest. Add a `vmsingle-host-service.yml` with `type: NodePort` (or extend the existing tailscale ClusterIP service). Eco-server points at `http://localhost:<nodeport>/opentelemetry/api/v1/push`. No tailscale traversal at all - same-host traffic goes through k3s's iptables rules. 2. **Use vmsingle's ClusterIP directly.** Resolve the cluster IP via kubectl once and point eco-telemetry's config at it. Brittle - cluster IPs aren't stable across helm upgrades. Pair with a DNS alias in /etc/hosts on kai-server. 3. **Run an OTel Collector on the host.** Receives OTLP locally, batches, remote-writes to vmsingle via in-cluster path. Heavyweight. 4. **Investigate the tailscale loopback** - might be a `tailscale set --advertise-routes` configuration that's missing the cluster CIDR, or a sysctl thing. Dependent on Tailscale's behavior with self-hosted ts-proxies. For tonight's session I've kept the failing endpoint configured because flipping to a different one needs a cluster-side change that's destructive enough to wait on a clear-headed pass. Diagnostic surface added in eco-telemetry will keep working once a reachable endpoint is configured. **What ships in eco-telemetry from tonight (closes the upstream half of #5):** - Plugin discovery fix - `IModKitPlugin` declaration. Eco's scanner now sees the plugin. (Bug behind everything else.) - `EmitConsoleAlongsideOtlp` config flag - mirror metrics to console for debugging. - Manual `OtlpMetricExporter` + `PeriodicExportingMetricReader` construction. The `AddOtlpExporter` helper was silently no-op'ing in our setup; manual wiring is what put the OTLP reader in the pipeline at all. - OTel `OTEL_DIAGNOSTICS.json` self-diagnostics enabled by `install-eco-mod.sh`. - Smoke probe at startup that writes its result to `Logs/EcoTelemetry/smoke-probe.txt` (this is how I caught the loopback issue). > 🤖 Filed by Claude Code on Kai's behalf.
coilysiren added
P4
and removed
P3
labels 2026-05-31 07:00:44 +00:00
Sign in to join this conversation.
No labels
P0
P1
P2
P3
P4
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
coilyco-flight-deck/infrastructure#89
No description provided.