Home 5GHz WiFi: Tailscale tunnel cycles up/down on ~2-5 min intervals from this network only #106

Open
opened 2026-05-24 18:52:31 +00:00 by coilysiren · 1 comment
Owner

Problem

From Kai's phone (Pixel 9, Termux), the Tailscale tunnel to kai-server cycles between reachable and unreachable on ~2-5 min intervals. Mosh sessions disconnect, then reattach on their own when the path comes back, then disconnect again. No phone-side intervention triggers the transitions - it's autonomic.

The localization is the key finding

  • Home 5GHz WiFi: cycles up/down every 2-5 min. The problem network.
  • Home 2.4GHz WiFi: in guest-network mode (round 2 finding), no LAN access, can't even test.
  • Cellular: works perfectly, no breaks.
  • Other house WiFi: works perfectly.
  • Work WiFi: works perfectly.

So this is not a Tailscale Android bug, not an Android per-app routing bug, not a kai-server bug. It is exclusively the home 5GHz radio + router combination doing something hostile to Tailscale's WireGuard UDP from non-Chrome apps.

Curious sub-finding

During the cycles, Chrome on the same phone keeps reaching http://api/ (MagicDNS) and https://forgejo.coilysiren.me without interruption. Tailscale's tunnel for Chrome stays up. Tailscale's tunnel for Termux (and probably any other app) cycles. Both are on the same 5GHz radio, same physical link, same Tailscale daemon.

That is genuinely weird and points at the router doing per-flow / per-app-fingerprint differentiation - QoS or WMM or HTTP/3 (QUIC) special-casing. Chrome uses QUIC over UDP/443 to reach forgejo; Tailscale uses WireGuard UDP on a different port. The router may be aggressively expiring WireGuard's UDP NAT mapping but keeping QUIC's alive.

Diagnostic evidence gathered

  • ping -c 3 100.69.164.66 from Termux during a 'down' window: 100% packet loss
  • ssh -4 -v kai@100.69.164.66: hangs at Connecting to ... port 22, SYN never arrives at server (zero entries in auth.log for the attempt)
  • mosh kai@100.69.164.66: connects during 'up' windows, holds session across 'down' windows
  • Tailscale Android exclude-list: Termux not excluded (tunneled)
  • Always-on VPN: enabled. Battery optimization: Unrestricted for Tailscale and Termux
  • Tailscale Android health: occasional 'San Francisco relay unavailable' warning, otherwise nominal
  • Server-side: sshd healthy, mosh-server healthy, ufw allowed UDP 60000:61000 on tailscale0, mosh-server bound to 100.69.164.66:60001

Hypotheses to test next time at home

  1. Router UDP NAT timeout / connection-tracking exhaustion: log into router admin, look for NAT table or connection-tracking limits. Bump UDP timeout high.
  2. Router QoS / WMM dropping WireGuard UDP: disable any QoS / smart-queue / app-prioritization features.
  3. DFS radar events on 5GHz: check router logs for 'DFS radar detected' or channel-change events. If present, pin to a non-DFS channel (36-48 in the US).
  4. Router firmware: check for updates. Track which firmware version was running when this started 'a few weeks ago.'
  5. Tailscale DERP fallback: if direct path through home router is what's flaky, forcing DERP-only with tailscale set --no-direct (or equivalent on Android) might paradoxically be more stable. Worth testing.

Workaround until fixed

  • On home 5GHz: use mosh exclusively (it survives the cycles). ssh from Termux is unusable.
  • Anywhere else: ssh works fine, mosh still recommended.

Related

  • Today's debug session originally tracked in a local file coilysiren/mobile-ssh-debug.md on kai-server. Pattern note: future debug threads of this shape should live as Forgejo issues from the start, not local files.
  • infrastructure#102, #103, #104, #105 - related hygiene work surfaced during this session.

Filed by Claude.

**Problem** From Kai's phone (Pixel 9, Termux), the Tailscale tunnel to kai-server cycles between reachable and unreachable on ~2-5 min intervals. Mosh sessions disconnect, then reattach on their own when the path comes back, then disconnect again. No phone-side intervention triggers the transitions - it's autonomic. **The localization is the key finding** - Home **5GHz WiFi**: cycles up/down every 2-5 min. The problem network. - Home **2.4GHz WiFi**: in guest-network mode (round 2 finding), no LAN access, can't even test. - **Cellular**: works perfectly, no breaks. - **Other house WiFi**: works perfectly. - **Work WiFi**: works perfectly. So this is **not** a Tailscale Android bug, **not** an Android per-app routing bug, **not** a kai-server bug. It is exclusively the home 5GHz radio + router combination doing something hostile to Tailscale's WireGuard UDP from non-Chrome apps. **Curious sub-finding** During the cycles, Chrome on the same phone keeps reaching `http://api/` (MagicDNS) and `https://forgejo.coilysiren.me` without interruption. Tailscale's tunnel for *Chrome* stays up. Tailscale's tunnel for Termux (and probably any other app) cycles. Both are on the same 5GHz radio, same physical link, same Tailscale daemon. That is genuinely weird and points at the router doing per-flow / per-app-fingerprint differentiation - QoS or WMM or HTTP/3 (QUIC) special-casing. Chrome uses QUIC over UDP/443 to reach forgejo; Tailscale uses WireGuard UDP on a different port. The router may be aggressively expiring WireGuard's UDP NAT mapping but keeping QUIC's alive. **Diagnostic evidence gathered** - `ping -c 3 100.69.164.66` from Termux during a 'down' window: 100% packet loss - `ssh -4 -v kai@100.69.164.66`: hangs at `Connecting to ... port 22`, SYN never arrives at server (zero entries in `auth.log` for the attempt) - `mosh kai@100.69.164.66`: connects during 'up' windows, holds session across 'down' windows - Tailscale Android exclude-list: Termux not excluded (tunneled) - Always-on VPN: enabled. Battery optimization: Unrestricted for Tailscale and Termux - Tailscale Android health: occasional 'San Francisco relay unavailable' warning, otherwise nominal - Server-side: sshd healthy, mosh-server healthy, ufw allowed UDP 60000:61000 on tailscale0, mosh-server bound to `100.69.164.66:60001` **Hypotheses to test next time at home** 1. **Router UDP NAT timeout / connection-tracking exhaustion**: log into router admin, look for NAT table or connection-tracking limits. Bump UDP timeout high. 2. **Router QoS / WMM dropping WireGuard UDP**: disable any QoS / smart-queue / app-prioritization features. 3. **DFS radar events on 5GHz**: check router logs for 'DFS radar detected' or channel-change events. If present, pin to a non-DFS channel (36-48 in the US). 4. **Router firmware**: check for updates. Track which firmware version was running when this started 'a few weeks ago.' 5. **Tailscale DERP fallback**: if direct path through home router is what's flaky, forcing DERP-only with `tailscale set --no-direct` (or equivalent on Android) might paradoxically be more stable. Worth testing. **Workaround until fixed** - On home 5GHz: use mosh exclusively (it survives the cycles). ssh from Termux is unusable. - Anywhere else: ssh works fine, mosh still recommended. **Related** - Today's debug session originally tracked in a local file `coilysiren/mobile-ssh-debug.md` on kai-server. Pattern note: future debug threads of this shape should live as Forgejo issues from the start, not local files. - [infrastructure#102](https://forgejo.coilysiren.me/coilysiren/infrastructure/issues/102), [#103](https://forgejo.coilysiren.me/coilysiren/infrastructure/issues/103), [#104](https://forgejo.coilysiren.me/coilysiren/infrastructure/issues/104), [#105](https://forgejo.coilysiren.me/coilysiren/infrastructure/issues/105) - related hygiene work surfaced during this session. **Filed by Claude.**
Author
Owner

Investigation status update

Confirmed root cause

Not the home 5GHz radio itself - it's the router's UPnP port-mapping table being saturated by a BitTorrent client on the LAN, causing Tailscale's UDP mapping to get evicted on a ~2-5 min cycle.

Evidence chain

  1. Router UPnP table dump showed ~22+ entries from a single device, labeled Transmission at 51413 plus the BitTorrent port range 6881-6891 (TCP and UDP variants of each), saturating the table.
  2. The device fingerprints as a Synology NAS (OUI 00:11:32 = Synology Inc), running Transmission (confirmed via curl -I http://<host>:9091/transmission/rpc returning Server: Transmission and WWW-Authenticate: Basic realm="Transmission").
  3. Cycle timing (~2-5 min) matches typical UPnP entry churn rate on a saturated table.
  4. Why phone is more affected than laptop: mobile Tailscale on Android does not request its own UPnP mapping (Android VPN service constraint), so it relies entirely on outbound-keepalive to hold its NAT entry. Laptop Tailscale actively re-asserts UPnP mappings and is more resilient to table eviction.
  5. Why Chrome stays up on phone while ssh/mosh cycles: Chrome's QUIC connections to public IPs traverse a different NAT mapping path than Tailscale's WireGuard UDP to a tailnet peer; the latter is what's getting evicted.
  6. Why only on home WiFi: cellular and other-network WiFi don't have a UPnP-table-saturating torrent client on the LAN.

Router specifics

  • Model: TP-Link Archer A20 v1.0, firmware 1.1.1 build 20191026.
  • Public UPnP entry-table cap is undocumented; anecdotally similar TP-Link consumer routers cap at ~64 entries (community thread).

Fix options, lowest-friction first

  1. Disable UPnP/NAT-PMP on the NAS's Transmission install (Settings -> Network -> uncheck "Use UPnP / NAT-PMP port forwarding from my router"). Torrents keep working, just no inbound peer optimization. Stops the table churn at source. Requires NAS admin access / housemate cooperation.
  2. Disable UPnP on the router globally. Loses Tailscale's direct-path optimization (falls back to DERP relay), but stops the eviction war. Easy and reversible.
  3. Replace the router with one that has a higher UPnP entry cap (or runs OpenWRT, which is configurable). Real fix, biggest disruption.
  4. Static port forward for Tailscale: pin Tailscale to a fixed port on each device and create a manual port-forward on the router that survives UPnP eviction. Requires per-device Tailscale config.

Adjacent findings filed separately

  • #102 - sshd ClientAlive (zombie session reaping)
  • #103 - umbrella sshd security pass (public exposure question)
  • #104 - fail2ban quick win
  • #105 - earlier hypothesis (Tailscale Android per-app routing) now superseded by this UPnP finding

Citations supporting the diagnosis

## Investigation status update ### Confirmed root cause Not the home 5GHz radio itself - it's the router's **UPnP port-mapping table being saturated** by a BitTorrent client on the LAN, causing Tailscale's UDP mapping to get evicted on a ~2-5 min cycle. ### Evidence chain 1. Router UPnP table dump showed ~22+ entries from a single device, labeled `Transmission at 51413` plus the BitTorrent port range 6881-6891 (TCP and UDP variants of each), saturating the table. 2. The device fingerprints as a Synology NAS (OUI `00:11:32` = Synology Inc), running **Transmission** (confirmed via `curl -I http://<host>:9091/transmission/rpc` returning `Server: Transmission` and `WWW-Authenticate: Basic realm="Transmission"`). 3. Cycle timing (~2-5 min) matches typical UPnP entry churn rate on a saturated table. 4. Why phone is more affected than laptop: mobile Tailscale on Android does not request its own UPnP mapping (Android VPN service constraint), so it relies entirely on outbound-keepalive to hold its NAT entry. Laptop Tailscale actively re-asserts UPnP mappings and is more resilient to table eviction. 5. Why Chrome stays up on phone while ssh/mosh cycles: Chrome's QUIC connections to public IPs traverse a different NAT mapping path than Tailscale's WireGuard UDP to a tailnet peer; the latter is what's getting evicted. 6. Why only on home WiFi: cellular and other-network WiFi don't have a UPnP-table-saturating torrent client on the LAN. ### Router specifics - Model: TP-Link Archer A20 v1.0, firmware 1.1.1 build 20191026. - Public UPnP entry-table cap is undocumented; anecdotally similar TP-Link consumer routers cap at ~64 entries ([community thread](https://community.tp-link.com/us/home/forum/topic/154961)). ### Fix options, lowest-friction first 1. **Disable UPnP/NAT-PMP on the NAS's Transmission install** (Settings -> Network -> uncheck "Use UPnP / NAT-PMP port forwarding from my router"). Torrents keep working, just no inbound peer optimization. Stops the table churn at source. Requires NAS admin access / housemate cooperation. 2. **Disable UPnP on the router globally**. Loses Tailscale's direct-path optimization (falls back to DERP relay), but stops the eviction war. Easy and reversible. 3. **Replace the router with one that has a higher UPnP entry cap** (or runs OpenWRT, which is configurable). Real fix, biggest disruption. 4. **Static port forward for Tailscale**: pin Tailscale to a fixed port on each device and create a manual port-forward on the router that survives UPnP eviction. Requires per-device Tailscale config. ### Adjacent findings filed separately - [#102](https://forgejo.coilysiren.me/coilysiren/infrastructure/issues/102) - sshd ClientAlive (zombie session reaping) - [#103](https://forgejo.coilysiren.me/coilysiren/infrastructure/issues/103) - umbrella sshd security pass (public exposure question) - [#104](https://forgejo.coilysiren.me/coilysiren/infrastructure/issues/104) - fail2ban quick win - [#105](https://forgejo.coilysiren.me/coilysiren/infrastructure/issues/105) - earlier hypothesis (Tailscale Android per-app routing) now superseded by this UPnP finding ### Citations supporting the diagnosis - [tailscale#18348 - UPnP mapping breaks direct connections without restart](https://github.com/tailscale/tailscale/issues/18348) - [tailscale#10771 - Portmapping (UPnP) issues on consumer gateways](https://github.com/tailscale/tailscale/issues/10771) - [qBittorrent#20247 - Torrent client floods UPnP table with conflicting entries](https://github.com/qbittorrent/qBittorrent/issues/20247) - [Tailscale: How NAT traversal works](https://tailscale.com/blog/nat-traversal-improvements-pt-1)
coilysiren added
P4
and removed
P3
labels 2026-05-31 07:00:41 +00:00
Sign in to join this conversation.
No labels
P0
P1
P2
P3
P4
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
coilyco-flight-deck/infrastructure#106
No description provided.