coilyco-flight-deck/infrastructure

Fork 0

public-port allowlist must live in the repo and be audited externally #157

New issue

Open

opened 2026-05-27 04:59:43 +00:00 by coilysiren · 0 comments

coilysiren commented

2026-05-27 04:59:43 +00:00

Owner

Anchor

Sibling to the 2026-05-26 postmortem (filed alongside). Public SSH was reachable from the internet via router DMZ for an unknown stretch of time. Brute-force traffic from 60.163.139.198 had been hammering sshd for at least hours, possibly days, before we noticed. Closing the hole was a one-toggle fix at the router. The bigger problem is that nothing in the repo or in any document said which public ports should be open, so a regression (DMZ flip, new router, UPnP punch, port-triggering rule) could re-expose the host without anyone noticing for a long time.

What happened (short version)

Router was a TP-Link Archer AX20.
Virtual Servers table had only factorio (34197/udp) and eco (3000-3003). Looked clean.
DMZ was on, pointed at kai-server's LAN IP (192.168.0.194). Every port on kai-server was public.
sshd had been absorbing brute-force from 60.163.139.198 at one-attempt-per-~8-seconds for hours. No fail2ban, no ufw, just sshd handling it. No successful logins (key-only auth) but the noise displaced legitimate journal lines.
Fix: added explicit Virtual Servers for 80/tcp and 443/tcp to 192.168.0.194, turned DMZ off, verified public HTTPS still works.

Why the postmortem rates this high

Public exposure is the highest-blast-radius regression class on the homelab. A misconfigured port forward outranks every other operational fault by orders of magnitude. Bridge churn kills SSH for 5 minutes; a publicly reachable kube-apiserver could be a takeover. The closest thing we have to a defense is the memory of which ports are supposed to be open, and that memory currently lives only in this chat transcript.

Proposed allowlist (canonical)

These are the only ports that should be reachable from the public internet, forwarded to 192.168.0.194 (kai-server):

80/tcp - caddy HTTP, ACME challenges + redirect to HTTPS
443/tcp - caddy HTTPS, serves all *.coilysiren.me via reverse proxy
3000/udp - Eco game traffic
3001/tcp - Eco web UI
3002/tcp - Eco game protocol
3003/? - Eco (Kai mentioned 3000-3003 range; protocol unverified, document when confirmed)
34197/udp - factorio default port

Explicitly not allowlisted:

22/tcp - SSH is tailnet-only via kai@kai-server over tailscale
6443/tcp - kube-apiserver is tailnet-only via kubeconfig pinned to kai-server:6443
53096/tcp - tailscale PeerAPI, never public
41641/udp - tailscale itself, NATs through cleanly with no port-forward needed
anything else not in the allowlist above

Where the canonical list should live

Two layers, both load-bearing:

docs/public-ports.md in this repo. Plain markdown. Source of truth. References this issue, links to the postmortem.
A periodic external probe that hits the public IP from off-LAN and asserts the open-port set matches the allowlist. If a port outside the list is reachable, alert. Implementation: a coily verb (coily exec audit-public-ports) plus a github actions or forgejo actions workflow that runs it from a cloud runner so the source IP is off-LAN. Output goes to vault inbox or Slack/Discord/wherever Kai actually reads.

Regression prevention checklist

When ANY of these happen, re-audit:

Router replaced or factory-reset
Router firmware update that resets settings
New port-forward added for any reason (game, lab service)
UPnP toggled
DMZ touched
ISP changes the public IP (/coilysiren/home/public-ip in SSM should track this)
New kai-server LAN IP

Audit = walk Virtual Servers, walk UPnP service list, walk Port Triggering, walk DMZ, walk DDNS, compare to allowlist.

What I didn't audit tonight

UPnP service list - might have stale punched holes. Should be reviewed and probably disabled entirely if not in use.
Port Triggering rules.
DDNS / dynamic DNS settings.
WAN ping / WAN access settings.
Guest network isolation.

Each one is a potential bypass of the Virtual Servers allowlist.

Out of scope

The brute-force traffic itself. Even with port 22 closed publicly, putting fail2ban or similar on kai-server is still good hygiene for the brief windows when something is exposed by accident. Not blocking, separate hardening task.
Public-facing reverse proxy hardening (caddy WAF rules, rate limits, etc.). Different layer.

How to apply

Land docs/public-ports.md with the allowlist above and a regression-prevention checklist.
File a follow-up issue for the external-probe audit script (or implement here if scope creep is OK).
Audit UPnP service list at the router next time Kai is near the admin UI.

**Anchor** Sibling to the 2026-05-26 postmortem (filed alongside). Public SSH was reachable from the internet via router DMZ for an unknown stretch of time. Brute-force traffic from `60.163.139.198` had been hammering sshd for at least hours, possibly days, before we noticed. Closing the hole was a one-toggle fix at the router. The bigger problem is that **nothing in the repo or in any document said which public ports should be open**, so a regression (DMZ flip, new router, UPnP punch, port-triggering rule) could re-expose the host without anyone noticing for a long time. **What happened (short version)** - Router was a TP-Link Archer AX20. - Virtual Servers table had only factorio (34197/udp) and eco (3000-3003). Looked clean. - DMZ was **on**, pointed at kai-server's LAN IP (192.168.0.194). Every port on kai-server was public. - sshd had been absorbing brute-force from `60.163.139.198` at one-attempt-per-~8-seconds for hours. No fail2ban, no ufw, just sshd handling it. No successful logins (key-only auth) but the noise displaced legitimate journal lines. - Fix: added explicit Virtual Servers for `80/tcp` and `443/tcp` to 192.168.0.194, turned DMZ off, verified public HTTPS still works. **Why the postmortem rates this high** Public exposure is the highest-blast-radius regression class on the homelab. A misconfigured port forward outranks every other operational fault by orders of magnitude. Bridge churn kills SSH for 5 minutes; a publicly reachable kube-apiserver could be a takeover. The closest thing we have to a defense is the **memory of which ports are supposed to be open**, and that memory currently lives only in this chat transcript. **Proposed allowlist (canonical)** These are the only ports that should be reachable from the public internet, forwarded to 192.168.0.194 (kai-server): * `80/tcp` - caddy HTTP, ACME challenges + redirect to HTTPS * `443/tcp` - caddy HTTPS, serves all `*.coilysiren.me` via reverse proxy * `3000/udp` - Eco game traffic * `3001/tcp` - Eco web UI * `3002/tcp` - Eco game protocol * `3003/?` - Eco (Kai mentioned 3000-3003 range; protocol unverified, document when confirmed) * `34197/udp` - factorio default port Explicitly **not** allowlisted: * `22/tcp` - SSH is tailnet-only via `kai@kai-server` over tailscale * `6443/tcp` - kube-apiserver is tailnet-only via kubeconfig pinned to `kai-server:6443` * `53096/tcp` - tailscale PeerAPI, never public * `41641/udp` - tailscale itself, NATs through cleanly with no port-forward needed * anything else not in the allowlist above **Where the canonical list should live** Two layers, both load-bearing: 1. **docs/public-ports.md** in this repo. Plain markdown. Source of truth. References this issue, links to the postmortem. 2. **A periodic external probe** that hits the public IP from off-LAN and asserts the open-port set matches the allowlist. If a port outside the list is reachable, alert. Implementation: a coily verb (`coily exec audit-public-ports`) plus a github actions or forgejo actions workflow that runs it from a cloud runner so the source IP is off-LAN. Output goes to vault inbox or Slack/Discord/wherever Kai actually reads. **Regression prevention checklist** When ANY of these happen, re-audit: - Router replaced or factory-reset - Router firmware update that resets settings - New port-forward added for any reason (game, lab service) - UPnP toggled - DMZ touched - ISP changes the public IP (`/coilysiren/home/public-ip` in SSM should track this) - New kai-server LAN IP Audit = walk Virtual Servers, walk UPnP service list, walk Port Triggering, walk DMZ, walk DDNS, compare to allowlist. **What I didn't audit tonight** - UPnP service list - might have stale punched holes. Should be reviewed and probably disabled entirely if not in use. - Port Triggering rules. - DDNS / dynamic DNS settings. - WAN ping / WAN access settings. - Guest network isolation. Each one is a potential bypass of the Virtual Servers allowlist. **Out of scope** - The brute-force traffic itself. Even with port 22 closed publicly, putting fail2ban or similar on kai-server is still good hygiene for the brief windows when something is exposed by accident. Not blocking, separate hardening task. - Public-facing reverse proxy hardening (caddy WAF rules, rate limits, etc.). Different layer. **How to apply** 1. Land `docs/public-ports.md` with the allowlist above and a regression-prevention checklist. 2. File a follow-up issue for the external-probe audit script (or implement here if scope creep is OK). 3. Audit UPnP service list at the router next time Kai is near the admin UI.

coilysiren added the

label

2026-05-31 01:54:33 +00:00

coilysiren added

and removed

labels

2026-05-31 07:00:36 +00:00

No labels

No milestone

No project

No assignees

1 participant

Notifications

Due date

The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference

coilyco-flight-deck/infrastructure#157

No description provided.

Rows
Columns