Three-node k3s HA control plane with embedded etcd quorum #31

Closed
opened 2026-05-23 20:54:31 +00:00 by coilysiren · 1 comment
Owner

Originally filed by @coilysiren on 2026-05-22T07:17:21Z - https://github.com/coilysiren/infrastructure/issues/237

Goal - Move the k3s homelab onto a dedicated, highly-available control plane: 3 Linux server nodes running embedded etcd, tolerant of any single node going down.

Surfaced by - 2026-05-21 night planning session. Kai wants kai-server, the desktop tower, and a "third thing" to each validate the other two. The concept is correct (etcd Raft quorum). One correction below changes the topology.

Load-bearing constraint - the Windows tower cannot be a control-plane node

K3s server (control-plane) nodes are Linux-only. There is no Windows k3s server. Windows can join a cluster only as an agent (worker) node. So the tower is a GPU agent, not an etcd member. This is also the right call on its own merits: the tower is a part-time gaming box, and an etcd member that drops out every time Kai boots into a game would churn quorum.

Corrected topology

  • 3 Linux control-plane nodes carry etcd quorum (quorum = 2 of 3, tolerates losing any 1). kai-server is node 1. Nodes 2 and 3 are small dedicated Linux mini PCs.
  • The Windows tower joins as a GPU agent node and runs the Qwen + OpenClaw workload (see the OpenClaw deploy issue).
  • First server node starts with --cluster-init, the other two join with --server.

Cost-benefit

  • Minimum viable (~$150-200): one N100-class mini PC, Ubuntu pre-loaded, 16 GB RAM, wired ethernet. Make it the sole dedicated control node. Biggest improvement per dollar - the control plane stops sharing kai-server with game servers.
  • Proper 3-node HA (~$300-400): two mini PCs + kai-server. Cluster API and scheduling survive any one node rebooting. Given the tower is deliberately part-time, the always-on Linux nodes keep the cluster healthy regardless.
  • Skip: Raspberry Pi (etcd on SD-card I/O is miserable), Kat's storage box (storage appliance, not an etcd member).
  • Running cost: N100 mini PCs idle ~6-10 W, negligible. etcd over a wired LAN is fine under ~10 ms latency.

Proposed work

  • Buy one mini PC (Ubuntu pre-loaded ideal), bring up as a dedicated k3s server node with --cluster-init.
  • Migrate kai-server's k3s to embedded-etcd HA mode and join it as server node 2.
  • Add a second mini PC as server node 3 for full quorum.
  • Join the Windows tower as a GPU agent node.
  • All node bring-up scripted (bash) and committed - generic scripts to coilysiren/agentic-os, infra-specific to coilysiren/infrastructure. No ad-hoc one-off setup.

Related - blocks/enables the OpenClaw-on-tower deploy. Sibling of infrastructure #73 (self-hosted LLM stack).

_Originally filed by @coilysiren on 2026-05-22T07:17:21Z - [https://github.com/coilysiren/infrastructure/issues/237](https://github.com/coilysiren/infrastructure/issues/237)_ **Goal** - Move the k3s homelab onto a dedicated, highly-available control plane: 3 Linux server nodes running embedded etcd, tolerant of any single node going down. **Surfaced by** - 2026-05-21 night planning session. Kai wants kai-server, the desktop tower, and a "third thing" to each validate the other two. The concept is correct (etcd Raft quorum). One correction below changes the topology. **Load-bearing constraint - the Windows tower cannot be a control-plane node** K3s server (control-plane) nodes are Linux-only. There is no Windows k3s server. Windows can join a cluster only as an agent (worker) node. So the tower is a GPU agent, not an etcd member. This is also the right call on its own merits: the tower is a part-time gaming box, and an etcd member that drops out every time Kai boots into a game would churn quorum. **Corrected topology** - 3 Linux control-plane nodes carry etcd quorum (quorum = 2 of 3, tolerates losing any 1). kai-server is node 1. Nodes 2 and 3 are small dedicated Linux mini PCs. - The Windows tower joins as a GPU agent node and runs the Qwen + OpenClaw workload (see the OpenClaw deploy issue). - First server node starts with `--cluster-init`, the other two join with `--server`. **Cost-benefit** - Minimum viable (~$150-200): one N100-class mini PC, Ubuntu pre-loaded, 16 GB RAM, wired ethernet. Make it the sole dedicated control node. Biggest improvement per dollar - the control plane stops sharing kai-server with game servers. - Proper 3-node HA (~$300-400): two mini PCs + kai-server. Cluster API and scheduling survive any one node rebooting. Given the tower is deliberately part-time, the always-on Linux nodes keep the cluster healthy regardless. - Skip: Raspberry Pi (etcd on SD-card I/O is miserable), Kat's storage box (storage appliance, not an etcd member). - Running cost: N100 mini PCs idle ~6-10 W, negligible. etcd over a wired LAN is fine under ~10 ms latency. **Proposed work** - [ ] Buy one mini PC (Ubuntu pre-loaded ideal), bring up as a dedicated k3s server node with `--cluster-init`. - [ ] Migrate kai-server's k3s to embedded-etcd HA mode and join it as server node 2. - [ ] Add a second mini PC as server node 3 for full quorum. - [ ] Join the Windows tower as a GPU agent node. - [ ] All node bring-up scripted (bash) and committed - generic scripts to `coilysiren/agentic-os`, infra-specific to `coilysiren/infrastructure`. No ad-hoc one-off setup. **Related** - blocks/enables the OpenClaw-on-tower deploy. Sibling of infrastructure #73 (self-hosted LLM stack).
Author
Owner

Iceboxed in the 2026-05-29 backlog burn-down: three-node HA control plane, far-future hardware play. Reopen anytime if it becomes real.

Iceboxed in the 2026-05-29 backlog burn-down: three-node HA control plane, far-future hardware play. Reopen anytime if it becomes real.
Sign in to join this conversation.
No labels
P0
P1
P2
P3
P4
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
coilyco-flight-deck/infrastructure#31
No description provided.