Buy two Beelink SER8 mini PCs, expand K3s to a 3-node HA control plane #130

Closed
opened 2026-05-26 05:23:01 +00:00 by coilysiren · 1 comment
Owner

Problem

kai-server is currently the only K3s control-plane node. If it goes down, the whole control plane is down and nothing reschedules anywhere. It is a single point of failure that cannot tolerate any downtime.

Plan

Buy two Beelink SER8 mini PCs and join them as control-plane (server) nodes alongside kai-server. That makes a 3-node etcd cluster: N=3, quorum 2, tolerates one machine failure. Any single box, including kai-server, can go down and the cluster keeps scheduling. The two SER8s also host the public-facing services so blue-green deploys stop being manual.

Hardware

What this gives, and what it does not

  • Control-plane HA: yes. The 3-node cluster removes kai-server as a single point of failure.
  • Both SER8s live at the current house, kai-server is at the other house. The single point of failure moves from "one machine" to "the 2-node site." If that house goes fully dark, kai-server alone cannot hold quorum and scheduling stops. Still a real upgrade, since single-box failures and routine reboots are far more common than a whole-house outage.
  • Physical-distance backup is a separate goal. kai-server should still hold independent data backups and replicas of the public services so a true house-level disaster leaves it able to serve, even if degraded and manual.

Notes

  • etcd Raft heartbeats will cross the inter-house link. Run them over the Tailscale tailnet. Expect kai-server to briefly drop out of quorum and rejoin whenever the link flaps. That is churn, not data loss.
  • Consider a small UPS per site. etcd dislikes unclean shutdowns.

Out of scope

  • The Beelink GTR9 Pro (Ryzen AI Max+ 395, ~$1,799) for local LLM compute. Possible separate purchase in a year or two.

Kai purchases the two SER8s at her convenience, then the cluster-join work happens.


Ported from coilysiren/infrastructure#278.

**Problem** kai-server is currently the only K3s control-plane node. If it goes down, the whole control plane is down and nothing reschedules anywhere. It is a single point of failure that cannot tolerate any downtime. **Plan** Buy two Beelink SER8 mini PCs and join them as control-plane (server) nodes alongside kai-server. That makes a 3-node etcd cluster: N=3, quorum 2, tolerates one machine failure. Any single box, including kai-server, can go down and the cluster keeps scheduling. The two SER8s also host the public-facing services so blue-green deploys stop being manual. **Hardware** - Beelink SER8 (AMD Ryzen 7 8845HS, 8C/16T, 32GB DDR5, 1TB NVMe, 2x M.2 slots, 2.5GbE): https://www.bee-link.com/products/beelink-ser8-8845hs - Buy two. Each well under the ~$1k/machine budget. **What this gives, and what it does not** - Control-plane HA: yes. The 3-node cluster removes kai-server as a single point of failure. - Both SER8s live at the current house, kai-server is at the other house. The single point of failure moves from "one machine" to "the 2-node site." If that house goes fully dark, kai-server alone cannot hold quorum and scheduling stops. Still a real upgrade, since single-box failures and routine reboots are far more common than a whole-house outage. - Physical-distance backup is a separate goal. kai-server should still hold independent data backups and replicas of the public services so a true house-level disaster leaves it able to serve, even if degraded and manual. **Notes** - etcd Raft heartbeats will cross the inter-house link. Run them over the Tailscale tailnet. Expect kai-server to briefly drop out of quorum and rejoin whenever the link flaps. That is churn, not data loss. - Consider a small UPS per site. etcd dislikes unclean shutdowns. **Out of scope** - The Beelink GTR9 Pro (Ryzen AI Max+ 395, ~$1,799) for local LLM compute. Possible separate purchase in a year or two. Kai purchases the two SER8s at her convenience, then the cluster-join work happens. --- _Ported from coilysiren/infrastructure#278._
Author
Owner

Iceboxed in the 2026-05-29 backlog burn-down: 3-node HA control plane, overlaps #31 (cluster fully iceboxed). Reopen anytime if it becomes real.

Iceboxed in the 2026-05-29 backlog burn-down: 3-node HA control plane, overlaps #31 (cluster fully iceboxed). Reopen anytime if it becomes real.
Sign in to join this conversation.
No labels
P0
P1
P2
P3
P4
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
coilyco-flight-deck/infrastructure#130
No description provided.