stand up self-hosted LLM on desktop tower 3090 #87
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally filed by @coilysiren on 2026-04-28T18:20:25Z - https://github.com/coilysiren/infrastructure/issues/73
Goal
Stand up a self-hosted LLM inference stack on the desktop tower once the hardware upgrade chain (#74 SSD, #75 3090, #76 PSU) lands. Production-ish bar - OpenAI-compatible API, reachable from elsewhere on Tailscale, picks up models from a known store, doesn't fall over.
Not the bar - hosting at scale for other users, training, fine-tuning. That's later.
What an RTX 3090 24GB can actually run
24GB VRAM is the load-bearing feature. It's the biggest single capability jump from the RTX 2080's 8GB.
On HuggingFace specifically
Model hub: yes, canonical. huggingface.co is where weights live for almost every open model.
transformerslibrary is the canonical Python interface.huggingface-clifor downloads. This part is fine.TGI (Text Generation Inference): no, dead. HuggingFace put TGI into maintenance mode in December 2025 and now recommend vLLM or SGLang for new deployments. Don't start there.
So "HuggingFace is good" resolves to "use the hub for weights, ignore the server."
Framework options
TGIProductionRecommended path: Ollama week one (proves the hardware works, gives an API to point things at), then vLLM or mistral.rs once load is real. Don't try to start at vLLM - the setup cost on the first day is too high.
k3s topology - the open question
The GPU is on the desktop tower (Windows). The k3s cluster lives on kai-server (Linux). They are not the same host. Three forks:
Run inference natively on Windows, expose via Tailscale. Ollama or LM Studio on Windows, Tailscale exposes an OpenAI-compatible endpoint, k3s services on kai-server consume the endpoint over Tailnet. Lowest setup cost, no k8s integration, but the GPU never appears in the cluster.
WSL2 + k3s agent on the desktop tower. WSL2 supports CUDA passthrough as of recent NVIDIA drivers. Run a k3s agent inside WSL2, join the kai-server cluster, install NVIDIA Container Runtime + device plugin daemonset, schedule LLM workloads onto the desktop. Real k8s integration, but WSL2 GPU passthrough is a non-trivial path with its own debugging surface.
Dual-boot Linux on the desktop tower (or replace Windows). Cleanest k3s integration, hardest lifestyle change. Probably not worth it for a daily-driver gaming box.
Recommendation: start with fork 1 (native Windows + Tailscale). Get the model running, prove the API works, point a real workload at it. Decide on fork 2 only if the lack of k3s integration becomes load-bearing - and at that point WSL2 is the right next step, not dual-boot.
The k3s integration path is well-documented when you get there. NVIDIA Container Runtime + k8s-device-plugin daemonset is the standard recipe.
Open questions
Acceptance
openaiSDK) works.desktop-tower.mdRole-in-the-stack section updated.Related
coilysiren/eco-telemetryfor OTel-instrumenting the inference server once it's up.References
Iceboxed in the 2026-05-29 backlog burn-down: self-hosted LLM, gated on hardware chain. Reopen anytime if it becomes real.