stand up self-hosted LLM on desktop tower 3090 #87

New issue

Closed

opened 2026-05-23 20:54:40 +00:00 by coilysiren · 1 comment

coilysiren commented

2026-05-23 20:54:40 +00:00

Owner

Originally filed by @coilysiren on 2026-04-28T18:20:25Z - https://github.com/coilysiren/infrastructure/issues/73

🤖 Filed by Claude Code on Kai's behalf.

Goal

Stand up a self-hosted LLM inference stack on the desktop tower once the hardware upgrade chain (#74 SSD, #75 3090, #76 PSU) lands. Production-ish bar - OpenAI-compatible API, reachable from elsewhere on Tailscale, picks up models from a known store, doesn't fall over.

Not the bar - hosting at scale for other users, training, fine-tuning. That's later.

What an RTX 3090 24GB can actually run

Quant	Model size ceiling	Examples
FP16	~13B	Llama 3.1 13B, Qwen 2.5 14B at full precision
Q8	~24B	Llama 3.3 22B, Qwen 2.5 32B-Coder Q8 (tight)
Q4	~32B	Qwen 2.5 32B, Mixtral 8x7B at Q4
Aggressive Q (Q3/Q2)	~70B	Llama 3.1 70B at Q2_K, marginal quality

24GB VRAM is the load-bearing feature. It's the biggest single capability jump from the RTX 2080's 8GB.

On HuggingFace specifically

Model hub: yes, canonical. huggingface.co is where weights live for almost every open model. transformers library is the canonical Python interface. huggingface-cli for downloads. This part is fine.

TGI (Text Generation Inference): no, dead. HuggingFace put TGI into maintenance mode in December 2025 and now recommend vLLM or SGLang for new deployments. Don't start there.

So "HuggingFace is good" resolves to "use the hub for weights, ignore the server."

Framework options

Framework	Tier	Best for
Ollama	Dev / local	One-command pull-and-run, OpenAI-compatible API on localhost. Built on llama.cpp. No PagedAttention, no continuous batching - falls over under concurrent load. Right starting point.
llama.cpp	Low-level / bleeding-edge	GGUF-native, default for Qwen 3.6 and recent MoE work. More setup, more control. Reach for when Ollama's defaults don't fit.
vLLM	Production	PagedAttention, continuous batching, OpenAI-compatible API. 3-24x faster than Ollama under concurrent load. Right end-state when latency under load matters.
mistral.rs	Adjacent	Rust implementation, CUDA + Metal, supports vision models. @cagyirey is a contributor. Worth a real look both for the engineering quality and as a way to keep the distributed-prompting thread warm.
SGLang	Production-alt	HF's other recommended replacement for TGI. Worth knowing about, probably not worth chasing first.
~~TGI~~	~~Production~~	Dead. Skip.

Recommended path: Ollama week one (proves the hardware works, gives an API to point things at), then vLLM or mistral.rs once load is real. Don't try to start at vLLM - the setup cost on the first day is too high.

k3s topology - the open question

The GPU is on the desktop tower (Windows). The k3s cluster lives on kai-server (Linux). They are not the same host. Three forks:

Run inference natively on Windows, expose via Tailscale. Ollama or LM Studio on Windows, Tailscale exposes an OpenAI-compatible endpoint, k3s services on kai-server consume the endpoint over Tailnet. Lowest setup cost, no k8s integration, but the GPU never appears in the cluster.
WSL2 + k3s agent on the desktop tower. WSL2 supports CUDA passthrough as of recent NVIDIA drivers. Run a k3s agent inside WSL2, join the kai-server cluster, install NVIDIA Container Runtime + device plugin daemonset, schedule LLM workloads onto the desktop. Real k8s integration, but WSL2 GPU passthrough is a non-trivial path with its own debugging surface.
Dual-boot Linux on the desktop tower (or replace Windows). Cleanest k3s integration, hardest lifestyle change. Probably not worth it for a daily-driver gaming box.

Recommendation: start with fork 1 (native Windows + Tailscale). Get the model running, prove the API works, point a real workload at it. Decide on fork 2 only if the lack of k3s integration becomes load-bearing - and at that point WSL2 is the right next step, not dual-boot.

The k3s integration path is well-documented when you get there. NVIDIA Container Runtime + k8s-device-plugin daemonset is the standard recipe.

Open questions

Is fork 1 (native Windows + Tailscale) good enough long-term, or is k3s integration worth the WSL2 cost?
First model target - Qwen 2.5 32B (coder-strong, Q4 fits cleanly), Llama 3.1 13B FP16 (full precision, lower ceiling), or something else?
Storage layout - keep models on the new SN850X 2TB, or relocate to the 860 EVO 2TB? GGUFs benefit from NVMe load times.
How does this fit with the eco-telemetry observability story? Self-hosted LLM with OTel traces would be a real demo artifact.
Talk to @cagyirey about mistral.rs - is she running it locally, what's her take, does it interop with the distributed-prompting thread?

Acceptance

Hardware upgrade chain landed (#74, #75, #76).
Ollama installed on the desktop tower, at least one model pulled (Qwen 2.5 32B or Llama 3.1 13B).
OpenAI-compatible API reachable from the desktop tower locally.
API reachable from kai-server over Tailscale.
First end-to-end completion from a real client (curl, Python openai SDK) works.
desktop-tower.md Role-in-the-stack section updated.
Decision on fork 1 vs fork 2 documented (vault note or follow-up issue).

depends on #74 (SSD), #75 (3090), #76 (PSU).
Adjacent: coilysiren/eco-telemetry for OTel-instrumenting the inference server once it's up.

References

🤖 Filed by Claude Code on Kai's behalf.

_Originally filed by @coilysiren on 2026-04-28T18:20:25Z - [https://github.com/coilysiren/infrastructure/issues/73](https://github.com/coilysiren/infrastructure/issues/73)_ > 🤖 Filed by Claude Code on Kai's behalf. ## Goal Stand up a self-hosted LLM inference stack on the desktop tower once the hardware upgrade chain (#74 SSD, #75 3090, #76 PSU) lands. Production-ish bar - OpenAI-compatible API, reachable from elsewhere on Tailscale, picks up models from a known store, doesn't fall over. Not the bar - hosting at scale for other users, training, fine-tuning. That's later. ## What an RTX 3090 24GB can actually run | Quant | Model size ceiling | Examples | |-------|-------------------|----------| | FP16 | ~13B | Llama 3.1 13B, Qwen 2.5 14B at full precision | | Q8 | ~24B | Llama 3.3 22B, Qwen 2.5 32B-Coder Q8 (tight) | | Q4 | ~32B | Qwen 2.5 32B, Mixtral 8x7B at Q4 | | Aggressive Q (Q3/Q2) | ~70B | Llama 3.1 70B at Q2_K, marginal quality | 24GB VRAM is the load-bearing feature. It's the biggest single capability jump from the RTX 2080's 8GB. ## On HuggingFace specifically **Model hub: yes, canonical.** [huggingface.co](https://huggingface.co) is where weights live for almost every open model. `transformers` library is the canonical Python interface. `huggingface-cli` for downloads. This part is fine. **TGI (Text Generation Inference): no, dead.** HuggingFace put TGI into maintenance mode in December 2025 and now recommend vLLM or SGLang for new deployments. Don't start there. So "HuggingFace is good" resolves to "use the hub for weights, ignore the server." ## Framework options | Framework | Tier | Best for | |-----------|------|----------| | **Ollama** | Dev / local | One-command pull-and-run, OpenAI-compatible API on localhost. Built on llama.cpp. No PagedAttention, no continuous batching - falls over under concurrent load. Right starting point. | | **llama.cpp** | Low-level / bleeding-edge | GGUF-native, default for Qwen 3.6 and recent MoE work. More setup, more control. Reach for when Ollama's defaults don't fit. | | **vLLM** | Production | PagedAttention, continuous batching, OpenAI-compatible API. 3-24x faster than Ollama under concurrent load. Right end-state when latency under load matters. | | **mistral.rs** | Adjacent | Rust implementation, CUDA + Metal, supports vision models. @cagyirey is a contributor. Worth a real look both for the engineering quality and as a way to keep the distributed-prompting thread warm. | | **SGLang** | Production-alt | HF's other recommended replacement for TGI. Worth knowing about, probably not worth chasing first. | | ~~TGI~~ | ~~Production~~ | Dead. Skip. | **Recommended path:** Ollama week one (proves the hardware works, gives an API to point things at), then vLLM or mistral.rs once load is real. Don't try to start at vLLM - the setup cost on the first day is too high. ## k3s topology - the open question The GPU is on the **desktop tower (Windows)**. The k3s cluster lives on **kai-server (Linux)**. They are not the same host. Three forks: 1. **Run inference natively on Windows, expose via Tailscale.** Ollama or LM Studio on Windows, Tailscale exposes an OpenAI-compatible endpoint, k3s services on kai-server consume the endpoint over Tailnet. Lowest setup cost, no k8s integration, but the GPU never appears in the cluster. 2. **WSL2 + k3s agent on the desktop tower.** WSL2 supports CUDA passthrough as of recent NVIDIA drivers. Run a k3s agent inside WSL2, join the kai-server cluster, install NVIDIA Container Runtime + device plugin daemonset, schedule LLM workloads onto the desktop. **Real k8s integration, but WSL2 GPU passthrough is a non-trivial path** with its own debugging surface. 3. **Dual-boot Linux on the desktop tower** (or replace Windows). Cleanest k3s integration, hardest lifestyle change. Probably not worth it for a daily-driver gaming box. **Recommendation:** start with fork 1 (native Windows + Tailscale). Get the model running, prove the API works, point a real workload at it. Decide on fork 2 only if the lack of k3s integration becomes load-bearing - and at that point WSL2 is the right next step, not dual-boot. The k3s integration path is well-documented when you get there. NVIDIA Container Runtime + [k8s-device-plugin](https://github.com/NVIDIA/k8s-device-plugin) daemonset is the standard recipe. ## Open questions - [ ] Is fork 1 (native Windows + Tailscale) good enough long-term, or is k3s integration worth the WSL2 cost? - [ ] First model target - Qwen 2.5 32B (coder-strong, Q4 fits cleanly), Llama 3.1 13B FP16 (full precision, lower ceiling), or something else? - [ ] Storage layout - keep models on the new SN850X 2TB, or relocate to the 860 EVO 2TB? GGUFs benefit from NVMe load times. - [ ] How does this fit with the eco-telemetry observability story? Self-hosted LLM with OTel traces would be a real demo artifact. - [ ] Talk to @cagyirey about mistral.rs - is she running it locally, what's her take, does it interop with the distributed-prompting thread? ## Acceptance - [ ] Hardware upgrade chain landed (#74, #75, #76). - [ ] Ollama installed on the desktop tower, at least one model pulled (Qwen 2.5 32B or Llama 3.1 13B). - [ ] OpenAI-compatible API reachable from the desktop tower locally. - [ ] API reachable from kai-server over Tailscale. - [ ] First end-to-end completion from a real client (curl, Python `openai` SDK) works. - [ ] `desktop-tower.md` Role-in-the-stack section updated. - [ ] Decision on fork 1 vs fork 2 documented (vault note or follow-up issue). ## Related - depends on #74 (SSD), #75 (3090), #76 (PSU). - Adjacent: `coilysiren/eco-telemetry` for OTel-instrumenting the inference server once it's up. ## References - [Ollama vs vLLM vs TGI benchmark 2026 - Medium](https://medium.com/@anupkawarase.akz/ollama-vs-vllm-vs-tgi-local-llm-serving-benchmark-2026-ba7d8474fea7) - [vLLM vs TGI vs TensorRT-LLM vs Ollama - Hivenet](https://www.hivenet.com/post/vllm-vs-tgi-vs-tensorrt-llm-vs-ollama) - [Battle-tested LLM-on-3090 guide - keturk/llm_on_rtx_3090](https://github.com/keturk/llm_on_rtx_3090) - [RTX 3090 LLM benchmarks - Hardware Corner](https://www.hardware-corner.net/gpu-llm-benchmarks/rtx-3090/) - [k3s GPU workloads guide - OneUptime](https://oneuptime.com/blog/post/2026-03-20-k3s-gpu-workloads/view) - [NVIDIA k8s-device-plugin](https://github.com/NVIDIA/k8s-device-plugin) - [mistral.rs](https://github.com/EricLBuehler/mistral.rs) > 🤖 Filed by Claude Code on Kai's behalf.

coilysiren commented

2026-05-30 05:43:20 +00:00

Author

Owner

Iceboxed in the 2026-05-29 backlog burn-down: self-hosted LLM, gated on hardware chain. Reopen anytime if it becomes real.

coilysiren closed this issue

2026-05-30 05:43:20 +00:00