Mirror ghcr.io images to a backup registry for outage resilience #50

Closed
opened 2026-05-23 20:54:33 +00:00 by coilysiren · 1 comment
Owner

Originally filed by @coilysiren on 2026-05-15T04:37:29Z - https://github.com/coilysiren/infrastructure/issues/166

Problem

GitHub being unreliable (API rate-limited or outage) cascades into k3s deploy failures because ghcr.io is the only image registry for coilysiren-backend, eco-mcp-app, and galaxy-gen namespaces. The 2026-05-14 incident made this concrete: a PAT rotation broke ghcr.io pulls in three namespaces simultaneously, and the dashboard had no backup pull path.

Goal

Every image published to ghcr.io/coilysiren/* also lands in a second registry. Deploys can fail over to the backup without a code change.

Scope

In-tree workloads currently pulling from ghcr.io:

Pull credential is the shared /github/pat in SSM, synced to k8s via external-secrets.

Tasks

  • Pick a backup registry. Candidates: Docker Hub (free, public, simple), AWS ECR (already authenticated via coily, private), fly.io registry (already paying), GHCR-but-different-org-as-PoC (defeats the purpose). Decision criteria: independent failure domain from GitHub, cheap, fast to pull from kai-server.
  • CI: dual-push every release tag from each repo's existing build workflow. Tag scheme matches ghcr (:latest and :<sha>).
  • Add imagePullSecrets for the backup registry alongside the existing ghcr.io one in each namespace.
  • Add a manifest-level switch (env var or kustomize overlay) that flips image: from ghcr.io/... to <backup>/... without re-pushing manifests.
  • Local deploy script that pulls from the backup registry by default, so kai can roll out from her laptop when GitHub is wedged.
  • Document the failover procedure in infrastructure/docs/architecture.md and the per-repo READMEs.

Acceptance

A simulated ghcr.io outage (e.g. kubectl edit secret docker-registry to break the auth) does not block a fresh deploy. The backup registry serves the image; a documented one-liner flips manifests over.

Out of scope

  • Mirroring third-party images (postgres, redis, etc). Those should pull from upstream Docker Hub already.
  • Replacing ghcr.io as primary. Goal is redundancy, not migration.

Refs

_Originally filed by @coilysiren on 2026-05-15T04:37:29Z - [https://github.com/coilysiren/infrastructure/issues/166](https://github.com/coilysiren/infrastructure/issues/166)_ **Problem** GitHub being unreliable (API rate-limited or outage) cascades into k3s deploy failures because ghcr.io is the only image registry for `coilysiren-backend`, `eco-mcp-app`, and `galaxy-gen` namespaces. The 2026-05-14 incident made this concrete: a PAT rotation broke ghcr.io pulls in three namespaces simultaneously, and the dashboard had no backup pull path. **Goal** Every image published to `ghcr.io/coilysiren/*` also lands in a second registry. Deploys can fail over to the backup without a code change. **Scope** In-tree workloads currently pulling from ghcr.io: - `coilysiren-backend` namespace (per [backend/deploy/main.yml](https://github.com/coilysiren/backend/blob/main/deploy/main.yml)) - `eco-mcp-app` namespace (per [eco-mcp-app/deploy/main.yml](https://github.com/coilysiren/eco-mcp-app/blob/main/deploy/main.yml)) - `galaxy-gen` namespace (per [galaxy-gen/deploy/main.yml](https://github.com/coilysiren/galaxy-gen/blob/main/deploy/main.yml)) Pull credential is the shared `/github/pat` in SSM, synced to k8s via external-secrets. **Tasks** - [ ] Pick a backup registry. Candidates: Docker Hub (free, public, simple), AWS ECR (already authenticated via coily, private), fly.io registry (already paying), GHCR-but-different-org-as-PoC (defeats the purpose). Decision criteria: independent failure domain from GitHub, cheap, fast to pull from kai-server. - [ ] CI: dual-push every release tag from each repo's existing build workflow. Tag scheme matches ghcr (`:latest` and `:<sha>`). - [ ] Add `imagePullSecrets` for the backup registry alongside the existing ghcr.io one in each namespace. - [ ] Add a manifest-level switch (env var or kustomize overlay) that flips `image:` from `ghcr.io/...` to `<backup>/...` without re-pushing manifests. - [ ] Local deploy script that pulls from the backup registry by default, so kai can roll out from her laptop when GitHub is wedged. - [ ] Document the failover procedure in [infrastructure/docs/architecture.md](https://github.com/coilysiren/infrastructure/blob/main/docs/architecture.md) and the per-repo READMEs. **Acceptance** A simulated ghcr.io outage (e.g. `kubectl edit secret docker-registry` to break the auth) does not block a fresh deploy. The backup registry serves the image; a documented one-liner flips manifests over. **Out of scope** - Mirroring third-party images (postgres, redis, etc). Those should pull from upstream Docker Hub already. - Replacing ghcr.io as primary. Goal is redundancy, not migration. **Refs** - 2026-05-14 GitHub API rate-limit incident, plus PAT-rotation cascade. - [coilyco-ai discussion (this session)](https://github.com/coilysiren/coilyco-ai/issues/526) - related work on the rate-limit dashboard and CI poll cadence.
Author
Owner

Iceboxed in the 2026-05-29 backlog burn-down: mirror ghcr images, superseded by in-cluster registry #168. Reopen anytime if it becomes real.

Iceboxed in the 2026-05-29 backlog burn-down: mirror ghcr images, superseded by in-cluster registry #168. Reopen anytime if it becomes real.
Sign in to join this conversation.
No labels
P0
P1
P2
P3
P4
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
coilyco-flight-deck/infrastructure#50
No description provided.