docs: create Netgrimoire/K3s_Convert

This commit is contained in:
Administrator 2026-04-18 16:04:11 +00:00 committed by John Smith
parent a16af2b407
commit 4f76f7aacb

358
Netgrimoire/K3s_Convert.md Normal file
View file

@ -0,0 +1,358 @@
---
title: Kubernetes Conversion
description:
published: true
date: 2026-04-18T16:04:01.503Z
tags:
editor: markdown
dateCreated: 2026-04-18T16:04:01.503Z
---
# Netgrimoire → k3s Migration Plan
> **Status:** PLANNED — Pre-conditions not yet met. Do not begin Phase 1 until all prerequisites are checked off.
>
> **Last Updated:** 2026-04-18
> **Author:** graymutt
---
## Overview
Netgrimoire currently runs on Docker Swarm, a mature and stable orchestrator that has served the lab well. The Docker v29 release introduced breaking changes that disproportionately affect Swarm environments, and Swarm's development pace has effectively stalled. Mirantis has committed support through 2030, so there is no emergency — but the long-term trajectory points toward Kubernetes.
This document captures the **strategic decision, prerequisites, architecture plan, and phased migration path** for moving the Netgrimoire service catalog from Docker Swarm to k3s. The migration will be gradual, parallel, and Gremlin-assisted. No cutover deadline exists. Services move when they are ready.
### What Stays on Docker Compose/Swarm Permanently
**MailCow is explicitly excluded from this migration.** It runs as a native Docker Compose stack on docker4/hermes, is maintained by its upstream authors as a Compose-first project, and has complex internal networking that should not be disturbed. It will receive a dedicated static IP (ATT_Mail) and direct nginx-mailcow exposure — cleaner than routing through any reverse proxy. CrowdSec integration for MailCow is handled separately via host-level agent on docker4/hermes feeding signals back to the OPNsense bouncer.
Any other service delivered exclusively as a Docker Compose file with no Helm chart and no homelab k8s community support may also remain on Swarm at discretion.
---
## Why k3s
k3s is the appropriate Kubernetes distribution for Netgrimoire because:
- Lightweight single-binary install — no external etcd, no complex bootstrapping
- Well-maintained by Rancher/SUSE with strong homelab adoption
- Ships with Traefik ingress and local-path storage class out of the box
- Strong ARM64 support — better than Swarm's for the Pi nodes
- Large ecosystem of Helm charts covering the majority of the Netgrimoire service catalog
- Compatible with the planned Gremlin CI/CD pipeline (Forgejo webhooks → n8n → `kubectl apply`)
---
## Prerequisites — Complete Before Phase 1
These are the conditions that must be true before any migration work begins. The migration is not the priority right now. Getting the lab into a known-good, fully monitored, fully backed-up state is.
### Monitoring Stack
- Prometheus + Grafana + Alertmanager fully operational
- cAdvisor and node-exporter running on all Swarm nodes with correct hostname resolution
- Blackbox Exporter configured for all external endpoints
- Alertmanager routing to `netgrimoire-alerts` ntfy topic confirmed working
- All critical services covered by Uptime Kuma checks
### Backup Stack
- Kopia two-tier backup (local + offsite) operational for all stateful services
- Immich pg_dump + Kopia wrapper confirmed
- Nextcloud AIO BorgBackup confirmed
- MailCow backup_and_restore.sh schedule confirmed
- ZFS syncoid replication to Pocket Grimoire vault confirmed
- At least one full restore test completed per critical service
### Security Baseline
- OPNsense Spamhaus + GeoIP rules re-enabled and confirmed working
- CrowdSec Caddy bouncer deployed
- Suricata configured
- Zenarmor installed
- dnscrypt-proxy installed
- os-git-backup installed
- SSH authorized keys pulling from LLDAP on all cluster hosts
### Gremlin Stack
- Ollama + Open WebUI + Qdrant + n8n fully operational on znas
- Forgejo webhook → n8n → `docker stack deploy` CI/CD pipeline working for Swarm
- n8n Kompose conversion workflow built and tested on at least 3 representative stacks
- Conversion prompt/rules file committed to Forgejo as `gremlin/k3s-conversion-rules.md`
- ntfy integration confirmed for CI/CD pipeline notifications
### Documentation
- All current Swarm stacks audited against canonical template standard
- All non-conforming stacks corrected and committed to Forgejo
- Wiki pages current for all production services
---
## Architecture: Target State
### Node Roles
| Node | Current Role | k3s Role |
|---|---|---|
| znas (192.168.5.10) | Swarm primary manager | k3s server (control plane) |
| docker3 | Swarm worker | k3s agent |
| docker4/hermes (192.168.5.16) | Swarm worker + MailCow host | MailCow stays here — k3s agent optional |
| docker5 (192.168.5.18) | Swarm worker + Jellyfin host | k3s agent |
| DockerPi1 (Pi 4, aarch64) | Swarm worker | k3s agent (ARM64) |
| Pi 3 (arm) | Swarm worker | k3s agent (ARM) or retire |
znas remains the primary node in k3s as it was in Swarm — it holds the ZFS vault pool and NFS exports that everything depends on.
### Ingress Architecture
The five ATT static IPs enable a clean parallel ingress setup during transition:
| IP | Assignment |
|---|---|
| ATT_Web (current) | Existing Caddy — Swarm services during transition |
| ATT_Mail | MailCow nginx-mailcow direct exposure (permanent) |
| ATT_WireGuard | WireGuard (unchanged) |
| ATT_Spare | **k3s Traefik ingress during migration** |
| ATT_Admin | Admin/management access |
This means Caddy and Traefik run simultaneously on separate external IPs with no conflicts. Services migrate by moving their public DNS record from the Caddy IP to the Traefik IP. Rollback is a single DNS change. No cutover deadline pressure exists.
Internal DNS at 192.168.5.7 mirrors this — internal hostnames can point at either ingress independently of external DNS, enabling LAN-side testing before any external DNS change.
### Storage Strategy
| Current Pattern | k3s Equivalent |
|---|---|
| `/data/nfs/znas/Docker/<service>` bind mount | NFS PersistentVolume via NFS CSI driver pointing at existing znas exports |
| `/DockerVol/<service>` with placement constraint | hostPath volume with nodeSelector pinning to same host |
| Docker named volume (`local` driver) | hostPath or local-path PV on constrained node |
The NFS CSI driver (`csi-driver-nfs`) is preferred over the older nfs-subdir-external-provisioner. Install via Helm into the `kube-system` namespace. Point at `192.168.5.10:/export/` with a StorageClass named `nfs-znas`.
No data moves during migration. The NFS exports on znas remain exactly as they are. k3s pods mount the same paths that Swarm containers did.
### Security Integration
**CrowdSec:** The OPNsense bouncer continues to provide network-level blocking. The k3s Traefik ingress supports the CrowdSec bouncer middleware natively via the `crowdsec-bouncer-traefik-plugin`. This replaces the `caddy.import_1=crowdsec` label pattern.
**Authentik:** Traefik supports Authentik forward auth as a middleware. The `caddy.import_2=authentik` label pattern becomes a Traefik `ForwardAuth` middleware annotation on each Ingress resource.
**PUID/PGID:** Replaced by `securityContext.runAsUser` and `securityContext.runAsGroup` at the pod spec level. Value remains 1964:1964.
**ARM exclusion:** The Swarm constraint `node.platform.arch != aarch64` becomes a Kubernetes node affinity rule:
```yaml
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/arch
operator: NotIn
values:
- arm64
- arm
```
---
## Gremlin-Assisted Conversion Pipeline
### How It Works
The existing Forgejo → n8n CI/CD pipeline gains a migration branch track:
```
Push Swarm YAML to migration/ branch in Forgejo
→ Forgejo webhook fires to n8n
→ n8n calls Kompose for initial structural conversion
→ n8n calls Ollama with conversion-rules.md + Kompose output
→ AI applies Netgrimoire-specific transformations (labels, storage, security)
→ n8n runs kubeval / kubectl --dry-run=client against output
→ If valid: write manifests to k3s-manifests/ repo in Forgejo
→ Second webhook: kubectl apply on k3s cluster
→ ntfy notification to netgrimoire-alerts with diff summary
→ If invalid: ntfy alert with error + original YAML for human review
```
Gremlin handles the mechanical conversion. Human review is triggered only on validation failure or intentional complex cases.
### Conversion Rules File
The file `gremlin/k3s-conversion-rules.md` in Forgejo is the authoritative mapping document. It defines:
- Caddy label → Traefik Ingress annotation translation
- `caddy.import_1=crowdsec` → CrowdSec middleware annotation
- `caddy.import_2=authentik` → ForwardAuth middleware annotation
- Homepage label → k8s Homepage annotation schema
- DIUN label → DIUN k8s annotation format
- Placement constraint → node affinity translation
- `/DockerVol/` → hostPath + nodeSelector pattern
- `/data/nfs/znas/` → NFS PVC pattern
- `PUID/PGID` env vars → securityContext
- `deploy.restart_policy` → pod restartPolicy + liveness probe
This file is living documentation — update it each time a new edge case is discovered during migration. The quality of Gremlin's conversion output is directly proportional to the quality of this file.
### What Gremlin Handles Well
- Simple stateless services with NFS mounts
- Standard images with known Helm charts (use chart instead of conversion)
- Services with straightforward single-ingress Caddy config
- PUID/PGID, placement constraints, restart policies
- Homepage, DIUN, Kuma label translation
### What Requires Human Review
- Multi-network services with complex isolation requirements
- Services using `network_mode: host`
- Stacks with tight `depends_on` ordering requirements
- Any service where the compose file has custom build steps
- Services with non-standard volume arrangements
---
## Phased Migration
### Phase 0 — k3s Cluster Bootstrap (Prerequisites Complete)
- Install k3s on znas as server node
- Join docker3, docker5 as agent nodes
- Install NFS CSI driver, point at znas exports
- Install Traefik (ships with k3s — verify version and configure)
- Configure CrowdSec bouncer Traefik plugin
- Configure Authentik ForwardAuth middleware
- Assign ATT_Spare IP to OPNsense NAT rule → k3s Traefik (ports 80/443)
- Verify internal DNS can resolve test service on k3s
- Run 1-2 throwaway test services to confirm storage, ingress, and auth all work
- Do not touch any production Swarm services
### Phase 1 — Gremlin Stack (Self-Referential, Low Risk)
Migrate the Gremlin stack first. It is pinned to znas, relatively self-contained, and having Gremlin running on k3s means the migration pipeline itself runs on the target platform.
Services: Ollama, Open WebUI, Qdrant, n8n
Validate: Gremlin briefings, n8n workflows, ntfy integration all continue working on k3s.
### Phase 2 — Monitoring Stack
Services: Prometheus, Grafana, Alertmanager, cAdvisor, node-exporter, Blackbox Exporter
Helm charts exist for all of these. Use the kube-prometheus-stack chart rather than converting individual Swarm stacks — it is purpose-built for k8s and far superior to the Swarm equivalent. The cAdvisor hostname workaround used in Swarm (metric_relabel_configs with node ID extraction) goes away — k8s has native node labels.
Validate: All existing dashboards and alerts still fire correctly.
### Phase 3 — Watch Grimoire
Services: Jellyfin and associated services on docker5
Jellyfin has a mature Helm chart. Hardware transcoding passthrough requires node affinity to docker5 and a device plugin or direct device mount — document the specific approach before proceeding.
Validate: Playback, transcoding, library scanning all working.
### Phase 4 — Shadow Grimoire
Services: *arr suite, NZBGet/SABnzbd, NZBHydra, indexers
All well-established in homelab k8s community. Helm charts available via Björn's charts or TrueCharts. These are mostly stateless beyond their config directories — straightforward migration.
Validate: Full acquisition pipeline working end-to-end.
### Phase 5 — Core Infrastructure Services
Services: Authentik, Homepage, Wiki.js, Forgejo, Portainer, Uptime Kuma, ntfy, internal DNS
These are the services that everything else depends on, so they go last. Migrate one at a time with extended parallel-run periods. Homepage in particular requires the k8s annotation schema — update all previously migrated service manifests with correct Homepage annotations before migrating Homepage itself.
Validate: Auth, documentation, monitoring UI, notifications all working from k3s.
### Phase 6 — Remaining Services + Swarm Decommission
Migrate any remaining services not covered above. Once all services are confirmed stable on k3s:
- Archive Swarm stack YAML files in Forgejo (do not delete — historical reference)
- Leave docker4/hermes running MailCow on Compose — this is permanent
- Decommission Swarm: `docker swarm leave --force` on all nodes
- Reclaim ATT_Web IP — reassign to k3s Traefik or consolidate
- Update all Wiki documentation to reflect k3s as the platform
---
## Green Grimoire and Pocket Grimoire
**Green Grimoire** runs on its own dedicated host separate from the main Swarm. It can be evaluated for k3s independently or remain on Docker Compose — the isolation from the main cluster is a feature, not a problem. No urgency to migrate.
**Pocket Grimoire** is the travel lab running on a laptop with a ZFS pool. It runs Stash (read-only), Jellyfin, Wiki.js (pull-only), and Filebrowser. These are simple enough that Docker Compose is appropriate for the travel context — low overhead, easy to manage offline. No migration planned.
---
## Rollback Strategy
At any point during the migration, rollback of a specific service is a single DNS change — point the internal or external hostname back at the Caddy/Swarm IP. The Swarm stack is not torn down until Phase 6. This is the primary reason for the parallel ingress approach.
If k3s itself has a critical failure, Swarm is still running and services can be redirected back in minutes.
---
## Key Risks and Mitigations
| Risk | Mitigation |
|---|---|
| Traefik behavior differs from Caddy in subtle ways | Extended parallel-run period per service before DNS cutover |
| NFS PVC binding issues on multi-node cluster | Test storage class thoroughly in Phase 0 before any production data |
| MailCow isolation on docker4/hermes affected by k3s agent install | Evaluate carefully — may leave docker4/hermes out of k3s cluster entirely |
| Helm chart values diverge from Swarm stack config | Document all value overrides in Forgejo alongside the chart |
| Gremlin conversion produces subtly wrong manifests | kubeval + dry-run gate in pipeline; human review on all first-pass migrations |
| Pi 3 (arm) incompatibility | Evaluate at Phase 0 — may retire or keep as Swarm-only node indefinitely |
---
## Reference: Useful Commands
```bash
# k3s cluster status
kubectl get nodes -o wide
# Check all pods across namespaces
kubectl get pods -A
# Apply a manifest
kubectl apply -f service.yaml
# Dry run validation
kubectl apply -f service.yaml --dry-run=client
# Check NFS storage class
kubectl get storageclass
# Check PVCs
kubectl get pvc -A
# Tail pod logs
kubectl logs -f deployment/<name> -n <namespace>
# Node affinity check
kubectl describe node <nodename>
# Rollout status
kubectl rollout status deployment/<name> -n <namespace>
# Helm install example (kube-prometheus-stack)
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install monitoring prometheus-community/kube-prometheus-stack \
-n monitoring --create-namespace \
-f values/monitoring.yaml
```
---
## Related Wiki Pages
- Netgrimoire Infrastructure Overview
- Docker Swarm Canonical Template Standard
- Gremlin Stack
- Gremlin CI/CD Pipeline
- MailCow Configuration
- OPNsense Firewall Configuration
- CrowdSec Integration
- Caddy Reverse Proxy
- Grimoire Vault — Backup Strategy