diff --git a/Netgrimoire/Gremlin-Grimoire/CICD_Architecture.md b/Netgrimoire/Gremlin-Grimoire/CICD_Architecture.md index 807e972..b2455c1 100644 --- a/Netgrimoire/Gremlin-Grimoire/CICD_Architecture.md +++ b/Netgrimoire/Gremlin-Grimoire/CICD_Architecture.md @@ -2,218 +2,400 @@ title: Gremlin CI/CD Pipepline description: N8N with LLAMA published: true -date: 2026-04-28T20:55:22.848Z +date: 2026-05-06T21:08:52.500Z tags: editor: markdown -dateCreated: 2026-04-28T20:55:22.848Z +dateCreated: 2026-05-03T04:16:15.155Z --- -# Gremlin CI/CD Pipeline +--- +title: Gremlin CI/CD — Operations Guide +description: Complete operations reference for the Gremlin CI/CD pipeline. Use this to rebuild or onboard into a new project. +published: true +date: 2026-05-06 +tags: gremlin, cicd, n8n, docker, swarm, compose, netgrimoire +editor: markdown +dateCreated: 2026-05-06 +--- + +# Gremlin CI/CD — Operations Guide > **NetGrimoire Infrastructure Reference** -> Automated validation, auto-fix, and deployment pipeline for Docker Swarm stacks. -> Runs on n8n (docker4). Triggered by Forgejo push webhooks on `traveler/services`. +> Complete reference for rebuilding, reconfiguring, or onboarding the Gremlin CI/CD pipeline in a new project. Covers infrastructure requirements, n8n setup, Forgejo configuration, service account setup, and the full node chain. --- ## Overview -The Gremlin CI/CD pipeline is an n8n workflow that intercepts every push to `traveler/services`, validates changed Swarm compose files against NetGrimoire standards, automatically fixes common issues, and deploys clean stacks to the Swarm cluster. It is the enforcement layer for stack consistency across the homelab. +Gremlin CI/CD is an n8n workflow that watches the `traveler/services` Forgejo repository for pushes, validates Docker Swarm stacks and Compose files against NetGrimoire standards, auto-fixes violations, prepares volumes, deploys via SSH, and syncs Caddy and Gatus configuration. All automated operations are attributed to the `gremlin` service account. -The pipeline is modular — each check and fix is a discrete node. Adding a new rule means adding one checker node and one fixer node. Nothing else changes. +**Pipeline triggers:** any push to `swarm/**/*.yaml` or `compose/**/*.yaml` in `traveler/services`. + +**Pipeline outputs:** validated and deployed stacks, updated Caddy static file, updated Gatus config, deploy log row in `traveler/Netgrimoire/Logs/.md`, ntfy notifications. --- -## Architecture +## Infrastructure Requirements -### File Detection +### Nodes -Every push to `traveler/services` fires a webhook to n8n. The pipeline detects changed files via three passes: - -1. **Standard arrays** — `commit.added` and `commit.modified` from the Forgejo payload -2. **Gremlin commit messages** — extracts file path from `gremlin: auto-fix swarm/foo.yaml (N issues fixed)` messages, handling Forgejo's habit of sending empty file arrays for programmatic commits -3. **Compare API fallback** — calls Forgejo's `/api/v1/repos/traveler/services/compare/before...after` if both passes find nothing - -Only files under `swarm/` with `.yml` or `.yaml` extensions are processed. - -### File Classification - -After fetching the file content, Build Envelope classifies it: - -| Type | Detection | Route | +| Node | Role | IP | |---|---|---| -| **Pocket** | `pocket.include: "true"` label | Silent exit | -| **Swarm** | Any `deploy:` block present | Full checker chain | -| **Compose** | No `deploy:` block | ntfy warning, skip | +| `znas` | Swarm manager, Caddy host, primary deploy target | `192.168.5.10` | +| `docker3` | Swarm worker | `192.168.5.15` | +| `docker4` | Swarm worker, n8n host, Ollama host | `192.168.5.16` | +| `docker5` | Swarm worker | `192.168.5.18` | +| `dockerpi1` | Swarm worker (ARM) | `192.168.5.8` | -### Pipeline Flow +### Required Services -``` -Forgejo Push - └─ Parse Push Payload - └─ Build Envelope - └─ Switch (Pocket / Swarm / Compose) - ├─ Pocket → (silent exit) - ├─ Swarm → Checker Chain - │ └─ Evaluate Checks - │ └─ Switch (hasFailed) - │ ├─ Failed → ntfy: Blocked - │ │ └─ Switch (canFix) - │ │ ├─ Yes → Fixer Chain → Commit → ntfy: Auto-fixed - │ │ └─ No → (stop) - │ └─ Passed → Ollama Audit - │ └─ Switch (ollamaVerdict) - │ ├─ Fail → ntfy: Blocked — Ollama - │ └─ Pass → Deploy Gate - │ └─ Deploy Enabled? - │ ├─ No → ntfy: Deploy Skipped - │ └─ Yes → Prepare Volumes - │ └─ Git Pull + Deploy - │ └─ Gatus Sync - │ └─ ntfy: Deploy Complete - └─ Compose → ntfy: Non-Swarm -``` - -### The Envelope - -Every checker and fixer operates on a single shared object called the **envelope**. It is built once per file and passed through the entire chain, accumulating issues and fixes. - -Key fields: - -| Field | Description | -|---|---| -| `stackName` | Derived from file path | -| `filePath` | Relative path in repo | -| `composeRaw` | Original file content — never modified | -| `fixedRaw` | Accumulates fixer changes — null until first fixer runs | -| `issues[]` | All checker findings | -| `fixes[]` | All fixer actions taken | -| `checkResults{}` | Pass/fail per checker ID | -| `hasFailed` | True if any checker failed | -| `canFix` | True if all issues are fixable and there are issues to fix | -| `isPocket` | True if pocket.include: "true" found | -| `isSwarm` | True if any deploy: block found | -| `directives{}` | Parsed gremlin.* label values | - ---- - -## Checker Chain - -Checkers run in this order. All checkers append to `envelope.issues` and set `envelope.checkResults[id]`. - -| Order | ID | What it checks | +| Service | Location | Purpose | |---|---|---| -| 1 | `swarm-syntax` | Forbidden fields: version, container_name, hostname (service-level), restart, depends_on, dnsrr | -| 2 | `identity` | PUID/PGID must be 1964, or user: "1964:1964" | -| 3 | `network` | netgrimoire overlay declared and attached | -| 4 | `placement` | ARM exclusions, DockerVol/hostname rules, restart_policy | -| 5 | `caddy` | caddy: label, reverse_proxy format, import_1/import_2 | -| 6 | `homepage` | group, name, icon, href, description labels | -| 7 | `monitor` | monitor.name, monitor.url, optional type/interval | -| 8 | `legacy-labels` | Flags any kuma.* labels for removal | -| 9 | `diun` | diun.enable: "true" present | +| n8n | `docker4` | Pipeline execution | +| Forgejo | `git.netgrimoire.com` | Git hosting, webhook source | +| Ollama | `docker4` | Audit model inference | +| ntfy | `ntfy.netgrimoire.com` | Notifications | +| Gatus | `znas` | Service monitoring | +| Caddy | `znas` | Reverse proxy | -### Fixable vs Unfixable +--- -Auto-fix only runs when **all** issues in the file are fixable. A single unfixable issue blocks the fix chain entirely. +## Service Account Setup -| Fixable | Unfixable | +The `gremlin` service account must be created and configured on every node before the pipeline will function. + +### Create the account + +On each node: + +```bash +sudo useradd -m -s /bin/bash gremlin +sudo usermod -aG docker gremlin +``` + +### SSH key + +Generate once, distribute everywhere: + +```bash +# On docker4 (n8n host) or your workstation +ssh-keygen -t ed25519 -C "gremlin@netgrimoire" -f ~/.ssh/gremlin_ed25519 -N "" +``` + +Copy the public key to every node: + +```bash +for NODE in znas docker3 docker4 docker5 dockerpi1; do + ssh-copy-id -i ~/.ssh/gremlin_ed25519.pub gremlin@$NODE +done +``` + +The private key must be placed at the path configured in `gremlin/config.yaml` (default `/home/gremlin/.ssh/id_ed25519`) inside the n8n container. Mount it via the n8n stack: + +```yaml +volumes: + - /DockerVol/n8n/keys:/home/gremlin/.ssh:ro +``` + +### Passwordless sudo + +On each node, create `/etc/sudoers.d/gremlin-deploy`: + +``` +gremlin ALL=(ALL) NOPASSWD: /bin/mkdir, /bin/chown, /bin/chmod +``` + +On `znas` additionally: + +``` +gremlin ALL=(ALL) NOPASSWD: /usr/bin/docker +``` + +### Forgejo account + +Create a `gremlin` local account in Forgejo. Grant collaborator write access to: +- `traveler/services` — stack files +- `traveler/Netgrimoire` — deploy logs + +Generate two tokens: +- **Read token** — for fetching files (`FORGEJO_TOKEN`) +- **Write token** — for committing fixes and logs (`FORGEJO_WRITE_TOKEN`) + +--- + +## n8n Stack Configuration + +### Required environment variables + +Add to the n8n stack compose `environment:` block: + +```yaml +environment: + # Forgejo + FORGEJO_TOKEN: + FORGEJO_WRITE_TOKEN: + FORGEJO_URL: https://forgejo.netgrimoire.com + + # ntfy + NTFY_URL: https://ntfy.netgrimoire.com + + # Node IPs (overlay DNS gives overlay IPs — store host IPs explicitly) + ZNAS_IP: 192.168.5.10 + DOCKER3_IP: 192.168.5.15 + DOCKER4_IP: 192.168.5.16 + DOCKER5_IP: 192.168.5.18 + DOCKERPI1_IP: 192.168.5.8 + + # Required for SSH and child_process in Code nodes + NODE_FUNCTION_ALLOW_BUILTIN: child_process +``` + +### Volume mounts + +```yaml +volumes: + - /DockerVol/n8n:/home/node/.n8n + - /DockerVol/n8n/keys:/home/gremlin/.ssh:ro +``` + +--- + +## Gremlin Config File + +The pipeline reads runtime configuration from `gremlin/config.yaml` in the `traveler/services` repo. This file is fetched on every push and overrides hardcoded defaults. + +```yaml +# gremlin/config.yaml +version: "2026-04-1" +deploy: true +autofix: true +ollama_model: "qwen2.5-coder:14b" +ollama_audit_model: "gemma3:4b" +ntfy_alerts_topic: "gremlin-alerts" +ntfy_monitor_topic: "gremlin-watch" +checks_skip: "" +maintenance: false +maintenance_message: "Gremlin is in maintenance mode" +ssh_key_path: "/home/gremlin/.ssh/id_ed25519" +repo_path: "/home/gremlin/services" +forgejo_owner: "traveler" +forgejo_repo: "services" +gatus_config_path: "/DockerVol/gatus/config/config.yaml" +caddy_file_path: "swarm/stack/caddy/Caddyfile" +caddy_static_path: "/export/Docker/caddy/Caddyfile" +``` + +Set `maintenance: true` to silently drop all pushes and send a single ntfy notification. + +--- + +## Repository Structure + +``` +traveler/services/ + swarm/ + stack/ + caddy/ + Caddyfile ← static Caddy config (managed by Gremlin Caddy Sync) + / + .yaml ← Swarm stack file + compose/ + / + .yaml ← Compose file for that node + -override.yaml ← Auto-generated by Gremlin (override mode) + gremlin/ + config.yaml ← Pipeline runtime config + logs/ ← (unused — logs now in traveler/Netgrimoire) + +traveler/Netgrimoire/ + Logs/ + .md ← Deploy log per stack (table + detail) + +gatus/ + config.yaml ← Gatus monitor config (managed by Gremlin Gatus Sync) +``` + +--- + +## Forgejo Webhook + +In Forgejo → `traveler/services` → Settings → Webhooks → Add Webhook: + +| Field | Value | |---|---| -| version: key | dnsrr endpoint mode | -| container_name, hostname (service), restart, depends_on | Missing deploy: block | -| Wrong or missing PUID/PGID | Invalid node.hostname value | -| Missing netgrimoire network | hostname missing when DockerVol present | -| ARM exclusion issues | — | -| Hostname present without DockerVol | — | -| Missing restart_policy | — | -| caddy: protocol prefix | — | -| Missing caddy.import_1/import_2 | — | -| Missing homepage labels (derived) | — | -| Missing monitor labels (derived) | — | -| Legacy kuma.* labels (removed) | — | -| Missing diun.enable | — | +| Target URL | `https://n8n.netgrimoire.com/webhook/gremlin-cicd` | +| Content type | `application/json` | +| Secret | _(optional, not currently validated)_ | +| Trigger | Push events only | --- -## Fixer Chain +## Pipeline Node Chain -Fixers run in the same order as checkers. Each fixer reads from `fixedRaw` (or `composeRaw` if first) and writes its changes back to `fixedRaw`. Changes accumulate correctly across the chain. +### Swarm path -When all fixers complete, the pipeline commits `fixedRaw` back to Forgejo with the message: ``` -gremlin: auto-fix swarm/foo.yaml (N issues fixed) - - - Removed version: key - - Added PUID/PGID 1964 to "app" - - ... +Forgejo Push Webhook + → Parse Push Payload extract files, branch, SHA, pusher, commit directives + → Build Envelope fetch file from Forgejo, detect type, parse directives + → Switch2 route: Compose / Swarm / Pocket / fallback + → Check: Swarm Syntax forbidden keys (version, container_name, restart, depends_on) + → Check: Identity PUID/PGID or user: 1964:1964 per service + → Check: Network netgrimoire network on every service + → Check: Placement ARM exclusions, node.hostname constraints, DockerVol placement + → Check: Caddy caddy:, import_1/2, reverse_proxy per service + → Check: Homepage all 5 homepage.* labels per service + → Check: Monitor monitor.name, monitor.url (http://) per service + → Check: Legacy Labels kuma.* labels (flagged for removal) + → Check: Version gremlin.version label per service + → Check: Diun diun.enable: "true" per service + → Evaluate Checks hasFailed, canFix flags + → Switch (hasFailed) + true → ntfy: Blocked — Schema + → Switch1 (canFix) + true → Fix chain (Syntax→Identity→Network→Placement→Caddy→ + Homepage→Monitor→Legacy→Version→Diun) + → Commit Fixed File + → ntfy: Auto-fixed + false → (dead end — manual fix required) + false → Ollama Audit + → Ollama Failed? + true → ntfy: Blocked — Ollama + false → Deploy Gate (deploy directive) + enabled → Prepare Volumes + → Git Pull + Deploy + → Gatus Sync + → ntfy: Deploy Complete (+ log write) + disabled → ntfy: Deploy Skipped ``` -This commit re-triggers the webhook, and the pipeline runs again on the now-fixed file. +### Compose path -### Smart Fix Derivation - -Homepage and monitor labels are derived from existing labels rather than placeholders: - -- `homepage.name` / `monitor.name` → `capitalize(serviceName)` -- `homepage.href` / `monitor.url` → `https://` + `caddy:` hostname (falls back to `https://servicename.netgrimoire.com`) -- `homepage.group` → `"New"` when missing -- `homepage.icon` → `servicename.png` -- `homepage.description` → `"Servicename service"` +``` + → Check: Compose Syntax forbidden keys (version, container_name, depends_on) + → Check: Compose Identity PUID/PGID per service (no restart_policy check) + → Check: Compose Network netgrimoire network + → Check: Compose Caddy caddy labels + → Check: Compose Homepage homepage labels + → Check: Compose Monitor monitor labels + → Check: Compose Diun diun.enable + → Check: Compose Version gremlin.version + → Evaluate Compose Checks + → Switch: Compose (hasFailed) + true → ntfy: Compose Blocked + → Switch: Compose1 (canFix) + true → Fix chain (writes to original or override file) + → Commit Compose Fixed File + → ntfy: Compose Auto-fixed + false → Compose Deploy Gate + enabled → Compose Prepare Volumes (SSH to composeNode) + → Compose Deploy (docker compose up -d on target node) + → Compose Caddy Sync (upsert static Caddyfile + git commit + cp) + → Compose Gatus Sync + → ntfy: Compose Deploy Complete (+ log write) + disabled → ntfy: Compose Deploy Skipped +``` --- -## Ollama Audit +## Caddy Sync -After all checkers pass, the file is sent to Ollama (`qwen2.5-coder:7b`) for a semantic audit. The prompt explicitly instructs Ollama to: +**Swarm:** caddy-docker-proxy reads labels directly from Swarm services. No Gremlin involvement needed for Swarm caddy entries — they are auto-generated by caddy-docker-proxy. -- **Ignore:** environment variables, volume paths, port mappings, OIDC/OAuth config, secrets, application-specific settings -- **Check only:** clearly wrong image names, structural errors preventing startup, obviously broken network config +**Compose:** Gremlin reads `caddy:` and related labels from the compose file, builds a Caddyfile block, upserts it into `swarm/stack/caddy/Caddyfile` in Forgejo, commits it, then SSHes to `znas` to `git pull` and `cp` the file to `/export/Docker/caddy/Caddyfile`. Caddy reloads via inotify and the admin API. -Ollama is conservative by design — when in doubt it passes. False positives can be suppressed with `gremlin.context`. +The Caddyfile path in the repo is configured via `caddy_file_path` in `gremlin/config.yaml`. --- ## Gatus Sync -After successful deploy, Gatus Sync reads `monitor.*` labels from the deployed compose file and upserts endpoints into `/DockerVol/gatus/config/config.yaml` on znas using base64-encoded SSH writes. Gatus hot-reloads the config automatically. +Both Swarm and Compose pipelines read `monitor.name` and `monitor.url` labels, build Gatus endpoint blocks, upsert them into `gatus/config.yaml` in `traveler/services`, commit via the Forgejo API, then SSH to `znas` to `git pull` and copy the file to the configured `gatus_config_path`. -Alerts from Gatus go to the `gremlin-watch` ntfy topic. +Monitor URLs must use `http://servicename:port` — Gatus is on the `netgrimoire` overlay and reaches services directly without going through Caddy. --- -## Infrastructure +## Deploy Logs -| Component | Value | +Each stack gets a log file at `traveler/Netgrimoire/Logs/.md`. The file is created on first deploy and a row is appended on each subsequent deploy. The ntfy message body is also written as a detail block below the table row. + +**Swarm table format:** +``` +| Timestamp | Commit | Branch | Pusher | Outcome | Volumes | +``` + +**Compose table format:** +``` +| Timestamp | Commit | Branch | Pusher | Outcome | Node | Volumes | +``` + +Failed deploys are also logged with the error output. + +--- + +## Volume Management + +Gremlin's Prepare Volumes node SSHes to the target node and handles `/DockerVol/` and `/data/nfs/` paths found in the stack file. + +| Scenario | Action | |---|---| -| n8n host | `docker4` (192.168.5.16) | -| Swarm manager | `znas` (192.168.5.10) | -| Service account | `gremlin` | -| SSH key | `/home/gremlin/.ssh/id_ed25519` | -| Repo path on znas | `/home/gremlin/services` | -| Webhook path | `gremlin-cicd` | -| ntfy pipeline alerts | `gremlin-alerts` | -| ntfy monitoring alerts | `gremlin-watch` | -| Gatus config | `/DockerVol/gatus/config/config.yaml` | +| Directory does not exist | `mkdir -p` + `chown -R 1964:1964` + `chmod -R 775` | +| Directory exists, correct perms | Silent — no action | +| Directory exists, wrong perms | Added to `permWarnings` — manual fix command included in ntfy and deploy log | +| Service has `gremlin.uid.exempt: "true"` | `mkdir -p` only — no chown or chmod | + +Permission warnings appear in both the ntfy Deploy Complete message and the deploy log detail block. --- -## ntfy Notifications +## Troubleshooting -| Event | Topic | Priority | -|---|---|---| -| Schema blocked | `gremlin-alerts` | 4 (high) | -| Ollama blocked | `gremlin-alerts` | 4 (high) | -| Auto-fixed | `gremlin-alerts` | 3 (default) | -| Deploy complete | `gremlin-alerts` | 3 (default) | -| Deploy skipped | `gremlin-alerts` | 2 (low) | -| Non-Swarm file | `gremlin-alerts` | 2 (low) | -| Service down/up | `gremlin-watch` | 3 (default) | +### Pipeline not triggering + +- Verify webhook is active in Forgejo → `traveler/services` → Settings → Webhooks +- Check n8n workflow is active +- Check that changed files are under `swarm/` or `compose/` paths + +### SSH failures + +- Verify `gremlin` has SSH key access to the target node: `ssh -i /DockerVol/n8n/keys/id_ed25519 gremlin@` +- Verify `NODE_FUNCTION_ALLOW_BUILTIN=child_process` is set in n8n env +- Check that the key mount is read-only: `/DockerVol/n8n/keys:/home/gremlin/.ssh:ro` + +### Forgejo API 404 on file fetch + +- Verify `FORGEJO_TOKEN` has read access to `traveler/services` +- Verify file path in the push payload matches actual repo path +- Check that `gremlin/config.yaml` exists in the repo + +### Caddy Sync 404 + +- Verify `caddy_file_path` in `gremlin/config.yaml` matches the actual path in the repo (`swarm/stack/caddy/Caddyfile`) +- Verify the Caddyfile exists at that path in Forgejo + +### Gatus corruption + +- The upsert logic splits on ` - name:` boundaries. If the config has non-standard indentation the split may fail. Reset the file manually in Forgejo and push again. + +### stat command failing in Prepare Volumes + +- The stat calls use three separate SSH invocations (`stat -c %U`, `stat -c %G`, `stat -c %a`) to avoid shell quoting issues. If these fail, check that `stat` is available on the target node. --- -## Related +## Rebuilding in a New Project + +1. Create service account `gremlin` on all nodes with docker group and passwordless sudo +2. Generate SSH keypair, distribute public key to all nodes +3. Place private key at `/DockerVol/n8n/keys/id_ed25519` on the n8n host +4. Create `gremlin` Forgejo account, grant write access to services and docs repos +5. Generate read and write tokens, add to n8n environment variables +6. Add all node IPs and other env vars to n8n stack compose, redeploy n8n +7. Import `gremlin-cicd-v2.json` workflow into n8n, activate it +8. Create `gremlin/config.yaml` in the services repo with your values +9. Create `gatus/config.yaml` in the services repo (empty endpoints block) +10. Create `Logs/` directory in the docs repo (`touch Logs/.gitkeep && git push`) +11. Add webhook to Forgejo pointing at `https://n8n.netgrimoire.com/webhook/gremlin-cicd` +12. Push a test stack file and verify the pipeline fires -- [Gremlin CI/CD — Operator Guide](gremlin-cicd-guide.md) -- [NetGrimoire Stack Standards](stack-standards.md) -- [Gatus](gatus.md) -- [n8n](n8n.md) \ No newline at end of file