docs: update Netgrimoire/Gremlin-Grimoire/CICD_Architecture

This commit is contained in:
Administrator 2026-05-06 21:08:57 +00:00 committed by John Smith
parent b9c843fb52
commit 52e4ab5e0f

View file

@ -2,218 +2,400 @@
title: Gremlin CI/CD Pipepline title: Gremlin CI/CD Pipepline
description: N8N with LLAMA description: N8N with LLAMA
published: true published: true
date: 2026-04-28T20:55:22.848Z date: 2026-05-06T21:08:52.500Z
tags: tags:
editor: markdown editor: markdown
dateCreated: 2026-04-28T20:55:22.848Z dateCreated: 2026-05-03T04:16:15.155Z
--- ---
# Gremlin CI/CD Pipeline ---
title: Gremlin CI/CD — Operations Guide
description: Complete operations reference for the Gremlin CI/CD pipeline. Use this to rebuild or onboard into a new project.
published: true
date: 2026-05-06
tags: gremlin, cicd, n8n, docker, swarm, compose, netgrimoire
editor: markdown
dateCreated: 2026-05-06
---
# Gremlin CI/CD — Operations Guide
> **NetGrimoire Infrastructure Reference** > **NetGrimoire Infrastructure Reference**
> Automated validation, auto-fix, and deployment pipeline for Docker Swarm stacks. > Complete reference for rebuilding, reconfiguring, or onboarding the Gremlin CI/CD pipeline in a new project. Covers infrastructure requirements, n8n setup, Forgejo configuration, service account setup, and the full node chain.
> Runs on n8n (docker4). Triggered by Forgejo push webhooks on `traveler/services`.
--- ---
## Overview ## Overview
The Gremlin CI/CD pipeline is an n8n workflow that intercepts every push to `traveler/services`, validates changed Swarm compose files against NetGrimoire standards, automatically fixes common issues, and deploys clean stacks to the Swarm cluster. It is the enforcement layer for stack consistency across the homelab. Gremlin CI/CD is an n8n workflow that watches the `traveler/services` Forgejo repository for pushes, validates Docker Swarm stacks and Compose files against NetGrimoire standards, auto-fixes violations, prepares volumes, deploys via SSH, and syncs Caddy and Gatus configuration. All automated operations are attributed to the `gremlin` service account.
The pipeline is modular — each check and fix is a discrete node. Adding a new rule means adding one checker node and one fixer node. Nothing else changes. **Pipeline triggers:** any push to `swarm/**/*.yaml` or `compose/**/*.yaml` in `traveler/services`.
**Pipeline outputs:** validated and deployed stacks, updated Caddy static file, updated Gatus config, deploy log row in `traveler/Netgrimoire/Logs/<stack>.md`, ntfy notifications.
--- ---
## Architecture ## Infrastructure Requirements
### File Detection ### Nodes
Every push to `traveler/services` fires a webhook to n8n. The pipeline detects changed files via three passes: | Node | Role | IP |
1. **Standard arrays**`commit.added` and `commit.modified` from the Forgejo payload
2. **Gremlin commit messages** — extracts file path from `gremlin: auto-fix swarm/foo.yaml (N issues fixed)` messages, handling Forgejo's habit of sending empty file arrays for programmatic commits
3. **Compare API fallback** — calls Forgejo's `/api/v1/repos/traveler/services/compare/before...after` if both passes find nothing
Only files under `swarm/` with `.yml` or `.yaml` extensions are processed.
### File Classification
After fetching the file content, Build Envelope classifies it:
| Type | Detection | Route |
|---|---|---| |---|---|---|
| **Pocket** | `pocket.include: "true"` label | Silent exit | | `znas` | Swarm manager, Caddy host, primary deploy target | `192.168.5.10` |
| **Swarm** | Any `deploy:` block present | Full checker chain | | `docker3` | Swarm worker | `192.168.5.15` |
| **Compose** | No `deploy:` block | ntfy warning, skip | | `docker4` | Swarm worker, n8n host, Ollama host | `192.168.5.16` |
| `docker5` | Swarm worker | `192.168.5.18` |
| `dockerpi1` | Swarm worker (ARM) | `192.168.5.8` |
### Pipeline Flow ### Required Services
``` | Service | Location | Purpose |
Forgejo Push
└─ Parse Push Payload
└─ Build Envelope
└─ Switch (Pocket / Swarm / Compose)
├─ Pocket → (silent exit)
├─ Swarm → Checker Chain
│ └─ Evaluate Checks
│ └─ Switch (hasFailed)
│ ├─ Failed → ntfy: Blocked
│ │ └─ Switch (canFix)
│ │ ├─ Yes → Fixer Chain → Commit → ntfy: Auto-fixed
│ │ └─ No → (stop)
│ └─ Passed → Ollama Audit
│ └─ Switch (ollamaVerdict)
│ ├─ Fail → ntfy: Blocked — Ollama
│ └─ Pass → Deploy Gate
│ └─ Deploy Enabled?
│ ├─ No → ntfy: Deploy Skipped
│ └─ Yes → Prepare Volumes
│ └─ Git Pull + Deploy
│ └─ Gatus Sync
│ └─ ntfy: Deploy Complete
└─ Compose → ntfy: Non-Swarm
```
### The Envelope
Every checker and fixer operates on a single shared object called the **envelope**. It is built once per file and passed through the entire chain, accumulating issues and fixes.
Key fields:
| Field | Description |
|---|---|
| `stackName` | Derived from file path |
| `filePath` | Relative path in repo |
| `composeRaw` | Original file content — never modified |
| `fixedRaw` | Accumulates fixer changes — null until first fixer runs |
| `issues[]` | All checker findings |
| `fixes[]` | All fixer actions taken |
| `checkResults{}` | Pass/fail per checker ID |
| `hasFailed` | True if any checker failed |
| `canFix` | True if all issues are fixable and there are issues to fix |
| `isPocket` | True if pocket.include: "true" found |
| `isSwarm` | True if any deploy: block found |
| `directives{}` | Parsed gremlin.* label values |
---
## Checker Chain
Checkers run in this order. All checkers append to `envelope.issues` and set `envelope.checkResults[id]`.
| Order | ID | What it checks |
|---|---|---| |---|---|---|
| 1 | `swarm-syntax` | Forbidden fields: version, container_name, hostname (service-level), restart, depends_on, dnsrr | | n8n | `docker4` | Pipeline execution |
| 2 | `identity` | PUID/PGID must be 1964, or user: "1964:1964" | | Forgejo | `git.netgrimoire.com` | Git hosting, webhook source |
| 3 | `network` | netgrimoire overlay declared and attached | | Ollama | `docker4` | Audit model inference |
| 4 | `placement` | ARM exclusions, DockerVol/hostname rules, restart_policy | | ntfy | `ntfy.netgrimoire.com` | Notifications |
| 5 | `caddy` | caddy: label, reverse_proxy format, import_1/import_2 | | Gatus | `znas` | Service monitoring |
| 6 | `homepage` | group, name, icon, href, description labels | | Caddy | `znas` | Reverse proxy |
| 7 | `monitor` | monitor.name, monitor.url, optional type/interval |
| 8 | `legacy-labels` | Flags any kuma.* labels for removal |
| 9 | `diun` | diun.enable: "true" present |
### Fixable vs Unfixable ---
Auto-fix only runs when **all** issues in the file are fixable. A single unfixable issue blocks the fix chain entirely. ## Service Account Setup
| Fixable | Unfixable | The `gremlin` service account must be created and configured on every node before the pipeline will function.
### Create the account
On each node:
```bash
sudo useradd -m -s /bin/bash gremlin
sudo usermod -aG docker gremlin
```
### SSH key
Generate once, distribute everywhere:
```bash
# On docker4 (n8n host) or your workstation
ssh-keygen -t ed25519 -C "gremlin@netgrimoire" -f ~/.ssh/gremlin_ed25519 -N ""
```
Copy the public key to every node:
```bash
for NODE in znas docker3 docker4 docker5 dockerpi1; do
ssh-copy-id -i ~/.ssh/gremlin_ed25519.pub gremlin@$NODE
done
```
The private key must be placed at the path configured in `gremlin/config.yaml` (default `/home/gremlin/.ssh/id_ed25519`) inside the n8n container. Mount it via the n8n stack:
```yaml
volumes:
- /DockerVol/n8n/keys:/home/gremlin/.ssh:ro
```
### Passwordless sudo
On each node, create `/etc/sudoers.d/gremlin-deploy`:
```
gremlin ALL=(ALL) NOPASSWD: /bin/mkdir, /bin/chown, /bin/chmod
```
On `znas` additionally:
```
gremlin ALL=(ALL) NOPASSWD: /usr/bin/docker
```
### Forgejo account
Create a `gremlin` local account in Forgejo. Grant collaborator write access to:
- `traveler/services` — stack files
- `traveler/Netgrimoire` — deploy logs
Generate two tokens:
- **Read token** — for fetching files (`FORGEJO_TOKEN`)
- **Write token** — for committing fixes and logs (`FORGEJO_WRITE_TOKEN`)
---
## n8n Stack Configuration
### Required environment variables
Add to the n8n stack compose `environment:` block:
```yaml
environment:
# Forgejo
FORGEJO_TOKEN: <gremlin read token>
FORGEJO_WRITE_TOKEN: <gremlin write token>
FORGEJO_URL: https://forgejo.netgrimoire.com
# ntfy
NTFY_URL: https://ntfy.netgrimoire.com
# Node IPs (overlay DNS gives overlay IPs — store host IPs explicitly)
ZNAS_IP: 192.168.5.10
DOCKER3_IP: 192.168.5.15
DOCKER4_IP: 192.168.5.16
DOCKER5_IP: 192.168.5.18
DOCKERPI1_IP: 192.168.5.8
# Required for SSH and child_process in Code nodes
NODE_FUNCTION_ALLOW_BUILTIN: child_process
```
### Volume mounts
```yaml
volumes:
- /DockerVol/n8n:/home/node/.n8n
- /DockerVol/n8n/keys:/home/gremlin/.ssh:ro
```
---
## Gremlin Config File
The pipeline reads runtime configuration from `gremlin/config.yaml` in the `traveler/services` repo. This file is fetched on every push and overrides hardcoded defaults.
```yaml
# gremlin/config.yaml
version: "2026-04-1"
deploy: true
autofix: true
ollama_model: "qwen2.5-coder:14b"
ollama_audit_model: "gemma3:4b"
ntfy_alerts_topic: "gremlin-alerts"
ntfy_monitor_topic: "gremlin-watch"
checks_skip: ""
maintenance: false
maintenance_message: "Gremlin is in maintenance mode"
ssh_key_path: "/home/gremlin/.ssh/id_ed25519"
repo_path: "/home/gremlin/services"
forgejo_owner: "traveler"
forgejo_repo: "services"
gatus_config_path: "/DockerVol/gatus/config/config.yaml"
caddy_file_path: "swarm/stack/caddy/Caddyfile"
caddy_static_path: "/export/Docker/caddy/Caddyfile"
```
Set `maintenance: true` to silently drop all pushes and send a single ntfy notification.
---
## Repository Structure
```
traveler/services/
swarm/
stack/
caddy/
Caddyfile ← static Caddy config (managed by Gremlin Caddy Sync)
<stackname>/
<stackname>.yaml ← Swarm stack file
compose/
<node>/
<stackname>.yaml ← Compose file for that node
<stackname>-override.yaml ← Auto-generated by Gremlin (override mode)
gremlin/
config.yaml ← Pipeline runtime config
logs/ ← (unused — logs now in traveler/Netgrimoire)
traveler/Netgrimoire/
Logs/
<stackname>.md ← Deploy log per stack (table + detail)
gatus/
config.yaml ← Gatus monitor config (managed by Gremlin Gatus Sync)
```
---
## Forgejo Webhook
In Forgejo → `traveler/services` → Settings → Webhooks → Add Webhook:
| Field | Value |
|---|---| |---|---|
| version: key | dnsrr endpoint mode | | Target URL | `https://n8n.netgrimoire.com/webhook/gremlin-cicd` |
| container_name, hostname (service), restart, depends_on | Missing deploy: block | | Content type | `application/json` |
| Wrong or missing PUID/PGID | Invalid node.hostname value | | Secret | _(optional, not currently validated)_ |
| Missing netgrimoire network | hostname missing when DockerVol present | | Trigger | Push events only |
| ARM exclusion issues | — |
| Hostname present without DockerVol | — |
| Missing restart_policy | — |
| caddy: protocol prefix | — |
| Missing caddy.import_1/import_2 | — |
| Missing homepage labels (derived) | — |
| Missing monitor labels (derived) | — |
| Legacy kuma.* labels (removed) | — |
| Missing diun.enable | — |
--- ---
## Fixer Chain ## Pipeline Node Chain
Fixers run in the same order as checkers. Each fixer reads from `fixedRaw` (or `composeRaw` if first) and writes its changes back to `fixedRaw`. Changes accumulate correctly across the chain. ### Swarm path
When all fixers complete, the pipeline commits `fixedRaw` back to Forgejo with the message:
``` ```
gremlin: auto-fix swarm/foo.yaml (N issues fixed) Forgejo Push Webhook
→ Parse Push Payload extract files, branch, SHA, pusher, commit directives
- Removed version: key → Build Envelope fetch file from Forgejo, detect type, parse directives
- Added PUID/PGID 1964 to "app" → Switch2 route: Compose / Swarm / Pocket / fallback
- ... → Check: Swarm Syntax forbidden keys (version, container_name, restart, depends_on)
→ Check: Identity PUID/PGID or user: 1964:1964 per service
→ Check: Network netgrimoire network on every service
→ Check: Placement ARM exclusions, node.hostname constraints, DockerVol placement
→ Check: Caddy caddy:, import_1/2, reverse_proxy per service
→ Check: Homepage all 5 homepage.* labels per service
→ Check: Monitor monitor.name, monitor.url (http://) per service
→ Check: Legacy Labels kuma.* labels (flagged for removal)
→ Check: Version gremlin.version label per service
→ Check: Diun diun.enable: "true" per service
→ Evaluate Checks hasFailed, canFix flags
→ Switch (hasFailed)
true → ntfy: Blocked — Schema
→ Switch1 (canFix)
true → Fix chain (Syntax→Identity→Network→Placement→Caddy→
Homepage→Monitor→Legacy→Version→Diun)
→ Commit Fixed File
→ ntfy: Auto-fixed
false → (dead end — manual fix required)
false → Ollama Audit
→ Ollama Failed?
true → ntfy: Blocked — Ollama
false → Deploy Gate (deploy directive)
enabled → Prepare Volumes
→ Git Pull + Deploy
→ Gatus Sync
→ ntfy: Deploy Complete (+ log write)
disabled → ntfy: Deploy Skipped
``` ```
This commit re-triggers the webhook, and the pipeline runs again on the now-fixed file. ### Compose path
### Smart Fix Derivation ```
→ Check: Compose Syntax forbidden keys (version, container_name, depends_on)
Homepage and monitor labels are derived from existing labels rather than placeholders: → Check: Compose Identity PUID/PGID per service (no restart_policy check)
→ Check: Compose Network netgrimoire network
- `homepage.name` / `monitor.name``capitalize(serviceName)` → Check: Compose Caddy caddy labels
- `homepage.href` / `monitor.url``https://` + `caddy:` hostname (falls back to `https://servicename.netgrimoire.com`) → Check: Compose Homepage homepage labels
- `homepage.group``"New"` when missing → Check: Compose Monitor monitor labels
- `homepage.icon``servicename.png` → Check: Compose Diun diun.enable
- `homepage.description``"Servicename service"` → Check: Compose Version gremlin.version
→ Evaluate Compose Checks
→ Switch: Compose (hasFailed)
true → ntfy: Compose Blocked
→ Switch: Compose1 (canFix)
true → Fix chain (writes to original or override file)
→ Commit Compose Fixed File
→ ntfy: Compose Auto-fixed
false → Compose Deploy Gate
enabled → Compose Prepare Volumes (SSH to composeNode)
→ Compose Deploy (docker compose up -d on target node)
→ Compose Caddy Sync (upsert static Caddyfile + git commit + cp)
→ Compose Gatus Sync
→ ntfy: Compose Deploy Complete (+ log write)
disabled → ntfy: Compose Deploy Skipped
```
--- ---
## Ollama Audit ## Caddy Sync
After all checkers pass, the file is sent to Ollama (`qwen2.5-coder:7b`) for a semantic audit. The prompt explicitly instructs Ollama to: **Swarm:** caddy-docker-proxy reads labels directly from Swarm services. No Gremlin involvement needed for Swarm caddy entries — they are auto-generated by caddy-docker-proxy.
- **Ignore:** environment variables, volume paths, port mappings, OIDC/OAuth config, secrets, application-specific settings **Compose:** Gremlin reads `caddy:` and related labels from the compose file, builds a Caddyfile block, upserts it into `swarm/stack/caddy/Caddyfile` in Forgejo, commits it, then SSHes to `znas` to `git pull` and `cp` the file to `/export/Docker/caddy/Caddyfile`. Caddy reloads via inotify and the admin API.
- **Check only:** clearly wrong image names, structural errors preventing startup, obviously broken network config
Ollama is conservative by design — when in doubt it passes. False positives can be suppressed with `gremlin.context`. The Caddyfile path in the repo is configured via `caddy_file_path` in `gremlin/config.yaml`.
--- ---
## Gatus Sync ## Gatus Sync
After successful deploy, Gatus Sync reads `monitor.*` labels from the deployed compose file and upserts endpoints into `/DockerVol/gatus/config/config.yaml` on znas using base64-encoded SSH writes. Gatus hot-reloads the config automatically. Both Swarm and Compose pipelines read `monitor.name` and `monitor.url` labels, build Gatus endpoint blocks, upsert them into `gatus/config.yaml` in `traveler/services`, commit via the Forgejo API, then SSH to `znas` to `git pull` and copy the file to the configured `gatus_config_path`.
Alerts from Gatus go to the `gremlin-watch` ntfy topic. Monitor URLs must use `http://servicename:port` — Gatus is on the `netgrimoire` overlay and reaches services directly without going through Caddy.
--- ---
## Infrastructure ## Deploy Logs
| Component | Value | Each stack gets a log file at `traveler/Netgrimoire/Logs/<stackname>.md`. The file is created on first deploy and a row is appended on each subsequent deploy. The ntfy message body is also written as a detail block below the table row.
**Swarm table format:**
```
| Timestamp | Commit | Branch | Pusher | Outcome | Volumes |
```
**Compose table format:**
```
| Timestamp | Commit | Branch | Pusher | Outcome | Node | Volumes |
```
Failed deploys are also logged with the error output.
---
## Volume Management
Gremlin's Prepare Volumes node SSHes to the target node and handles `/DockerVol/` and `/data/nfs/` paths found in the stack file.
| Scenario | Action |
|---|---| |---|---|
| n8n host | `docker4` (192.168.5.16) | | Directory does not exist | `mkdir -p` + `chown -R 1964:1964` + `chmod -R 775` |
| Swarm manager | `znas` (192.168.5.10) | | Directory exists, correct perms | Silent — no action |
| Service account | `gremlin` | | Directory exists, wrong perms | Added to `permWarnings` — manual fix command included in ntfy and deploy log |
| SSH key | `/home/gremlin/.ssh/id_ed25519` | | Service has `gremlin.uid.exempt: "true"` | `mkdir -p` only — no chown or chmod |
| Repo path on znas | `/home/gremlin/services` |
| Webhook path | `gremlin-cicd` | Permission warnings appear in both the ntfy Deploy Complete message and the deploy log detail block.
| ntfy pipeline alerts | `gremlin-alerts` |
| ntfy monitoring alerts | `gremlin-watch` |
| Gatus config | `/DockerVol/gatus/config/config.yaml` |
--- ---
## ntfy Notifications ## Troubleshooting
| Event | Topic | Priority | ### Pipeline not triggering
|---|---|---|
| Schema blocked | `gremlin-alerts` | 4 (high) | - Verify webhook is active in Forgejo → `traveler/services` → Settings → Webhooks
| Ollama blocked | `gremlin-alerts` | 4 (high) | - Check n8n workflow is active
| Auto-fixed | `gremlin-alerts` | 3 (default) | - Check that changed files are under `swarm/` or `compose/` paths
| Deploy complete | `gremlin-alerts` | 3 (default) |
| Deploy skipped | `gremlin-alerts` | 2 (low) | ### SSH failures
| Non-Swarm file | `gremlin-alerts` | 2 (low) |
| Service down/up | `gremlin-watch` | 3 (default) | - Verify `gremlin` has SSH key access to the target node: `ssh -i /DockerVol/n8n/keys/id_ed25519 gremlin@<node_ip>`
- Verify `NODE_FUNCTION_ALLOW_BUILTIN=child_process` is set in n8n env
- Check that the key mount is read-only: `/DockerVol/n8n/keys:/home/gremlin/.ssh:ro`
### Forgejo API 404 on file fetch
- Verify `FORGEJO_TOKEN` has read access to `traveler/services`
- Verify file path in the push payload matches actual repo path
- Check that `gremlin/config.yaml` exists in the repo
### Caddy Sync 404
- Verify `caddy_file_path` in `gremlin/config.yaml` matches the actual path in the repo (`swarm/stack/caddy/Caddyfile`)
- Verify the Caddyfile exists at that path in Forgejo
### Gatus corruption
- The upsert logic splits on ` - name:` boundaries. If the config has non-standard indentation the split may fail. Reset the file manually in Forgejo and push again.
### stat command failing in Prepare Volumes
- The stat calls use three separate SSH invocations (`stat -c %U`, `stat -c %G`, `stat -c %a`) to avoid shell quoting issues. If these fail, check that `stat` is available on the target node.
--- ---
## Related ## Rebuilding in a New Project
1. Create service account `gremlin` on all nodes with docker group and passwordless sudo
2. Generate SSH keypair, distribute public key to all nodes
3. Place private key at `/DockerVol/n8n/keys/id_ed25519` on the n8n host
4. Create `gremlin` Forgejo account, grant write access to services and docs repos
5. Generate read and write tokens, add to n8n environment variables
6. Add all node IPs and other env vars to n8n stack compose, redeploy n8n
7. Import `gremlin-cicd-v2.json` workflow into n8n, activate it
8. Create `gremlin/config.yaml` in the services repo with your values
9. Create `gatus/config.yaml` in the services repo (empty endpoints block)
10. Create `Logs/` directory in the docs repo (`touch Logs/.gitkeep && git push`)
11. Add webhook to Forgejo pointing at `https://n8n.netgrimoire.com/webhook/gremlin-cicd`
12. Push a test stack file and verify the pipeline fires
- [Gremlin CI/CD — Operator Guide](gremlin-cicd-guide.md)
- [NetGrimoire Stack Standards](stack-standards.md)
- [Gatus](gatus.md)
- [n8n](n8n.md)