docs: update Netgrimoire/Gremlin-Grimoire/CICD_UserGuide

This commit is contained in:
Administrator 2026-05-06 21:10:09 +00:00 committed by John Smith
parent 52e4ab5e0f
commit 35a791f69d

View file

@ -2,53 +2,74 @@
title: Gremlin CI/CD User Guide title: Gremlin CI/CD User Guide
description: description:
published: true published: true
date: 2026-04-30T18:33:09.881Z date: 2026-05-06T21:10:05.745Z
tags: tags:
editor: markdown editor: markdown
dateCreated: 2026-04-28T20:56:45.863Z dateCreated: 2026-05-03T04:16:19.938Z
--- ---
# Gremlin CI/CD — Operator Guide ---
title: Gremlin CI/CD — User Guide
description: Day-to-day usage guide for the Gremlin CI/CD pipeline. How to write stacks, use directives, interpret notifications, and manage deployments.
published: true
date: 2026-05-06
tags: gremlin, cicd, docker, swarm, compose, netgrimoire
editor: markdown
dateCreated: 2026-05-06
---
# Gremlin CI/CD — User Guide
> **NetGrimoire Infrastructure Reference** > **NetGrimoire Infrastructure Reference**
> How to write, structure, and manage Swarm stacks for the Gremlin CI/CD pipeline. > Day-to-day guide for working with the Gremlin CI/CD pipeline. Covers stack file conventions, directive usage, the auto-fix cycle, commit directives, notifications, and deploy logs.
> For pipeline architecture, see [Gremlin CI/CD Pipeline](gremlin-cicd-wiki.md).
--- ---
## How It Works ## How the Pipeline Works
Push any `.yml` or `.yaml` file under `swarm/` to `traveler/services` and Gremlin takes over: Push a `.yaml` file to `traveler/services` under `swarm/` or `compose/<node>/` and Gremlin takes over. It validates the file, fixes what it can automatically, prepares volume directories on the target node, deploys the stack, and notifies you via ntfy.
1. Fetches the file and classifies it (Swarm, Pocket, or plain Compose) You never need to SSH to a node to deploy — push to Forgejo and Gremlin handles the rest.
2. Runs all schema checkers
3. If issues found and all are fixable — auto-fixes and recommits
4. If issues found and unfixable — sends ntfy alert, stops
5. If all checks pass — runs Ollama audit, then deploys
6. After deploy — updates Gatus monitoring config
You get ntfy notifications at every stage. A clean push produces one notification: ✅ Deploy Complete.
--- ---
## Required Stack Structure ## File Locations
Every Swarm service must have these elements. Missing any will block deployment. | Type | Path | Example |
|---|---|---|
| Swarm stack | `swarm/stack/<name>/<name>.yaml` | `swarm/stack/wiki/wiki.yaml` |
| Compose file | `compose/<node>/<name>.yaml` | `compose/znas/namer.yaml` |
| Compose override | `compose/<node>/<name>-override.yaml` | auto-generated |
| Gremlin config | `gremlin/config.yaml` | pipeline settings |
---
## Swarm Stack Standards
Every service in a Swarm stack must comply with these standards. Gremlin will auto-fix most violations; unfixable ones block the deploy with an ntfy alert.
### Forbidden fields (auto-removed)
```yaml
version: "3.8" # ← removed
container_name: foo # ← removed
hostname: foo # ← removed
restart: unless-stopped # ← removed (use restart_policy)
depends_on: # ← removed
endpoint_mode: dnsrr # ← removed
```
### Required per service
```yaml ```yaml
services: services:
myservice: myservice:
image: vendor/image:tag image: myimage:latest
environment:
PUID: "1964"
PGID: "1964"
TZ: America/Chicago
volumes:
- /DockerVol/myservice:/data # pinned — requires node.hostname
# or
- /data/nfs/znas/Docker/myservice:/data # floating — no hostname needed
networks: networks:
- netgrimoire - netgrimoire
environment:
PUID: "1964" # ← required (or user: "1964:1964")
PGID: "1964"
deploy: deploy:
restart_policy: restart_policy:
condition: any condition: any
@ -57,25 +78,73 @@ services:
window: 120s window: 120s
placement: placement:
constraints: constraints:
- node.platform.arch != aarch64 - node.platform.arch != aarch64 # ← required (unless arm.allow)
- node.platform.arch != arm - node.platform.arch != arm
- node.hostname == znas # required when using /DockerVol/
labels: labels:
caddy: myservice.netgrimoire.com gremlin.version: "2026-04-1" # ← required
caddy: myservice.netgrimoire.com # ← required (or caddy.skip)
caddy.reverse_proxy: myservice:8080 caddy.reverse_proxy: myservice:8080
caddy.import_1: crowdsec caddy.import_1: crowdsec
caddy.import_2: authentik caddy.import_2: authentik
homepage.group: "My Group" # ← required (or homepage.skip)
homepage.name: "My Service"
homepage.icon: "myservice.png"
homepage.href: "https://myservice.netgrimoire.com"
homepage.description: "My service description"
monitor.name: "My Service" # ← required (or monitor.skip)
monitor.url: "http://myservice:8080"
diun.enable: "true" # ← required (or diun.skip)
monitor.name: MyService networks:
monitor.url: http://myservice:8080 # internal URL preferred netgrimoire:
external: true
```
homepage.group: NetGrimoire ### DockerVol placement constraint
homepage.name: MyService
homepage.icon: myservice.png
homepage.href: https://myservice.netgrimoire.com
homepage.description: My service description
diun.enable: "true" Any service with a `/DockerVol/` volume must also have:
```yaml
placement:
constraints:
- node.hostname == znas # or whichever node owns that DockerVol path
```
---
## Compose File Standards
Compose files follow the same label standards as Swarm with these differences:
- No `deploy:` block required (no restart_policy, no ARM exclusions, no placement)
- `restart:` is valid and left untouched
- Deploy is via `docker compose up -d` on the target node (derived from file path)
- Caddy entries are written to the static Caddyfile by Gremlin
```yaml
services:
myservice:
image: myimage:latest
networks:
- netgrimoire
environment:
PUID: "1964"
PGID: "1964"
restart: unless-stopped
labels:
gremlin.version: "2026-04-1"
caddy: myservice.netgrimoire.com
caddy.reverse_proxy: myservice:8080
caddy.import_1: crowdsec
caddy.import_2: authentik
homepage.group: "My Group"
homepage.name: "My Service"
homepage.icon: "myservice.png"
homepage.href: "https://myservice.netgrimoire.com"
homepage.description: "My service description"
monitor.name: "My Service"
monitor.url: "http://myservice:8080"
diun.enable: "true"
networks: networks:
netgrimoire: netgrimoire:
@ -84,355 +153,202 @@ networks:
--- ---
## Volume Path Rules ## The Auto-Fix Cycle
| Path type | Example | Placement constraint | When Gremlin detects fixable violations it:
|---|---|---|
| `/DockerVol/` | `/DockerVol/myservice:/data` | `node.hostname` **required** |
| `/data/nfs/znas/` | `/data/nfs/znas/Docker/myservice:/data` | `node.hostname` **not required** |
Valid hostnames for `node.hostname`: `docker3`, `docker4`, `docker5`, `znas`, `dockerpi1` 1. Fixes them in memory
2. Commits the corrected file back to Forgejo as the `gremlin` account
3. Sends ntfy: **Gremlin: Auto-fixed** listing what changed
4. The fix commit triggers the webhook again
5. The pipeline re-runs on the fixed file
6. If it passes, the stack deploys
You will receive two ntfy messages for an auto-fixed deploy: the fix notification and the deploy complete notification.
### What gets auto-fixed
| Violation | Fix |
|---|---|
| `version:` key present | Removed |
| `container_name:` present | Removed |
| `restart:` present (Swarm) | Removed |
| `depends_on:` present | Removed |
| Missing `PUID`/`PGID` | Added to environment block |
| Missing network | Added |
| Missing `caddy.import_1: crowdsec` | Added |
| Missing `caddy.import_2: authentik` | Added |
| Missing `caddy.reverse_proxy` | Added (derived from ports or gremlin.port) |
| Missing ARM exclusions | Added to placement constraints |
| Missing homepage labels | Added with placeholder values |
| Missing `monitor.name`/`monitor.url` | Added (derived from ports or gremlin.port) |
| Missing `diun.enable` | Added |
| Missing `gremlin.version` | Stamped |
| `kuma.*` labels present | Removed |
### What cannot be auto-fixed (blocks deploy)
| Violation | Why |
|---|---|
| Missing `caddy:` label | Gremlin cannot invent a hostname |
| Invalid `node.hostname` constraint | Must reference a real cluster node |
--- ---
## Identity Rules ## Override Mode for Vendor Files
**Method 1** — LinuxServer.io and homelab images (preferred): For downloaded vendor compose files you do not want Gremlin to modify, add to any service label:
```yaml
environment:
PUID: "1964"
PGID: "1964"
```
**Method 2** — Official Docker Hub images:
```yaml
user: "1964:1964"
```
**Exemption** — Images that manage their own users (Authentik, MailCow, Postgres, Redis):
```yaml ```yaml
labels: labels:
gremlin.uid.exempt: "true" gremlin.autofix.mode: "override"
gremlin.uid.reason: "Postgres manages its own user — requires UID 999"
``` ```
When `uid.exempt` is set, Prepare Volumes will `mkdir` the service's volume paths but will **not** `chown` them. The image manages its own ownership. Gremlin will:
- Never touch the original file
- Write all fixes to `<filename>-override.yaml` in the same directory
- Deploy with `docker compose -f original.yaml -f original-override.yaml up -d`
- Report unfixable issues (like `version:` key) as manual fixes needed in the original
The override file is committed to Forgejo automatically and updated on subsequent pushes.
--- ---
## Caddy Label Rules ## Commit Directives
```yaml Control the pipeline from the commit message without modifying any file. Use this for vendor files, one-off overrides, or emergency controls.
caddy: myservice.netgrimoire.com # hostname only — no https:// prefix
caddy.reverse_proxy: myservice:8080 # service name and port — no IP addresses **Syntax:** append `[gremlin: key=value key2=value2]` anywhere in the commit message.
caddy.import_1: crowdsec # always required
caddy.import_2: authentik # required unless gremlin.authentik.skip is set ```bash
git commit -m "deploy stash after config change [gremlin: deploy=true autofix=false checks.skip=all]"
git commit -m "WIP refactor [gremlin: deploy=false]"
git commit -m "bump image [gremlin: notify=false]"
git commit -m "fix label [gremlin: checks.skip=caddy,homepage]"
``` ```
Services without a public URL (internal sidecars, databases): **Commit directives override file directives and config defaults** — they are the highest priority in the pipeline.
```yaml
gremlin.caddy.skip: "true" ### Supported commit directives
| Directive | Values | Effect |
|---|---|---|
| `deploy` | `true` / `false` | Enable or disable deploy for this push |
| `autofix` | `true` / `false` | Enable or disable auto-fix |
| `checks.skip` | comma-separated checker IDs | Skip specific checks |
| `notify` | `true` / `false` | Enable or disable ntfy notifications |
| `notify.level` | `all` / `errors` | Notification verbosity |
| `enable` | `true` / `false` | Enable or disable the entire pipeline for this push |
### Checker IDs for checks.skip
`syntax`, `identity`, `network`, `placement`, `caddy`, `homepage`, `monitor`, `legacy-labels`, `version`, `diun`
Compose checker IDs: `compose-syntax`, `compose-identity`, `compose-network`, `compose-caddy`, `compose-homepage`, `compose-monitor`, `compose-diun`, `compose-version`
---
## Notifications
All ntfy notifications go to `gremlin-alerts` (errors, blocks, fixes) or `gremlin-watch` (deploy complete, skipped).
| Message | Topic | Meaning |
|---|---|---|
| Gremlin: Blocked — Schema | alerts | Unfixable violation — manual fix required |
| Gremlin: Auto-fixed | alerts | Violations fixed and recommitted |
| Gremlin: Blocked — Ollama | alerts | Ollama audit flagged a problem |
| Gremlin: Deploy Complete | watch | Stack deployed successfully |
| Gremlin: Deploy Skipped | watch | Deploy disabled via directive |
| Gremlin: Deploy Failed | alerts | Deploy command failed — check SSH/Docker |
| Gremlin: Compose Blocked | alerts | Compose file violation |
| Gremlin: Compose Auto-fixed | alerts | Compose violations fixed |
| Gremlin: Compose Deploy | watch | Compose stack deployed |
| Gremlin: Maintenance Mode | alerts | Push received during maintenance mode |
### Permission warnings in ntfy
If a volume directory exists with wrong ownership or permissions, the Deploy Complete message will include:
```
Manual permission fixes needed on znas:
• /DockerVol/myservice (current: root:root 755)
sudo chown -R 1964:1964 /DockerVol/myservice && sudo chmod -R 775 /DockerVol/myservice
``` ```
Services that should bypass Authentik but still go through CrowdSec: Gremlin never changes permissions on existing directories — you must apply these manually.
```yaml
gremlin.authentik.skip: "true" ---
## Deploy Logs
Each stack has a log at `traveler/Netgrimoire/Logs/<stackname>.md`. Every deploy (success or failure) appends a table row and a detail block containing the full ntfy message.
```markdown
# Deploy Log — wiki
| Timestamp | Commit | Branch | Pusher | Outcome | Volumes |
|---|---|---|---|---|---|
| 2026-05-06 00:15 UTC | abc1234 | master | traveler | ✅ deployed | 3 vol(s) |
**2026-05-06 00:15 UTC** — ✅ deployed
```
wiki deployed successfully
Branch: master
Pushed by: traveler
SHA: abc1234
Gatus monitors: Wiki
```
``` ```
--- ---
## Monitor Labels ## Maintenance Mode
Gremlin writes monitor endpoints to Gatus after each successful deploy. Monitor URLs should use the internal service name and port so Gatus checks the container directly without depending on Caddy or Authentik being up. Set `maintenance: true` in `gremlin/config.yaml` to pause the pipeline. Any push will receive a single ntfy notification and be silently dropped. No deploys, no fixes, no errors.
```yaml ```yaml
monitor.name: MyService # display name in Gatus maintenance: true
monitor.url: http://myservice:8080 # internal URL preferred maintenance_message: "Upgrading ZFS pool — pipeline paused"
monitor.type: http # optional: http | tcp | ping | dns (default: http)
monitor.interval: "60" # optional: seconds, minimum 20 (default: 60)
``` ```
For non-HTTP services (mail, databases): Push the config change to activate, push again with `maintenance: false` to resume.
```yaml
monitor.type: tcp
monitor.url: tcp://myservice:5432
```
Services that should not be monitored:
```yaml
gremlin.monitor.skip: "true"
```
Gatus determines the check condition from the URL scheme:
- `http://` or `https://``[STATUS] == 200`
- `tcp://` or `type: tcp``[CONNECTED] == true`
- `type: ping``[CONNECTED] == true`
--- ---
## Homepage Labels ## Common Workflows
```yaml ### Deploying a new stack
homepage.group: Media # dashboard group
homepage.name: MyService # display name 1. Write the stack file following the standards above
homepage.icon: myservice.png # icon filename 2. Place it at `swarm/stack/<name>/<name>.yaml` or `compose/<node>/<name>.yaml`
homepage.href: https://myservice.netgrimoire.com 3. `git add`, `git commit`, `git push`
homepage.description: Brief description 4. Watch ntfy — Gremlin will auto-fix any violations and deploy
### Redeploying an existing stack
Just push any change to the file. Even a comment or whitespace change triggers the pipeline.
Or use a commit directive to force a redeploy without changing the file:
```bash
git commit --allow-empty -m "redeploy wiki [gremlin: deploy=true]"
git push
``` ```
Services that should not appear on Homepage: ### Deploying a vendor file without modifying it
```yaml
gremlin.homepage.skip: "true" 1. Add `gremlin.autofix.mode: "override"` to one service label
2. Push — Gremlin creates the override file with all required labels
3. Subsequent pushes use the override file automatically
Or use commit directives to skip checks entirely:
```bash
git commit -m "deploy vendor stack [gremlin: autofix=false checks.skip=all deploy=true]"
``` ```
> **Auto-fix note:** If homepage labels are missing, Gremlin derives them from the caddy: label and service name. Group defaults to "New", icon defaults to "servicename.png". Review and correct after auto-fix. ### Checking what Gremlin did
--- - ntfy notifications for immediate status
- `traveler/Netgrimoire/Logs/<stackname>.md` for full history
## Gremlin Directives Reference - Forgejo commit history on `traveler/services` — look for `gremlin:` prefix commits
All directives go inside `deploy.labels`. All are opt-out — a stack with no `gremlin.*` labels gets full treatment.
### Pipeline Control
| Directive | Default | Description |
|---|---|---|
| `gremlin.enable` | `true` | Set `false` to have Gremlin ignore this file entirely on push |
| `gremlin.checks` | `all` | Comma-separated checker IDs to run, or `all` |
| `gremlin.checks.skip` | _(none)_ | Comma-separated checker IDs to skip |
| `gremlin.version` | _(auto)_ | Stamped automatically — do not set manually |
| `gremlin.context` | _(none)_ | Free text passed to Ollama as ground truth — Ollama will not flag anything this explains |
### Auto-fix Control
| Directive | Default | Description |
|---|---|---|
| `gremlin.autofix` | `true` | Set `false` to disable all auto-fixing |
| `gremlin.autofix.skip` | `false` | Set `true` to notify but never attempt to fix |
| `gremlin.autofix.skip_fields` | _(none)_ | Comma-separated fields to skip during fix (e.g. `uid,hostname`) |
### Deploy Control
| Directive | Default | Description |
|---|---|---|
| `gremlin.deploy` | `true` | Set `false` to run checks and fixes but never deploy |
| `gremlin.deploy.strategy` | `stack` | Deployment method — currently only `stack` is implemented |
| `gremlin.port` | _(none)_ | Internal container port when no `ports:` mapping exists — used to derive `caddy.reverse_proxy` and `monitor.url` |
### Identity
| Directive | Default | Description |
|---|---|---|
| `gremlin.uid.exempt` | `false` | Skip PUID/PGID/user checks and skip chown on volumes for this service |
| `gremlin.uid.reason` | _(none)_ | Documents why uid.exempt is set — include with every exemption |
### Placement
| Directive | Default | Description |
|---|---|---|
| `gremlin.arm.allow` | `false` | Allow ARM/Pi deployment — removes ARM exclusion constraint requirement |
### Caddy
| Directive | Default | Description |
|---|---|---|
| `gremlin.caddy.skip` | `false` | Skip all Caddy label checks for this service |
| `gremlin.authentik.skip` | `false` | Skip `caddy.import_2: authentik` requirement only — CrowdSec still required |
### Homepage
| Directive | Default | Description |
|---|---|---|
| `gremlin.homepage.skip` | `false` | Skip Homepage label checks for this service |
### Monitor
| Directive | Default | Description |
|---|---|---|
| `gremlin.monitor.skip` | `false` | Skip monitor label checks for this service |
### Network
| Directive | Default | Description |
|---|---|---|
| `gremlin.network.skip` | `false` | Skip netgrimoire network checks for this service |
### Diun
| Directive | Default | Description |
|---|---|---|
| `gremlin.diun.skip` | `false` | Skip `diun.enable` check for this service |
### Notifications
| Directive | Default | Description |
|---|---|---|
| `gremlin.notify` | `true` | Set `false` to suppress all ntfy notifications for this stack |
| `gremlin.notify.level` | `all` | `all` \| `failures` \| `none` |
---
## Checker IDs
Use these IDs with `gremlin.checks` and `gremlin.checks.skip`:
| ID | What it checks |
|---|---|
| `swarm-syntax` | Forbidden fields: version, container_name, hostname, restart, depends_on, dnsrr |
| `identity` | PUID/PGID 1964 or user: "1964:1964" |
| `network` | netgrimoire overlay network attached |
| `placement` | ARM exclusions, DockerVol/hostname rules, restart_policy |
| `caddy` | caddy: label, reverse_proxy format, import_1/import_2 |
| `homepage` | group, name, icon, href, description |
| `monitor` | monitor.name, monitor.url, optional type/interval |
| `legacy-labels` | Flags kuma.* labels for removal |
| `version` | gremlin.version stamp matches current config version |
| `diun` | diun.enable: "true" present |
---
## Common Patterns
### Internal sidecar (database, cache)
```yaml
postgres:
image: postgres:15
environment:
POSTGRES_USER: myapp
POSTGRES_PASSWORD: secret
volumes:
- /DockerVol/myapp/postgres:/var/lib/postgresql/data
networks:
- netgrimoire
deploy:
restart_policy:
condition: any
delay: 5s
max_attempts: 3
window: 120s
placement:
constraints:
- node.platform.arch != aarch64
- node.platform.arch != arm
- node.hostname == docker4
labels:
gremlin.uid.exempt: "true"
gremlin.uid.reason: "Postgres requires UID 999"
gremlin.caddy.skip: "true"
gremlin.homepage.skip: "true"
gremlin.monitor.skip: "true"
diun.enable: "true"
```
### Service without Authentik (remote browser, public endpoint)
```yaml
labels:
caddy: firefox.netgrimoire.com
caddy.reverse_proxy: firefox:5800
caddy.import_1: crowdsec
gremlin.authentik.skip: "true"
# ... other labels
```
### Service with no web UI and no public port
```yaml
labels:
gremlin.caddy.skip: "true"
gremlin.homepage.skip: "true"
gremlin.monitor.skip: "true"
diun.enable: "true"
```
### Test stack (never deployed)
```yaml
labels:
gremlin.deploy: "false"
# ... other labels
```
### ARM/Pi service
```yaml
labels:
gremlin.arm.allow: "true"
# ... other labels
placement:
constraints:
- node.hostname == dockerpi1
```
### Service with no ports: mapping
```yaml
labels:
gremlin.port: "8080"
# tells Gremlin the internal port for caddy and monitor derivation
# ... other labels
```
### Ollama false positive suppression
```yaml
labels:
gremlin.context: "shm_size is set to 1gb — required for this browser application"
# ... other labels
```
---
## Forbidden Fields
These fields are automatically removed by Gremlin:
| Field | Reason |
|---|---|
| `version:` (top-level) | Obsolete in Compose v3 |
| `container_name:` | Conflicts with Swarm service naming |
| `hostname:` (service-level) | Conflicts with Swarm DNS |
| `restart:` (service-level) | Use `deploy.restart_policy` instead |
| `depends_on:` | Not supported in Swarm mode |
These fields cause an **unfixable** block — Gremlin cannot fix them automatically:
| Field | Reason |
|---|---|
| `endpoint_mode: dnsrr` | Breaks internal DNS resolution — VIP mode required |
| Missing `deploy:` block | File treated as plain Compose, not Swarm |
| `/DockerVol/` without `node.hostname` | Gremlin cannot guess the target node |
---
## Troubleshooting
**"Missing deploy: block" — file skipped as non-Swarm**
Your compose file has no `deploy:` section. Add a `deploy:` block to each service.
**"uses /DockerVol/ but has no node.hostname constraint" — unfixable**
Add a `node.hostname` constraint to `deploy.placement.constraints`. Gremlin cannot guess which node to pin it to.
**PUID/PGID landing under volumes:**
Your service has no `environment:` block. Gremlin now creates one before `volumes:` automatically. If it still happens, add an `environment:` block manually with at least one entry.
**Ollama keeps blocking on legitimate config**
Add `gremlin.context` explaining the situation. Ollama treats it as ground truth.
**Auto-fix loop — same issues reappear after fix**
Check label indentation — labels inside `deploy.labels` must be indented 8 spaces consistently.
**Deploy skipped every time**
Check `gremlin.deploy` in the stack labels and in `gremlin/config.yaml`. Global `deploy: false` overrides all stacks unless the stack explicitly sets `gremlin.deploy: "true"`.
**Service shows up as "netgrimoire" in checker errors**
The file has a blank line between `services:` and the first service name — this was a known bug fixed in pipeline v2026-04-30.
---
## Related
- [Gremlin CI/CD Pipeline](gremlin-cicd-wiki.md)
- [NetGrimoire Stack Standards](stack-standards.md)
- [Gatus](gatus.md)