diff --git a/Netgrimoire/Gremlin-Grimoire/CICD_UserGuide.md b/Netgrimoire/Gremlin-Grimoire/CICD_UserGuide.md index 2e18912..2ab02a5 100644 --- a/Netgrimoire/Gremlin-Grimoire/CICD_UserGuide.md +++ b/Netgrimoire/Gremlin-Grimoire/CICD_UserGuide.md @@ -2,53 +2,74 @@ title: Gremlin CI/CD User Guide description: published: true -date: 2026-04-30T18:33:09.881Z +date: 2026-05-06T21:10:05.745Z tags: editor: markdown -dateCreated: 2026-04-28T20:56:45.863Z +dateCreated: 2026-05-03T04:16:19.938Z --- -# Gremlin CI/CD — Operator Guide +--- +title: Gremlin CI/CD — User Guide +description: Day-to-day usage guide for the Gremlin CI/CD pipeline. How to write stacks, use directives, interpret notifications, and manage deployments. +published: true +date: 2026-05-06 +tags: gremlin, cicd, docker, swarm, compose, netgrimoire +editor: markdown +dateCreated: 2026-05-06 +--- + +# Gremlin CI/CD — User Guide > **NetGrimoire Infrastructure Reference** -> How to write, structure, and manage Swarm stacks for the Gremlin CI/CD pipeline. -> For pipeline architecture, see [Gremlin CI/CD Pipeline](gremlin-cicd-wiki.md). +> Day-to-day guide for working with the Gremlin CI/CD pipeline. Covers stack file conventions, directive usage, the auto-fix cycle, commit directives, notifications, and deploy logs. --- -## How It Works +## How the Pipeline Works -Push any `.yml` or `.yaml` file under `swarm/` to `traveler/services` and Gremlin takes over: +Push a `.yaml` file to `traveler/services` under `swarm/` or `compose//` and Gremlin takes over. It validates the file, fixes what it can automatically, prepares volume directories on the target node, deploys the stack, and notifies you via ntfy. -1. Fetches the file and classifies it (Swarm, Pocket, or plain Compose) -2. Runs all schema checkers -3. If issues found and all are fixable — auto-fixes and recommits -4. If issues found and unfixable — sends ntfy alert, stops -5. If all checks pass — runs Ollama audit, then deploys -6. After deploy — updates Gatus monitoring config - -You get ntfy notifications at every stage. A clean push produces one notification: ✅ Deploy Complete. +You never need to SSH to a node to deploy — push to Forgejo and Gremlin handles the rest. --- -## Required Stack Structure +## File Locations -Every Swarm service must have these elements. Missing any will block deployment. +| Type | Path | Example | +|---|---|---| +| Swarm stack | `swarm/stack//.yaml` | `swarm/stack/wiki/wiki.yaml` | +| Compose file | `compose//.yaml` | `compose/znas/namer.yaml` | +| Compose override | `compose//-override.yaml` | auto-generated | +| Gremlin config | `gremlin/config.yaml` | pipeline settings | + +--- + +## Swarm Stack Standards + +Every service in a Swarm stack must comply with these standards. Gremlin will auto-fix most violations; unfixable ones block the deploy with an ntfy alert. + +### Forbidden fields (auto-removed) + +```yaml +version: "3.8" # ← removed +container_name: foo # ← removed +hostname: foo # ← removed +restart: unless-stopped # ← removed (use restart_policy) +depends_on: # ← removed +endpoint_mode: dnsrr # ← removed +``` + +### Required per service ```yaml services: myservice: - image: vendor/image:tag - environment: - PUID: "1964" - PGID: "1964" - TZ: America/Chicago - volumes: - - /DockerVol/myservice:/data # pinned — requires node.hostname - # or - - /data/nfs/znas/Docker/myservice:/data # floating — no hostname needed + image: myimage:latest networks: - netgrimoire + environment: + PUID: "1964" # ← required (or user: "1964:1964") + PGID: "1964" deploy: restart_policy: condition: any @@ -57,25 +78,73 @@ services: window: 120s placement: constraints: - - node.platform.arch != aarch64 + - node.platform.arch != aarch64 # ← required (unless arm.allow) - node.platform.arch != arm - - node.hostname == znas # required when using /DockerVol/ labels: - caddy: myservice.netgrimoire.com + gremlin.version: "2026-04-1" # ← required + caddy: myservice.netgrimoire.com # ← required (or caddy.skip) caddy.reverse_proxy: myservice:8080 caddy.import_1: crowdsec caddy.import_2: authentik + homepage.group: "My Group" # ← required (or homepage.skip) + homepage.name: "My Service" + homepage.icon: "myservice.png" + homepage.href: "https://myservice.netgrimoire.com" + homepage.description: "My service description" + monitor.name: "My Service" # ← required (or monitor.skip) + monitor.url: "http://myservice:8080" + diun.enable: "true" # ← required (or diun.skip) - monitor.name: MyService - monitor.url: http://myservice:8080 # internal URL preferred +networks: + netgrimoire: + external: true +``` - homepage.group: NetGrimoire - homepage.name: MyService - homepage.icon: myservice.png - homepage.href: https://myservice.netgrimoire.com - homepage.description: My service description +### DockerVol placement constraint - diun.enable: "true" +Any service with a `/DockerVol/` volume must also have: + +```yaml +placement: + constraints: + - node.hostname == znas # or whichever node owns that DockerVol path +``` + +--- + +## Compose File Standards + +Compose files follow the same label standards as Swarm with these differences: + +- No `deploy:` block required (no restart_policy, no ARM exclusions, no placement) +- `restart:` is valid and left untouched +- Deploy is via `docker compose up -d` on the target node (derived from file path) +- Caddy entries are written to the static Caddyfile by Gremlin + +```yaml +services: + myservice: + image: myimage:latest + networks: + - netgrimoire + environment: + PUID: "1964" + PGID: "1964" + restart: unless-stopped + labels: + gremlin.version: "2026-04-1" + caddy: myservice.netgrimoire.com + caddy.reverse_proxy: myservice:8080 + caddy.import_1: crowdsec + caddy.import_2: authentik + homepage.group: "My Group" + homepage.name: "My Service" + homepage.icon: "myservice.png" + homepage.href: "https://myservice.netgrimoire.com" + homepage.description: "My service description" + monitor.name: "My Service" + monitor.url: "http://myservice:8080" + diun.enable: "true" networks: netgrimoire: @@ -84,355 +153,202 @@ networks: --- -## Volume Path Rules +## The Auto-Fix Cycle -| Path type | Example | Placement constraint | -|---|---|---| -| `/DockerVol/` | `/DockerVol/myservice:/data` | `node.hostname` **required** | -| `/data/nfs/znas/` | `/data/nfs/znas/Docker/myservice:/data` | `node.hostname` **not required** | +When Gremlin detects fixable violations it: -Valid hostnames for `node.hostname`: `docker3`, `docker4`, `docker5`, `znas`, `dockerpi1` +1. Fixes them in memory +2. Commits the corrected file back to Forgejo as the `gremlin` account +3. Sends ntfy: **Gremlin: Auto-fixed** listing what changed +4. The fix commit triggers the webhook again +5. The pipeline re-runs on the fixed file +6. If it passes, the stack deploys + +You will receive two ntfy messages for an auto-fixed deploy: the fix notification and the deploy complete notification. + +### What gets auto-fixed + +| Violation | Fix | +|---|---| +| `version:` key present | Removed | +| `container_name:` present | Removed | +| `restart:` present (Swarm) | Removed | +| `depends_on:` present | Removed | +| Missing `PUID`/`PGID` | Added to environment block | +| Missing network | Added | +| Missing `caddy.import_1: crowdsec` | Added | +| Missing `caddy.import_2: authentik` | Added | +| Missing `caddy.reverse_proxy` | Added (derived from ports or gremlin.port) | +| Missing ARM exclusions | Added to placement constraints | +| Missing homepage labels | Added with placeholder values | +| Missing `monitor.name`/`monitor.url` | Added (derived from ports or gremlin.port) | +| Missing `diun.enable` | Added | +| Missing `gremlin.version` | Stamped | +| `kuma.*` labels present | Removed | + +### What cannot be auto-fixed (blocks deploy) + +| Violation | Why | +|---|---| +| Missing `caddy:` label | Gremlin cannot invent a hostname | +| Invalid `node.hostname` constraint | Must reference a real cluster node | --- -## Identity Rules +## Override Mode for Vendor Files -**Method 1** — LinuxServer.io and homelab images (preferred): -```yaml -environment: - PUID: "1964" - PGID: "1964" -``` +For downloaded vendor compose files you do not want Gremlin to modify, add to any service label: -**Method 2** — Official Docker Hub images: -```yaml -user: "1964:1964" -``` - -**Exemption** — Images that manage their own users (Authentik, MailCow, Postgres, Redis): ```yaml labels: - gremlin.uid.exempt: "true" - gremlin.uid.reason: "Postgres manages its own user — requires UID 999" + gremlin.autofix.mode: "override" ``` -When `uid.exempt` is set, Prepare Volumes will `mkdir` the service's volume paths but will **not** `chown` them. The image manages its own ownership. +Gremlin will: +- Never touch the original file +- Write all fixes to `-override.yaml` in the same directory +- Deploy with `docker compose -f original.yaml -f original-override.yaml up -d` +- Report unfixable issues (like `version:` key) as manual fixes needed in the original + +The override file is committed to Forgejo automatically and updated on subsequent pushes. --- -## Caddy Label Rules +## Commit Directives -```yaml -caddy: myservice.netgrimoire.com # hostname only — no https:// prefix -caddy.reverse_proxy: myservice:8080 # service name and port — no IP addresses -caddy.import_1: crowdsec # always required -caddy.import_2: authentik # required unless gremlin.authentik.skip is set +Control the pipeline from the commit message without modifying any file. Use this for vendor files, one-off overrides, or emergency controls. + +**Syntax:** append `[gremlin: key=value key2=value2]` anywhere in the commit message. + +```bash +git commit -m "deploy stash after config change [gremlin: deploy=true autofix=false checks.skip=all]" +git commit -m "WIP refactor [gremlin: deploy=false]" +git commit -m "bump image [gremlin: notify=false]" +git commit -m "fix label [gremlin: checks.skip=caddy,homepage]" ``` -Services without a public URL (internal sidecars, databases): -```yaml -gremlin.caddy.skip: "true" +**Commit directives override file directives and config defaults** — they are the highest priority in the pipeline. + +### Supported commit directives + +| Directive | Values | Effect | +|---|---|---| +| `deploy` | `true` / `false` | Enable or disable deploy for this push | +| `autofix` | `true` / `false` | Enable or disable auto-fix | +| `checks.skip` | comma-separated checker IDs | Skip specific checks | +| `notify` | `true` / `false` | Enable or disable ntfy notifications | +| `notify.level` | `all` / `errors` | Notification verbosity | +| `enable` | `true` / `false` | Enable or disable the entire pipeline for this push | + +### Checker IDs for checks.skip + +`syntax`, `identity`, `network`, `placement`, `caddy`, `homepage`, `monitor`, `legacy-labels`, `version`, `diun` + +Compose checker IDs: `compose-syntax`, `compose-identity`, `compose-network`, `compose-caddy`, `compose-homepage`, `compose-monitor`, `compose-diun`, `compose-version` + +--- + +## Notifications + +All ntfy notifications go to `gremlin-alerts` (errors, blocks, fixes) or `gremlin-watch` (deploy complete, skipped). + +| Message | Topic | Meaning | +|---|---|---| +| Gremlin: Blocked — Schema | alerts | Unfixable violation — manual fix required | +| Gremlin: Auto-fixed | alerts | Violations fixed and recommitted | +| Gremlin: Blocked — Ollama | alerts | Ollama audit flagged a problem | +| Gremlin: Deploy Complete | watch | Stack deployed successfully | +| Gremlin: Deploy Skipped | watch | Deploy disabled via directive | +| Gremlin: Deploy Failed | alerts | Deploy command failed — check SSH/Docker | +| Gremlin: Compose Blocked | alerts | Compose file violation | +| Gremlin: Compose Auto-fixed | alerts | Compose violations fixed | +| Gremlin: Compose Deploy | watch | Compose stack deployed | +| Gremlin: Maintenance Mode | alerts | Push received during maintenance mode | + +### Permission warnings in ntfy + +If a volume directory exists with wrong ownership or permissions, the Deploy Complete message will include: + +``` +Manual permission fixes needed on znas: +• /DockerVol/myservice (current: root:root 755) + sudo chown -R 1964:1964 /DockerVol/myservice && sudo chmod -R 775 /DockerVol/myservice ``` -Services that should bypass Authentik but still go through CrowdSec: -```yaml -gremlin.authentik.skip: "true" +Gremlin never changes permissions on existing directories — you must apply these manually. + +--- + +## Deploy Logs + +Each stack has a log at `traveler/Netgrimoire/Logs/.md`. Every deploy (success or failure) appends a table row and a detail block containing the full ntfy message. + +```markdown +# Deploy Log — wiki + +| Timestamp | Commit | Branch | Pusher | Outcome | Volumes | +|---|---|---|---|---|---| +| 2026-05-06 00:15 UTC | abc1234 | master | traveler | ✅ deployed | 3 vol(s) | + +**2026-05-06 00:15 UTC** — ✅ deployed +​``` +wiki deployed successfully +Branch: master +Pushed by: traveler +SHA: abc1234 +Gatus monitors: Wiki +​``` ``` --- -## Monitor Labels +## Maintenance Mode -Gremlin writes monitor endpoints to Gatus after each successful deploy. Monitor URLs should use the internal service name and port so Gatus checks the container directly without depending on Caddy or Authentik being up. +Set `maintenance: true` in `gremlin/config.yaml` to pause the pipeline. Any push will receive a single ntfy notification and be silently dropped. No deploys, no fixes, no errors. ```yaml -monitor.name: MyService # display name in Gatus -monitor.url: http://myservice:8080 # internal URL preferred -monitor.type: http # optional: http | tcp | ping | dns (default: http) -monitor.interval: "60" # optional: seconds, minimum 20 (default: 60) +maintenance: true +maintenance_message: "Upgrading ZFS pool — pipeline paused" ``` -For non-HTTP services (mail, databases): -```yaml -monitor.type: tcp -monitor.url: tcp://myservice:5432 -``` - -Services that should not be monitored: -```yaml -gremlin.monitor.skip: "true" -``` - -Gatus determines the check condition from the URL scheme: -- `http://` or `https://` → `[STATUS] == 200` -- `tcp://` or `type: tcp` → `[CONNECTED] == true` -- `type: ping` → `[CONNECTED] == true` +Push the config change to activate, push again with `maintenance: false` to resume. --- -## Homepage Labels +## Common Workflows -```yaml -homepage.group: Media # dashboard group -homepage.name: MyService # display name -homepage.icon: myservice.png # icon filename -homepage.href: https://myservice.netgrimoire.com -homepage.description: Brief description +### Deploying a new stack + +1. Write the stack file following the standards above +2. Place it at `swarm/stack//.yaml` or `compose//.yaml` +3. `git add`, `git commit`, `git push` +4. Watch ntfy — Gremlin will auto-fix any violations and deploy + +### Redeploying an existing stack + +Just push any change to the file. Even a comment or whitespace change triggers the pipeline. + +Or use a commit directive to force a redeploy without changing the file: + +```bash +git commit --allow-empty -m "redeploy wiki [gremlin: deploy=true]" +git push ``` -Services that should not appear on Homepage: -```yaml -gremlin.homepage.skip: "true" +### Deploying a vendor file without modifying it + +1. Add `gremlin.autofix.mode: "override"` to one service label +2. Push — Gremlin creates the override file with all required labels +3. Subsequent pushes use the override file automatically + +Or use commit directives to skip checks entirely: + +```bash +git commit -m "deploy vendor stack [gremlin: autofix=false checks.skip=all deploy=true]" ``` -> **Auto-fix note:** If homepage labels are missing, Gremlin derives them from the caddy: label and service name. Group defaults to "New", icon defaults to "servicename.png". Review and correct after auto-fix. +### Checking what Gremlin did ---- - -## Gremlin Directives Reference - -All directives go inside `deploy.labels`. All are opt-out — a stack with no `gremlin.*` labels gets full treatment. - -### Pipeline Control - -| Directive | Default | Description | -|---|---|---| -| `gremlin.enable` | `true` | Set `false` to have Gremlin ignore this file entirely on push | -| `gremlin.checks` | `all` | Comma-separated checker IDs to run, or `all` | -| `gremlin.checks.skip` | _(none)_ | Comma-separated checker IDs to skip | -| `gremlin.version` | _(auto)_ | Stamped automatically — do not set manually | -| `gremlin.context` | _(none)_ | Free text passed to Ollama as ground truth — Ollama will not flag anything this explains | - -### Auto-fix Control - -| Directive | Default | Description | -|---|---|---| -| `gremlin.autofix` | `true` | Set `false` to disable all auto-fixing | -| `gremlin.autofix.skip` | `false` | Set `true` to notify but never attempt to fix | -| `gremlin.autofix.skip_fields` | _(none)_ | Comma-separated fields to skip during fix (e.g. `uid,hostname`) | - -### Deploy Control - -| Directive | Default | Description | -|---|---|---| -| `gremlin.deploy` | `true` | Set `false` to run checks and fixes but never deploy | -| `gremlin.deploy.strategy` | `stack` | Deployment method — currently only `stack` is implemented | -| `gremlin.port` | _(none)_ | Internal container port when no `ports:` mapping exists — used to derive `caddy.reverse_proxy` and `monitor.url` | - -### Identity - -| Directive | Default | Description | -|---|---|---| -| `gremlin.uid.exempt` | `false` | Skip PUID/PGID/user checks and skip chown on volumes for this service | -| `gremlin.uid.reason` | _(none)_ | Documents why uid.exempt is set — include with every exemption | - -### Placement - -| Directive | Default | Description | -|---|---|---| -| `gremlin.arm.allow` | `false` | Allow ARM/Pi deployment — removes ARM exclusion constraint requirement | - -### Caddy - -| Directive | Default | Description | -|---|---|---| -| `gremlin.caddy.skip` | `false` | Skip all Caddy label checks for this service | -| `gremlin.authentik.skip` | `false` | Skip `caddy.import_2: authentik` requirement only — CrowdSec still required | - -### Homepage - -| Directive | Default | Description | -|---|---|---| -| `gremlin.homepage.skip` | `false` | Skip Homepage label checks for this service | - -### Monitor - -| Directive | Default | Description | -|---|---|---| -| `gremlin.monitor.skip` | `false` | Skip monitor label checks for this service | - -### Network - -| Directive | Default | Description | -|---|---|---| -| `gremlin.network.skip` | `false` | Skip netgrimoire network checks for this service | - -### Diun - -| Directive | Default | Description | -|---|---|---| -| `gremlin.diun.skip` | `false` | Skip `diun.enable` check for this service | - -### Notifications - -| Directive | Default | Description | -|---|---|---| -| `gremlin.notify` | `true` | Set `false` to suppress all ntfy notifications for this stack | -| `gremlin.notify.level` | `all` | `all` \| `failures` \| `none` | - ---- - -## Checker IDs - -Use these IDs with `gremlin.checks` and `gremlin.checks.skip`: - -| ID | What it checks | -|---|---| -| `swarm-syntax` | Forbidden fields: version, container_name, hostname, restart, depends_on, dnsrr | -| `identity` | PUID/PGID 1964 or user: "1964:1964" | -| `network` | netgrimoire overlay network attached | -| `placement` | ARM exclusions, DockerVol/hostname rules, restart_policy | -| `caddy` | caddy: label, reverse_proxy format, import_1/import_2 | -| `homepage` | group, name, icon, href, description | -| `monitor` | monitor.name, monitor.url, optional type/interval | -| `legacy-labels` | Flags kuma.* labels for removal | -| `version` | gremlin.version stamp matches current config version | -| `diun` | diun.enable: "true" present | - ---- - -## Common Patterns - -### Internal sidecar (database, cache) - -```yaml - postgres: - image: postgres:15 - environment: - POSTGRES_USER: myapp - POSTGRES_PASSWORD: secret - volumes: - - /DockerVol/myapp/postgres:/var/lib/postgresql/data - networks: - - netgrimoire - deploy: - restart_policy: - condition: any - delay: 5s - max_attempts: 3 - window: 120s - placement: - constraints: - - node.platform.arch != aarch64 - - node.platform.arch != arm - - node.hostname == docker4 - labels: - gremlin.uid.exempt: "true" - gremlin.uid.reason: "Postgres requires UID 999" - gremlin.caddy.skip: "true" - gremlin.homepage.skip: "true" - gremlin.monitor.skip: "true" - diun.enable: "true" -``` - -### Service without Authentik (remote browser, public endpoint) - -```yaml - labels: - caddy: firefox.netgrimoire.com - caddy.reverse_proxy: firefox:5800 - caddy.import_1: crowdsec - gremlin.authentik.skip: "true" - # ... other labels -``` - -### Service with no web UI and no public port - -```yaml - labels: - gremlin.caddy.skip: "true" - gremlin.homepage.skip: "true" - gremlin.monitor.skip: "true" - diun.enable: "true" -``` - -### Test stack (never deployed) - -```yaml - labels: - gremlin.deploy: "false" - # ... other labels -``` - -### ARM/Pi service - -```yaml - labels: - gremlin.arm.allow: "true" - # ... other labels - placement: - constraints: - - node.hostname == dockerpi1 -``` - -### Service with no ports: mapping - -```yaml - labels: - gremlin.port: "8080" - # tells Gremlin the internal port for caddy and monitor derivation - # ... other labels -``` - -### Ollama false positive suppression - -```yaml - labels: - gremlin.context: "shm_size is set to 1gb — required for this browser application" - # ... other labels -``` - ---- - -## Forbidden Fields - -These fields are automatically removed by Gremlin: - -| Field | Reason | -|---|---| -| `version:` (top-level) | Obsolete in Compose v3 | -| `container_name:` | Conflicts with Swarm service naming | -| `hostname:` (service-level) | Conflicts with Swarm DNS | -| `restart:` (service-level) | Use `deploy.restart_policy` instead | -| `depends_on:` | Not supported in Swarm mode | - -These fields cause an **unfixable** block — Gremlin cannot fix them automatically: - -| Field | Reason | -|---|---| -| `endpoint_mode: dnsrr` | Breaks internal DNS resolution — VIP mode required | -| Missing `deploy:` block | File treated as plain Compose, not Swarm | -| `/DockerVol/` without `node.hostname` | Gremlin cannot guess the target node | - ---- - -## Troubleshooting - -**"Missing deploy: block" — file skipped as non-Swarm** -Your compose file has no `deploy:` section. Add a `deploy:` block to each service. - -**"uses /DockerVol/ but has no node.hostname constraint" — unfixable** -Add a `node.hostname` constraint to `deploy.placement.constraints`. Gremlin cannot guess which node to pin it to. - -**PUID/PGID landing under volumes:** -Your service has no `environment:` block. Gremlin now creates one before `volumes:` automatically. If it still happens, add an `environment:` block manually with at least one entry. - -**Ollama keeps blocking on legitimate config** -Add `gremlin.context` explaining the situation. Ollama treats it as ground truth. - -**Auto-fix loop — same issues reappear after fix** -Check label indentation — labels inside `deploy.labels` must be indented 8 spaces consistently. - -**Deploy skipped every time** -Check `gremlin.deploy` in the stack labels and in `gremlin/config.yaml`. Global `deploy: false` overrides all stacks unless the stack explicitly sets `gremlin.deploy: "true"`. - -**Service shows up as "netgrimoire" in checker errors** -The file has a blank line between `services:` and the first service name — this was a known bug fixed in pipeline v2026-04-30. - ---- - -## Related - -- [Gremlin CI/CD Pipeline](gremlin-cicd-wiki.md) -- [NetGrimoire Stack Standards](stack-standards.md) -- [Gatus](gatus.md) +- ntfy notifications for immediate status +- `traveler/Netgrimoire/Logs/.md` for full history +- Forgejo commit history on `traveler/services` — look for `gremlin:` prefix commits