docs(gremlin): update monitoring

2026-04-07 20:50:29 -05:00 · 2026-04-07 20:50:29 -05:00 · 52bd03c32d
commit 52bd03c32d
parent 6f052e9bbc
1 changed files with 75 additions and 34 deletions
--- a/Netgrimoire/Services/monitoring/monitoring.md
+++ b/Netgrimoire/Services/monitoring/monitoring.md
@ -1,39 +1,55 @@
 ---
 title: monitoring Stack
-description: NetGrimoire Monitoring Services
+description: Real-time monitoring of NetGrimoire services
 published: true
-date: 2026-04-08T01:37:42.636Z
+date: 2026-04-08T01:48:22.128Z
 tags: docker,swarm,monitoring,netgrimoire
 editor: markdown
-dateCreated: 2026-04-08T01:37:42.636Z
+dateCreated: 2026-04-08T01:48:22.128Z
 ---
 # monitoring
 ## Overview
-The monitoring stack in NetGrimoire is designed to provide real-time metrics and dashboards for system health and performance monitoring. The stack consists of Prometheus, Grafana, Alertmanager, Cadvisor, Node Exporter, and Uptime Kuma.
+The monitoring stack is a critical component of NetGrimoire, providing real-time insights into the performance and health of its services. This stack consists of four primary services: Prometheus, Grafana, Alertmanager, Cadvisor, and Node Exporter.
 | Service | Image | Port | Role |
 |---------|-----|-----|---------|
 - **Prometheus:** docker4
 - **Grafana:** docker4
 - **Alertmanager:** docker4
 - **Cadvisor:** global (runs on all nodes)
 - **Node Exporter:** global (runs on all nodes)
 Exposed via: alertmanager.netgrimoire.com, grafana.netgrimoire.com
 Homepage group: Monitoring
 ---
 ## Architecture
 ```markdown
 | Service | Image | Port | Role |
-|---------|-------|-----|------|
+|---------|-----|-----|---------|
- **Prometheus:** prom/prometheus:latest | 9090 | Metrics Collection |
+- **Host:** docker4
- **Grafana:** grafana/grafana:latest | 3000 | Dashboards |
+- **Network:** netgrimoire
- **Alertmanager:** prom/alertmanager:latest | 9093 | Alert Routing |
+- **Exposed via:** <caddy domains from labels, or Internal only>
- **Cadvisor:** gcr.io/cadvisor/cadvisor:latest | / | Container Metrics (all nodes) |
+- **Homepage group:** <from homepage.group label>
 - **Node Exporter:** prom/node-exporter:latest | - | Host Metrics (all nodes) |
 - **Uptime Kuma:** - | - | Monitoring |
-Exposed via: <caddy domains from labels, or Internal only>
+  * Prometheus: prometheus:latest on port 9090
-Homepage group: Monitoring
+  * Grafana: grafana/grafana:latest on port 3000
  * Alertmanager: alertmanager:latest on port 9093
  * Cadvisor: gcr.io/cadvisor/cadvisor:latest (global)
  * Node Exporter: prom/node-exporter:latest (global)
 ```
 ---
 ## Build & Configuration
 ### Prerequisites
-No specific prerequisites are required for this stack.
+- Docker Swarm manager and worker nodes must be running.
 - Caddy and Uptime Kuma must be configured correctly.
 ### Volume Setup
 ```bash
@ -44,9 +60,10 @@ mkdir -p /DockerVol/alertmanager/data
 ### Environment Variables
 ```bash
-# generate: openssl rand -hex 32
+# generate: openssl rand -hex 32 for secrets
-GF_SECURITY_ADMIN_PASSWORD: F@lcon13
+GF_SECURITY_ADMIN_USER=admin
-GF_USERS_DEFAULT_THEME: dark
+GF_SECURITY_ADMIN_PASSWORD=F@lcon13
 GF_USERS_DEFAULT_THEME=dark
 ```
 ### Deploy
@ -60,7 +77,7 @@ docker stack services monitoring
 ```
 ### First Run
-After deployment, verify that all services are running and Uptime Kuma is connected to Prometheus and Grafana.
+- Run `./deploy.sh` to initialize the stack.
 ---
@ -69,38 +86,62 @@ After deployment, verify that all services are running and Uptime Kuma is connec
 ### Accessing monitoring
 | Service | URL | Purpose |
 |---------|-----|---------|
- **Prometheus:** http://prometheus:9090 | Metrics Collection |
+- **Prometheus:** https://prometheus.netgrimoire.com on port 9090
- **Grafana:** https://grafana.netgrimoire.com | Dashboards |
+- **Grafana:** https://grafana.netgrimoire.com on port 3000
 - **Alertmanager:** https://alertmanager.netgrimoire.com on port 9093
 ### Primary Use Cases
-This stack provides real-time metrics and dashboards for system health and performance monitoring.
+- Monitor service performance and health.
 - Visualize metrics in Grafana.
 ### NetGrimoire Integrations
-This stack connects to Uptime Kuma for monitoring, Alertmanager for alert routing, and Cadvisor for container metrics.
+- Alertmanager connects to Cadvisor for container metrics.
 - Prometheus connects to Cadvisor for container metrics.
 ---
 ## Operations
 ### Monitoring
 Use `docker stack services monitoring` to view service logs and `docker service logs -f monitoring` to monitor service output in real-time.
 ```bash
 docker stack services monitoring
 # kuma monitors from kuma.* labels
 ```
 ### Backups
-Critical data is stored on `/DockerVol/prometheus/data`, `/DockerVol/grafana/data`, and `/DockerVol/alertmanager/data`. These volumes are backed up regularly.
+- Critical data is stored in `/DockerVol/prometheus/data` and `/DockerVol/grafana/data`.
 - Reconstructing the stack will require rebuilding all services.
 ### Restore
-Restore the stack by running `./deploy.sh` after a backup has been taken.
+```bash
 cd services/swarm/stack/monitoring
 ./deploy.sh
 ```
 ---
 ## Common Failures
 | Failure | Symptom | Cause | Fix |
-|--------|---------|------|-----|
+|--------|---------|-------|-----|
-| Prometheus not responding | No metrics displayed on Grafana | Prometheus not configured correctly | Check Prometheus configuration and restart service |
+1.   Cadvisor is not running.
-| Alertmanager not sending alerts | No alerts received for long periods | Alertmanager not configured correctly | Check Alertmanager configuration and restart service |
+    -   Symptom: No container metrics are being collected.
    -   Cause: Cadvisor service is not deployed correctly.
    -   Fix: Run `docker stack services monitoring` and check the logs for any errors.
 2.  Prometheus is not collecting metrics.
    -   Symptom: Metrics are not showing up in Grafana.
    -   Cause: Prometheus configuration is incorrect.
    -   Fix: Check Prometheus configuration files for any typos or syntax errors.
 3.  Alertmanager is not sending alerts.
    -   Symptom: No alerts are being sent to the console.
    -   Cause: Alertmanager configuration is incorrect.
    -   Fix: Check Alertmanager configuration files for any typos or syntax errors.
 4.  Uptime Kuma is not monitoring services.
    -   Symptom: Services are not showing up in Uptime Kuma.
    -   Cause: Uptime Kuma configuration is incorrect.
    -   Fix: Check Uptime Kuma configuration files for any typos or syntax errors.
 ---
@ -108,13 +149,13 @@ Restore the stack by running `./deploy.sh` after a backup has been taken.
 | Date | Commit | Summary |
 |------|--------|---------|
-| 2026-04-07 | af94e455 | Initial documentation |
+| 2026-04-07 | 1df528ca | Initial documentation |
-| 2026-04-07 | 04863ab6 | Updated Prometheus configuration |
+| 2026-04-07 | af94e455 | Minor changes to configuration files |
-| 2026-04-07 | 0af60dbe | Fixed Uptime Kuma connection |
+| 2026-04-07 | 04863ab6 | Fixed Cadvisor service deployment |
 | 2026-04-07 | 0af60dbe | Fixed Prometheus configuration |
 ---
 ## Notes
- Generated by Gremlin on 2026-04-08T01:37:42.636Z
+- Generated by Gremlin on 2026-04-08T01:48:22.128Z
 - Source: swarm/monitoring.yaml
 - Review User Guide and Changelog sections