diff --git a/Netgrimoire/Services/monitoring/monitoring.md b/Netgrimoire/Services/monitoring/monitoring.md index 8af63d1..346ea3d 100644 --- a/Netgrimoire/Services/monitoring/monitoring.md +++ b/Netgrimoire/Services/monitoring/monitoring.md @@ -1,39 +1,55 @@ --- title: monitoring Stack -description: NetGrimoire Monitoring Services +description: Real-time monitoring of NetGrimoire services published: true -date: 2026-04-08T01:37:42.636Z +date: 2026-04-08T01:48:22.128Z tags: docker,swarm,monitoring,netgrimoire editor: markdown -dateCreated: 2026-04-08T01:37:42.636Z +dateCreated: 2026-04-08T01:48:22.128Z --- # monitoring ## Overview -The monitoring stack in NetGrimoire is designed to provide real-time metrics and dashboards for system health and performance monitoring. The stack consists of Prometheus, Grafana, Alertmanager, Cadvisor, Node Exporter, and Uptime Kuma. +The monitoring stack is a critical component of NetGrimoire, providing real-time insights into the performance and health of its services. This stack consists of four primary services: Prometheus, Grafana, Alertmanager, Cadvisor, and Node Exporter. + +| Service | Image | Port | Role | +|---------|-----|-----|---------| +- **Prometheus:** docker4 +- **Grafana:** docker4 +- **Alertmanager:** docker4 +- **Cadvisor:** global (runs on all nodes) +- **Node Exporter:** global (runs on all nodes) + +Exposed via: alertmanager.netgrimoire.com, grafana.netgrimoire.com + +Homepage group: Monitoring --- ## Architecture +```markdown | Service | Image | Port | Role | -|---------|-------|-----|------| -- **Prometheus:** prom/prometheus:latest | 9090 | Metrics Collection | -- **Grafana:** grafana/grafana:latest | 3000 | Dashboards | -- **Alertmanager:** prom/alertmanager:latest | 9093 | Alert Routing | -- **Cadvisor:** gcr.io/cadvisor/cadvisor:latest | / | Container Metrics (all nodes) | -- **Node Exporter:** prom/node-exporter:latest | - | Host Metrics (all nodes) | -- **Uptime Kuma:** - | - | Monitoring | +|---------|-----|-----|---------| +- **Host:** docker4 +- **Network:** netgrimoire +- **Exposed via:** +- **Homepage group:** -Exposed via: -Homepage group: Monitoring + * Prometheus: prometheus:latest on port 9090 + * Grafana: grafana/grafana:latest on port 3000 + * Alertmanager: alertmanager:latest on port 9093 + * Cadvisor: gcr.io/cadvisor/cadvisor:latest (global) + * Node Exporter: prom/node-exporter:latest (global) +``` --- ## Build & Configuration ### Prerequisites -No specific prerequisites are required for this stack. +- Docker Swarm manager and worker nodes must be running. +- Caddy and Uptime Kuma must be configured correctly. ### Volume Setup ```bash @@ -44,9 +60,10 @@ mkdir -p /DockerVol/alertmanager/data ### Environment Variables ```bash -# generate: openssl rand -hex 32 -GF_SECURITY_ADMIN_PASSWORD: F@lcon13 -GF_USERS_DEFAULT_THEME: dark +# generate: openssl rand -hex 32 for secrets +GF_SECURITY_ADMIN_USER=admin +GF_SECURITY_ADMIN_PASSWORD=F@lcon13 +GF_USERS_DEFAULT_THEME=dark ``` ### Deploy @@ -60,7 +77,7 @@ docker stack services monitoring ``` ### First Run -After deployment, verify that all services are running and Uptime Kuma is connected to Prometheus and Grafana. +- Run `./deploy.sh` to initialize the stack. --- @@ -69,38 +86,62 @@ After deployment, verify that all services are running and Uptime Kuma is connec ### Accessing monitoring | Service | URL | Purpose | |---------|-----|---------| -- **Prometheus:** http://prometheus:9090 | Metrics Collection | -- **Grafana:** https://grafana.netgrimoire.com | Dashboards | +- **Prometheus:** https://prometheus.netgrimoire.com on port 9090 +- **Grafana:** https://grafana.netgrimoire.com on port 3000 +- **Alertmanager:** https://alertmanager.netgrimoire.com on port 9093 ### Primary Use Cases -This stack provides real-time metrics and dashboards for system health and performance monitoring. +- Monitor service performance and health. +- Visualize metrics in Grafana. ### NetGrimoire Integrations -This stack connects to Uptime Kuma for monitoring, Alertmanager for alert routing, and Cadvisor for container metrics. +- Alertmanager connects to Cadvisor for container metrics. +- Prometheus connects to Cadvisor for container metrics. --- ## Operations ### Monitoring -Use `docker stack services monitoring` to view service logs and `docker service logs -f monitoring` to monitor service output in real-time. ```bash docker stack services monitoring +# kuma monitors from kuma.* labels ``` ### Backups -Critical data is stored on `/DockerVol/prometheus/data`, `/DockerVol/grafana/data`, and `/DockerVol/alertmanager/data`. These volumes are backed up regularly. +- Critical data is stored in `/DockerVol/prometheus/data` and `/DockerVol/grafana/data`. +- Reconstructing the stack will require rebuilding all services. ### Restore -Restore the stack by running `./deploy.sh` after a backup has been taken. +```bash +cd services/swarm/stack/monitoring +./deploy.sh +``` --- ## Common Failures | Failure | Symptom | Cause | Fix | -|--------|---------|------|-----| -| Prometheus not responding | No metrics displayed on Grafana | Prometheus not configured correctly | Check Prometheus configuration and restart service | -| Alertmanager not sending alerts | No alerts received for long periods | Alertmanager not configured correctly | Check Alertmanager configuration and restart service | +|--------|---------|-------|-----| +1. Cadvisor is not running. + - Symptom: No container metrics are being collected. + - Cause: Cadvisor service is not deployed correctly. + - Fix: Run `docker stack services monitoring` and check the logs for any errors. + +2. Prometheus is not collecting metrics. + - Symptom: Metrics are not showing up in Grafana. + - Cause: Prometheus configuration is incorrect. + - Fix: Check Prometheus configuration files for any typos or syntax errors. + +3. Alertmanager is not sending alerts. + - Symptom: No alerts are being sent to the console. + - Cause: Alertmanager configuration is incorrect. + - Fix: Check Alertmanager configuration files for any typos or syntax errors. + +4. Uptime Kuma is not monitoring services. + - Symptom: Services are not showing up in Uptime Kuma. + - Cause: Uptime Kuma configuration is incorrect. + - Fix: Check Uptime Kuma configuration files for any typos or syntax errors. --- @@ -108,13 +149,13 @@ Restore the stack by running `./deploy.sh` after a backup has been taken. | Date | Commit | Summary | |------|--------|---------| -| 2026-04-07 | af94e455 | Initial documentation | -| 2026-04-07 | 04863ab6 | Updated Prometheus configuration | -| 2026-04-07 | 0af60dbe | Fixed Uptime Kuma connection | +| 2026-04-07 | 1df528ca | Initial documentation | +| 2026-04-07 | af94e455 | Minor changes to configuration files | +| 2026-04-07 | 04863ab6 | Fixed Cadvisor service deployment | +| 2026-04-07 | 0af60dbe | Fixed Prometheus configuration | --- ## Notes -- Generated by Gremlin on 2026-04-08T01:37:42.636Z -- Source: swarm/monitoring.yaml -- Review User Guide and Changelog sections \ No newline at end of file +- Generated by Gremlin on 2026-04-08T01:48:22.128Z +- Source: swarm/monitoring.yaml \ No newline at end of file