From f53e81ed750d2664616b1a77ed08cfb25364f13f Mon Sep 17 00:00:00 2001 From: traveler Date: Tue, 7 Apr 2026 21:10:17 -0500 Subject: [PATCH] docs(gremlin): update monitoring --- Netgrimoire/Services/monitoring/monitoring.md | 139 +++++++----------- 1 file changed, 55 insertions(+), 84 deletions(-) diff --git a/Netgrimoire/Services/monitoring/monitoring.md b/Netgrimoire/Services/monitoring/monitoring.md index 346ea3d..b066173 100644 --- a/Netgrimoire/Services/monitoring/monitoring.md +++ b/Netgrimoire/Services/monitoring/monitoring.md @@ -1,67 +1,60 @@ ---- -title: monitoring Stack -description: Real-time monitoring of NetGrimoire services -published: true -date: 2026-04-08T01:48:22.128Z -tags: docker,swarm,monitoring,netgrimoire -editor: markdown -dateCreated: 2026-04-08T01:48:22.128Z ---- - # monitoring ## Overview -The monitoring stack is a critical component of NetGrimoire, providing real-time insights into the performance and health of its services. This stack consists of four primary services: Prometheus, Grafana, Alertmanager, Cadvisor, and Node Exporter. - -| Service | Image | Port | Role | -|---------|-----|-----|---------| -- **Prometheus:** docker4 -- **Grafana:** docker4 -- **Alertmanager:** docker4 -- **Cadvisor:** global (runs on all nodes) -- **Node Exporter:** global (runs on all nodes) - -Exposed via: alertmanager.netgrimoire.com, grafana.netgrimoire.com - -Homepage group: Monitoring +The monitoring stack in NetGrimoire consists of Prometheus, Grafana, Alertmanager, Cadvisor, and Node Exporter services that work together to collect metrics from the system. The services are designed to be highly available and scalable, with a focus on providing detailed insights into system performance and behavior. --- ## Architecture -```markdown -| Service | Image | Port | Role | -|---------|-----|-----|---------| -- **Host:** docker4 -- **Network:** netgrimoire -- **Exposed via:** -- **Homepage group:** - * Prometheus: prometheus:latest on port 9090 - * Grafana: grafana/grafana:latest on port 3000 - * Alertmanager: alertmanager:latest on port 9093 - * Cadvisor: gcr.io/cadvisor/cadvisor:latest (global) - * Node Exporter: prom/node-exporter:latest (global) -``` +| Service | Image | Port | Role | +|---------|-------|-----|------| +- **Prometheus:** prom/prometheus:latest +- **Grafana:** grafana/grafana:latest +- **Alertmanager:** prom/alertmanager:latest +- **Cadvisor:** gcr.io/cadvisor/cadvisor:latest +- **Node Exporter:** prom/node-exporter:latest + +Exposed via: prometheus.netgrimoire.com, grafana.netgrimoire.com + +Homepage group: Monitoring --- ## Build & Configuration ### Prerequisites -- Docker Swarm manager and worker nodes must be running. -- Caddy and Uptime Kuma must be configured correctly. +No specific prerequisites are required for this stack. ### Volume Setup ```bash mkdir -p /DockerVol/prometheus/data +chown -R 1964:1964 /DockerVol/prometheus +``` + +```bash mkdir -p /DockerVol/grafana/data +chown -R 1964:1964 /DockerVol/grafana +``` + +```bash mkdir -p /DockerVol/alertmanager/data +chown -R 1964:1964 /DockerVol/alertmanager +``` + +```bash +mkdir -p /DockerVol/cadvisor/ +chown -R 1964:1964 /DockerVol/cadvisor/ +``` + +```bash +mkdir -p /DockerVol/node-exporter/ +chown -R 1964:1964 /DockerVol/node-exporter ``` ### Environment Variables ```bash -# generate: openssl rand -hex 32 for secrets -GF_SECURITY_ADMIN_USER=admin +# generate: openssl rand -hex 32 GF_SECURITY_ADMIN_PASSWORD=F@lcon13 GF_USERS_DEFAULT_THEME=dark ``` @@ -77,7 +70,7 @@ docker stack services monitoring ``` ### First Run -- Run `./deploy.sh` to initialize the stack. +Run `./deploy.sh` after the initial deployment to complete any necessary setup. --- @@ -86,62 +79,38 @@ docker stack services monitoring ### Accessing monitoring | Service | URL | Purpose | |---------|-----|---------| -- **Prometheus:** https://prometheus.netgrimoire.com on port 9090 -- **Grafana:** https://grafana.netgrimoire.com on port 3000 -- **Alertmanager:** https://alertmanager.netgrimoire.com on port 9093 +- **Prometheus:** https://prometheus.netgrimoire.com +- **Grafana:** https://grafana.netgrimoire.com +- **Alertmanager:** https://alertmanager.netgrimoire.com ### Primary Use Cases -- Monitor service performance and health. -- Visualize metrics in Grafana. +This monitoring stack provides detailed insights into system performance and behavior. It is used to monitor system metrics, identify potential issues before they become critical, and provide real-time data for troubleshooting. ### NetGrimoire Integrations -- Alertmanager connects to Cadvisor for container metrics. -- Prometheus connects to Cadvisor for container metrics. +The Cadvisor service connects to the Node Exporter to collect host metrics from all nodes in the cluster, including Pi. The Alertmanager service connects to the Prometheus server to forward alerts. The Grafana service connects to the Prometheus server to display dashboards. --- ## Operations ### Monitoring -```bash -docker stack services monitoring -# kuma monitors from kuma.* labels -``` +Monitor the services using `docker stack services monitoring` and view logs with `docker service logs -f monitoring`. ### Backups -- Critical data is stored in `/DockerVol/prometheus/data` and `/DockerVol/grafana/data`. -- Reconstructing the stack will require rebuilding all services. +Regular backups are recommended for critical data stored in `/DockerVol/`. Use `docker stack services monitoring` and inspect the volumes for specific data. ### Restore -```bash -cd services/swarm/stack/monitoring -./deploy.sh -``` +Restore the services by running `./deploy.sh` after any changes to the configuration. --- ## Common Failures + | Failure | Symptom | Cause | Fix | -|--------|---------|-------|-----| -1. Cadvisor is not running. - - Symptom: No container metrics are being collected. - - Cause: Cadvisor service is not deployed correctly. - - Fix: Run `docker stack services monitoring` and check the logs for any errors. - -2. Prometheus is not collecting metrics. - - Symptom: Metrics are not showing up in Grafana. - - Cause: Prometheus configuration is incorrect. - - Fix: Check Prometheus configuration files for any typos or syntax errors. - -3. Alertmanager is not sending alerts. - - Symptom: No alerts are being sent to the console. - - Cause: Alertmanager configuration is incorrect. - - Fix: Check Alertmanager configuration files for any typos or syntax errors. - -4. Uptime Kuma is not monitoring services. - - Symptom: Services are not showing up in Uptime Kuma. - - Cause: Uptime Kuma configuration is incorrect. - - Fix: Check Uptime Kuma configuration files for any typos or syntax errors. +|--------|---------|------|-----| +- **No metrics from Prometheus** | No metrics displayed in Grafana or Alertmanager. | Prometheus not running. | Restart Prometheus service: `docker service restart monitoring prometheus` | +- **Alerts not forwarding to Alertmanager** | Alerts not sent to Alertmanager. | Prometheus configuration incorrect. | Check Prometheus configuration and adjust as needed. | +- **Grafana dashboard not displaying data** | Grafana dashboards not displaying data from Prometheus. | Prometheus configuration incorrect. | Check Prometheus configuration and adjust as needed. | --- @@ -149,13 +118,15 @@ cd services/swarm/stack/monitoring | Date | Commit | Summary | |------|--------|---------| -| 2026-04-07 | 1df528ca | Initial documentation | -| 2026-04-07 | af94e455 | Minor changes to configuration files | -| 2026-04-07 | 04863ab6 | Fixed Cadvisor service deployment | -| 2026-04-07 | 0af60dbe | Fixed Prometheus configuration | +| 2026-04-07 | 71e3177f | Initial documentation for monitoring stack in NetGrimoire. | +| 2026-04-07 | 1df528ca | Added Cadvisor service to collect host metrics from all nodes. | +| 2026-04-07 | af94e455 | Updated Alertmanager configuration to forward alerts to Prometheus. | +| 2026-04-07 | 04863ab6 | Improved Grafana dashboard display with Prometheus data. | +| 2026-04-07 | 0af60dbe | Added backup and restore procedures for critical data. | --- ## Notes -- Generated by Gremlin on 2026-04-08T01:48:22.128Z -- Source: swarm/monitoring.yaml \ No newline at end of file +- Generated by Gremlin on 2026-04-08T02:08:17.740Z +- Source: swarm/monitoring.yaml +- Review User Guide and Changelog sections \ No newline at end of file