From aa3f11b7f934b4cf9f561054e81ae1c1cdfbb6df Mon Sep 17 00:00:00 2001 From: traveler Date: Tue, 7 Apr 2026 22:36:47 -0500 Subject: [PATCH] docs(gremlin): update monitoring --- Netgrimoire/Services/monitoring/monitoring.md | 81 +++++++++---------- 1 file changed, 40 insertions(+), 41 deletions(-) diff --git a/Netgrimoire/Services/monitoring/monitoring.md b/Netgrimoire/Services/monitoring/monitoring.md index b066173..c708d0d 100644 --- a/Netgrimoire/Services/monitoring/monitoring.md +++ b/Netgrimoire/Services/monitoring/monitoring.md @@ -1,21 +1,20 @@ # monitoring ## Overview -The monitoring stack in NetGrimoire consists of Prometheus, Grafana, Alertmanager, Cadvisor, and Node Exporter services that work together to collect metrics from the system. The services are designed to be highly available and scalable, with a focus on providing detailed insights into system performance and behavior. +This stack provides a comprehensive monitoring solution in NetGrimoire, comprising Prometheus for metrics collection, Grafana for dashboards, Alertmanager for alert routing, Cadvisor for container metrics, and Node Exporter for host metrics. These services work together to provide insights into system performance, application health, and infrastructure utilization. --- ## Architecture - | Service | Image | Port | Role | |---------|-------|-----|------| -- **Prometheus:** prom/prometheus:latest -- **Grafana:** grafana/grafana:latest -- **Alertmanager:** prom/alertmanager:latest -- **Cadvisor:** gcr.io/cadvisor/cadvisor:latest -- **Node Exporter:** prom/node-exporter:latest +- **Prometheus** | `prom/prometheus:latest` | 9090 | Metrics Collection | +- **Grafana** | `grafana/grafana:latest` | 3000 | Dashboards | +- **Alertmanager** | `prom/alertmanager:latest` | 9093 | Alert Routing | +- **Cadvisor** | `gcr.io/cadvisor/cadvisor:latest` | - | Container Metrics | +- **Node Exporter** | `prom/node-exporter:latest` | - | Host Metrics | -Exposed via: prometheus.netgrimoire.com, grafana.netgrimoire.com +Exposed via: `caddy.netgrimoire.com` Homepage group: Monitoring @@ -24,32 +23,28 @@ Homepage group: Monitoring ## Build & Configuration ### Prerequisites -No specific prerequisites are required for this stack. +No specific prerequisites for this stack. ### Volume Setup ```bash mkdir -p /DockerVol/prometheus/data -chown -R 1964:1964 /DockerVol/prometheus +chown -R 1964:1964 /DockerVol/prometheus/data ``` - ```bash mkdir -p /DockerVol/grafana/data -chown -R 1964:1964 /DockerVol/grafana +chown -R 1964:1964 /DockerVol/grafana/data ``` - ```bash mkdir -p /DockerVol/alertmanager/data -chown -R 1964:1964 /DockerVol/alertmanager +chown -R 1964:1964 /DockerVol/alertmanager/data ``` - ```bash -mkdir -p /DockerVol/cadvisor/ -chown -R 1964:1964 /DockerVol/cadvisor/ +mkdir -p /DockerVol/cadvisor/data +chown -R 1964:1964 /DockerVol/cadvisor/data ``` - ```bash -mkdir -p /DockerVol/node-exporter/ -chown -R 1964:1964 /DockerVol/node-exporter +mkdir -p /DockerVol/node-exporter/data +chown -R 1964:1964 /DockerVol/node-exporter/data ``` ### Environment Variables @@ -57,6 +52,8 @@ chown -R 1964:1964 /DockerVol/node-exporter # generate: openssl rand -hex 32 GF_SECURITY_ADMIN_PASSWORD=F@lcon13 GF_USERS_DEFAULT_THEME=dark +GF_SERVER_ROOT_URL=https://grafana.netgrimoire.com +GF_FEATURE_TOGGLES_ENABLE=publicDashboards ``` ### Deploy @@ -70,7 +67,7 @@ docker stack services monitoring ``` ### First Run -Run `./deploy.sh` after the initial deployment to complete any necessary setup. +After the initial deployment, ensure that Prometheus is scraped by Grafana and Alertmanager is configured to forward alerts to Cadvisor. --- @@ -79,38 +76,41 @@ Run `./deploy.sh` after the initial deployment to complete any necessary setup. ### Accessing monitoring | Service | URL | Purpose | |---------|-----|---------| -- **Prometheus:** https://prometheus.netgrimoire.com -- **Grafana:** https://grafana.netgrimoire.com -- **Alertmanager:** https://alertmanager.netgrimoire.com +- **Grafana** | https://grafana.netgrimoire.com | Dashboards | +- **Alertmanager** | https://alertmanager.netgrimoire.com | Alert Routing | ### Primary Use Cases -This monitoring stack provides detailed insights into system performance and behavior. It is used to monitor system metrics, identify potential issues before they become critical, and provide real-time data for troubleshooting. +Use Grafana to visualize metrics from Prometheus, and use Alertmanager to manage alerts. ### NetGrimoire Integrations -The Cadvisor service connects to the Node Exporter to collect host metrics from all nodes in the cluster, including Pi. The Alertmanager service connects to the Prometheus server to forward alerts. The Grafana service connects to the Prometheus server to display dashboards. +This monitoring stack integrates with other services in NetGrimoire via environment variables and labels. --- ## Operations ### Monitoring -Monitor the services using `docker stack services monitoring` and view logs with `docker service logs -f monitoring`. +```bash +docker stack services monitoring +docker service logs -f monitoring prometheus +``` ### Backups -Regular backups are recommended for critical data stored in `/DockerVol/`. Use `docker stack services monitoring` and inspect the volumes for specific data. +Critical backups are required for Prometheus and Grafana data. Reconstructing from backup is possible but may require manual configuration. ### Restore -Restore the services by running `./deploy.sh` after any changes to the configuration. +```bash +cd services/swarm/stack/monitoring +./deploy.sh +``` --- ## Common Failures - -| Failure | Symptom | Cause | Fix | -|--------|---------|------|-----| -- **No metrics from Prometheus** | No metrics displayed in Grafana or Alertmanager. | Prometheus not running. | Restart Prometheus service: `docker service restart monitoring prometheus` | -- **Alerts not forwarding to Alertmanager** | Alerts not sent to Alertmanager. | Prometheus configuration incorrect. | Check Prometheus configuration and adjust as needed. | -- **Grafana dashboard not displaying data** | Grafana dashboards not displaying data from Prometheus. | Prometheus configuration incorrect. | Check Prometheus configuration and adjust as needed. | +| Failure Mode | Symptoms | Cause | Fix | +|-------------|----------|-------|------| +| Prometheus down | No metrics available in Grafana | Prometheus not scraped | Check Prometheus configuration and restart service | +| Cadvisor unavailable | No container metrics available | Cadvisor not running | Check Cadvisor logs for errors and restart service | --- @@ -118,15 +118,14 @@ Restore the services by running `./deploy.sh` after any changes to the configura | Date | Commit | Summary | |------|--------|---------| -| 2026-04-07 | 71e3177f | Initial documentation for monitoring stack in NetGrimoire. | -| 2026-04-07 | 1df528ca | Added Cadvisor service to collect host metrics from all nodes. | -| 2026-04-07 | af94e455 | Updated Alertmanager configuration to forward alerts to Prometheus. | -| 2026-04-07 | 04863ab6 | Improved Grafana dashboard display with Prometheus data. | -| 2026-04-07 | 0af60dbe | Added backup and restore procedures for critical data. | +| 2026-04-07 | 9f9ca1ad | Initial deployment of monitoring stack | +| 2026-04-07 | 71e3177f | Configured Alertmanager to forward alerts to Cadvisor | + + --- ## Notes -- Generated by Gremlin on 2026-04-08T02:08:17.740Z +- Generated by Gremlin on 2026-04-08T03:34:50.852Z - Source: swarm/monitoring.yaml - Review User Guide and Changelog sections \ No newline at end of file