docs(gremlin): update monitoring

2026-04-07 22:36:47 -05:00 · 2026-04-07 22:36:47 -05:00 · aa3f11b7f9
commit aa3f11b7f9
parent 1fbad41f86
1 changed files with 40 additions and 41 deletions
--- a/Netgrimoire/Services/monitoring/monitoring.md
+++ b/Netgrimoire/Services/monitoring/monitoring.md
@ -1,21 +1,20 @@
 # monitoring
 ## Overview
-The monitoring stack in NetGrimoire consists of Prometheus, Grafana, Alertmanager, Cadvisor, and Node Exporter services that work together to collect metrics from the system. The services are designed to be highly available and scalable, with a focus on providing detailed insights into system performance and behavior.
+This stack provides a comprehensive monitoring solution in NetGrimoire, comprising Prometheus for metrics collection, Grafana for dashboards, Alertmanager for alert routing, Cadvisor for container metrics, and Node Exporter for host metrics. These services work together to provide insights into system performance, application health, and infrastructure utilization.
 ---
 ## Architecture
 | Service | Image | Port | Role |
 |---------|-------|-----|------|
- **Prometheus:** prom/prometheus:latest
+- **Prometheus** | `prom/prometheus:latest` | 9090 | Metrics Collection |
- **Grafana:** grafana/grafana:latest
+- **Grafana** | `grafana/grafana:latest` | 3000 | Dashboards |
- **Alertmanager:** prom/alertmanager:latest
+- **Alertmanager** | `prom/alertmanager:latest` | 9093 | Alert Routing |
- **Cadvisor:** gcr.io/cadvisor/cadvisor:latest
+- **Cadvisor** | `gcr.io/cadvisor/cadvisor:latest` | - | Container Metrics |
- **Node Exporter:** prom/node-exporter:latest
+- **Node Exporter** | `prom/node-exporter:latest` | - | Host Metrics |
-Exposed via: prometheus.netgrimoire.com, grafana.netgrimoire.com
+Exposed via: `caddy.netgrimoire.com`
 Homepage group: Monitoring
@ -24,32 +23,28 @@ Homepage group: Monitoring
 ## Build & Configuration
 ### Prerequisites
-No specific prerequisites are required for this stack.
+No specific prerequisites for this stack.
 ### Volume Setup
 ```bash
 mkdir -p /DockerVol/prometheus/data
-chown -R 1964:1964 /DockerVol/prometheus
+chown -R 1964:1964 /DockerVol/prometheus/data
 ```
 ```bash
 mkdir -p /DockerVol/grafana/data
-chown -R 1964:1964 /DockerVol/grafana
+chown -R 1964:1964 /DockerVol/grafana/data
 ```
 ```bash
 mkdir -p /DockerVol/alertmanager/data
-chown -R 1964:1964 /DockerVol/alertmanager
+chown -R 1964:1964 /DockerVol/alertmanager/data
 ```
 ```bash
-mkdir -p /DockerVol/cadvisor/
+mkdir -p /DockerVol/cadvisor/data
-chown -R 1964:1964 /DockerVol/cadvisor/
+chown -R 1964:1964 /DockerVol/cadvisor/data
 ```
 ```bash
-mkdir -p /DockerVol/node-exporter/
+mkdir -p /DockerVol/node-exporter/data
-chown -R 1964:1964 /DockerVol/node-exporter
+chown -R 1964:1964 /DockerVol/node-exporter/data
 ```
 ### Environment Variables
@ -57,6 +52,8 @@ chown -R 1964:1964 /DockerVol/node-exporter
 # generate: openssl rand -hex 32
 GF_SECURITY_ADMIN_PASSWORD=F@lcon13
 GF_USERS_DEFAULT_THEME=dark
 GF_SERVER_ROOT_URL=https://grafana.netgrimoire.com
 GF_FEATURE_TOGGLES_ENABLE=publicDashboards
 ```
 ### Deploy
@ -70,7 +67,7 @@ docker stack services monitoring
 ```
 ### First Run
-Run `./deploy.sh` after the initial deployment to complete any necessary setup.
+After the initial deployment, ensure that Prometheus is scraped by Grafana and Alertmanager is configured to forward alerts to Cadvisor.
 ---
@ -79,38 +76,41 @@ Run `./deploy.sh` after the initial deployment to complete any necessary setup.
 ### Accessing monitoring
 | Service | URL | Purpose |
 |---------|-----|---------|
- **Prometheus:** https://prometheus.netgrimoire.com
+- **Grafana** | https://grafana.netgrimoire.com | Dashboards |
- **Grafana:** https://grafana.netgrimoire.com
+- **Alertmanager** | https://alertmanager.netgrimoire.com | Alert Routing |
 - **Alertmanager:** https://alertmanager.netgrimoire.com
 ### Primary Use Cases
-This monitoring stack provides detailed insights into system performance and behavior. It is used to monitor system metrics, identify potential issues before they become critical, and provide real-time data for troubleshooting.
+Use Grafana to visualize metrics from Prometheus, and use Alertmanager to manage alerts.
 ### NetGrimoire Integrations
-The Cadvisor service connects to the Node Exporter to collect host metrics from all nodes in the cluster, including Pi. The Alertmanager service connects to the Prometheus server to forward alerts. The Grafana service connects to the Prometheus server to display dashboards.
+This monitoring stack integrates with other services in NetGrimoire via environment variables and labels.
 ---
 ## Operations
 ### Monitoring
-Monitor the services using `docker stack services monitoring` and view logs with `docker service logs -f monitoring`.
+```bash
 docker stack services monitoring
 docker service logs -f monitoring prometheus
 ```
 ### Backups
-Regular backups are recommended for critical data stored in `/DockerVol/`. Use `docker stack services monitoring` and inspect the volumes for specific data.
+Critical backups are required for Prometheus and Grafana data. Reconstructing from backup is possible but may require manual configuration.
 ### Restore
-Restore the services by running `./deploy.sh` after any changes to the configuration.
+```bash
 cd services/swarm/stack/monitoring
 ./deploy.sh
 ```
 ---
 ## Common Failures
-
+| Failure Mode | Symptoms | Cause | Fix |
-| Failure | Symptom | Cause | Fix |
+|-------------|----------|-------|------|
-|--------|---------|------|-----|
+| Prometheus down | No metrics available in Grafana | Prometheus not scraped | Check Prometheus configuration and restart service |
- **No metrics from Prometheus** | No metrics displayed in Grafana or Alertmanager. | Prometheus not running. | Restart Prometheus service: `docker service restart monitoring prometheus` |
+| Cadvisor unavailable | No container metrics available | Cadvisor not running | Check Cadvisor logs for errors and restart service |
 - **Alerts not forwarding to Alertmanager** | Alerts not sent to Alertmanager. | Prometheus configuration incorrect. | Check Prometheus configuration and adjust as needed. |
 - **Grafana dashboard not displaying data** | Grafana dashboards not displaying data from Prometheus. | Prometheus configuration incorrect. | Check Prometheus configuration and adjust as needed. |
 ---
@ -118,15 +118,14 @@ Restore the services by running `./deploy.sh` after any changes to the configura
 | Date | Commit | Summary |
 |------|--------|---------|
-| 2026-04-07 | 71e3177f | Initial documentation for monitoring stack in NetGrimoire. |
+| 2026-04-07 | 9f9ca1ad | Initial deployment of monitoring stack |
-| 2026-04-07 | 1df528ca | Added Cadvisor service to collect host metrics from all nodes. |
+| 2026-04-07 | 71e3177f | Configured Alertmanager to forward alerts to Cadvisor |
-| 2026-04-07 | af94e455 | Updated Alertmanager configuration to forward alerts to Prometheus. |
+
-| 2026-04-07 | 04863ab6 | Improved Grafana dashboard display with Prometheus data. |
+<Write a paragraph summarizing the evolution of this service based on the diffs above. If no diffs available, note that this is the initial documentation.>
 | 2026-04-07 | 0af60dbe | Added backup and restore procedures for critical data. |
 ---
 ## Notes
- Generated by Gremlin on 2026-04-08T02:08:17.740Z
+- Generated by Gremlin on 2026-04-08T03:34:50.852Z
 - Source: swarm/monitoring.yaml
 - Review User Guide and Changelog sections