docs(gremlin): update monitoring

This commit is contained in:
traveler 2026-04-07 22:36:47 -05:00
parent 1fbad41f86
commit aa3f11b7f9

View file

@ -1,21 +1,20 @@
# monitoring
## Overview
The monitoring stack in NetGrimoire consists of Prometheus, Grafana, Alertmanager, Cadvisor, and Node Exporter services that work together to collect metrics from the system. The services are designed to be highly available and scalable, with a focus on providing detailed insights into system performance and behavior.
This stack provides a comprehensive monitoring solution in NetGrimoire, comprising Prometheus for metrics collection, Grafana for dashboards, Alertmanager for alert routing, Cadvisor for container metrics, and Node Exporter for host metrics. These services work together to provide insights into system performance, application health, and infrastructure utilization.
---
## Architecture
| Service | Image | Port | Role |
|---------|-------|-----|------|
- **Prometheus:** prom/prometheus:latest
- **Grafana:** grafana/grafana:latest
- **Alertmanager:** prom/alertmanager:latest
- **Cadvisor:** gcr.io/cadvisor/cadvisor:latest
- **Node Exporter:** prom/node-exporter:latest
- **Prometheus** | `prom/prometheus:latest` | 9090 | Metrics Collection |
- **Grafana** | `grafana/grafana:latest` | 3000 | Dashboards |
- **Alertmanager** | `prom/alertmanager:latest` | 9093 | Alert Routing |
- **Cadvisor** | `gcr.io/cadvisor/cadvisor:latest` | - | Container Metrics |
- **Node Exporter** | `prom/node-exporter:latest` | - | Host Metrics |
Exposed via: prometheus.netgrimoire.com, grafana.netgrimoire.com
Exposed via: `caddy.netgrimoire.com`
Homepage group: Monitoring
@ -24,32 +23,28 @@ Homepage group: Monitoring
## Build & Configuration
### Prerequisites
No specific prerequisites are required for this stack.
No specific prerequisites for this stack.
### Volume Setup
```bash
mkdir -p /DockerVol/prometheus/data
chown -R 1964:1964 /DockerVol/prometheus
chown -R 1964:1964 /DockerVol/prometheus/data
```
```bash
mkdir -p /DockerVol/grafana/data
chown -R 1964:1964 /DockerVol/grafana
chown -R 1964:1964 /DockerVol/grafana/data
```
```bash
mkdir -p /DockerVol/alertmanager/data
chown -R 1964:1964 /DockerVol/alertmanager
chown -R 1964:1964 /DockerVol/alertmanager/data
```
```bash
mkdir -p /DockerVol/cadvisor/
chown -R 1964:1964 /DockerVol/cadvisor/
mkdir -p /DockerVol/cadvisor/data
chown -R 1964:1964 /DockerVol/cadvisor/data
```
```bash
mkdir -p /DockerVol/node-exporter/
chown -R 1964:1964 /DockerVol/node-exporter
mkdir -p /DockerVol/node-exporter/data
chown -R 1964:1964 /DockerVol/node-exporter/data
```
### Environment Variables
@ -57,6 +52,8 @@ chown -R 1964:1964 /DockerVol/node-exporter
# generate: openssl rand -hex 32
GF_SECURITY_ADMIN_PASSWORD=F@lcon13
GF_USERS_DEFAULT_THEME=dark
GF_SERVER_ROOT_URL=https://grafana.netgrimoire.com
GF_FEATURE_TOGGLES_ENABLE=publicDashboards
```
### Deploy
@ -70,7 +67,7 @@ docker stack services monitoring
```
### First Run
Run `./deploy.sh` after the initial deployment to complete any necessary setup.
After the initial deployment, ensure that Prometheus is scraped by Grafana and Alertmanager is configured to forward alerts to Cadvisor.
---
@ -79,38 +76,41 @@ Run `./deploy.sh` after the initial deployment to complete any necessary setup.
### Accessing monitoring
| Service | URL | Purpose |
|---------|-----|---------|
- **Prometheus:** https://prometheus.netgrimoire.com
- **Grafana:** https://grafana.netgrimoire.com
- **Alertmanager:** https://alertmanager.netgrimoire.com
- **Grafana** | https://grafana.netgrimoire.com | Dashboards |
- **Alertmanager** | https://alertmanager.netgrimoire.com | Alert Routing |
### Primary Use Cases
This monitoring stack provides detailed insights into system performance and behavior. It is used to monitor system metrics, identify potential issues before they become critical, and provide real-time data for troubleshooting.
Use Grafana to visualize metrics from Prometheus, and use Alertmanager to manage alerts.
### NetGrimoire Integrations
The Cadvisor service connects to the Node Exporter to collect host metrics from all nodes in the cluster, including Pi. The Alertmanager service connects to the Prometheus server to forward alerts. The Grafana service connects to the Prometheus server to display dashboards.
This monitoring stack integrates with other services in NetGrimoire via environment variables and labels.
---
## Operations
### Monitoring
Monitor the services using `docker stack services monitoring` and view logs with `docker service logs -f monitoring`.
```bash
docker stack services monitoring
docker service logs -f monitoring prometheus
```
### Backups
Regular backups are recommended for critical data stored in `/DockerVol/`. Use `docker stack services monitoring` and inspect the volumes for specific data.
Critical backups are required for Prometheus and Grafana data. Reconstructing from backup is possible but may require manual configuration.
### Restore
Restore the services by running `./deploy.sh` after any changes to the configuration.
```bash
cd services/swarm/stack/monitoring
./deploy.sh
```
---
## Common Failures
| Failure | Symptom | Cause | Fix |
|--------|---------|------|-----|
- **No metrics from Prometheus** | No metrics displayed in Grafana or Alertmanager. | Prometheus not running. | Restart Prometheus service: `docker service restart monitoring prometheus` |
- **Alerts not forwarding to Alertmanager** | Alerts not sent to Alertmanager. | Prometheus configuration incorrect. | Check Prometheus configuration and adjust as needed. |
- **Grafana dashboard not displaying data** | Grafana dashboards not displaying data from Prometheus. | Prometheus configuration incorrect. | Check Prometheus configuration and adjust as needed. |
| Failure Mode | Symptoms | Cause | Fix |
|-------------|----------|-------|------|
| Prometheus down | No metrics available in Grafana | Prometheus not scraped | Check Prometheus configuration and restart service |
| Cadvisor unavailable | No container metrics available | Cadvisor not running | Check Cadvisor logs for errors and restart service |
---
@ -118,15 +118,14 @@ Restore the services by running `./deploy.sh` after any changes to the configura
| Date | Commit | Summary |
|------|--------|---------|
| 2026-04-07 | 71e3177f | Initial documentation for monitoring stack in NetGrimoire. |
| 2026-04-07 | 1df528ca | Added Cadvisor service to collect host metrics from all nodes. |
| 2026-04-07 | af94e455 | Updated Alertmanager configuration to forward alerts to Prometheus. |
| 2026-04-07 | 04863ab6 | Improved Grafana dashboard display with Prometheus data. |
| 2026-04-07 | 0af60dbe | Added backup and restore procedures for critical data. |
| 2026-04-07 | 9f9ca1ad | Initial deployment of monitoring stack |
| 2026-04-07 | 71e3177f | Configured Alertmanager to forward alerts to Cadvisor |
<Write a paragraph summarizing the evolution of this service based on the diffs above. If no diffs available, note that this is the initial documentation.>
---
## Notes
- Generated by Gremlin on 2026-04-08T02:08:17.740Z
- Generated by Gremlin on 2026-04-08T03:34:50.852Z
- Source: swarm/monitoring.yaml
- Review User Guide and Changelog sections