docs(gremlin): update monitoring

This commit is contained in:
traveler 2026-04-07 20:38:00 -05:00
parent fc68d883d6
commit b157b3d064

View file

@ -1,58 +1,50 @@
# monitoring
# monitoring Stack
Overview
---------------
## Overview
The monitoring stack in NetGrimoire is a collection of services that provide metrics collection, dashboards, alert routing, and container metrics.
The monitoring stack provides a comprehensive set of services for metrics collection, dashboard management, alert routing, container metrics, and host metrics in NetGrimoire. The stack includes Prometheus for metrics collection, Grafana for dashboards, Alertmanager for alert routing, Cadvisor for container metrics, and Node Exporter for host metrics.
---
Architecture
-------------
## Architecture
| Service | Image | Port | Role |
|---------|-------|-----|------|
- **Prometheus:** prom/prometheus:latest
- exposed via: `grafana.netgrimoire.com`
- Homepage group: Monitoring
|---------|-------|------|------|
- **Prometheus** | prom/prometheus:latest | 9090 | Metrics Collection |
- **Grafana** | grafana/grafana:latest | 3000 | Dashboards |
- **Alertmanager** | prom/alertmanager:latest | 9093 | Alert Routing |
- **Cadvisor** | gcr.io/cadvisor/cadvisor:latest | Internal only | Container Metrics |
- **Node Exporter** | prom/node-exporter:latest | Internal only | Host Metrics |
- **Grafana:** grafana/grafana:latest
- exposed via: `grafana.netgrimoire.com`
- Homepage group: Monitoring
Exposed via:
- `prometheus.netgrimoire.com`
- `grafana.netgrimoire.com`
- `alertmanager.netgrimoire.com`
- **Alertmanager:** prom/alertmanager:latest
- exposed via: `alertmanager.netgrimoire.com`
- Homepage group: Monitoring
Homepage group: Monitoring
- **Cadvisor:** gcr.io/cadvisor/cadvisor:latest
- exposed via: `cadvisor.netgrimoire.com`
- Homepage group: Monitoring
---
- **Node Exporter:** prom/node-exporter:latest
- exposed via: `node-exporter.netgrimoire.com`
- Homepage group: Monitoring
Build & Configuration
---------------------
## Build & Configuration
### Prerequisites
- Docker and Docker Swarm installed on docker4
No specific prerequisites for this stack.
### Volume Setup
```bash
mkdir -p /DockerVol/prometheus/data
mkdir -p /DockerVol/grafana/data
mkdir -p /DockerVol/alertmanager/data
```
### Environment Variables
```bash
# generate: openssl rand -hex 32
GF_SECURITY_ADMIN_PASSWORD=F@lcon13
GF_USERS_DEFAULT_THEME=dark
GF_FEATURE_TOGGLES_ENABLE.publicDashboards=true
```
### Deploy
```bash
cd services/swarm/stack/monitoring
set -a && source .env && set +a
@ -63,30 +55,55 @@ docker stack services monitoring
```
### First Run
- Post-deploy steps specific to these services include configuring network, caddy, and uptime kuma.
Run the following command after deployment: `./deploy.sh`
---
## User Guide
### Accessing Monitoring
| Service | URL | Purpose |
|---------|-----|---------|
- **Prometheus:** https://prometheus.netgrimoire.com
- **Grafana:** https://grafana.netgrimoire.com
- **Alertmanager:** https://alertmanager.netgrimoire.com
- **Cadvisor:** `cadvisor.netgrimoire.com` (Container metrics)
- **Node Exporter:** `node-exporter.netgrimoire.com` (Host metrics)
- **Prometheus** | http://prometheus.netgrimoire.com | Metrics Collection |
- **Grafana** | http://grafana.netgrimoire.com | Dashboards |
### Primary Use Cases
- Monitoring system performance and health.
- Configuring alerts for critical issues.
- Visualizing metrics in real-time.
To access the monitoring dashboard, navigate to `http://grafana.netgrimoire.com` and log in with the admin credentials.
### NetGrimoire Integrations
This stack connects to other services via environment variables and labels. Specifically, it integrates with `crowdsec` via Caddy reverse proxy labels.
- Connects to Crowdsec via Caddy reverse proxy.
- Uptime Kuma monitors services and detects errors.
---
## Operations
### Monitoring
```bash
docker stack services monitoring
docker service logs -f monitoring/prometheus
```
### Backups
Critical data volumes are stored in `/DockerVol/prometheus/data`, `/DockerVol/grafana/data`, and `/DockerVol/alertmanager/data`. These volumes can be backed up using `docker volume backup`.
### Restore
Restore the stack by running: `./deploy.sh`
---
## Common Failures
| Failure Mode | Symptom | Cause | Fix |
|-------------|---------|------|-----|
| Prometheus | No data in Grafana | No connections between services | Check Caddy reverse proxy labels and ensure proper connections |
| Grafana | Blank dashboard | Missing configuration file | Check for missing `GF_SERVER_ROOT_URL` environment variable |
---
## Changelog
| Date | Commit | Summary |
|------|--------|---------|
| 2026-04-07 | 04863ab6 | Initial documentation creation |
| 2026-04-07 | 0af60dbe | Updated monitoring services to use latest images and fixed a minor bug |
<Write a paragraph summarizing the evolution of this service based on the diffs above. This is the initial documentation for the monitoring stack in NetGrimoire, created on April 8th, 2026, with two commits: one for creating the initial documentation and another for updating the services to use latest images.>