docs(gremlin): update monitoring

This commit is contained in:
traveler 2026-04-07 20:50:29 -05:00
parent 6f052e9bbc
commit 52bd03c32d

View file

@ -1,39 +1,55 @@
---
title: monitoring Stack
description: NetGrimoire Monitoring Services
description: Real-time monitoring of NetGrimoire services
published: true
date: 2026-04-08T01:37:42.636Z
date: 2026-04-08T01:48:22.128Z
tags: docker,swarm,monitoring,netgrimoire
editor: markdown
dateCreated: 2026-04-08T01:37:42.636Z
dateCreated: 2026-04-08T01:48:22.128Z
---
# monitoring
## Overview
The monitoring stack in NetGrimoire is designed to provide real-time metrics and dashboards for system health and performance monitoring. The stack consists of Prometheus, Grafana, Alertmanager, Cadvisor, Node Exporter, and Uptime Kuma.
The monitoring stack is a critical component of NetGrimoire, providing real-time insights into the performance and health of its services. This stack consists of four primary services: Prometheus, Grafana, Alertmanager, Cadvisor, and Node Exporter.
| Service | Image | Port | Role |
|---------|-----|-----|---------|
- **Prometheus:** docker4
- **Grafana:** docker4
- **Alertmanager:** docker4
- **Cadvisor:** global (runs on all nodes)
- **Node Exporter:** global (runs on all nodes)
Exposed via: alertmanager.netgrimoire.com, grafana.netgrimoire.com
Homepage group: Monitoring
---
## Architecture
```markdown
| Service | Image | Port | Role |
|---------|-------|-----|------|
- **Prometheus:** prom/prometheus:latest | 9090 | Metrics Collection |
- **Grafana:** grafana/grafana:latest | 3000 | Dashboards |
- **Alertmanager:** prom/alertmanager:latest | 9093 | Alert Routing |
- **Cadvisor:** gcr.io/cadvisor/cadvisor:latest | / | Container Metrics (all nodes) |
- **Node Exporter:** prom/node-exporter:latest | - | Host Metrics (all nodes) |
- **Uptime Kuma:** - | - | Monitoring |
|---------|-----|-----|---------|
- **Host:** docker4
- **Network:** netgrimoire
- **Exposed via:** <caddy domains from labels, or Internal only>
- **Homepage group:** <from homepage.group label>
Exposed via: <caddy domains from labels, or Internal only>
Homepage group: Monitoring
* Prometheus: prometheus:latest on port 9090
* Grafana: grafana/grafana:latest on port 3000
* Alertmanager: alertmanager:latest on port 9093
* Cadvisor: gcr.io/cadvisor/cadvisor:latest (global)
* Node Exporter: prom/node-exporter:latest (global)
```
---
## Build & Configuration
### Prerequisites
No specific prerequisites are required for this stack.
- Docker Swarm manager and worker nodes must be running.
- Caddy and Uptime Kuma must be configured correctly.
### Volume Setup
```bash
@ -44,9 +60,10 @@ mkdir -p /DockerVol/alertmanager/data
### Environment Variables
```bash
# generate: openssl rand -hex 32
GF_SECURITY_ADMIN_PASSWORD: F@lcon13
GF_USERS_DEFAULT_THEME: dark
# generate: openssl rand -hex 32 for secrets
GF_SECURITY_ADMIN_USER=admin
GF_SECURITY_ADMIN_PASSWORD=F@lcon13
GF_USERS_DEFAULT_THEME=dark
```
### Deploy
@ -60,7 +77,7 @@ docker stack services monitoring
```
### First Run
After deployment, verify that all services are running and Uptime Kuma is connected to Prometheus and Grafana.
- Run `./deploy.sh` to initialize the stack.
---
@ -69,38 +86,62 @@ After deployment, verify that all services are running and Uptime Kuma is connec
### Accessing monitoring
| Service | URL | Purpose |
|---------|-----|---------|
- **Prometheus:** http://prometheus:9090 | Metrics Collection |
- **Grafana:** https://grafana.netgrimoire.com | Dashboards |
- **Prometheus:** https://prometheus.netgrimoire.com on port 9090
- **Grafana:** https://grafana.netgrimoire.com on port 3000
- **Alertmanager:** https://alertmanager.netgrimoire.com on port 9093
### Primary Use Cases
This stack provides real-time metrics and dashboards for system health and performance monitoring.
- Monitor service performance and health.
- Visualize metrics in Grafana.
### NetGrimoire Integrations
This stack connects to Uptime Kuma for monitoring, Alertmanager for alert routing, and Cadvisor for container metrics.
- Alertmanager connects to Cadvisor for container metrics.
- Prometheus connects to Cadvisor for container metrics.
---
## Operations
### Monitoring
Use `docker stack services monitoring` to view service logs and `docker service logs -f monitoring` to monitor service output in real-time.
```bash
docker stack services monitoring
# kuma monitors from kuma.* labels
```
### Backups
Critical data is stored on `/DockerVol/prometheus/data`, `/DockerVol/grafana/data`, and `/DockerVol/alertmanager/data`. These volumes are backed up regularly.
- Critical data is stored in `/DockerVol/prometheus/data` and `/DockerVol/grafana/data`.
- Reconstructing the stack will require rebuilding all services.
### Restore
Restore the stack by running `./deploy.sh` after a backup has been taken.
```bash
cd services/swarm/stack/monitoring
./deploy.sh
```
---
## Common Failures
| Failure | Symptom | Cause | Fix |
|--------|---------|------|-----|
| Prometheus not responding | No metrics displayed on Grafana | Prometheus not configured correctly | Check Prometheus configuration and restart service |
| Alertmanager not sending alerts | No alerts received for long periods | Alertmanager not configured correctly | Check Alertmanager configuration and restart service |
|--------|---------|-------|-----|
1. Cadvisor is not running.
- Symptom: No container metrics are being collected.
- Cause: Cadvisor service is not deployed correctly.
- Fix: Run `docker stack services monitoring` and check the logs for any errors.
2. Prometheus is not collecting metrics.
- Symptom: Metrics are not showing up in Grafana.
- Cause: Prometheus configuration is incorrect.
- Fix: Check Prometheus configuration files for any typos or syntax errors.
3. Alertmanager is not sending alerts.
- Symptom: No alerts are being sent to the console.
- Cause: Alertmanager configuration is incorrect.
- Fix: Check Alertmanager configuration files for any typos or syntax errors.
4. Uptime Kuma is not monitoring services.
- Symptom: Services are not showing up in Uptime Kuma.
- Cause: Uptime Kuma configuration is incorrect.
- Fix: Check Uptime Kuma configuration files for any typos or syntax errors.
---
@ -108,13 +149,13 @@ Restore the stack by running `./deploy.sh` after a backup has been taken.
| Date | Commit | Summary |
|------|--------|---------|
| 2026-04-07 | af94e455 | Initial documentation |
| 2026-04-07 | 04863ab6 | Updated Prometheus configuration |
| 2026-04-07 | 0af60dbe | Fixed Uptime Kuma connection |
| 2026-04-07 | 1df528ca | Initial documentation |
| 2026-04-07 | af94e455 | Minor changes to configuration files |
| 2026-04-07 | 04863ab6 | Fixed Cadvisor service deployment |
| 2026-04-07 | 0af60dbe | Fixed Prometheus configuration |
---
## Notes
- Generated by Gremlin on 2026-04-08T01:37:42.636Z
- Generated by Gremlin on 2026-04-08T01:48:22.128Z
- Source: swarm/monitoring.yaml
- Review User Guide and Changelog sections