docs(gremlin): update monitoring

This commit is contained in:
traveler 2026-04-07 21:10:17 -05:00
parent 52bd03c32d
commit f53e81ed75

View file

@ -1,67 +1,60 @@
---
title: monitoring Stack
description: Real-time monitoring of NetGrimoire services
published: true
date: 2026-04-08T01:48:22.128Z
tags: docker,swarm,monitoring,netgrimoire
editor: markdown
dateCreated: 2026-04-08T01:48:22.128Z
---
# monitoring
## Overview
The monitoring stack is a critical component of NetGrimoire, providing real-time insights into the performance and health of its services. This stack consists of four primary services: Prometheus, Grafana, Alertmanager, Cadvisor, and Node Exporter.
| Service | Image | Port | Role |
|---------|-----|-----|---------|
- **Prometheus:** docker4
- **Grafana:** docker4
- **Alertmanager:** docker4
- **Cadvisor:** global (runs on all nodes)
- **Node Exporter:** global (runs on all nodes)
Exposed via: alertmanager.netgrimoire.com, grafana.netgrimoire.com
Homepage group: Monitoring
The monitoring stack in NetGrimoire consists of Prometheus, Grafana, Alertmanager, Cadvisor, and Node Exporter services that work together to collect metrics from the system. The services are designed to be highly available and scalable, with a focus on providing detailed insights into system performance and behavior.
---
## Architecture
```markdown
| Service | Image | Port | Role |
|---------|-----|-----|---------|
- **Host:** docker4
- **Network:** netgrimoire
- **Exposed via:** <caddy domains from labels, or Internal only>
- **Homepage group:** <from homepage.group label>
* Prometheus: prometheus:latest on port 9090
* Grafana: grafana/grafana:latest on port 3000
* Alertmanager: alertmanager:latest on port 9093
* Cadvisor: gcr.io/cadvisor/cadvisor:latest (global)
* Node Exporter: prom/node-exporter:latest (global)
```
| Service | Image | Port | Role |
|---------|-------|-----|------|
- **Prometheus:** prom/prometheus:latest
- **Grafana:** grafana/grafana:latest
- **Alertmanager:** prom/alertmanager:latest
- **Cadvisor:** gcr.io/cadvisor/cadvisor:latest
- **Node Exporter:** prom/node-exporter:latest
Exposed via: prometheus.netgrimoire.com, grafana.netgrimoire.com
Homepage group: Monitoring
---
## Build & Configuration
### Prerequisites
- Docker Swarm manager and worker nodes must be running.
- Caddy and Uptime Kuma must be configured correctly.
No specific prerequisites are required for this stack.
### Volume Setup
```bash
mkdir -p /DockerVol/prometheus/data
chown -R 1964:1964 /DockerVol/prometheus
```
```bash
mkdir -p /DockerVol/grafana/data
chown -R 1964:1964 /DockerVol/grafana
```
```bash
mkdir -p /DockerVol/alertmanager/data
chown -R 1964:1964 /DockerVol/alertmanager
```
```bash
mkdir -p /DockerVol/cadvisor/
chown -R 1964:1964 /DockerVol/cadvisor/
```
```bash
mkdir -p /DockerVol/node-exporter/
chown -R 1964:1964 /DockerVol/node-exporter
```
### Environment Variables
```bash
# generate: openssl rand -hex 32 for secrets
GF_SECURITY_ADMIN_USER=admin
# generate: openssl rand -hex 32
GF_SECURITY_ADMIN_PASSWORD=F@lcon13
GF_USERS_DEFAULT_THEME=dark
```
@ -77,7 +70,7 @@ docker stack services monitoring
```
### First Run
- Run `./deploy.sh` to initialize the stack.
Run `./deploy.sh` after the initial deployment to complete any necessary setup.
---
@ -86,62 +79,38 @@ docker stack services monitoring
### Accessing monitoring
| Service | URL | Purpose |
|---------|-----|---------|
- **Prometheus:** https://prometheus.netgrimoire.com on port 9090
- **Grafana:** https://grafana.netgrimoire.com on port 3000
- **Alertmanager:** https://alertmanager.netgrimoire.com on port 9093
- **Prometheus:** https://prometheus.netgrimoire.com
- **Grafana:** https://grafana.netgrimoire.com
- **Alertmanager:** https://alertmanager.netgrimoire.com
### Primary Use Cases
- Monitor service performance and health.
- Visualize metrics in Grafana.
This monitoring stack provides detailed insights into system performance and behavior. It is used to monitor system metrics, identify potential issues before they become critical, and provide real-time data for troubleshooting.
### NetGrimoire Integrations
- Alertmanager connects to Cadvisor for container metrics.
- Prometheus connects to Cadvisor for container metrics.
The Cadvisor service connects to the Node Exporter to collect host metrics from all nodes in the cluster, including Pi. The Alertmanager service connects to the Prometheus server to forward alerts. The Grafana service connects to the Prometheus server to display dashboards.
---
## Operations
### Monitoring
```bash
docker stack services monitoring
# kuma monitors from kuma.* labels
```
Monitor the services using `docker stack services monitoring` and view logs with `docker service logs -f monitoring`.
### Backups
- Critical data is stored in `/DockerVol/prometheus/data` and `/DockerVol/grafana/data`.
- Reconstructing the stack will require rebuilding all services.
Regular backups are recommended for critical data stored in `/DockerVol/`. Use `docker stack services monitoring` and inspect the volumes for specific data.
### Restore
```bash
cd services/swarm/stack/monitoring
./deploy.sh
```
Restore the services by running `./deploy.sh` after any changes to the configuration.
---
## Common Failures
| Failure | Symptom | Cause | Fix |
|--------|---------|-------|-----|
1. Cadvisor is not running.
- Symptom: No container metrics are being collected.
- Cause: Cadvisor service is not deployed correctly.
- Fix: Run `docker stack services monitoring` and check the logs for any errors.
2. Prometheus is not collecting metrics.
- Symptom: Metrics are not showing up in Grafana.
- Cause: Prometheus configuration is incorrect.
- Fix: Check Prometheus configuration files for any typos or syntax errors.
3. Alertmanager is not sending alerts.
- Symptom: No alerts are being sent to the console.
- Cause: Alertmanager configuration is incorrect.
- Fix: Check Alertmanager configuration files for any typos or syntax errors.
4. Uptime Kuma is not monitoring services.
- Symptom: Services are not showing up in Uptime Kuma.
- Cause: Uptime Kuma configuration is incorrect.
- Fix: Check Uptime Kuma configuration files for any typos or syntax errors.
|--------|---------|------|-----|
- **No metrics from Prometheus** | No metrics displayed in Grafana or Alertmanager. | Prometheus not running. | Restart Prometheus service: `docker service restart monitoring prometheus` |
- **Alerts not forwarding to Alertmanager** | Alerts not sent to Alertmanager. | Prometheus configuration incorrect. | Check Prometheus configuration and adjust as needed. |
- **Grafana dashboard not displaying data** | Grafana dashboards not displaying data from Prometheus. | Prometheus configuration incorrect. | Check Prometheus configuration and adjust as needed. |
---
@ -149,13 +118,15 @@ cd services/swarm/stack/monitoring
| Date | Commit | Summary |
|------|--------|---------|
| 2026-04-07 | 1df528ca | Initial documentation |
| 2026-04-07 | af94e455 | Minor changes to configuration files |
| 2026-04-07 | 04863ab6 | Fixed Cadvisor service deployment |
| 2026-04-07 | 0af60dbe | Fixed Prometheus configuration |
| 2026-04-07 | 71e3177f | Initial documentation for monitoring stack in NetGrimoire. |
| 2026-04-07 | 1df528ca | Added Cadvisor service to collect host metrics from all nodes. |
| 2026-04-07 | af94e455 | Updated Alertmanager configuration to forward alerts to Prometheus. |
| 2026-04-07 | 04863ab6 | Improved Grafana dashboard display with Prometheus data. |
| 2026-04-07 | 0af60dbe | Added backup and restore procedures for critical data. |
---
## Notes
- Generated by Gremlin on 2026-04-08T01:48:22.128Z
- Source: swarm/monitoring.yaml
- Generated by Gremlin on 2026-04-08T02:08:17.740Z
- Source: swarm/monitoring.yaml
- Review User Guide and Changelog sections