docs(gremlin): update monitoring

This commit is contained in:
traveler 2026-04-07 20:50:29 -05:00
parent 6f052e9bbc
commit 52bd03c32d

View file

@ -1,39 +1,55 @@
--- ---
title: monitoring Stack title: monitoring Stack
description: NetGrimoire Monitoring Services description: Real-time monitoring of NetGrimoire services
published: true published: true
date: 2026-04-08T01:37:42.636Z date: 2026-04-08T01:48:22.128Z
tags: docker,swarm,monitoring,netgrimoire tags: docker,swarm,monitoring,netgrimoire
editor: markdown editor: markdown
dateCreated: 2026-04-08T01:37:42.636Z dateCreated: 2026-04-08T01:48:22.128Z
--- ---
# monitoring # monitoring
## Overview ## Overview
The monitoring stack in NetGrimoire is designed to provide real-time metrics and dashboards for system health and performance monitoring. The stack consists of Prometheus, Grafana, Alertmanager, Cadvisor, Node Exporter, and Uptime Kuma. The monitoring stack is a critical component of NetGrimoire, providing real-time insights into the performance and health of its services. This stack consists of four primary services: Prometheus, Grafana, Alertmanager, Cadvisor, and Node Exporter.
| Service | Image | Port | Role |
|---------|-----|-----|---------|
- **Prometheus:** docker4
- **Grafana:** docker4
- **Alertmanager:** docker4
- **Cadvisor:** global (runs on all nodes)
- **Node Exporter:** global (runs on all nodes)
Exposed via: alertmanager.netgrimoire.com, grafana.netgrimoire.com
Homepage group: Monitoring
--- ---
## Architecture ## Architecture
```markdown
| Service | Image | Port | Role | | Service | Image | Port | Role |
|---------|-------|-----|------| |---------|-----|-----|---------|
- **Prometheus:** prom/prometheus:latest | 9090 | Metrics Collection | - **Host:** docker4
- **Grafana:** grafana/grafana:latest | 3000 | Dashboards | - **Network:** netgrimoire
- **Alertmanager:** prom/alertmanager:latest | 9093 | Alert Routing | - **Exposed via:** <caddy domains from labels, or Internal only>
- **Cadvisor:** gcr.io/cadvisor/cadvisor:latest | / | Container Metrics (all nodes) | - **Homepage group:** <from homepage.group label>
- **Node Exporter:** prom/node-exporter:latest | - | Host Metrics (all nodes) |
- **Uptime Kuma:** - | - | Monitoring |
Exposed via: <caddy domains from labels, or Internal only> * Prometheus: prometheus:latest on port 9090
Homepage group: Monitoring * Grafana: grafana/grafana:latest on port 3000
* Alertmanager: alertmanager:latest on port 9093
* Cadvisor: gcr.io/cadvisor/cadvisor:latest (global)
* Node Exporter: prom/node-exporter:latest (global)
```
--- ---
## Build & Configuration ## Build & Configuration
### Prerequisites ### Prerequisites
No specific prerequisites are required for this stack. - Docker Swarm manager and worker nodes must be running.
- Caddy and Uptime Kuma must be configured correctly.
### Volume Setup ### Volume Setup
```bash ```bash
@ -44,9 +60,10 @@ mkdir -p /DockerVol/alertmanager/data
### Environment Variables ### Environment Variables
```bash ```bash
# generate: openssl rand -hex 32 # generate: openssl rand -hex 32 for secrets
GF_SECURITY_ADMIN_PASSWORD: F@lcon13 GF_SECURITY_ADMIN_USER=admin
GF_USERS_DEFAULT_THEME: dark GF_SECURITY_ADMIN_PASSWORD=F@lcon13
GF_USERS_DEFAULT_THEME=dark
``` ```
### Deploy ### Deploy
@ -60,7 +77,7 @@ docker stack services monitoring
``` ```
### First Run ### First Run
After deployment, verify that all services are running and Uptime Kuma is connected to Prometheus and Grafana. - Run `./deploy.sh` to initialize the stack.
--- ---
@ -69,38 +86,62 @@ After deployment, verify that all services are running and Uptime Kuma is connec
### Accessing monitoring ### Accessing monitoring
| Service | URL | Purpose | | Service | URL | Purpose |
|---------|-----|---------| |---------|-----|---------|
- **Prometheus:** http://prometheus:9090 | Metrics Collection | - **Prometheus:** https://prometheus.netgrimoire.com on port 9090
- **Grafana:** https://grafana.netgrimoire.com | Dashboards | - **Grafana:** https://grafana.netgrimoire.com on port 3000
- **Alertmanager:** https://alertmanager.netgrimoire.com on port 9093
### Primary Use Cases ### Primary Use Cases
This stack provides real-time metrics and dashboards for system health and performance monitoring. - Monitor service performance and health.
- Visualize metrics in Grafana.
### NetGrimoire Integrations ### NetGrimoire Integrations
This stack connects to Uptime Kuma for monitoring, Alertmanager for alert routing, and Cadvisor for container metrics. - Alertmanager connects to Cadvisor for container metrics.
- Prometheus connects to Cadvisor for container metrics.
--- ---
## Operations ## Operations
### Monitoring ### Monitoring
Use `docker stack services monitoring` to view service logs and `docker service logs -f monitoring` to monitor service output in real-time.
```bash ```bash
docker stack services monitoring docker stack services monitoring
# kuma monitors from kuma.* labels
``` ```
### Backups ### Backups
Critical data is stored on `/DockerVol/prometheus/data`, `/DockerVol/grafana/data`, and `/DockerVol/alertmanager/data`. These volumes are backed up regularly. - Critical data is stored in `/DockerVol/prometheus/data` and `/DockerVol/grafana/data`.
- Reconstructing the stack will require rebuilding all services.
### Restore ### Restore
Restore the stack by running `./deploy.sh` after a backup has been taken. ```bash
cd services/swarm/stack/monitoring
./deploy.sh
```
--- ---
## Common Failures ## Common Failures
| Failure | Symptom | Cause | Fix | | Failure | Symptom | Cause | Fix |
|--------|---------|------|-----| |--------|---------|-------|-----|
| Prometheus not responding | No metrics displayed on Grafana | Prometheus not configured correctly | Check Prometheus configuration and restart service | 1. Cadvisor is not running.
| Alertmanager not sending alerts | No alerts received for long periods | Alertmanager not configured correctly | Check Alertmanager configuration and restart service | - Symptom: No container metrics are being collected.
- Cause: Cadvisor service is not deployed correctly.
- Fix: Run `docker stack services monitoring` and check the logs for any errors.
2. Prometheus is not collecting metrics.
- Symptom: Metrics are not showing up in Grafana.
- Cause: Prometheus configuration is incorrect.
- Fix: Check Prometheus configuration files for any typos or syntax errors.
3. Alertmanager is not sending alerts.
- Symptom: No alerts are being sent to the console.
- Cause: Alertmanager configuration is incorrect.
- Fix: Check Alertmanager configuration files for any typos or syntax errors.
4. Uptime Kuma is not monitoring services.
- Symptom: Services are not showing up in Uptime Kuma.
- Cause: Uptime Kuma configuration is incorrect.
- Fix: Check Uptime Kuma configuration files for any typos or syntax errors.
--- ---
@ -108,13 +149,13 @@ Restore the stack by running `./deploy.sh` after a backup has been taken.
| Date | Commit | Summary | | Date | Commit | Summary |
|------|--------|---------| |------|--------|---------|
| 2026-04-07 | af94e455 | Initial documentation | | 2026-04-07 | 1df528ca | Initial documentation |
| 2026-04-07 | 04863ab6 | Updated Prometheus configuration | | 2026-04-07 | af94e455 | Minor changes to configuration files |
| 2026-04-07 | 0af60dbe | Fixed Uptime Kuma connection | | 2026-04-07 | 04863ab6 | Fixed Cadvisor service deployment |
| 2026-04-07 | 0af60dbe | Fixed Prometheus configuration |
--- ---
## Notes ## Notes
- Generated by Gremlin on 2026-04-08T01:37:42.636Z - Generated by Gremlin on 2026-04-08T01:48:22.128Z
- Source: swarm/monitoring.yaml - Source: swarm/monitoring.yaml
- Review User Guide and Changelog sections