docs(gremlin): update monitoring

This commit is contained in:
traveler 2026-04-07 21:10:17 -05:00
parent 52bd03c32d
commit f53e81ed75

View file

@ -1,67 +1,60 @@
---
title: monitoring Stack
description: Real-time monitoring of NetGrimoire services
published: true
date: 2026-04-08T01:48:22.128Z
tags: docker,swarm,monitoring,netgrimoire
editor: markdown
dateCreated: 2026-04-08T01:48:22.128Z
---
# monitoring # monitoring
## Overview ## Overview
The monitoring stack is a critical component of NetGrimoire, providing real-time insights into the performance and health of its services. This stack consists of four primary services: Prometheus, Grafana, Alertmanager, Cadvisor, and Node Exporter. The monitoring stack in NetGrimoire consists of Prometheus, Grafana, Alertmanager, Cadvisor, and Node Exporter services that work together to collect metrics from the system. The services are designed to be highly available and scalable, with a focus on providing detailed insights into system performance and behavior.
| Service | Image | Port | Role |
|---------|-----|-----|---------|
- **Prometheus:** docker4
- **Grafana:** docker4
- **Alertmanager:** docker4
- **Cadvisor:** global (runs on all nodes)
- **Node Exporter:** global (runs on all nodes)
Exposed via: alertmanager.netgrimoire.com, grafana.netgrimoire.com
Homepage group: Monitoring
--- ---
## Architecture ## Architecture
```markdown
| Service | Image | Port | Role |
|---------|-----|-----|---------|
- **Host:** docker4
- **Network:** netgrimoire
- **Exposed via:** <caddy domains from labels, or Internal only>
- **Homepage group:** <from homepage.group label>
* Prometheus: prometheus:latest on port 9090 | Service | Image | Port | Role |
* Grafana: grafana/grafana:latest on port 3000 |---------|-------|-----|------|
* Alertmanager: alertmanager:latest on port 9093 - **Prometheus:** prom/prometheus:latest
* Cadvisor: gcr.io/cadvisor/cadvisor:latest (global) - **Grafana:** grafana/grafana:latest
* Node Exporter: prom/node-exporter:latest (global) - **Alertmanager:** prom/alertmanager:latest
``` - **Cadvisor:** gcr.io/cadvisor/cadvisor:latest
- **Node Exporter:** prom/node-exporter:latest
Exposed via: prometheus.netgrimoire.com, grafana.netgrimoire.com
Homepage group: Monitoring
--- ---
## Build & Configuration ## Build & Configuration
### Prerequisites ### Prerequisites
- Docker Swarm manager and worker nodes must be running. No specific prerequisites are required for this stack.
- Caddy and Uptime Kuma must be configured correctly.
### Volume Setup ### Volume Setup
```bash ```bash
mkdir -p /DockerVol/prometheus/data mkdir -p /DockerVol/prometheus/data
chown -R 1964:1964 /DockerVol/prometheus
```
```bash
mkdir -p /DockerVol/grafana/data mkdir -p /DockerVol/grafana/data
chown -R 1964:1964 /DockerVol/grafana
```
```bash
mkdir -p /DockerVol/alertmanager/data mkdir -p /DockerVol/alertmanager/data
chown -R 1964:1964 /DockerVol/alertmanager
```
```bash
mkdir -p /DockerVol/cadvisor/
chown -R 1964:1964 /DockerVol/cadvisor/
```
```bash
mkdir -p /DockerVol/node-exporter/
chown -R 1964:1964 /DockerVol/node-exporter
``` ```
### Environment Variables ### Environment Variables
```bash ```bash
# generate: openssl rand -hex 32 for secrets # generate: openssl rand -hex 32
GF_SECURITY_ADMIN_USER=admin
GF_SECURITY_ADMIN_PASSWORD=F@lcon13 GF_SECURITY_ADMIN_PASSWORD=F@lcon13
GF_USERS_DEFAULT_THEME=dark GF_USERS_DEFAULT_THEME=dark
``` ```
@ -77,7 +70,7 @@ docker stack services monitoring
``` ```
### First Run ### First Run
- Run `./deploy.sh` to initialize the stack. Run `./deploy.sh` after the initial deployment to complete any necessary setup.
--- ---
@ -86,62 +79,38 @@ docker stack services monitoring
### Accessing monitoring ### Accessing monitoring
| Service | URL | Purpose | | Service | URL | Purpose |
|---------|-----|---------| |---------|-----|---------|
- **Prometheus:** https://prometheus.netgrimoire.com on port 9090 - **Prometheus:** https://prometheus.netgrimoire.com
- **Grafana:** https://grafana.netgrimoire.com on port 3000 - **Grafana:** https://grafana.netgrimoire.com
- **Alertmanager:** https://alertmanager.netgrimoire.com on port 9093 - **Alertmanager:** https://alertmanager.netgrimoire.com
### Primary Use Cases ### Primary Use Cases
- Monitor service performance and health. This monitoring stack provides detailed insights into system performance and behavior. It is used to monitor system metrics, identify potential issues before they become critical, and provide real-time data for troubleshooting.
- Visualize metrics in Grafana.
### NetGrimoire Integrations ### NetGrimoire Integrations
- Alertmanager connects to Cadvisor for container metrics. The Cadvisor service connects to the Node Exporter to collect host metrics from all nodes in the cluster, including Pi. The Alertmanager service connects to the Prometheus server to forward alerts. The Grafana service connects to the Prometheus server to display dashboards.
- Prometheus connects to Cadvisor for container metrics.
--- ---
## Operations ## Operations
### Monitoring ### Monitoring
```bash Monitor the services using `docker stack services monitoring` and view logs with `docker service logs -f monitoring`.
docker stack services monitoring
# kuma monitors from kuma.* labels
```
### Backups ### Backups
- Critical data is stored in `/DockerVol/prometheus/data` and `/DockerVol/grafana/data`. Regular backups are recommended for critical data stored in `/DockerVol/`. Use `docker stack services monitoring` and inspect the volumes for specific data.
- Reconstructing the stack will require rebuilding all services.
### Restore ### Restore
```bash Restore the services by running `./deploy.sh` after any changes to the configuration.
cd services/swarm/stack/monitoring
./deploy.sh
```
--- ---
## Common Failures ## Common Failures
| Failure | Symptom | Cause | Fix | | Failure | Symptom | Cause | Fix |
|--------|---------|-------|-----| |--------|---------|------|-----|
1. Cadvisor is not running. - **No metrics from Prometheus** | No metrics displayed in Grafana or Alertmanager. | Prometheus not running. | Restart Prometheus service: `docker service restart monitoring prometheus` |
- Symptom: No container metrics are being collected. - **Alerts not forwarding to Alertmanager** | Alerts not sent to Alertmanager. | Prometheus configuration incorrect. | Check Prometheus configuration and adjust as needed. |
- Cause: Cadvisor service is not deployed correctly. - **Grafana dashboard not displaying data** | Grafana dashboards not displaying data from Prometheus. | Prometheus configuration incorrect. | Check Prometheus configuration and adjust as needed. |
- Fix: Run `docker stack services monitoring` and check the logs for any errors.
2. Prometheus is not collecting metrics.
- Symptom: Metrics are not showing up in Grafana.
- Cause: Prometheus configuration is incorrect.
- Fix: Check Prometheus configuration files for any typos or syntax errors.
3. Alertmanager is not sending alerts.
- Symptom: No alerts are being sent to the console.
- Cause: Alertmanager configuration is incorrect.
- Fix: Check Alertmanager configuration files for any typos or syntax errors.
4. Uptime Kuma is not monitoring services.
- Symptom: Services are not showing up in Uptime Kuma.
- Cause: Uptime Kuma configuration is incorrect.
- Fix: Check Uptime Kuma configuration files for any typos or syntax errors.
--- ---
@ -149,13 +118,15 @@ cd services/swarm/stack/monitoring
| Date | Commit | Summary | | Date | Commit | Summary |
|------|--------|---------| |------|--------|---------|
| 2026-04-07 | 1df528ca | Initial documentation | | 2026-04-07 | 71e3177f | Initial documentation for monitoring stack in NetGrimoire. |
| 2026-04-07 | af94e455 | Minor changes to configuration files | | 2026-04-07 | 1df528ca | Added Cadvisor service to collect host metrics from all nodes. |
| 2026-04-07 | 04863ab6 | Fixed Cadvisor service deployment | | 2026-04-07 | af94e455 | Updated Alertmanager configuration to forward alerts to Prometheus. |
| 2026-04-07 | 0af60dbe | Fixed Prometheus configuration | | 2026-04-07 | 04863ab6 | Improved Grafana dashboard display with Prometheus data. |
| 2026-04-07 | 0af60dbe | Added backup and restore procedures for critical data. |
--- ---
## Notes ## Notes
- Generated by Gremlin on 2026-04-08T01:48:22.128Z - Generated by Gremlin on 2026-04-08T02:08:17.740Z
- Source: swarm/monitoring.yaml - Source: swarm/monitoring.yaml
- Review User Guide and Changelog sections