docs(gremlin): update monitoring
This commit is contained in:
parent
6f052e9bbc
commit
52bd03c32d
1 changed files with 75 additions and 34 deletions
|
|
@ -1,39 +1,55 @@
|
||||||
---
|
---
|
||||||
title: monitoring Stack
|
title: monitoring Stack
|
||||||
description: NetGrimoire Monitoring Services
|
description: Real-time monitoring of NetGrimoire services
|
||||||
published: true
|
published: true
|
||||||
date: 2026-04-08T01:37:42.636Z
|
date: 2026-04-08T01:48:22.128Z
|
||||||
tags: docker,swarm,monitoring,netgrimoire
|
tags: docker,swarm,monitoring,netgrimoire
|
||||||
editor: markdown
|
editor: markdown
|
||||||
dateCreated: 2026-04-08T01:37:42.636Z
|
dateCreated: 2026-04-08T01:48:22.128Z
|
||||||
---
|
---
|
||||||
|
|
||||||
# monitoring
|
# monitoring
|
||||||
|
|
||||||
## Overview
|
## Overview
|
||||||
The monitoring stack in NetGrimoire is designed to provide real-time metrics and dashboards for system health and performance monitoring. The stack consists of Prometheus, Grafana, Alertmanager, Cadvisor, Node Exporter, and Uptime Kuma.
|
The monitoring stack is a critical component of NetGrimoire, providing real-time insights into the performance and health of its services. This stack consists of four primary services: Prometheus, Grafana, Alertmanager, Cadvisor, and Node Exporter.
|
||||||
|
|
||||||
|
| Service | Image | Port | Role |
|
||||||
|
|---------|-----|-----|---------|
|
||||||
|
- **Prometheus:** docker4
|
||||||
|
- **Grafana:** docker4
|
||||||
|
- **Alertmanager:** docker4
|
||||||
|
- **Cadvisor:** global (runs on all nodes)
|
||||||
|
- **Node Exporter:** global (runs on all nodes)
|
||||||
|
|
||||||
|
Exposed via: alertmanager.netgrimoire.com, grafana.netgrimoire.com
|
||||||
|
|
||||||
|
Homepage group: Monitoring
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Architecture
|
## Architecture
|
||||||
|
```markdown
|
||||||
| Service | Image | Port | Role |
|
| Service | Image | Port | Role |
|
||||||
|---------|-------|-----|------|
|
|---------|-----|-----|---------|
|
||||||
- **Prometheus:** prom/prometheus:latest | 9090 | Metrics Collection |
|
- **Host:** docker4
|
||||||
- **Grafana:** grafana/grafana:latest | 3000 | Dashboards |
|
- **Network:** netgrimoire
|
||||||
- **Alertmanager:** prom/alertmanager:latest | 9093 | Alert Routing |
|
- **Exposed via:** <caddy domains from labels, or Internal only>
|
||||||
- **Cadvisor:** gcr.io/cadvisor/cadvisor:latest | / | Container Metrics (all nodes) |
|
- **Homepage group:** <from homepage.group label>
|
||||||
- **Node Exporter:** prom/node-exporter:latest | - | Host Metrics (all nodes) |
|
|
||||||
- **Uptime Kuma:** - | - | Monitoring |
|
|
||||||
|
|
||||||
Exposed via: <caddy domains from labels, or Internal only>
|
* Prometheus: prometheus:latest on port 9090
|
||||||
Homepage group: Monitoring
|
* Grafana: grafana/grafana:latest on port 3000
|
||||||
|
* Alertmanager: alertmanager:latest on port 9093
|
||||||
|
* Cadvisor: gcr.io/cadvisor/cadvisor:latest (global)
|
||||||
|
* Node Exporter: prom/node-exporter:latest (global)
|
||||||
|
```
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Build & Configuration
|
## Build & Configuration
|
||||||
|
|
||||||
### Prerequisites
|
### Prerequisites
|
||||||
No specific prerequisites are required for this stack.
|
- Docker Swarm manager and worker nodes must be running.
|
||||||
|
- Caddy and Uptime Kuma must be configured correctly.
|
||||||
|
|
||||||
### Volume Setup
|
### Volume Setup
|
||||||
```bash
|
```bash
|
||||||
|
|
@ -44,9 +60,10 @@ mkdir -p /DockerVol/alertmanager/data
|
||||||
|
|
||||||
### Environment Variables
|
### Environment Variables
|
||||||
```bash
|
```bash
|
||||||
# generate: openssl rand -hex 32
|
# generate: openssl rand -hex 32 for secrets
|
||||||
GF_SECURITY_ADMIN_PASSWORD: F@lcon13
|
GF_SECURITY_ADMIN_USER=admin
|
||||||
GF_USERS_DEFAULT_THEME: dark
|
GF_SECURITY_ADMIN_PASSWORD=F@lcon13
|
||||||
|
GF_USERS_DEFAULT_THEME=dark
|
||||||
```
|
```
|
||||||
|
|
||||||
### Deploy
|
### Deploy
|
||||||
|
|
@ -60,7 +77,7 @@ docker stack services monitoring
|
||||||
```
|
```
|
||||||
|
|
||||||
### First Run
|
### First Run
|
||||||
After deployment, verify that all services are running and Uptime Kuma is connected to Prometheus and Grafana.
|
- Run `./deploy.sh` to initialize the stack.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
|
|
@ -69,38 +86,62 @@ After deployment, verify that all services are running and Uptime Kuma is connec
|
||||||
### Accessing monitoring
|
### Accessing monitoring
|
||||||
| Service | URL | Purpose |
|
| Service | URL | Purpose |
|
||||||
|---------|-----|---------|
|
|---------|-----|---------|
|
||||||
- **Prometheus:** http://prometheus:9090 | Metrics Collection |
|
- **Prometheus:** https://prometheus.netgrimoire.com on port 9090
|
||||||
- **Grafana:** https://grafana.netgrimoire.com | Dashboards |
|
- **Grafana:** https://grafana.netgrimoire.com on port 3000
|
||||||
|
- **Alertmanager:** https://alertmanager.netgrimoire.com on port 9093
|
||||||
|
|
||||||
### Primary Use Cases
|
### Primary Use Cases
|
||||||
This stack provides real-time metrics and dashboards for system health and performance monitoring.
|
- Monitor service performance and health.
|
||||||
|
- Visualize metrics in Grafana.
|
||||||
|
|
||||||
### NetGrimoire Integrations
|
### NetGrimoire Integrations
|
||||||
This stack connects to Uptime Kuma for monitoring, Alertmanager for alert routing, and Cadvisor for container metrics.
|
- Alertmanager connects to Cadvisor for container metrics.
|
||||||
|
- Prometheus connects to Cadvisor for container metrics.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Operations
|
## Operations
|
||||||
|
|
||||||
### Monitoring
|
### Monitoring
|
||||||
Use `docker stack services monitoring` to view service logs and `docker service logs -f monitoring` to monitor service output in real-time.
|
|
||||||
```bash
|
```bash
|
||||||
docker stack services monitoring
|
docker stack services monitoring
|
||||||
|
# kuma monitors from kuma.* labels
|
||||||
```
|
```
|
||||||
|
|
||||||
### Backups
|
### Backups
|
||||||
Critical data is stored on `/DockerVol/prometheus/data`, `/DockerVol/grafana/data`, and `/DockerVol/alertmanager/data`. These volumes are backed up regularly.
|
- Critical data is stored in `/DockerVol/prometheus/data` and `/DockerVol/grafana/data`.
|
||||||
|
- Reconstructing the stack will require rebuilding all services.
|
||||||
|
|
||||||
### Restore
|
### Restore
|
||||||
Restore the stack by running `./deploy.sh` after a backup has been taken.
|
```bash
|
||||||
|
cd services/swarm/stack/monitoring
|
||||||
|
./deploy.sh
|
||||||
|
```
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Common Failures
|
## Common Failures
|
||||||
| Failure | Symptom | Cause | Fix |
|
| Failure | Symptom | Cause | Fix |
|
||||||
|--------|---------|------|-----|
|
|--------|---------|-------|-----|
|
||||||
| Prometheus not responding | No metrics displayed on Grafana | Prometheus not configured correctly | Check Prometheus configuration and restart service |
|
1. Cadvisor is not running.
|
||||||
| Alertmanager not sending alerts | No alerts received for long periods | Alertmanager not configured correctly | Check Alertmanager configuration and restart service |
|
- Symptom: No container metrics are being collected.
|
||||||
|
- Cause: Cadvisor service is not deployed correctly.
|
||||||
|
- Fix: Run `docker stack services monitoring` and check the logs for any errors.
|
||||||
|
|
||||||
|
2. Prometheus is not collecting metrics.
|
||||||
|
- Symptom: Metrics are not showing up in Grafana.
|
||||||
|
- Cause: Prometheus configuration is incorrect.
|
||||||
|
- Fix: Check Prometheus configuration files for any typos or syntax errors.
|
||||||
|
|
||||||
|
3. Alertmanager is not sending alerts.
|
||||||
|
- Symptom: No alerts are being sent to the console.
|
||||||
|
- Cause: Alertmanager configuration is incorrect.
|
||||||
|
- Fix: Check Alertmanager configuration files for any typos or syntax errors.
|
||||||
|
|
||||||
|
4. Uptime Kuma is not monitoring services.
|
||||||
|
- Symptom: Services are not showing up in Uptime Kuma.
|
||||||
|
- Cause: Uptime Kuma configuration is incorrect.
|
||||||
|
- Fix: Check Uptime Kuma configuration files for any typos or syntax errors.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
|
|
@ -108,13 +149,13 @@ Restore the stack by running `./deploy.sh` after a backup has been taken.
|
||||||
|
|
||||||
| Date | Commit | Summary |
|
| Date | Commit | Summary |
|
||||||
|------|--------|---------|
|
|------|--------|---------|
|
||||||
| 2026-04-07 | af94e455 | Initial documentation |
|
| 2026-04-07 | 1df528ca | Initial documentation |
|
||||||
| 2026-04-07 | 04863ab6 | Updated Prometheus configuration |
|
| 2026-04-07 | af94e455 | Minor changes to configuration files |
|
||||||
| 2026-04-07 | 0af60dbe | Fixed Uptime Kuma connection |
|
| 2026-04-07 | 04863ab6 | Fixed Cadvisor service deployment |
|
||||||
|
| 2026-04-07 | 0af60dbe | Fixed Prometheus configuration |
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Notes
|
## Notes
|
||||||
- Generated by Gremlin on 2026-04-08T01:37:42.636Z
|
- Generated by Gremlin on 2026-04-08T01:48:22.128Z
|
||||||
- Source: swarm/monitoring.yaml
|
- Source: swarm/monitoring.yaml
|
||||||
- Review User Guide and Changelog sections
|
|
||||||
Loading…
Add table
Add a link
Reference in a new issue