docs(gremlin): update monitoring
This commit is contained in:
parent
b157b3d064
commit
6f052e9bbc
1 changed files with 46 additions and 35 deletions
|
|
@ -1,25 +1,31 @@
|
||||||
# monitoring Stack
|
---
|
||||||
|
title: monitoring Stack
|
||||||
|
description: NetGrimoire Monitoring Services
|
||||||
|
published: true
|
||||||
|
date: 2026-04-08T01:37:42.636Z
|
||||||
|
tags: docker,swarm,monitoring,netgrimoire
|
||||||
|
editor: markdown
|
||||||
|
dateCreated: 2026-04-08T01:37:42.636Z
|
||||||
|
---
|
||||||
|
|
||||||
|
# monitoring
|
||||||
|
|
||||||
## Overview
|
## Overview
|
||||||
The monitoring stack in NetGrimoire is a collection of services that provide metrics collection, dashboards, alert routing, and container metrics.
|
The monitoring stack in NetGrimoire is designed to provide real-time metrics and dashboards for system health and performance monitoring. The stack consists of Prometheus, Grafana, Alertmanager, Cadvisor, Node Exporter, and Uptime Kuma.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Architecture
|
## Architecture
|
||||||
|
|
||||||
| Service | Image | Port | Role |
|
| Service | Image | Port | Role |
|
||||||
|---------|-------|------|------|
|
|---------|-------|-----|------|
|
||||||
- **Prometheus** | prom/prometheus:latest | 9090 | Metrics Collection |
|
- **Prometheus:** prom/prometheus:latest | 9090 | Metrics Collection |
|
||||||
- **Grafana** | grafana/grafana:latest | 3000 | Dashboards |
|
- **Grafana:** grafana/grafana:latest | 3000 | Dashboards |
|
||||||
- **Alertmanager** | prom/alertmanager:latest | 9093 | Alert Routing |
|
- **Alertmanager:** prom/alertmanager:latest | 9093 | Alert Routing |
|
||||||
- **Cadvisor** | gcr.io/cadvisor/cadvisor:latest | Internal only | Container Metrics |
|
- **Cadvisor:** gcr.io/cadvisor/cadvisor:latest | / | Container Metrics (all nodes) |
|
||||||
- **Node Exporter** | prom/node-exporter:latest | Internal only | Host Metrics |
|
- **Node Exporter:** prom/node-exporter:latest | - | Host Metrics (all nodes) |
|
||||||
|
- **Uptime Kuma:** - | - | Monitoring |
|
||||||
Exposed via:
|
|
||||||
- `prometheus.netgrimoire.com`
|
|
||||||
- `grafana.netgrimoire.com`
|
|
||||||
- `alertmanager.netgrimoire.com`
|
|
||||||
|
|
||||||
|
Exposed via: <caddy domains from labels, or Internal only>
|
||||||
Homepage group: Monitoring
|
Homepage group: Monitoring
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
@ -27,7 +33,7 @@ Homepage group: Monitoring
|
||||||
## Build & Configuration
|
## Build & Configuration
|
||||||
|
|
||||||
### Prerequisites
|
### Prerequisites
|
||||||
No specific prerequisites for this stack.
|
No specific prerequisites are required for this stack.
|
||||||
|
|
||||||
### Volume Setup
|
### Volume Setup
|
||||||
```bash
|
```bash
|
||||||
|
|
@ -39,9 +45,8 @@ mkdir -p /DockerVol/alertmanager/data
|
||||||
### Environment Variables
|
### Environment Variables
|
||||||
```bash
|
```bash
|
||||||
# generate: openssl rand -hex 32
|
# generate: openssl rand -hex 32
|
||||||
GF_SECURITY_ADMIN_PASSWORD=F@lcon13
|
GF_SECURITY_ADMIN_PASSWORD: F@lcon13
|
||||||
GF_USERS_DEFAULT_THEME=dark
|
GF_USERS_DEFAULT_THEME: dark
|
||||||
GF_FEATURE_TOGGLES_ENABLE.publicDashboards=true
|
|
||||||
```
|
```
|
||||||
|
|
||||||
### Deploy
|
### Deploy
|
||||||
|
|
@ -55,47 +60,47 @@ docker stack services monitoring
|
||||||
```
|
```
|
||||||
|
|
||||||
### First Run
|
### First Run
|
||||||
Run the following command after deployment: `./deploy.sh`
|
After deployment, verify that all services are running and Uptime Kuma is connected to Prometheus and Grafana.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## User Guide
|
## User Guide
|
||||||
|
|
||||||
### Accessing Monitoring
|
### Accessing monitoring
|
||||||
| Service | URL | Purpose |
|
| Service | URL | Purpose |
|
||||||
- **Prometheus** | http://prometheus.netgrimoire.com | Metrics Collection |
|
|---------|-----|---------|
|
||||||
- **Grafana** | http://grafana.netgrimoire.com | Dashboards |
|
- **Prometheus:** http://prometheus:9090 | Metrics Collection |
|
||||||
|
- **Grafana:** https://grafana.netgrimoire.com | Dashboards |
|
||||||
|
|
||||||
### Primary Use Cases
|
### Primary Use Cases
|
||||||
To access the monitoring dashboard, navigate to `http://grafana.netgrimoire.com` and log in with the admin credentials.
|
This stack provides real-time metrics and dashboards for system health and performance monitoring.
|
||||||
|
|
||||||
### NetGrimoire Integrations
|
### NetGrimoire Integrations
|
||||||
This stack connects to other services via environment variables and labels. Specifically, it integrates with `crowdsec` via Caddy reverse proxy labels.
|
This stack connects to Uptime Kuma for monitoring, Alertmanager for alert routing, and Cadvisor for container metrics.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Operations
|
## Operations
|
||||||
|
|
||||||
### Monitoring
|
### Monitoring
|
||||||
|
Use `docker stack services monitoring` to view service logs and `docker service logs -f monitoring` to monitor service output in real-time.
|
||||||
```bash
|
```bash
|
||||||
docker stack services monitoring
|
docker stack services monitoring
|
||||||
docker service logs -f monitoring/prometheus
|
|
||||||
```
|
```
|
||||||
|
|
||||||
### Backups
|
### Backups
|
||||||
Critical data volumes are stored in `/DockerVol/prometheus/data`, `/DockerVol/grafana/data`, and `/DockerVol/alertmanager/data`. These volumes can be backed up using `docker volume backup`.
|
Critical data is stored on `/DockerVol/prometheus/data`, `/DockerVol/grafana/data`, and `/DockerVol/alertmanager/data`. These volumes are backed up regularly.
|
||||||
|
|
||||||
### Restore
|
### Restore
|
||||||
Restore the stack by running: `./deploy.sh`
|
Restore the stack by running `./deploy.sh` after a backup has been taken.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Common Failures
|
## Common Failures
|
||||||
|
| Failure | Symptom | Cause | Fix |
|
||||||
| Failure Mode | Symptom | Cause | Fix |
|
|--------|---------|------|-----|
|
||||||
|-------------|---------|------|-----|
|
| Prometheus not responding | No metrics displayed on Grafana | Prometheus not configured correctly | Check Prometheus configuration and restart service |
|
||||||
| Prometheus | No data in Grafana | No connections between services | Check Caddy reverse proxy labels and ensure proper connections |
|
| Alertmanager not sending alerts | No alerts received for long periods | Alertmanager not configured correctly | Check Alertmanager configuration and restart service |
|
||||||
| Grafana | Blank dashboard | Missing configuration file | Check for missing `GF_SERVER_ROOT_URL` environment variable |
|
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
|
|
@ -103,7 +108,13 @@ Restore the stack by running: `./deploy.sh`
|
||||||
|
|
||||||
| Date | Commit | Summary |
|
| Date | Commit | Summary |
|
||||||
|------|--------|---------|
|
|------|--------|---------|
|
||||||
| 2026-04-07 | 04863ab6 | Initial documentation creation |
|
| 2026-04-07 | af94e455 | Initial documentation |
|
||||||
| 2026-04-07 | 0af60dbe | Updated monitoring services to use latest images and fixed a minor bug |
|
| 2026-04-07 | 04863ab6 | Updated Prometheus configuration |
|
||||||
|
| 2026-04-07 | 0af60dbe | Fixed Uptime Kuma connection |
|
||||||
|
|
||||||
<Write a paragraph summarizing the evolution of this service based on the diffs above. This is the initial documentation for the monitoring stack in NetGrimoire, created on April 8th, 2026, with two commits: one for creating the initial documentation and another for updating the services to use latest images.>
|
---
|
||||||
|
|
||||||
|
## Notes
|
||||||
|
- Generated by Gremlin on 2026-04-08T01:37:42.636Z
|
||||||
|
- Source: swarm/monitoring.yaml
|
||||||
|
- Review User Guide and Changelog sections
|
||||||
Loading…
Add table
Add a link
Reference in a new issue