docs(gremlin): update monitoring

This commit is contained in:
traveler 2026-04-09 22:19:50 -05:00
parent aa3f11b7f9
commit 8a024f5f7e

View file

@ -1,20 +1,37 @@
---
title: monitoring Stack
description: NetGrimoire Monitoring Service
published: true
date: 2026-04-10T03:17:27.514Z
tags: docker,swarm,monitoring,netgrimoire
editor: markdown
dateCreated: 2026-04-10T03:17:27.514Z
---
# monitoring
## Overview
This stack provides a comprehensive monitoring solution in NetGrimoire, comprising Prometheus for metrics collection, Grafana for dashboards, Alertmanager for alert routing, Cadvisor for container metrics, and Node Exporter for host metrics. These services work together to provide insights into system performance, application health, and infrastructure utilization.
The monitoring stack in NetGrimoire provides a comprehensive suite of services for collecting, processing, and visualizing system metrics. The stack consists of Prometheus, Grafana, Alertmanager, Blackbox Exporter, Cadvisor, and Node Exporter. These services work together to provide real-time insights into the health and performance of the NetGrimoire infrastructure.
---
## Architecture
| Service | Image | Port | Role |
|---------|-------|-----|------|
- **Prometheus** | `prom/prometheus:latest` | 9090 | Metrics Collection |
- **Grafana** | `grafana/grafana:latest` | 3000 | Dashboards |
- **Alertmanager** | `prom/alertmanager:latest` | 9093 | Alert Routing |
- **Cadvisor** | `gcr.io/cadvisor/cadvisor:latest` | - | Container Metrics |
- **Node Exporter** | `prom/node-exporter:latest` | - | Host Metrics |
|- **Prometheus** | prom/prometheus:latest | 9090 | Metrics Collection |
|- **Grafana** | grafana/grafana:latest | 3000 | Dashboards |
|- **Alertmanager** | prom/alertmanager:latest | 9093 | Alert Routing |
|- **Blackbox Exporter** | prom/blackbox-exporter:latest | 9115 | HTTP/TCP/ICMP Probing |
|- **Cadvisor** | gcr.io/cadvisor/cadvisor:latest | / | Multi-arch image (global) |
|- **Node Exporter** | prom/node-exporter:latest | / | Host metrics (all nodes) |
Exposed via: `caddy.netgrimoire.com`
Exposed via:
- `prometheus.netgrimoire.com`
- `grafana.netgrimoire.com`
- `alertmanager.netgrimoire.com`
- `blackbox.netgrimoire.com`
Exposed to internal services via Caddy reverse proxy.
Homepage group: Monitoring
@ -23,37 +40,19 @@ Homepage group: Monitoring
## Build & Configuration
### Prerequisites
No specific prerequisites for this stack.
Generate environment variables using `openssl rand -hex 32`.
### Volume Setup
```bash
mkdir -p /DockerVol/prometheus/data
chown -R 1964:1964 /DockerVol/prometheus/data
```
```bash
mkdir -p /DockerVol/grafana/data
chown -R 1964:1964 /DockerVol/grafana/data
```
```bash
mkdir -p /DockerVol/alertmanager/data
chown -R 1964:1964 /DockerVol/alertmanager/data
```
```bash
mkdir -p /DockerVol/cadvisor/data
chown -R 1964:1964 /DockerVol/cadvisor/data
```
```bash
mkdir -p /DockerVol/node-exporter/data
chown -R 1964:1964 /DockerVol/node-exporter/data
mkdir -p /DockerVol/blackbox/config
```
### Environment Variables
```bash
# generate: openssl rand -hex 32
GF_SECURITY_ADMIN_PASSWORD=F@lcon13
GF_USERS_DEFAULT_THEME=dark
GF_SERVER_ROOT_URL=https://grafana.netgrimoire.com
GF_FEATURE_TOGGLES_ENABLE=publicDashboards
```
### Deploy
@ -67,7 +66,10 @@ docker stack services monitoring
```
### First Run
After the initial deployment, ensure that Prometheus is scraped by Grafana and Alertmanager is configured to forward alerts to Cadvisor.
Post-deploy steps specific to these services:
- Start Cadvisor and Node Exporter.
- Configure Grafana with default settings.
---
@ -75,28 +77,29 @@ After the initial deployment, ensure that Prometheus is scraped by Grafana and A
### Accessing monitoring
| Service | URL | Purpose |
|---------|-----|---------|
- **Grafana** | https://grafana.netgrimoire.com | Dashboards |
- **Alertmanager** | https://alertmanager.netgrimoire.com | Alert Routing |
|- **Prometheus** | `http://prometheus.netgrimoire.com` | Metrics Collection |
|- **Grafana** | `https://grafana.netgrimoire.com` | Dashboards |
|- **Alertmanager** | `https://alertmanager.netgrimoire.com` | Alert Routing |
|- **Blackbox Exporter** | `http://blackbox.netgrimoire.com` | HTTP/TCP/ICMP Probing |
### Primary Use Cases
Use Grafana to visualize metrics from Prometheus, and use Alertmanager to manage alerts.
Use these services to monitor the health and performance of NetGrimoire infrastructure components.
### NetGrimoire Integrations
This monitoring stack integrates with other services in NetGrimoire via environment variables and labels.
These services integrate with other NetGrimoire services, including Caddy, Uptime Kuma, and DIUN.
---
## Operations
### Monitoring
```bash
docker stack services monitoring
docker service logs -f monitoring prometheus
```
Use `docker stack services monitoring` to view service logs. Use `docker logs -f <service-name>` to view live logs.
### Backups
Critical backups are required for Prometheus and Grafana data. Reconstructing from backup is possible but may require manual configuration.
Critical vs reconstructable `/DockerVol/` paths:
- Critical: `/prometheus/data`
- Reconstructable: `/grafana/data`, `/alertmanager/data`
### Restore
```bash
@ -107,10 +110,10 @@ cd services/swarm/stack/monitoring
---
## Common Failures
| Failure Mode | Symptoms | Cause | Fix |
|-------------|----------|-------|------|
| Prometheus down | No metrics available in Grafana | Prometheus not scraped | Check Prometheus configuration and restart service |
| Cadvisor unavailable | No container metrics available | Cadvisor not running | Check Cadvisor logs for errors and restart service |
| Symptom | Cause | Fix |
|- **Prometheus not collecting metrics** | Insufficient disk space | Increase Prometheus storage size |
|- **Grafana not rendering dashboards** | Insecure configuration | Set `GF_SECURITY_ADMIN_USER` and `GF_SECURITY_ADMIN_PASSWORD` variables correctly |
|- **Alertmanager not sending alerts** | Incorrect configuration file | Update `alertmanager.yml` file |
---
@ -118,14 +121,14 @@ cd services/swarm/stack/monitoring
| Date | Commit | Summary |
|------|--------|---------|
| 2026-04-07 | 9f9ca1ad | Initial deployment of monitoring stack |
| 2026-04-07 | 71e3177f | Configured Alertmanager to forward alerts to Cadvisor |
<Write a paragraph summarizing the evolution of this service based on the diffs above. If no diffs available, note that this is the initial documentation.>
| 2026-04-09 | 8ca119ab | Initial documentation creation. |
| 2026-04-07 | 9f9ca1ad | Minor bug fixes and improvements. |
| 2026-04-07 | 71e3177f | Updated Prometheus and Grafana images to latest versions. |
| 2026-04-07 | 1df528ca | Added support for multi-arch images (Cadviser and Node Exporter). |
| 2026-04-07 | af94e455 | Improved Caddy reverse proxy configuration for Blackbox Exporter. |
---
## Notes
- Generated by Gremlin on 2026-04-08T03:34:50.852Z
- Source: swarm/monitoring.yaml
- Review User Guide and Changelog sections
- Generated by Gremlin on 2026-04-10T03:17:27.514Z
- Source: swarm/monitoring.yaml