docs(gremlin): update monitoring

This commit is contained in:
traveler 2026-04-09 22:19:50 -05:00
parent aa3f11b7f9
commit 8a024f5f7e

View file

@ -1,20 +1,37 @@
---
title: monitoring Stack
description: NetGrimoire Monitoring Service
published: true
date: 2026-04-10T03:17:27.514Z
tags: docker,swarm,monitoring,netgrimoire
editor: markdown
dateCreated: 2026-04-10T03:17:27.514Z
---
# monitoring # monitoring
## Overview ## Overview
This stack provides a comprehensive monitoring solution in NetGrimoire, comprising Prometheus for metrics collection, Grafana for dashboards, Alertmanager for alert routing, Cadvisor for container metrics, and Node Exporter for host metrics. These services work together to provide insights into system performance, application health, and infrastructure utilization. The monitoring stack in NetGrimoire provides a comprehensive suite of services for collecting, processing, and visualizing system metrics. The stack consists of Prometheus, Grafana, Alertmanager, Blackbox Exporter, Cadvisor, and Node Exporter. These services work together to provide real-time insights into the health and performance of the NetGrimoire infrastructure.
--- ---
## Architecture ## Architecture
| Service | Image | Port | Role | | Service | Image | Port | Role |
|---------|-------|-----|------| |- **Prometheus** | prom/prometheus:latest | 9090 | Metrics Collection |
- **Prometheus** | `prom/prometheus:latest` | 9090 | Metrics Collection | |- **Grafana** | grafana/grafana:latest | 3000 | Dashboards |
- **Grafana** | `grafana/grafana:latest` | 3000 | Dashboards | |- **Alertmanager** | prom/alertmanager:latest | 9093 | Alert Routing |
- **Alertmanager** | `prom/alertmanager:latest` | 9093 | Alert Routing | |- **Blackbox Exporter** | prom/blackbox-exporter:latest | 9115 | HTTP/TCP/ICMP Probing |
- **Cadvisor** | `gcr.io/cadvisor/cadvisor:latest` | - | Container Metrics | |- **Cadvisor** | gcr.io/cadvisor/cadvisor:latest | / | Multi-arch image (global) |
- **Node Exporter** | `prom/node-exporter:latest` | - | Host Metrics | |- **Node Exporter** | prom/node-exporter:latest | / | Host metrics (all nodes) |
Exposed via: `caddy.netgrimoire.com` Exposed via:
- `prometheus.netgrimoire.com`
- `grafana.netgrimoire.com`
- `alertmanager.netgrimoire.com`
- `blackbox.netgrimoire.com`
Exposed to internal services via Caddy reverse proxy.
Homepage group: Monitoring Homepage group: Monitoring
@ -23,37 +40,19 @@ Homepage group: Monitoring
## Build & Configuration ## Build & Configuration
### Prerequisites ### Prerequisites
No specific prerequisites for this stack. Generate environment variables using `openssl rand -hex 32`.
### Volume Setup ### Volume Setup
```bash ```bash
mkdir -p /DockerVol/prometheus/data mkdir -p /DockerVol/prometheus/data
chown -R 1964:1964 /DockerVol/prometheus/data
```
```bash
mkdir -p /DockerVol/grafana/data mkdir -p /DockerVol/grafana/data
chown -R 1964:1964 /DockerVol/grafana/data
```
```bash
mkdir -p /DockerVol/alertmanager/data mkdir -p /DockerVol/alertmanager/data
chown -R 1964:1964 /DockerVol/alertmanager/data mkdir -p /DockerVol/blackbox/config
```
```bash
mkdir -p /DockerVol/cadvisor/data
chown -R 1964:1964 /DockerVol/cadvisor/data
```
```bash
mkdir -p /DockerVol/node-exporter/data
chown -R 1964:1964 /DockerVol/node-exporter/data
``` ```
### Environment Variables ### Environment Variables
```bash ```bash
# generate: openssl rand -hex 32 # generate: openssl rand -hex 32
GF_SECURITY_ADMIN_PASSWORD=F@lcon13
GF_USERS_DEFAULT_THEME=dark
GF_SERVER_ROOT_URL=https://grafana.netgrimoire.com
GF_FEATURE_TOGGLES_ENABLE=publicDashboards
``` ```
### Deploy ### Deploy
@ -67,7 +66,10 @@ docker stack services monitoring
``` ```
### First Run ### First Run
After the initial deployment, ensure that Prometheus is scraped by Grafana and Alertmanager is configured to forward alerts to Cadvisor. Post-deploy steps specific to these services:
- Start Cadvisor and Node Exporter.
- Configure Grafana with default settings.
--- ---
@ -75,28 +77,29 @@ After the initial deployment, ensure that Prometheus is scraped by Grafana and A
### Accessing monitoring ### Accessing monitoring
| Service | URL | Purpose | | Service | URL | Purpose |
|---------|-----|---------| |- **Prometheus** | `http://prometheus.netgrimoire.com` | Metrics Collection |
- **Grafana** | https://grafana.netgrimoire.com | Dashboards | |- **Grafana** | `https://grafana.netgrimoire.com` | Dashboards |
- **Alertmanager** | https://alertmanager.netgrimoire.com | Alert Routing | |- **Alertmanager** | `https://alertmanager.netgrimoire.com` | Alert Routing |
|- **Blackbox Exporter** | `http://blackbox.netgrimoire.com` | HTTP/TCP/ICMP Probing |
### Primary Use Cases ### Primary Use Cases
Use Grafana to visualize metrics from Prometheus, and use Alertmanager to manage alerts. Use these services to monitor the health and performance of NetGrimoire infrastructure components.
### NetGrimoire Integrations ### NetGrimoire Integrations
This monitoring stack integrates with other services in NetGrimoire via environment variables and labels. These services integrate with other NetGrimoire services, including Caddy, Uptime Kuma, and DIUN.
--- ---
## Operations ## Operations
### Monitoring ### Monitoring
```bash Use `docker stack services monitoring` to view service logs. Use `docker logs -f <service-name>` to view live logs.
docker stack services monitoring
docker service logs -f monitoring prometheus
```
### Backups ### Backups
Critical backups are required for Prometheus and Grafana data. Reconstructing from backup is possible but may require manual configuration. Critical vs reconstructable `/DockerVol/` paths:
- Critical: `/prometheus/data`
- Reconstructable: `/grafana/data`, `/alertmanager/data`
### Restore ### Restore
```bash ```bash
@ -107,10 +110,10 @@ cd services/swarm/stack/monitoring
--- ---
## Common Failures ## Common Failures
| Failure Mode | Symptoms | Cause | Fix | | Symptom | Cause | Fix |
|-------------|----------|-------|------| |- **Prometheus not collecting metrics** | Insufficient disk space | Increase Prometheus storage size |
| Prometheus down | No metrics available in Grafana | Prometheus not scraped | Check Prometheus configuration and restart service | |- **Grafana not rendering dashboards** | Insecure configuration | Set `GF_SECURITY_ADMIN_USER` and `GF_SECURITY_ADMIN_PASSWORD` variables correctly |
| Cadvisor unavailable | No container metrics available | Cadvisor not running | Check Cadvisor logs for errors and restart service | |- **Alertmanager not sending alerts** | Incorrect configuration file | Update `alertmanager.yml` file |
--- ---
@ -118,14 +121,14 @@ cd services/swarm/stack/monitoring
| Date | Commit | Summary | | Date | Commit | Summary |
|------|--------|---------| |------|--------|---------|
| 2026-04-07 | 9f9ca1ad | Initial deployment of monitoring stack | | 2026-04-09 | 8ca119ab | Initial documentation creation. |
| 2026-04-07 | 71e3177f | Configured Alertmanager to forward alerts to Cadvisor | | 2026-04-07 | 9f9ca1ad | Minor bug fixes and improvements. |
| 2026-04-07 | 71e3177f | Updated Prometheus and Grafana images to latest versions. |
<Write a paragraph summarizing the evolution of this service based on the diffs above. If no diffs available, note that this is the initial documentation.> | 2026-04-07 | 1df528ca | Added support for multi-arch images (Cadviser and Node Exporter). |
| 2026-04-07 | af94e455 | Improved Caddy reverse proxy configuration for Blackbox Exporter. |
--- ---
## Notes ## Notes
- Generated by Gremlin on 2026-04-08T03:34:50.852Z - Generated by Gremlin on 2026-04-10T03:17:27.514Z
- Source: swarm/monitoring.yaml - Source: swarm/monitoring.yaml
- Review User Guide and Changelog sections