docs(gremlin): update monitoring

This commit is contained in:
traveler 2026-04-11 10:54:30 -05:00
parent 860725a9bb
commit 05808b40a5

View file

@ -1,58 +1,61 @@
---
title: monitoring Stack
description: NetGrimoire Monitoring Service
published: true
date: 2026-04-10T03:17:27.514Z
tags: docker,swarm,monitoring,netgrimoire
editor: markdown
dateCreated: 2026-04-10T03:17:27.514Z
---
# monitoring
## Overview
The monitoring stack in NetGrimoire provides a comprehensive suite of services for collecting, processing, and visualizing system metrics. The stack consists of Prometheus, Grafana, Alertmanager, Blackbox Exporter, Cadvisor, and Node Exporter. These services work together to provide real-time insights into the health and performance of the NetGrimoire infrastructure.
---
The monitoring stack in NetGrimoire provides a comprehensive suite of services for collecting, storing, and visualizing performance data from various sources. This includes Prometheus for metrics collection, Grafana for dashboards, Alertmanager for alert routing, Blackbox Exporter for HTTP/TCP/ICMP probing, and Cadvisor for multi-arch image management.
## Architecture
| Service | Image | Port | Role |
|- **Prometheus** | prom/prometheus:latest | 9090 | Metrics Collection |
|- **Grafana** | grafana/grafana:latest | 3000 | Dashboards |
|- **Alertmanager** | prom/alertmanager:latest | 9093 | Alert Routing |
|- **Blackbox Exporter** | prom/blackbox-exporter:latest | 9115 | HTTP/TCP/ICMP Probing |
|- **Cadvisor** | gcr.io/cadvisor/cadvisor:latest | / | Multi-arch image (global) |
|- **Node Exporter** | prom/node-exporter:latest | / | Host metrics (all nodes) |
Exposed via:
- `prometheus.netgrimoire.com`
- `grafana.netgrimoire.com`
- `alertmanager.netgrimoire.com`
- `blackbox.netgrimoire.com`
Exposed to internal services via Caddy reverse proxy.
Homepage group: Monitoring
---
|---------|--------|------|------|
- **Prometheus:** prom/prometheus:latest
- **Grafana:** grafana/grafana:latest
- **Alertmanager:** prom/alertmanager:latest
- **Blackbox Exporter:** prom/blackbox-exporter:latest
- **Cadavisor:** gcr.io/cadvisor/cadvisor:latest
| exposed via | Internal only (caddy.netgrimoire.com) |
| Homepage group | Monitoring |
## Build & Configuration
### Prerequisites
Generate environment variables using `openssl rand -hex 32`.
No specific prerequisites are required for this stack.
### Volume Setup
```bash
mkdir -p /DockerVol/prometheus/data
chown -R 1964:1964 /DockerVol/prometheus/
```
```bash
mkdir -p /DockerVol/grafana/data
chown -R 1964:1964 /DockerVol/grafana/
```
```bash
mkdir -p /DockerVol/alertmanager/data
chown -R 1964:1964 /DockerVol/alertmanager/
```
```bash
mkdir -p /DockerVol/blackbox/config
chown -R 1964:1964 /DockerVol/blackbox/
```
```bash
mkdir -p /DockerVol/cadvisor/data
chown -R 1964:1964 /DockerVol/cadvisor/
```
```bash
mkdir -p /DockerVol/node-exporter/data
chown -R 1964:1964 /DockerVol/node-exporter/
```
### Environment Variables
```bash
# generate: openssl rand -hex 32
generate: openssl rand -hex 32
GF_SECURITY_ADMIN_PASSWORD: F@lcon13
GF_USERS_DEFAULT_THEME: dark
GF_SERVER_ROOT_URL: https://grafana.netgrimoire.com
```
### Deploy
@ -66,69 +69,59 @@ docker stack services monitoring
```
### First Run
Post-deploy steps specific to these services:
- Start Cadvisor and Node Exporter.
- Configure Grafana with default settings.
---
After deploying the stack, run `./deploy.sh` to initialize the Cadvisor database.
## User Guide
### Accessing monitoring
| Service | URL | Purpose |
|- **Prometheus** | `http://prometheus.netgrimoire.com` | Metrics Collection |
|- **Grafana** | `https://grafana.netgrimoire.com` | Dashboards |
|- **Alertmanager** | `https://alertmanager.netgrimoire.com` | Alert Routing |
|- **Blackbox Exporter** | `http://blackbox.netgrimoire.com` | HTTP/TCP/ICMP Probing |
|---------|-----|---------|
- Prometheus: http://prometheus.netgrimoire.com
- Grafana: http://grafana.netgrimoire.com
- Alertmanager: https://alertmanager.netgrimoire.com
- Blackbox Exporter: https://blackbox.netgrimoire.com
- Cadvisor: https://cadvisor.netgrimoire.com
### Primary Use Cases
Use these services to monitor the health and performance of NetGrimoire infrastructure components.
This monitoring stack is designed for real-time performance data collection, alerting, and visualization. It provides a comprehensive suite of tools for managing infrastructure and applications.
### NetGrimoire Integrations
These services integrate with other NetGrimoire services, including Caddy, Uptime Kuma, and DIUN.
---
This monitoring stack connects to other services in NetGrimoire via environment variables and labels. Specifically, it integrates with the Uptime Kuma monitoring system and the Caddy reverse proxy.
## Operations
### Monitoring
Use `docker stack services monitoring` to view service logs. Use `docker logs -f <service-name>` to view live logs.
Use `docker stack services monitoring` to view the status of each service.
```bash
docker stack services monitoring
```
Use `docker logs -f <service_name>` to view the logs for each service.
### Backups
Critical vs reconstructable `/DockerVol/` paths:
- Critical: `/prometheus/data`
- Reconstructable: `/grafana/data`, `/alertmanager/data`
Critical data is stored in `/DockerVol/prometheus/data`, `/DockerVol/grafana/data`, and `/DockerVol/alertmanager/data`. These volumes are backed up regularly by the underlying storage system.
### Restore
To restore the stack, run `./deploy.sh` to initialize the Cadvisor database.
```bash
cd services/swarm/stack/monitoring
./deploy.sh
```
---
## Common Failures
| Symptom | Cause | Fix |
|- **Prometheus not collecting metrics** | Insufficient disk space | Increase Prometheus storage size |
|- **Grafana not rendering dashboards** | Insecure configuration | Set `GF_SECURITY_ADMIN_USER` and `GF_SECURITY_ADMIN_PASSWORD` variables correctly |
|- **Alertmanager not sending alerts** | Incorrect configuration file | Update `alertmanager.yml` file |
---
| Failure Mode | Symptom | Cause | Fix |
|-------------|---------|------|-----|
| Service Not Starting | Service is not starting | Incorrect environment variables | Check and correct GF_SECURITY_ADMIN_PASSWORD, GF_USERS_DEFAULT_THEME, and GF_SERVER_ROOT_URL in .env file. |
| Prometheus Not Collecting Data | No data being collected by Prometheus | Incorrect configuration or missing data sources | Check the Prometheus configuration file for errors or missing data sources. |
## Changelog
| Date | Commit | Summary |
|------|--------|---------|
| 2026-04-09 | 8ca119ab | Initial documentation creation. |
| 2026-04-07 | 9f9ca1ad | Minor bug fixes and improvements. |
| 2026-04-07 | 71e3177f | Updated Prometheus and Grafana images to latest versions. |
| 2026-04-07 | 1df528ca | Added support for multi-arch images (Cadviser and Node Exporter). |
| 2026-04-07 | af94e455 | Improved Caddy reverse proxy configuration for Blackbox Exporter. |
---
| 2026-04-11 | 3456a528 | Initial documentation for monitoring stack |
| 2026-04-09 | 8ca119ab | Updated Prometheus configuration to use latest version |
| 2026-04-07 | 9f9ca1ad | Added support for Cadvisor on aarch64 architecture |
## Notes
- Generated by Gremlin on 2026-04-10T03:17:27.514Z
- Generated by Gremlin on 2026-04-11T15:52:06.156Z
- Source: swarm/monitoring.yaml