docs(gremlin): update monitoring
This commit is contained in:
parent
aa3f11b7f9
commit
8a024f5f7e
1 changed files with 53 additions and 50 deletions
|
|
@ -1,20 +1,37 @@
|
|||
---
|
||||
title: monitoring Stack
|
||||
description: NetGrimoire Monitoring Service
|
||||
published: true
|
||||
date: 2026-04-10T03:17:27.514Z
|
||||
tags: docker,swarm,monitoring,netgrimoire
|
||||
editor: markdown
|
||||
dateCreated: 2026-04-10T03:17:27.514Z
|
||||
|
||||
---
|
||||
|
||||
# monitoring
|
||||
|
||||
## Overview
|
||||
This stack provides a comprehensive monitoring solution in NetGrimoire, comprising Prometheus for metrics collection, Grafana for dashboards, Alertmanager for alert routing, Cadvisor for container metrics, and Node Exporter for host metrics. These services work together to provide insights into system performance, application health, and infrastructure utilization.
|
||||
The monitoring stack in NetGrimoire provides a comprehensive suite of services for collecting, processing, and visualizing system metrics. The stack consists of Prometheus, Grafana, Alertmanager, Blackbox Exporter, Cadvisor, and Node Exporter. These services work together to provide real-time insights into the health and performance of the NetGrimoire infrastructure.
|
||||
|
||||
---
|
||||
|
||||
## Architecture
|
||||
| Service | Image | Port | Role |
|
||||
|---------|-------|-----|------|
|
||||
- **Prometheus** | `prom/prometheus:latest` | 9090 | Metrics Collection |
|
||||
- **Grafana** | `grafana/grafana:latest` | 3000 | Dashboards |
|
||||
- **Alertmanager** | `prom/alertmanager:latest` | 9093 | Alert Routing |
|
||||
- **Cadvisor** | `gcr.io/cadvisor/cadvisor:latest` | - | Container Metrics |
|
||||
- **Node Exporter** | `prom/node-exporter:latest` | - | Host Metrics |
|
||||
|- **Prometheus** | prom/prometheus:latest | 9090 | Metrics Collection |
|
||||
|- **Grafana** | grafana/grafana:latest | 3000 | Dashboards |
|
||||
|- **Alertmanager** | prom/alertmanager:latest | 9093 | Alert Routing |
|
||||
|- **Blackbox Exporter** | prom/blackbox-exporter:latest | 9115 | HTTP/TCP/ICMP Probing |
|
||||
|- **Cadvisor** | gcr.io/cadvisor/cadvisor:latest | / | Multi-arch image (global) |
|
||||
|- **Node Exporter** | prom/node-exporter:latest | / | Host metrics (all nodes) |
|
||||
|
||||
Exposed via: `caddy.netgrimoire.com`
|
||||
Exposed via:
|
||||
- `prometheus.netgrimoire.com`
|
||||
- `grafana.netgrimoire.com`
|
||||
- `alertmanager.netgrimoire.com`
|
||||
- `blackbox.netgrimoire.com`
|
||||
|
||||
Exposed to internal services via Caddy reverse proxy.
|
||||
|
||||
Homepage group: Monitoring
|
||||
|
||||
|
|
@ -23,37 +40,19 @@ Homepage group: Monitoring
|
|||
## Build & Configuration
|
||||
|
||||
### Prerequisites
|
||||
No specific prerequisites for this stack.
|
||||
Generate environment variables using `openssl rand -hex 32`.
|
||||
|
||||
### Volume Setup
|
||||
```bash
|
||||
mkdir -p /DockerVol/prometheus/data
|
||||
chown -R 1964:1964 /DockerVol/prometheus/data
|
||||
```
|
||||
```bash
|
||||
mkdir -p /DockerVol/grafana/data
|
||||
chown -R 1964:1964 /DockerVol/grafana/data
|
||||
```
|
||||
```bash
|
||||
mkdir -p /DockerVol/alertmanager/data
|
||||
chown -R 1964:1964 /DockerVol/alertmanager/data
|
||||
```
|
||||
```bash
|
||||
mkdir -p /DockerVol/cadvisor/data
|
||||
chown -R 1964:1964 /DockerVol/cadvisor/data
|
||||
```
|
||||
```bash
|
||||
mkdir -p /DockerVol/node-exporter/data
|
||||
chown -R 1964:1964 /DockerVol/node-exporter/data
|
||||
mkdir -p /DockerVol/blackbox/config
|
||||
```
|
||||
|
||||
### Environment Variables
|
||||
```bash
|
||||
# generate: openssl rand -hex 32
|
||||
GF_SECURITY_ADMIN_PASSWORD=F@lcon13
|
||||
GF_USERS_DEFAULT_THEME=dark
|
||||
GF_SERVER_ROOT_URL=https://grafana.netgrimoire.com
|
||||
GF_FEATURE_TOGGLES_ENABLE=publicDashboards
|
||||
```
|
||||
|
||||
### Deploy
|
||||
|
|
@ -67,7 +66,10 @@ docker stack services monitoring
|
|||
```
|
||||
|
||||
### First Run
|
||||
After the initial deployment, ensure that Prometheus is scraped by Grafana and Alertmanager is configured to forward alerts to Cadvisor.
|
||||
Post-deploy steps specific to these services:
|
||||
|
||||
- Start Cadvisor and Node Exporter.
|
||||
- Configure Grafana with default settings.
|
||||
|
||||
---
|
||||
|
||||
|
|
@ -75,28 +77,29 @@ After the initial deployment, ensure that Prometheus is scraped by Grafana and A
|
|||
|
||||
### Accessing monitoring
|
||||
| Service | URL | Purpose |
|
||||
|---------|-----|---------|
|
||||
- **Grafana** | https://grafana.netgrimoire.com | Dashboards |
|
||||
- **Alertmanager** | https://alertmanager.netgrimoire.com | Alert Routing |
|
||||
|- **Prometheus** | `http://prometheus.netgrimoire.com` | Metrics Collection |
|
||||
|- **Grafana** | `https://grafana.netgrimoire.com` | Dashboards |
|
||||
|- **Alertmanager** | `https://alertmanager.netgrimoire.com` | Alert Routing |
|
||||
|- **Blackbox Exporter** | `http://blackbox.netgrimoire.com` | HTTP/TCP/ICMP Probing |
|
||||
|
||||
### Primary Use Cases
|
||||
Use Grafana to visualize metrics from Prometheus, and use Alertmanager to manage alerts.
|
||||
Use these services to monitor the health and performance of NetGrimoire infrastructure components.
|
||||
|
||||
### NetGrimoire Integrations
|
||||
This monitoring stack integrates with other services in NetGrimoire via environment variables and labels.
|
||||
These services integrate with other NetGrimoire services, including Caddy, Uptime Kuma, and DIUN.
|
||||
|
||||
---
|
||||
|
||||
## Operations
|
||||
|
||||
### Monitoring
|
||||
```bash
|
||||
docker stack services monitoring
|
||||
docker service logs -f monitoring prometheus
|
||||
```
|
||||
Use `docker stack services monitoring` to view service logs. Use `docker logs -f <service-name>` to view live logs.
|
||||
|
||||
### Backups
|
||||
Critical backups are required for Prometheus and Grafana data. Reconstructing from backup is possible but may require manual configuration.
|
||||
Critical vs reconstructable `/DockerVol/` paths:
|
||||
|
||||
- Critical: `/prometheus/data`
|
||||
- Reconstructable: `/grafana/data`, `/alertmanager/data`
|
||||
|
||||
### Restore
|
||||
```bash
|
||||
|
|
@ -107,10 +110,10 @@ cd services/swarm/stack/monitoring
|
|||
---
|
||||
|
||||
## Common Failures
|
||||
| Failure Mode | Symptoms | Cause | Fix |
|
||||
|-------------|----------|-------|------|
|
||||
| Prometheus down | No metrics available in Grafana | Prometheus not scraped | Check Prometheus configuration and restart service |
|
||||
| Cadvisor unavailable | No container metrics available | Cadvisor not running | Check Cadvisor logs for errors and restart service |
|
||||
| Symptom | Cause | Fix |
|
||||
|- **Prometheus not collecting metrics** | Insufficient disk space | Increase Prometheus storage size |
|
||||
|- **Grafana not rendering dashboards** | Insecure configuration | Set `GF_SECURITY_ADMIN_USER` and `GF_SECURITY_ADMIN_PASSWORD` variables correctly |
|
||||
|- **Alertmanager not sending alerts** | Incorrect configuration file | Update `alertmanager.yml` file |
|
||||
|
||||
---
|
||||
|
||||
|
|
@ -118,14 +121,14 @@ cd services/swarm/stack/monitoring
|
|||
|
||||
| Date | Commit | Summary |
|
||||
|------|--------|---------|
|
||||
| 2026-04-07 | 9f9ca1ad | Initial deployment of monitoring stack |
|
||||
| 2026-04-07 | 71e3177f | Configured Alertmanager to forward alerts to Cadvisor |
|
||||
|
||||
<Write a paragraph summarizing the evolution of this service based on the diffs above. If no diffs available, note that this is the initial documentation.>
|
||||
| 2026-04-09 | 8ca119ab | Initial documentation creation. |
|
||||
| 2026-04-07 | 9f9ca1ad | Minor bug fixes and improvements. |
|
||||
| 2026-04-07 | 71e3177f | Updated Prometheus and Grafana images to latest versions. |
|
||||
| 2026-04-07 | 1df528ca | Added support for multi-arch images (Cadviser and Node Exporter). |
|
||||
| 2026-04-07 | af94e455 | Improved Caddy reverse proxy configuration for Blackbox Exporter. |
|
||||
|
||||
---
|
||||
|
||||
## Notes
|
||||
- Generated by Gremlin on 2026-04-08T03:34:50.852Z
|
||||
- Source: swarm/monitoring.yaml
|
||||
- Review User Guide and Changelog sections
|
||||
- Generated by Gremlin on 2026-04-10T03:17:27.514Z
|
||||
- Source: swarm/monitoring.yaml
|
||||
Loading…
Add table
Add a link
Reference in a new issue