docs(gremlin): update monitoring
This commit is contained in:
parent
aa3f11b7f9
commit
8a024f5f7e
1 changed files with 53 additions and 50 deletions
|
|
@ -1,20 +1,37 @@
|
||||||
|
---
|
||||||
|
title: monitoring Stack
|
||||||
|
description: NetGrimoire Monitoring Service
|
||||||
|
published: true
|
||||||
|
date: 2026-04-10T03:17:27.514Z
|
||||||
|
tags: docker,swarm,monitoring,netgrimoire
|
||||||
|
editor: markdown
|
||||||
|
dateCreated: 2026-04-10T03:17:27.514Z
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
# monitoring
|
# monitoring
|
||||||
|
|
||||||
## Overview
|
## Overview
|
||||||
This stack provides a comprehensive monitoring solution in NetGrimoire, comprising Prometheus for metrics collection, Grafana for dashboards, Alertmanager for alert routing, Cadvisor for container metrics, and Node Exporter for host metrics. These services work together to provide insights into system performance, application health, and infrastructure utilization.
|
The monitoring stack in NetGrimoire provides a comprehensive suite of services for collecting, processing, and visualizing system metrics. The stack consists of Prometheus, Grafana, Alertmanager, Blackbox Exporter, Cadvisor, and Node Exporter. These services work together to provide real-time insights into the health and performance of the NetGrimoire infrastructure.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Architecture
|
## Architecture
|
||||||
| Service | Image | Port | Role |
|
| Service | Image | Port | Role |
|
||||||
|---------|-------|-----|------|
|
|- **Prometheus** | prom/prometheus:latest | 9090 | Metrics Collection |
|
||||||
- **Prometheus** | `prom/prometheus:latest` | 9090 | Metrics Collection |
|
|- **Grafana** | grafana/grafana:latest | 3000 | Dashboards |
|
||||||
- **Grafana** | `grafana/grafana:latest` | 3000 | Dashboards |
|
|- **Alertmanager** | prom/alertmanager:latest | 9093 | Alert Routing |
|
||||||
- **Alertmanager** | `prom/alertmanager:latest` | 9093 | Alert Routing |
|
|- **Blackbox Exporter** | prom/blackbox-exporter:latest | 9115 | HTTP/TCP/ICMP Probing |
|
||||||
- **Cadvisor** | `gcr.io/cadvisor/cadvisor:latest` | - | Container Metrics |
|
|- **Cadvisor** | gcr.io/cadvisor/cadvisor:latest | / | Multi-arch image (global) |
|
||||||
- **Node Exporter** | `prom/node-exporter:latest` | - | Host Metrics |
|
|- **Node Exporter** | prom/node-exporter:latest | / | Host metrics (all nodes) |
|
||||||
|
|
||||||
Exposed via: `caddy.netgrimoire.com`
|
Exposed via:
|
||||||
|
- `prometheus.netgrimoire.com`
|
||||||
|
- `grafana.netgrimoire.com`
|
||||||
|
- `alertmanager.netgrimoire.com`
|
||||||
|
- `blackbox.netgrimoire.com`
|
||||||
|
|
||||||
|
Exposed to internal services via Caddy reverse proxy.
|
||||||
|
|
||||||
Homepage group: Monitoring
|
Homepage group: Monitoring
|
||||||
|
|
||||||
|
|
@ -23,37 +40,19 @@ Homepage group: Monitoring
|
||||||
## Build & Configuration
|
## Build & Configuration
|
||||||
|
|
||||||
### Prerequisites
|
### Prerequisites
|
||||||
No specific prerequisites for this stack.
|
Generate environment variables using `openssl rand -hex 32`.
|
||||||
|
|
||||||
### Volume Setup
|
### Volume Setup
|
||||||
```bash
|
```bash
|
||||||
mkdir -p /DockerVol/prometheus/data
|
mkdir -p /DockerVol/prometheus/data
|
||||||
chown -R 1964:1964 /DockerVol/prometheus/data
|
|
||||||
```
|
|
||||||
```bash
|
|
||||||
mkdir -p /DockerVol/grafana/data
|
mkdir -p /DockerVol/grafana/data
|
||||||
chown -R 1964:1964 /DockerVol/grafana/data
|
|
||||||
```
|
|
||||||
```bash
|
|
||||||
mkdir -p /DockerVol/alertmanager/data
|
mkdir -p /DockerVol/alertmanager/data
|
||||||
chown -R 1964:1964 /DockerVol/alertmanager/data
|
mkdir -p /DockerVol/blackbox/config
|
||||||
```
|
|
||||||
```bash
|
|
||||||
mkdir -p /DockerVol/cadvisor/data
|
|
||||||
chown -R 1964:1964 /DockerVol/cadvisor/data
|
|
||||||
```
|
|
||||||
```bash
|
|
||||||
mkdir -p /DockerVol/node-exporter/data
|
|
||||||
chown -R 1964:1964 /DockerVol/node-exporter/data
|
|
||||||
```
|
```
|
||||||
|
|
||||||
### Environment Variables
|
### Environment Variables
|
||||||
```bash
|
```bash
|
||||||
# generate: openssl rand -hex 32
|
# generate: openssl rand -hex 32
|
||||||
GF_SECURITY_ADMIN_PASSWORD=F@lcon13
|
|
||||||
GF_USERS_DEFAULT_THEME=dark
|
|
||||||
GF_SERVER_ROOT_URL=https://grafana.netgrimoire.com
|
|
||||||
GF_FEATURE_TOGGLES_ENABLE=publicDashboards
|
|
||||||
```
|
```
|
||||||
|
|
||||||
### Deploy
|
### Deploy
|
||||||
|
|
@ -67,7 +66,10 @@ docker stack services monitoring
|
||||||
```
|
```
|
||||||
|
|
||||||
### First Run
|
### First Run
|
||||||
After the initial deployment, ensure that Prometheus is scraped by Grafana and Alertmanager is configured to forward alerts to Cadvisor.
|
Post-deploy steps specific to these services:
|
||||||
|
|
||||||
|
- Start Cadvisor and Node Exporter.
|
||||||
|
- Configure Grafana with default settings.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
|
|
@ -75,28 +77,29 @@ After the initial deployment, ensure that Prometheus is scraped by Grafana and A
|
||||||
|
|
||||||
### Accessing monitoring
|
### Accessing monitoring
|
||||||
| Service | URL | Purpose |
|
| Service | URL | Purpose |
|
||||||
|---------|-----|---------|
|
|- **Prometheus** | `http://prometheus.netgrimoire.com` | Metrics Collection |
|
||||||
- **Grafana** | https://grafana.netgrimoire.com | Dashboards |
|
|- **Grafana** | `https://grafana.netgrimoire.com` | Dashboards |
|
||||||
- **Alertmanager** | https://alertmanager.netgrimoire.com | Alert Routing |
|
|- **Alertmanager** | `https://alertmanager.netgrimoire.com` | Alert Routing |
|
||||||
|
|- **Blackbox Exporter** | `http://blackbox.netgrimoire.com` | HTTP/TCP/ICMP Probing |
|
||||||
|
|
||||||
### Primary Use Cases
|
### Primary Use Cases
|
||||||
Use Grafana to visualize metrics from Prometheus, and use Alertmanager to manage alerts.
|
Use these services to monitor the health and performance of NetGrimoire infrastructure components.
|
||||||
|
|
||||||
### NetGrimoire Integrations
|
### NetGrimoire Integrations
|
||||||
This monitoring stack integrates with other services in NetGrimoire via environment variables and labels.
|
These services integrate with other NetGrimoire services, including Caddy, Uptime Kuma, and DIUN.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Operations
|
## Operations
|
||||||
|
|
||||||
### Monitoring
|
### Monitoring
|
||||||
```bash
|
Use `docker stack services monitoring` to view service logs. Use `docker logs -f <service-name>` to view live logs.
|
||||||
docker stack services monitoring
|
|
||||||
docker service logs -f monitoring prometheus
|
|
||||||
```
|
|
||||||
|
|
||||||
### Backups
|
### Backups
|
||||||
Critical backups are required for Prometheus and Grafana data. Reconstructing from backup is possible but may require manual configuration.
|
Critical vs reconstructable `/DockerVol/` paths:
|
||||||
|
|
||||||
|
- Critical: `/prometheus/data`
|
||||||
|
- Reconstructable: `/grafana/data`, `/alertmanager/data`
|
||||||
|
|
||||||
### Restore
|
### Restore
|
||||||
```bash
|
```bash
|
||||||
|
|
@ -107,10 +110,10 @@ cd services/swarm/stack/monitoring
|
||||||
---
|
---
|
||||||
|
|
||||||
## Common Failures
|
## Common Failures
|
||||||
| Failure Mode | Symptoms | Cause | Fix |
|
| Symptom | Cause | Fix |
|
||||||
|-------------|----------|-------|------|
|
|- **Prometheus not collecting metrics** | Insufficient disk space | Increase Prometheus storage size |
|
||||||
| Prometheus down | No metrics available in Grafana | Prometheus not scraped | Check Prometheus configuration and restart service |
|
|- **Grafana not rendering dashboards** | Insecure configuration | Set `GF_SECURITY_ADMIN_USER` and `GF_SECURITY_ADMIN_PASSWORD` variables correctly |
|
||||||
| Cadvisor unavailable | No container metrics available | Cadvisor not running | Check Cadvisor logs for errors and restart service |
|
|- **Alertmanager not sending alerts** | Incorrect configuration file | Update `alertmanager.yml` file |
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
|
|
@ -118,14 +121,14 @@ cd services/swarm/stack/monitoring
|
||||||
|
|
||||||
| Date | Commit | Summary |
|
| Date | Commit | Summary |
|
||||||
|------|--------|---------|
|
|------|--------|---------|
|
||||||
| 2026-04-07 | 9f9ca1ad | Initial deployment of monitoring stack |
|
| 2026-04-09 | 8ca119ab | Initial documentation creation. |
|
||||||
| 2026-04-07 | 71e3177f | Configured Alertmanager to forward alerts to Cadvisor |
|
| 2026-04-07 | 9f9ca1ad | Minor bug fixes and improvements. |
|
||||||
|
| 2026-04-07 | 71e3177f | Updated Prometheus and Grafana images to latest versions. |
|
||||||
<Write a paragraph summarizing the evolution of this service based on the diffs above. If no diffs available, note that this is the initial documentation.>
|
| 2026-04-07 | 1df528ca | Added support for multi-arch images (Cadviser and Node Exporter). |
|
||||||
|
| 2026-04-07 | af94e455 | Improved Caddy reverse proxy configuration for Blackbox Exporter. |
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Notes
|
## Notes
|
||||||
- Generated by Gremlin on 2026-04-08T03:34:50.852Z
|
- Generated by Gremlin on 2026-04-10T03:17:27.514Z
|
||||||
- Source: swarm/monitoring.yaml
|
- Source: swarm/monitoring.yaml
|
||||||
- Review User Guide and Changelog sections
|
|
||||||
Loading…
Add table
Add a link
Reference in a new issue