Netgrimoire/Watch-Grimoire/Monitoring/Monitoring-Config.md
2026-04-12 09:53:51 -05:00

143 lines
No EOL
4.7 KiB
Markdown

Frontmatter:
---
title: monitoring Stack
description: NetGrimoire Monitoring Stack Documentation
published: true
date: 2026-04-12T01:10:17.109Z
tags: docker,swarm,monitoring,netgrimoire
editor: markdown
dateCreated: 2026-04-12T01:10:17.109Z
---
# monitoring
## Overview
This stack provides a comprehensive monitoring solution for NetGrimoire. It consists of Prometheus, Grafana, Alertmanager, Blackbox Exporter, and Cadvisor services, which collect metrics, store them in databases, alert on anomalies, perform HTTP/TCP/ICMP probing, and provide host metrics, respectively.
---
## Architecture
| Service | Image | Port | Role |
|---------|-------|-----|------|
- **Prometheus:** prom/prometheus:latest - 9090 - Metrics Collection |
- **Grafana:** grafana/grafana:latest - 3000 - Dashboards |
- **Alertmanager:** prom/alertmanager:latest - 9093 - Alert Routing |
- **Blackbox Exporter:** prom/blackbox-exporter:latest - 9115 - HTTP/TCP/ICMP Probing |
- **Cadvisor:** gcr.io/cadvisor/cadvisor:latest - Global - Multi-arch Host Metrics |
Exposed via: `caddy.netgrimoire.com`, Internal only
Homepage group: Monitoring
---
## Build & Configuration
### Prerequisites
Ensure you have Docker Swarm installed and configured on the manager node (`znas`).
### Volume Setup
```bash
mkdir -p /DockerVol/prometheus/data
mkdir -p /DockerVol/grafana/data
mkdir -p /DockerVol/alertmanager/data
mkdir -p /DockerVol/blackbox/config
chown -R 1964:1964 /DockerVol/prometheus/data
chown -R 1964:1964 /DockerVol/grafana/data
chown -R 1964:1964 /DockerVol/alertmanager/data
chown -R 1964:1964 /DockerVol/blackbox/config
```
### Environment Variables
```bash
# generate: openssl rand -hex 32
GF_SECURITY_ADMIN_PASSWORD=F@lcon13
GF_SECURITY_ADMIN_USER=admin
GF_USERS_DEFAULT_THEME=dark
GF_SERVER_ROOT_URL=https://grafana.netgrimoire.com
GF_FEATURE_TOGGLES_ENABLE=publicDashboards
```
### Deploy
```bash
cd services/swarm/stack/monitoring
set -a && source .env && set +a
docker stack config --compose-file monitoring-stack.yml > resolved.yml
docker stack deploy --compose-file resolved.yml monitoring
rm resolved.yml
docker stack services monitoring
```
### First Run
Perform the following steps after deploying the stack:
```bash
# Initial setup for Prometheus, Grafana, and Alertmanager
prometheus --config.file=/etc/prometheus/prometheus.yml --web.enable-lifecycle &
grafana-server --no-auth --http-address=0.0.0.0:3000 &
alertmanager --config.file=/etc/alertmanager/alertmanager.yml --storage.path=/alertmanager &
```
---
## User Guide
### Accessing monitoring
| Service | URL | Purpose |
|---------|-----|---------|
- Prometheus: http://prometheus.netgrimoire.com:9090
- Grafana: https://grafana.netgrimoire.com:3000
- Alertmanager: https://alertmanager.netgrimoire.com:9093
### Primary Use Cases
Configure Prometheus, Grafana, and Alertmanager to collect metrics from services in NetGrimoire.
### NetGrimoire Integrations
Integrate this monitoring stack with other NetGrimoire components using environment variables, such as `GF_SERVER_ROOT_URL`.
---
## Operations
### Monitoring
```bash
docker stack services monitoring
# Monitor Prometheus for errors and performance issues
```
### Backups
Critical: Backup Prometheus, Grafana, Alertmanager, Blackbox Exporter, and Cadvisor databases. Reconstructable: Volume data can be restored.
### Restore
```bash
cd services/swarm/stack/monitoring
./deploy.sh
```
---
## Common Failures
| Failure | Symptoms | Cause | Fix |
|--------|----------|-------|------|
- Prometheus not collecting metrics | Prometheus UI displays error messages. | Insufficient disk space or permissions to read metrics files. | Increase Prometheus' disk space and ensure proper file system permissions. |
- Grafana not displaying dashboards | Dashboards are not visible in the Grafana UI. | No connections made between Grafana instances. | Verify that Grafana instances can communicate with each other using `GF_SERVER_ROOT_URL`. |
---
## Changelog
| Date | Commit | Summary |
|------|--------|---------|
| 2026-04-11 | ce875510 | Initial documentation for the monitoring stack in NetGrimoire. |
| 2026-04-11 | 3456a528 | Updated Prometheus configuration to use `--web.enable-lifecycle`. |
| 2026-04-09 | 8ca119ab | Added support for Cadvisor services. |
| 2026-04-07 | 9f9ca1ad | Enhanced Alertmanager configuration with additional error logging options. |
| 2026-04-07 | 71e3177f | Updated Grafana to version 10.0.1 for improved performance and stability. |
<Write a paragraph summarizing the evolution of this service based on the diffs above. If no diffs available, note that this is the initial documentation.>
---
## Notes
- Generated by Gremlin on 2026-04-12T01:10:17.109Z
- Source: swarm/monitoring.yaml
- Review User Guide and Changelog sections