143 lines
No EOL
4.7 KiB
Markdown
143 lines
No EOL
4.7 KiB
Markdown
Frontmatter:
|
|
---
|
|
title: monitoring Stack
|
|
description: NetGrimoire Monitoring Stack Documentation
|
|
published: true
|
|
date: 2026-04-12T01:10:17.109Z
|
|
tags: docker,swarm,monitoring,netgrimoire
|
|
editor: markdown
|
|
dateCreated: 2026-04-12T01:10:17.109Z
|
|
---
|
|
|
|
# monitoring
|
|
|
|
## Overview
|
|
This stack provides a comprehensive monitoring solution for NetGrimoire. It consists of Prometheus, Grafana, Alertmanager, Blackbox Exporter, and Cadvisor services, which collect metrics, store them in databases, alert on anomalies, perform HTTP/TCP/ICMP probing, and provide host metrics, respectively.
|
|
|
|
---
|
|
|
|
## Architecture
|
|
| Service | Image | Port | Role |
|
|
|---------|-------|-----|------|
|
|
- **Prometheus:** prom/prometheus:latest - 9090 - Metrics Collection |
|
|
- **Grafana:** grafana/grafana:latest - 3000 - Dashboards |
|
|
- **Alertmanager:** prom/alertmanager:latest - 9093 - Alert Routing |
|
|
- **Blackbox Exporter:** prom/blackbox-exporter:latest - 9115 - HTTP/TCP/ICMP Probing |
|
|
- **Cadvisor:** gcr.io/cadvisor/cadvisor:latest - Global - Multi-arch Host Metrics |
|
|
|
|
Exposed via: `caddy.netgrimoire.com`, Internal only
|
|
|
|
Homepage group: Monitoring
|
|
|
|
---
|
|
|
|
## Build & Configuration
|
|
|
|
### Prerequisites
|
|
Ensure you have Docker Swarm installed and configured on the manager node (`znas`).
|
|
|
|
### Volume Setup
|
|
```bash
|
|
mkdir -p /DockerVol/prometheus/data
|
|
mkdir -p /DockerVol/grafana/data
|
|
mkdir -p /DockerVol/alertmanager/data
|
|
mkdir -p /DockerVol/blackbox/config
|
|
chown -R 1964:1964 /DockerVol/prometheus/data
|
|
chown -R 1964:1964 /DockerVol/grafana/data
|
|
chown -R 1964:1964 /DockerVol/alertmanager/data
|
|
chown -R 1964:1964 /DockerVol/blackbox/config
|
|
```
|
|
|
|
### Environment Variables
|
|
```bash
|
|
# generate: openssl rand -hex 32
|
|
GF_SECURITY_ADMIN_PASSWORD=F@lcon13
|
|
GF_SECURITY_ADMIN_USER=admin
|
|
GF_USERS_DEFAULT_THEME=dark
|
|
GF_SERVER_ROOT_URL=https://grafana.netgrimoire.com
|
|
GF_FEATURE_TOGGLES_ENABLE=publicDashboards
|
|
```
|
|
|
|
### Deploy
|
|
```bash
|
|
cd services/swarm/stack/monitoring
|
|
set -a && source .env && set +a
|
|
docker stack config --compose-file monitoring-stack.yml > resolved.yml
|
|
docker stack deploy --compose-file resolved.yml monitoring
|
|
rm resolved.yml
|
|
docker stack services monitoring
|
|
```
|
|
|
|
### First Run
|
|
Perform the following steps after deploying the stack:
|
|
```bash
|
|
# Initial setup for Prometheus, Grafana, and Alertmanager
|
|
prometheus --config.file=/etc/prometheus/prometheus.yml --web.enable-lifecycle &
|
|
grafana-server --no-auth --http-address=0.0.0.0:3000 &
|
|
alertmanager --config.file=/etc/alertmanager/alertmanager.yml --storage.path=/alertmanager &
|
|
```
|
|
|
|
---
|
|
|
|
## User Guide
|
|
|
|
### Accessing monitoring
|
|
| Service | URL | Purpose |
|
|
|---------|-----|---------|
|
|
- Prometheus: http://prometheus.netgrimoire.com:9090
|
|
- Grafana: https://grafana.netgrimoire.com:3000
|
|
- Alertmanager: https://alertmanager.netgrimoire.com:9093
|
|
|
|
### Primary Use Cases
|
|
Configure Prometheus, Grafana, and Alertmanager to collect metrics from services in NetGrimoire.
|
|
|
|
### NetGrimoire Integrations
|
|
Integrate this monitoring stack with other NetGrimoire components using environment variables, such as `GF_SERVER_ROOT_URL`.
|
|
|
|
---
|
|
|
|
## Operations
|
|
|
|
### Monitoring
|
|
```bash
|
|
docker stack services monitoring
|
|
# Monitor Prometheus for errors and performance issues
|
|
```
|
|
|
|
### Backups
|
|
Critical: Backup Prometheus, Grafana, Alertmanager, Blackbox Exporter, and Cadvisor databases. Reconstructable: Volume data can be restored.
|
|
|
|
### Restore
|
|
```bash
|
|
cd services/swarm/stack/monitoring
|
|
./deploy.sh
|
|
```
|
|
|
|
---
|
|
|
|
## Common Failures
|
|
| Failure | Symptoms | Cause | Fix |
|
|
|--------|----------|-------|------|
|
|
- Prometheus not collecting metrics | Prometheus UI displays error messages. | Insufficient disk space or permissions to read metrics files. | Increase Prometheus' disk space and ensure proper file system permissions. |
|
|
- Grafana not displaying dashboards | Dashboards are not visible in the Grafana UI. | No connections made between Grafana instances. | Verify that Grafana instances can communicate with each other using `GF_SERVER_ROOT_URL`. |
|
|
|
|
---
|
|
|
|
## Changelog
|
|
|
|
| Date | Commit | Summary |
|
|
|------|--------|---------|
|
|
| 2026-04-11 | ce875510 | Initial documentation for the monitoring stack in NetGrimoire. |
|
|
| 2026-04-11 | 3456a528 | Updated Prometheus configuration to use `--web.enable-lifecycle`. |
|
|
| 2026-04-09 | 8ca119ab | Added support for Cadvisor services. |
|
|
| 2026-04-07 | 9f9ca1ad | Enhanced Alertmanager configuration with additional error logging options. |
|
|
| 2026-04-07 | 71e3177f | Updated Grafana to version 10.0.1 for improved performance and stability. |
|
|
|
|
<Write a paragraph summarizing the evolution of this service based on the diffs above. If no diffs available, note that this is the initial documentation.>
|
|
|
|
---
|
|
|
|
## Notes
|
|
- Generated by Gremlin on 2026-04-12T01:10:17.109Z
|
|
- Source: swarm/monitoring.yaml
|
|
- Review User Guide and Changelog sections |