Frontmatter: --- title: monitoring Stack description: NetGrimoire Monitoring Stack Documentation published: true date: 2026-04-12T01:10:17.109Z tags: docker,swarm,monitoring,netgrimoire editor: markdown dateCreated: 2026-04-12T01:10:17.109Z --- # monitoring ## Overview This stack provides a comprehensive monitoring solution for NetGrimoire. It consists of Prometheus, Grafana, Alertmanager, Blackbox Exporter, and Cadvisor services, which collect metrics, store them in databases, alert on anomalies, perform HTTP/TCP/ICMP probing, and provide host metrics, respectively. --- ## Architecture | Service | Image | Port | Role | |---------|-------|-----|------| - **Prometheus:** prom/prometheus:latest - 9090 - Metrics Collection | - **Grafana:** grafana/grafana:latest - 3000 - Dashboards | - **Alertmanager:** prom/alertmanager:latest - 9093 - Alert Routing | - **Blackbox Exporter:** prom/blackbox-exporter:latest - 9115 - HTTP/TCP/ICMP Probing | - **Cadvisor:** gcr.io/cadvisor/cadvisor:latest - Global - Multi-arch Host Metrics | Exposed via: `caddy.netgrimoire.com`, Internal only Homepage group: Monitoring --- ## Build & Configuration ### Prerequisites Ensure you have Docker Swarm installed and configured on the manager node (`znas`). ### Volume Setup ```bash mkdir -p /DockerVol/prometheus/data mkdir -p /DockerVol/grafana/data mkdir -p /DockerVol/alertmanager/data mkdir -p /DockerVol/blackbox/config chown -R 1964:1964 /DockerVol/prometheus/data chown -R 1964:1964 /DockerVol/grafana/data chown -R 1964:1964 /DockerVol/alertmanager/data chown -R 1964:1964 /DockerVol/blackbox/config ``` ### Environment Variables ```bash # generate: openssl rand -hex 32 GF_SECURITY_ADMIN_PASSWORD=F@lcon13 GF_SECURITY_ADMIN_USER=admin GF_USERS_DEFAULT_THEME=dark GF_SERVER_ROOT_URL=https://grafana.netgrimoire.com GF_FEATURE_TOGGLES_ENABLE=publicDashboards ``` ### Deploy ```bash cd services/swarm/stack/monitoring set -a && source .env && set +a docker stack config --compose-file monitoring-stack.yml > resolved.yml docker stack deploy --compose-file resolved.yml monitoring rm resolved.yml docker stack services monitoring ``` ### First Run Perform the following steps after deploying the stack: ```bash # Initial setup for Prometheus, Grafana, and Alertmanager prometheus --config.file=/etc/prometheus/prometheus.yml --web.enable-lifecycle & grafana-server --no-auth --http-address=0.0.0.0:3000 & alertmanager --config.file=/etc/alertmanager/alertmanager.yml --storage.path=/alertmanager & ``` --- ## User Guide ### Accessing monitoring | Service | URL | Purpose | |---------|-----|---------| - Prometheus: http://prometheus.netgrimoire.com:9090 - Grafana: https://grafana.netgrimoire.com:3000 - Alertmanager: https://alertmanager.netgrimoire.com:9093 ### Primary Use Cases Configure Prometheus, Grafana, and Alertmanager to collect metrics from services in NetGrimoire. ### NetGrimoire Integrations Integrate this monitoring stack with other NetGrimoire components using environment variables, such as `GF_SERVER_ROOT_URL`. --- ## Operations ### Monitoring ```bash docker stack services monitoring # Monitor Prometheus for errors and performance issues ``` ### Backups Critical: Backup Prometheus, Grafana, Alertmanager, Blackbox Exporter, and Cadvisor databases. Reconstructable: Volume data can be restored. ### Restore ```bash cd services/swarm/stack/monitoring ./deploy.sh ``` --- ## Common Failures | Failure | Symptoms | Cause | Fix | |--------|----------|-------|------| - Prometheus not collecting metrics | Prometheus UI displays error messages. | Insufficient disk space or permissions to read metrics files. | Increase Prometheus' disk space and ensure proper file system permissions. | - Grafana not displaying dashboards | Dashboards are not visible in the Grafana UI. | No connections made between Grafana instances. | Verify that Grafana instances can communicate with each other using `GF_SERVER_ROOT_URL`. | --- ## Changelog | Date | Commit | Summary | |------|--------|---------| | 2026-04-11 | ce875510 | Initial documentation for the monitoring stack in NetGrimoire. | | 2026-04-11 | 3456a528 | Updated Prometheus configuration to use `--web.enable-lifecycle`. | | 2026-04-09 | 8ca119ab | Added support for Cadvisor services. | | 2026-04-07 | 9f9ca1ad | Enhanced Alertmanager configuration with additional error logging options. | | 2026-04-07 | 71e3177f | Updated Grafana to version 10.0.1 for improved performance and stability. | --- ## Notes - Generated by Gremlin on 2026-04-12T01:10:17.109Z - Source: swarm/monitoring.yaml - Review User Guide and Changelog sections