docs(gremlin): update monitoring

This commit is contained in:
traveler 2026-04-11 20:12:57 -05:00
parent 549255472b
commit a72eb28f9e

View file

@ -1,61 +1,61 @@
Frontmatter:
---
title: monitoring Stack
description: NetGrimoire Monitoring Stack Documentation
published: true
date: 2026-04-12T01:10:17.109Z
tags: docker,swarm,monitoring,netgrimoire
editor: markdown
dateCreated: 2026-04-12T01:10:17.109Z
---
# monitoring # monitoring
## Overview ## Overview
The monitoring stack in NetGrimoire provides a comprehensive suite of services for collecting, storing, and visualizing performance data from various sources. This includes Prometheus for metrics collection, Grafana for dashboards, Alertmanager for alert routing, Blackbox Exporter for HTTP/TCP/ICMP probing, and Cadvisor for multi-arch image management. This stack provides a comprehensive monitoring solution for NetGrimoire. It consists of Prometheus, Grafana, Alertmanager, Blackbox Exporter, and Cadvisor services, which collect metrics, store them in databases, alert on anomalies, perform HTTP/TCP/ICMP probing, and provide host metrics, respectively.
---
## Architecture ## Architecture
| Service | Image | Port | Role | | Service | Image | Port | Role |
|---------|--------|------|------| |---------|-------|-----|------|
- **Prometheus:** prom/prometheus:latest - **Prometheus:** prom/prometheus:latest - 9090 - Metrics Collection |
- **Grafana:** grafana/grafana:latest - **Grafana:** grafana/grafana:latest - 3000 - Dashboards |
- **Alertmanager:** prom/alertmanager:latest - **Alertmanager:** prom/alertmanager:latest - 9093 - Alert Routing |
- **Blackbox Exporter:** prom/blackbox-exporter:latest - **Blackbox Exporter:** prom/blackbox-exporter:latest - 9115 - HTTP/TCP/ICMP Probing |
- **Cadavisor:** gcr.io/cadvisor/cadvisor:latest - **Cadvisor:** gcr.io/cadvisor/cadvisor:latest - Global - Multi-arch Host Metrics |
| exposed via | Internal only (caddy.netgrimoire.com) |
| Homepage group | Monitoring | Exposed via: `caddy.netgrimoire.com`, Internal only
Homepage group: Monitoring
---
## Build & Configuration ## Build & Configuration
### Prerequisites ### Prerequisites
No specific prerequisites are required for this stack. Ensure you have Docker Swarm installed and configured on the manager node (`znas`).
### Volume Setup ### Volume Setup
```bash ```bash
mkdir -p /DockerVol/prometheus/data mkdir -p /DockerVol/prometheus/data
chown -R 1964:1964 /DockerVol/prometheus/
```
```bash
mkdir -p /DockerVol/grafana/data mkdir -p /DockerVol/grafana/data
chown -R 1964:1964 /DockerVol/grafana/
```
```bash
mkdir -p /DockerVol/alertmanager/data mkdir -p /DockerVol/alertmanager/data
chown -R 1964:1964 /DockerVol/alertmanager/
```
```bash
mkdir -p /DockerVol/blackbox/config mkdir -p /DockerVol/blackbox/config
chown -R 1964:1964 /DockerVol/blackbox/ chown -R 1964:1964 /DockerVol/prometheus/data
``` chown -R 1964:1964 /DockerVol/grafana/data
chown -R 1964:1964 /DockerVol/alertmanager/data
```bash chown -R 1964:1964 /DockerVol/blackbox/config
mkdir -p /DockerVol/cadvisor/data
chown -R 1964:1964 /DockerVol/cadvisor/
```
```bash
mkdir -p /DockerVol/node-exporter/data
chown -R 1964:1964 /DockerVol/node-exporter/
``` ```
### Environment Variables ### Environment Variables
```bash ```bash
generate: openssl rand -hex 32 # generate: openssl rand -hex 32
GF_SECURITY_ADMIN_PASSWORD: F@lcon13 GF_SECURITY_ADMIN_PASSWORD=F@lcon13
GF_USERS_DEFAULT_THEME: dark GF_SECURITY_ADMIN_USER=admin
GF_SERVER_ROOT_URL: https://grafana.netgrimoire.com GF_USERS_DEFAULT_THEME=dark
GF_SERVER_ROOT_URL=https://grafana.netgrimoire.com
GF_FEATURE_TOGGLES_ENABLE=publicDashboards
``` ```
### Deploy ### Deploy
@ -69,59 +69,75 @@ docker stack services monitoring
``` ```
### First Run ### First Run
After deploying the stack, run `./deploy.sh` to initialize the Cadvisor database. Perform the following steps after deploying the stack:
```bash
# Initial setup for Prometheus, Grafana, and Alertmanager
prometheus --config.file=/etc/prometheus/prometheus.yml --web.enable-lifecycle &
grafana-server --no-auth --http-address=0.0.0.0:3000 &
alertmanager --config.file=/etc/alertmanager/alertmanager.yml --storage.path=/alertmanager &
```
---
## User Guide ## User Guide
### Accessing monitoring ### Accessing monitoring
| Service | URL | Purpose | | Service | URL | Purpose |
|---------|-----|---------| |---------|-----|---------|
- Prometheus: http://prometheus.netgrimoire.com - Prometheus: http://prometheus.netgrimoire.com:9090
- Grafana: http://grafana.netgrimoire.com - Grafana: https://grafana.netgrimoire.com:3000
- Alertmanager: https://alertmanager.netgrimoire.com - Alertmanager: https://alertmanager.netgrimoire.com:9093
- Blackbox Exporter: https://blackbox.netgrimoire.com
- Cadvisor: https://cadvisor.netgrimoire.com
### Primary Use Cases ### Primary Use Cases
This monitoring stack is designed for real-time performance data collection, alerting, and visualization. It provides a comprehensive suite of tools for managing infrastructure and applications. Configure Prometheus, Grafana, and Alertmanager to collect metrics from services in NetGrimoire.
### NetGrimoire Integrations ### NetGrimoire Integrations
This monitoring stack connects to other services in NetGrimoire via environment variables and labels. Specifically, it integrates with the Uptime Kuma monitoring system and the Caddy reverse proxy. Integrate this monitoring stack with other NetGrimoire components using environment variables, such as `GF_SERVER_ROOT_URL`.
---
## Operations ## Operations
### Monitoring ### Monitoring
Use `docker stack services monitoring` to view the status of each service.
```bash ```bash
docker stack services monitoring docker stack services monitoring
# Monitor Prometheus for errors and performance issues
``` ```
Use `docker logs -f <service_name>` to view the logs for each service.
### Backups ### Backups
Critical data is stored in `/DockerVol/prometheus/data`, `/DockerVol/grafana/data`, and `/DockerVol/alertmanager/data`. These volumes are backed up regularly by the underlying storage system. Critical: Backup Prometheus, Grafana, Alertmanager, Blackbox Exporter, and Cadvisor databases. Reconstructable: Volume data can be restored.
### Restore ### Restore
To restore the stack, run `./deploy.sh` to initialize the Cadvisor database.
```bash ```bash
cd services/swarm/stack/monitoring cd services/swarm/stack/monitoring
./deploy.sh ./deploy.sh
``` ```
## Common Failures ---
| Failure Mode | Symptom | Cause | Fix | ## Common Failures
|-------------|---------|------|-----| | Failure | Symptoms | Cause | Fix |
| Service Not Starting | Service is not starting | Incorrect environment variables | Check and correct GF_SECURITY_ADMIN_PASSWORD, GF_USERS_DEFAULT_THEME, and GF_SERVER_ROOT_URL in .env file. | |--------|----------|-------|------|
| Prometheus Not Collecting Data | No data being collected by Prometheus | Incorrect configuration or missing data sources | Check the Prometheus configuration file for errors or missing data sources. | - Prometheus not collecting metrics | Prometheus UI displays error messages. | Insufficient disk space or permissions to read metrics files. | Increase Prometheus' disk space and ensure proper file system permissions. |
- Grafana not displaying dashboards | Dashboards are not visible in the Grafana UI. | No connections made between Grafana instances. | Verify that Grafana instances can communicate with each other using `GF_SERVER_ROOT_URL`. |
---
## Changelog ## Changelog
| Date | Commit | Summary | | Date | Commit | Summary |
|------|--------|---------| |------|--------|---------|
| 2026-04-11 | 3456a528 | Initial documentation for monitoring stack | | 2026-04-11 | ce875510 | Initial documentation for the monitoring stack in NetGrimoire. |
| 2026-04-09 | 8ca119ab | Updated Prometheus configuration to use latest version | | 2026-04-11 | 3456a528 | Updated Prometheus configuration to use `--web.enable-lifecycle`. |
| 2026-04-07 | 9f9ca1ad | Added support for Cadvisor on aarch64 architecture | | 2026-04-09 | 8ca119ab | Added support for Cadvisor services. |
| 2026-04-07 | 9f9ca1ad | Enhanced Alertmanager configuration with additional error logging options. |
| 2026-04-07 | 71e3177f | Updated Grafana to version 10.0.1 for improved performance and stability. |
<Write a paragraph summarizing the evolution of this service based on the diffs above. If no diffs available, note that this is the initial documentation.>
---
## Notes ## Notes
- Generated by Gremlin on 2026-04-11T15:52:06.156Z - Generated by Gremlin on 2026-04-12T01:10:17.109Z
- Source: swarm/monitoring.yaml - Source: swarm/monitoring.yaml
- Review User Guide and Changelog sections