docs(gremlin): update monitoring

This commit is contained in:
traveler 2026-04-07 22:36:47 -05:00
parent 1fbad41f86
commit aa3f11b7f9

View file

@ -1,21 +1,20 @@
# monitoring # monitoring
## Overview ## Overview
The monitoring stack in NetGrimoire consists of Prometheus, Grafana, Alertmanager, Cadvisor, and Node Exporter services that work together to collect metrics from the system. The services are designed to be highly available and scalable, with a focus on providing detailed insights into system performance and behavior. This stack provides a comprehensive monitoring solution in NetGrimoire, comprising Prometheus for metrics collection, Grafana for dashboards, Alertmanager for alert routing, Cadvisor for container metrics, and Node Exporter for host metrics. These services work together to provide insights into system performance, application health, and infrastructure utilization.
--- ---
## Architecture ## Architecture
| Service | Image | Port | Role | | Service | Image | Port | Role |
|---------|-------|-----|------| |---------|-------|-----|------|
- **Prometheus:** prom/prometheus:latest - **Prometheus** | `prom/prometheus:latest` | 9090 | Metrics Collection |
- **Grafana:** grafana/grafana:latest - **Grafana** | `grafana/grafana:latest` | 3000 | Dashboards |
- **Alertmanager:** prom/alertmanager:latest - **Alertmanager** | `prom/alertmanager:latest` | 9093 | Alert Routing |
- **Cadvisor:** gcr.io/cadvisor/cadvisor:latest - **Cadvisor** | `gcr.io/cadvisor/cadvisor:latest` | - | Container Metrics |
- **Node Exporter:** prom/node-exporter:latest - **Node Exporter** | `prom/node-exporter:latest` | - | Host Metrics |
Exposed via: prometheus.netgrimoire.com, grafana.netgrimoire.com Exposed via: `caddy.netgrimoire.com`
Homepage group: Monitoring Homepage group: Monitoring
@ -24,32 +23,28 @@ Homepage group: Monitoring
## Build & Configuration ## Build & Configuration
### Prerequisites ### Prerequisites
No specific prerequisites are required for this stack. No specific prerequisites for this stack.
### Volume Setup ### Volume Setup
```bash ```bash
mkdir -p /DockerVol/prometheus/data mkdir -p /DockerVol/prometheus/data
chown -R 1964:1964 /DockerVol/prometheus chown -R 1964:1964 /DockerVol/prometheus/data
``` ```
```bash ```bash
mkdir -p /DockerVol/grafana/data mkdir -p /DockerVol/grafana/data
chown -R 1964:1964 /DockerVol/grafana chown -R 1964:1964 /DockerVol/grafana/data
``` ```
```bash ```bash
mkdir -p /DockerVol/alertmanager/data mkdir -p /DockerVol/alertmanager/data
chown -R 1964:1964 /DockerVol/alertmanager chown -R 1964:1964 /DockerVol/alertmanager/data
``` ```
```bash ```bash
mkdir -p /DockerVol/cadvisor/ mkdir -p /DockerVol/cadvisor/data
chown -R 1964:1964 /DockerVol/cadvisor/ chown -R 1964:1964 /DockerVol/cadvisor/data
``` ```
```bash ```bash
mkdir -p /DockerVol/node-exporter/ mkdir -p /DockerVol/node-exporter/data
chown -R 1964:1964 /DockerVol/node-exporter chown -R 1964:1964 /DockerVol/node-exporter/data
``` ```
### Environment Variables ### Environment Variables
@ -57,6 +52,8 @@ chown -R 1964:1964 /DockerVol/node-exporter
# generate: openssl rand -hex 32 # generate: openssl rand -hex 32
GF_SECURITY_ADMIN_PASSWORD=F@lcon13 GF_SECURITY_ADMIN_PASSWORD=F@lcon13
GF_USERS_DEFAULT_THEME=dark GF_USERS_DEFAULT_THEME=dark
GF_SERVER_ROOT_URL=https://grafana.netgrimoire.com
GF_FEATURE_TOGGLES_ENABLE=publicDashboards
``` ```
### Deploy ### Deploy
@ -70,7 +67,7 @@ docker stack services monitoring
``` ```
### First Run ### First Run
Run `./deploy.sh` after the initial deployment to complete any necessary setup. After the initial deployment, ensure that Prometheus is scraped by Grafana and Alertmanager is configured to forward alerts to Cadvisor.
--- ---
@ -79,38 +76,41 @@ Run `./deploy.sh` after the initial deployment to complete any necessary setup.
### Accessing monitoring ### Accessing monitoring
| Service | URL | Purpose | | Service | URL | Purpose |
|---------|-----|---------| |---------|-----|---------|
- **Prometheus:** https://prometheus.netgrimoire.com - **Grafana** | https://grafana.netgrimoire.com | Dashboards |
- **Grafana:** https://grafana.netgrimoire.com - **Alertmanager** | https://alertmanager.netgrimoire.com | Alert Routing |
- **Alertmanager:** https://alertmanager.netgrimoire.com
### Primary Use Cases ### Primary Use Cases
This monitoring stack provides detailed insights into system performance and behavior. It is used to monitor system metrics, identify potential issues before they become critical, and provide real-time data for troubleshooting. Use Grafana to visualize metrics from Prometheus, and use Alertmanager to manage alerts.
### NetGrimoire Integrations ### NetGrimoire Integrations
The Cadvisor service connects to the Node Exporter to collect host metrics from all nodes in the cluster, including Pi. The Alertmanager service connects to the Prometheus server to forward alerts. The Grafana service connects to the Prometheus server to display dashboards. This monitoring stack integrates with other services in NetGrimoire via environment variables and labels.
--- ---
## Operations ## Operations
### Monitoring ### Monitoring
Monitor the services using `docker stack services monitoring` and view logs with `docker service logs -f monitoring`. ```bash
docker stack services monitoring
docker service logs -f monitoring prometheus
```
### Backups ### Backups
Regular backups are recommended for critical data stored in `/DockerVol/`. Use `docker stack services monitoring` and inspect the volumes for specific data. Critical backups are required for Prometheus and Grafana data. Reconstructing from backup is possible but may require manual configuration.
### Restore ### Restore
Restore the services by running `./deploy.sh` after any changes to the configuration. ```bash
cd services/swarm/stack/monitoring
./deploy.sh
```
--- ---
## Common Failures ## Common Failures
| Failure Mode | Symptoms | Cause | Fix |
| Failure | Symptom | Cause | Fix | |-------------|----------|-------|------|
|--------|---------|------|-----| | Prometheus down | No metrics available in Grafana | Prometheus not scraped | Check Prometheus configuration and restart service |
- **No metrics from Prometheus** | No metrics displayed in Grafana or Alertmanager. | Prometheus not running. | Restart Prometheus service: `docker service restart monitoring prometheus` | | Cadvisor unavailable | No container metrics available | Cadvisor not running | Check Cadvisor logs for errors and restart service |
- **Alerts not forwarding to Alertmanager** | Alerts not sent to Alertmanager. | Prometheus configuration incorrect. | Check Prometheus configuration and adjust as needed. |
- **Grafana dashboard not displaying data** | Grafana dashboards not displaying data from Prometheus. | Prometheus configuration incorrect. | Check Prometheus configuration and adjust as needed. |
--- ---
@ -118,15 +118,14 @@ Restore the services by running `./deploy.sh` after any changes to the configura
| Date | Commit | Summary | | Date | Commit | Summary |
|------|--------|---------| |------|--------|---------|
| 2026-04-07 | 71e3177f | Initial documentation for monitoring stack in NetGrimoire. | | 2026-04-07 | 9f9ca1ad | Initial deployment of monitoring stack |
| 2026-04-07 | 1df528ca | Added Cadvisor service to collect host metrics from all nodes. | | 2026-04-07 | 71e3177f | Configured Alertmanager to forward alerts to Cadvisor |
| 2026-04-07 | af94e455 | Updated Alertmanager configuration to forward alerts to Prometheus. |
| 2026-04-07 | 04863ab6 | Improved Grafana dashboard display with Prometheus data. | <Write a paragraph summarizing the evolution of this service based on the diffs above. If no diffs available, note that this is the initial documentation.>
| 2026-04-07 | 0af60dbe | Added backup and restore procedures for critical data. |
--- ---
## Notes ## Notes
- Generated by Gremlin on 2026-04-08T02:08:17.740Z - Generated by Gremlin on 2026-04-08T03:34:50.852Z
- Source: swarm/monitoring.yaml - Source: swarm/monitoring.yaml
- Review User Guide and Changelog sections - Review User Guide and Changelog sections