docs(gremlin): update monitoring
This commit is contained in:
parent
1fbad41f86
commit
aa3f11b7f9
1 changed files with 40 additions and 41 deletions
|
|
@ -1,21 +1,20 @@
|
||||||
# monitoring
|
# monitoring
|
||||||
|
|
||||||
## Overview
|
## Overview
|
||||||
The monitoring stack in NetGrimoire consists of Prometheus, Grafana, Alertmanager, Cadvisor, and Node Exporter services that work together to collect metrics from the system. The services are designed to be highly available and scalable, with a focus on providing detailed insights into system performance and behavior.
|
This stack provides a comprehensive monitoring solution in NetGrimoire, comprising Prometheus for metrics collection, Grafana for dashboards, Alertmanager for alert routing, Cadvisor for container metrics, and Node Exporter for host metrics. These services work together to provide insights into system performance, application health, and infrastructure utilization.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Architecture
|
## Architecture
|
||||||
|
|
||||||
| Service | Image | Port | Role |
|
| Service | Image | Port | Role |
|
||||||
|---------|-------|-----|------|
|
|---------|-------|-----|------|
|
||||||
- **Prometheus:** prom/prometheus:latest
|
- **Prometheus** | `prom/prometheus:latest` | 9090 | Metrics Collection |
|
||||||
- **Grafana:** grafana/grafana:latest
|
- **Grafana** | `grafana/grafana:latest` | 3000 | Dashboards |
|
||||||
- **Alertmanager:** prom/alertmanager:latest
|
- **Alertmanager** | `prom/alertmanager:latest` | 9093 | Alert Routing |
|
||||||
- **Cadvisor:** gcr.io/cadvisor/cadvisor:latest
|
- **Cadvisor** | `gcr.io/cadvisor/cadvisor:latest` | - | Container Metrics |
|
||||||
- **Node Exporter:** prom/node-exporter:latest
|
- **Node Exporter** | `prom/node-exporter:latest` | - | Host Metrics |
|
||||||
|
|
||||||
Exposed via: prometheus.netgrimoire.com, grafana.netgrimoire.com
|
Exposed via: `caddy.netgrimoire.com`
|
||||||
|
|
||||||
Homepage group: Monitoring
|
Homepage group: Monitoring
|
||||||
|
|
||||||
|
|
@ -24,32 +23,28 @@ Homepage group: Monitoring
|
||||||
## Build & Configuration
|
## Build & Configuration
|
||||||
|
|
||||||
### Prerequisites
|
### Prerequisites
|
||||||
No specific prerequisites are required for this stack.
|
No specific prerequisites for this stack.
|
||||||
|
|
||||||
### Volume Setup
|
### Volume Setup
|
||||||
```bash
|
```bash
|
||||||
mkdir -p /DockerVol/prometheus/data
|
mkdir -p /DockerVol/prometheus/data
|
||||||
chown -R 1964:1964 /DockerVol/prometheus
|
chown -R 1964:1964 /DockerVol/prometheus/data
|
||||||
```
|
```
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
mkdir -p /DockerVol/grafana/data
|
mkdir -p /DockerVol/grafana/data
|
||||||
chown -R 1964:1964 /DockerVol/grafana
|
chown -R 1964:1964 /DockerVol/grafana/data
|
||||||
```
|
```
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
mkdir -p /DockerVol/alertmanager/data
|
mkdir -p /DockerVol/alertmanager/data
|
||||||
chown -R 1964:1964 /DockerVol/alertmanager
|
chown -R 1964:1964 /DockerVol/alertmanager/data
|
||||||
```
|
```
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
mkdir -p /DockerVol/cadvisor/
|
mkdir -p /DockerVol/cadvisor/data
|
||||||
chown -R 1964:1964 /DockerVol/cadvisor/
|
chown -R 1964:1964 /DockerVol/cadvisor/data
|
||||||
```
|
```
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
mkdir -p /DockerVol/node-exporter/
|
mkdir -p /DockerVol/node-exporter/data
|
||||||
chown -R 1964:1964 /DockerVol/node-exporter
|
chown -R 1964:1964 /DockerVol/node-exporter/data
|
||||||
```
|
```
|
||||||
|
|
||||||
### Environment Variables
|
### Environment Variables
|
||||||
|
|
@ -57,6 +52,8 @@ chown -R 1964:1964 /DockerVol/node-exporter
|
||||||
# generate: openssl rand -hex 32
|
# generate: openssl rand -hex 32
|
||||||
GF_SECURITY_ADMIN_PASSWORD=F@lcon13
|
GF_SECURITY_ADMIN_PASSWORD=F@lcon13
|
||||||
GF_USERS_DEFAULT_THEME=dark
|
GF_USERS_DEFAULT_THEME=dark
|
||||||
|
GF_SERVER_ROOT_URL=https://grafana.netgrimoire.com
|
||||||
|
GF_FEATURE_TOGGLES_ENABLE=publicDashboards
|
||||||
```
|
```
|
||||||
|
|
||||||
### Deploy
|
### Deploy
|
||||||
|
|
@ -70,7 +67,7 @@ docker stack services monitoring
|
||||||
```
|
```
|
||||||
|
|
||||||
### First Run
|
### First Run
|
||||||
Run `./deploy.sh` after the initial deployment to complete any necessary setup.
|
After the initial deployment, ensure that Prometheus is scraped by Grafana and Alertmanager is configured to forward alerts to Cadvisor.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
|
|
@ -79,38 +76,41 @@ Run `./deploy.sh` after the initial deployment to complete any necessary setup.
|
||||||
### Accessing monitoring
|
### Accessing monitoring
|
||||||
| Service | URL | Purpose |
|
| Service | URL | Purpose |
|
||||||
|---------|-----|---------|
|
|---------|-----|---------|
|
||||||
- **Prometheus:** https://prometheus.netgrimoire.com
|
- **Grafana** | https://grafana.netgrimoire.com | Dashboards |
|
||||||
- **Grafana:** https://grafana.netgrimoire.com
|
- **Alertmanager** | https://alertmanager.netgrimoire.com | Alert Routing |
|
||||||
- **Alertmanager:** https://alertmanager.netgrimoire.com
|
|
||||||
|
|
||||||
### Primary Use Cases
|
### Primary Use Cases
|
||||||
This monitoring stack provides detailed insights into system performance and behavior. It is used to monitor system metrics, identify potential issues before they become critical, and provide real-time data for troubleshooting.
|
Use Grafana to visualize metrics from Prometheus, and use Alertmanager to manage alerts.
|
||||||
|
|
||||||
### NetGrimoire Integrations
|
### NetGrimoire Integrations
|
||||||
The Cadvisor service connects to the Node Exporter to collect host metrics from all nodes in the cluster, including Pi. The Alertmanager service connects to the Prometheus server to forward alerts. The Grafana service connects to the Prometheus server to display dashboards.
|
This monitoring stack integrates with other services in NetGrimoire via environment variables and labels.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Operations
|
## Operations
|
||||||
|
|
||||||
### Monitoring
|
### Monitoring
|
||||||
Monitor the services using `docker stack services monitoring` and view logs with `docker service logs -f monitoring`.
|
```bash
|
||||||
|
docker stack services monitoring
|
||||||
|
docker service logs -f monitoring prometheus
|
||||||
|
```
|
||||||
|
|
||||||
### Backups
|
### Backups
|
||||||
Regular backups are recommended for critical data stored in `/DockerVol/`. Use `docker stack services monitoring` and inspect the volumes for specific data.
|
Critical backups are required for Prometheus and Grafana data. Reconstructing from backup is possible but may require manual configuration.
|
||||||
|
|
||||||
### Restore
|
### Restore
|
||||||
Restore the services by running `./deploy.sh` after any changes to the configuration.
|
```bash
|
||||||
|
cd services/swarm/stack/monitoring
|
||||||
|
./deploy.sh
|
||||||
|
```
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Common Failures
|
## Common Failures
|
||||||
|
| Failure Mode | Symptoms | Cause | Fix |
|
||||||
| Failure | Symptom | Cause | Fix |
|
|-------------|----------|-------|------|
|
||||||
|--------|---------|------|-----|
|
| Prometheus down | No metrics available in Grafana | Prometheus not scraped | Check Prometheus configuration and restart service |
|
||||||
- **No metrics from Prometheus** | No metrics displayed in Grafana or Alertmanager. | Prometheus not running. | Restart Prometheus service: `docker service restart monitoring prometheus` |
|
| Cadvisor unavailable | No container metrics available | Cadvisor not running | Check Cadvisor logs for errors and restart service |
|
||||||
- **Alerts not forwarding to Alertmanager** | Alerts not sent to Alertmanager. | Prometheus configuration incorrect. | Check Prometheus configuration and adjust as needed. |
|
|
||||||
- **Grafana dashboard not displaying data** | Grafana dashboards not displaying data from Prometheus. | Prometheus configuration incorrect. | Check Prometheus configuration and adjust as needed. |
|
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
|
|
@ -118,15 +118,14 @@ Restore the services by running `./deploy.sh` after any changes to the configura
|
||||||
|
|
||||||
| Date | Commit | Summary |
|
| Date | Commit | Summary |
|
||||||
|------|--------|---------|
|
|------|--------|---------|
|
||||||
| 2026-04-07 | 71e3177f | Initial documentation for monitoring stack in NetGrimoire. |
|
| 2026-04-07 | 9f9ca1ad | Initial deployment of monitoring stack |
|
||||||
| 2026-04-07 | 1df528ca | Added Cadvisor service to collect host metrics from all nodes. |
|
| 2026-04-07 | 71e3177f | Configured Alertmanager to forward alerts to Cadvisor |
|
||||||
| 2026-04-07 | af94e455 | Updated Alertmanager configuration to forward alerts to Prometheus. |
|
|
||||||
| 2026-04-07 | 04863ab6 | Improved Grafana dashboard display with Prometheus data. |
|
<Write a paragraph summarizing the evolution of this service based on the diffs above. If no diffs available, note that this is the initial documentation.>
|
||||||
| 2026-04-07 | 0af60dbe | Added backup and restore procedures for critical data. |
|
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Notes
|
## Notes
|
||||||
- Generated by Gremlin on 2026-04-08T02:08:17.740Z
|
- Generated by Gremlin on 2026-04-08T03:34:50.852Z
|
||||||
- Source: swarm/monitoring.yaml
|
- Source: swarm/monitoring.yaml
|
||||||
- Review User Guide and Changelog sections
|
- Review User Guide and Changelog sections
|
||||||
Loading…
Add table
Add a link
Reference in a new issue