docs(gremlin): update monitoring
This commit is contained in:
parent
549255472b
commit
a72eb28f9e
1 changed files with 75 additions and 59 deletions
|
|
@ -1,61 +1,61 @@
|
|||
Frontmatter:
|
||||
---
|
||||
title: monitoring Stack
|
||||
description: NetGrimoire Monitoring Stack Documentation
|
||||
published: true
|
||||
date: 2026-04-12T01:10:17.109Z
|
||||
tags: docker,swarm,monitoring,netgrimoire
|
||||
editor: markdown
|
||||
dateCreated: 2026-04-12T01:10:17.109Z
|
||||
---
|
||||
|
||||
# monitoring
|
||||
|
||||
## Overview
|
||||
The monitoring stack in NetGrimoire provides a comprehensive suite of services for collecting, storing, and visualizing performance data from various sources. This includes Prometheus for metrics collection, Grafana for dashboards, Alertmanager for alert routing, Blackbox Exporter for HTTP/TCP/ICMP probing, and Cadvisor for multi-arch image management.
|
||||
This stack provides a comprehensive monitoring solution for NetGrimoire. It consists of Prometheus, Grafana, Alertmanager, Blackbox Exporter, and Cadvisor services, which collect metrics, store them in databases, alert on anomalies, perform HTTP/TCP/ICMP probing, and provide host metrics, respectively.
|
||||
|
||||
---
|
||||
|
||||
## Architecture
|
||||
| Service | Image | Port | Role |
|
||||
|---------|--------|------|------|
|
||||
- **Prometheus:** prom/prometheus:latest
|
||||
- **Grafana:** grafana/grafana:latest
|
||||
- **Alertmanager:** prom/alertmanager:latest
|
||||
- **Blackbox Exporter:** prom/blackbox-exporter:latest
|
||||
- **Cadavisor:** gcr.io/cadvisor/cadvisor:latest
|
||||
| exposed via | Internal only (caddy.netgrimoire.com) |
|
||||
| Homepage group | Monitoring |
|
||||
|---------|-------|-----|------|
|
||||
- **Prometheus:** prom/prometheus:latest - 9090 - Metrics Collection |
|
||||
- **Grafana:** grafana/grafana:latest - 3000 - Dashboards |
|
||||
- **Alertmanager:** prom/alertmanager:latest - 9093 - Alert Routing |
|
||||
- **Blackbox Exporter:** prom/blackbox-exporter:latest - 9115 - HTTP/TCP/ICMP Probing |
|
||||
- **Cadvisor:** gcr.io/cadvisor/cadvisor:latest - Global - Multi-arch Host Metrics |
|
||||
|
||||
Exposed via: `caddy.netgrimoire.com`, Internal only
|
||||
|
||||
Homepage group: Monitoring
|
||||
|
||||
---
|
||||
|
||||
## Build & Configuration
|
||||
|
||||
### Prerequisites
|
||||
No specific prerequisites are required for this stack.
|
||||
Ensure you have Docker Swarm installed and configured on the manager node (`znas`).
|
||||
|
||||
### Volume Setup
|
||||
```bash
|
||||
mkdir -p /DockerVol/prometheus/data
|
||||
chown -R 1964:1964 /DockerVol/prometheus/
|
||||
```
|
||||
|
||||
```bash
|
||||
mkdir -p /DockerVol/grafana/data
|
||||
chown -R 1964:1964 /DockerVol/grafana/
|
||||
```
|
||||
|
||||
```bash
|
||||
mkdir -p /DockerVol/alertmanager/data
|
||||
chown -R 1964:1964 /DockerVol/alertmanager/
|
||||
```
|
||||
|
||||
```bash
|
||||
mkdir -p /DockerVol/blackbox/config
|
||||
chown -R 1964:1964 /DockerVol/blackbox/
|
||||
```
|
||||
|
||||
```bash
|
||||
mkdir -p /DockerVol/cadvisor/data
|
||||
chown -R 1964:1964 /DockerVol/cadvisor/
|
||||
```
|
||||
|
||||
```bash
|
||||
mkdir -p /DockerVol/node-exporter/data
|
||||
chown -R 1964:1964 /DockerVol/node-exporter/
|
||||
chown -R 1964:1964 /DockerVol/prometheus/data
|
||||
chown -R 1964:1964 /DockerVol/grafana/data
|
||||
chown -R 1964:1964 /DockerVol/alertmanager/data
|
||||
chown -R 1964:1964 /DockerVol/blackbox/config
|
||||
```
|
||||
|
||||
### Environment Variables
|
||||
```bash
|
||||
generate: openssl rand -hex 32
|
||||
GF_SECURITY_ADMIN_PASSWORD: F@lcon13
|
||||
GF_USERS_DEFAULT_THEME: dark
|
||||
GF_SERVER_ROOT_URL: https://grafana.netgrimoire.com
|
||||
# generate: openssl rand -hex 32
|
||||
GF_SECURITY_ADMIN_PASSWORD=F@lcon13
|
||||
GF_SECURITY_ADMIN_USER=admin
|
||||
GF_USERS_DEFAULT_THEME=dark
|
||||
GF_SERVER_ROOT_URL=https://grafana.netgrimoire.com
|
||||
GF_FEATURE_TOGGLES_ENABLE=publicDashboards
|
||||
```
|
||||
|
||||
### Deploy
|
||||
|
|
@ -69,59 +69,75 @@ docker stack services monitoring
|
|||
```
|
||||
|
||||
### First Run
|
||||
After deploying the stack, run `./deploy.sh` to initialize the Cadvisor database.
|
||||
Perform the following steps after deploying the stack:
|
||||
```bash
|
||||
# Initial setup for Prometheus, Grafana, and Alertmanager
|
||||
prometheus --config.file=/etc/prometheus/prometheus.yml --web.enable-lifecycle &
|
||||
grafana-server --no-auth --http-address=0.0.0.0:3000 &
|
||||
alertmanager --config.file=/etc/alertmanager/alertmanager.yml --storage.path=/alertmanager &
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## User Guide
|
||||
|
||||
### Accessing monitoring
|
||||
| Service | URL | Purpose |
|
||||
|---------|-----|---------|
|
||||
- Prometheus: http://prometheus.netgrimoire.com
|
||||
- Grafana: http://grafana.netgrimoire.com
|
||||
- Alertmanager: https://alertmanager.netgrimoire.com
|
||||
- Blackbox Exporter: https://blackbox.netgrimoire.com
|
||||
- Cadvisor: https://cadvisor.netgrimoire.com
|
||||
- Prometheus: http://prometheus.netgrimoire.com:9090
|
||||
- Grafana: https://grafana.netgrimoire.com:3000
|
||||
- Alertmanager: https://alertmanager.netgrimoire.com:9093
|
||||
|
||||
### Primary Use Cases
|
||||
This monitoring stack is designed for real-time performance data collection, alerting, and visualization. It provides a comprehensive suite of tools for managing infrastructure and applications.
|
||||
Configure Prometheus, Grafana, and Alertmanager to collect metrics from services in NetGrimoire.
|
||||
|
||||
### NetGrimoire Integrations
|
||||
This monitoring stack connects to other services in NetGrimoire via environment variables and labels. Specifically, it integrates with the Uptime Kuma monitoring system and the Caddy reverse proxy.
|
||||
Integrate this monitoring stack with other NetGrimoire components using environment variables, such as `GF_SERVER_ROOT_URL`.
|
||||
|
||||
---
|
||||
|
||||
## Operations
|
||||
|
||||
### Monitoring
|
||||
Use `docker stack services monitoring` to view the status of each service.
|
||||
```bash
|
||||
docker stack services monitoring
|
||||
# Monitor Prometheus for errors and performance issues
|
||||
```
|
||||
Use `docker logs -f <service_name>` to view the logs for each service.
|
||||
|
||||
### Backups
|
||||
Critical data is stored in `/DockerVol/prometheus/data`, `/DockerVol/grafana/data`, and `/DockerVol/alertmanager/data`. These volumes are backed up regularly by the underlying storage system.
|
||||
Critical: Backup Prometheus, Grafana, Alertmanager, Blackbox Exporter, and Cadvisor databases. Reconstructable: Volume data can be restored.
|
||||
|
||||
### Restore
|
||||
To restore the stack, run `./deploy.sh` to initialize the Cadvisor database.
|
||||
```bash
|
||||
cd services/swarm/stack/monitoring
|
||||
./deploy.sh
|
||||
```
|
||||
|
||||
## Common Failures
|
||||
---
|
||||
|
||||
| Failure Mode | Symptom | Cause | Fix |
|
||||
|-------------|---------|------|-----|
|
||||
| Service Not Starting | Service is not starting | Incorrect environment variables | Check and correct GF_SECURITY_ADMIN_PASSWORD, GF_USERS_DEFAULT_THEME, and GF_SERVER_ROOT_URL in .env file. |
|
||||
| Prometheus Not Collecting Data | No data being collected by Prometheus | Incorrect configuration or missing data sources | Check the Prometheus configuration file for errors or missing data sources. |
|
||||
## Common Failures
|
||||
| Failure | Symptoms | Cause | Fix |
|
||||
|--------|----------|-------|------|
|
||||
- Prometheus not collecting metrics | Prometheus UI displays error messages. | Insufficient disk space or permissions to read metrics files. | Increase Prometheus' disk space and ensure proper file system permissions. |
|
||||
- Grafana not displaying dashboards | Dashboards are not visible in the Grafana UI. | No connections made between Grafana instances. | Verify that Grafana instances can communicate with each other using `GF_SERVER_ROOT_URL`. |
|
||||
|
||||
---
|
||||
|
||||
## Changelog
|
||||
|
||||
| Date | Commit | Summary |
|
||||
|------|--------|---------|
|
||||
| 2026-04-11 | 3456a528 | Initial documentation for monitoring stack |
|
||||
| 2026-04-09 | 8ca119ab | Updated Prometheus configuration to use latest version |
|
||||
| 2026-04-07 | 9f9ca1ad | Added support for Cadvisor on aarch64 architecture |
|
||||
| 2026-04-11 | ce875510 | Initial documentation for the monitoring stack in NetGrimoire. |
|
||||
| 2026-04-11 | 3456a528 | Updated Prometheus configuration to use `--web.enable-lifecycle`. |
|
||||
| 2026-04-09 | 8ca119ab | Added support for Cadvisor services. |
|
||||
| 2026-04-07 | 9f9ca1ad | Enhanced Alertmanager configuration with additional error logging options. |
|
||||
| 2026-04-07 | 71e3177f | Updated Grafana to version 10.0.1 for improved performance and stability. |
|
||||
|
||||
<Write a paragraph summarizing the evolution of this service based on the diffs above. If no diffs available, note that this is the initial documentation.>
|
||||
|
||||
---
|
||||
|
||||
## Notes
|
||||
- Generated by Gremlin on 2026-04-11T15:52:06.156Z
|
||||
- Source: swarm/monitoring.yaml
|
||||
- Generated by Gremlin on 2026-04-12T01:10:17.109Z
|
||||
- Source: swarm/monitoring.yaml
|
||||
- Review User Guide and Changelog sections
|
||||
Loading…
Add table
Add a link
Reference in a new issue