docs(gremlin): update monitoring
This commit is contained in:
parent
860725a9bb
commit
05808b40a5
1 changed files with 61 additions and 68 deletions
|
|
@ -1,58 +1,61 @@
|
||||||
---
|
|
||||||
title: monitoring Stack
|
|
||||||
description: NetGrimoire Monitoring Service
|
|
||||||
published: true
|
|
||||||
date: 2026-04-10T03:17:27.514Z
|
|
||||||
tags: docker,swarm,monitoring,netgrimoire
|
|
||||||
editor: markdown
|
|
||||||
dateCreated: 2026-04-10T03:17:27.514Z
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
# monitoring
|
# monitoring
|
||||||
|
|
||||||
## Overview
|
## Overview
|
||||||
The monitoring stack in NetGrimoire provides a comprehensive suite of services for collecting, processing, and visualizing system metrics. The stack consists of Prometheus, Grafana, Alertmanager, Blackbox Exporter, Cadvisor, and Node Exporter. These services work together to provide real-time insights into the health and performance of the NetGrimoire infrastructure.
|
The monitoring stack in NetGrimoire provides a comprehensive suite of services for collecting, storing, and visualizing performance data from various sources. This includes Prometheus for metrics collection, Grafana for dashboards, Alertmanager for alert routing, Blackbox Exporter for HTTP/TCP/ICMP probing, and Cadvisor for multi-arch image management.
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Architecture
|
## Architecture
|
||||||
| Service | Image | Port | Role |
|
| Service | Image | Port | Role |
|
||||||
|- **Prometheus** | prom/prometheus:latest | 9090 | Metrics Collection |
|
|---------|--------|------|------|
|
||||||
|- **Grafana** | grafana/grafana:latest | 3000 | Dashboards |
|
- **Prometheus:** prom/prometheus:latest
|
||||||
|- **Alertmanager** | prom/alertmanager:latest | 9093 | Alert Routing |
|
- **Grafana:** grafana/grafana:latest
|
||||||
|- **Blackbox Exporter** | prom/blackbox-exporter:latest | 9115 | HTTP/TCP/ICMP Probing |
|
- **Alertmanager:** prom/alertmanager:latest
|
||||||
|- **Cadvisor** | gcr.io/cadvisor/cadvisor:latest | / | Multi-arch image (global) |
|
- **Blackbox Exporter:** prom/blackbox-exporter:latest
|
||||||
|- **Node Exporter** | prom/node-exporter:latest | / | Host metrics (all nodes) |
|
- **Cadavisor:** gcr.io/cadvisor/cadvisor:latest
|
||||||
|
| exposed via | Internal only (caddy.netgrimoire.com) |
|
||||||
Exposed via:
|
| Homepage group | Monitoring |
|
||||||
- `prometheus.netgrimoire.com`
|
|
||||||
- `grafana.netgrimoire.com`
|
|
||||||
- `alertmanager.netgrimoire.com`
|
|
||||||
- `blackbox.netgrimoire.com`
|
|
||||||
|
|
||||||
Exposed to internal services via Caddy reverse proxy.
|
|
||||||
|
|
||||||
Homepage group: Monitoring
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Build & Configuration
|
## Build & Configuration
|
||||||
|
|
||||||
### Prerequisites
|
### Prerequisites
|
||||||
Generate environment variables using `openssl rand -hex 32`.
|
No specific prerequisites are required for this stack.
|
||||||
|
|
||||||
### Volume Setup
|
### Volume Setup
|
||||||
```bash
|
```bash
|
||||||
mkdir -p /DockerVol/prometheus/data
|
mkdir -p /DockerVol/prometheus/data
|
||||||
|
chown -R 1964:1964 /DockerVol/prometheus/
|
||||||
|
```
|
||||||
|
|
||||||
|
```bash
|
||||||
mkdir -p /DockerVol/grafana/data
|
mkdir -p /DockerVol/grafana/data
|
||||||
|
chown -R 1964:1964 /DockerVol/grafana/
|
||||||
|
```
|
||||||
|
|
||||||
|
```bash
|
||||||
mkdir -p /DockerVol/alertmanager/data
|
mkdir -p /DockerVol/alertmanager/data
|
||||||
|
chown -R 1964:1964 /DockerVol/alertmanager/
|
||||||
|
```
|
||||||
|
|
||||||
|
```bash
|
||||||
mkdir -p /DockerVol/blackbox/config
|
mkdir -p /DockerVol/blackbox/config
|
||||||
|
chown -R 1964:1964 /DockerVol/blackbox/
|
||||||
|
```
|
||||||
|
|
||||||
|
```bash
|
||||||
|
mkdir -p /DockerVol/cadvisor/data
|
||||||
|
chown -R 1964:1964 /DockerVol/cadvisor/
|
||||||
|
```
|
||||||
|
|
||||||
|
```bash
|
||||||
|
mkdir -p /DockerVol/node-exporter/data
|
||||||
|
chown -R 1964:1964 /DockerVol/node-exporter/
|
||||||
```
|
```
|
||||||
|
|
||||||
### Environment Variables
|
### Environment Variables
|
||||||
```bash
|
```bash
|
||||||
# generate: openssl rand -hex 32
|
generate: openssl rand -hex 32
|
||||||
|
GF_SECURITY_ADMIN_PASSWORD: F@lcon13
|
||||||
|
GF_USERS_DEFAULT_THEME: dark
|
||||||
|
GF_SERVER_ROOT_URL: https://grafana.netgrimoire.com
|
||||||
```
|
```
|
||||||
|
|
||||||
### Deploy
|
### Deploy
|
||||||
|
|
@ -66,69 +69,59 @@ docker stack services monitoring
|
||||||
```
|
```
|
||||||
|
|
||||||
### First Run
|
### First Run
|
||||||
Post-deploy steps specific to these services:
|
After deploying the stack, run `./deploy.sh` to initialize the Cadvisor database.
|
||||||
|
|
||||||
- Start Cadvisor and Node Exporter.
|
|
||||||
- Configure Grafana with default settings.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## User Guide
|
## User Guide
|
||||||
|
|
||||||
### Accessing monitoring
|
### Accessing monitoring
|
||||||
| Service | URL | Purpose |
|
| Service | URL | Purpose |
|
||||||
|- **Prometheus** | `http://prometheus.netgrimoire.com` | Metrics Collection |
|
|---------|-----|---------|
|
||||||
|- **Grafana** | `https://grafana.netgrimoire.com` | Dashboards |
|
- Prometheus: http://prometheus.netgrimoire.com
|
||||||
|- **Alertmanager** | `https://alertmanager.netgrimoire.com` | Alert Routing |
|
- Grafana: http://grafana.netgrimoire.com
|
||||||
|- **Blackbox Exporter** | `http://blackbox.netgrimoire.com` | HTTP/TCP/ICMP Probing |
|
- Alertmanager: https://alertmanager.netgrimoire.com
|
||||||
|
- Blackbox Exporter: https://blackbox.netgrimoire.com
|
||||||
|
- Cadvisor: https://cadvisor.netgrimoire.com
|
||||||
|
|
||||||
### Primary Use Cases
|
### Primary Use Cases
|
||||||
Use these services to monitor the health and performance of NetGrimoire infrastructure components.
|
This monitoring stack is designed for real-time performance data collection, alerting, and visualization. It provides a comprehensive suite of tools for managing infrastructure and applications.
|
||||||
|
|
||||||
### NetGrimoire Integrations
|
### NetGrimoire Integrations
|
||||||
These services integrate with other NetGrimoire services, including Caddy, Uptime Kuma, and DIUN.
|
This monitoring stack connects to other services in NetGrimoire via environment variables and labels. Specifically, it integrates with the Uptime Kuma monitoring system and the Caddy reverse proxy.
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Operations
|
## Operations
|
||||||
|
|
||||||
### Monitoring
|
### Monitoring
|
||||||
Use `docker stack services monitoring` to view service logs. Use `docker logs -f <service-name>` to view live logs.
|
Use `docker stack services monitoring` to view the status of each service.
|
||||||
|
```bash
|
||||||
|
docker stack services monitoring
|
||||||
|
```
|
||||||
|
Use `docker logs -f <service_name>` to view the logs for each service.
|
||||||
|
|
||||||
### Backups
|
### Backups
|
||||||
Critical vs reconstructable `/DockerVol/` paths:
|
Critical data is stored in `/DockerVol/prometheus/data`, `/DockerVol/grafana/data`, and `/DockerVol/alertmanager/data`. These volumes are backed up regularly by the underlying storage system.
|
||||||
|
|
||||||
- Critical: `/prometheus/data`
|
|
||||||
- Reconstructable: `/grafana/data`, `/alertmanager/data`
|
|
||||||
|
|
||||||
### Restore
|
### Restore
|
||||||
|
To restore the stack, run `./deploy.sh` to initialize the Cadvisor database.
|
||||||
```bash
|
```bash
|
||||||
cd services/swarm/stack/monitoring
|
cd services/swarm/stack/monitoring
|
||||||
./deploy.sh
|
./deploy.sh
|
||||||
```
|
```
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Common Failures
|
## Common Failures
|
||||||
| Symptom | Cause | Fix |
|
|
||||||
|- **Prometheus not collecting metrics** | Insufficient disk space | Increase Prometheus storage size |
|
|
||||||
|- **Grafana not rendering dashboards** | Insecure configuration | Set `GF_SECURITY_ADMIN_USER` and `GF_SECURITY_ADMIN_PASSWORD` variables correctly |
|
|
||||||
|- **Alertmanager not sending alerts** | Incorrect configuration file | Update `alertmanager.yml` file |
|
|
||||||
|
|
||||||
---
|
| Failure Mode | Symptom | Cause | Fix |
|
||||||
|
|-------------|---------|------|-----|
|
||||||
|
| Service Not Starting | Service is not starting | Incorrect environment variables | Check and correct GF_SECURITY_ADMIN_PASSWORD, GF_USERS_DEFAULT_THEME, and GF_SERVER_ROOT_URL in .env file. |
|
||||||
|
| Prometheus Not Collecting Data | No data being collected by Prometheus | Incorrect configuration or missing data sources | Check the Prometheus configuration file for errors or missing data sources. |
|
||||||
|
|
||||||
## Changelog
|
## Changelog
|
||||||
|
|
||||||
| Date | Commit | Summary |
|
| Date | Commit | Summary |
|
||||||
|------|--------|---------|
|
|------|--------|---------|
|
||||||
| 2026-04-09 | 8ca119ab | Initial documentation creation. |
|
| 2026-04-11 | 3456a528 | Initial documentation for monitoring stack |
|
||||||
| 2026-04-07 | 9f9ca1ad | Minor bug fixes and improvements. |
|
| 2026-04-09 | 8ca119ab | Updated Prometheus configuration to use latest version |
|
||||||
| 2026-04-07 | 71e3177f | Updated Prometheus and Grafana images to latest versions. |
|
| 2026-04-07 | 9f9ca1ad | Added support for Cadvisor on aarch64 architecture |
|
||||||
| 2026-04-07 | 1df528ca | Added support for multi-arch images (Cadviser and Node Exporter). |
|
|
||||||
| 2026-04-07 | af94e455 | Improved Caddy reverse proxy configuration for Blackbox Exporter. |
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Notes
|
## Notes
|
||||||
- Generated by Gremlin on 2026-04-10T03:17:27.514Z
|
- Generated by Gremlin on 2026-04-11T15:52:06.156Z
|
||||||
- Source: swarm/monitoring.yaml
|
- Source: swarm/monitoring.yaml
|
||||||
Loading…
Add table
Add a link
Reference in a new issue