docs: create docker_dns_issues

This commit is contained in:
Administrator 2026-02-13 15:45:01 +00:00 committed by John Smith
parent 97c8135fc8
commit 24a05fa08e

612
docker_dns_issues.md Normal file
View file

@ -0,0 +1,612 @@
---
title: Docker DNS Fix
description: Override docker VIP for dns
published: true
date: 2026-02-13T15:44:48.521Z
tags:
editor: markdown
dateCreated: 2026-02-13T15:44:48.521Z
---
# Docker Swarm DNS Fix: endpoint_mode dnsrr
## Problem Overview
Docker Swarm's overlay network uses an embedded DNS server with a Virtual IP (VIP) layer. This VIP layer caches DNS entries, which can become stale when containers restart frequently or get new IP addresses. This results in connection timeouts and "connection pool full" errors.
**Common symptoms:**
- Services cannot connect to databases despite correct configuration
- DNS resolves to wrong/old IP addresses
- "Knex: Timeout acquiring a connection" errors
- Issues worsen with frequent container restarts/rebuilds
- Problems occur across all nodes (not architecture-specific)
## The Solution: endpoint_mode dnsrr
`endpoint_mode: dnsrr` (DNS Round Robin) bypasses Swarm's VIP layer entirely. DNS queries resolve directly to actual container IPs, eliminating the caching layer that causes stale entries.
**Benefits:**
- No stale DNS entries
- Fresh DNS lookups every time
- Works with existing overlay networks
- No additional software required
- Can be implemented gradually
## Implementation Guide
### Step 1: Update Docker Daemon (ALL Nodes)
This step benefits all containers immediately and should be done on every node in your Swarm cluster.
**On each node:**
```bash
sudo nano /etc/docker/daemon.json
```
**Add or replace with:**
```json
{
"dns": ["8.8.8.8", "1.1.1.1"],
"dns-opt": ["ndots:0", "timeout:2", "attempts:3"],
"log-driver": "json-file",
"log-opts": {
"max-size": "10m",
"max-file": "3"
}
}
```
**Configuration explanation:**
- `dns`: Uses Google and Cloudflare DNS as fallbacks
- `ndots:0`: Forces external DNS lookup for FQDNs, reducing reliance on Swarm DNS
- `timeout:2`: 2-second timeout per DNS query
- `attempts:3`: Retry up to 3 times
- `log-opts`: Prevents logs from filling disk (optional but recommended)
**Restart Docker (one node at a time):**
```bash
sudo systemctl restart docker
```
⚠️ **Important:** Wait 2-3 minutes between node restarts for services to stabilize.
### Step 2: Update Docker Compose Files
Add `endpoint_mode: dnsrr` to the `deploy:` section of each service in your compose files.
**Before:**
```yaml
services:
my-service:
image: some-image
networks:
- my-network
deploy:
mode: replicated
replicas: 1
restart_policy:
condition: any
```
**After:**
```yaml
services:
my-service:
image: some-image
networks:
- my-network
deploy:
endpoint_mode: dnsrr # ADD THIS LINE
mode: replicated
replicas: 1
restart_policy:
condition: any
```
### Step 3: Redeploy Services
After updating compose files, redeploy each stack:
```bash
docker stack deploy -c your-compose-file.yml your-stack-name
```
**Note:** You can do this gradually. Services without `dnsrr` will continue working (but may still have DNS issues).
## Complete Example: Database + Application
```yaml
version: "3.8"
networks:
app-network:
external: true
services:
database:
image: postgres:16-alpine
networks:
- app-network
environment:
POSTGRES_DB: myapp
POSTGRES_USER: appuser
POSTGRES_PASSWORD: secret
volumes:
- /data/postgres:/var/lib/postgresql/data
deploy:
endpoint_mode: dnsrr # Prevents stale DNS
mode: replicated
replicas: 1
placement:
constraints:
- node.hostname == node1
restart_policy:
condition: any
delay: 5s
application:
image: myapp:latest
networks:
- app-network
environment:
DB_HOST: database # Uses service name
DB_PORT: "5432"
DB_USER: appuser
DB_PASS: secret
DB_NAME: myapp
deploy:
endpoint_mode: dnsrr # Prevents stale DNS
mode: replicated
replicas: 1
placement:
constraints:
- node.hostname == node1
restart_policy:
condition: any
delay: 5s
```
## Adding New Nodes to Your Swarm
When adding new nodes to your cluster, follow these steps to ensure DNS works correctly:
### 1. Prepare the New Node
**Before joining Swarm, configure Docker daemon:**
```bash
# On the new node
sudo nano /etc/docker/daemon.json
```
Add the same configuration as existing nodes:
```json
{
"dns": ["8.8.8.8", "1.1.1.1"],
"dns-opt": ["ndots:0", "timeout:2", "attempts:3"],
"log-driver": "json-file",
"log-opts": {
"max-size": "10m",
"max-file": "3"
}
}
```
**Restart Docker:**
```bash
sudo systemctl restart docker
```
### 2. Join the Node to Swarm
**On a manager node, get the join token:**
```bash
# For worker nodes
docker swarm join-token worker
# For manager nodes
docker swarm join-token manager
```
**On the new node, run the join command:**
```bash
docker swarm join --token SWMTKN-xxx-xxx manager-ip:2377
```
### 3. Verify the Node
**On a manager node:**
```bash
# Check node is visible
docker node ls
# Check node status
docker node inspect <node-name> --pretty
```
### 4. Add Node Labels (if needed)
If you use placement constraints based on labels:
```bash
# Add CPU architecture label
docker node update --label-add cpu=arm <node-name>
# or
docker node update --label-add cpu=x86 <node-name>
# Add custom labels as needed
docker node update --label-add role=database <node-name>
```
### 5. Test Network Connectivity
**Deploy a test service on the new node:**
```bash
docker service create \
--name test-dns \
--constraint 'node.hostname==<new-node-name>' \
--network netgrimoire \
alpine sleep 3600
```
**Test DNS resolution from the test service:**
```bash
# Get container ID
docker ps | grep test-dns
# Test DNS lookup of existing service
docker exec <container-id> nslookup <existing-service-name>
# Test connectivity
docker exec <container-id> ping -c 3 <existing-service-name>
```
**Clean up:**
```bash
docker service rm test-dns
```
## Troubleshooting
### Issue: Service can't resolve DNS
**Symptoms:**
- `nslookup` or `ping` fails to resolve service names
- Connection timeouts
**Diagnosis:**
```bash
# Check if service is running
docker service ls | grep <service-name>
# Check service details
docker service inspect <service-name>
# Get container ID
docker ps -f name=<service-name>
# Test DNS from inside container
docker exec <container-id> nslookup <target-service-name>
docker exec <container-id> cat /etc/resolv.conf
```
**Solutions:**
1. **Verify both services are on the same network:**
```bash
docker network inspect <network-name>
```
Both services should appear in the containers list.
2. **Check if endpoint_mode is set:**
```bash
docker service inspect <service-name> --format '{{.Spec.EndpointSpec.Mode}}'
```
Should return `dnsrr` or `vip`.
3. **Restart the service:**
```bash
docker service update --force <service-name>
```
4. **Check Docker daemon config is correct:**
```bash
cat /etc/docker/daemon.json
sudo systemctl status docker
```
### Issue: DNS resolves to wrong IP
**Symptoms:**
- DNS returns an old/incorrect IP address
- Service was recently restarted and got a new IP
**Diagnosis:**
```bash
# Find the actual container IP
docker inspect <container-id> | grep IPAddress
# Check what DNS returns
docker exec <other-container> nslookup <service-name>
# Compare the two
```
**Solutions:**
1. **Verify endpoint_mode is dnsrr:**
```bash
docker service inspect <service-name> --format '{{.Spec.EndpointSpec.Mode}}'
```
2. **Force update to refresh DNS:**
```bash
docker service update --force <service-name>
```
3. **If still wrong, check for stale network entries:**
```bash
# Disconnect and reconnect service to network (requires downtime)
docker service update --network-rm <network-name> <service-name>
docker service update --network-add <network-name> <service-name>
```
### Issue: Multiple services have stale DNS
**Symptoms:**
- Widespread DNS issues across cluster
- Affects multiple services/nodes
**Solutions:**
1. **Force update all services (no downtime but slower):**
```bash
docker service ls --format "{{.Name}}" | while read service; do
echo "Updating $service..."
docker service update --force $service
done
```
2. **Restart Docker on affected nodes (one at a time):**
```bash
# On each node
sudo systemctl restart docker
```
3. **Nuclear option - recreate the network (requires downtime):**
```bash
# Stop all stacks
docker stack ls
docker stack rm <stack-name>
# Remove network
docker network rm <network-name>
# Recreate network
docker network create --driver overlay --attachable <network-name>
# Redeploy stacks
docker stack deploy -c <compose-file> <stack-name>
```
### Issue: New node can't reach services on other nodes
**Symptoms:**
- Services on new node can't connect to services on existing nodes
- DNS works locally but not cross-node
**Diagnosis:**
```bash
# Check if node is properly connected to overlay network
docker network inspect <network-name>
# Verify node is in Swarm
docker node ls
# Check firewall rules (on new node)
sudo iptables -L -n | grep 4789 # VXLAN port
sudo iptables -L -n | grep 7946 # Serf port
```
**Solutions:**
1. **Ensure required ports are open:**
- TCP port 2377 (cluster management)
- TCP/UDP port 7946 (node communication)
- UDP port 4789 (overlay network traffic)
2. **Check MTU settings:**
```bash
# On all nodes, check MTU
ip link show
# If MTU issues, recreate network with explicit MTU
docker network create \
--driver overlay \
--attachable \
--opt com.docker.network.driver.mtu=1450 \
<network-name>
```
3. **Verify Docker daemon is configured identically:**
```bash
# Compare daemon.json across nodes
cat /etc/docker/daemon.json
```
### Issue: Intermittent DNS failures
**Symptoms:**
- DNS works sometimes, fails other times
- No consistent pattern
**Diagnosis:**
```bash
# Check Docker daemon logs
sudo journalctl -u docker -f
# Check for resource constraints
free -h
df -h
# Monitor DNS queries
docker exec <container> sh -c 'for i in 1 2 3 4 5; do nslookup <service-name>; sleep 2; done'
```
**Solutions:**
1. **Increase DNS timeout in daemon.json:**
```json
{
"dns": ["8.8.8.8", "1.1.1.1"],
"dns-opt": ["ndots:0", "timeout:5", "attempts:5"]
}
```
2. **Add health checks to services:**
```yaml
services:
my-service:
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
interval: 30s
timeout: 10s
retries: 3
start_period: 40s
```
3. **Check for network congestion:**
```bash
# Monitor network traffic
sudo iftop -i docker_gwbridge
```
## Verification and Testing
### Test DNS Resolution
```bash
# From inside a container
docker exec <container-id> nslookup <service-name>
# Should return the current IP of the target service
```
### Test Service Connectivity
```bash
# Ping test
docker exec <container-id> ping -c 3 <service-name>
# Port connectivity test
docker exec <container-id> nc -zv <service-name> <port>
# or
docker exec <container-id> telnet <service-name> <port>
```
### Monitor DNS Changes
```bash
# Run this to watch DNS resolution over time
watch -n 5 'docker exec <container-id> nslookup <service-name>'
```
### Verify Endpoint Mode
```bash
# Check if dnsrr is active
docker service inspect <service-name> --format '{{.Spec.EndpointSpec.Mode}}'
# Should return: dnsrr
```
## Migration Strategy
### Gradual Migration (Recommended)
1. **Update Docker daemon on all nodes first** (one at a time)
2. **Fix critical services with DNS issues immediately**
3. **Add `endpoint_mode: dnsrr` to other services during normal maintenance**
4. **No rush - both modes work together**
### Full Migration (If Preferred)
1. **Update Docker daemon on all nodes** (one at a time)
2. **Update all compose files** to include `endpoint_mode: dnsrr`
3. **Redeploy all stacks** during a maintenance window
4. **Test each service** after deployment
## Best Practices
1. **Always use `endpoint_mode: dnsrr` for database services** - They're most affected by stale DNS
2. **Use health checks** to prevent services from accepting traffic before they're ready
3. **Add node labels** for better placement control
4. **Document your network topology** including which services run on which nodes
5. **Keep daemon.json consistent** across all nodes
6. **Monitor DNS resolution** during deployments
7. **Test new nodes** before moving production workloads to them
## Additional Resources
### Useful Commands
```bash
# List all services and their endpoint modes
docker service ls --format "table {{.Name}}\t{{.Mode}}\t{{.Replicas}}" | while read line; do
echo "$line"
done
# Find services without dnsrr
for service in $(docker service ls --format "{{.Name}}"); do
mode=$(docker service inspect $service --format '{{.Spec.EndpointSpec.Mode}}')
if [ "$mode" != "dnsrr" ]; then
echo "$service: $mode"
fi
done
# Check which nodes are running which services
docker service ps <service-name> --format "table {{.Name}}\t{{.Node}}\t{{.CurrentState}}"
# View all containers on current node
docker ps --format "table {{.Names}}\t{{.Status}}\t{{.Ports}}"
```
### Network Inspection
```bash
# View network details
docker network inspect <network-name> --format '{{json .}}' | jq
# List all services on a network
docker network inspect <network-name> --format '{{range .Containers}}{{.Name}} {{end}}'
# Check network driver and options
docker network inspect <network-name> --format '{{.Driver}} {{.Options}}'
```
## Summary
The `endpoint_mode: dnsrr` solution:
- ✅ Eliminates stale DNS entries
- ✅ Works with existing infrastructure
- ✅ Requires no additional software
- ✅ Can be implemented gradually
- ✅ Compatible with mixed VIP/dnsrr environments
- ✅ Simple to troubleshoot
By combining the Docker daemon configuration changes with `endpoint_mode: dnsrr`, you create a robust DNS solution that handles frequent restarts and multi-node deployments reliably.