Administrator 24a05fa08e docs: create docker_dns_issues

2026-02-13 15:45:01 +00:00

14 KiB

Raw Blame History

title	description	published	date	tags	editor	dateCreated
Docker DNS Fix	Override docker VIP for dns	true	2026-02-13T15:44:48.521Z		markdown	2026-02-13T15:44:48.521Z

Docker Swarm DNS Fix: endpoint_mode dnsrr

Problem Overview

Docker Swarm's overlay network uses an embedded DNS server with a Virtual IP (VIP) layer. This VIP layer caches DNS entries, which can become stale when containers restart frequently or get new IP addresses. This results in connection timeouts and "connection pool full" errors.

Common symptoms:

Services cannot connect to databases despite correct configuration
DNS resolves to wrong/old IP addresses
"Knex: Timeout acquiring a connection" errors
Issues worsen with frequent container restarts/rebuilds
Problems occur across all nodes (not architecture-specific)

The Solution: endpoint_mode dnsrr

endpoint_mode: dnsrr (DNS Round Robin) bypasses Swarm's VIP layer entirely. DNS queries resolve directly to actual container IPs, eliminating the caching layer that causes stale entries.

Benefits:

No stale DNS entries
Fresh DNS lookups every time
Works with existing overlay networks
No additional software required
Can be implemented gradually

Implementation Guide

Step 1: Update Docker Daemon (ALL Nodes)

This step benefits all containers immediately and should be done on every node in your Swarm cluster.

On each node:

sudo nano /etc/docker/daemon.json

Add or replace with:

{
  "dns": ["8.8.8.8", "1.1.1.1"],
  "dns-opt": ["ndots:0", "timeout:2", "attempts:3"],
  "log-driver": "json-file",
  "log-opts": {
    "max-size": "10m",
    "max-file": "3"
  }
}

Configuration explanation:

dns: Uses Google and Cloudflare DNS as fallbacks
ndots:0: Forces external DNS lookup for FQDNs, reducing reliance on Swarm DNS
timeout:2: 2-second timeout per DNS query
attempts:3: Retry up to 3 times
log-opts: Prevents logs from filling disk (optional but recommended)

Restart Docker (one node at a time):

sudo systemctl restart docker

⚠️ Important: Wait 2-3 minutes between node restarts for services to stabilize.

Step 2: Update Docker Compose Files

Add endpoint_mode: dnsrr to the deploy: section of each service in your compose files.

Before:

services:
  my-service:
    image: some-image
    networks:
      - my-network
    deploy:
      mode: replicated
      replicas: 1
      restart_policy:
        condition: any

After:

services:
  my-service:
    image: some-image
    networks:
      - my-network
    deploy:
      endpoint_mode: dnsrr  # ADD THIS LINE
      mode: replicated
      replicas: 1
      restart_policy:
        condition: any

Step 3: Redeploy Services

After updating compose files, redeploy each stack:

docker stack deploy -c your-compose-file.yml your-stack-name

Note: You can do this gradually. Services without dnsrr will continue working (but may still have DNS issues).

Complete Example: Database + Application

version: "3.8"

networks:
  app-network:
    external: true

services:
  database:
    image: postgres:16-alpine
    networks:
      - app-network
    environment:
      POSTGRES_DB: myapp
      POSTGRES_USER: appuser
      POSTGRES_PASSWORD: secret
    volumes:
      - /data/postgres:/var/lib/postgresql/data
    deploy:
      endpoint_mode: dnsrr  # Prevents stale DNS
      mode: replicated
      replicas: 1
      placement:
        constraints:
          - node.hostname == node1
      restart_policy:
        condition: any
        delay: 5s

  application:
    image: myapp:latest
    networks:
      - app-network
    environment:
      DB_HOST: database  # Uses service name
      DB_PORT: "5432"
      DB_USER: appuser
      DB_PASS: secret
      DB_NAME: myapp
    deploy:
      endpoint_mode: dnsrr  # Prevents stale DNS
      mode: replicated
      replicas: 1
      placement:
        constraints:
          - node.hostname == node1
      restart_policy:
        condition: any
        delay: 5s

Adding New Nodes to Your Swarm

When adding new nodes to your cluster, follow these steps to ensure DNS works correctly:

1. Prepare the New Node

Before joining Swarm, configure Docker daemon:

# On the new node
sudo nano /etc/docker/daemon.json

Add the same configuration as existing nodes:

{
  "dns": ["8.8.8.8", "1.1.1.1"],
  "dns-opt": ["ndots:0", "timeout:2", "attempts:3"],
  "log-driver": "json-file",
  "log-opts": {
    "max-size": "10m",
    "max-file": "3"
  }
}

Restart Docker:

sudo systemctl restart docker

2. Join the Node to Swarm

On a manager node, get the join token:

# For worker nodes
docker swarm join-token worker

# For manager nodes
docker swarm join-token manager

On the new node, run the join command:

docker swarm join --token SWMTKN-xxx-xxx manager-ip:2377

3. Verify the Node

On a manager node:

# Check node is visible
docker node ls

# Check node status
docker node inspect <node-name> --pretty

4. Add Node Labels (if needed)

If you use placement constraints based on labels:

# Add CPU architecture label
docker node update --label-add cpu=arm <node-name>
# or
docker node update --label-add cpu=x86 <node-name>

# Add custom labels as needed
docker node update --label-add role=database <node-name>

5. Test Network Connectivity

Deploy a test service on the new node:

docker service create \
  --name test-dns \
  --constraint 'node.hostname==<new-node-name>' \
  --network netgrimoire \
  alpine sleep 3600

Test DNS resolution from the test service:

# Get container ID
docker ps | grep test-dns

# Test DNS lookup of existing service
docker exec <container-id> nslookup <existing-service-name>

# Test connectivity
docker exec <container-id> ping -c 3 <existing-service-name>

Clean up:

docker service rm test-dns

Troubleshooting

Issue: Service can't resolve DNS

Symptoms:

nslookup or ping fails to resolve service names
Connection timeouts

Diagnosis:

# Check if service is running
docker service ls | grep <service-name>

# Check service details
docker service inspect <service-name>

# Get container ID
docker ps -f name=<service-name>

# Test DNS from inside container
docker exec <container-id> nslookup <target-service-name>
docker exec <container-id> cat /etc/resolv.conf

Solutions:

Verify both services are on the same network:
```
docker network inspect <network-name>
```
Both services should appear in the containers list.

Check if endpoint_mode is set:

docker service inspect <service-name> --format '{{.Spec.EndpointSpec.Mode}}'

Should return dnsrr or vip.

Restart the service:

docker service update --force <service-name>

Check Docker daemon config is correct:

cat /etc/docker/daemon.json
sudo systemctl status docker

Issue: DNS resolves to wrong IP

Symptoms:

DNS returns an old/incorrect IP address
Service was recently restarted and got a new IP

Diagnosis:

# Find the actual container IP
docker inspect <container-id> | grep IPAddress

# Check what DNS returns
docker exec <other-container> nslookup <service-name>

# Compare the two

Solutions:

Verify endpoint_mode is dnsrr:

docker service inspect <service-name> --format '{{.Spec.EndpointSpec.Mode}}'

Force update to refresh DNS:

docker service update --force <service-name>

If still wrong, check for stale network entries:

# Disconnect and reconnect service to network (requires downtime)
docker service update --network-rm <network-name> <service-name>
docker service update --network-add <network-name> <service-name>

Issue: Multiple services have stale DNS

Symptoms:

Widespread DNS issues across cluster
Affects multiple services/nodes

Solutions:

Force update all services (no downtime but slower):

docker service ls --format "{{.Name}}" | while read service; do
  echo "Updating $service..."
  docker service update --force $service
done

Restart Docker on affected nodes (one at a time):
```
# On each node
sudo systemctl restart docker
```

Nuclear option - recreate the network (requires downtime):

# Stop all stacks
docker stack ls
docker stack rm <stack-name>

# Remove network
docker network rm <network-name>

# Recreate network
docker network create --driver overlay --attachable <network-name>

# Redeploy stacks
docker stack deploy -c <compose-file> <stack-name>

Issue: New node can't reach services on other nodes

Symptoms:

Services on new node can't connect to services on existing nodes
DNS works locally but not cross-node

Diagnosis:

# Check if node is properly connected to overlay network
docker network inspect <network-name>

# Verify node is in Swarm
docker node ls

# Check firewall rules (on new node)
sudo iptables -L -n | grep 4789  # VXLAN port
sudo iptables -L -n | grep 7946  # Serf port

Solutions:

Ensure required ports are open:
- TCP port 2377 (cluster management)
- TCP/UDP port 7946 (node communication)
- UDP port 4789 (overlay network traffic)

Check MTU settings:

# On all nodes, check MTU
ip link show

# If MTU issues, recreate network with explicit MTU
docker network create \
  --driver overlay \
  --attachable \
  --opt com.docker.network.driver.mtu=1450 \
  <network-name>

Verify Docker daemon is configured identically:

# Compare daemon.json across nodes
cat /etc/docker/daemon.json

Issue: Intermittent DNS failures

Symptoms:

DNS works sometimes, fails other times
No consistent pattern

Diagnosis:

# Check Docker daemon logs
sudo journalctl -u docker -f

# Check for resource constraints
free -h
df -h

# Monitor DNS queries
docker exec <container> sh -c 'for i in 1 2 3 4 5; do nslookup <service-name>; sleep 2; done'

Solutions:

Increase DNS timeout in daemon.json:

{
  "dns": ["8.8.8.8", "1.1.1.1"],
  "dns-opt": ["ndots:0", "timeout:5", "attempts:5"]
}

Add health checks to services:

services:
  my-service:
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 40s

Check for network congestion:

# Monitor network traffic
sudo iftop -i docker_gwbridge

Verification and Testing

Test DNS Resolution

# From inside a container
docker exec <container-id> nslookup <service-name>

# Should return the current IP of the target service

Test Service Connectivity

# Ping test
docker exec <container-id> ping -c 3 <service-name>

# Port connectivity test
docker exec <container-id> nc -zv <service-name> <port>
# or
docker exec <container-id> telnet <service-name> <port>

Monitor DNS Changes

# Run this to watch DNS resolution over time
watch -n 5 'docker exec <container-id> nslookup <service-name>'

Verify Endpoint Mode

# Check if dnsrr is active
docker service inspect <service-name> --format '{{.Spec.EndpointSpec.Mode}}'

# Should return: dnsrr

Migration Strategy

Gradual Migration (Recommended)

Update Docker daemon on all nodes first (one at a time)
Fix critical services with DNS issues immediately
Add endpoint_mode: dnsrr to other services during normal maintenance
No rush - both modes work together

Full Migration (If Preferred)

Update Docker daemon on all nodes (one at a time)
Update all compose files to include endpoint_mode: dnsrr
Redeploy all stacks during a maintenance window
Test each service after deployment

Best Practices

Always use endpoint_mode: dnsrr for database services - They're most affected by stale DNS
Use health checks to prevent services from accepting traffic before they're ready
Add node labels for better placement control
Document your network topology including which services run on which nodes
Keep daemon.json consistent across all nodes
Monitor DNS resolution during deployments
Test new nodes before moving production workloads to them

Additional Resources

Useful Commands

# List all services and their endpoint modes
docker service ls --format "table {{.Name}}\t{{.Mode}}\t{{.Replicas}}" | while read line; do 
  echo "$line"
done

# Find services without dnsrr
for service in $(docker service ls --format "{{.Name}}"); do
  mode=$(docker service inspect $service --format '{{.Spec.EndpointSpec.Mode}}')
  if [ "$mode" != "dnsrr" ]; then
    echo "$service: $mode"
  fi
done

# Check which nodes are running which services
docker service ps <service-name> --format "table {{.Name}}\t{{.Node}}\t{{.CurrentState}}"

# View all containers on current node
docker ps --format "table {{.Names}}\t{{.Status}}\t{{.Ports}}"

Network Inspection

# View network details
docker network inspect <network-name> --format '{{json .}}' | jq

# List all services on a network
docker network inspect <network-name> --format '{{range .Containers}}{{.Name}} {{end}}'

# Check network driver and options
docker network inspect <network-name> --format '{{.Driver}} {{.Options}}'

Summary

The endpoint_mode: dnsrr solution:

✅ Eliminates stale DNS entries
✅ Works with existing infrastructure
✅ Requires no additional software
✅ Can be implemented gradually
✅ Compatible with mixed VIP/dnsrr environments
✅ Simple to troubleshoot

By combining the Docker daemon configuration changes with endpoint_mode: dnsrr, you create a robust DNS solution that handles frequent restarts and multi-node deployments reliably.

14 KiB Raw Blame History

Docker Swarm DNS Fix: endpoint_mode dnsrr

Problem Overview

The Solution: endpoint_mode dnsrr

Implementation Guide

Step 1: Update Docker Daemon (ALL Nodes)

Step 2: Update Docker Compose Files

Step 3: Redeploy Services

Complete Example: Database + Application

Adding New Nodes to Your Swarm

1. Prepare the New Node

2. Join the Node to Swarm

3. Verify the Node

4. Add Node Labels (if needed)

5. Test Network Connectivity

Troubleshooting

Issue: Service can't resolve DNS

Issue: DNS resolves to wrong IP

Issue: Multiple services have stale DNS

Issue: New node can't reach services on other nodes

Issue: Intermittent DNS failures

Verification and Testing

Test DNS Resolution

Test Service Connectivity

Monitor DNS Changes

Verify Endpoint Mode

Migration Strategy

Gradual Migration (Recommended)

Full Migration (If Preferred)

Best Practices

Additional Resources

Useful Commands

Network Inspection

Summary

14 KiB

Raw Blame History