Netgrimoire/docker_dns_issues.md
2026-02-13 15:45:01 +00:00

14 KiB

title description published date tags editor dateCreated
Docker DNS Fix Override docker VIP for dns true 2026-02-13T15:44:48.521Z markdown 2026-02-13T15:44:48.521Z

Docker Swarm DNS Fix: endpoint_mode dnsrr

Problem Overview

Docker Swarm's overlay network uses an embedded DNS server with a Virtual IP (VIP) layer. This VIP layer caches DNS entries, which can become stale when containers restart frequently or get new IP addresses. This results in connection timeouts and "connection pool full" errors.

Common symptoms:

  • Services cannot connect to databases despite correct configuration
  • DNS resolves to wrong/old IP addresses
  • "Knex: Timeout acquiring a connection" errors
  • Issues worsen with frequent container restarts/rebuilds
  • Problems occur across all nodes (not architecture-specific)

The Solution: endpoint_mode dnsrr

endpoint_mode: dnsrr (DNS Round Robin) bypasses Swarm's VIP layer entirely. DNS queries resolve directly to actual container IPs, eliminating the caching layer that causes stale entries.

Benefits:

  • No stale DNS entries
  • Fresh DNS lookups every time
  • Works with existing overlay networks
  • No additional software required
  • Can be implemented gradually

Implementation Guide

Step 1: Update Docker Daemon (ALL Nodes)

This step benefits all containers immediately and should be done on every node in your Swarm cluster.

On each node:

sudo nano /etc/docker/daemon.json

Add or replace with:

{
  "dns": ["8.8.8.8", "1.1.1.1"],
  "dns-opt": ["ndots:0", "timeout:2", "attempts:3"],
  "log-driver": "json-file",
  "log-opts": {
    "max-size": "10m",
    "max-file": "3"
  }
}

Configuration explanation:

  • dns: Uses Google and Cloudflare DNS as fallbacks
  • ndots:0: Forces external DNS lookup for FQDNs, reducing reliance on Swarm DNS
  • timeout:2: 2-second timeout per DNS query
  • attempts:3: Retry up to 3 times
  • log-opts: Prevents logs from filling disk (optional but recommended)

Restart Docker (one node at a time):

sudo systemctl restart docker

⚠️ Important: Wait 2-3 minutes between node restarts for services to stabilize.

Step 2: Update Docker Compose Files

Add endpoint_mode: dnsrr to the deploy: section of each service in your compose files.

Before:

services:
  my-service:
    image: some-image
    networks:
      - my-network
    deploy:
      mode: replicated
      replicas: 1
      restart_policy:
        condition: any

After:

services:
  my-service:
    image: some-image
    networks:
      - my-network
    deploy:
      endpoint_mode: dnsrr  # ADD THIS LINE
      mode: replicated
      replicas: 1
      restart_policy:
        condition: any

Step 3: Redeploy Services

After updating compose files, redeploy each stack:

docker stack deploy -c your-compose-file.yml your-stack-name

Note: You can do this gradually. Services without dnsrr will continue working (but may still have DNS issues).

Complete Example: Database + Application

version: "3.8"

networks:
  app-network:
    external: true

services:
  database:
    image: postgres:16-alpine
    networks:
      - app-network
    environment:
      POSTGRES_DB: myapp
      POSTGRES_USER: appuser
      POSTGRES_PASSWORD: secret
    volumes:
      - /data/postgres:/var/lib/postgresql/data
    deploy:
      endpoint_mode: dnsrr  # Prevents stale DNS
      mode: replicated
      replicas: 1
      placement:
        constraints:
          - node.hostname == node1
      restart_policy:
        condition: any
        delay: 5s

  application:
    image: myapp:latest
    networks:
      - app-network
    environment:
      DB_HOST: database  # Uses service name
      DB_PORT: "5432"
      DB_USER: appuser
      DB_PASS: secret
      DB_NAME: myapp
    deploy:
      endpoint_mode: dnsrr  # Prevents stale DNS
      mode: replicated
      replicas: 1
      placement:
        constraints:
          - node.hostname == node1
      restart_policy:
        condition: any
        delay: 5s

Adding New Nodes to Your Swarm

When adding new nodes to your cluster, follow these steps to ensure DNS works correctly:

1. Prepare the New Node

Before joining Swarm, configure Docker daemon:

# On the new node
sudo nano /etc/docker/daemon.json

Add the same configuration as existing nodes:

{
  "dns": ["8.8.8.8", "1.1.1.1"],
  "dns-opt": ["ndots:0", "timeout:2", "attempts:3"],
  "log-driver": "json-file",
  "log-opts": {
    "max-size": "10m",
    "max-file": "3"
  }
}

Restart Docker:

sudo systemctl restart docker

2. Join the Node to Swarm

On a manager node, get the join token:

# For worker nodes
docker swarm join-token worker

# For manager nodes
docker swarm join-token manager

On the new node, run the join command:

docker swarm join --token SWMTKN-xxx-xxx manager-ip:2377

3. Verify the Node

On a manager node:

# Check node is visible
docker node ls

# Check node status
docker node inspect <node-name> --pretty

4. Add Node Labels (if needed)

If you use placement constraints based on labels:

# Add CPU architecture label
docker node update --label-add cpu=arm <node-name>
# or
docker node update --label-add cpu=x86 <node-name>

# Add custom labels as needed
docker node update --label-add role=database <node-name>

5. Test Network Connectivity

Deploy a test service on the new node:

docker service create \
  --name test-dns \
  --constraint 'node.hostname==<new-node-name>' \
  --network netgrimoire \
  alpine sleep 3600

Test DNS resolution from the test service:

# Get container ID
docker ps | grep test-dns

# Test DNS lookup of existing service
docker exec <container-id> nslookup <existing-service-name>

# Test connectivity
docker exec <container-id> ping -c 3 <existing-service-name>

Clean up:

docker service rm test-dns

Troubleshooting

Issue: Service can't resolve DNS

Symptoms:

  • nslookup or ping fails to resolve service names
  • Connection timeouts

Diagnosis:

# Check if service is running
docker service ls | grep <service-name>

# Check service details
docker service inspect <service-name>

# Get container ID
docker ps -f name=<service-name>

# Test DNS from inside container
docker exec <container-id> nslookup <target-service-name>
docker exec <container-id> cat /etc/resolv.conf

Solutions:

  1. Verify both services are on the same network:

    docker network inspect <network-name>
    

    Both services should appear in the containers list.

  2. Check if endpoint_mode is set:

    docker service inspect <service-name> --format '{{.Spec.EndpointSpec.Mode}}'
    

    Should return dnsrr or vip.

  3. Restart the service:

    docker service update --force <service-name>
    
  4. Check Docker daemon config is correct:

    cat /etc/docker/daemon.json
    sudo systemctl status docker
    

Issue: DNS resolves to wrong IP

Symptoms:

  • DNS returns an old/incorrect IP address
  • Service was recently restarted and got a new IP

Diagnosis:

# Find the actual container IP
docker inspect <container-id> | grep IPAddress

# Check what DNS returns
docker exec <other-container> nslookup <service-name>

# Compare the two

Solutions:

  1. Verify endpoint_mode is dnsrr:

    docker service inspect <service-name> --format '{{.Spec.EndpointSpec.Mode}}'
    
  2. Force update to refresh DNS:

    docker service update --force <service-name>
    
  3. If still wrong, check for stale network entries:

    # Disconnect and reconnect service to network (requires downtime)
    docker service update --network-rm <network-name> <service-name>
    docker service update --network-add <network-name> <service-name>
    

Issue: Multiple services have stale DNS

Symptoms:

  • Widespread DNS issues across cluster
  • Affects multiple services/nodes

Solutions:

  1. Force update all services (no downtime but slower):

    docker service ls --format "{{.Name}}" | while read service; do
      echo "Updating $service..."
      docker service update --force $service
    done
    
  2. Restart Docker on affected nodes (one at a time):

    # On each node
    sudo systemctl restart docker
    
  3. Nuclear option - recreate the network (requires downtime):

    # Stop all stacks
    docker stack ls
    docker stack rm <stack-name>
    
    # Remove network
    docker network rm <network-name>
    
    # Recreate network
    docker network create --driver overlay --attachable <network-name>
    
    # Redeploy stacks
    docker stack deploy -c <compose-file> <stack-name>
    

Issue: New node can't reach services on other nodes

Symptoms:

  • Services on new node can't connect to services on existing nodes
  • DNS works locally but not cross-node

Diagnosis:

# Check if node is properly connected to overlay network
docker network inspect <network-name>

# Verify node is in Swarm
docker node ls

# Check firewall rules (on new node)
sudo iptables -L -n | grep 4789  # VXLAN port
sudo iptables -L -n | grep 7946  # Serf port

Solutions:

  1. Ensure required ports are open:

    • TCP port 2377 (cluster management)
    • TCP/UDP port 7946 (node communication)
    • UDP port 4789 (overlay network traffic)
  2. Check MTU settings:

    # On all nodes, check MTU
    ip link show
    
    # If MTU issues, recreate network with explicit MTU
    docker network create \
      --driver overlay \
      --attachable \
      --opt com.docker.network.driver.mtu=1450 \
      <network-name>
    
  3. Verify Docker daemon is configured identically:

    # Compare daemon.json across nodes
    cat /etc/docker/daemon.json
    

Issue: Intermittent DNS failures

Symptoms:

  • DNS works sometimes, fails other times
  • No consistent pattern

Diagnosis:

# Check Docker daemon logs
sudo journalctl -u docker -f

# Check for resource constraints
free -h
df -h

# Monitor DNS queries
docker exec <container> sh -c 'for i in 1 2 3 4 5; do nslookup <service-name>; sleep 2; done'

Solutions:

  1. Increase DNS timeout in daemon.json:

    {
      "dns": ["8.8.8.8", "1.1.1.1"],
      "dns-opt": ["ndots:0", "timeout:5", "attempts:5"]
    }
    
  2. Add health checks to services:

    services:
      my-service:
        healthcheck:
          test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
          interval: 30s
          timeout: 10s
          retries: 3
          start_period: 40s
    
  3. Check for network congestion:

    # Monitor network traffic
    sudo iftop -i docker_gwbridge
    

Verification and Testing

Test DNS Resolution

# From inside a container
docker exec <container-id> nslookup <service-name>

# Should return the current IP of the target service

Test Service Connectivity

# Ping test
docker exec <container-id> ping -c 3 <service-name>

# Port connectivity test
docker exec <container-id> nc -zv <service-name> <port>
# or
docker exec <container-id> telnet <service-name> <port>

Monitor DNS Changes

# Run this to watch DNS resolution over time
watch -n 5 'docker exec <container-id> nslookup <service-name>'

Verify Endpoint Mode

# Check if dnsrr is active
docker service inspect <service-name> --format '{{.Spec.EndpointSpec.Mode}}'

# Should return: dnsrr

Migration Strategy

  1. Update Docker daemon on all nodes first (one at a time)
  2. Fix critical services with DNS issues immediately
  3. Add endpoint_mode: dnsrr to other services during normal maintenance
  4. No rush - both modes work together

Full Migration (If Preferred)

  1. Update Docker daemon on all nodes (one at a time)
  2. Update all compose files to include endpoint_mode: dnsrr
  3. Redeploy all stacks during a maintenance window
  4. Test each service after deployment

Best Practices

  1. Always use endpoint_mode: dnsrr for database services - They're most affected by stale DNS
  2. Use health checks to prevent services from accepting traffic before they're ready
  3. Add node labels for better placement control
  4. Document your network topology including which services run on which nodes
  5. Keep daemon.json consistent across all nodes
  6. Monitor DNS resolution during deployments
  7. Test new nodes before moving production workloads to them

Additional Resources

Useful Commands

# List all services and their endpoint modes
docker service ls --format "table {{.Name}}\t{{.Mode}}\t{{.Replicas}}" | while read line; do 
  echo "$line"
done

# Find services without dnsrr
for service in $(docker service ls --format "{{.Name}}"); do
  mode=$(docker service inspect $service --format '{{.Spec.EndpointSpec.Mode}}')
  if [ "$mode" != "dnsrr" ]; then
    echo "$service: $mode"
  fi
done

# Check which nodes are running which services
docker service ps <service-name> --format "table {{.Name}}\t{{.Node}}\t{{.CurrentState}}"

# View all containers on current node
docker ps --format "table {{.Names}}\t{{.Status}}\t{{.Ports}}"

Network Inspection

# View network details
docker network inspect <network-name> --format '{{json .}}' | jq

# List all services on a network
docker network inspect <network-name> --format '{{range .Containers}}{{.Name}} {{end}}'

# Check network driver and options
docker network inspect <network-name> --format '{{.Driver}} {{.Options}}'

Summary

The endpoint_mode: dnsrr solution:

  • Eliminates stale DNS entries
  • Works with existing infrastructure
  • Requires no additional software
  • Can be implemented gradually
  • Compatible with mixed VIP/dnsrr environments
  • Simple to troubleshoot

By combining the Docker daemon configuration changes with endpoint_mode: dnsrr, you create a robust DNS solution that handles frequent restarts and multi-node deployments reliably.