14 KiB
| title | description | published | date | tags | editor | dateCreated |
|---|---|---|---|---|---|---|
| Docker DNS Fix | Override docker VIP for dns | true | 2026-02-13T15:44:48.521Z | markdown | 2026-02-13T15:44:48.521Z |
Docker Swarm DNS Fix: endpoint_mode dnsrr
Problem Overview
Docker Swarm's overlay network uses an embedded DNS server with a Virtual IP (VIP) layer. This VIP layer caches DNS entries, which can become stale when containers restart frequently or get new IP addresses. This results in connection timeouts and "connection pool full" errors.
Common symptoms:
- Services cannot connect to databases despite correct configuration
- DNS resolves to wrong/old IP addresses
- "Knex: Timeout acquiring a connection" errors
- Issues worsen with frequent container restarts/rebuilds
- Problems occur across all nodes (not architecture-specific)
The Solution: endpoint_mode dnsrr
endpoint_mode: dnsrr (DNS Round Robin) bypasses Swarm's VIP layer entirely. DNS queries resolve directly to actual container IPs, eliminating the caching layer that causes stale entries.
Benefits:
- No stale DNS entries
- Fresh DNS lookups every time
- Works with existing overlay networks
- No additional software required
- Can be implemented gradually
Implementation Guide
Step 1: Update Docker Daemon (ALL Nodes)
This step benefits all containers immediately and should be done on every node in your Swarm cluster.
On each node:
sudo nano /etc/docker/daemon.json
Add or replace with:
{
"dns": ["8.8.8.8", "1.1.1.1"],
"dns-opt": ["ndots:0", "timeout:2", "attempts:3"],
"log-driver": "json-file",
"log-opts": {
"max-size": "10m",
"max-file": "3"
}
}
Configuration explanation:
dns: Uses Google and Cloudflare DNS as fallbacksndots:0: Forces external DNS lookup for FQDNs, reducing reliance on Swarm DNStimeout:2: 2-second timeout per DNS queryattempts:3: Retry up to 3 timeslog-opts: Prevents logs from filling disk (optional but recommended)
Restart Docker (one node at a time):
sudo systemctl restart docker
⚠️ Important: Wait 2-3 minutes between node restarts for services to stabilize.
Step 2: Update Docker Compose Files
Add endpoint_mode: dnsrr to the deploy: section of each service in your compose files.
Before:
services:
my-service:
image: some-image
networks:
- my-network
deploy:
mode: replicated
replicas: 1
restart_policy:
condition: any
After:
services:
my-service:
image: some-image
networks:
- my-network
deploy:
endpoint_mode: dnsrr # ADD THIS LINE
mode: replicated
replicas: 1
restart_policy:
condition: any
Step 3: Redeploy Services
After updating compose files, redeploy each stack:
docker stack deploy -c your-compose-file.yml your-stack-name
Note: You can do this gradually. Services without dnsrr will continue working (but may still have DNS issues).
Complete Example: Database + Application
version: "3.8"
networks:
app-network:
external: true
services:
database:
image: postgres:16-alpine
networks:
- app-network
environment:
POSTGRES_DB: myapp
POSTGRES_USER: appuser
POSTGRES_PASSWORD: secret
volumes:
- /data/postgres:/var/lib/postgresql/data
deploy:
endpoint_mode: dnsrr # Prevents stale DNS
mode: replicated
replicas: 1
placement:
constraints:
- node.hostname == node1
restart_policy:
condition: any
delay: 5s
application:
image: myapp:latest
networks:
- app-network
environment:
DB_HOST: database # Uses service name
DB_PORT: "5432"
DB_USER: appuser
DB_PASS: secret
DB_NAME: myapp
deploy:
endpoint_mode: dnsrr # Prevents stale DNS
mode: replicated
replicas: 1
placement:
constraints:
- node.hostname == node1
restart_policy:
condition: any
delay: 5s
Adding New Nodes to Your Swarm
When adding new nodes to your cluster, follow these steps to ensure DNS works correctly:
1. Prepare the New Node
Before joining Swarm, configure Docker daemon:
# On the new node
sudo nano /etc/docker/daemon.json
Add the same configuration as existing nodes:
{
"dns": ["8.8.8.8", "1.1.1.1"],
"dns-opt": ["ndots:0", "timeout:2", "attempts:3"],
"log-driver": "json-file",
"log-opts": {
"max-size": "10m",
"max-file": "3"
}
}
Restart Docker:
sudo systemctl restart docker
2. Join the Node to Swarm
On a manager node, get the join token:
# For worker nodes
docker swarm join-token worker
# For manager nodes
docker swarm join-token manager
On the new node, run the join command:
docker swarm join --token SWMTKN-xxx-xxx manager-ip:2377
3. Verify the Node
On a manager node:
# Check node is visible
docker node ls
# Check node status
docker node inspect <node-name> --pretty
4. Add Node Labels (if needed)
If you use placement constraints based on labels:
# Add CPU architecture label
docker node update --label-add cpu=arm <node-name>
# or
docker node update --label-add cpu=x86 <node-name>
# Add custom labels as needed
docker node update --label-add role=database <node-name>
5. Test Network Connectivity
Deploy a test service on the new node:
docker service create \
--name test-dns \
--constraint 'node.hostname==<new-node-name>' \
--network netgrimoire \
alpine sleep 3600
Test DNS resolution from the test service:
# Get container ID
docker ps | grep test-dns
# Test DNS lookup of existing service
docker exec <container-id> nslookup <existing-service-name>
# Test connectivity
docker exec <container-id> ping -c 3 <existing-service-name>
Clean up:
docker service rm test-dns
Troubleshooting
Issue: Service can't resolve DNS
Symptoms:
nslookuporpingfails to resolve service names- Connection timeouts
Diagnosis:
# Check if service is running
docker service ls | grep <service-name>
# Check service details
docker service inspect <service-name>
# Get container ID
docker ps -f name=<service-name>
# Test DNS from inside container
docker exec <container-id> nslookup <target-service-name>
docker exec <container-id> cat /etc/resolv.conf
Solutions:
-
Verify both services are on the same network:
docker network inspect <network-name>Both services should appear in the containers list.
-
Check if endpoint_mode is set:
docker service inspect <service-name> --format '{{.Spec.EndpointSpec.Mode}}'Should return
dnsrrorvip. -
Restart the service:
docker service update --force <service-name> -
Check Docker daemon config is correct:
cat /etc/docker/daemon.json sudo systemctl status docker
Issue: DNS resolves to wrong IP
Symptoms:
- DNS returns an old/incorrect IP address
- Service was recently restarted and got a new IP
Diagnosis:
# Find the actual container IP
docker inspect <container-id> | grep IPAddress
# Check what DNS returns
docker exec <other-container> nslookup <service-name>
# Compare the two
Solutions:
-
Verify endpoint_mode is dnsrr:
docker service inspect <service-name> --format '{{.Spec.EndpointSpec.Mode}}' -
Force update to refresh DNS:
docker service update --force <service-name> -
If still wrong, check for stale network entries:
# Disconnect and reconnect service to network (requires downtime) docker service update --network-rm <network-name> <service-name> docker service update --network-add <network-name> <service-name>
Issue: Multiple services have stale DNS
Symptoms:
- Widespread DNS issues across cluster
- Affects multiple services/nodes
Solutions:
-
Force update all services (no downtime but slower):
docker service ls --format "{{.Name}}" | while read service; do echo "Updating $service..." docker service update --force $service done -
Restart Docker on affected nodes (one at a time):
# On each node sudo systemctl restart docker -
Nuclear option - recreate the network (requires downtime):
# Stop all stacks docker stack ls docker stack rm <stack-name> # Remove network docker network rm <network-name> # Recreate network docker network create --driver overlay --attachable <network-name> # Redeploy stacks docker stack deploy -c <compose-file> <stack-name>
Issue: New node can't reach services on other nodes
Symptoms:
- Services on new node can't connect to services on existing nodes
- DNS works locally but not cross-node
Diagnosis:
# Check if node is properly connected to overlay network
docker network inspect <network-name>
# Verify node is in Swarm
docker node ls
# Check firewall rules (on new node)
sudo iptables -L -n | grep 4789 # VXLAN port
sudo iptables -L -n | grep 7946 # Serf port
Solutions:
-
Ensure required ports are open:
- TCP port 2377 (cluster management)
- TCP/UDP port 7946 (node communication)
- UDP port 4789 (overlay network traffic)
-
Check MTU settings:
# On all nodes, check MTU ip link show # If MTU issues, recreate network with explicit MTU docker network create \ --driver overlay \ --attachable \ --opt com.docker.network.driver.mtu=1450 \ <network-name> -
Verify Docker daemon is configured identically:
# Compare daemon.json across nodes cat /etc/docker/daemon.json
Issue: Intermittent DNS failures
Symptoms:
- DNS works sometimes, fails other times
- No consistent pattern
Diagnosis:
# Check Docker daemon logs
sudo journalctl -u docker -f
# Check for resource constraints
free -h
df -h
# Monitor DNS queries
docker exec <container> sh -c 'for i in 1 2 3 4 5; do nslookup <service-name>; sleep 2; done'
Solutions:
-
Increase DNS timeout in daemon.json:
{ "dns": ["8.8.8.8", "1.1.1.1"], "dns-opt": ["ndots:0", "timeout:5", "attempts:5"] } -
Add health checks to services:
services: my-service: healthcheck: test: ["CMD", "curl", "-f", "http://localhost:8080/health"] interval: 30s timeout: 10s retries: 3 start_period: 40s -
Check for network congestion:
# Monitor network traffic sudo iftop -i docker_gwbridge
Verification and Testing
Test DNS Resolution
# From inside a container
docker exec <container-id> nslookup <service-name>
# Should return the current IP of the target service
Test Service Connectivity
# Ping test
docker exec <container-id> ping -c 3 <service-name>
# Port connectivity test
docker exec <container-id> nc -zv <service-name> <port>
# or
docker exec <container-id> telnet <service-name> <port>
Monitor DNS Changes
# Run this to watch DNS resolution over time
watch -n 5 'docker exec <container-id> nslookup <service-name>'
Verify Endpoint Mode
# Check if dnsrr is active
docker service inspect <service-name> --format '{{.Spec.EndpointSpec.Mode}}'
# Should return: dnsrr
Migration Strategy
Gradual Migration (Recommended)
- Update Docker daemon on all nodes first (one at a time)
- Fix critical services with DNS issues immediately
- Add
endpoint_mode: dnsrrto other services during normal maintenance - No rush - both modes work together
Full Migration (If Preferred)
- Update Docker daemon on all nodes (one at a time)
- Update all compose files to include
endpoint_mode: dnsrr - Redeploy all stacks during a maintenance window
- Test each service after deployment
Best Practices
- Always use
endpoint_mode: dnsrrfor database services - They're most affected by stale DNS - Use health checks to prevent services from accepting traffic before they're ready
- Add node labels for better placement control
- Document your network topology including which services run on which nodes
- Keep daemon.json consistent across all nodes
- Monitor DNS resolution during deployments
- Test new nodes before moving production workloads to them
Additional Resources
Useful Commands
# List all services and their endpoint modes
docker service ls --format "table {{.Name}}\t{{.Mode}}\t{{.Replicas}}" | while read line; do
echo "$line"
done
# Find services without dnsrr
for service in $(docker service ls --format "{{.Name}}"); do
mode=$(docker service inspect $service --format '{{.Spec.EndpointSpec.Mode}}')
if [ "$mode" != "dnsrr" ]; then
echo "$service: $mode"
fi
done
# Check which nodes are running which services
docker service ps <service-name> --format "table {{.Name}}\t{{.Node}}\t{{.CurrentState}}"
# View all containers on current node
docker ps --format "table {{.Names}}\t{{.Status}}\t{{.Ports}}"
Network Inspection
# View network details
docker network inspect <network-name> --format '{{json .}}' | jq
# List all services on a network
docker network inspect <network-name> --format '{{range .Containers}}{{.Name}} {{end}}'
# Check network driver and options
docker network inspect <network-name> --format '{{.Driver}} {{.Options}}'
Summary
The endpoint_mode: dnsrr solution:
- ✅ Eliminates stale DNS entries
- ✅ Works with existing infrastructure
- ✅ Requires no additional software
- ✅ Can be implemented gradually
- ✅ Compatible with mixed VIP/dnsrr environments
- ✅ Simple to troubleshoot
By combining the Docker daemon configuration changes with endpoint_mode: dnsrr, you create a robust DNS solution that handles frequent restarts and multi-node deployments reliably.