docs: create docker_dns_issues

2026-02-13 15:45:01 +00:00 · 2026-02-13 15:45:01 +00:00 · 24a05fa08e
commit 24a05fa08e
parent 97c8135fc8
1 changed files with 612 additions and 0 deletions
--- a/docker_dns_issues.md
+++ b/docker_dns_issues.md
@ -0,0 +1,612 @@
+---
+title: Docker DNS Fix
+description: Override docker VIP for dns
+published: true
+date: 2026-02-13T15:44:48.521Z
+tags: 
+editor: markdown
+dateCreated: 2026-02-13T15:44:48.521Z
+---
+
+# Docker Swarm DNS Fix: endpoint_mode dnsrr
+
+## Problem Overview
+
+Docker Swarm's overlay network uses an embedded DNS server with a Virtual IP (VIP) layer. This VIP layer caches DNS entries, which can become stale when containers restart frequently or get new IP addresses. This results in connection timeouts and "connection pool full" errors.
+
+**Common symptoms:**
+- Services cannot connect to databases despite correct configuration
+- DNS resolves to wrong/old IP addresses
+- "Knex: Timeout acquiring a connection" errors
+- Issues worsen with frequent container restarts/rebuilds
+- Problems occur across all nodes (not architecture-specific)
+
+## The Solution: endpoint_mode dnsrr
+
+`endpoint_mode: dnsrr` (DNS Round Robin) bypasses Swarm's VIP layer entirely. DNS queries resolve directly to actual container IPs, eliminating the caching layer that causes stale entries.
+
+**Benefits:**
+- No stale DNS entries
+- Fresh DNS lookups every time
+- Works with existing overlay networks
+- No additional software required
+- Can be implemented gradually
+
+## Implementation Guide
+
+### Step 1: Update Docker Daemon (ALL Nodes)
+
+This step benefits all containers immediately and should be done on every node in your Swarm cluster.
+
+**On each node:**
+
+```bash
+sudo nano /etc/docker/daemon.json
+```
+
+**Add or replace with:**
+
+```json
+{
+  "dns": ["8.8.8.8", "1.1.1.1"],
+  "dns-opt": ["ndots:0", "timeout:2", "attempts:3"],
+  "log-driver": "json-file",
+  "log-opts": {
+    "max-size": "10m",
+    "max-file": "3"
+  }
+}
+```
+
+**Configuration explanation:**
+- `dns`: Uses Google and Cloudflare DNS as fallbacks
+- `ndots:0`: Forces external DNS lookup for FQDNs, reducing reliance on Swarm DNS
+- `timeout:2`: 2-second timeout per DNS query
+- `attempts:3`: Retry up to 3 times
+- `log-opts`: Prevents logs from filling disk (optional but recommended)
+
+**Restart Docker (one node at a time):**
+
+```bash
+sudo systemctl restart docker
+```
+
+⚠️ **Important:** Wait 2-3 minutes between node restarts for services to stabilize.
+
+### Step 2: Update Docker Compose Files
+
+Add `endpoint_mode: dnsrr` to the `deploy:` section of each service in your compose files.
+
+**Before:**
+```yaml
+services:
+  my-service:
+    image: some-image
+    networks:
+      - my-network
+    deploy:
+      mode: replicated
+      replicas: 1
+      restart_policy:
+        condition: any
+```
+
+**After:**
+```yaml
+services:
+  my-service:
+    image: some-image
+    networks:
+      - my-network
+    deploy:
+      endpoint_mode: dnsrr  # ADD THIS LINE
+      mode: replicated
+      replicas: 1
+      restart_policy:
+        condition: any
+```
+
+### Step 3: Redeploy Services
+
+After updating compose files, redeploy each stack:
+
+```bash
+docker stack deploy -c your-compose-file.yml your-stack-name
+```
+
+**Note:** You can do this gradually. Services without `dnsrr` will continue working (but may still have DNS issues).
+
+## Complete Example: Database + Application
+
+```yaml
+version: "3.8"
+
+networks:
+  app-network:
+    external: true
+
+services:
+  database:
+    image: postgres:16-alpine
+    networks:
+      - app-network
+    environment:
+      POSTGRES_DB: myapp
+      POSTGRES_USER: appuser
+      POSTGRES_PASSWORD: secret
+    volumes:
+      - /data/postgres:/var/lib/postgresql/data
+    deploy:
+      endpoint_mode: dnsrr  # Prevents stale DNS
+      mode: replicated
+      replicas: 1
+      placement:
+        constraints:
+          - node.hostname == node1
+      restart_policy:
+        condition: any
+        delay: 5s
+
+  application:
+    image: myapp:latest
+    networks:
+      - app-network
+    environment:
+      DB_HOST: database  # Uses service name
+      DB_PORT: "5432"
+      DB_USER: appuser
+      DB_PASS: secret
+      DB_NAME: myapp
+    deploy:
+      endpoint_mode: dnsrr  # Prevents stale DNS
+      mode: replicated
+      replicas: 1
+      placement:
+        constraints:
+          - node.hostname == node1
+      restart_policy:
+        condition: any
+        delay: 5s
+```
+
+## Adding New Nodes to Your Swarm
+
+When adding new nodes to your cluster, follow these steps to ensure DNS works correctly:
+
+### 1. Prepare the New Node
+
+**Before joining Swarm, configure Docker daemon:**
+
+```bash
+# On the new node
+sudo nano /etc/docker/daemon.json
+```
+
+Add the same configuration as existing nodes:
+
+```json
+{
+  "dns": ["8.8.8.8", "1.1.1.1"],
+  "dns-opt": ["ndots:0", "timeout:2", "attempts:3"],
+  "log-driver": "json-file",
+  "log-opts": {
+    "max-size": "10m",
+    "max-file": "3"
+  }
+}
+```
+
+**Restart Docker:**
+
+```bash
+sudo systemctl restart docker
+```
+
+### 2. Join the Node to Swarm
+
+**On a manager node, get the join token:**
+
+```bash
+# For worker nodes
+docker swarm join-token worker
+
+# For manager nodes
+docker swarm join-token manager
+```
+
+**On the new node, run the join command:**
+
+```bash
+docker swarm join --token SWMTKN-xxx-xxx manager-ip:2377
+```
+
+### 3. Verify the Node
+
+**On a manager node:**
+
+```bash
+# Check node is visible
+docker node ls
+
+# Check node status
+docker node inspect <node-name> --pretty
+```
+
+### 4. Add Node Labels (if needed)
+
+If you use placement constraints based on labels:
+
+```bash
+# Add CPU architecture label
+docker node update --label-add cpu=arm <node-name>
+# or
+docker node update --label-add cpu=x86 <node-name>
+
+# Add custom labels as needed
+docker node update --label-add role=database <node-name>
+```
+
+### 5. Test Network Connectivity
+
+**Deploy a test service on the new node:**
+
+```bash
+docker service create \
+  --name test-dns \
+  --constraint 'node.hostname==<new-node-name>' \
+  --network netgrimoire \
+  alpine sleep 3600
+```
+
+**Test DNS resolution from the test service:**
+
+```bash
+# Get container ID
+docker ps | grep test-dns
+
+# Test DNS lookup of existing service
+docker exec <container-id> nslookup <existing-service-name>
+
+# Test connectivity
+docker exec <container-id> ping -c 3 <existing-service-name>
+```
+
+**Clean up:**
+
+```bash
+docker service rm test-dns
+```
+
+## Troubleshooting
+
+### Issue: Service can't resolve DNS
+
+**Symptoms:**
+- `nslookup` or `ping` fails to resolve service names
+- Connection timeouts
+
+**Diagnosis:**
+
+```bash
+# Check if service is running
+docker service ls | grep <service-name>
+
+# Check service details
+docker service inspect <service-name>
+
+# Get container ID
+docker ps -f name=<service-name>
+
+# Test DNS from inside container
+docker exec <container-id> nslookup <target-service-name>
+docker exec <container-id> cat /etc/resolv.conf
+```
+
+**Solutions:**
+
+1. **Verify both services are on the same network:**
+   ```bash
+   docker network inspect <network-name>
+   ```
+   Both services should appear in the containers list.
+
+2. **Check if endpoint_mode is set:**
+   ```bash
+   docker service inspect <service-name> --format '{{.Spec.EndpointSpec.Mode}}'
+   ```
+   Should return `dnsrr` or `vip`.
+
+3. **Restart the service:**
+   ```bash
+   docker service update --force <service-name>
+   ```
+
+4. **Check Docker daemon config is correct:**
+   ```bash
+   cat /etc/docker/daemon.json
+   sudo systemctl status docker
+   ```
+
+### Issue: DNS resolves to wrong IP
+
+**Symptoms:**
+- DNS returns an old/incorrect IP address
+- Service was recently restarted and got a new IP
+
+**Diagnosis:**
+
+```bash
+# Find the actual container IP
+docker inspect <container-id> | grep IPAddress
+
+# Check what DNS returns
+docker exec <other-container> nslookup <service-name>
+
+# Compare the two
+```
+
+**Solutions:**
+
+1. **Verify endpoint_mode is dnsrr:**
+   ```bash
+   docker service inspect <service-name> --format '{{.Spec.EndpointSpec.Mode}}'
+   ```
+
+2. **Force update to refresh DNS:**
+   ```bash
+   docker service update --force <service-name>
+   ```
+
+3. **If still wrong, check for stale network entries:**
+   ```bash
+   # Disconnect and reconnect service to network (requires downtime)
+   docker service update --network-rm <network-name> <service-name>
+   docker service update --network-add <network-name> <service-name>
+   ```
+
+### Issue: Multiple services have stale DNS
+
+**Symptoms:**
+- Widespread DNS issues across cluster
+- Affects multiple services/nodes
+
+**Solutions:**
+
+1. **Force update all services (no downtime but slower):**
+   ```bash
+   docker service ls --format "{{.Name}}" | while read service; do
+     echo "Updating $service..."
+     docker service update --force $service
+   done
+   ```
+
+2. **Restart Docker on affected nodes (one at a time):**
+   ```bash
+   # On each node
+   sudo systemctl restart docker
+   ```
+
+3. **Nuclear option - recreate the network (requires downtime):**
+   ```bash
+   # Stop all stacks
+   docker stack ls
+   docker stack rm <stack-name>
+   
+   # Remove network
+   docker network rm <network-name>
+   
+   # Recreate network
+   docker network create --driver overlay --attachable <network-name>
+   
+   # Redeploy stacks
+   docker stack deploy -c <compose-file> <stack-name>
+   ```
+
+### Issue: New node can't reach services on other nodes
+
+**Symptoms:**
+- Services on new node can't connect to services on existing nodes
+- DNS works locally but not cross-node
+
+**Diagnosis:**
+
+```bash
+# Check if node is properly connected to overlay network
+docker network inspect <network-name>
+
+# Verify node is in Swarm
+docker node ls
+
+# Check firewall rules (on new node)
+sudo iptables -L -n | grep 4789  # VXLAN port
+sudo iptables -L -n | grep 7946  # Serf port
+```
+
+**Solutions:**
+
+1. **Ensure required ports are open:**
+   - TCP port 2377 (cluster management)
+   - TCP/UDP port 7946 (node communication)
+   - UDP port 4789 (overlay network traffic)
+
+2. **Check MTU settings:**
+   ```bash
+   # On all nodes, check MTU
+   ip link show
+   
+   # If MTU issues, recreate network with explicit MTU
+   docker network create \
+     --driver overlay \
+     --attachable \
+     --opt com.docker.network.driver.mtu=1450 \
+     <network-name>
+   ```
+
+3. **Verify Docker daemon is configured identically:**
+   ```bash
+   # Compare daemon.json across nodes
+   cat /etc/docker/daemon.json
+   ```
+
+### Issue: Intermittent DNS failures
+
+**Symptoms:**
+- DNS works sometimes, fails other times
+- No consistent pattern
+
+**Diagnosis:**
+
+```bash
+# Check Docker daemon logs
+sudo journalctl -u docker -f
+
+# Check for resource constraints
+free -h
+df -h
+
+# Monitor DNS queries
+docker exec <container> sh -c 'for i in 1 2 3 4 5; do nslookup <service-name>; sleep 2; done'
+```
+
+**Solutions:**
+
+1. **Increase DNS timeout in daemon.json:**
+   ```json
+   {
+     "dns": ["8.8.8.8", "1.1.1.1"],
+     "dns-opt": ["ndots:0", "timeout:5", "attempts:5"]
+   }
+   ```
+
+2. **Add health checks to services:**
+   ```yaml
+   services:
+     my-service:
+       healthcheck:
+         test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
+         interval: 30s
+         timeout: 10s
+         retries: 3
+         start_period: 40s
+   ```
+
+3. **Check for network congestion:**
+   ```bash
+   # Monitor network traffic
+   sudo iftop -i docker_gwbridge
+   ```
+
+## Verification and Testing
+
+### Test DNS Resolution
+
+```bash
+# From inside a container
+docker exec <container-id> nslookup <service-name>
+
+# Should return the current IP of the target service
+```
+
+### Test Service Connectivity
+
+```bash
+# Ping test
+docker exec <container-id> ping -c 3 <service-name>
+
+# Port connectivity test
+docker exec <container-id> nc -zv <service-name> <port>
+# or
+docker exec <container-id> telnet <service-name> <port>
+```
+
+### Monitor DNS Changes
+
+```bash
+# Run this to watch DNS resolution over time
+watch -n 5 'docker exec <container-id> nslookup <service-name>'
+```
+
+### Verify Endpoint Mode
+
+```bash
+# Check if dnsrr is active
+docker service inspect <service-name> --format '{{.Spec.EndpointSpec.Mode}}'
+
+# Should return: dnsrr
+```
+
+## Migration Strategy
+
+### Gradual Migration (Recommended)
+
+1. **Update Docker daemon on all nodes first** (one at a time)
+2. **Fix critical services with DNS issues immediately**
+3. **Add `endpoint_mode: dnsrr` to other services during normal maintenance**
+4. **No rush - both modes work together**
+
+### Full Migration (If Preferred)
+
+1. **Update Docker daemon on all nodes** (one at a time)
+2. **Update all compose files** to include `endpoint_mode: dnsrr`
+3. **Redeploy all stacks** during a maintenance window
+4. **Test each service** after deployment
+
+## Best Practices
+
+1. **Always use `endpoint_mode: dnsrr` for database services** - They're most affected by stale DNS
+2. **Use health checks** to prevent services from accepting traffic before they're ready
+3. **Add node labels** for better placement control
+4. **Document your network topology** including which services run on which nodes
+5. **Keep daemon.json consistent** across all nodes
+6. **Monitor DNS resolution** during deployments
+7. **Test new nodes** before moving production workloads to them
+
+## Additional Resources
+
+### Useful Commands
+
+```bash
+# List all services and their endpoint modes
+docker service ls --format "table {{.Name}}\t{{.Mode}}\t{{.Replicas}}" | while read line; do 
+  echo "$line"
+done
+
+# Find services without dnsrr
+for service in $(docker service ls --format "{{.Name}}"); do
+  mode=$(docker service inspect $service --format '{{.Spec.EndpointSpec.Mode}}')
+  if [ "$mode" != "dnsrr" ]; then
+    echo "$service: $mode"
+  fi
+done
+
+# Check which nodes are running which services
+docker service ps <service-name> --format "table {{.Name}}\t{{.Node}}\t{{.CurrentState}}"
+
+# View all containers on current node
+docker ps --format "table {{.Names}}\t{{.Status}}\t{{.Ports}}"
+```
+
+### Network Inspection
+
+```bash
+# View network details
+docker network inspect <network-name> --format '{{json .}}' | jq
+
+# List all services on a network
+docker network inspect <network-name> --format '{{range .Containers}}{{.Name}} {{end}}'
+
+# Check network driver and options
+docker network inspect <network-name> --format '{{.Driver}} {{.Options}}'
+```
+
+## Summary
+
+The `endpoint_mode: dnsrr` solution:
+- ✅ Eliminates stale DNS entries
+- ✅ Works with existing infrastructure
+- ✅ Requires no additional software
+- ✅ Can be implemented gradually
+- ✅ Compatible with mixed VIP/dnsrr environments
+- ✅ Simple to troubleshoot
+
+By combining the Docker daemon configuration changes with `endpoint_mode: dnsrr`, you create a robust DNS solution that handles frequent restarts and multi-node deployments reliably.