Troubleshooting Guide¶
Tags: Reference, Troubleshooting
A comprehensive guide for diagnosing and resolving common issues in the EgyGeeks infrastructure.
404 Errors¶
Problem¶
Requests to a service return 404 Not Found errors, or the service is unreachable.
Causes & Solutions¶
Container Health Issues¶
Diagnosis:
# Check if container is running
docker ps | grep SERVICE_NAME
# Check container logs
docker logs CONTAINER_ID
# Inspect container health status
docker ps --format "table {{.Names}}\t{{.Status}}"
Solution: - Ensure the container is in Up state - Check for restart loops: docker inspect CONTAINER_ID | grep -A 5 State - Review container logs for startup errors - Restart the container if necessary: docker restart CONTAINER_ID
Traefik Labels & Routing Configuration¶
Diagnosis:
# Verify Traefik is running
docker ps | grep traefik
# Check Traefik dashboard (typically http://server-ip:8080)
# Look for routes in the HTTP section
# Verify container has proper labels
docker inspect CONTAINER_ID | grep -A 20 Labels
Required labels for Traefik:
labels:
- "traefik.enable=true"
- "traefik.http.routers.SERVICE.rule=Host(`domain.com`)"
- "traefik.http.routers.SERVICE.entrypoints=web,websecure"
- "traefik.http.services.SERVICE.loadbalancer.server.port=PORT"
Solution: - Verify all required labels are present in docker-compose.yml - Check that the service name in labels matches the container service name - Ensure the target port matches the application's listening port - Redeploy with docker-compose up -d to apply label changes
Network Connectivity¶
Diagnosis:
# Check if container is on correct network
docker network inspect traefik_public | grep CONTAINER_NAME
# Test connectivity from Traefik to service
docker exec TRAEFIK_CONTAINER ping SERVICE_CONTAINER_NAME
# Check DNS resolution within network
docker exec TRAEFIK_CONTAINER nslookup SERVICE_CONTAINER_NAME
Solution: - Verify service is connected to traefik_public network: networks: - traefik_public - Check network exists: docker network ls | grep traefik_public - Recreate network if missing: docker network create traefik_public - Ensure Traefik can reach the service on the configured port
Unhealthy Containers¶
Problem¶
Containers show as "Unhealthy" in docker ps output or fail to start.
Diagnosis & Solutions¶
Check health status:
View detailed health check output:
Common issues:
| Issue | Diagnosis | Solution |
|---|---|---|
| Failed health checks | docker logs CONTAINER_ID shows errors | Fix the underlying service error; check ports, dependencies |
| Startup takes too long | Container exits before passing health check | Increase start_period in health check configuration |
| Port not listening | docker exec CONTAINER_ID netstat -tlnp or lsof -i | Verify application started; check port in health check URL |
| Dependency not ready | Container depends on database/service not yet running | Use depends_on condition or wait script |
Example health check configuration:
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:PORT/health"]
interval: 30s
timeout: 10s
retries: 3
start_period: 40s
SSL/TLS Issues¶
Problem¶
SSL certificates not provisioning, HTTPS not working, or Let's Encrypt errors.
Diagnosis¶
Check Traefik certificate storage:
docker exec traefik ls -la /letsencrypt/
docker exec traefik cat /letsencrypt/acme.json | jq '.[] | keys'
View Traefik logs for Let's Encrypt activity:
Test certificate renewal:
Solutions¶
Certificate Not Provisioning¶
- Verify domain is publicly accessible and resolves to server IP
- Check that Traefik has internet access for Let's Encrypt validation
- Ensure
acme.jsonfile exists and is writable:docker exec traefik ls -la /letsencrypt/ - Verify Traefik configuration includes ACME/Let's Encrypt provider
- Check DNS propagation:
nslookup domain.com
Certificate Renewal Failing¶
- Check certificate expiry:
docker exec traefik openssl x509 -in /etc/ssl/certs/certificate.crt -noout -dates - Manually trigger renewal by redeploying Traefik
- Review Traefik logs for rate limiting or DNS validation errors
HTTPS Not Working¶
- Verify entrypoints include
websecure:traefik.http.routers.SERVICE.entrypoints=web,websecure - Check that port 443 is accessible and not blocked by firewall
- Ensure certificate is valid:
curl -vI https://domain.com
Database Connection Issues¶
Problem¶
Application cannot connect to database; "Connection refused" or timeout errors.
Diagnosis¶
Check if database container is running:
Test connectivity from application container:
Verify database service is listening:
Check connection environment variables:
Solutions¶
| Issue | Solution |
|---|---|
| Container not running | docker ps - restart with docker-compose up -d DATABASE_NAME |
| Port mismatch | Verify DATABASE_PORT matches listening port in container |
| Wrong hostname | Use service name from docker-compose (not localhost): DATABASE_SERVICE_NAME:PORT |
| Network not shared | Ensure both containers on same network: networks: - traefik_public |
| Credentials incorrect | Verify environment variables: docker inspect APP_CONTAINER_ID \| grep -E "MYSQL\|POSTGRES" |
| Database not initialized | Check logs for initialization errors; may need to reset volume |
Test connection manually:
# For MySQL
docker exec APP_CONTAINER_ID mysql -h DATABASE_HOST -u USER -p PASSWORD -e "SELECT 1;"
# For PostgreSQL
docker exec APP_CONTAINER_ID psql -h DATABASE_HOST -U USER -d DATABASE -c "SELECT 1;"
Port Conflicts¶
Problem¶
Container fails to start with "port already in use" error or multiple services can't bind to same port.
Diagnosis¶
Find what's using a port:
Check all exposed ports:
Solutions¶
-
Identify conflicting service:
-
Choose action:
- If old container:
docker rm CONTAINER_ID(stop first if needed) - If system process:
kill -9 PID(use with caution) -
If different service: Change port mapping in docker-compose.yml
-
Update docker-compose.yml:
-
Redeploy:
Build Failures¶
Problem¶
docker-compose build or docker build fails with errors.
Diagnosis¶
View full build logs:
Check Dockerfile syntax:
Common Build Issues¶
| Error | Cause | Solution |
|---|---|---|
FROM image not found | Base image doesn't exist or wrong registry | Verify image name; check Docker Hub; use docker pull to verify access |
ADD/COPY failed | File path incorrect or file doesn't exist | Check relative paths; verify file exists in build context |
RUN command not found | Command doesn't exist in base image | Check base image; install package with package manager |
Permission denied | Script not executable | Add chmod +x before copying or in RUN command |
Out of disk space | Build creating large intermediate layers | Clean up with docker system prune |
Solutions¶
Clean build:
Remove build cache:
Check available disk space:
Build verbosely to see where it fails:
General Troubleshooting Tips¶
-
Always check logs first:
-
View service status across the stack:
-
Inspect actual vs expected configuration:
-
Restart everything (nuclear option):
-
Network inspection:
Related References¶
- Server Setup: See server.md for server details and installed software
- Infrastructure Architecture: See
docs/journeys/infrastructure.mdfor system overview - Docker Compose Configuration: See
docker-compose.ymlfor current service definitions