Docker Compose Production Deployment: Health Checks, Restart Policies, and Log Management
3 AM, and server alerts are blowing up your phone. You SSH in to find disk space at 99%—container logs are eating up 50GB.
That’s not even the worst part. Last year, a project’s API container showed “running” status, but the database connection had been dead for hours. Every request returned a 500 error. It took three hours to track down the issue. According to Last9’s research, container “zombie” states waste an average of 3.2 hours of troubleshooting time per incident.
Let’s be honest—many teams deploying Docker Compose to production just configure port mapping and volume mounts, then throw containers into the wild. No health checks, no log rotation, restart policy is just a lazy restart: always. The result: containers look like they’re running but are actually dead; log files grow uncontrollably until they fill the disk; crashing services restart infinitely, consuming all CPU and memory.
In this article, I’ll break down three core production configurations—health checks, restart policies, and log management—into clear, actionable steps. Not just configuration examples, but health check commands for common services, troubleshooting steps, and a complete docker-compose.yml template you can copy and use.
Health Checks — Make Containers Truly Alive
A container showing running status doesn’t mean the application is actually working. Database connection failures, ports not listening, frozen processes—Docker has no idea about these. Health checks act as a “heartbeat monitor” for containers, regularly checking whether the application can still respond normally.
Configuration Syntax
In docker-compose.yml, the healthcheck configuration looks like this:
healthcheck:
test: ["CMD-SHELL", "pg_isready -U postgres"]
interval: 10s # Check every 10 seconds
timeout: 5s # Wait up to 5 seconds per check
retries: 5 # Mark as unhealthy after 5 consecutive failures
start_period: 30s # Give container 30 seconds warm-up time after startup
These parameters need to work together. timeout can’t be larger than interval, otherwise the next check starts before the previous one finishes. start_period shouldn’t be skipped—services like databases start slowly, and if the warm-up time is too short, health checks will falsely mark the container as dead.
Health Check Commands for Common Services
Different services require different check methods. Here are some common ones:
PostgreSQL
healthcheck:
test: ["CMD-SHELL", "pg_isready -U postgres -d mydb"]
interval: 10s
timeout: 5s
retries: 5
start_period: 30s
pg_isready is PostgreSQL’s built-in check tool, specifically designed to determine if the database is ready to accept connections.
MySQL / MariaDB
healthcheck:
test: ["CMD-SHELL", "mysqladmin ping -h localhost -u root -p$$MYSQL_ROOT_PASSWORD"]
interval: 10s
timeout: 5s
retries: 5
start_period: 30s
Note the password uses $$ for escaping, otherwise YAML treats $ as a variable reference.
Redis
healthcheck:
test: ["CMD-SHELL", "redis-cli ping | grep PONG"]
interval: 10s
timeout: 3s
retries: 3
Redis’s ping command returns PONG, filtered with grep to ensure the result is correct.
Web Server (HTTP Check)
healthcheck:
test: ["CMD-SHELL", "curl -f http://localhost:8080/health || exit 1"]
interval: 30s
timeout: 10s
retries: 3
start_period: 10s
The -f flag makes curl return a non-zero exit code when HTTP status code is not 2xx, triggering a health check failure.
Common Pitfall: Minimal images like Alpine may not have curl installed. Either install it (apk add curl) or use wget instead:
test: ["CMD-SHELL", "wget --no-verbose --tries=1 --spider http://localhost:8080/health || exit 1"]
Startup Order Control
The database isn’t ready yet, but the API container starts up, resulting in connection failures, errors, and crashes—I’ve seen this too many times. Using depends_on with condition: service_healthy solves this:
services:
postgres:
image: postgres:16
healthcheck:
test: ["CMD-SHELL", "pg_isready -U postgres"]
interval: 10s
timeout: 5s
retries: 5
start_period: 30s
api:
build: ./api
depends_on:
postgres:
condition: service_healthy # Wait for postgres health check to pass before starting
This way, Docker Compose waits for postgres’s health check to return healthy before starting the api container. No more awkward “database not ready, API tries to connect” situations.
Restart Policies — Graceful Recovery After Failure
Container crashed—what to do? Automatic restart seems like a good idea. But here’s the problem: if the root cause isn’t resolved, restarting just creates an infinite loop, wasting CPU and memory, and masking the real failure.
Configuration Syntax
Restart policy is configured in the deploy block:
deploy:
restart_policy:
condition: on-failure # Only restart on failure
delay: 5s # Wait 5 seconds before restart
max_attempts: 3 # Maximum 3 restart attempts
window: 120s # 120 seconds without failure after restart counts as recovery
condition has three options:
none: Don’t restart, leave the container deadon-failure: Only restart when container exits abnormally (non-zero exit code)any: Restart regardless of the situation
Production Recommendation
For production environments, use on-failure instead of always.
Why? restart: always makes containers restart no matter what. Application code has a bug causing crash? Restart. Database connection fails causing process exit? Restart. Configuration file error prevents startup? Still restart. The result is a crash loop, logs filling up, CPU being consumed repeatedly.
on-failure with max_attempts is different—restart at most 3 times, then stop if still failing. Operations can see the container ultimately died and investigate the real problem.
Parameter Tuning
delay is the restart interval. Too short, and the container might not be fully cleaned up before restarting; too long extends recovery time. Generally 5-10 seconds works well.
window is an easily overlooked parameter. It defines: how long after restart without another failure counts as successful restart. For example, setting window: 120s means if the container crashes again within 120 seconds after restart, the max_attempts counter doesn’t reset. This avoids false positives from “restarts successfully for one second then crashes again.”
Health Check and Restart Policy Coordination
Health checks and restart policies don’t work independently—they coordinate:
- Health check fails
retriestimes consecutively → Container marked asunhealthy - If
restart_policyis configured, Docker attempts to restart the container - After restart, health check counter resets
- If health check passes after restart, container returns to normal; if still fails, continue restart attempts until
max_attemptsis exhausted
This chain gives failures “auto-recovery” capability while limiting the risk of infinite restarts.
Log Management — Prevent Disk Space Exhaustion
The 3 AM alert I mentioned at the beginning—disk at 99%, logs consuming 50GB—I’ve experienced this more than once. Docker’s default json-file log driver doesn’t auto-cleanup old logs, files grow indefinitely. Without log rotation configuration, sooner or later the disk fills up.
Log Rotation Configuration
Add logging configuration in docker-compose.yml:
logging:
driver: "json-file"
options:
max-size: "10m" # Single log file max 10MB
max-file: "3" # Keep maximum 3 log files
compress: "true" # Compress old logs to save space
With this configuration, container logs occupy at most 30MB (10MB × 3). When exceeding 10MB, Docker creates a new file; when exceeding 3 files, the oldest gets deleted or compressed.
Log files are stored at /var/lib/docker/containers/<container-id>/<container-id>-json.log. Use the du command to check actual usage:
du -sh /var/lib/docker/containers/*/*-json.log
Driver Selection
Docker supports multiple log drivers: json-file, syslog, fluentd, journald, local, etc. For most scenarios, json-file or local are sufficient.
Docker’s official documentation mentions that the local driver is more efficient than json-file, with built-in log rotation, no need to manually configure max-size/max-file. If you have large log volumes (like tens of GB per day), consider using local:
logging:
driver: "local"
However, local driver has a drawback: you can’t directly view log content with docker logs. You need to add mode: "non-blocking" in configuration for compatibility.
Centralized Log Collection (Optional)
Single-machine deployment works fine with json-file or local. But if you have dozens of servers and hundreds of containers, logs scattered everywhere become hard to manage. Consider centralized logging solutions:
- Fluentd: Lightweight log collection, suitable for small clusters
- ELK Stack (Elasticsearch + Logstash + Kibana): Powerful but high deployment cost
- Loki + Grafana: Cloud-native solution, integrates well with Prometheus ecosystem
These solutions are more complex to configure, outside this article’s scope. Briefly mentioning Fluentd configuration approach:
logging:
driver: "fluentd"
options:
fluentd-address: "localhost:24224"
tag: "docker.{{.Name}}"
Fluentd forwards logs to the specified address, where you can collect and analyze them on another server.
Complete Configuration Template
Combine health checks, restart policies, and log management to create a production-grade docker-compose.yml. Here’s a complete example with PostgreSQL database, Redis cache, and API service:
version: '3.8'
services:
postgres:
image: postgres:16
environment:
POSTGRES_USER: myuser
POSTGRES_PASSWORD: mypassword
POSTGRES_DB: mydb
volumes:
- postgres_data:/var/lib/postgresql/data
healthcheck:
test: ["CMD-SHELL", "pg_isready -U myuser -d mydb"]
interval: 10s
timeout: 5s
retries: 5
start_period: 30s
deploy:
restart_policy:
condition: on-failure
delay: 5s
max_attempts: 3
window: 120s
logging:
driver: "json-file"
options:
max-size: "10m"
max-file: "3"
compress: "true"
redis:
image: redis:7-alpine
healthcheck:
test: ["CMD-SHELL", "redis-cli ping | grep PONG"]
interval: 10s
timeout: 3s
retries: 3
start_period: 5s
deploy:
restart_policy:
condition: on-failure
delay: 5s
max_attempts: 3
logging:
driver: "json-file"
options:
max-size: "10m"
max-file: "3"
api:
build:
context: ./api
dockerfile: Dockerfile
ports:
- "8080:8080"
environment:
DATABASE_URL: postgres://myuser:mypassword@postgres:5432/mydb
REDIS_URL: redis://redis:6379
depends_on:
postgres:
condition: service_healthy
redis:
condition: service_healthy
healthcheck:
test: ["CMD-SHELL", "curl -f http://localhost:8080/health || exit 1"]
interval: 30s
timeout: 10s
retries: 3
start_period: 10s
deploy:
restart_policy:
condition: on-failure
delay: 10s
max_attempts: 3
window: 120s
logging:
driver: "json-file"
options:
max-size: "50m"
max-file: "5"
compress: "true"
volumes:
postgres_data:
Configuration Key Points
Startup Order: The api container’s depends_on waits for both postgres and redis health checks to pass. Database and cache are ready before API starts, avoiding startup connection errors.
Log Size Differences: postgres and redis logs are usually small, 10MB × 3 is sufficient; API service logs may be larger, set to 50MB × 5. Adjust based on actual log volume, don’t use one size fits all.
Restart Delay Differences: postgres starts slowly and needs time to recover after restart, delay set to 5 seconds; API starts quickly, delay set to 10 seconds gives health checks more buffer.
Startup Warm-up Time: postgres start_period: 30s gives database enough initialization time; redis start_period: 5s, Redis starts fast anyway; API start_period: 10s, application startup usually takes just a few seconds.
This template can be copied and used directly—just replace environment variables and images with your own. If your project has other services (like MongoDB, MinIO), add health checks, restart policies, and log configuration following the same pattern.
Common Pitfalls and Troubleshooting
Configuration written, deployed, problems may still appear. Here are common pitfalls and troubleshooting steps.
Health Check Keeps Failing
Symptom: Container status always shows unhealthy, but application seems to work normally.
Troubleshooting Steps:
-
First check if the health check command tool exists:
docker exec <container> which curl docker exec <container> which pg_isreadyAlpine images often don’t have curl, need to manually install or switch to wget.
-
Manually run health check command, check output:
docker exec <container> curl -f http://localhost:8080/healthIf it returns an error, the health check endpoint itself may have issues.
-
View detailed health check status:
docker inspect --format='{{json .State.Health}}' <container> | jqYou can see recent check results, failure reasons, timestamps.
Container Restarting Repeatedly
Symptom: Container starts and dies after a few seconds, logs filled with restart records.
Troubleshooting Steps:
-
Check container exit reason:
docker inspect --format='{{.State.ExitCode}}' <container> docker inspect --format='{{.State.Error}}' <container>Exit code tells you roughly what’s wrong (1 = general error, 137 = killed by OOM, 139 = segmentation fault).
-
Check restart count:
docker inspect --format='{{.RestartCount}}' <container>If count is large, check if
max_attemptsis taking effect. -
Check container logs for specific errors:
docker logs --tail 100 <container>
Log Disk Full
Symptom: Disk space alert, discover /var/lib/docker/containers directory is very large.
Troubleshooting Steps:
-
Find largest log files:
du -sh /var/lib/docker/containers/*/*-json.log | sort -rh | head -5 -
Check if log rotation configuration is effective:
docker inspect --format='{{.HostConfig.LogConfig}}' <container>If output shows
Config: {}, log rotation isn’t configured. -
Manually clean logs (temporary solution):
truncate -s 0 /var/lib/docker/containers/<id>/<id>-json.logThis is a temporary solution, long-term still need to add log rotation configuration.
Quick Troubleshooting Command List
When problems occur, these commands help quickly locate issues:
# View all container health status
docker ps --format "table {{.Names}}\t{{.Status}}"
# View specific container's health check history
docker inspect --format='{{json .State.Health}}' <container>
# View container exit code and restart count
docker inspect --format='ExitCode: {{.State.ExitCode}}, RestartCount: {{.RestartCount}}' <container>
# Check log file sizes
du -sh /var/lib/docker/containers/*/*-json.log | sort -rh
# View container's last 100 log lines
docker logs --tail 100 <container>
Summary
For production deployment with Docker Compose, these three configurations aren’t optional—they’re essential: health checks make containers not just “look alive”, restart policies give failures a chance for auto-recovery while limiting infinite loops, log management prevents disk exhaustion.
Core Configuration Checklist:
- Health Check:
test+interval+timeout+retries+start_period - Restart Policy:
condition: on-failure+max_attempts: 3 - Log Rotation:
max-size: 10m+max-file: 3+compress: true
Three-Step Action Plan:
- Check your existing docker-compose.yml for these three configurations. If missing, at least add health checks and log rotation.
- Deploy a test service using the complete template above, observe if health checks work and logs are rotating.
- Save the troubleshooting commands. Next time you get a 3 AM alert, you can quickly locate the problem.
Don’t let your containers run naked in production. Configure these three “protective shields”—when problems occur, at least they can auto-recover, be quickly troubleshooted, and won’t fill up your disk.
FAQ
How should I configure health check interval and timeout?
Is restart: always or on-failure better for restart policy?
What log file size and count are appropriate?
Container health check keeps failing but application runs normally—what to do?
What does depends_on with condition: service_healthy do?
How to quickly check how much disk space container logs use?
10 min read · Published on: Apr 12, 2026 · Modified on: Apr 12, 2026
Related Posts
Nginx Performance Tuning: gzip, Caching, and Connection Pool Configuration
Nginx Performance Tuning: gzip, Caching, and Connection Pool Configuration
GitHub Actions Getting Started: YAML Workflow Basics and Trigger Configuration
GitHub Actions Getting Started: YAML Workflow Basics and Trigger Configuration
GitHub Actions Getting Started: YAML Workflow Basics and Trigger Configuration

Comments
Sign in with GitHub to leave a comment