Nginx Load Balancing in Practice: upstream Configuration and Health Checks
2 AM, phone buzzing like crazy. I open the monitoring dashboard—backend1 status bar is all red. A single application server went down.
That year during the Double 11 sale, we only had two backend servers. The Nginx config read upstream backend { server backend1; server backend2; }, looked symmetric enough. But backend1 handled 70% of the traffic because it was first in the config file, and our Nginx used default round-robin—no weight, no health check configured.
When backend1 crashed, user order requests kept going to that dead server. Nginx didn’t know it was down, kept forwarding requests there. What users saw? 500 error pages. By the time ops manually removed backend1 from upstream, 15 minutes had passed.
After that incident, I caught up on what I’d missed: Nginx upstream isn’t just about listing server addresses. Weight distribution, health checks, failover—these are what production environments actually need. This article compiles the pitfalls I hit and the configuration methods I learned afterward, including how to implement active health checks with open-source Nginx (without paying for NGINX Plus).
1. upstream Basic Configuration: From Single Server to Cluster
The core purpose of an upstream block is simple: package multiple servers into a logical group so Nginx knows where to forward requests. But its parameters are richer than many people imagine.
upstream backend {
zone backend 64k;
server backend1.example.com weight=3 max_fails=2 fail_timeout=30s;
server backend2.example.com;
server backup1.example.com backup;
}
Let me explain line by line:
zone backend 64k: Shared memory area. Nginx worker processes need to share backend server state info (who’s alive, who’s down). 64k is the starting value—increase it if you have many servers. Without this line, workers each manage their own state, causing problems.
weight=3: Weight. backend1 has weight 3, backend2 defaults to 1. This means out of 4 requests, 3 go to backend1, 1 to backend2. Suitable for heterogeneous backend servers—for example, backend1 is 8-core 16GB, backend2 is 4-core 8GB.
max_fails=2: Failure threshold. Within the fail_timeout window, if 2 requests to this server fail, Nginx marks it unavailable. Default is 1—too sensitive. One network jitter triggers it, not appropriate. Production should set 2 or 3.
fail_timeout=30s: Dual meaning. First, the failure counting window is 30 seconds; second, after a server is marked unavailable, Nginx will try connecting it again after 30 seconds. Default 10 seconds may not be enough for slow-starting services.
backup: Backup server. Only when all primary servers are unavailable, the backup server receives requests. Suitable for using lower-config servers as fallback.
There’s also a down parameter for manually marking a server offline, often used during maintenance:
server backend3.example.com down; # Temporarily offline for maintenance
In practice, I’ve seen many people skip the zone config. The result: worker processes each maintain state separately. One worker discovers a server is down, others still send requests there. Adding zone solves the state sync problem.
2. Five Load Balancing Strategies: When to Use Each?
Is the default round-robin strategy enough? Depends.
I’ve seen plenty of projects run with default round-robin for years without issues. But when you encounter WebSocket long connections, shopping cart sessions, or cache penetration scenarios, you’ll find default strategy isn’t ideal. Here’s a selection guide I compiled:
| Scenario | Recommended Strategy | Reason |
|---|---|---|
| Stateless API | round-robin | Uniform distribution, no special handling needed |
| WebSocket Service | least_conn | Dynamic connection monitoring, avoid overload on one server |
| Shopping Cart | ip_hash | Same user requests go to same server |
| Cache Proxy | hash key=$uri | Reduce cache invalidation and penetration |
| Test Environment | random | Quick validation, simple config |
round-robin (Default Round-Robin)
No configuration means round-robin. Requests go to each server in sequence:
upstream backend {
server backend1.example.com;
server backend2.example.com;
server backend3.example.com;
}
Suitable for stateless services. Each request is independent, doesn’t depend on previous request state. Most REST APIs work with this.
least_conn (Least Connections)
Prioritize sending requests to the server with fewest current connections:
upstream websocket_app {
least_conn;
server ws1.example.com:8080;
server ws2.example.com:8080;
}
Typical for WebSocket services. One user establishes one long connection, connection count fluctuates. With round-robin, one server might accumulate many long connections while new requests still go there. least_conn monitors connection counts real-time, sending new requests to the least loaded server.
ip_hash (IP Hash)
Calculate hash value from client IP address—requests from same IP always go to same server:
upstream shopping_cart {
ip_hash;
server cart1.example.com;
server cart2.example.com;
}
Suitable for scenarios needing session consistency. Like e-commerce shopping cart—user added items on cart1, if next request goes to cart2, cart data is gone (unless you use distributed session storage). ip_hash solves this.
But ip_hash has a limitation: if one server goes down, users originally hashed to it get reassigned. Their sessions get lost. So ip_hash fits scenarios where session data isn’t critical, or paired with session sharing storage.
hash (Consistent Hash)
Custom hash key, supports consistent hashing algorithm:
upstream cache_proxy {
hash $uri consistent;
server cache1.example.com;
server cache2.example.com;
}
First choice for cache proxy scenarios. $uri uses request path as hash key. consistent parameter enables consistent hashing—when servers change, only some keys get remapped, not all shuffled. Cache hit rate won’t drop significantly.
random
Simple random distribution:
upstream test_backend {
random;
server test1.example.com;
server test2.example.com;
}
Good enough for test environments. Not recommended for production—lacks control.
Honestly, my experience: most web apps work fine with round-robin or least_conn. ip_hash and hash are solutions for specific scenarios—don’t force them just to “look advanced.”
3. Passive Health Check: max_fails and fail_timeout
“Passive” means: Nginx doesn’t proactively probe backend server health status, but judges by observing actual request success/failure. Like you wouldn’t knock on your neighbor’s door asking “are you okay”—you watch if they walk the dog, receive packages—judge indirectly through daily behavior.
max_fails and fail_timeout parameters configure this “observation mechanism”:
upstream backend {
server backend1.example.com max_fails=3 fail_timeout=30s;
server backend2.example.com max_fails=3 fail_timeout=30s;
}
location / {
proxy_pass http://backend;
proxy_next_upstream error timeout http_500 http_502 http_503 http_504;
}
proxy_next_upstream specifies what counts as “failure.” error is connection error, timeout is timeout, http_500 to http_504 are various HTTP error status codes. When these happen, Nginx forwards request to next server, while recording a failure for current server.
3 failures within 30 seconds, server gets marked unavailable. Next 30 seconds, Nginx stops sending requests there. After 30 seconds, Nginx tries once—if successful, server recovers; if failed, wait another 30 seconds.
This mechanism’s problem: slow fault response. At least 3 real user requests must fail before server gets removed. Those 3 users already received error responses—experience damaged.
More extreme case: server just started, still initializing, health status unstable. max_fails=1 config might mark server unavailable due to one startup failure, keeping it excluded permanently.
My suggestions:
- Set max_fails to 2 or 3, tolerate occasional network jitter
- Set fail_timeout to 30+ seconds, give server recovery chance
- Configure complete error type list in proxy_next_upstream, avoid missing failure cases
Passive check’s advantage: simple config, supported by open-source Nginx. Disadvantage: relies on user requests triggering, user experience suffers first. If you need faster fault detection, proactive backend probing—that’s active health check.
4. Active Health Check: NGINX Plus and Open-Source Alternatives
Active health check logic: Nginx periodically sends probe requests to backend servers (like GET /health), judging server health by response status. No need to wait for user request failures—Nginx itself discovers faulty servers and removes them ahead of time.
Official solution is NGINX Plus (commercial), annual fee $3,675/instance. 10 instances costs $36,750/year. Honestly, that price is steep for many companies.
Open-source solution: nginx_upstream_check_module, developed by Taobao tech team. Requires recompiling Nginx to add this module, but functionality is quite complete:
| Feature | NGINX Plus | nginx_upstream_check_module |
|---|---|---|
| Price | $3,675/year | Open-source free |
| HTTP Check | Supported | Supported |
| TCP Check | Supported | Supported |
| MySQL Check | Not supported | Supported |
| FastCGI Check | Not supported | Supported |
| Status Page | Supported | Supported (check_status) |
The open-source module supports MySQL and FastCGI checks—features NGINX Plus doesn’t have. If your backend is PHP-FPM or MySQL, this module is more suitable.
nginx_upstream_check_module Configuration Example
upstream backend {
server backend1.example.com:8080;
server backend2.example.com:8080;
check interval=3000 rise=2 fall=5 timeout=1000 type=http;
check_http_send "GET /health HTTP/1.0\r\n\r\n";
check_http_expect_alive http_2xx http_3xx;
}
server {
location / {
proxy_pass http://backend;
}
location /upstream_status {
check_status json;
allow 127.0.0.1;
allow 10.0.0.0/8;
deny all;
}
}
Parameter explanation:
- interval=3000: Send probe request every 3 seconds
- rise=2: After 2 consecutive successes, server marked healthy (newly started servers might be unstable, need consecutive success confirmation)
- fall=5: After 5 consecutive failures, server marked unavailable (tolerate occasional timeout)
- timeout=1000: Probe request timeout 1 second
- type=http: Use HTTP protocol for probing (also supports tcp, ssl_hello, mysql, ajp, fastcgi)
check_http_send defines probe request content. Here sending a simple GET /health request. Backend needs to implement /health endpoint, returning 200 or 3xx status code.
check_http_expect_alive specifies which status codes count as “healthy.” http_2xx and http_3xx mean 200-299 and 300-399 status codes all count as success.
check_status provides status monitoring page. JSON format output, convenient for Prometheus or Zabbix integration. That allow/deny afterward is access control—this page can’t be casually exposed.
Module Installation Method
nginx_upstream_check_module requires compilation install. Rough steps:
# Download module source
git clone https://github.com/yaoweibin/nginx_upstream_check_module.git
# Download Nginx source
wget http://nginx.org/download/nginx-1.24.0.tar.gz
tar -zxvf nginx-1.24.0.tar.gz
# Apply patch (choose based on Nginx version)
cd nginx-1.24.0
patch -p1 < ../nginx_upstream_check_module/check_1.20.1+.patch
# Compile
./configure --add-module=../nginx_upstream_check_module
make && make install
If you deploy with Docker, you can build an image containing the module yourself, or find community-ready images.
5. Production Practice: Security and Monitoring
Health check configuration in production has several pitfalls. I summarized three principles:
Security Configuration: Three Key Points
1. check_status page must have access control
Status page exposes backend server list and health status. If accessed externally, attackers can see your internal topology info and know which server is currently unavailable—perfect timing for attack.
location /upstream_status {
check_status json;
allow 127.0.0.1; # Local access
allow 10.0.0.0/8; # Internal IP
deny all; # Deny others
}
Or stricter—only allow specific monitoring server IPs.
2. Use dedicated health check port
Health check requests are frequent (every 3-5 seconds). If probing business port directly, backend logs record tons of /health requests. Log files bloat, affecting performance.
Recommend backend service listen on two ports: business port (like 8080) and health check port (like 8888). Health check port only returns simple status code, doesn’t handle business logic, doesn’t write logs.
check interval=5000 rise=2 fall=3 timeout=2000 type=http port=8888;
port=8888 specifies dedicated probe port.
3. Health check endpoint doesn’t return sensitive info
/health endpoint only needs to return status code. Don’t return version number, config info, memory usage, or other internal data. Attackers will use this info to locate vulnerabilities.
# Backend implementation example
@app.route('/health')
def health():
return '', 200 # Only return status code
Monitoring Integration: JSON Output
check_status supports multiple formats. JSON format works well with monitoring systems:
curl http://127.0.0.1/upstream_status
Output example:
{
"servers": {
"total": 3,
"generation": 12,
"server": [
{"index": 0, "name": "10.0.0.1:8080", "status": "up", "rise": 5, "fall": 0, "type": "http"},
{"index": 1, "name": "10.0.0.2:8080", "status": "up", "rise": 3, "fall": 0, "type": "http"},
{"index": 2, "name": "10.0.0.3:8080", "status": "down", "rise": 0, "fall": 5, "type": "http"}
]
}
}
generation is config change counter. Every time upstream config is modified and reloaded, generation value increases. Monitoring scripts can compare this value to confirm config生效.
Parameter Tuning Suggestions
interval not below 3000ms
Too high probe frequency stresses backend. 3-5 second interval works—fault detection delay in seconds, doesn’t affect user experience.
rise and fall threshold balance
- rise too small (like 1), server might be misjudged unhealthy during startup initialization, then quickly recovered, oscillating
- fall too small (like 1), one network jitter triggers removal, too sensitive
My experience values: rise=2, fall=3 or fall=5. Tolerate transient faults, confirm sustained faults before removal.
Complete Production Configuration
upstream web_app {
zone web_app 64k;
server 10.0.0.1:8080 weight=3;
server 10.0.0.2:8080;
server 10.0.0.3:8080 backup;
check interval=5000 rise=2 fall=3 timeout=2000 type=http port=8888;
check_http_send "GET /health HTTP/1.1\r\nHost: app.example.com\r\n\r\n";
check_http_expect_alive http_2xx;
}
server {
listen 80;
server_name app.example.com;
location / {
proxy_pass http://web_app;
proxy_set_header Host $host;
proxy_next_upstream error timeout http_502 http_503 http_504;
}
location /upstream_status {
check_status json;
allow 127.0.0.1;
allow 10.0.0.0/8;
deny all;
}
}
This configuration:
- zone shared memory ensures worker state sync
- Active health check probes dedicated port every 5 seconds
- Status page only allows internal network access
- proxy_next_upstream ensures requests on faulty server get forwarded to healthy backends
Conclusion
Back to that Double 11 incident. Later we added zone shared memory to upstream, configured max_fails and fail_timeout, then compiled and installed nginx_upstream_check_module for active health checks. When a server crashes, Nginx discovers and removes it within 5 seconds—users basically won’t receive error responses.
Load balancing strategy choice, in one sentence: stateless services use round-robin or least_conn, stateful services use ip_hash or hash. Health check choice: production must have it, open-source solution uses nginx_upstream_check_module, don’t forget access control on check_status page.
If your Nginx still uses default round-robin without health check, suggest starting with passive check (max_fails + fail_timeout). Minimal change, immediate effect. After validating stability, consider upgrading to active health check. Try in test environment first, confirm config works before production deployment.
FAQ
What does the zone configuration in Nginx upstream do?
What's the difference between passive and active health checks?
Which is better: nginx_upstream_check_module or NGINX Plus?
How to choose between round-robin and least_conn load balancing?
How to set rise and fall parameters for health checks?
Why must check_status page have access control?
11 min read · Published on: Apr 27, 2026 · Modified on: Apr 29, 2026
Nginx Practice Guide
If you landed here from search, the fastest way to build context is to jump to the previous or next post in this same series.
Previous
Nginx SSL/TLS Configuration in Practice: From HTTPS Certificates to A+ Security Hardening
Configure Nginx HTTPS from scratch: Let's Encrypt certificate issuance, TLS 1.3 security hardening, SSL Labs A+ rating configuration templates, OCSP Stapling performance optimization. Complete guide for 2026.
Part 3 of 4
Next
This is the latest post in the series so far.
Related Posts
Nginx Reverse Proxy Complete Guide: Upstream, Buffering, and Timeout
Nginx Reverse Proxy Complete Guide: Upstream, Buffering, and Timeout
Nginx Performance Tuning: gzip, Caching, and Connection Pool Configuration
Nginx Performance Tuning: gzip, Caching, and Connection Pool Configuration
Vitest Unit Testing in Practice: TDD Workflow and Coverage Reports


Comments
Sign in with GitHub to leave a comment