Nginx Load Balancing in Practice: upstream Configuration and Health Checks

2 AM, phone buzzing like crazy. I open the monitoring dashboard—backend1 status bar is all red. A single application server went down.

That year during the Double 11 sale, we only had two backend servers. The Nginx config read upstream backend { server backend1; server backend2; }, looked symmetric enough. But backend1 handled 70% of the traffic because it was first in the config file, and our Nginx used default round-robin—no weight, no health check configured.

When backend1 crashed, user order requests kept going to that dead server. Nginx didn’t know it was down, kept forwarding requests there. What users saw? 500 error pages. By the time ops manually removed backend1 from upstream, 15 minutes had passed.

After that incident, I caught up on what I’d missed: Nginx upstream isn’t just about listing server addresses. Weight distribution, health checks, failover—these are what production environments actually need. This article compiles the pitfalls I hit and the configuration methods I learned afterward, including how to implement active health checks with open-source Nginx (without paying for NGINX Plus).

1. upstream Basic Configuration: From Single Server to Cluster

The core purpose of an upstream block is simple: package multiple servers into a logical group so Nginx knows where to forward requests. But its parameters are richer than many people imagine.

upstream backend {
  zone backend 64k;
  server backend1.example.com weight=3 max_fails=2 fail_timeout=30s;
  server backend2.example.com;
  server backup1.example.com backup;
}

Let me explain line by line:

zone backend 64k: Shared memory area. Nginx worker processes need to share backend server state info (who’s alive, who’s down). 64k is the starting value—increase it if you have many servers. Without this line, workers each manage their own state, causing problems.

weight=3: Weight. backend1 has weight 3, backend2 defaults to 1. This means out of 4 requests, 3 go to backend1, 1 to backend2. Suitable for heterogeneous backend servers—for example, backend1 is 8-core 16GB, backend2 is 4-core 8GB.

max_fails=2: Failure threshold. Within the fail_timeout window, if 2 requests to this server fail, Nginx marks it unavailable. Default is 1—too sensitive. One network jitter triggers it, not appropriate. Production should set 2 or 3.

fail_timeout=30s: Dual meaning. First, the failure counting window is 30 seconds; second, after a server is marked unavailable, Nginx will try connecting it again after 30 seconds. Default 10 seconds may not be enough for slow-starting services.

backup: Backup server. Only when all primary servers are unavailable, the backup server receives requests. Suitable for using lower-config servers as fallback.

There’s also a down parameter for manually marking a server offline, often used during maintenance:

server backend3.example.com down;  # Temporarily offline for maintenance

In practice, I’ve seen many people skip the zone config. The result: worker processes each maintain state separately. One worker discovers a server is down, others still send requests there. Adding zone solves the state sync problem.

2. Five Load Balancing Strategies: When to Use Each?

Is the default round-robin strategy enough? Depends.

I’ve seen plenty of projects run with default round-robin for years without issues. But when you encounter WebSocket long connections, shopping cart sessions, or cache penetration scenarios, you’ll find default strategy isn’t ideal. Here’s a selection guide I compiled:

Scenario	Recommended Strategy	Reason
Stateless API	round-robin	Uniform distribution, no special handling needed
WebSocket Service	least_conn	Dynamic connection monitoring, avoid overload on one server
Shopping Cart	ip_hash	Same user requests go to same server
Cache Proxy	hash key=$uri	Reduce cache invalidation and penetration
Test Environment	random	Quick validation, simple config

round-robin (Default Round-Robin)

No configuration means round-robin. Requests go to each server in sequence:

upstream backend {
  server backend1.example.com;
  server backend2.example.com;
  server backend3.example.com;
}

Suitable for stateless services. Each request is independent, doesn’t depend on previous request state. Most REST APIs work with this.

least_conn (Least Connections)

Prioritize sending requests to the server with fewest current connections:

upstream websocket_app {
  least_conn;
  server ws1.example.com:8080;
  server ws2.example.com:8080;
}

Typical for WebSocket services. One user establishes one long connection, connection count fluctuates. With round-robin, one server might accumulate many long connections while new requests still go there. least_conn monitors connection counts real-time, sending new requests to the least loaded server.

ip_hash (IP Hash)

Calculate hash value from client IP address—requests from same IP always go to same server:

upstream shopping_cart {
  ip_hash;
  server cart1.example.com;
  server cart2.example.com;
}

Suitable for scenarios needing session consistency. Like e-commerce shopping cart—user added items on cart1, if next request goes to cart2, cart data is gone (unless you use distributed session storage). ip_hash solves this.

But ip_hash has a limitation: if one server goes down, users originally hashed to it get reassigned. Their sessions get lost. So ip_hash fits scenarios where session data isn’t critical, or paired with session sharing storage.

hash (Consistent Hash)

Custom hash key, supports consistent hashing algorithm:

upstream cache_proxy {
  hash $uri consistent;
  server cache1.example.com;
  server cache2.example.com;
}

First choice for cache proxy scenarios. $uri uses request path as hash key. consistent parameter enables consistent hashing—when servers change, only some keys get remapped, not all shuffled. Cache hit rate won’t drop significantly.

random

Simple random distribution:

upstream test_backend {
  random;
  server test1.example.com;
  server test2.example.com;
}

Good enough for test environments. Not recommended for production—lacks control.

Honestly, my experience: most web apps work fine with round-robin or least_conn. ip_hash and hash are solutions for specific scenarios—don’t force them just to “look advanced.”

3. Passive Health Check: max_fails and fail_timeout

“Passive” means: Nginx doesn’t proactively probe backend server health status, but judges by observing actual request success/failure. Like you wouldn’t knock on your neighbor’s door asking “are you okay”—you watch if they walk the dog, receive packages—judge indirectly through daily behavior.

max_fails and fail_timeout parameters configure this “observation mechanism”:

upstream backend {
  server backend1.example.com max_fails=3 fail_timeout=30s;
  server backend2.example.com max_fails=3 fail_timeout=30s;
}

location / {
  proxy_pass http://backend;
  proxy_next_upstream error timeout http_500 http_502 http_503 http_504;
}

proxy_next_upstream specifies what counts as “failure.” error is connection error, timeout is timeout, http_500 to http_504 are various HTTP error status codes. When these happen, Nginx forwards request to next server, while recording a failure for current server.

3 failures within 30 seconds, server gets marked unavailable. Next 30 seconds, Nginx stops sending requests there. After 30 seconds, Nginx tries once—if successful, server recovers; if failed, wait another 30 seconds.

This mechanism’s problem: slow fault response. At least 3 real user requests must fail before server gets removed. Those 3 users already received error responses—experience damaged.

More extreme case: server just started, still initializing, health status unstable. max_fails=1 config might mark server unavailable due to one startup failure, keeping it excluded permanently.

My suggestions:

Set max_fails to 2 or 3, tolerate occasional network jitter
Set fail_timeout to 30+ seconds, give server recovery chance
Configure complete error type list in proxy_next_upstream, avoid missing failure cases

Passive check’s advantage: simple config, supported by open-source Nginx. Disadvantage: relies on user requests triggering, user experience suffers first. If you need faster fault detection, proactive backend probing—that’s active health check.

4. Active Health Check: NGINX Plus and Open-Source Alternatives

Active health check logic: Nginx periodically sends probe requests to backend servers (like GET /health), judging server health by response status. No need to wait for user request failures—Nginx itself discovers faulty servers and removes them ahead of time.

Official solution is NGINX Plus (commercial), annual fee $3,675/instance. 10 instances costs $36,750/year. Honestly, that price is steep for many companies.

Open-source solution: nginx_upstream_check_module, developed by Taobao tech team. Requires recompiling Nginx to add this module, but functionality is quite complete:

Feature	NGINX Plus	nginx_upstream_check_module
Price	$3,675/year	Open-source free
HTTP Check	Supported	Supported
TCP Check	Supported	Supported
MySQL Check	Not supported	Supported
FastCGI Check	Not supported	Supported
Status Page	Supported	Supported (check_status)

The open-source module supports MySQL and FastCGI checks—features NGINX Plus doesn’t have. If your backend is PHP-FPM or MySQL, this module is more suitable.

nginx_upstream_check_module Configuration Example

upstream backend {
  server backend1.example.com:8080;
  server backend2.example.com:8080;

  check interval=3000 rise=2 fall=5 timeout=1000 type=http;
  check_http_send "GET /health HTTP/1.0\r\n\r\n";
  check_http_expect_alive http_2xx http_3xx;
}

server {
  location / {
    proxy_pass http://backend;
  }

  location /upstream_status {
    check_status json;
    allow 127.0.0.1;
    allow 10.0.0.0/8;
    deny all;
  }
}

Parameter explanation:

interval=3000: Send probe request every 3 seconds
rise=2: After 2 consecutive successes, server marked healthy (newly started servers might be unstable, need consecutive success confirmation)
fall=5: After 5 consecutive failures, server marked unavailable (tolerate occasional timeout)
timeout=1000: Probe request timeout 1 second
type=http: Use HTTP protocol for probing (also supports tcp, ssl_hello, mysql, ajp, fastcgi)

check_http_send defines probe request content. Here sending a simple GET /health request. Backend needs to implement /health endpoint, returning 200 or 3xx status code.

check_http_expect_alive specifies which status codes count as “healthy.” http_2xx and http_3xx mean 200-299 and 300-399 status codes all count as success.

check_status provides status monitoring page. JSON format output, convenient for Prometheus or Zabbix integration. That allow/deny afterward is access control—this page can’t be casually exposed.

Module Installation Method

nginx_upstream_check_module requires compilation install. Rough steps:

# Download module source
git clone https://github.com/yaoweibin/nginx_upstream_check_module.git

# Download Nginx source
wget http://nginx.org/download/nginx-1.24.0.tar.gz
tar -zxvf nginx-1.24.0.tar.gz

# Apply patch (choose based on Nginx version)
cd nginx-1.24.0
patch -p1 < ../nginx_upstream_check_module/check_1.20.1+.patch

# Compile
./configure --add-module=../nginx_upstream_check_module
make && make install

If you deploy with Docker, you can build an image containing the module yourself, or find community-ready images.

5. Production Practice: Security and Monitoring

Health check configuration in production has several pitfalls. I summarized three principles:

Security Configuration: Three Key Points

1. check_status page must have access control

Status page exposes backend server list and health status. If accessed externally, attackers can see your internal topology info and know which server is currently unavailable—perfect timing for attack.

location /upstream_status {
  check_status json;
  allow 127.0.0.1;       # Local access
  allow 10.0.0.0/8;      # Internal IP
  deny all;              # Deny others
}

Or stricter—only allow specific monitoring server IPs.

2. Use dedicated health check port

Health check requests are frequent (every 3-5 seconds). If probing business port directly, backend logs record tons of /health requests. Log files bloat, affecting performance.

Recommend backend service listen on two ports: business port (like 8080) and health check port (like 8888). Health check port only returns simple status code, doesn’t handle business logic, doesn’t write logs.

check interval=5000 rise=2 fall=3 timeout=2000 type=http port=8888;

port=8888 specifies dedicated probe port.

3. Health check endpoint doesn’t return sensitive info

/health endpoint only needs to return status code. Don’t return version number, config info, memory usage, or other internal data. Attackers will use this info to locate vulnerabilities.

# Backend implementation example
@app.route('/health')
def health():
    return '', 200   # Only return status code

Monitoring Integration: JSON Output

check_status supports multiple formats. JSON format works well with monitoring systems:

curl http://127.0.0.1/upstream_status

Output example:

{
  "servers": {
    "total": 3,
    "generation": 12,
    "server": [
      {"index": 0, "name": "10.0.0.1:8080", "status": "up", "rise": 5, "fall": 0, "type": "http"},
      {"index": 1, "name": "10.0.0.2:8080", "status": "up", "rise": 3, "fall": 0, "type": "http"},
      {"index": 2, "name": "10.0.0.3:8080", "status": "down", "rise": 0, "fall": 5, "type": "http"}
    ]
  }
}

generation is config change counter. Every time upstream config is modified and reloaded, generation value increases. Monitoring scripts can compare this value to confirm config生效.

Parameter Tuning Suggestions

interval not below 3000ms

Too high probe frequency stresses backend. 3-5 second interval works—fault detection delay in seconds, doesn’t affect user experience.

rise and fall threshold balance

rise too small (like 1), server might be misjudged unhealthy during startup initialization, then quickly recovered, oscillating
fall too small (like 1), one network jitter triggers removal, too sensitive

My experience values: rise=2, fall=3 or fall=5. Tolerate transient faults, confirm sustained faults before removal.

Complete Production Configuration

upstream web_app {
  zone web_app 64k;
  server 10.0.0.1:8080 weight=3;
  server 10.0.0.2:8080;
  server 10.0.0.3:8080 backup;

  check interval=5000 rise=2 fall=3 timeout=2000 type=http port=8888;
  check_http_send "GET /health HTTP/1.1\r\nHost: app.example.com\r\n\r\n";
  check_http_expect_alive http_2xx;
}

server {
  listen 80;
  server_name app.example.com;

  location / {
    proxy_pass http://web_app;
    proxy_set_header Host $host;
    proxy_next_upstream error timeout http_502 http_503 http_504;
  }

  location /upstream_status {
    check_status json;
    allow 127.0.0.1;
    allow 10.0.0.0/8;
    deny all;
  }
}

This configuration:

zone shared memory ensures worker state sync
Active health check probes dedicated port every 5 seconds
Status page only allows internal network access
proxy_next_upstream ensures requests on faulty server get forwarded to healthy backends

Conclusion

Back to that Double 11 incident. Later we added zone shared memory to upstream, configured max_fails and fail_timeout, then compiled and installed nginx_upstream_check_module for active health checks. When a server crashes, Nginx discovers and removes it within 5 seconds—users basically won’t receive error responses.

Load balancing strategy choice, in one sentence: stateless services use round-robin or least_conn, stateful services use ip_hash or hash. Health check choice: production must have it, open-source solution uses nginx_upstream_check_module, don’t forget access control on check_status page.

If your Nginx still uses default round-robin without health check, suggest starting with passive check (max_fails + fail_timeout). Minimal change, immediate effect. After validating stability, consider upgrading to active health check. Try in test environment first, confirm config works before production deployment.

FAQ

What does the zone configuration in Nginx upstream do?

zone configuration creates a shared memory area for multiple worker processes to share backend server state info. Without zone, each worker manages state independently—one worker might discover server failure while others still send requests there.

What's the difference between passive and active health checks?

Passive check judges server status by observing actual request success/failure, relies on user requests triggering, slow fault detection. Active check has Nginx periodically send probe requests, doesn't rely on user requests, fast detection—can remove faulty servers before users affected.

Which is better: nginx_upstream_check_module or NGINX Plus?

Each has advantages: NGINX Plus has official support and complete docs, nginx_upstream_check_module is open-source free and supports MySQL/FastCGI checks (NGINX Plus doesn't). For budget-sensitive scenarios, recommend open-source module; for stability and official support, choose NGINX Plus.

How to choose between round-robin and least_conn load balancing?

Stateless API services use round-robin for uniform distribution. WebSocket and other long connection services use least_conn—it monitors connection counts real-time, sends new requests to least loaded server, avoiding accumulated long connections on one server.

How to set rise and fall parameters for health checks?

rise controls how many consecutive successes mark server healthy—suggest 2 to prevent oscillation during startup. fall controls how many consecutive failures mark server unavailable—suggest 3 or 5 to tolerate transient faults. Too sensitive settings cause false positives.

Why must check_status page have access control?

Status page exposes backend server list and health status. Attackers can see internal topology info and know which server is currently unavailable—perfect timing for attack. Recommend only allowing internal IPs or specific monitoring server access.

11 min read · Published on: Apr 27, 2026 · Modified on: Apr 29, 2026

Easton

Technology

Nginx Load Balancing in Practice: upstream Configuration and Health Checks

1. upstream Basic Configuration: From Single Server to Cluster

2. Five Load Balancing Strategies: When to Use Each?

round-robin (Default Round-Robin)

least_conn (Least Connections)

ip_hash (IP Hash)

hash (Consistent Hash)

random

3. Passive Health Check: max_fails and fail_timeout

4. Active Health Check: NGINX Plus and Open-Source Alternatives

nginx_upstream_check_module Configuration Example

Module Installation Method

5. Production Practice: Security and Monitoring

Security Configuration: Three Key Points

Monitoring Integration: JSON Output

Parameter Tuning Suggestions

Complete Production Configuration

Conclusion

FAQ

Nginx Practice Guide

Nginx SSL/TLS Configuration in Practice: From HTTPS Certificates to A+ Security Hardening

Nginx Reverse Proxy Complete Guide: Upstream, Buffering, and Timeout

Nginx Reverse Proxy Complete Guide: Upstream, Buffering, and Timeout

Nginx Performance Tuning: gzip, Caching, and Connection Pool Configuration

Nginx Performance Tuning: gzip, Caching, and Connection Pool Configuration

Vitest Unit Testing in Practice: TDD Workflow and Coverage Reports

Vitest Unit Testing in Practice: TDD Workflow and Coverage Reports

Comments

1. upstream Basic Configuration: From Single Server to Cluster

2. Five Load Balancing Strategies: When to Use Each?

round-robin (Default Round-Robin)

least_conn (Least Connections)

ip_hash (IP Hash)

hash (Consistent Hash)

random

3. Passive Health Check: max_fails and fail_timeout

4. Active Health Check: NGINX Plus and Open-Source Alternatives

nginx_upstream_check_module Configuration Example

Module Installation Method

5. Production Practice: Security and Monitoring

Security Configuration: Three Key Points

Monitoring Integration: JSON Output

Parameter Tuning Suggestions

Complete Production Configuration

Conclusion

FAQ

Nginx Practice Guide

Nginx SSL/TLS Configuration in Practice: From HTTPS Certificates to A+ Security Hardening

Related Posts

Nginx Reverse Proxy Complete Guide: Upstream, Buffering, and Timeout

Nginx Reverse Proxy Complete Guide: Upstream, Buffering, and Timeout

Nginx Performance Tuning: gzip, Caching, and Connection Pool Configuration

Nginx Performance Tuning: gzip, Caching, and Connection Pool Configuration

Vitest Unit Testing in Practice: TDD Workflow and Coverage Reports

Vitest Unit Testing in Practice: TDD Workflow and Coverage Reports

Comments