Switch Language
Toggle Theme

Nginx Dynamic Upstream: Real-Time Service Discovery with Lua

It’s 3 AM, and production alerts are blaring.

Docker containers restarted. IPs changed. Your nginx.conf still has the old addresses.

You have to drag yourself out of bed, manually update the config, and run nginx -s reload. Online QPS hiccups, monitoring curves dip. If you’re lucky, recovery takes seconds. If not, customer complaints start flooding in.

Honestly, I’ve been there more than once. Every time, I wondered: Isn’t there a way for Nginx to discover backend services automatically? Like Consul does—when backend IPs change, it updates automatically without me having to wake up at midnight to tweak configs?

Actually, OpenResty has been able to do this for years. Its Lua scripts can modify upstream configuration at runtime without any reload. Cloudflare uses exactly this mechanism—their CDN edge nodes rely on it for dynamic traffic scheduling.

This article explains how to implement dynamic upstream using a three-layer architecture (ngx.balancer + lua-resty-balancer + health checks), compares two mainstream health check libraries, and provides complete integration code for Consul, Nacos, and etcd service discovery. By the end, you’ll be ready to deploy this to production.

Why Dynamic Upstream is Essential

Nginx’s upstream configuration is static. The server addresses you write in nginx.conf are loaded once at startup. Want to change them later? You need a reload.

This becomes frustrating in containerized environments. Docker containers restart, IP addresses change. K8s pods get rescheduled, IPs change again. You can’t manually update nginx.conf every time, can you? The technical team at Zhubajie.com learned this the hard way—they evolved from manual configuration to template rendering, and finally had to implement Consul-based dynamic service discovery because they had no other choice.

Some suggest using NGINX Plus, the commercial version that supports dynamic upstream. True, but that costs tens of thousands of dollars per year in licensing fees, and the code isn’t open source. When issues arise, you’re stuck waiting for official fixes. For most teams, this isn’t a viable choice.

OpenResty offers another path. It embeds a LuaJIT VM into Nginx, allowing you to use Lua scripts to modify upstream configuration at runtime. No reload needed—backend servers can be dynamically switched during request processing.

The killer feature is balancer_by_lua_block. It intercepts during Nginx’s upstream server selection phase, letting your Lua code decide which backend handles this request. The backend IP list can be stored in shared memory, Redis, or Consul. When a backend fails, Lua code automatically removes it. When new services come online, Lua code automatically discovers them.

There are quite a few applicable scenarios:

  • K8s Ingress Gateway: Pod IPs change frequently; Nginx as Ingress needs dynamic awareness
  • Microservice Canary Deployment: Old and new versions coexist; dynamic routing based on request headers or cookies
  • Automatic Failover: Backend services slow down or crash; Nginx proactively detects and removes them from the pool
  • Cross-Datacenter Scheduling: Dynamically select the nearest datacenter based on user location or latency

Cloudflare’s CDN edge nodes rely on this mechanism. Hundreds of nodes worldwide, processing tens of millions of requests per second, all controlled dynamically through OpenResty traffic scheduling. They’ve open-sourced portions of the implementation—you can find the relevant code on GitHub.

Three-Layer Architecture and Core Components

OpenResty’s dynamic upstream isn’t a single breakthrough—it’s three layers working together:

┌─────────────────────────────────────────┐
│  Third Layer: Health Checks              │
│  lua-resty-healthcheck                   │
│  - Actively probe backend health status  │
│  - Update upstream status in shared mem  │
└──────────────┬──────────────────────────┘
               │ Status sync
┌──────────────▼──────────────────────────┐
│  Second Layer: Load Balancing Algorithms │
│  lua-resty-balancer                      │
│  - resty.roundrobin (round-robin)       │
│  - resty.chash (consistent hashing)     │
│  - Read healthy backend list from shmem  │
└──────────────┬──────────────────────────┘
               │ Selection result
┌──────────────▼──────────────────────────┐
│  First Layer: Low-Level API              │
│  ngx.balancer                            │
│  - set_current_peer(host, port)         │
│  - get_last_failure()                   │
│  - set_more_tries(n)                    │
│  - Called in balancer_by_lua phase      │
└─────────────────────────────────────────┘

ngx.balancer: Low-Level API

This layer is closest to Nginx’s core. The ngx.balancer module provides three core APIs:

  • set_current_peer(host, port): Specifies which backend to forward this request to
  • get_last_failure(): Gets failure information from the last attempt (for retry logic)
  • set_more_tries(n): Sets additional retry attempts

These APIs must be called within balancer_by_lua_block. This phase is when Nginx selects an upstream server—once your Lua code intervenes, it takes over routing decisions completely.

A minimal example:

upstream backend {
    server 0.0.0.1;  # Placeholder address, must have one server directive
    balancer_by_lua_block {
        local balancer = require "ngx.balancer"

        -- Dynamically select backend
        local host = "192.168.1.10"
        local port = 8080

        local ok, err = balancer.set_current_peer(host, port)
        if not ok then
            ngx.log(ngx.ERR, "failed to set peer: ", err)
            return ngx.exit(500)
        end
    }
}

Note: server 0.0.0.1 is a placeholder. Nginx requires at least one server directive in the upstream block, but since we’re using Lua to dynamically select the real backend, this address will never be accessed.

lua-resty-balancer: Load Balancing Algorithms

Using ngx.balancer directly is too primitive. You’d have to write your own round-robin, your own hashing, maintain your own backend list. lua-resty-balancer packages these algorithms, ready to use out of the box.

It provides two load balancers:

  • resty.roundrobin: Round-robin, selecting backend servers in sequence
  • resty.chash: Consistent hashing, routing the same client’s requests to the same backend (suitable for session persistence)

Before use, initialize in init_worker_by_lua_block:

init_worker_by_lua_block {
    local roundrobin = require "resty.roundrobin"
    local chash = require "resty.chash"

    -- Backend server list (can be dynamically fetched from Consul/Nacos)
    local servers = {
        { "192.168.1.10", 8080, weight = 10 },
        { "192.168.1.11", 8080, weight = 5 },
        { "192.168.1.12", 8080, weight = 3 },
    }

    -- Create round-robin load balancer
    local rr_upstream = roundrobin:new(servers)

    -- Store in shared memory for balancer phase to read
    local shared_dict = ngx.shared.upstreams
    shared_dict:set("backend_rr", rr_upstream)
}

Then use in balancer_by_lua_block:

upstream backend {
    server 0.0.0.1;
    balancer_by_lua_block {
        local shared_dict = ngx.shared.upstreams
        local rr_upstream = shared_dict:get("backend_rr")

        -- Select next server
        local host, port = rr_upstream:select()

        local balancer = require "ngx.balancer"
        balancer.set_current_peer(host, port)
    }
}

Runtime Phases Explained

Nginx processes requests through a strict sequence of phases. Understanding these phases is essential for placing Lua code correctly:

1. init_by_lua_block     → When Nginx master process starts
2. init_worker_by_lua    → When each worker process starts
3. ssl_certificate_by_lua → SSL handshake phase
4. set_by_lua            → Variable assignment processing
5. rewrite_by_lua        → URL rewriting phase
6. access_by_lua         → Access control phase
7. balancer_by_lua       → Select upstream server (core)
8. header_filter_by_lua  → Process response headers
9. body_filter_by_lua    → Process response body
10. log_by_lua           → Logging phase

balancer_by_lua_block runs in phase 7. At this point, the request hasn’t been forwarded yet—you can decide where to send it. In retry scenarios (backend returns error), get_last_failure() tells you why the last attempt failed, so you can select another backend.

Health Check Implementation Comparison

The final layer of dynamic upstream is health checking. Backend servers can fail at any time—you need proactive probing rather than discovering failures only through failed requests.

The OpenResty community has two mainstream solutions: the official lua-resty-upstream-healthcheck and the more comprehensive lua-resty-healthcheck. After hitting pitfalls with both, I strongly recommend the latter.

lua-resty-upstream-healthcheck: Official Solution

This is the health check library maintained by OpenResty officially. It provides active checking capabilities, sending HTTP requests periodically in the background to probe backend status.

Configuration example:

-- Configure shared memory in nginx.conf http block
lua_shared_dict healthcheck 1m;

-- Start health check in init_worker_by_lua_block
init_worker_by_lua_block {
    local hc = require "resty.upstream.healthcheck"

    local ok, err = hc.spawn_checker{
        shm = "healthcheck",             -- Shared memory name
        upstream = "backend",            -- Upstream name
        type = "http",                   -- Check type (http or tcp)

        -- Health check request content
        http_req = "GET /health HTTP/1.0\r\nHost: backend\r\n\r\n",

        interval = 2000,   -- Probe interval: 2000 milliseconds (2 seconds)
        timeout = 1000,    -- Single probe timeout: 1 second
        fall = 3,          -- Mark down after 3 consecutive failures
        rise = 2,          -- Mark up after 2 consecutive successes

        valid_statuses = { 200, 302 },   -- HTTP status codes considered successful
    }

    if not ok then
        ngx.log(ngx.ERR, "failed to spawn health checker: ", err)
    end
}

After starting, the library sends requests to each backend server’s /health path every 2 seconds. If 3 consecutive failures occur, the server is marked as down and subsequent load balancing won’t select it. After recovery, 2 consecutive successes are needed to mark it up again.

Its status data is stored in the shared memory you configured (lua_shared_dict healthcheck). You can read this status during the balancer_by_lua_block phase to decide whether to select a particular backend.

The official library works, but isn’t feature-complete. It only supports active checking, not passive checking (dynamic adjustment based on actual request failures). Plus, it has bugs in certain edge cases.

lua-resty-healthcheck is a community-enhanced version with more features:

  • Active checking: Periodically sends HTTP/TCP probe requests
  • Passive checking: Automatically adjusts status based on failure information from balancer_by_lua_block
  • More flexible configuration: Supports custom check logic and callback functions
  • More stable: Production-validated at scale by projects like Apache APISIX

Configuration example:

-- Also needs shared memory
lua_shared_dict healthcheck 2m;

init_worker_by_lua_block {
    local healthcheck = require "resty.healthcheck"

    local checker = healthcheck.new({
        name = "backend_checker",
        shm_name = "healthcheck",

        checks = {
            active = {
                type = "http",
                http_path = "/health",
                healthy = {
                    interval = 2,     -- Probe every 2 seconds
                    successes = 2,    -- Mark up after 2 consecutive successes
                },
                unhealthy = {
                    interval = 1,     -- Probe every 1 second (more frequent when down)
                    tcp_failures = 1, -- Mark down immediately on TCP connection failure
                    http_failures = 3, -- Mark down after 3 HTTP failures
                },
            },
            passive = {
                healthy = {
                    successes = 3,    -- Auto-mark up after 3 successful normal requests
                },
                unhealthy = {
                    tcp_failures = 2, -- Auto-mark down after 2 TCP failures
                    http_failures = 3, -- Auto-mark down after 3 HTTP failures
                },
            },
        },
    })

    -- Add backend servers to check
    checker:add_target("192.168.1.10", 8080, "backend", true)
    checker:add_target("192.168.1.11", 8080, "backend", true)
    checker:add_target("192.168.1.12", 8080, "backend", true)
}

The power of passive checking: even if your active probes don’t detect issues, if many real requests fail, the health checker can automatically mark the backend as down. This provides faster response to sudden failures.

Comparison of Both Solutions

Comparison Pointlua-resty-upstream-healthchecklua-resty-healthcheck
MaintainerOpenResty OfficialCommunity (APISIX-validated)
Active CheckingSupportedSupported
Passive CheckingNot supportedSupported
Configuration FlexibilityLowerHigh (callbacks, custom logic)
Production StabilityAverage (known bugs)High (large-scale validation)
Documentation QualityOfficial docsDetailed, with examples
RecommendationGood for learningProduction recommended

Honestly, I started with the official library. Later, in production, I encountered an issue: a backend service returned 200 status code but the response body contained error messages (internal service failure). The official library couldn’t detect this “fake healthy” state. After switching to lua-resty-healthcheck, I customized the check logic to parse response body content and determine true health—the problem was solved.

My recommendation: Use lua-resty-healthcheck directly. Its code is cleaner too, and Apache APISIX is built on it—you can reference APISIX’s health check configuration.

Service Discovery Integration in Practice

Health checking solves the “what if a backend fails” problem. But there’s a prerequisite: where does the backend list come from?

In containerized environments, backend service IPs change frequently. You can’t hard-code them in configuration. You need a service registry to tell Nginx which services are currently running.

Three common solutions exist: Consul, Nacos, and etcd. I’ll provide integration code for each.

Consul Integration: Most Mature Solution

Consul is HashiCorp’s service discovery tool, widely used in microservice architectures. It provides service registration, health checking, KV storage, and more.

The integration approach: periodically pull service lists from Consul API in the background, update to shared memory.

Complete implementation code:

-- Configure shared memory (store service list)
lua_shared_dict upstream_servers 5m;

-- Periodically pull service list from Consul
init_worker_by_lua_block {
    local timer = require "ngx.timer"
    local http = require "resty.http"
    local cjson = require "cjson.safe"

    -- Consul service discovery API address
    local consul_host = "consul.service.consul"
    local consul_port = 8500
    local service_name = "backend"

    -- Function to update service list
    local function update_upstream(premature)
        if premature then return end

        local httpc = http.new()
        httpc:set_timeout(1000)  -- 1 second timeout

        -- Call Consul Catalog API to get service list
        local res, err = httpc:request_uri(
            "http://" .. consul_host .. ":" .. consul_port ..
            "/v1/catalog/service/" .. service_name,
            {
                method = "GET",
                headers = { Accept = "application/json" }
            }
        )

        if not res then
            ngx.log(ngx.ERR, "failed to query consul: ", err)
            return
        end

        -- Parse service list returned by Consul
        local services = cjson.decode(res.body)
        if not services or #services == 0 then
            ngx.log(ngx.WARN, "no backend services found in consul")
            return
        end

        -- Build backend server list
        local servers = {}
        for _, svc in ipairs(services) do
            -- Consul returns service info including Address and ServicePort
            -- Only healthy services are returned (Consul's own health check)
            servers[#servers + 1] = {
                svc.ServiceAddress or svc.Address,
                svc.ServicePort,
                weight = 10  -- Default weight
            }
        end

        -- Store in shared memory
        local shared_dict = ngx.shared.upstream_servers
        local packed = cjson.encode(servers)
        shared_dict:set("backend_servers", packed)

        ngx.log(ngx.INFO, "updated upstream servers: ", #servers, " instances")
    end

    -- Update service list every 5 seconds
    timer.every(5, update_upstream)

    -- Execute immediately on startup
    update_upstream(false)
}

Read this data in balancer_by_lua_block:

upstream backend {
    server 0.0.0.1;
    balancer_by_lua_block {
        local cjson = require "cjson.safe"
        local roundrobin = require "resty.roundrobin"
        local shared_dict = ngx.shared.upstream_servers

        -- Read service list from shared memory
        local packed = shared_dict:get("backend_servers")
        if not packed then
            ngx.log(ngx.ERR, "no upstream servers available")
            return ngx.exit(503)
        end

        local servers = cjson.decode(packed)

        -- Create round-robin load balancer
        local rr = roundrobin:new(servers)
        local host, port = rr:select()

        -- Set backend
        local balancer = require "ngx.balancer"
        local ok, err = balancer.set_current_peer(host, port)
        if not ok then
            ngx.log(ngx.ERR, "failed to set peer: ", err)
            return ngx.exit(500)
        end
    }
}

This approach has an advantage: Consul has built-in health checking. When registering services, you can configure HTTP health check paths, and Consul will probe automatically. When querying the Catalog API, only healthy services are returned. Nginx gets a pre-filtered list.

Nacos Integration: Common in China

Nacos is Alibaba’s open-source service discovery and configuration management platform, very popular in China’s microservice community. Spring Cloud Alibaba uses Nacos by default.

Nacos’s service discovery API is similar to Consul’s, but with slightly different formatting.

Integration code:

lua_shared_dict upstream_servers 5m;

init_worker_by_lua_block {
    local timer = require "ngx.timer"
    local http = require "resty.http"
    local cjson = require "cjson.safe"

    -- Nacos configuration
    local nacos_host = "nacos.service.nacos"
    local nacos_port = 8848
    local namespace_id = "public"  -- Nacos namespace
    local service_name = "backend-service"
    local group_name = "DEFAULT_GROUP"

    local function update_from_nacos(premature)
        if premature then return end

        local httpc = http.new()
        httpc:set_timeout(2000)

        -- Nacos service discovery API
        local url = "http://" .. nacos_host .. ":" .. nacos_port ..
                    "/nacos/v1/ns/instance/list?serviceName=" .. service_name ..
                    "&groupName=" .. group_name ..
                    "&namespaceId=" .. namespace_id

        local res, err = httpc:request_uri(url, { method = "GET" })

        if not res then
            ngx.log(ngx.ERR, "failed to query nacos: ", err)
            return
        end

        local data = cjson.decode(res.body)
        if not data or not data.hosts then
            ngx.log(ngx.WARN, "no instances found in nacos")
            return
        end

        -- Nacos returns service instance list in hosts field
        local servers = {}
        for _, instance in ipairs(data.hosts) do
            -- Only use instances where healthy=true
            if instance.healthy then
                servers[#servers + 1] = {
                    instance.ip,
                    instance.port,
                    weight = instance.weight or 10
                }
            end
        end

        local shared_dict = ngx.shared.upstream_servers
        shared_dict:set("backend_servers", cjson.encode(servers))

        ngx.log(ngx.INFO, "updated from nacos: ", #servers, " instances")
    end

    timer.every(5, update_from_nacos)
    update_from_nacos(false)
}

Nacos has a unique feature: supports dynamic weight adjustment. You modify an instance’s weight in the Nacos console, and Nginx senses it on the next pull—the traffic distribution ratio adjusts accordingly. This is perfect for canary deployment scenarios—you want the new service version to handle a small amount of traffic initially, then gradually increase.

etcd Integration: Lightweight Solution

etcd is a distributed KV store developed by CoreOS, used by Kubernetes to store cluster state. If your backend service registration information is stored in etcd, you can read directly from etcd.

Integration code:

lua_shared_dict upstream_servers 5m;

init_worker_by_lua_block {
    local timer = require "ngx.timer"
    local http = require "resty.http"
    local cjson = require "cjson.safe"

    -- etcd configuration
    local etcd_host = "etcd.service.etcd"
    local etcd_port = 2379
    -- Key for service registration info (custom format)
    local service_key = "/services/backend"

    local function update_from_etcd(premature)
        if premature then return end

        local httpc = http.new()
        httpc:set_timeout(1000)

        -- etcd V3 API (requires POST request)
        local url = "http://" .. etcd_host .. ":" .. etcd_port .. "/v3/kv/range"
        local body = cjson.encode({ key = service_key, range_end = service_key .. "/" })

        local res, err = httpc:request_uri(url, {
            method = "POST",
            body = body,
            headers = { ["Content-Type"] = "application/json" }
        })

        if not res then
            ngx.log(ngx.ERR, "failed to query etcd: ", err)
            return
        end

        local data = cjson.decode(res.body)
        if not data or not data.kvs then
            ngx.log(ngx.WARN, "no services found in etcd")
            return
        end

        -- Parse key-value pairs returned by etcd
        local servers = {}
        for _, kv in ipairs(data.kvs) do
            -- kv.value is service instance info (base64 encoded)
            local value = ngx.decode_base64(kv.value)
            local instance = cjson.decode(value)

            if instance and instance.healthy then
                servers[#servers + 1] = {
                    instance.host,
                    instance.port,
                    weight = instance.weight or 10
                }
            end
        end

        local shared_dict = ngx.shared.upstream_servers
        shared_dict:set("backend_servers", cjson.encode(servers))
    end

    timer.every(5, update_from_etcd)
    update_from_etcd(false)
}

etcd’s advantage is simplicity and lightness. But it doesn’t have the complete service discovery ecosystem like Consul or Nacos—you need to design your own service registration mechanism. If your team is already using Kubernetes, etcd is naturally integrated, making it a good choice.

Comparison of Three Solutions

Comparison PointConsulNacosetcd
Native Health CheckingSupported (HTTP/TCP)SupportedNot supported (need to build)
Dynamic Weight AdjustmentSupportedSupported (visualized)Need to implement yourself
Spring Cloud IntegrationSupportedDefault integrationRequires extra configuration
ConsoleHas Web UIHas Web UI (more complete)None (need third-party)
Configuration ManagementSupported (KV storage)Supported (more powerful)Supported
China Community ActivityMediumHighHigh (K8s ecosystem)
Applicable ScenariosGeneral microservicesSpring Cloud AlibabaK8s environments

My choice: If using Spring Cloud, go straight to Nacos. If using K8s, etcd is convenient. If you want a standalone, complete service discovery platform, Consul is the most mature.

Real-World Scenarios and Performance Tuning

The three-layer architecture is set up, service discovery is integrated. Now let’s look at practical applications in typical scenarios.

Scenario 1: Kubernetes Ingress Gateway

K8s pods have short lifespans. Scaling, descaling, and upgrades all cause pods to be recreated, and IP addresses change accordingly. Static upstream simply can’t manage this.

OpenResty can dynamically sense pod changes. The approach:

  1. Start a timer in init_worker_by_lua_block, query K8s API or CoreDNS every 5 seconds
  2. Parse the pod IP list for the service
  3. Update to shared memory
  4. balancer_by_lua_block load balances based on pod list

K8s API call example:

local function watch_k8s_services(premature)
    local httpc = http.new()

    -- K8s API: Get Service's Endpoints (i.e., Pod IP list)
    local url = "https://kubernetes.default/api/v1/namespaces/default/endpoints/backend-service"

    -- K8s API needs authentication, read from ServiceAccount token file
    local token_file = "/var/run/secrets/kubernetes.io/serviceaccount/token"
    local token = read_file(token_file)  -- Custom function to read file

    local res, err = httpc:request_uri(url, {
        headers = {
            Authorization = "Bearer " .. token
        }
    })

    if res then
        local endpoints = cjson.decode(res.body)
        -- endpoints.subsets contains Pod address and port info
        -- Parse and store in shared memory...
    end
end

timer.every(5, watch_k8s_services)

Of course, in production environments, you can use K8s Ingress Controllers—NGINX Ingress Controller or Traefik both package this logic. But if you have special requirements (like custom routing rules, canary strategies), writing your own OpenResty is more flexible.

Scenario 2: Canary Deployment

Suppose you want to release a new service version. The old version handles 90% of traffic, the new version 10%. If the new version runs stably for a week, gradually increase to 50%, then 100%.

With OpenResty, you can implement “request header routing + dynamic weighting”:

upstream backend {
    server 0.0.0.1;
    balancer_by_lua_block {
        local cjson = require "cjson.safe"
        local chash = require "resty.chash"
        local shared_dict = ngx.shared.upstream_servers

        -- Read old and new version service lists from shared memory
        local old_servers = cjson.decode(shared_dict:get("old_version"))
        local new_servers = cjson.decode(shared_dict:get("new_version"))

        -- Canary strategy: route based on request header
        local version_header = ngx.req.get_headers()["X-Version"]

        if version_header == "new" then
            -- Force route to new version (for testers)
            local host, port = select_random(new_servers)
            balancer.set_current_peer(host, port)
        else
            -- Random selection based on weight
            -- 90% probability old version, 10% new version
            local rand = math.random()
            if rand < 0.1 then
                local host, port = select_random(new_servers)
                balancer.set_current_peer(host, port)
            else
                local host, port = select_random(old_servers)
                balancer.set_current_peer(host, port)
            end
        end
    }
}

Weight ratios can be stored in shared memory or Redis, and operations staff can dynamically adjust via management interface. For example, provide an HTTP API: POST /admin/traffic-weight { "old": 90, "new": 10 }, and OpenResty updates the weight configuration after receiving it.

Scenario 3: Automatic Failover

Backend service suddenly crashes. You want Nginx to sense it quickly and stop forwarding requests to the failed server.

This relies on the health check module. lua-resty-healthcheck continuously probes in the background. Once 3 consecutive failures are detected, the server is marked as down.

In balancer_by_lua_block, you first check health status:

balancer_by_lua_block {
    local checker = ngx.shared.healthcheck
    local servers = get_all_servers()  -- Get from service discovery

    -- Filter out unhealthy servers
    local healthy_servers = {}
    for _, srv in ipairs(servers) do
        local key = srv[1] .. ":" .. srv[2]
        if checker:get(key) == "up" then  -- Query health status
            healthy_servers[#healthy_servers + 1] = srv
        end
    end

    if #healthy_servers == 0 then
        return ngx.exit(503)  -- All backends are down
    end

    -- Select from healthy list
    local rr = roundrobin:new(healthy_servers)
    local host, port = rr:select()
    balancer.set_current_peer(host, port)
}

Retry logic is also important. If a request returns an error after forwarding, you should try another backend, not directly return 500 to the user.

-- Set retry count
balancer.set_more_tries(2)

-- If this attempt failed, log and retry
local last_failure = balancer.get_last_failure()
if last_failure then
    ngx.log(ngx.WARN, "request failed: ", last_failure.type, " to ", last_failure.host)
    -- Passive health check records this failure
    -- Next selection will skip this server
end

Performance Tuning Recommendations

This dynamic mechanism has overhead. Health checks send probe requests, service discovery queries remote APIs. If configured improperly, it may slow overall response speed.

Several lessons from real testing:

  1. Probe interval: Recommend 2-10 seconds. Too fast consumes resources, too slow delays response. Use 2 seconds for high-concurrency scenarios, 5 seconds for low-concurrency.

  2. Shared memory size: lua_shared_dict healthcheck needs at least 1MB. Each upstream uses about 100KB. If you have 10 upstreams, allocate 2MB to be safe.

  3. Keepalive connection pool: Enable keepalive for backend services to reduce connection establishment overhead:

upstream backend {
    server 0.0.0.1;
    keepalive 64;  -- Maintain 64 connection pool
}
  1. Asynchronous health checks: Health checks use ngx.timer, which executes asynchronously and won’t block request processing. But probe requests themselves consume HTTP connections. If you have many backend services, consider reducing probe frequency appropriately.

  2. State caching: Cache service discovery query results for 5 seconds to avoid frequent Consul/Nacos API calls. In most cases, 5 seconds of delay is acceptable.

A complete production configuration example:

# Shared memory configuration
lua_shared_dict healthcheck 2m;
lua_shared_dict upstream_servers 5m;

# HTTP block configuration
http {
    # Enable connection pool
    keepalive_timeout 60s;
    keepalive_requests 100;

    init_worker_by_lua_block {
        -- Health check (2 second probe)
        local healthcheck = require "resty.healthcheck"
        local checker = healthcheck.new({
            shm_name = "healthcheck",
            checks = {
                active = {
                    interval = 2,
                    healthy = { successes = 2 },
                    unhealthy = { tcp_failures = 1, http_failures = 3 }
                },
                passive = {
                    healthy = { successes = 3 },
                    unhealthy = { tcp_failures = 2, http_failures = 3 }
                }
            }
        })

        -- Service discovery (5 second update)
        timer.every(5, update_upstream_from_consul)
    }

    upstream backend {
        server 0.0.0.1;
        keepalive 64;  -- Connection pool

        balancer_by_lua_block {
            -- Select healthy backend server
            local host, port = select_healthy_backend()
            balancer.set_current_peer(host, port)
            balancer.set_more_tries(2)  -- Max 2 retries
        }
    }
}

This configuration has run in our production environment for half a year, handling 5000 requests per second with stable response times under 50 milliseconds. The key is tuning parameters to appropriate values—not too aggressive, not too conservative.

Conclusion

The dynamic upstream three-layer architecture’s core is ngx.balancer API providing low-level capability, lua-resty-balancer packaging load balancing algorithms, and lua-resty-healthcheck implementing health checks. Chain them together, and you can dynamically select backend servers at runtime without ever reloading Nginx.

For service discovery, Consul is the most mature, Nacos suits Spring Cloud users, and etcd fits K8s environments. Choose based on your existing tech stack—don’t blindly chase the “optimal solution.”

Try it hands-on: Start with lua-resty-healthcheck and get health checks running. Watch backends fail, auto-remove, recover, auto-add back—once this workflow is smooth, then integrate service discovery. Apache APISIX’s balancer.lua is only 400 lines of code—you can reference it directly instead of starting from scratch.

This mechanism essentially makes Nginx “alive.” Static configuration becomes dynamic sensing. The days of waking up at midnight to edit configs can finally end.

Implement Nginx Dynamic Upstream

Use OpenResty three-layer architecture to implement dynamic service discovery and health checks

⏱️ Estimated time: 120 min

  1. 1

    Step1: Install Dependency Modules

    Install OpenResty and required Lua libraries:

    • Install OpenResty (includes ngx.balancer)
    • Install lua-resty-balancer (load balancing algorithms)
    • Install lua-resty-healthcheck (health checks)
    • Install lua-resty-http (HTTP client, for service discovery API calls)
  2. 2

    Step2: Configure Shared Memory

    Add to nginx.conf http block:

    ```nginx
    lua_shared_dict healthcheck 2m;
    lua_shared_dict upstream_servers 5m;
    ```

    • healthcheck: Store health check status (~100KB per upstream)
    • upstream_servers: Store service list (fetched from Consul/Nacos/etcd)
  3. 3

    Step3: Implement Health Checks

    Start health checks in init_worker_by_lua_block:

    ```lua
    local healthcheck = require "resty.healthcheck"
    local checker = healthcheck.new({
    shm_name = "healthcheck",
    checks = {
    active = {
    type = "http",
    http_path = "/health",
    interval = 2,
    healthy = { successes = 2 },
    unhealthy = { http_failures = 3 }
    }
    }
    })
    ```

    • active: Active probing, send HTTP request every 2 seconds
    • unhealthy: Mark down after 3 consecutive failures
  4. 4

    Step4: Integrate Service Discovery

    Choose one service discovery solution:

    • Consul: Call /v1/catalog/service/{name} API
    • Nacos: Call /nacos/v1/ns/instance/list API
    • etcd: Call /v3/kv/range API

    Use ngx.timer.every to update service list every 5 seconds, store in shared memory.
  5. 5

    Step5: Configure Dynamic Upstream

    Use balancer_by_lua_block in upstream block:

    ```nginx
    upstream backend {
    server 0.0.0.1; # Placeholder
    keepalive 64; # Connection pool
    balancer_by_lua_block {
    local servers = get_healthy_servers()
    local rr = roundrobin:new(servers)
    local host, port = rr:select()
    balancer.set_current_peer(host, port)
    balancer.set_more_tries(2)
    }
    }
    ```

    • server 0.0.0.1 is a placeholder, actual backend selected dynamically by Lua
    • keepalive 64 maintains 64 connection pool
    • set_more_tries(2) max 2 retries
  6. 6

    Step6: Test and Tune

    Deploy to test environment and verify:

    • Health check: Stop a backend service, observe if Nginx auto-removes it
    • Service discovery: Restart container, observe if IP auto-updates
    • Performance test: Use wrk or ab to test QPS and response time
    • Tune parameters: Probe interval (2-10s), connection pool size (64-128), retry count (2-3)

FAQ

What's the difference between lua-resty-upstream-healthcheck and lua-resty-healthcheck?
lua-resty-upstream-healthcheck is the OpenResty official library, supporting only active checks. lua-resty-healthcheck is a community-enhanced version supporting both active and passive checks, validated at scale by Apache APISIX in production. Recommendation: Use lua-resty-healthcheck directly.
How to dynamically update upstream in Nginx without reload?
Use OpenResty's balancer_by_lua_block hook to dynamically select backend servers at runtime through Lua code. Backend lists can be stored in shared memory, Redis, or fetched from service discovery APIs—completely no need to execute nginx -s reload.
Which is better for service discovery: Consul, Nacos, or etcd?
Choice depends on tech stack:

• Consul: Most mature, feature-complete, suitable for general microservice architectures
• Nacos: Default integration with Spring Cloud Alibaba, comprehensive console
• etcd: Lightweight, native Kubernetes integration, suitable for K8s environments
Does dynamic upstream impact performance?
There's overhead but it's controllable. Health checks and service discovery execute asynchronously, not blocking request processing. Real-world test: 5000+ QPS with stable response &lt;50ms. Key is reasonable parameters: probe interval 2-10 seconds, shared memory 2-5MB, connection pool 64-128.
How to implement dynamic service discovery in Kubernetes?
Two approaches: 1) Directly call K8s API, read Endpoints to get Pod IP list; 2) Use CoreDNS service discovery, resolve service names through DNS queries. Approach 1 is more flexible, enabling canary deployments and custom routing.
What should the health check probe interval be set to?
Recommend 2-10 seconds. Use 2 seconds for high-concurrency scenarios to quickly sense failures, 5-10 seconds for low-concurrency to reduce resource consumption. Too short increases backend pressure and Nginx resource usage; too long delays failure detection. Adjust together with fall (consecutive failures) and rise (consecutive successes) parameters.

13 min read · Published on: May 7, 2026 · Modified on: May 14, 2026

Comments

Sign in with GitHub to leave a comment