Logo
Technical Article

Load Balancing Techniques

16 min read

Load Balancing Techniques — A Complete Engineer's Guide#

Every backend system beyond a certain scale has one. But most engineers treat load balancers as a black box — configure it, forget it, hope it works. This post breaks open that box: how load balancers actually think, what algorithms they use, and how to choose the right one for your system.


What Is a Load Balancer, Really?#

A load balancer sits in front of a pool of servers and distributes incoming traffic across them. That sounds simple. But its actual job has three distinct responsibilities:

  1. Distribute — spread work so no single server gets overwhelmed
  2. Detect failure — stop sending traffic to servers that are unhealthy or dead
  3. Hide complexity — the outside world sees one address; what's behind it is your problem to manage

The third job is underappreciated. Load balancers are the foundation of horizontal scaling. Without them, scaling means giving users a new IP address every time you add a server. That's not a product — that's chaos.


The Two Dimensions of Load Balancing#

Every load balancing decision can be understood along two axes:

Dimension 1: WHERE does it operate? (OSI Layer)
Dimension 2: HOW does it decide where to send traffic? (Algorithm)

Get comfortable with both. They're independent — you can run a Least Connections algorithm at Layer 4 or Layer 7. The layer determines what information the balancer has available; the algorithm determines what it does with that information.


Dimension 1 — Where Does It Operate?#

Layer 4 — Transport Layer Load Balancing#

An L4 load balancer operates at the TCP/UDP level. It sees:

  • Source IP, Destination IP
  • Source Port, Destination Port
  • Protocol (TCP or UDP)

It does not read the actual content of the request. It doesn't know if it's HTTP, a database query, or a game packet. It just sees the envelope, not the letter inside.

Incoming packet: [IP: 103.x.x.x] [Port: 443] [TCP]
LB decision: route to Server B
LB action: rewrite destination IP → Server B's IP

Internally, L4 LBs use one of three forwarding mechanisms:

  • NAT — rewrites the destination IP on the packet
  • DSR (Direct Server Return) — the server responds directly to the client, bypassing the LB on the way back (great for high-throughput, asymmetric traffic like video)
  • IP Tunneling — encapsulates the original packet and forwards it to the backend

Use L4 when:

  • You need minimal latency overhead (we're talking microseconds)
  • You're dealing with non-HTTP traffic — databases, message queues, game servers, media streaming
  • Raw throughput is the priority and you don't need content-aware routing

The tradeoff: Speed comes at the cost of intelligence. You can't route /api/payments to your payment cluster and /api/search to your search cluster at L4. It simply doesn't see URLs.


Layer 7 — Application Layer Load Balancing#

An L7 load balancer operates at the HTTP/HTTPS level. It fully reads and parses the request:

  • URL path (/api/users, /api/payments)
  • HTTP headers (Host, Authorization, Content-Type)
  • Cookies (session IDs, feature flags)
  • Request body (in some cases)

This is what makes modern microservices architectures possible.

Incoming request:
  GET /api/payments/charge HTTP/1.1
  Host: api.myapp.com
  Cookie: session_id=abc123

LB decision:
  → Path starts with /api/payments → route to Payment Service cluster
  → Cookie has session_id → route to same server as last time (sticky session)

L7 load balancers also handle SSL/TLS termination — they decrypt HTTPS once at the edge, then forward plain HTTP to your backend servers. This offloads expensive crypto operations from your app servers and simplifies certificate management significantly.

Use L7 when:

  • You're building microservices and different paths need to go to different services
  • You need sticky sessions
  • You want meaningful health checks (more on this below)
  • You need to inspect, modify, or rewrite requests and responses

The tradeoff: More intelligence means more overhead. L7 LBs are slower than L4 and can't handle non-HTTP protocols. For most web applications, this overhead is irrelevant. For ultra-high-frequency trading systems, it might matter.


Layer 3 — Network Layer Load Balancing#

Operates purely at the IP level. Extremely specialized, rarely seen in typical application architectures. Mentioned here for completeness — if you're not building a carrier-grade network, you won't need it.


Dimension 2 — The Routing Algorithms#

This is where it gets interesting. The algorithm is what defines the load balancer's "personality" — its decision-making logic for every single request.


1. Round Robin#

The simplest algorithm. Distribute requests sequentially, cycling through all servers.

Request 1 → Server A
Request 2 → Server B
Request 3 → Server C
Request 4 → Server A  ← cycle repeats
Request 5 → Server B

Where it works well: All servers are identical hardware, requests are roughly uniform in processing time, stateless application.

Where it falls apart: Imagine Server A is grinding through a 10-second video encoding job. Round Robin doesn't know or care — it keeps sending new requests to Server A at the same rate. Meanwhile, Server B has been idle for 9 seconds.

Round Robin distributes requests evenly, not load. Those are very different things.

Verdict: Good starting point. Don't use it in production for workloads with variable request complexity.


2. Weighted Round Robin#

Same as Round Robin, but each server gets a weight. Higher weight = more traffic share.

Server A: weight 3  →  gets 3 out of every 6 requests
Server B: weight 2  →  gets 2 out of every 6 requests
Server C: weight 1  →  gets 1 out of every 6 requests

Where this shines: Heterogeneous fleets where some servers are more powerful than others. Also great for traffic migration — set the new server to weight 1, the old to weight 9, then gradually shift. This is the primitive behind blue-green deployments and canary releases.

The limitation: Weights are statically configured. If Server A suddenly gets slow due to a noisy neighbour on the VM, Weighted Round Robin has no idea. It keeps sending the same proportion of traffic. It's a manual knob, not an adaptive one.


3. Least Connections#

Track the number of active connections on each server. Always route to the server with the fewest.

Server A: 150 active connections
Server B: 23 active connections   ← new request goes here
Server C: 87 active connections

This is the first algorithm on this list that reacts to real-time load rather than just counting requests.

Where this shines: When request processing times vary significantly. Long-lived connections like WebSockets, video streams, or database queries. Any workload where "number of requests" is a poor proxy for "actual server load."

The limitation: The LB now needs to maintain state — a connection count per server. Minor overhead, but worth knowing. Also, connection count alone doesn't tell the full story. A server with 5 connections could be more loaded than one with 50, if those 5 are each doing heavy computation.


4. Weighted Least Connections#

Combines the best of Weighted Round Robin and Least Connections. The routing score:

Score = active_connections / weight

Server A: 150 connections, weight 5 → Score = 30
Server B: 23 connections,  weight 1 → Score = 23  ← lowest score, wins
Server C: 87 connections,  weight 3 → Score = 29

More powerful servers (higher weight) can absorb more connections before getting penalised in the score. The algorithm naturally routes more traffic to beefier servers while still adapting to real-time load.

Best for: Mixed fleets with variable request processing times — which is most realistic production environments.


5. IP Hash (Source IP Affinity)#

Hash the client's IP address. The same IP always maps to the same server.

hash(103.21.x.x) % 3 = 1 → always Server B
hash(49.207.x.x) % 3 = 0 → always Server A
hash(122.xx.x.x) % 3 = 2 → always Server C

Why this exists — The Session Problem:

Some applications store session data locally on the server (in-memory or on local disk), not in a shared store like Redis. If a user's requests get routed to different servers, their session data isn't there — they get logged out or lose state. This is called the sticky session problem.

IP Hash is a brute-force solution: the same user always hits the same server, so local session data is always found.

The problems with IP Hash:

  • If Server B dies, everyone hashed to it loses their session — no graceful recovery
  • Corporate NAT and mobile carrier NAT mean thousands of users can share a single IP. Your "even" distribution suddenly isn't
  • Traffic distribution can be wildly uneven depending on how IPs are distributed

IP Hash solves a real problem, but it's a workaround. The real fix is externalising session state to Redis or a shared DB — then any server can handle any request and the problem disappears entirely.


A smarter version of IP Hash. On the first request, the LB picks a server and injects a routing cookie into the response:

Set-Cookie: LB_ROUTE=server_b; Path=/

Every subsequent request from that browser includes the cookie. The LB reads it and routes to server_b every time.

Why this beats IP Hash:

  • Works correctly even when thousands of users share a single IP (NAT, corporate proxies)
  • If server_b dies, the LB can detect it, reassign the user to another server, and update the cookie — clean recovery
  • Tracks individual browser sessions, not IPs — much more granular

The cost: Requires L7 (cookie inspection), and the LB needs to maintain a routing table. But for any application that needs sticky sessions, this is the right primitive.


7. Least Response Time#

Goes one step further than Least Connections. Routes to the server with the best combined score of active connections and response time:

Score = active_connections × average_response_time

Server A: 20 connections × 50ms  = 1,000
Server B: 5 connections  × 300ms = 1,500
Server C: 10 connections × 80ms  = 800   ← wins

Why Least Connections isn't enough:

A server might have few connections because it's slow and holding them open for a long time. Least Connections would happily route more traffic to it. Least Response Time catches this — if a server is slow, it gets penalised regardless of connection count.

The tradeoff: The LB needs to actively measure response times. More overhead, more moving parts. Use this when accuracy matters more than operational simplicity.


8. Random#

Pick a backend server at random. That's it.

rand() % num_servers → pick that server

This sounds like a joke, but it's not. At large scale — thousands of servers, millions of requests per second — random selection statistically approximates round robin. The law of large numbers evens it out. Twitter's Finagle library used random selection internally for exactly this reason.

Pros: Zero state to maintain, trivially simple, works at massive scale.
Cons: High variance at small scale. Don't use this with 3 servers.


9. Power of Two Random Choices#

A clever hybrid that elegantly solves the core tradeoff between random and least connections.

How it works:

  1. Pick 2 servers at random
  2. Of those 2, route to the one with fewer active connections
Random picks: Server A (150 connections) vs Server C (30 connections)
Decision: Server C

Why this is clever:

Pure Least Connections requires checking all servers on every request — O(n) overhead. Pure random is cheap but naive. Power of Two Choices gets you most of the benefit of Least Connections (you always avoid the worse option) with near-random overhead (only 2 comparisons).

It's been proven mathematically that this approach reduces the maximum load by an exponential factor compared to pure random, while adding almost no overhead. This is used in Nginx, HAProxy, and Envoy configurations in production at scale.


10. Resource-Based / Adaptive#

The most sophisticated approach. Servers actively report their own resource utilisation — CPU percentage, memory usage, queue depth — to the load balancer via an agent or API. The LB routes based on actual capacity metrics.

Server A: CPU 90%, Memory 85% → heavily loaded, avoid
Server B: CPU 20%, Memory 40% → lightly loaded, prefer
Server C: CPU 55%, Memory 60% → moderate

Where this wins: When workload characteristics vary wildly — some requests are CPU-bound, some are memory-bound, some are I/O-bound. Connection count is a poor proxy for all of these. Actual resource metrics are the ground truth.

The cost: Every server needs instrumentation. There's reporting lag. The setup is significantly more complex. Reserve this for systems where the accuracy justification is strong.


Health Checks — The Foundation of Everything#

Here's the thing: every algorithm above is completely useless if the LB keeps routing traffic to a dead server. Health checks are how the LB knows which servers are actually available.

Three Levels of Health Checking#

TCP Health Check (Layer 4):

LB → Server: [TCP SYN]
Server → LB: [TCP SYN-ACK]  ← network stack is alive

This only tells you the server is reachable and accepting connections. The application could be completely broken and this check would still pass. Use this as a baseline, not a guarantee.

HTTP Health Check (Layer 7):

LB → GET /health HTTP/1.1
Server → HTTP 200 OK    ← application is healthy
Server → HTTP 500       ← application is running but broken
Server → [timeout]      ← server is unreachable

Verifies the actual application is responding correctly, not just the network layer. This is the minimum bar for web applications.

Custom Health Check (Application-Aware):

Your /health endpoint does real work:

{
  "status": "healthy",
  "database": "connected",
  "cache": "connected",
  "disk_free_gb": 42,
  "queue_depth": 150
}

Returns 200 only if every dependency checks out. Returns 503 if the database connection pool is exhausted, even if the server itself is technically running. This is the highest-fidelity signal — and it's the one most teams neglect to build properly.

Health Check Configuration That Actually Matters#

interval:    5s   → check every 5 seconds
timeout:     2s   → if no response in 2s, count as a failure
threshold:   3    → mark unhealthy after 3 consecutive failures
recovery:    2    → mark healthy again after 2 consecutive successes

The recovery threshold is often overlooked. You don't want a flapping server (healthy → unhealthy → healthy in rapid cycles) to cause request storms. Requiring 2 consecutive successes before re-adding a server to the pool gives it time to stabilise.


Comparison at a Glance#

AlgorithmState RequiredHandles Variable LoadHandles Mixed Server CapacityComplexity
Round RobinNoneVery Low
Weighted Round RobinWeightsLow
Least ConnectionsConnection countsMedium
Weighted Least ConnectionsCounts + WeightsMedium
IP HashHash tableLow
Cookie Sticky SessionsCookie storeMedium
Least Response TimeCounts + TimingsHigh
RandomNoneStatisticallyVery Low
Power of Two ChoicesConnection countsLow
Resource-Based / AdaptiveLive metricsVery High

How to Choose — A Decision Framework#

Use this as a starting point, not gospel. Every system has its own constraints.

Is your application stateless?
├── YES → Round Robin, Least Connections, or Random
└── NO  → IP Hash or Cookie Sticky Sessions
          (Better answer: make it stateless with Redis session store)

Are all servers identical hardware?
├── YES → Round Robin or Least Connections
└── NO  → Use Weighted variants

Are requests uniform in processing time?
├── YES → Round Robin is perfectly fine
└── NO  → Least Connections or Least Response Time

Operating at massive scale (10,000+ servers)?
└── Power of Two Choices or Random

Need maximum accuracy, overhead is acceptable?
└── Resource-Based / Adaptive

When in doubt, start with Least Connections. It's simple, stateful but not excessively so, adapts to real load, and handles variable workloads gracefully. Most teams that reach for Round Robin in production and then hit problems end up migrating to Least Connections anyway.


Common Misconceptions — Busted#

MisconceptionReality
"Round Robin distributes load evenly"It distributes requests evenly, not load. A server with one heavy request can be more loaded than one handling fifty light ones
"Sticky sessions require IP Hash"Cookie-based stickiness is more reliable, more granular, and handles NAT correctly. Prefer it over IP Hash
"Least Connections is always better than Round Robin"At high scale with uniform requests, Round Robin's simplicity and zero state overhead actually wins
"My /health endpoint tells me if my server is healthy"Only if your /health endpoint checks your actual dependencies. A 200 from a server that can't reach its database is worse than useless — it's actively misleading

Key Takeaways#

  • Layer determines visibility. L4 LBs are fast but blind to content. L7 LBs are content-aware but heavier. Pick based on what routing intelligence you actually need.
  • Round Robin distributes requests, not load. The distinction matters the moment your workload has any variability in processing time.
  • Sticky sessions are a workaround. The real fix is externalising session state. If you can make your application stateless, do it — it unlocks every routing algorithm and simplifies your architecture.
  • Health checks are the bedrock. A smart algorithm routing to dead servers is worse than a dumb algorithm routing to healthy ones. Invest in /health endpoints that actually reflect application health.
  • Power of Two Choices is underrated. Excellent performance characteristics, minimal overhead, production-proven. If you need Least Connections at scale, start here.
  • Most teams should use Least Connections. Unless you have a specific reason to do otherwise, it's the default that ages well.

Load balancers are one of those topics where understanding the internals pays dividends repeatedly — in incident debugging, in capacity planning, in architecture reviews, and in system design interviews. The algorithms aren't complex. The nuance is knowing which tradeoffs matter for your workload.


Further reading: HAProxy Configuration, Nginx upstream module, Envoy load balancing policies, The Power of Two Random Choices — original paper

Related Posts