Load Balancing and Rate Limiting

Tung Nguyen

Jun 24, 2026

12 min read

A server can answer a request, route it to the right site, and send the client elsewhere. But all of that assumes one server, calmly handling whatever shows up. Production rarely looks like that. Traffic spikes, machines crash, and one badly-behaved client can try to make a thousand requests a second. This chapter covers the two tools that keep things standing: load balancing, which spreads traffic across many servers, and rate limiting, which caps how much any single caller can demand.

They solve different problems, but they pair naturally. Load balancing lets you grow sideways and survive failures. Rate limiting protects whatever is behind it from being swamped. Most real systems run both.

Load balancing: one address, many servers

A single server has a ceiling. There is only so much CPU, memory, and bandwidth in one box, and when you hit that ceiling the usual fix is not a bigger box. It is more boxes. You run several identical copies of your app and put something in front of them that decides which copy answers each request. That something is a load balancer — a piece of software (or a managed cloud service) that takes incoming requests and distributes them across a pool of backend servers.

This is almost always the reverse proxy from the architecture section wearing a second hat. The clients connect to one public address. Behind it sit Server 1, Server 2, Server 3, and so on, and the load balancer picks one per request. The clients never know how many servers there are, and you can add or remove servers without anyone outside noticing.

A load balancer spreads requests across three backends and skips an unhealthy one

Spreading traffic this way buys you two things at once:

Horizontal scale. Need to handle more traffic? Add more servers to the pool. The load balancer starts sending them work immediately. This is "scaling out" (more machines) instead of "scaling up" (a bigger machine), and it has no hard ceiling.
Surviving a failure. If one server crashes, the load balancer notices and stops sending it requests. The others absorb the load. From the outside, the site stays up even though a machine just died.

That second point only works if the load balancer can tell a dead server from a live one. We will get to that in a moment. First, how does it decide which server to pick?

About the Author

Tung Nguyen

Software Engineer at Autograb

Connect on LinkedIn

Load Balancing and Rate Limiting

Load balancing: one address, many servers

About the Author

Tung Nguyen

How the load balancer picks a server

Sticky sessions, and why a shared store frees you

Health checks: pulling a dead server out of rotation

Rate limiting: capping what any one caller can demand

Counting algorithms, and the trade-off each makes

How a server tells you it's rate limited

Try it now

Being a polite client

What's Next