Explainstuff.mebeta
All concepts
Cloud Native Patternsintermediate6 min

Rate Limiting

Smooth your own outbound calls so you stay under a downstream service's limits instead of slamming into them.

Imagine pouring a large jug of water into a funnel. Dump it all at once and it overflows everywhere; pour it at a controlled pace and every drop makes it through. When your application calls an external API — a payment provider, a mapping service, a partner feed — you're pouring requests into someone else's funnel, and they've decided exactly how fast it can drain.

The Rate Limiting pattern is how you do the controlled pour. Rather than firing requests the instant you have them and hoping the provider keeps up, you deliberately pace your own outbound traffic to stay within the limit the downstream service allows.

The problem

Almost every external service caps how often you may call it — say, 100 requests per second, or 10,000 per day. Exceed it and you don't just get a polite slowdown: you get rejected requests, 429 Too Many Requests errors, temporary bans, or even billing penalties. Your work fails not because anything was wrong with it, but because you sent it too fast.

This is easy to confuse with throttling, but the direction is opposite. Throttling is defensive on the inbound side — it protects your service from being overwhelmed by callers. Rate limiting is considerate on the outbound side — it protects a downstream service from being overwhelmed by you. Naively retrying rejected calls only makes it worse, hammering an already-saturated limiter and triggering ever-longer penalties.

Without rate limiting — flooding the funnel
unpaced flood → rejections
Burst of work
Burst of work
Downstream API (quota)
429 · throttled / banned
Bursts of outbound calls go straight to the downstream API at full speed. Past its quota it answers with 429s and bans — work fails not because it was wrong, but because it was sent too fast.

How it works

The classic mechanism is a token bucket. The bucket holds tokens and refills at exactly the rate the downstream service permits — say, 100 tokens per second. Every outbound call must spend a token. If tokens are available, the call goes immediately; if the bucket is empty, the call waits until a token refills rather than being fired off to fail.

The bucket's size sets how big a burst you can absorb before pacing kicks in, while the refill rate enforces the long-run average. Work that can't go out right now sits in a buffer until its turn, so a sudden spike of 1,000 requests drains out smoothly at the allowed pace instead of being rejected en masse. The diagram below shows requests arriving in bursts, being metered by the limiter, and leaving as a steady, compliant stream to the downstream service.

Rate Limiting — pour at a pace the funnel can take
paced to the limit
Burst of work
Burst of work
Token bucket
Downstream API (quota)
Bursts of outbound work meet a token bucket that releases calls at the allowed rate, so the downstream API receives a steady, quota-compliant stream instead of a flood.
Tip

Coordinate the bucket across instances. A per-process limiter is fine for one worker, but ten workers each pacing to the full limit will collectively blow past it tenfold. When you scale out, the token bucket usually needs to live in shared state (a cache like Redis) so the whole fleet shares one budget.

When to use it

Use rate limiting whenever you call a service that publishes a quota and you'd rather pace yourself than be cut off. It pairs naturally with queue-load-leveling: a queue absorbs the bursty work, and the rate limiter drains it at a sustainable speed. It also makes your retry logic far gentler — instead of retrying immediately into a wall, retries wait for the next available token, so you stop amplifying the very congestion you're trying to recover from.

Where it's overkill: low-volume calls that never approach any limit, or fire-and-forget traffic where the occasional rejection genuinely doesn't matter. But the moment you're doing bulk work against a metered API — sending notifications, syncing records, scraping a feed — pacing your outbound flow is the difference between steady throughput and a stream of rejections.

Key takeaways

  • Rate limiting governs the calls *you* make outward, keeping you inside a downstream service's published quota.
  • It's the mirror image of throttling: throttling protects your own service from callers; rate limiting protects you from overwhelming someone else's.
  • A common mechanism is a token bucket — you spend a token per call and refill at the allowed rate, smoothing bursts into a steady stream.
  • Excess work waits in a buffer or queue rather than being fired off and rejected, so you avoid wasted calls and error storms.
  • It turns unpredictable bursts into a predictable, compliant flow — and avoids the penalty of tripping a provider's limiter.

Keep going