<< BACK

Rate Limiting

A field guide to the algorithms, tradeoffs, and operational sharp edges.

DATE:
APR.25.2026
READ:
25 MIN

The lifecycle of one request

A request leaves a client. It crosses the public internet, lands at a CDN, gets routed to a gateway, then to your application, which queries a database, returns a response, and propagates back. At every one of those hops, something could decide to refuse the request — because you’ve sent too many, because the system is overloaded, because somebody else’s traffic is starving yours. Rate limiting is the discipline of refusing requests on purpose, predictably, and explaining why.

It is also the discipline most engineers reach for the wrong tool inside of, because “limit the rate” is four English words that mean five different algorithmic things.

Each layer matters because each layer fails differently. Edge limiters protect bandwidth and origin compute. Gateway limiters enforce per-key fairness. Application limiters apply business rules a gateway doesn’t see. Database concurrency caps protect the last resource that can’t autoscale. A serious system runs all of them, and they fail independently — a Redis blip at the gateway shouldn’t take down your DB pool’s semaphore.

Throughout this post we’ll talk about accepted, rejected, and queued requests, plus tokens and capacity. The diagrams below use the same shapes for those terms in every figure.

Why count at all

Three problems share one shape:

  • Abuse. A bad actor scripts a login endpoint and tries 10,000 passwords in a minute.
  • Cost. A search endpoint hits an LLM at $0.02 per call. A user with infinite enthusiasm can run up four-figure bills in an afternoon.
  • Cascading load. A dependency slows from 80ms to 800ms. Open connections pile up, threads block, the queue grows, and the whole system tips over despite a constant input rate.

In all three, the lever is the same: refuse some traffic, deliberately, before the resource you’re protecting collapses. How you choose which to refuse is what the rest of this post is about.

Stripe’s Scaling your API with rate limiters and Cloudflare’s Counting things are the two posts to read alongside this one.

Fixed window

Bucket time into fixed intervals. Maintain a counter per (key, window). On a request: INCR, reject if it exceeds the limit. The window ID is floor(now / window_size), so old keys expire naturally.

It’s a single Redis op. It uses one integer of memory per active key. It’s the implementation people default to when they “just need rate limiting.” And it has a problem.

+ ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +
FIXED WINDOWwindow 1 [0–1000ms]window 2 [1000–2000ms]boundarytime++++++++++++counter: 5 / 5 ✓counter: 5 / 5 ✓1s sliding view: 10 / 5 ✗both windows compliant — the 1s sliding view is not
+ ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +

A client with bad intentions (or bad timing) can fire limit requests at the very end of one window and another limit at the very start of the next, doubling their effective rate over any 1-second sliding view. The limiter sees two compliant windows; the backend sees a 2× burst. The other failure mode is synchronized resets: every limited client retries at exactly T + window, producing a periodic CPU spike on backends every minute on the dot. (Mitigation: randomize window alignment per key, e.g. window_start = hash(key) % window.)

Fixed window is fine for coarse abuse protection where 2× burst tolerance doesn’t matter, and for daily/monthly billing quotas where the “thundering herd at the boundary” isn’t a real concern. Anywhere a client has incentive to game the boundary, reach for something else.

Sliding window log

Store a sorted set of request timestamps per key. On each request: drop entries older than the window, count what’s left, allow if under limit. This is the algorithm that admits exactly the limit, no more, no boundary exploit. It’s also the algorithm that costs O(limit) memory per key and multiple Redis ops per request.

Reach for it when accuracy is auditable (login throttling, abuse cases you’d defend in court) and limits are low (≤100/window). Don’t reach for it when limits are 10k/minute across millions of users — you’ll OOM your store.

Sliding window counter

Keep two fixed-window counters and interpolate. Cloudflare popularized this; it’s the default when you want production-grade behaviour without sliding-log’s bill.

+ ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +
SLIDING COUNTERprevious windowcurrent window83nowtimeest = prev × (1 − elapsed / window) + curr
+ ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +

The interpolation assumes uniform distribution within the previous window. If the previous window had a spike at its end, you under-count; if at its start, you over-count. In Cloudflare’s measurements on real traffic the error was under 0.003% versus sliding log — see counting-things-a-lot-of-different-things. For most production HTTP APIs, this is the correct default.

Token bucket

A bucket of capacity B refills at r tokens/sec. Each request consumes one. If the bucket is empty, reject (or queue). Stored lazily as (last_tokens, last_refill_ts) — the math tokens = min(B, last_tokens + (now − last_refill) × r) runs on read.

+ ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +
TOKEN BUCKETRefill atrate rB = capacity(burst limit)consume on requesttokens ≥ 1: allow · empty: reject (or wait)
+ ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +

This is the model most engineers actually want when they say “I need rate limiting.” B controls how big a burst you’ll absorb; r controls the long-run ceiling. AWS API Gateway exposes both: burstLimit and rateLimit. Two parameters is also the cost — clients sometimes find the burst behaviour surprising.

Leaky bucket

This is the place people get confused. “Leaky bucket” refers to two genuinely different things.

+ ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +
LEAKY BUCKETUseractionsadd "water"*************************************************************************************τ = BucketcapacityConstantdrip out
+ ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +

The meter is mathematically equivalent to token bucket inverted: requests fill the bucket; it leaks at a constant rate; overflow is rejected. This is what Shopify’s API docs describe (bucket size 40, leak rate 2/sec).

The queue is a real queue: accepted requests are delayed, drained one-by-one at the leak rate. Output is perfectly smooth — the price is latency. Use the queue variant for outbound calls to a third party with a hard rate cap (Stripe, Twilio): you’d rather wait than fail. Don’t use it for synchronous user-facing APIs unless you’ve thought about queue depth, timeouts, and head-of-line blocking.

GCRA

The Generic Cell Rate Algorithm comes from ATM networking and is the elegant answer when you want token-bucket semantics with O(1) state and no background timer. Store a single value: the Theoretical Arrival Time (TAT) of the next conforming request. Define T = 1/r and τ = burst × T.

+ ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +
ALLOWED REQUESTallow at+(past)t₀+tat+time++τ + TT= Emission intervalτ= Capacity of bucketT + τ= Delay variation tolerancetat= Theoretical arrival timet₀= Actual time of request
+ ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +

A request at time t₀ is allowed if it falls inside the tolerance band that ends at tat and is τ + T wide. If it does, advance tat by T.

+ ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +
DENIED REQUESTt₀+allow at+(future)tat+time++τ + T
+ ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +

If t₀ arrives before the band opens — meaning the bucket would have to look further into the future than τ allows — reject. The retry-after is exactly allow-at − t₀.

function tryConsume(state, now) {
  const tat = Math.max(now, state.tat);
  const newTat = tat + T;
  if (newTat - now > τ) return { allow: false };
  return { allow: true, state: { tat: newTat } };
}

That’s it. One integer per key. One atomic CAS. Brandur Leach’s post on the algorithm (Stripe internal) is the canonical reference; the redis-cell Redis module exposes it as CL.THROTTLE. Default choice for Redis-backed limiters.

Concurrency limiting

Rate limiting bounds frequency. Concurrency limiting bounds parallelism. They are not interchangeable.

+ ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +
CONCURRENCY POOLλ requests/secN slotsserved, after WL = λ · W(Little's Law)
+ ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +

A 1 req/sec rate limit doesn’t help if each request holds a DB connection for 30 seconds — your pool exhausts at concurrency, not arrival rate. Little’s Law: $L = \lambda W$. Average concurrency equals arrival rate × service time. If service time blows up under load, your rate-limit numbers become meaningless. The cheapest insurance against latency-induced collapse is a concurrency cap at every resource boundary, sized to your downstream pool.

Always run one. It’s the line of defense rate limits cannot replace.

Adaptive limiting

What’s your service’s actual capacity right now? Probably not what you guessed in the config file last quarter. Capacity changes with deploys, dependency latencies, traffic mix.

AIMD (additive increase, multiplicative decrease) borrows TCP congestion control: maintain a dynamic limit L. On success, L += 1. On failure or timeout, L *= 0.5. The limit converges to the system’s actual capacity without you guessing. Netflix’s concurrency-limits library uses RTT measurements as the signal — latency rises before throughput collapses, so latency is the early warning.

Adaptive limiters are great for internal RPCs and load-shedding at the LB. They’re a bad fit for billing or fairness use cases — limits are non-deterministic per client. Pair them with a static per-client cap.

Distributed reality

Once you have more than one instance of your limiter, things get interesting.

  • Redis INCR + EXPIRE. Cheap, but EXPIRE on a fresh key is a separate op — use SET key 0 EX window NX then INCR, or accept a microscopic race.
  • Lua scripts. Atomic, single round trip, can implement GCRA or sliding-window-counter exactly. This is what redis-cell, Upstash Ratelimit, and most production limiters use.
  • CRDTs. For multi-region: accept that limits become eventually consistent. A G-Counter CRDT lets each region count locally and reconcile. You over-admit by up to (regions × propagation_delay × rate), which is usually fine.
  • Probabilistic structures. HyperLogLog for unique-visitor caps; Count-Min Sketch for top-k abuser detection. Cloudflare uses sketches at edge for high-cardinality detection.

When the limit store goes down, the question is fail-open or fail-closed. Fail-open (allow traffic) for non-critical endpoints — preserves availability. Fail-closed (reject) for billing or abuse-critical endpoints — preserves correctness. Make it a per-limiter knob, not a global default. And always have a local in-memory fallback limiter so a Redis outage doesn’t yield infinite throughput.

The wire protocol

What you send back matters as much as what you compute.

HTTP/1.1 429 Too Many Requests
Retry-After: 30
RateLimit-Limit: 60
RateLimit-Remaining: 0
RateLimit-Reset: 1777134688
Content-Type: application/json

{
  "error": "rate_limited",
  "message": "Too many requests. Slow down."
}

The CLIENT exceeded its quota. Use when a specific identity (key, user, IP) is over its limit.

Most limiter middleware mixes 429 and 503 randomly, which is wrong. RFC 6585 defines 429 for “this client is over its limit.” 503 is for “the server is overloaded right now, regardless of who you are.” The distinction is observable to clients: a 429 means “back off, you specifically”; a 503 means “the system is unhealthy, and incidentally everyone’s getting them.” Different runbooks, different alerts.

Retry-After is non-negotiable on both. Clients can be much more polite when they know how long to wait. The IETF RateLimit-* headers draft standardizes the metadata; GitHub’s X-RateLimit-* variants are the most common in the wild.

When rate limiting is the wrong tool

Three patterns get conflated with rate limiting and produce systems that 429 healthy traffic while the actual contended resource still melts:

  • Bursty load on a slow dependency → you want a queue with backpressure, not 429s.
  • Protecting a finite resource → use a concurrency limit.
  • Per-customer entitlements → that’s quota accounting (durable, billing-aware, not in-memory).
  • Fair scheduling among tenants → weighted fair queueing or DRR, not a per-tenant rate cap.

Rate limiting handles request frequency. The other patterns handle different invariants. Knowing which you have is half the work.

Decision matrix

Use caseTrafficFairnessAccuracyPick
Public API, authedBurstyper-keyapprox okGCRA / token bucket
Public API, anon abuseSpiky, hostileper-IP/ASNapproxSliding window counter at edge
Billing quotas (daily)Anyper-accountexactFixed window, durable store
Outbound to 3rd-partyYoursn/aexactLeaky bucket queue
Internal RPCVariableper-caller + globaldoesn’t matterAIMD / Netflix concurrency-limits
Login / authLow, abuse-proneper-account + per-IPexactSliding window log
Expensive queries (search, AI)Low, costlyper-tenantapproxToken bucket, cost-weighted consume
DB / connection protectionAnyglobaln/aConcurrency limit (semaphore)
Multi-region APIGlobalper-keyeventually consistentLocal GCRA + CRDT reconcile
Detect top abusersHigh cardinalityn/aapproxCount-Min Sketch + heavy hitter

The defaults a senior engineer reaches for: GCRA via redis-cell for authed APIs, sliding window counter at the edge, concurrency limit at every resource boundary, AIMD for internal load shedding. Everything else is a specialization.

Where to read more