Rate limits and quotas

Rate limits and quotas are the fundamental mechanics of managing the demand for shared resources: CPU, network, database, queues, external APIs. The goal is fairness, predictability of SLOs, and protection from outbursts, abuse, and "noisy neighbor."

1) Basic concepts

Rate limit - limit the intensity of requests/operations (req/s, msg/min, bytes/sec).
Burst - permissible short-term burst over the average rate.
Quota - volume limit per time window (documents/day, GB/month).
Concurrency cap - restriction of simultaneous operations (simultaneous requests/jobs).
Scope - scope: per-tenant, per-user, per-token, per-endpoint, per-IP, per-region, per-feature.

2) Limiting algorithms

2. 1 Token Bucket

Parameters: 'rate '(tokens/sec),' burst '(bucket size).
Works like "credit": accumulated tokens allow short peaks.
Suitable for external APIs and user requests.

2. 2 Leaky Bucket

Smoothly "bleeds" the flow at a constant speed.
Good for smoothing traffic to sensitive backends.

2. 3 Fixed/Sliding Window

Fixed window: simple but vulnerable to "window switching."

Sliding window: more accurate, but more expensive computationally.

2. 4 GCRA (Generic Cell Rate Algorithm)

Token Bucket equivalent in terms of virtual arrival time.
Accurate and stable for distributed limiters (less conflicting state).

2. 5 Concurrency Limits

Limiting concurrent operations.
Protects against depletion of thread/connection pools and head-of-line blocking.

3) Where to apply limits

At the border (L7/API gateway): main barrier, quick failure (429/503), cheap checks.
Inside services: additional caps for heavy operations (exports, reports, transformations).
At the exit to external systems: individual limits for third parties (anti-penalty).
On queues/workers: fairness to shared pools.

4) Scopes and priorities (multi-tenant)

Иерархия: Global → Region → Tenant/Plan → User/Token → Endpoint/Feature → IP/Device.
Priority-aware: VIP/Enterprise get more 'burst' and weight, but don't break overall SLOs.
Limit composition: total tolerance = 'min (global, regional, tenant, user, endpoint)'.

5) Volume quotas

Daily/monthly quotas: documents/day, GB/month, messages/min.
Soft/hard thresholds: Warnings (80/90%) and hard stop.
Roll-up: accounting by objects (tables, files, events) and "withdrawal" to billing.

6) Distributed limiters

Requirements: low latency, consistency, fault tolerance, horizontal scaling.

Local + probabilistic sync: local shard buckets + periodic synchronization.
Central store: Redis/KeyDB/Memcached с LUA/atomic ops (INCR/PEXPIRE).
Sharding: keys of the form 'limit: {scope}: {id}: {window}' with uniform distribution.
Clock skew: store "truth" on the limiter server, not on clients.
Idempotency: Idempotency-Keys reduce false charges.

7) Anti-abuse and protection

Per-IP + device fingerprint for public endpoints.
Proof-of-Work/CAPTCHA in anomalies.
Slowdown (throttling) instead of complete failure when UX is more important (search prompts).
Adaptive limits: dynamic reduction of thresholds for incidents/expensive degradations.

8) Client behavior and protocol

Codes: '429 Too Many Requests' (rate), '403' (quota/plan exceeded), '503' (protective degradation).

Best practice:

'Retry-After: '- when to try again.

'RateLimit- 'family (IETF):

`RateLimit-Limit: ;w=`
`RateLimit-Remaining: `
`RateLimit-Reset: `
Backoff: exponential + jitter (full jitter, equal jitter).
Idempotency: 'Idempotency-Key' header and repeatability of safe operations.
Timeouts and cancellations: correctly interrupt suspended requests so as not to "capture" limits.

9) Observability and testing

Теги: `tenant_id`, `plan`, `user_id`, `endpoint`, `region`, `decision` (allow/deny), `reason` (quota/rate/concurrency).
Metrics: throughput, 429/403/503 failure rate, p95/p99 limiter delay, key cache hit ratio, plan allocation.
Audit logs: causes of blocks, top "noisy" keys.
Tests: load profiles "saw/burst/plateau," chaos - Redis/shard failure, clock desynchronization.

10) Integration with billing

Usage counters are collected at the border, aggregated by batches (every N minutes) with idempotency.
Plan summary: overspending → overcharge or temporarily increasing the plan.
Discrepancies: reconciliation usage vs invoice; alerts to the delta.

11) Fairness inside (queues, workers)

Weighted Fair Queuing/DRR: Allocating slots to tenants by plan weight.
Per-tenant worker pools: rigid isolation of VIP/noisy.
Admission control: failure before execution if quotas are exhausted; queues do not swell.
Caps on concurrency: Limit concurrent heavy jabs.

12) Typical plan profiles (example)

yaml plans:
starter:
rate: 50  # req/s burst: 100 concurrency: 20 quotas:
daily_requests: 100_000 monthly_gb_egress: 50 business:
rate: 200 burst: 400 concurrency: 100 quotas:
daily_requests: 1_000_000 monthly_gb_egress: 500 enterprise:
rate: 1000 burst: 2000 concurrency: 500 quotas:
daily_requests: 10_000_000 monthly_gb_egress: 5000

13) Architectural reference (verbal scheme)

1. Edge/API gateway: TLS → extract context (tenant/plan) → check limits/quotas → place RateLimit headers → log/trace.
2. Policy Engine: priority rules (VIP), adaptive thresholds.
3. Limiter Store: Redis/KeyDB (atomic ops, LUA), key sharding, replication.
4. Services: secondary limit and caps for heavy operations; idempotency; Queues with WFQ/DRR.
5. Usage/Billing: collection, aggregation, invoice, alerts by thresholds.
6. Observability: tagged metrics/logs/trails, dashboards per-tenant.

14) Pre-sale checklist

Limit scopes (tenant/user/token/endpoint/IP) and their hierarchy are defined.
Selected algorithm (Token Bucket/GCRA) and'rate/burst 'parameters.
Implemented concurrency caps and admission control for heavy operations.
Included'RateLimit- 'and'Retry-After' headers; clients support backoff + jitter.
The limiter is distributed and fault tolerant (shards, replication, degradation).
Usage-collection is idempotent; bundle with billing, alerts for overspending.
Observability: metrics/trails/tagged logs, top "noisy" keys, alterters.
Tests: bursts, "saw," stor failure, clock skew, cold start.
Customer documentation: plan limits, 429/Retry-After examples, retray best practices.
Exclusion Policy: How to temporarily raise limits and when.

15) Typical errors

Global limit without per-tenant/per-endpoint - "noisy neighbor" breaks all SLOs.
Lack of 'burst': UX suffers in short bursts.

Using only a fixed window → a "double hit on the window border."

There is no idempotency and retrays with jitter → a storm of repetitions.

Limits only at the border, without caps in services/queues → internal "traffic jams."

Non-rejection of limits in responses (no 'Retry-After', 'RateLimit-') → clients do not adapt.
Storage of the limiter state in the OLTP database → high latency and hot locks.

16) Quick strategy selection

Public APIs with peaks: Token Bucket + large 'burst', RateLimit - headers, CDN/edge cache.
Internal heavy jabs: concurrency caps + WFQ/DRR, admission control.
Integration with third parties: separate exit limits, buffering/retrays.
SaaS multi-tenant: limit hierarchy (global→tenant→user→endpoint), VIP prioritization, monthly quotas.

Conclusion

Good rate limits and quotas are a system contract between the platform and the client: an honest share of resources, resistance to spikes, predictable SLOs and transparent billing. Combine algorithms (Token/GCRA + concurrency caps), implement a hierarchy of ospreys, give clear headers and metrics, and regularly check schemes under real traffic profiles - this way the platform will remain stable even with aggressive load growth.

Rate limits and quotas

Conclusion

Get in Touch

Quick Contact

The video will be updated soon

We are currently very busy with projects