Rate limits and quotas
Rate limits and quotas are the fundamental mechanics of managing the demand for shared resources: CPU, network, database, queues, external APIs. The goal is fairness, predictability of SLOs, and protection from outbursts, abuse, and "noisy neighbor."
1) Basic concepts
Rate limit - limit the intensity of requests/operations (req/s, msg/min, bytes/sec).
Burst - permissible short-term burst over the average rate.
Quota - volume limit per time window (documents/day, GB/month).
Concurrency cap - restriction of simultaneous operations (simultaneous requests/jobs).
Scope - scope: per-tenant, per-user, per-token, per-endpoint, per-IP, per-region, per-feature.
2) Limiting algorithms
2. 1 Token Bucket
Parameters: 'rate '(tokens/sec),' burst '(bucket size).
Works like "credit": accumulated tokens allow short peaks.
Suitable for external APIs and user requests.
2. 2 Leaky Bucket
Smoothly "bleeds" the flow at a constant speed.
Good for smoothing traffic to sensitive backends.
2. 3 Fixed/Sliding Window
Fixed window: simple but vulnerable to "window switching."
Sliding window: more accurate, but more expensive computationally.
2. 4 GCRA (Generic Cell Rate Algorithm)
Token Bucket equivalent in terms of virtual arrival time.
Accurate and stable for distributed limiters (less conflicting state).
2. 5 Concurrency Limits
Limiting concurrent operations.
Protects against depletion of thread/connection pools and head-of-line blocking.
3) Where to apply limits
At the border (L7/API gateway): main barrier, quick failure (429/503), cheap checks.
Inside services: additional caps for heavy operations (exports, reports, transformations).
At the exit to external systems: individual limits for third parties (anti-penalty).
On queues/workers: fairness to shared pools.
4) Scopes and priorities (multi-tenant)
Иерархия: Global → Region → Tenant/Plan → User/Token → Endpoint/Feature → IP/Device.
Priority-aware: VIP/Enterprise get more 'burst' and weight, but don't break overall SLOs.
Limit composition: total tolerance = 'min (global, regional, tenant, user, endpoint)'.
5) Volume quotas
Daily/monthly quotas: documents/day, GB/month, messages/min.
Soft/hard thresholds: Warnings (80/90%) and hard stop.
Roll-up: accounting by objects (tables, files, events) and "withdrawal" to billing.
6) Distributed limiters
Requirements: low latency, consistency, fault tolerance, horizontal scaling.
Local + probabilistic sync: local shard buckets + periodic synchronization.
Central store: Redis/KeyDB/Memcached с LUA/atomic ops (INCR/PEXPIRE).
Sharding: keys of the form 'limit: {scope}: {id}: {window}' with uniform distribution.
Clock skew: store "truth" on the limiter server, not on clients.
Idempotency: Idempotency-Keys reduce false charges.
7) Anti-abuse and protection
Per-IP + device fingerprint for public endpoints.
Proof-of-Work/CAPTCHA in anomalies.
Slowdown (throttling) instead of complete failure when UX is more important (search prompts).
Adaptive limits: dynamic reduction of thresholds for incidents/expensive degradations.
8) Client behavior and protocol
Codes: '429 Too Many Requests' (rate), '403' (quota/plan exceeded), '503' (protective degradation).
Best practice:- 'Retry-After:
'- when to try again.
- `RateLimit-Limit:
;w= ` - `RateLimit-Remaining:
` - `RateLimit-Reset:
` - Backoff: exponential + jitter (full jitter, equal jitter).
- Idempotency: 'Idempotency-Key' header and repeatability of safe operations.
- Timeouts and cancellations: correctly interrupt suspended requests so as not to "capture" limits.
9) Observability and testing
Теги: `tenant_id`, `plan`, `user_id`, `endpoint`, `region`, `decision` (allow/deny), `reason` (quota/rate/concurrency).
Metrics: throughput, 429/403/503 failure rate, p95/p99 limiter delay, key cache hit ratio, plan allocation.
Audit logs: causes of blocks, top "noisy" keys.
Tests: load profiles "saw/burst/plateau," chaos - Redis/shard failure, clock desynchronization.
10) Integration with billing
Usage counters are collected at the border, aggregated by batches (every N minutes) with idempotency.
Plan summary: overspending → overcharge or temporarily increasing the plan.
Discrepancies: reconciliation usage vs invoice; alerts to the delta.
11) Fairness inside (queues, workers)
Weighted Fair Queuing/DRR: Allocating slots to tenants by plan weight.
Per-tenant worker pools: rigid isolation of VIP/noisy.
Admission control: failure before execution if quotas are exhausted; queues do not swell.
Caps on concurrency: Limit concurrent heavy jabs.
12) Typical plan profiles (example)
yaml plans:
starter:
rate: 50 # req/s burst: 100 concurrency: 20 quotas:
daily_requests: 100_000 monthly_gb_egress: 50 business:
rate: 200 burst: 400 concurrency: 100 quotas:
daily_requests: 1_000_000 monthly_gb_egress: 500 enterprise:
rate: 1000 burst: 2000 concurrency: 500 quotas:
daily_requests: 10_000_000 monthly_gb_egress: 5000
13) Architectural reference (verbal scheme)
1. Edge/API gateway: TLS → extract context (tenant/plan) → check limits/quotas → place RateLimit headers → log/trace.
2. Policy Engine: priority rules (VIP), adaptive thresholds.
3. Limiter Store: Redis/KeyDB (atomic ops, LUA), key sharding, replication.
4. Services: secondary limit and caps for heavy operations; idempotency; Queues with WFQ/DRR.
5. Usage/Billing: collection, aggregation, invoice, alerts by thresholds.
6. Observability: tagged metrics/logs/trails, dashboards per-tenant.
14) Pre-sale checklist
- Limit scopes (tenant/user/token/endpoint/IP) and their hierarchy are defined.
- Selected algorithm (Token Bucket/GCRA) and'rate/burst 'parameters.
- Implemented concurrency caps and admission control for heavy operations.
- Included'RateLimit- 'and'Retry-After' headers; clients support backoff + jitter.
- The limiter is distributed and fault tolerant (shards, replication, degradation).
- Usage-collection is idempotent; bundle with billing, alerts for overspending.
- Observability: metrics/trails/tagged logs, top "noisy" keys, alterters.
- Tests: bursts, "saw," stor failure, clock skew, cold start.
- Customer documentation: plan limits, 429/Retry-After examples, retray best practices.
- Exclusion Policy: How to temporarily raise limits and when.
15) Typical errors
Using only a fixed window → a "double hit on the window border."
Limits only at the border, without caps in services/queues → internal "traffic jams."
Global limit without per-tenant/per-endpoint - "noisy neighbor" breaks all SLOs.
Lack of 'burst': UX suffers in short bursts.
There is no idempotency and retrays with jitter → a storm of repetitions.
Non-rejection of limits in responses (no 'Retry-After', 'RateLimit-') → clients do not adapt.
Storage of the limiter state in the OLTP database → high latency and hot locks.
16) Quick strategy selection
Public APIs with peaks: Token Bucket + large 'burst', RateLimit - headers, CDN/edge cache.
Internal heavy jabs: concurrency caps + WFQ/DRR, admission control.
Integration with third parties: separate exit limits, buffering/retrays.
SaaS multi-tenant: limit hierarchy (global→tenant→user→endpoint), VIP prioritization, monthly quotas.
Conclusion
Good rate limits and quotas are a system contract between the platform and the client: an honest share of resources, resistance to spikes, predictable SLOs and transparent billing. Combine algorithms (Token/GCRA + concurrency caps), implement a hierarchy of ospreys, give clear headers and metrics, and regularly check schemes under real traffic profiles - this way the platform will remain stable even with aggressive load growth.