GH GambleHub

Rate limits and quotas

Rate limits and quotas are the fundamental mechanics of managing the demand for shared resources: CPU, network, database, queues, external APIs. The goal is fairness, predictability of SLOs, and protection from outbursts, abuse, and "noisy neighbor."

1) Basic concepts

Rate limit - limit the intensity of requests/operations (req/s, msg/min, bytes/sec).
Burst - permissible short-term burst over the average rate.
Quota - volume limit per time window (documents/day, GB/month).
Concurrency cap - restriction of simultaneous operations (simultaneous requests/jobs).
Scope - scope: per-tenant, per-user, per-token, per-endpoint, per-IP, per-region, per-feature.

2) Limiting algorithms

2. 1 Token Bucket

Parameters: 'rate '(tokens/sec),' burst '(bucket size).
Works like "credit": accumulated tokens allow short peaks.
Suitable for external APIs and user requests.

2. 2 Leaky Bucket

Smoothly "bleeds" the flow at a constant speed.
Good for smoothing traffic to sensitive backends.

2. 3 Fixed/Sliding Window

Fixed window: simple but vulnerable to "window switching."

Sliding window: more accurate, but more expensive computationally.

2. 4 GCRA (Generic Cell Rate Algorithm)

Token Bucket equivalent in terms of virtual arrival time.
Accurate and stable for distributed limiters (less conflicting state).

2. 5 Concurrency Limits

Limiting concurrent operations.
Protects against depletion of thread/connection pools and head-of-line blocking.

3) Where to apply limits

At the border (L7/API gateway): main barrier, quick failure (429/503), cheap checks.
Inside services: additional caps for heavy operations (exports, reports, transformations).
At the exit to external systems: individual limits for third parties (anti-penalty).
On queues/workers: fairness to shared pools.

4) Scopes and priorities (multi-tenant)

Иерархия: Global → Region → Tenant/Plan → User/Token → Endpoint/Feature → IP/Device.
Priority-aware: VIP/Enterprise get more 'burst' and weight, but don't break overall SLOs.
Limit composition: total tolerance = 'min (global, regional, tenant, user, endpoint)'.

5) Volume quotas

Daily/monthly quotas: documents/day, GB/month, messages/min.
Soft/hard thresholds: Warnings (80/90%) and hard stop.
Roll-up: accounting by objects (tables, files, events) and "withdrawal" to billing.

6) Distributed limiters

Requirements: low latency, consistency, fault tolerance, horizontal scaling.

Local + probabilistic sync: local shard buckets + periodic synchronization.
Central store: Redis/KeyDB/Memcached с LUA/atomic ops (INCR/PEXPIRE).
Sharding: keys of the form 'limit: {scope}: {id}: {window}' with uniform distribution.
Clock skew: store "truth" on the limiter server, not on clients.
Idempotency: Idempotency-Keys reduce false charges.

7) Anti-abuse and protection

Per-IP + device fingerprint for public endpoints.
Proof-of-Work/CAPTCHA in anomalies.
Slowdown (throttling) instead of complete failure when UX is more important (search prompts).
Adaptive limits: dynamic reduction of thresholds for incidents/expensive degradations.

8) Client behavior and protocol

Codes: '429 Too Many Requests' (rate), '403' (quota/plan exceeded), '503' (protective degradation).

Best practice:
  • 'Retry-After: '- when to try again.
'RateLimit- 'family (IETF):
  • `RateLimit-Limit: ;w=`
  • `RateLimit-Remaining: `
  • `RateLimit-Reset: `
  • Backoff: exponential + jitter (full jitter, equal jitter).
  • Idempotency: 'Idempotency-Key' header and repeatability of safe operations.
  • Timeouts and cancellations: correctly interrupt suspended requests so as not to "capture" limits.

9) Observability and testing

Теги: `tenant_id`, `plan`, `user_id`, `endpoint`, `region`, `decision` (allow/deny), `reason` (quota/rate/concurrency).
Metrics: throughput, 429/403/503 failure rate, p95/p99 limiter delay, key cache hit ratio, plan allocation.
Audit logs: causes of blocks, top "noisy" keys.
Tests: load profiles "saw/burst/plateau," chaos - Redis/shard failure, clock desynchronization.

10) Integration with billing

Usage counters are collected at the border, aggregated by batches (every N minutes) with idempotency.
Plan summary: overspending → overcharge or temporarily increasing the plan.
Discrepancies: reconciliation usage vs invoice; alerts to the delta.

11) Fairness inside (queues, workers)

Weighted Fair Queuing/DRR: Allocating slots to tenants by plan weight.
Per-tenant worker pools: rigid isolation of VIP/noisy.
Admission control: failure before execution if quotas are exhausted; queues do not swell.
Caps on concurrency: Limit concurrent heavy jabs.

12) Typical plan profiles (example)

yaml plans:
starter:
rate: 50  # req/s burst: 100 concurrency: 20 quotas:
daily_requests: 100_000 monthly_gb_egress: 50 business:
rate: 200 burst: 400 concurrency: 100 quotas:
daily_requests: 1_000_000 monthly_gb_egress: 500 enterprise:
rate: 1000 burst: 2000 concurrency: 500 quotas:
daily_requests: 10_000_000 monthly_gb_egress: 5000

13) Architectural reference (verbal scheme)

1. Edge/API gateway: TLS → extract context (tenant/plan) → check limits/quotas → place RateLimit headers → log/trace.
2. Policy Engine: priority rules (VIP), adaptive thresholds.
3. Limiter Store: Redis/KeyDB (atomic ops, LUA), key sharding, replication.
4. Services: secondary limit and caps for heavy operations; idempotency; Queues with WFQ/DRR.
5. Usage/Billing: collection, aggregation, invoice, alerts by thresholds.
6. Observability: tagged metrics/logs/trails, dashboards per-tenant.

14) Pre-sale checklist

  • Limit scopes (tenant/user/token/endpoint/IP) and their hierarchy are defined.
  • Selected algorithm (Token Bucket/GCRA) and'rate/burst 'parameters.
  • Implemented concurrency caps and admission control for heavy operations.
  • Included'RateLimit- 'and'Retry-After' headers; clients support backoff + jitter.
  • The limiter is distributed and fault tolerant (shards, replication, degradation).
  • Usage-collection is idempotent; bundle with billing, alerts for overspending.
  • Observability: metrics/trails/tagged logs, top "noisy" keys, alterters.
  • Tests: bursts, "saw," stor failure, clock skew, cold start.
  • Customer documentation: plan limits, 429/Retry-After examples, retray best practices.
  • Exclusion Policy: How to temporarily raise limits and when.

15) Typical errors

Global limit without per-tenant/per-endpoint - "noisy neighbor" breaks all SLOs.
Lack of 'burst': UX suffers in short bursts.

Using only a fixed window → a "double hit on the window border."

There is no idempotency and retrays with jitter → a storm of repetitions.

Limits only at the border, without caps in services/queues → internal "traffic jams."

Non-rejection of limits in responses (no 'Retry-After', 'RateLimit-') → clients do not adapt.
Storage of the limiter state in the OLTP database → high latency and hot locks.

16) Quick strategy selection

Public APIs with peaks: Token Bucket + large 'burst', RateLimit - headers, CDN/edge cache.
Internal heavy jabs: concurrency caps + WFQ/DRR, admission control.
Integration with third parties: separate exit limits, buffering/retrays.
SaaS multi-tenant: limit hierarchy (global→tenant→user→endpoint), VIP prioritization, monthly quotas.

Conclusion

Good rate limits and quotas are a system contract between the platform and the client: an honest share of resources, resistance to spikes, predictable SLOs and transparent billing. Combine algorithms (Token/GCRA + concurrency caps), implement a hierarchy of ospreys, give clear headers and metrics, and regularly check schemes under real traffic profiles - this way the platform will remain stable even with aggressive load growth.

Contact

Get in Touch

Reach out with any questions or support needs.We are always ready to help!

Telegram
@Gamble_GC
Start Integration

Email is required. Telegram or WhatsApp — optional.

Your Name optional
Email optional
Subject optional
Message optional
Telegram optional
@
If you include Telegram — we will reply there as well, in addition to Email.
WhatsApp optional
Format: +country code and number (e.g., +380XXXXXXXXX).

By clicking this button, you agree to data processing.