Tenant isolation and limits
Tenant isolation and limits are the foundation of multi-tenant architecture. Purpose: So that the actions of one tenant never affect the data, security and SLO of another, and resources are distributed fairly and predictably. Below is a practical map of solutions from the data level to computing planning and incident management.
1) Threat model and targets
Threats
Data leakage between tenants (logical/by cache/via logs).
"Noisy neighbor": degradation of performance due to spikes in one client.
Privilege escalation (access policy error).
Billing drift (mismatch of use and charges).
Cascade fail-safe scenarios (incident of one leads to downtime of many).
Objectives
Strict isolation of data and secrets.
Marginal limits/quotas and fair planning.
Transparent auditing, observability and billing.
Incident localization and rapid recovery per tenant.
2) Insulation levels (end-to-end model)
1. Data
'tenant _ id'in keys and indexes, Row-Level Security (RLS).
Encryption: KMS hierarchy → tenant key (KEK) → data keys (DEK).
Separate schemes/DB with high requirements (Silo), a common cluster with RLS for efficiency (Pool).
Retention policies and "right to forget" per tenant, crypto-shredding keys.
2. Calculations
CPU/RAM/IO quotas, worker pools per tenant, weighted queues.
GC/heap isolation (JVM/Runtime containers/settings), parallelism limits.
Per-tenant autoscaling + backpressure.
3. Network
Segmentation: private endpoints/VPC, ACL by 'tenant _ id'.
Rate limiting and per-tenant connection caps at the border.
Protection against DDoS/bots, taking into account plan/priority.
4. Operations and Processes
Tenant migrations, backups, DR, feature-flags.
Incidents - "micro-blast-radius": fusing by 'tenant _ id'.
3) Access control and tenant context
AuthN: OIDC/SAML; tokens carry 'tenant _ id', 'org _ id', 'plan', 'scopes'.
AuthZ: RBAC/ABAC (roles + attributes of project, department, region).
Context at the border: the API gateway extracts and validates the tenant context, supplements with limits/quotas, writes to trails.
Principle of "double lock": checking in the + RLS service/database policy.
4) Data: schemes, cache, logs
Schemes:- Shared-schema (row-level): maximum efficiency, strict RLS is required.
- Per-schema: isolation/operability tradeoff.
- Per-DB/cluster (Silo): for VIP/regulated.
Cache: key prefixes' tenant: {id}:... ', TTL by plans, cache-stampede protection (lock/early refresh).
Logs/metadata: full pseudonymization of PII, filters by 'tenant _ id', prohibition of "gluing" logs of different tenants.
5) Limiting traffic and operations
Basic mechanics
Token Bucket: smoothed bursts, parameterization 'rate '/' burst'.
Leaky Bucket: Stabilization throughput.
Fixed Window/Sliding Window: simple/exact quotas on the time window.
Concurrency limits: caps for simultaneous requests/jabs.
Where to apply
At the border (L7/API gateway) - basic protection and "quick failure."
In the core (in services/queues) - for the second circuit and "fair share."
Policies
By tenant/plan/endpoint/type of operation (public APIs, heavy exports, admin actions).
Priority-aware: VIP gets more 'burst' and weight in arbitration.
Idempotency-keys for safe retreats.
Sample profiles (concepts)
Starter: 50 req/s, burst 100, 2 parallel exports.
Business: 200 req/s, burst 400, 5 exports.
Enterprise/VIP: 1000 req/s, burst 2000, dedicated workers.
6) Quotas and fair planning (fairness)
Resource quotas: storage, objects, messages/min, jobs/hour, queue size.
Weighted Fair Queuing/Deficit Round Robin: "Weighted" access to shared workers.
Per-tenant worker pools: rigid isolation for noisy/critical customers.
Admission control: failure/degradation before execution when quotas are exhausted.
Backoff + jitter: exponential delays to keep bursts out of sync.
7) Observability and billing per tenant
Required tags are 'tenant _ id', 'plan', 'region', 'endpoint', 'status'.
SLI/SLO per tenant: p95/p99 latency, error rate, availability, utilization, saturation.
Usage metrics: CPU operation/byte/second counters → aggregator → invoices.
Billing idempotence: snapshots at the border, protection against double write-offs/loss of events.
Dashboards in segments: VIP/regulated/new tenants.
8) Incidents, degradation and DR "by tenant"
Fusing by 'tenant _ id': emergency shutdown/throttling of a specific tenant without affecting the rest.
Graceful Degradation: read-only mode, sandbox queues, deferred tasks.
RTO/RPO per tenant: recovery and loss targets for each plan.
Drill: Regular "game days" with noisy tenant cut off and DR checked
9) Compliance (residency, privacy)
Pinning tenant to the region; clear cross-regional flow rules.
Key/data access audit, admin logging.
Manage retention and data export per tenant.
10) Mini reference: how to put it together
Request flow
1. Edge (API gateway): TLS → extract'tenant _ id '→ token validation → apply rate/quotas → put trails.
2. Political engine: context 'tenant _ id/plan/features' → decision about route and limits.
3. Service: checking rights + labels' tenant _ id '→ working with database under RLS → cache with prefix.
4. Usage-collection: counters of operations/bytes → aggregator → billing.
Data
Schema/DB by strategy (row-level/per-schema/per-DB).
KMS: tenant keys, rotation, crypto-shredding on deletion.
Computing
Queues with weights, pools of workers per tenant, caps by concurrency.
Autoscaling by per-tenant metrics.
11) Pseudo-politics (for orientation)
yaml limits:
starter:
req_per_sec: 50 burst: 100 concurrency: 20 exports_parallel: 2 business:
req_per_sec: 200 burst: 400 concurrency: 100 exports_parallel: 5 enterprise:
req_per_sec: 1000 burst: 2000 concurrency: 500 exports_parallel: 20
quotas:
objects_max: { starter: 1_000_000, business: 20_000_000, enterprise: 100_000_000 }
storage_gb: { starter: 100, business: 1000, enterprise: 10000 }
12) Pre-sale checklist
- Single source of truth 'tenant _ id'; everywhere is thrown and logged.
- RLS/ACL enabled at DB level + service check (double lock).
- Encryption keys per tenant, crypto-shredding documented.
- Limits/quotas at the border and inside; tested bursts and "burst."
- Fair-queuing and/or dedicated VIP workers; caps на concurrency.
- Per-tenant SLOs and alerts; dashboards by segment.
- Usage-collection is idempotent; billing rollup verified.
- DR/incidents are localized to the tenant; fusing by 'tenant _ id' works.
- Cash/logs are divided by tenant; PII masked.
- Migration/backup/export procedures are tenant-based.
13) Typical errors
RLS disabled/bypassed by the "service" user - risk of leakage.
Single global limiter → "noisy neighbor" and SLO violation.
Shared caches/queues without prefixes → data intersection.
Billing counts by logs that are lost at peaks.
Lack of tenant fusion - cascade falls.
Migrations "in one fell swoop" without the ability to stop the problematic 'tenant _ id'.
14) Quick strategy selection
Regulated/VIP: Silo data (per-DB), dedicated workers, strict quotas and residency.
Mass SaaS: Shared-schema + RLS, strong limits at the border, fair-queuing inside.
Load "noisy/pulsating": large 'burst' + hard concurrency-caps, backpressure and priorities according to plans.
Conclusion
Isolation of tenants and limits are about boundaries and justice. Clear 'tenant _ id' through the stack, RLS and encryption on data, limiting and quotas at the border and in the core, fair scheduler, observability and localization of incidents - all this together gives security, predictable quality and transparent billing for each tenant, even with aggressive platform growth.