Graceful degradation

1) The essence of the approach

Graceful degradation is the managed transition of a system to a simpler but useful mode when resources are scarce, dependencies fail, or load peaks. The goal is to preserve the core of user value and platform resilience by sacrificing secondary capabilities and quality.

Key properties:

Predictability: predefined scenarios and degradation "ladders."
Constrain hit radius: Isolate features and constraints.
Observability: metrics, logs and traces "what level of degradation is active and why."
Reversibility: rapid return to normal.

2) Principles and boundaries

1. Save the main thing: your main SLA/SLO (e.g. "purchase," "login," "search") - priority is higher than secondary (avatars, recommendations, animations).

2. Fail-open vs fail-closed:

Security, payments, rights - fail-closed (better refusal than violation).
Cached content, hints, avatars - fail-open with folback.
3. Time budgets: top-down timeouts (client
4. Cost control: degradation should reduce CPU/IO/network consumption, not just "hide" errors.

3) Degradation levels

3. 1 Client/UX

Skeletons/placeholders and "lazy" loading of secondary widgets.
Partial UI: critical blocks are loaded, secondary blocks are hidden/simplified.

Client-side cache: last-known-good (LKG) marked "data may have become obsolete."

Offline mode: command queue with repetition later (idempotence!).

3. 2 Edge/CDN/WAF/API Gateway

stale-while-revalidate: we give the cache, update the background.
Rate limiting & load shedding: when overloading, reset background/anonymous traffic.
Geofence/weighted routing: traffic is diverted to the nearest healthy region.

3. 3 Service layer

Partial response: return part of the + 'warnings' data.
Read-only mode: temporarily prohibit mutations (flags).
Brownout: temporary disabling of resource-intensive features (recommendations, enrichment).
Adaptive concurrency: dynamically reduce concurrency.

3. 4 Data/Streaming

Cache as a source of truth with TTL (temporarily): "better approximately than nothing."

Reduced accuracy of models/algorithms (fast path vs accurate path).
Defer/queue - transfer heavy tasks to the background (outbox/job queue).
Priority queues: critical events - in a separate class.

4) Degradation "ladders" (playbooks)

Example for search API:

L0 (normal) → L1: hide personalization and banners → L2: disable synonyms/fuzzy search → L3: limit response size and timeout to 300 ms → L4: give results from cache 5 minutes → L5: "read-only & cached only" + queue of requests for recalculation.

For each level the following is recorded:

Triggers: CPU overload> 85% p95> target, errors> threshold, Kafka> threshold flag, dependency flag.
Actions: turn on the X flag, lower the concurrency to N, switch the Y source to the cache.
Exit criteria: 10 minutes of green metrics, resource headroom.

5) Decision-making policies

5. 1 Erroneous budget and SLO

Use error-budget burn rate as a brownout/shedding trigger.

Policy: "if burn-rate> 4 × within 15 min - turn on L2 degradation."

5. 2 Admission control

We restrict incoming RPS on critical paths to guarantee p99 and prevent queue collapse.

5. 3 Prioritization

Classes: interactive> system> background.
Per-tenant priorities (Gold/Silver/Bronze) and justice (fair share).

6) Patterns and implementations

6. 1 Load shedding

Drop requests before they take up all resources.
Return '429 '/' 503' with 'Retry-After' and policy explanation (for customers).

Envoy (adaptive concurrency + circuit breaking)

yaml typed_extension_protocol_options:
envoy. filters. http. adaptive_concurrency:
"@type": type. googleapis. com/envoy. extensions. filters. http. adaptive_concurrency. v3. AdaptiveConcurrency gradient_controller_config:
sample_aggregate_percentile: 90 circuit_breakers:
thresholds:
- max_requests: 2000 max_pending_requests: 500 max_connections: 1000

6. 2 Brownout (temporary simplification)

The idea: to reduce the "brightness" (cost) of the feature when resources are running out.

kotlin class Brownout(val level: Int) { // 0..3 fun recommendationsEnabled() = level < 2 fun imagesQuality() = if (level >= 2) "low" else "high"
fun timeoutMs() = if (level >= 1) 150 else 300
}

6. 3 Partial response and warnings

'Warnings '/' degradation'field in response:

json
{
"items": [...],
"degradation": {
"level": 2,
"applied": ["cache_only", "no_personalization"],
"expiresAt": "2025-10-31T14:20:00Z"
}
}

6. 4 Stale-while-revalidate on the edge (Nginx)

nginx proxy_cache_valid 200 10m;
proxy_cache_use_stale error timeout http_500 http_502 http_504 updating;
proxy_cache_background_update on;

6. 5 Read-only switch (Kubernetes + flag)

yaml apiVersion: v1 kind: ConfigMap data:
MODE: "read_only"

The code should check MODE and block mutations with a friendly message.

6. 6 Kafka: backpressure and queue classes

Switch the heavy-consumers to the smaller'max. poll. records', limit production batch-and.
Separate "critical" and "bulk" events by topic/quota.

6. 7 UI: graceful fallback

Hide "heavy" widgets, show cache/skeleton and clearly label outdated data.

7) Configuration examples

7. 1 Istio outlier + priority pools

yaml outlierDetection:
consecutive5xx: 5 interval: 10s baseEjectionTime: 30s maxEjectionPercent: 50

7. 2 Nginx: Background traffic under the knife first

nginx map $http_x_priority $bucket { default low; high high; }

limit_req_zone $binary_remote_addr zone=perip:10m rate=20r/s;
limit_req_status 429;

server {
location /api/critical/ { limit_req zone=perip burst=40 nodelay; }
location /api/background/ {
limit_req zone = perip burst = 5 nodelay; # stricter
}
}

7. 3 Feature flags / kill-switches

Store in dynamic configuration (ConfigMap/Consul), update without release.
Separate per-feature and global flags, log activations.

8) Observability

8. 1 Metrics

'degradation _ level {service} 'is the current level.
'shed _ requests _ total {route, reason} '- how much is reset and why.
'stale _ responses _ total '- how much cache was issued.
`read_only_mode_seconds_total`.
`brownout_activations_total{feature}`.
Erroneous budget: burn-rate, proportion of SLO violations.

8. 2 Tracing

Attributes of spans: 'degraded = true', 'level = 2', 'reason = upstream _ timeout'.
Links between retrays/hedged queries to see the contribution to the tails.

8. 3 Logs/Alerts

Degradation level switch events with change causes and owner.
Alerts for "sticking" level (degradation lasts too long).

9) Risk management and security

Do not degrade authentication/authorization/data integrity: better failure.
PII masking is preserved in any mode.
Finance/payments: idempotent transactions only, strict timeouts and rollbacks; in doubt - read-only/hold.

10) Anti-patterns

Quiet degradation without prompting the user and without telemetry.
Retray storms instead of load shedding and short timeouts.
Global "switches" without segmentation - a huge blast radius.
Mix prod and lightweight paths in the same cache/queue.
Eternal degradation: brownout as the "new normal," forgotten exit criteria.
Stale-write: attempts to write based on stale data.

11) Implementation checklist

Core value and critical user scenarios defined.
Degradation ladders by service/domaine with triggers and outputs are compiled.
Timeouts/restrictions and server-side load shedding are entered.
Rate limits and priority traffic classes are configured.
Implemented partial response, read-only, stale-while-revalidate.
Integrated feature flags/kill-switches with auditing.
Metrics/tracing/alerts for degradation levels and causes.
Regular game day exercises with simulated overload/failures.
SLO documented and error-budget → degradation policy.

12) FAQ

Q: When to choose brownout and when to shedding?
A: If the goal is to reduce the cost of requests without failures - brownout. If the goal is to protect the system when even simplification does not help, shedding the login.

Q: Do I report degradation to the user?
A: For critical scenarios - yes ("limited mode" badge). Transparency reduces support and discontent.

Q: Can a cache be made a source of truth?
A: Temporarily - yes, with explicit SLAs and aging labels. For mutations - prohibited.

Q: How not to make retrai "broken"?
A: Short timeouts, exponential backoff with jitter, idempotency and attempt limit; retrait only safe operations.

13) Totals

Graceful degradation is an architectural contract and a set of controlled modes of operation, turned on by the signals of metrics and erroneous budget. Properly designed stairs, strict timeouts and shedding, cashback and brownout, plus powerful observability - and your platform remains useful and economical even in a storm.

Graceful degradation

Envoy (adaptive concurrency + circuit breaking)

Get in Touch

Quick Contact

The video will be updated soon

We are currently very busy with projects