Operations and → Management Service Dependencies
Service dependencies
1) Why do you need it
Any production-platform is a count: users → Edge/API → domain services → turns/streams → DB/caches → external providers (payments, KYC, providers of games). An error on one edge of the graph often "walks" throughout the network: delays grow, retrays are triggered, queues are clogged, cascading failures occur. Dependency management reduces "blast radius" and makes releases predictable.
Objectives:- See the full graph of calls and understand who depends on whom.
- Prevent cascade failures and "retray storm."
- Plan releases based on compatibility and SLO promotion.
- Raise MTTR: Find the true root cause faster.
2) Types of dependencies
Synchronous (RPC: REST/gRPC/GraphQL): hard connectivity by latency/availability. We need timeouts, breakers, retray budget.
Asynchronous (Event/Stream: Kafka/Rabbit/Pulsar): more stable connectivity, but there is lag/backlog and delivery semantics (at-least-once, idempotency).
Storage (DB/Cache/Object store): shared resources → content, connection limits/IOPS, eviction, replication.
External providers (PSP/KYC/game providers): quotas, toll calls, service windows, legal SLAs.
Operating (releases, phicheflags, configs): indirect dependencies through settings, secrets, schema registry.
3) Service catalog and dependency graphs
What we fix in the directory (Backstage/Service Catalog/CMDB):- Owners (Squad/chat/On-call rota), repo, environment, artifacts.
- API contracts (OpenAPI/AsyncAPI), versions, compatibility (back/forward).
- Inbound/outbound dependencies (upstream/downstream) with type (sync/async), criticality, SLO expectations.
- Timeout/retreat budget, breakers, bulkhead pools.
- Data on quotas and limits of external integrations.
- `service: payments-api`
- Upstream: `user-profile` (sync), `risk-score` (async).
- Downstream: `PSP-X` (sync, квота 2k RPS), `ledger` (async).
- SLO: p99 ≤ 300ms, 99. 9% uptime.
- Timeouts: 200 ms to 'PSP-X', 150 ms to 'user-profile'.
- Retrai: 2 with exponential delay, jitter.
- Breaker: open for 30s at 5% errors/10s.
4) SLO propaganda and "latency budget"
With a chain of synchronous calls, the total SLO is formed from the sum of delays and failure probabilities.
Principles:- The budget of the request is divided from top to bottom: front-end SLO 500 ms → Edge 50 ms → API 150 ms → domain services 200 ms → provider 100 ms.
- Timeouts "out are shorter than in": the caller has a timeout less than the total internal timeout so that resources are updated, and zombie calls are not accumulated.
- Retrai only for secure codes/exceptions and with jitter; no retrays for bottleneck timeouts (aka "storm").
5) Contracts and interoperability
API versioning: SemVer for contracts; backward-compatible changes through the "optional" fields, schema extensions; deletion - only through the "deprecate period."
Consumer-driven contracts (CDC): Consumer tests (Pact-like) run against provider in CI; release is blocked if incompatible.
Register schema (Async): version of topics/events, evolution of schemas (Avro/JSON-Schema), can-read-old/can-write-new policy.
6) Engineering stability patterns
Timeouts: separating business SLAs from technical expectations; each outgoing connection is an explicit timeout.
Retries + backoff + jitter: no more than 2-3 attempts, given idempotence.
Circuit Breaker: "rapid drop" in downstream degradation; half-open trial.
Bulkhead (pool isolation): for different downstreams - separate pools of streams/floors/connections.
Rate-limit/Leaky-bucket: so as not to kill downstreams at peaks.
Idempotency & deduplication: request/message level idempotency key; grandfather-leiter and retreat-queue.
Caching and follbacks: local/distributed caches, stale-while-revalidate statuses, content degradation.
outbound:
psp-x:
timeout_ms: 200 retries: 2 retry_on: [5xx, connect_error]
backoff: exponential jitter: true circuit_breaker:
error_rate_threshold: 0. 05 window_s: 10 open_s: 30 pool: dedicated-psp (max_conns: 200)
7) Observability of dependencies
Distributed traces (TraceID, Baggage): see the path of the request by links; spans to outgoing calls with 'peer tags. service`, `retry`, `timeout`.
Метрики per-dependency: `outbound_latency_p99`, `outbound_error_rate`, `open_circuit`, `retry_count`, `queue_lag`.
- SLO and erroneous edge color-coded service map.
- "Top N problem dependencies" for the last week.
- "Blast radius" - a list of services that will be affected by the fall of X.
- Correlation logs: include 'trace _ id '/' span _ id' in logs.
8) Dependency-aware release management
Dependency-aware pipelines: The provider's release is blocked if consumer CDC tests are red.
Gradual inclusion (phicheflags): new fields/endoints → for 1% of consumers → 10% → 100%.
Canary releases: we check key dependencies and the "latency budget" on the share of traffic.
Compatibility of schemes: producer writes' vNew ', consumers read' vOld/vNew '; after the transition - "garbage collection" of old fields.
9) Incidents and escalations by column
We define the "true culprit": alert-correlation - if the'PSP-X' has degraded, we do not page the entire "payment bush," but the owner of the integration.
Autodegradation: phicheflag "minimum mode" (less heavy endpoints, trimmed bundles, disabling non-critical features).
Guard from cascades: limit parallelism, turn off retras on a hot branch, open the breaker in advance (pre-open).
- Diagnostics: what dashboards/metrics, how to check quotas/limits.
- Actions: reduce RPS, switch to a backup provider, temporarily enable cache responses.
- Rollback and validation: return parameters, make sure the norm is p95/p99 and error-rate.
10) Dependency criticality matrix
Evaluate each link along the axes: Rules:- For "critical" - double provisioning, breakers, individual pools, chaos tests.
- For the "high" - at least degradation and the "green button" to turn off the feature.
- For "medium/low" - retray limits and queue budget.
11) Process: from inventory to operation
1. Graph mapping: collect actual calls (traces) + declarative dependencies from the directory.
2. Assign owners: for each service and external integration - responsible on-call.
3. Define SLOs and budgets: latency/errors, timeouts/retrays/pools.
4. Formalize contracts: OpenAPI/AsyncAPI, schemas and CDC.
5. Enable stability patterns: timeouts/retries/circuit/bulkhead.
6. Configure dashboards and alerts per-dependency.
7. Install release gates: block by CDC/compatibility/canary.
8. Regular game days: chaos experiments on falling key edges.
9. Post-mortems with a focus on communication: what strengthened the cascade, how to narrow the radius.
12) Alerts on addiction (rule ideas)
Synchronous downstreams:- `outbound_error_rate{to="X"} > 3% FOR 10m` → warning; `>5% FOR 5m` → critical.
- `outbound_p99_latency{to="X"} > SLO1. 3 FOR 10m` → warning.
- 'circuit _ open {to =" X"} = = 1 FOR 1m' → page to the integration owner.
- 'retry _ rate {to =" X"}> baseline2 FOR 5m' +' outbound _ rps> 0 '→ storm risk.
- `consumer_lag{topic="Y"} growth > threshold FOR 10m` + `hpa at max` → крит.
- 'usage _ quota {provider =" PSP-X "}> 90% window' → warning, auto-switch routes.
13) Anti-patterns
"One common stream pool for all downstreams." Total: head-of-line blocking. Divide pools.
No timeouts/with endless retraces. So a storm is born.
Blind retrays of non-idempotent surgeries. Duplicate write-offs/bets.
Hidden "shared DB" as a connectivity point. Strong competition and blockages.
The API version changes without CDC and deprecate plan. Catch mass falls.
Observability only by services, not by connections. It is not visible where the chain breaks.
14) Dashboards: minimum set
Service Map: an interactive map of services with edge metrics (latency/error/volume).
Upstream/Downstream Overview: for the owner of the service - incoming dependencies (who is calling), outgoing (who is calling), "top problems."
Dependency Drilldown: link specific card: p50/p95/p99, class errors, open breaker percentage, retrays, connection pool, quotas/cost.
Release Context: annotations of releases/phicheflags on dependency graphs.
15) Implementation checklist
- Directory of services with owners and contracts (OpenAPI/AsyncAPI).
- Full graph of dependencies from traces (update daily).
- SLO by service and "latency budgets" down the chain.
- Explicit timeouts, jitter retreats, breakers, bulkhead isolation.
- CDC tests in CI as a release gate.
- Dashboards per-dependency and service card.
- Alerts on edges + suppression by root cause.
- Game-days: provider/cluster/topic drop and degradation check.
- Degradation plan: which features we turn off, which caches we turn on.
- Regular postmortems with actions to reduce connectivity.
16) Dependency Management Quality KPIs
Dependency MTTR: median rib recovery.
Blast Radius Index: the average number of affected services when one falls.
Coupling Score: the proportion of sync dependencies among all; downward trend.
CDC Pass Rate:% of releases without contract violations.
Retry Storms/Month Target → 0.
Cost of External Calls: cost of external calls per 1k RPS (see the effect of caching/follbacks).
17) Fast start (defaults)
Timeouts: 70-80% of the link budget; request upper timeout <sum of internal.
Retrai: max 2, only on idempotent 5xx/network, with backoff + jitter.
Breaker: threshold of 5% errors in 10 s, open = 30 s, half-open samples.
Bulkhead: dedicated pools/connection limits per downstream.
CDC: Mandatory for all public APIs and topics.
Async-preference: where possible - switching to events/queues (decoupling in time).
18) FAQ
Q: What's more important: a retreat or a breaker?
A: Both. Retrai save from short-term failures, breaker protects from permanent degradation and storms.
Q: How do you know the connection is "too fragile"?
A: High error correlation, low timeout margin, frequent retrays, no follbacks/caches, synchronous long chains.
Q: Why CDC if we have integration tests?
A: The CDC captures consumer expectations and breaks the provider's release when incompatible - before the code gets into production.