SLA/OLA with providers

1) Terms and boundaries

SLI - measurable indicator (availability, p99 latency, successfully processed webhooks, RPO/RTO).
SLO - target SLI value per measurement window (for example, 99. 9 %/30 days).
SLA - legally binding document (SLO + procedures + reimbursement).
OLA - internal goals and processes that ensure compliance with SLAs.
UC (Underpinning Contract) - "substrate" with third parties (channels, data centers, CDN, etc.).

Boundaries: clearly separate the provider's area of responsibility (cloud/WAF/CDN/payment gateway/KYC provider) from your area (code, config, client settings).

2) Criticality matrix and model selection

Segment providers by business impact:

Class	Examples	Required level	Strategy
A (mission-critical)	Payments, Authentication, Data Core	99. 9–99. 99	Duplication, hot fake, strict credit mechanisms
B (critical)	Logs, alerts, CDN	99. 5–99. 9	Caching, offline modes, credit/penalty
C (important)	BI, reporting	99. 0–99. 5	"Best Attempt," Extended RTO/RPO
D (auxiliary)	Mail marketing	98–99	Asynchronous, flexible windows

The matrix determines the depth of the SLA, the scope of checks and the requirements for OLA/UC.

3) Metrics and measurement windows

Availability-The percentage of time that the service executes queries according to tolerances.
Latency: p95/p99 for key operations; "slow success" counts.
Data reliability: RPO (maximum allowable data loss) and RTO (recovery time).
Bandwidth/limits: guaranteed quotas (RPS/MBps).
Quality of integrations: share of delivered webhooks ≤ X minutes, share of 2xx responses, repetitions and deduplication.
Measurement window: monthly/rolling 30 days, exceptions (planned activities) with limits.

"External availability" formula (example):

`Availability_ext = 1 − (Downtime_confirmed_outages / Total_minutes_in_window)`
Where outage is the confirmed unavailable state by external monitoring, and not just by the provider's status page.

4) SLA content (section template)

1. Subject and scope (services, regions, API versions).
2. Definitions (SLI/SLO, "incident," "planned work," "force majeure").
3. Service objectives (SLOs) by request category and region.
4. Monitoring and evidence base: in what way, whose sensors, with what frequency.
5. Incidents and escalations: channels, response/update times, roles.
6. Refunds: credits/fines/bonuses, thresholds, formulas.
7. Security and privacy: DPA, encryption, logs, violation notifications.
8. Service changes: deprecates, notification window, compatibility.
9. Continuity and DR: RPO/RTO, recovery tests.
10. Audit and compliance: the right to audit, reporting, certification.
11. Exit Plan: data export, dates, format, migration assistance.
12. Legal provisions: jurisdiction, force majeure, confidentiality, validity period.

5) Examples of wording (fragments)

5. 1 Availability and measurement

"Provider provides 99. 95% availability in each calendar month. Availability is measured by Customer's external synthetic monitoring from ≥3 regions at intervals of ≤1 minutes. Recorded unavailability in ≥2 regions is simultaneously considered a Level SEV2 incident and counted in Downtime. "

5. 2 Key API latency

"p99 response time'POST/payments/authorize" ≤ 450 ms on 95% of the days of the month. A cause analysis report is provided for the percentage of requests that exceed the threshold"

5. 3 Incidents and escalations

"S1: ack ≤ 15 min, updates every ≤ 30 min, target recovery ≤ 2 h; S2: ack ≤ 30 min, updates ≤ 60 min; S3: Next Business Day. Channels: phone 24 × 7, chat bridge, email"

5. 4 Refunds (credits)


If Availability_ext <99. 95% → credit 10% monthly fee
< 99. 9% → 25%
< 99. 5% → 50%

Loans do not exclude other methods of compensation for damage in gross negligence.

5. 5 Deprecates and compatibility

"At least 180 days notice for changes that break compatibility. Concurrent support for vN and vN + 1 for at least 90 days"

5. 6 Exit

"Within 30 days after termination, the provider provides full export of data in Parquet/JSON + formats free of charge; additional migration services - at tariff X. Destruction of copies is confirmed by the act"

6) OLA: internal support for external SLA

Example OLA between "Platform" and "Payment Team":

Targets: p99 gateway ≤ 200 ms, error rate ≤ 0. 3%, DR: RPO 0, RTO 30 min.
Responsibility: SRE-on-call, 24 × 7; common dashboards and alerts.
Processes: chaos-smoke in releases, perf-smoke on PR, heuristics of shading.
Gates: deploy block when the SLO/xaoc test fails; mandatory runbook update.

7) Monitoring and evidence

Synthetics: external probes (HTTP/TCP), user path, "slow success."

RUM: real user monitoring to confirm impact.
Correlation: 'provider', 'region', 'api _ method', 'incident _ id' labels.
Artifacts: screenshots/trails/logs, KPI export, escalation timeline.

Mini policy in CI/CD (pseudo-Rego):

rego package policy. sla deny["Release blocked: provider SLO risk"] {
input. release. affects_providers[_] == p input. slo. forecast[p].breach == true
}

8) Incidents and Interactions

Playbook:

1. SEV classification, war-room opening, IC purpose.

2. Notification of the provider via the "hot channel," transmission of artifacts.

3. Bypass modes/feature flags (stale, shading, rate-cap).

4. Shared timeline, recovery.

5. Postmortem + actions: updating config limits, keys, backup routes.

6. Initiation of SLA loans, fixing in billing.

9) Security and DPA

DPA/privacy: controller/processor roles, data categories, legality base, processing deadlines/objectives, sub-processors and their SLAs.
Encryption: TLS1. 2+, PFS; data "at rest," key management (KMS/HSM), rotation.
Audit: access logs, violation notifications ≤ 72 hours, pentest reports on request.
Localization: storage region, prohibition of export without consent.

10) Supply Chain and interoperability

SBOM/vulnerabilities: CVSS threshold policy and fix times (criticized ≤ 7 days, high ≤ 14).
API compatibility: contract tests, sandboxes and stable fixtures.
Provider changes: early release notes, previews/beta windows, backward compatibility.

11) Multi-provider and feilover

Active/Active: Harder and more expensive, but higher availability (consider consistency).

Active/Passive: Cold/Warm Reserve, DR. Regular Workouts

Abstractions/adapters: single contract, health/cost/carbon routing (if relevant).
License/commercial conditions: portability, limitation on data output, egress cost.

12) Exit plan and periodic rehearsals

Data/diagram catalogue and volumes.
SDK/API portability script (minimum - second source).
Dry exit test: export/import, restore, check invariants.
Legal retention/disposal periods after release.

13) Contract tests and conformance

API Samples: Positive/Negative, Limits, Errors, and Retrays.
Delivery of events/webhooks: signature/time/grandfather/repetitions.
Perf baselines: p99, bandwidth; regression tests on release notes of the provider.
Cross-region: the degradation of one region should not violate SLO globally.

14) Anti-patterns

SLA "on status page" without external measurements.
Same goals for all regions/endpoints.
Lack of audit rights and detailed incident logs.
No OLA/UC → there is no one to fulfill external obligations inside.
Undefined exit plan → supplier hostage.
"Fines only by loans" without the right to terminate in case of systematic violations.
Deprecates without a transition window.

15) Architect checklist

1. Defined SLI/SLO for key flow and regions?
2. Selected external monitoring method and evidence base?
3. Are incidents, escalations, planned work windows and exception limit spelled out in SLA?
4. Have a credit scale/penalties and a right of termination for N violations?
5. DPA/security: encryption, logs, notifications, sub-processors, localization?
6. Contract tests and sandboxes in the pipeline?
7. Internal OLAs/UCs enable external SLOs?
8. DR: RPO/RTO declared, training conducted, reports available?
9. Exit plan: export formats, timing, dry exit practice?
10. Are gates in CI/CD blocking releases that increase the risk of SLA violation?

16) Mini-examples (sketches)

16. 1 Deploy-gate policy on provider risk

yaml gate: provider-slo-risk checks:
- name: forecasted-slo-breach input: slo_forecast/providers. json deny_if: any(.providers[].breach == true)
action_on_deny: "block-release"

16. 2 Exporting "incident evidence"

bash curl -s https://probe. example. com/export? from=2025-10-01&to=2025-10-31 \
jq '.      {region, endpoint, status, latency_ms, trace_id, ts}' > evidence. jsonl

16. 3 Contract Webhook Test (Pseudocode)

python evt = sign(make_event(id=uuid4(), ts=now()))
res = post(provider_url, evt)
assert res. status in (200, 202)
assert replay(provider_url, evt). status = = 200 # idempotency

Conclusion

SLA/OLA is not only a "legal paper," but an architectural mechanism for managing risk and quality. The right metrics and windows, external monitoring, clear incident and reimbursement procedures, internal OLA/UCs, pipelined gates, multi-vendors, and a real exit plan turn provider dependency into a controlled, measurable, and economically predictable part of your platform.

SLA/OLA with providers

Conclusion

Get in Touch

Quick Contact

The video will be updated soon

We are currently very busy with projects