SLA/OLA with providers
1) Terms and boundaries
SLI - measurable indicator (availability, p99 latency, successfully processed webhooks, RPO/RTO).
SLO - target SLI value per measurement window (for example, 99. 9 %/30 days).
SLA - legally binding document (SLO + procedures + reimbursement).
OLA - internal goals and processes that ensure compliance with SLAs.
UC (Underpinning Contract) - "substrate" with third parties (channels, data centers, CDN, etc.).
Boundaries: clearly separate the provider's area of responsibility (cloud/WAF/CDN/payment gateway/KYC provider) from your area (code, config, client settings).
2) Criticality matrix and model selection
Segment providers by business impact:The matrix determines the depth of the SLA, the scope of checks and the requirements for OLA/UC.
3) Metrics and measurement windows
Availability-The percentage of time that the service executes queries according to tolerances.
Latency: p95/p99 for key operations; "slow success" counts.
Data reliability: RPO (maximum allowable data loss) and RTO (recovery time).
Bandwidth/limits: guaranteed quotas (RPS/MBps).
Quality of integrations: share of delivered webhooks ≤ X minutes, share of 2xx responses, repetitions and deduplication.
Measurement window: monthly/rolling 30 days, exceptions (planned activities) with limits.
- `Availability_ext = 1 − (Downtime_confirmed_outages / Total_minutes_in_window)`
- Where outage is the confirmed unavailable state by external monitoring, and not just by the provider's status page.
4) SLA content (section template)
1. Subject and scope (services, regions, API versions).
2. Definitions (SLI/SLO, "incident," "planned work," "force majeure").
3. Service objectives (SLOs) by request category and region.
4. Monitoring and evidence base: in what way, whose sensors, with what frequency.
5. Incidents and escalations: channels, response/update times, roles.
6. Refunds: credits/fines/bonuses, thresholds, formulas.
7. Security and privacy: DPA, encryption, logs, violation notifications.
8. Service changes: deprecates, notification window, compatibility.
9. Continuity and DR: RPO/RTO, recovery tests.
10. Audit and compliance: the right to audit, reporting, certification.
11. Exit Plan: data export, dates, format, migration assistance.
12. Legal provisions: jurisdiction, force majeure, confidentiality, validity period.
5) Examples of wording (fragments)
5. 1 Availability and measurement
"Provider provides 99. 95% availability in each calendar month. Availability is measured by Customer's external synthetic monitoring from ≥3 regions at intervals of ≤1 minutes. Recorded unavailability in ≥2 regions is simultaneously considered a Level SEV2 incident and counted in Downtime. "
5. 2 Key API latency
"p99 response time'POST/payments/authorize" ≤ 450 ms on 95% of the days of the month. A cause analysis report is provided for the percentage of requests that exceed the threshold"
5. 3 Incidents and escalations
"S1: ack ≤ 15 min, updates every ≤ 30 min, target recovery ≤ 2 h; S2: ack ≤ 30 min, updates ≤ 60 min; S3: Next Business Day. Channels: phone 24 × 7, chat bridge, email"
5. 4 Refunds (credits)
If Availability_ext <99. 95% → credit 10% monthly fee
< 99. 9% → 25%
< 99. 5% → 50%
Loans do not exclude other methods of compensation for damage in gross negligence.
5. 5 Deprecates and compatibility
"At least 180 days notice for changes that break compatibility. Concurrent support for vN and vN + 1 for at least 90 days"
5. 6 Exit
"Within 30 days after termination, the provider provides full export of data in Parquet/JSON + formats free of charge; additional migration services - at tariff X. Destruction of copies is confirmed by the act"
6) OLA: internal support for external SLA
Example OLA between "Platform" and "Payment Team":- Targets: p99 gateway ≤ 200 ms, error rate ≤ 0. 3%, DR: RPO 0, RTO 30 min.
- Responsibility: SRE-on-call, 24 × 7; common dashboards and alerts.
- Processes: chaos-smoke in releases, perf-smoke on PR, heuristics of shading.
- Gates: deploy block when the SLO/xaoc test fails; mandatory runbook update.
7) Monitoring and evidence
Synthetics: external probes (HTTP/TCP), user path, "slow success."
RUM: real user monitoring to confirm impact.
Correlation: 'provider', 'region', 'api _ method', 'incident _ id' labels.
Artifacts: screenshots/trails/logs, KPI export, escalation timeline.
rego package policy. sla deny["Release blocked: provider SLO risk"] {
input. release. affects_providers[_] == p input. slo. forecast[p].breach == true
}
8) Incidents and Interactions
Playbook:1. SEV classification, war-room opening, IC purpose.
2. Notification of the provider via the "hot channel," transmission of artifacts.
3. Bypass modes/feature flags (stale, shading, rate-cap).
4. Shared timeline, recovery.
5. Postmortem + actions: updating config limits, keys, backup routes.
6. Initiation of SLA loans, fixing in billing.
9) Security and DPA
DPA/privacy: controller/processor roles, data categories, legality base, processing deadlines/objectives, sub-processors and their SLAs.
Encryption: TLS1. 2+, PFS; data "at rest," key management (KMS/HSM), rotation.
Audit: access logs, violation notifications ≤ 72 hours, pentest reports on request.
Localization: storage region, prohibition of export without consent.
10) Supply Chain and interoperability
SBOM/vulnerabilities: CVSS threshold policy and fix times (criticized ≤ 7 days, high ≤ 14).
API compatibility: contract tests, sandboxes and stable fixtures.
Provider changes: early release notes, previews/beta windows, backward compatibility.
11) Multi-provider and feilover
Active/Active: Harder and more expensive, but higher availability (consider consistency).
Active/Passive: Cold/Warm Reserve, DR. Regular Workouts
Abstractions/adapters: single contract, health/cost/carbon routing (if relevant).
License/commercial conditions: portability, limitation on data output, egress cost.
12) Exit plan and periodic rehearsals
Data/diagram catalogue and volumes.
SDK/API portability script (minimum - second source).
Dry exit test: export/import, restore, check invariants.
Legal retention/disposal periods after release.
13) Contract tests and conformance
API Samples: Positive/Negative, Limits, Errors, and Retrays.
Delivery of events/webhooks: signature/time/grandfather/repetitions.
Perf baselines: p99, bandwidth; regression tests on release notes of the provider.
Cross-region: the degradation of one region should not violate SLO globally.
14) Anti-patterns
SLA "on status page" without external measurements.
Same goals for all regions/endpoints.
Lack of audit rights and detailed incident logs.
No OLA/UC → there is no one to fulfill external obligations inside.
Undefined exit plan → supplier hostage.
"Fines only by loans" without the right to terminate in case of systematic violations.
Deprecates without a transition window.
15) Architect checklist
1. Defined SLI/SLO for key flow and regions?
2. Selected external monitoring method and evidence base?
3. Are incidents, escalations, planned work windows and exception limit spelled out in SLA?
4. Have a credit scale/penalties and a right of termination for N violations?
5. DPA/security: encryption, logs, notifications, sub-processors, localization?
6. Contract tests and sandboxes in the pipeline?
7. Internal OLAs/UCs enable external SLOs?
8. DR: RPO/RTO declared, training conducted, reports available?
9. Exit plan: export formats, timing, dry exit practice?
10. Are gates in CI/CD blocking releases that increase the risk of SLA violation?
16) Mini-examples (sketches)
16. 1 Deploy-gate policy on provider risk
yaml gate: provider-slo-risk checks:
- name: forecasted-slo-breach input: slo_forecast/providers. json deny_if: any(.providers[].breach == true)
action_on_deny: "block-release"
16. 2 Exporting "incident evidence"
bash curl -s https://probe. example. com/export? from=2025-10-01&to=2025-10-31 \
jq '. {region, endpoint, status, latency_ms, trace_id, ts}' > evidence. jsonl
16. 3 Contract Webhook Test (Pseudocode)
python evt = sign(make_event(id=uuid4(), ts=now()))
res = post(provider_url, evt)
assert res. status in (200, 202)
assert replay(provider_url, evt). status = = 200 # idempotency
Conclusion
SLA/OLA is not only a "legal paper," but an architectural mechanism for managing risk and quality. The right metrics and windows, external monitoring, clear incident and reimbursement procedures, internal OLA/UCs, pipelined gates, multi-vendors, and a real exit plan turn provider dependency into a controlled, measurable, and economically predictable part of your platform.