Operations and Management → Integration with External Tools
Integrations with external tools
1) Why do you need it
Almost any product platform relies on an external ecosystem: payment providers, KYC/AML, anti-fraud, email/SMS/push, analytics, game studio providers, BI, CDP, task managers, marketing tools. Smartly designed integrations increase conversion and uptime; illiterate - multiply cascading waivers, surprise bills and SLA penalties.
Objectives:- Connect providers quickly and safely.
- Keep SLO business (deposit, bet, withdrawal, game launch).
- Manage quotas/limits and costs.
- Reduce failure radius and MTTR.
2) Integration taxonomy
Synchronous APIs (REST/gRPC/GraphQL): instant responses, rigid latency and availability dependency.
Asynchronous (webhook/event/queue): delivery of events, confirmations, less time connectivity.
SDK/client libraries: speed of implementation, but risk of invisible dependencies and "magic."
Batch/ETL/SFTP/file exchange: reports, reconciliation, night uploads.
iFrame/Redirect/Hosted page: fast but less UX/Security control.
Hybrid: synchronous call + asynchronous confirmation (often for payments/ACC).
3) Governance model
Integration directory: owner, contacts, on-call, contracts (OpenAPI/AsyncAPI), versions, environment, keys/secrets, quotas and tariff.
SLO/OLA agreements: what we guarantee the user and what the provider promises; explicit SLO ↔ OLA/SLA relationship.
Release gates: consumer-driven contracts (CDC), compatibility tests, canary inclusions, phicheflags.
Data policies: PII, financial data, GDPR/CCPA, storage regions, DPA with vendors.
4) Security and secrets
Storage of secrets: KMS/Secrets Manager, rotation, principle of least rights, access by role accounts.
Signature and verification: HMAC/JWS for webhooks, mutual TLS for server-server.
IP allowlist/mTLS/WAF: protect inbound and outbound links.
Token scope: narrow API key rights, individual keys by environment.
Audit trail: all outgoing calls and config changes - to the audit log.
5) Quotas, rate limits and reliability
Explicit rate-limit per-provider: so as not to fly to 429/ban.
Bulkhead isolation: dedicated thread/connection pools for each provider.
Timeouts
Backoff + jitter retrays: for idempotent operations/codes only.
Circuit breaker: A quick "drop" and pullback to follbeck on degradation.
Queue + Outbox: for critical operations - guaranteed delivery and repetition.
providers:
psp_x:
timeout_ms: 200 rate_limit_rps: 1500 retries: 2 retry_on: [5xx, connect_error]
backoff: exponential jitter: true circuit_breaker:
error_rate_threshold: 0.05 window_s: 10 open_s: 30 pool: dedicated-psp-x (max_conns: 300)
6) Contracts, version and compatibility
OpenAPI/AsyncAPI + SemVer: extensions - backward-compatible; removal - through the deprecate period.
CDC tests: consumer fixes expectations; release of the provider is blocked if incompatible.
Schema Registry (events): evolution of schemes (Avro/JSON-Schema); can-read-old/can-write-new policy.
Change control: change log, migration guides, date of disabling the old version.
7) Mediums and sandboxes
Sandbox/Stage/Prod from the vendor - required.
Test data: PII-like generators, fictitious cards/documents, test wallets.
Contract & integration tests: against the stage with real limits.
Golden-path & chaos-path: happy-case and negative scenarios (timeouts/4xx/5xx/webhook-retries).
8) Observability and dashboards
Метрики per-integration: `outbound_rps`, `p95/p99`, `error_rate`, `retry_rate`, `circuit_open`, `cost_per_1k_calls`.
Webhook health: delivery delay, repetition percentage, signature/validation.
Release/phicheflag events: annotations on graphs.
Dependency map: who refers to the provider where the bottlenecks are.
9) Incidents and escalations
Correlation of alerts: if the provider is a page of the integration owner, not all consumers.
Autodegradation: "minimum mode" feature flags (light content, simplified KYC flow, processing queues).
Feilover/multi-vendor: PSP-X ⇄ PSP-Y, KYC-A ⇄ KYC-B; manual and automatic switch.
Runbook: how to confirm an incident with a vendor, increase quotas, enable an alternative route, roll back.
- Diagnostics: integration dashboard, vendor status, our logs with 'trace _ id'.
- Action: Lower the RPS, open the breaker, turn on the feilover, switch the ficheflag.
- Communications: incident channel, update template for business/support.
- Rollback/verification: p95/error-rate is normal, queue is processed, expenses are in the limit.
10) Cost management
CPM/CPA/CPC/call model: track 'cost _ per _ 1k _ calls' and "cost of success."
Quotas and "soft-cap": protective thresholds, warnings.
Caching and dedup: reducing unnecessary calls (idempotency keys).
Reports and reconciliation: daily reconciliation of accounts with our logs.
11) Working with webhooks
Delivery: 'at-least-once', repeat with exponential delay, dedup by 'event _ id'.
Security: signature (HMAC/JWS), time stamp, mTLS/allowlist.
Reliability: 2xx response only after writing to outbox/txn, otherwise the provider will retract.
Idempotence: handlers are idempotent, store "seen events."
12) Data, privacy and compliance
Data minimization - request only what you need.
PII/financial data: masking in logs, tokenization, encryption.
Data residency: where data is stored and processed (registers).
DPA/SCC: data processing conventions, sub-processors.
Right to delete/export: API/processes on the vendor side.
13) Anti-patterns
Retrai on the time-outs of the bottleneck → the "storm of retrai."
Common connection pool for all vendors → head-of-line blocking.
No signature/validation webhook → freds and false events.
Secrets in environment variables without rotation and explicit rights.
Lack of CDC and contract versions → massive drops in vendor updates.
Strong tie on SDK without observability → black box.
14) Implementation checklist
- Integration card in directory: owner, SLA/OLA, tariff, contacts, keys, schemas.
- OpenAPI/AsyncAPI + CDC; tests for stage, canary inclusion.
- Timeouts, retrays (idempotency!), Breaker, bulkhead, rate-limit.
- Secrets: KMS/SM, rotation, single keys per-env.
- Webhook: signature, dedup, redelivery, outbox.
- Dashboard and alerts per-integration; release annotations.
- Failover plan (second provider/manual switch), runbook and contacts.
- Cost reporting and reconciliation.
- DPA/compliance, data policy, audit logs.
- Game-days/chaos for key vendors.
15) Integration Quality KPIs
Success rate for critical operations (deposit/rate/withdrawal).
p95/p99 outgoing calls.
Retry storm count/month (target → 0).
MTTD/MTTR on provider incidents.
Cost per 1k calls/successful action.
CDC pass rate and the proportion of releases without integration incidents.
Webhook latency and repeatability.
16) Rapid defaults
Timeout = 70-80% of the link budget; Request upper timeout is shorter than the sum of internal timeouts.
Retrai ≤ 2, only on 5xx/network, with backoff + jitter.
Circuit breaker: '> 5%' errors for '10s', 'open = 30s', 'half-open' samples.
Rate-limit per-provider, separate connection pool.
Webhook: confirm after recording, dedup by 'event _ id'.
Ficheflag for quick transfer to "minimum mode."
17) Examples of alerts (ideas)
ALERT ProviderErrorRateHigh
IF outbound_error_rate{provider="psp_x"} > 0.05 FOR 5m
LABELS {severity="critical", team="payments"}
ALERT ProviderLatencySLO
IF outbound_p99_latency_ms{provider="kyc_a"} > 300 FOR 10m
LABELS {severity="warning", team="risk"}
ALERT WebhookDeliveryDelayed
IF webhook_delivery_p95_s{provider="studio_y"} > 20 FOR 15m
LABELS {severity="warning", team="games"}
ALERT ProviderCostSpike
IF rate(provider_cost_usd_total[15m]) > 2 baseline_1w
LABELS {severity="info", team="finops"}
18) FAQ
Q: How to distinguish between a temporary provider failure and our problems?
A: See symmetry: increase in errors for all provider customers, opening a breaker, no internal errors/regressions. Traces and logs with'peer. service 'will help.
Q: Do you always need a second provider?
A: For critical paths, yes (PSP/KYC). For less critical ones, degradation and caches are enough.
Q: Vendor SDK or own client?
A: The SDK will speed up the start, but demand observability, timeout/retray config and pinning versions. Otherwise - your client over HTTP/gRPC.