Multi-cloud strategy and migrations

1) Why multi-cloud and when is it justified

Objectives: business continuity (provider reserve), data/jurisdiction sovereignty, value/discount optimization, access to the best managed services (ML/anti-fraud/analytics).
Compromises: increasing complexity of operations, duplication of competencies, network egress costs.
Key: determine in advance where portability is needed and where vendor lock-in is acceptable for speed/price.

2) Target architectural models

Portable Core: critical core (API, domain services, data) - portable (K8s, Postgres, Kafka, Redis, MinIO/Vault); periphery - natively-managed.
Active-Active Multi-cloud: two clouds serve traffic at the same time (difficult: data conflicts, global routing).
Active-Passive (Hot/Warm): one - main, the second - hot/warm DR.
Hybrid: part in he-prem/part in the clouds (often for legal/PII restrictions).

3) Tolerability patterns

Kubernetes as a base platform (aliases: EKS/GKE/AKS/on-prem K8s).
Service Mesh (mTLS, traffic shifting, locality/failover; Istio/Linkerd).
IaC: Terraform + modular abstractions; для K8s — Helm/Kustomize + GitOps (Argo/Flux).
Secrets: HashiCorp Vault/External Secrets Operator; abstraction over KMS/HSM.
Repositories: Postgres (operators/Patroni), Kafka (operators/MirrorMaker2), Redis (sentinel/cluster), S3-compatible (MinIO) for API uniformity.
Observability: OpenTelemetry + vendor-neutral backends (Prometheus/Tempo/Loki/ClickHouse).
Authentication: OIDC/OAuth2 (Keycloak/Auth0/Entra/Google), unified federation.
API layer: Envoy/NGINX/Contour + general policies (CORS, mandatory headers, rate limits).

4) Migration strategies (7R - brief)

Rehost (Lift-and-Shift): fast, no recycling; good for stateless/VM, bad for cost.
Replatform: migrating to K8s/simplifying dependencies (less risky than refactor).
Refactor/Repurchase: rewrite for portable patterns or replace with SaaS service.
Retain/Retire: Leave/decommission what you don't need to carry.

Practice: start with the registry of services (criticality, RTO/RPO, SLO, dependencies), compile migration waves (by domain).

5) Data and consistency

Replication/CDC: Debezium/log straining for Postgres/MySQL; Kafka MirrorMaker2 for topics.
Bidirectional synchronization: only with strict idempotency and versioning keys (vector clock/updated_at).
Dual-write with deduplication - Records are marked'Idempotency-Key '/' event _ id '+ outbox/inbox for guaranteed delivery.
Sharing ownership: leader-region/cloud per key/tenant to avoid conflicts.
Cash: local-regional; global only via events/TTL (no "shared" global cache with strong consistency).

6) Global traffic and network

GSLB/DNS: latency/geo-routing + health-checks, weights for canaries/feilover.
Anycast/Edge/CDN for proximity to the user, then - laying to the nearest healthy region/cloud.
Direct channels: Interconnect/ExpressRoute/Direct Connect between clouds/on-prem to reduce egress/latency.
Client policies: short timeouts, exponential backoff + jitter, iterative retrays, idempotency of write operations.

7) Safety and compliance

mTLS everywhere (mesh + SPIFFE/SPIRE or native PKI).
KMS/HSM: abstract API through Vault; key segmentation per jurisdiction/tenant.
IAM: Unified Role and Group Model (SCIM/SSO), Least Privilege Policy, Temporary Credential (STS).
Secrets/rotation: automatic rotation of tokens/passwords; blocking "long" static keys.
Compliance: PCI DSS/GDPR - data residency, isolated audit logs, geo-blocks.

8) Observability, SLO and Error Budgets

RED/USE signals + trails + profiles in all clouds; single log format (JSON + 'trace _ id').
Trace sampling tail-based: save errors/p99, segments by 'cloud', 'region', 'tenant'.
SLO per cloud/region + total aggregate; alerts by burn-rate (multi-window).
Canary dashboards "before/after migration," regression report.

9) CI/CD and config management

GitOps: artifacts of images are one, configs - per-environment/region via Helm values/Kustomize overlays.
Secrets through External Secrets Operator (bridges to AWS/GCP/Azure secret stores).
Promo streams: dev → staging → canary (cloud A) → canary (cloud B) → full.
Release gates: SLO/Synthetic/Contract-tests checkout before traffic weight growth.

10) Cost and FinOps

Consider egress rates between the clouds, RI/CUD/Savings Plans discounts, marketplace bundles.
Rule 80/20: Transfer only 20% of the greatest business risk; the rest is where it is cheaper/easier.
Downsampling metrics, cold-storage logs, limits on trails (budget-aware sampling).
Resource tagging: 'env', 'team', 'service', 'tenant', 'cost _ center' - for transparent billing.

11) Migration plans (playbook)

11. 1 Preparation

1. Inventory of services/data/dependencies; target RTO/RPO/SLO.
2. Select model (active-active vs active-passive) and network layer (GSLB/Anycast).
3. Sandbox preparation in the target cloud: K8s cluster, mesh, observability, secrets.

11. 2 Run and validation

4. Shadow-traffic: mirroring requests without affecting sales

5. Contract tests (OpenAPI/gRPC/CDC) and synthetics along key routes.
6. CDC/replication: hot data synchronization, consistency reconciliation.

11. 3 Switching

7. Dual-write (idempotent) to a limited percentage of users/tenants.
8. Phased traffic shifting (1%→10%→50%→100%) with SLO gates.
9. Freeze/moving stateful; final cutover rental; holding the old loop in "read-only" until the final reconcile.

11. 4 After migration

10. Checking audit logs/logs, archiving old snapshots, optimizing egress/cache.
11. Upgrade runbooks and on-call training.

12) DR and fault tolerance tests

GameDay: disconnecting an entire cloud/region; measurement of actual RTO/RPO.
Chaos injections: loss of packages/increase in cross-link latency, broker/base drop.
Automatic degradation flags: disabling "expensive" features, switching to the 'stale-while-revalidate' cache.

13) Antipatterns

"Clean" active-active without data ownership agreements → conflicts/duplicates.
Shared global cache with strong consistency - latency/congestion.
Retrays without idempotency → repeated write-offs/orders.
Different log/trace formats in the clouds - loss of correlation.
Lack of a single IAM/secret model.
Migration "all at once" without waves and gates.

14) Specifics of iGaming/Finance

Jurisdictions and data residency: PII/payment logs remain "within the country/region," cross-cloud - only aggregates/anonymous.
Payment providers: multi-PSP and smart-routing by cloud/region; webhooks - through a global broker with deduplication.
Sanction/compliance filters: regional profiles; fast failover on the allowed PSP.
SLO "money paths" above the general; individual alerts/deshboards per provider/region.
Audit: immutable transaction logs, synchronous writing to two independent storages (WORM/S3 Object Lock).

15) Prod Readiness Checklist

Target model selected (portable core/active-active/standby); RTO/RPO/SLO are described.
IaC/GitOps: modular Terraform/Helm/Kustomize; single mesh and security policies.
Observability: OTel in all media; general format of logs; tail-sampling by errors/p99.
Data: CDC configured; dual-write is idempotent; there is a conflict-resolution plan.
GSLB/DNS/Anycast и health-checks; phased traffic shifting with SLO gates.
Secrets and KMS: Abstraction via Vault; rotation; segmentation by region.
FinOps: value models, egress limits, tags and quotas; cost reports.
DR exercises conducted; Actual RTO/RPO measured updated runbooks.
API/event contracts are verified in both clouds; monitoring webhooks.
For iGaming/Finance: data residency, multi-PSP routing, WORM logs.

16) TL; DR

Build a portable core (K8s + IaC + mesh + OTel + Vault) and choose a multi-cloud pattern for RTO/RPO/SLO business goals and cost. Make the transfer in waves: shadow-traffic → CDC → dual-write → phased traffic with SLO-gates. Manage data through idempotency and outbox/inbox, traffic through GSLB/Anycast, security through mTLS/KMS/Vault. For iGaming - strict data residency and multi-PSP rules, separate SLOs for "money" paths.

Multi-cloud strategy and migrations

Get in Touch

Quick Contact

The video will be updated soon

We are currently very busy with projects