Multi-cloud strategy and migrations
1) Why multi-cloud and when is it justified
Objectives: business continuity (provider reserve), data/jurisdiction sovereignty, value/discount optimization, access to the best managed services (ML/anti-fraud/analytics).
Compromises: increasing complexity of operations, duplication of competencies, network egress costs.
Key: determine in advance where portability is needed and where vendor lock-in is acceptable for speed/price.
2) Target architectural models
Portable Core: critical core (API, domain services, data) - portable (K8s, Postgres, Kafka, Redis, MinIO/Vault); periphery - natively-managed.
Active-Active Multi-cloud: two clouds serve traffic at the same time (difficult: data conflicts, global routing).
Active-Passive (Hot/Warm): one - main, the second - hot/warm DR.
Hybrid: part in he-prem/part in the clouds (often for legal/PII restrictions).
3) Tolerability patterns
Kubernetes as a base platform (aliases: EKS/GKE/AKS/on-prem K8s).
Service Mesh (mTLS, traffic shifting, locality/failover; Istio/Linkerd).
IaC: Terraform + modular abstractions; для K8s — Helm/Kustomize + GitOps (Argo/Flux).
Secrets: HashiCorp Vault/External Secrets Operator; abstraction over KMS/HSM.
Repositories: Postgres (operators/Patroni), Kafka (operators/MirrorMaker2), Redis (sentinel/cluster), S3-compatible (MinIO) for API uniformity.
Observability: OpenTelemetry + vendor-neutral backends (Prometheus/Tempo/Loki/ClickHouse).
Authentication: OIDC/OAuth2 (Keycloak/Auth0/Entra/Google), unified federation.
API layer: Envoy/NGINX/Contour + general policies (CORS, mandatory headers, rate limits).
4) Migration strategies (7R - brief)
Rehost (Lift-and-Shift): fast, no recycling; good for stateless/VM, bad for cost.
Replatform: migrating to K8s/simplifying dependencies (less risky than refactor).
Refactor/Repurchase: rewrite for portable patterns or replace with SaaS service.
Retain/Retire: Leave/decommission what you don't need to carry.
Practice: start with the registry of services (criticality, RTO/RPO, SLO, dependencies), compile migration waves (by domain).
5) Data and consistency
Replication/CDC: Debezium/log straining for Postgres/MySQL; Kafka MirrorMaker2 for topics.
Bidirectional synchronization: only with strict idempotency and versioning keys (vector clock/updated_at).
Dual-write with deduplication - Records are marked'Idempotency-Key '/' event _ id '+ outbox/inbox for guaranteed delivery.
Sharing ownership: leader-region/cloud per key/tenant to avoid conflicts.
Cash: local-regional; global only via events/TTL (no "shared" global cache with strong consistency).
6) Global traffic and network
GSLB/DNS: latency/geo-routing + health-checks, weights for canaries/feilover.
Anycast/Edge/CDN for proximity to the user, then - laying to the nearest healthy region/cloud.
Direct channels: Interconnect/ExpressRoute/Direct Connect between clouds/on-prem to reduce egress/latency.
Client policies: short timeouts, exponential backoff + jitter, iterative retrays, idempotency of write operations.
7) Safety and compliance
mTLS everywhere (mesh + SPIFFE/SPIRE or native PKI).
KMS/HSM: abstract API through Vault; key segmentation per jurisdiction/tenant.
IAM: Unified Role and Group Model (SCIM/SSO), Least Privilege Policy, Temporary Credential (STS).
Secrets/rotation: automatic rotation of tokens/passwords; blocking "long" static keys.
Compliance: PCI DSS/GDPR - data residency, isolated audit logs, geo-blocks.
8) Observability, SLO and Error Budgets
RED/USE signals + trails + profiles in all clouds; single log format (JSON + 'trace _ id').
Trace sampling tail-based: save errors/p99, segments by 'cloud', 'region', 'tenant'.
SLO per cloud/region + total aggregate; alerts by burn-rate (multi-window).
Canary dashboards "before/after migration," regression report.
9) CI/CD and config management
GitOps: artifacts of images are one, configs - per-environment/region via Helm values/Kustomize overlays.
Secrets through External Secrets Operator (bridges to AWS/GCP/Azure secret stores).
Promo streams: dev → staging → canary (cloud A) → canary (cloud B) → full.
Release gates: SLO/Synthetic/Contract-tests checkout before traffic weight growth.
10) Cost and FinOps
Consider egress rates between the clouds, RI/CUD/Savings Plans discounts, marketplace bundles.
Rule 80/20: Transfer only 20% of the greatest business risk; the rest is where it is cheaper/easier.
Downsampling metrics, cold-storage logs, limits on trails (budget-aware sampling).
Resource tagging: 'env', 'team', 'service', 'tenant', 'cost _ center' - for transparent billing.
11) Migration plans (playbook)
11. 1 Preparation
1. Inventory of services/data/dependencies; target RTO/RPO/SLO.
2. Select model (active-active vs active-passive) and network layer (GSLB/Anycast).
3. Sandbox preparation in the target cloud: K8s cluster, mesh, observability, secrets.
11. 2 Run and validation
4. Shadow-traffic: mirroring requests without affecting sales
5. Contract tests (OpenAPI/gRPC/CDC) and synthetics along key routes.
6. CDC/replication: hot data synchronization, consistency reconciliation.
11. 3 Switching
7. Dual-write (idempotent) to a limited percentage of users/tenants.
8. Phased traffic shifting (1%→10%→50%→100%) with SLO gates.
9. Freeze/moving stateful; final cutover rental; holding the old loop in "read-only" until the final reconcile.
11. 4 After migration
10. Checking audit logs/logs, archiving old snapshots, optimizing egress/cache.
11. Upgrade runbooks and on-call training.
12) DR and fault tolerance tests
GameDay: disconnecting an entire cloud/region; measurement of actual RTO/RPO.
Chaos injections: loss of packages/increase in cross-link latency, broker/base drop.
Automatic degradation flags: disabling "expensive" features, switching to the 'stale-while-revalidate' cache.
13) Antipatterns
"Clean" active-active without data ownership agreements → conflicts/duplicates.
Shared global cache with strong consistency - latency/congestion.
Retrays without idempotency → repeated write-offs/orders.
Different log/trace formats in the clouds - loss of correlation.
Lack of a single IAM/secret model.
Migration "all at once" without waves and gates.
14) Specifics of iGaming/Finance
Jurisdictions and data residency: PII/payment logs remain "within the country/region," cross-cloud - only aggregates/anonymous.
Payment providers: multi-PSP and smart-routing by cloud/region; webhooks - through a global broker with deduplication.
Sanction/compliance filters: regional profiles; fast failover on the allowed PSP.
SLO "money paths" above the general; individual alerts/deshboards per provider/region.
Audit: immutable transaction logs, synchronous writing to two independent storages (WORM/S3 Object Lock).
15) Prod Readiness Checklist
- Target model selected (portable core/active-active/standby); RTO/RPO/SLO are described.
- IaC/GitOps: modular Terraform/Helm/Kustomize; single mesh and security policies.
- Observability: OTel in all media; general format of logs; tail-sampling by errors/p99.
- Data: CDC configured; dual-write is idempotent; there is a conflict-resolution plan.
- GSLB/DNS/Anycast и health-checks; phased traffic shifting with SLO gates.
- Secrets and KMS: Abstraction via Vault; rotation; segmentation by region.
- FinOps: value models, egress limits, tags and quotas; cost reports.
- DR exercises conducted; Actual RTO/RPO measured updated runbooks.
- API/event contracts are verified in both clouds; monitoring webhooks.
- For iGaming/Finance: data residency, multi-PSP routing, WORM logs.
16) TL; DR
Build a portable core (K8s + IaC + mesh + OTel + Vault) and choose a multi-cloud pattern for RTO/RPO/SLO business goals and cost. Make the transfer in waves: shadow-traffic → CDC → dual-write → phased traffic with SLO-gates. Manage data through idempotency and outbox/inbox, traffic through GSLB/Anycast, security through mTLS/KMS/Vault. For iGaming - strict data residency and multi-PSP rules, separate SLOs for "money" paths.