Hybrid cloud: on-prem + cloud
1) Why a hybrid and when is it justified
Drivers: regulatory requirements (data residency/PII), existing on-prem investments, latency to "proprietary" systems, cost control, access to managed cloud services.
Trade-offs: complexity of networks and security, duplication of competencies, synchronization of data and configs, operational risks.
Motto: portable where critical; cloud-native where it is profitable.
2) Hybrid models
On-prem extension: cloud as data center extension (new microservices/analytics, fronts).
Cloud-first with local anchors: core in the cloud, on-prem - accounting systems/payment gateways/PII storage.
Cloud-bursting: elastic peaks of load in the cloud (batch, promo-peaks), base volume - locally.
DR to Cloud: Hot/Warm Cloud Reserve for on-prem (RTO/RPO managed).
Edge + Core: PoP/edge nodes are closer to the user, root data/ML is in the cloud.
3) Network and connectivity
3. 1 Channels
Site-to-Site VPN (IPsec/SSL) - start quickly, higher latency, jitter.
Direct lines (DC/ER/IC, MPLS) - predictable SLAs, below delay, more expensive.
Dual-link + BGP - fault tolerance and routing control.
3. 2 Addressing and routes
Single RFC1918 diagram without intersections; CIDR plan for years to come.
NAT-domes at borders only; east-west without NAT.
Segment/VRF for isolation of environments (dev/stage/prod), tenants, providers.
3. 3 Time and DNS policies
Single NTP (clock = fate for cryptography/signatures).
Split-horizon DNS: internal zones (svc. cluster. local, corp.local), external - public.
Health-based GSLB for inbound traffic.
4) Identity and access
SSO Federation: OIDC/SAML, on-prem IdP ↔ cloud IdP; SCIM provisioning.
Roles according to the principle of least privilege; break-glass accounts with MFA.
Machine identity: SPIFFE/SPIRE or mesh-PKI for mTLS.
RBAC "end-to-end": Git/CI/CD → cluster/mesh → brokers/DB → logs.
5) Platform: Kubernetes + GitOps
5. 1 Single execution layer
Clusters on-prem and cloud with the same versions/CRD.
GitOps (Argo CD/Flux): single charts/overlays, drift control, promo streams.
5. 2 Service mesh
Istio/Linkerd: mTLS default, locality-aware balancing, failover inter-cluster.
L7 policies (JWT, headers, rate limits, retry/circuit/timeout) - in manifest code.
5. 3 Example (K8s topology & mesh)
yaml anti-affinity and distribution by zones on-prem cluster spec:
topologySpreadConstraints:
- maxSkew: 1 topologyKey: topology. kubernetes. io/zone whenUnsatisfiable: DoNotSchedule labelSelector: { matchLabels: { app: api } }
Istio DestinationRule: local cluster priority, then trafficPolicy cloud:
outlierDetection: { consecutive5xx: 5, interval: 5s, baseEjectionTime: 30s }
6) Data and storage
6. 1 Bases
On-prem master, cloud read-replica (analytics/directories).
Cloud master + on-prem cache (low latency for local integrations).
Distributed SQL/NoSQL (Cockroach/Cassandra) with local quorums.
CDC/log replication (Debezium) between loops; idempotency of handlers.
6. 2 Object/file/block
S3-compatible stors (on-prem MinIO + cloud S3/GCS) with replication/versioning; WORM for audit.
Backups: 3-2-1 (3 copies, 2 media, 1 - offsite), regular recovery verification.
6. 3 Cache and queues
Redis/KeyDB cluster per-site; global cache - only through events/TTL.
Kafka/Pulsar: MirrorMaker 2/replicator; key - deadup/idempotency of consumers.
7) Security and compliance (Zero Trust)
mTLS everywhere (mesh), TLS 1. 2 + on the perimeter; disabling unencrypted channels.
Secrets: HashiCorp Vault/ESO; short-lived tokens; auto-rotation.
KMS/HSM: keys segmented per jurisdiction/tenant; scheduled crypto rotations.
Segmentation: NetworkPolicies, micro-segmentation (NSX/Calico), ZTNA for admin access.
Logs: immutable (Object Lock), end-to-end 'trace _ id', PII/PAN masking.
8) Observability, SLO and incident management
OpenTelemetry SDK everywhere; Collector on-prem and in the cloud.
Tail-sampling: 100% ошибок и p99, labels `site=onprem|cloud`, `region`, `tenant`.
SLO and Error Budgets by slices (route/tenant/provider/site); alerts by burn-rate.
End-to-end dashboards: RED/USE, dependency maps, canary comparisons (before/after migrations).
9) CI/CD and configs
A single registry of artifacts (pull-through cache on-prem).
Promo stream: dev → stage (on-prem) → canary (cloud) → prod; or vice versa - depending on the goal.
Checks: contract tests (OpenAPI/gRPC/CDC), static analysis, IaC linking, image scan, SLO gates.
10) DR/BCP (continuity plan)
RTO/RPO per service. Examples:- catalogs/landings: RTO 5-15 min, RPO ≤ 5 min;
- payments/wallets: RTO ≤ 5 min, RPO ≈ 0-1 min (quorum/synchronous within the site).
- Runbook: switching GSLB/weights, raising standby in a cluster, feature-flags "lightweight mode."
- GameDays: quarterly - disconnection of the site/channel, verification of real RTO/RPO.
11) Cost and FinOps
Egress between on-prem and the cloud is the main "hidden" expense; cache and keep hiking to a minimum (SWR, edge).
Tagging: 'service', 'env', 'site', 'tenant', 'cost _ center'.
Rule 80/20: we transfer/keep portable 20% of the "critical core," the rest - where it is cheaper.
Downsampling metrics, references of logs "hot/cold," budget-aware sampling tracing.
12) Placement patterns of workloads
13) Examples of configs
13. 1 IPsec S2S (idea)
onprem ↔ cloud: IKEv2, AES-GCM, PFS group 14, rekey ≤ 1h, DPD 15s, SLA monitoring jitter/packet-loss
13. 2 Terraform (tag/label snippet)
hcl resource "kubernetes_namespace" "payments" {
metadata {
name = "payments"
labels = {
"site" = var. site # onprem cloud
"tenant" = var. tenant
"cost_center" = var. cc
}
}
}
13. 3 Vault + ESO (secret from on-prem to cloud cluster)
yaml apiVersion: external-secrets. io/v1beta1 kind: ExternalSecret spec:
refreshInterval: 1h secretStoreRef: { kind: ClusterSecretStore, name: vault-store }
target: { name: psp-hmac, creationPolicy: Owner }
data:
- secretKey: hmac remoteRef: { key: kv/data/payments, property: HMAC_SECRET }
14) Antipatterns
Intersecting CIDR → NAT chaos; first the address plan, then the channels.
One "shared" global cache with strong consistency → latency and split-brain.
Retrays without idempotency → double write-offs/orders.
"Naked" VPN without mTLS/Zero Trust inside - lateral movement when compromised.
Lack of DR exercises: plans do not work in reality.
The discrepancy between the versions of K8s/CRD/operators → the impossibility of uniform charts.
Logs in free format without 'trace _ id' and masking are impossible.
15) Specifics of iGaming/Finance
Data residency: PII/payment events - on-prem/regional circuit; to the cloud - aggregates/anonymous.
PSP/KYC: multi-providers; smart-routing from the cloud to local gateways, fallback to backup; webhooks through a broker with deduplication.
"Money Paths": individual SLOs above total; HMAC/mTLS, 'Retry-After', 'Idempotency-Key' are required.
Audit: WORM storage (Object Lock), immutable transaction logs, two-way recording (on-prem + cloud) for critical events.
Jurisdictions: KMS/Vault key segmentation per country/brand; geo-blocks on the perimeter.
16) Prod Readiness Checklist
- Address plan, DNS, NTP - one; S2S links + forward protected links (BGP).
- Single identity (SSO/OIDC/SAML), MFA, least privilege; SPIFFE/SPIRE for services.
- K8s in all sites, GitOps, same operators/CRD; service mesh с mTLS и locality-aware LB.
- Data: CDC, consistency tests, RPO/RTO policies, 3-2-1 backup and regular restore drives.
- Security: Vault/ESO, Rotation, NetworkPolicies, ZTNA; logs are immutable.
- Observability: OTel, tail-sampling, SLO/budgets by site/region/tenant; canary dashboards.
- CI/CD: contract tests, IaC linting, image scan; release-gates by SLO.
- DR-runbooks, GameDays, measured actual RTO/RPO; cutover/roll-back buttons.
- FinOps: egress limits, tags and reports, metrics/logs/trails retention policy.
- iGaming specifics: data residency, multi-PSP, WORM audit, individual SLOs for payments.
17) TL; DR
Hybrid = common execution platform (K8s + GitOps + mesh + OTel + Vault) on two worlds: on-prem and cloud. Plan network and identity, make data portable via CDC/idempotency, differentiate security across Zero Trust, measure SLO/Error Budgets reliability, and train DR. For iGaming, keep data and payments in jurisdiction, use multi-PSP smart-routing, and unchangeable auditing.