AI infrastructure and GPU pools

(Section: Technology and Infrastructure)

Brief Summary

Production-AI is not "one model on one server," but a cluster of GPU nodes, shared accelerator pools, unified serving, data/feature, observability, and cost management. For iGaming, this is critical in real time: anti-fraud, personalization, chatbots, LLM assistants, game/stock recommendations. Basic bricks: Kubernetes/Slurm for planning, isolation of workloads, high-speed network (100/200/400G with RDMA), fast storage, mature MLOps, and "reinforced concrete" SLO.

1) Architectural map

Layers:

1. Computing cluster: GPU nodes (A/H classes, AMD/ROCm, Intel Gaudi, etc.), CPU nodes for preprocessing/feature.

2. Network: 100G + Ethernet/IB, RDMA (RoCEv2), NCCL topologies, QoS.

3. Storage: object (S3-shared) , distributed POSIX (Ceph/grid), local NVMe-scratch.

4. Data/features: fichester (online/offline), vector databases (ANN), cache (Redis), queues.

5. ML-platform: register of artifacts and models, pipelines (CI/CD), version control, features as code.

6. Service layer: Triton/KServe/vLLM/text-generation-inference (TGI), A/V/canary-deploy, autoresize.

7. Governance and Security: PII, Secrets, Audit, Export Policies, Weight/Datacet Licenses.

Typical loads:

Online scoring (p95 ≤ 50-150 ms) - anti-fraud, recommendations, ranking.
LLM-serving (p95 ≤ 200-800 ms for 128-512 tokens) - chat/agents/prompts.
Batch analytics/additional training - night windows, offline metrics.
Fighting/adaptation - periodically, with a priority lower than online.

2) GPU pools and scheduling

Pool Model

Serving pool: short requests, high butching, strict SLOs.
Training/Finetuning Pool: Long Jobs, Distributed Training (DDP).
Pool "R & D/Experiments": quotas/limits, preemption allowed.
CPU/Pre-/Post-processing pool: normalization, tokenization, rerank on CPU.

Schedulers

Kubernetes (+ device-plugin, NodeFeatureDiscovery, taints/tolerations, PriorityClass, PodPriority/Preemption).
Slurm (often for HPC training) - can be mixed with K8s through separate workers.
Fair share and quotas: namespace quotas for GPU, CPU, memory; "banks" GPU-hours; limits on the namespace/project.

GPU partitioning

MIG (Multi-Instance GPU): cutting the accelerator into isolated slices (for surfing/multi-tenancy).
MPS: SM sharing for small tasks (monitor interference).
NVLink/PCIe: consider Topology Aware Scheduling.

Example of K8s annotations (concept):

yaml apiVersion: v1 kind: Pod metadata:
annotations:
scheduling. k8s. io/group-name: "ai-serving"
spec:
nodeSelector: { gpu-pool: serving }
tolerations: [{ key: "gpu", operator: "Exists", effect: "NoSchedule" }]
priorityClassName: ai-serving-critical

3) Network and inter-site performance

RDMA (RoCEv2) for NCCL allrudges; ECN/PFC settings, isolation of traffic classes.
Localization: training inside one "factory" (pod/host/optics), serving - closer to the user (edge/region).
Congest control: tuned profiles, jumbo frames, pin-ning interfaces.

4) Storage and data

Weight/artifact storage: object (versioning, immutability).
Datasets/features: Lakehouse (Delta/Iceberg/Hudi) + offline fichester; online-fichestor (millisecond SLAs).
Vector databases (ANN): Faiss/ScaNN/accelerators, or vendor vector engines; shardiness, HNSW/IVF, replication.
Local NVMe cache: warming up scales/embeddings for a cold start.

5) Models serving

Frameworks

Triton Inference Server (multimodel, multi-time, dynamic butching).
KServe (K8s-native, autoscaling HPA/KPA, canary).
vLLM/TGI for LLM tokenization and high-performance decoding (paged-attention, KV cache offload).
ONNX Runtime/TensorRT-LLM - for compilation and acceleration.

Optimizations

Quantization: INT8/FP8/INT4 (percentiles/calibration, AWQ/GPTQ) - online carefully, measure quality.
Graph compilation: TensorRT, TorchInductor/XLA, fused-kernels.
Butching/microbatching: dynamic and static; для LLM — continuous batching.
KV cache: sharing between requests, offline on CPU/NVMe with long contexts.
Speculative decoding: draft model + verifier to speed up token production.
Token/context limits, early stop, stopwords, time-budget per request.

Deploy policies

A/B, canary, shadow - comparison of latency/quality/business metrics.
Blue green - no downtime.
Rollback on SLO/errors.

6) Training/Fighting

DDP/FSDP/ZeRO: distributed memory/gradients, NVLink/topology accounting.
Checkpoints: incremental/full, frequency vs I/O.
Mixed Precision: bf16/fp16 + loss scaling; profile stability.
Dataset Sharding: uniform iterator, replication across nodes.
Priorities: interruptible jobs (preemptible) in favor of surfing.
Standalone pipelines: data → train → eval → register → progress in PROD according to gate criteria.

7) MLOps and platform

Register of models: versions, signatures, dependencies, licenses/right to use scales.
CI/CD models: compatibility tests, performance regressions, quality gates, safe deploy.
Fichestor: offline/online consistency (feature parity), TTL and backfill.
Data/Model Lineage: trace from dataset to report/experiment.
Directory of prompts/templates for LLM (versioning).

8) Observability and SLO

Online metrics:

Latency p50/p95/p99, tokens/s, batch occupancy, queue wait, GPU-util/SM occupancy, memory, errors.
LLM specifics: I/O tokens, average response length, percentage of failures by limits, KV cache hit.
Quality: automatic regression tests (offline), online telemetry (content flags, toxicity, accuracy of issuance on gold samples).
Business SLO: personalization conversion, anti-fraud accuracy, retention.

Alerts: p99/queue growth, tokens/s drop, batch-fill degradation, VRAM/PCIe-throttle exhaustion, rate-limit failure growth.

9) Security, compliance and privacy

PII/financial data: segmentation of calculations and data by region, encryption at rest/in transit, tokenization.
Secrets/Keys: KMS/Secrets Manager; exclude storage in images/code.
LLM output policies: security filters, red-teaming, logging of prompts/responses (with anonymization).
Licenses: compliance with licenses for datasets/weights; "no-redistribute "/commercial restrictions.
Tenant isolation: namespace-RBAC, networks, MIG slices, limits and quotas.

10) Cost and Finops

Capacity planning: load profiles (RPS, tokens/sec), "tails" of tournaments and campaigns.
Reserve/Spot: mixed pools (reserved + spot/preemptible) with re-setting tasks and checkpoints.
Autoscale: HPA/KPA by RPS/queue depth/GPU-util; "warm start" with warmed scales.
Model Zoo: Reduce options; Use adaptation (LoRA/PEFT) instead of full duplication.
Cache: embeddings/results of expensive requests, KV cache sharing for LLM.
Optimization of tokens: compression of prompts, retrieval-augmented generation (RAG), rerank before generation.

11) Multi-region, HA and DR

Active/Active surfing is closer to the user, global routing (latency-based).
Replication of scales and features with integrity check; warming up caches during releases.
DR plan: loss of AZ/region, evacuation to the backup pool, control of dependence on the centralized directory.
Chaos days: GPU node/network domain/storage failure tests.

12) Configuration templates (concepts)

Triton - dynamic butching:

text dynamic_batching {
preferred_batch_size: [4, 8, 16, 32]
max_queue_delay_microseconds: 2000
}
instance_group { count: 2 kind: KIND_GPU }

KServe - Canary:

yaml spec:
predictor:
canaryTrafficPercent: 20 model:
modelFormat: { name: triton }
resources:
limits: { nvidia. com/gpu: "1" }

vLLM - Launch (Ideas):


--tensor-parallel-size 2
--max-num-seqs 512
--gpu-memory-utilization 0. 9
--enforce-eager

13) LLM specificity: RAG and search loop

Indexing: chanking, embeddings, ANN-sharding by 'tenant/locale'.
Rerank: Lightweight CPU/GPU slice model to improve accuracy.
Prompts/contexts cache: dedup, canonicalization.
Citation/liability policies for sensitive domains (CCP/rules).

14) Implementation checklist

1. Capture SLOs (p95 latency/tokens/s, availability) and load profiles.
2. Split the cluster into pools (serving/train/R & D), enter quotas/priorities.
3. Enable RDMA/NCCL and topologically aware scheduling.
4. Set up storages: scales, datasets, fichester (online/offline), vector databases.
5. Select the serving stack (Triton/KServe/vLLM), add butching/KV cache/quantization.
6. Run the model register, CI/CD, canary/shadow deploy.
7. Put observability: system + business metrics, quality, tracing.
8. Enter security/PII policies, licenses, auditing.
9. Optimize TCO: reserved + spot, autoscale, cache, PEFT instead of full clones.
10. Prepare HA/DR and have a game-day.

15) Antipatterns

"One big GPU for all" without pools and priorities.
Lack of dynamic butching and KV cache for LLM → explosion of p99 and cost.
Training and serving on the same pool without preemption → SLO incidents.
Zero quality/safety telemetry → subtle degradation and risks.
Centralized monolith without phichester/model register → no reproducibility.
Ignoring scale/data licenses.

Summary

Successful AI infrastructure includes smart scheduling GPU pools, high network and correct storage, efficient serving (butching, cache, quantization, compilation), mature MLOps, and strict SLOs. Combined with security/PII, multi-regional HA/DR and thoughtful Finops, the platform gives a stable p99, controlled $/request and fast implementation of new models - from anti-fraud to personalization and LLM assistants.

AI infrastructure and GPU pools

Brief Summary

Schedulers

GPU partitioning

Optimizations

Deploy policies

Summary

Get in Touch

Quick Contact

The video will be updated soon

We are currently very busy with projects