AI infrastructure and GPU pools
(Section: Technology and Infrastructure)
Brief Summary
Production-AI is not "one model on one server," but a cluster of GPU nodes, shared accelerator pools, unified serving, data/feature, observability, and cost management. For iGaming, this is critical in real time: anti-fraud, personalization, chatbots, LLM assistants, game/stock recommendations. Basic bricks: Kubernetes/Slurm for planning, isolation of workloads, high-speed network (100/200/400G with RDMA), fast storage, mature MLOps, and "reinforced concrete" SLO.
1) Architectural map
Layers:1. Computing cluster: GPU nodes (A/H classes, AMD/ROCm, Intel Gaudi, etc.), CPU nodes for preprocessing/feature.
2. Network: 100G + Ethernet/IB, RDMA (RoCEv2), NCCL topologies, QoS.
3. Storage: object (S3-shared) , distributed POSIX (Ceph/grid), local NVMe-scratch.
4. Data/features: fichester (online/offline), vector databases (ANN), cache (Redis), queues.
5. ML-platform: register of artifacts and models, pipelines (CI/CD), version control, features as code.
6. Service layer: Triton/KServe/vLLM/text-generation-inference (TGI), A/V/canary-deploy, autoresize.
7. Governance and Security: PII, Secrets, Audit, Export Policies, Weight/Datacet Licenses.
Typical loads:- Online scoring (p95 ≤ 50-150 ms) - anti-fraud, recommendations, ranking.
- LLM-serving (p95 ≤ 200-800 ms for 128-512 tokens) - chat/agents/prompts.
- Batch analytics/additional training - night windows, offline metrics.
- Fighting/adaptation - periodically, with a priority lower than online.
2) GPU pools and scheduling
Pool Model
Serving pool: short requests, high butching, strict SLOs.
Training/Finetuning Pool: Long Jobs, Distributed Training (DDP).
Pool "R & D/Experiments": quotas/limits, preemption allowed.
CPU/Pre-/Post-processing pool: normalization, tokenization, rerank on CPU.
Schedulers
Kubernetes (+ device-plugin, NodeFeatureDiscovery, taints/tolerations, PriorityClass, PodPriority/Preemption).
Slurm (often for HPC training) - can be mixed with K8s through separate workers.
Fair share and quotas: namespace quotas for GPU, CPU, memory; "banks" GPU-hours; limits on the namespace/project.
GPU partitioning
MIG (Multi-Instance GPU): cutting the accelerator into isolated slices (for surfing/multi-tenancy).
MPS: SM sharing for small tasks (monitor interference).
NVLink/PCIe: consider Topology Aware Scheduling.
yaml apiVersion: v1 kind: Pod metadata:
annotations:
scheduling. k8s. io/group-name: "ai-serving"
spec:
nodeSelector: { gpu-pool: serving }
tolerations: [{ key: "gpu", operator: "Exists", effect: "NoSchedule" }]
priorityClassName: ai-serving-critical
3) Network and inter-site performance
RDMA (RoCEv2) for NCCL allrudges; ECN/PFC settings, isolation of traffic classes.
Localization: training inside one "factory" (pod/host/optics), serving - closer to the user (edge/region).
Congest control: tuned profiles, jumbo frames, pin-ning interfaces.
4) Storage and data
Weight/artifact storage: object (versioning, immutability).
Datasets/features: Lakehouse (Delta/Iceberg/Hudi) + offline fichester; online-fichestor (millisecond SLAs).
Vector databases (ANN): Faiss/ScaNN/accelerators, or vendor vector engines; shardiness, HNSW/IVF, replication.
Local NVMe cache: warming up scales/embeddings for a cold start.
5) Models serving
Frameworks
Triton Inference Server (multimodel, multi-time, dynamic butching).
KServe (K8s-native, autoscaling HPA/KPA, canary).
vLLM/TGI for LLM tokenization and high-performance decoding (paged-attention, KV cache offload).
ONNX Runtime/TensorRT-LLM - for compilation and acceleration.
Optimizations
Quantization: INT8/FP8/INT4 (percentiles/calibration, AWQ/GPTQ) - online carefully, measure quality.
Graph compilation: TensorRT, TorchInductor/XLA, fused-kernels.
Butching/microbatching: dynamic and static; для LLM — continuous batching.
KV cache: sharing between requests, offline on CPU/NVMe with long contexts.
Speculative decoding: draft model + verifier to speed up token production.
Token/context limits, early stop, stopwords, time-budget per request.
Deploy policies
A/B, canary, shadow - comparison of latency/quality/business metrics.
Blue green - no downtime.
Rollback on SLO/errors.
6) Training/Fighting
DDP/FSDP/ZeRO: distributed memory/gradients, NVLink/topology accounting.
Checkpoints: incremental/full, frequency vs I/O.
Mixed Precision: bf16/fp16 + loss scaling; profile stability.
Dataset Sharding: uniform iterator, replication across nodes.
Priorities: interruptible jobs (preemptible) in favor of surfing.
Standalone pipelines: data → train → eval → register → progress in PROD according to gate criteria.
7) MLOps and platform
Register of models: versions, signatures, dependencies, licenses/right to use scales.
CI/CD models: compatibility tests, performance regressions, quality gates, safe deploy.
Fichestor: offline/online consistency (feature parity), TTL and backfill.
Data/Model Lineage: trace from dataset to report/experiment.
Directory of prompts/templates for LLM (versioning).
8) Observability and SLO
Online metrics:- Latency p50/p95/p99, tokens/s, batch occupancy, queue wait, GPU-util/SM occupancy, memory, errors.
- LLM specifics: I/O tokens, average response length, percentage of failures by limits, KV cache hit.
- Quality: automatic regression tests (offline), online telemetry (content flags, toxicity, accuracy of issuance on gold samples).
- Business SLO: personalization conversion, anti-fraud accuracy, retention.
Alerts: p99/queue growth, tokens/s drop, batch-fill degradation, VRAM/PCIe-throttle exhaustion, rate-limit failure growth.
9) Security, compliance and privacy
PII/financial data: segmentation of calculations and data by region, encryption at rest/in transit, tokenization.
Secrets/Keys: KMS/Secrets Manager; exclude storage in images/code.
LLM output policies: security filters, red-teaming, logging of prompts/responses (with anonymization).
Licenses: compliance with licenses for datasets/weights; "no-redistribute "/commercial restrictions.
Tenant isolation: namespace-RBAC, networks, MIG slices, limits and quotas.
10) Cost and Finops
Capacity planning: load profiles (RPS, tokens/sec), "tails" of tournaments and campaigns.
Reserve/Spot: mixed pools (reserved + spot/preemptible) with re-setting tasks and checkpoints.
Autoscale: HPA/KPA by RPS/queue depth/GPU-util; "warm start" with warmed scales.
Model Zoo: Reduce options; Use adaptation (LoRA/PEFT) instead of full duplication.
Cache: embeddings/results of expensive requests, KV cache sharing for LLM.
Optimization of tokens: compression of prompts, retrieval-augmented generation (RAG), rerank before generation.
11) Multi-region, HA and DR
Active/Active surfing is closer to the user, global routing (latency-based).
Replication of scales and features with integrity check; warming up caches during releases.
DR plan: loss of AZ/region, evacuation to the backup pool, control of dependence on the centralized directory.
Chaos days: GPU node/network domain/storage failure tests.
12) Configuration templates (concepts)
Triton - dynamic butching:text dynamic_batching {
preferred_batch_size: [4, 8, 16, 32]
max_queue_delay_microseconds: 2000
}
instance_group { count: 2 kind: KIND_GPU }
KServe - Canary:
yaml spec:
predictor:
canaryTrafficPercent: 20 model:
modelFormat: { name: triton }
resources:
limits: { nvidia. com/gpu: "1" }
vLLM - Launch (Ideas):
--tensor-parallel-size 2
--max-num-seqs 512
--gpu-memory-utilization 0. 9
--enforce-eager
13) LLM specificity: RAG and search loop
Indexing: chanking, embeddings, ANN-sharding by 'tenant/locale'.
Rerank: Lightweight CPU/GPU slice model to improve accuracy.
Prompts/contexts cache: dedup, canonicalization.
Citation/liability policies for sensitive domains (CCP/rules).
14) Implementation checklist
1. Capture SLOs (p95 latency/tokens/s, availability) and load profiles.
2. Split the cluster into pools (serving/train/R & D), enter quotas/priorities.
3. Enable RDMA/NCCL and topologically aware scheduling.
4. Set up storages: scales, datasets, fichester (online/offline), vector databases.
5. Select the serving stack (Triton/KServe/vLLM), add butching/KV cache/quantization.
6. Run the model register, CI/CD, canary/shadow deploy.
7. Put observability: system + business metrics, quality, tracing.
8. Enter security/PII policies, licenses, auditing.
9. Optimize TCO: reserved + spot, autoscale, cache, PEFT instead of full clones.
10. Prepare HA/DR and have a game-day.
15) Antipatterns
"One big GPU for all" without pools and priorities.
Lack of dynamic butching and KV cache for LLM → explosion of p99 and cost.
Training and serving on the same pool without preemption → SLO incidents.
Zero quality/safety telemetry → subtle degradation and risks.
Centralized monolith without phichester/model register → no reproducibility.
Ignoring scale/data licenses.
Summary
Successful AI infrastructure includes smart scheduling GPU pools, high network and correct storage, efficient serving (butching, cache, quantization, compilation), mature MLOps, and strict SLOs. Combined with security/PII, multi-regional HA/DR and thoughtful Finops, the platform gives a stable p99, controlled $/request and fast implementation of new models - from anti-fraud to personalization and LLM assistants.