GPU nodes and ML calculations
Brief Summary
A successful ML stack on a GPU is a collection of hardware, software, planning, data and observability solutions. The cluster should be able to do equally well:1. Train models (high utilization, fast checkpoints, interrupt tolerance)
2. serve inference (low p95-latency at high conversion),
3. cost predictable money (FinOps, quotas, multi-tenancy),
4. be safe (isolation, supply chain, control of scales/datasets).
Hardware and topologies
GPU and memory
HBM volume and band is more important than "raw TFLOPS" for LLM/RecSys.
For the inference of many small requests - the priority of the built-in memory (KV-cache) and the high clocks/power limit.
Connectivity
NVLink/NVSwitch - inside the node for fast all-reduce.
InfiniBand/RoCE - inter-node exchange for DDP/FSDP (≥ 100-200 Gb/s).
PCIe tree: try to keep the NIC and GPU on the same NUMA node; avoid hot PCIe-switch bottleneck.
Basic BIOS/Host Tuning
Performance modes, disabling C-states (or increased minimum), NUMA awareness, ASPM off on critical PCIe.
Power: stable profiles, not aggressive power-save - otherwise p99 "trembles."
Basic soft stack
NVIDIA + CUDA + cuDNN/TensorRT compatibility matrix matched drivers.
NVIDIA Container Toolkit for GPUs inside containers.
NCCL (collectives), UCX (transport), Apex/xFormers/Flash-Attention - for speed.
Optional GDS (GPUDirect Storage) on fast NVMe/IB - speeds up data flow.
Kubernetes for GPU
Key Components
NVIDIA GPU Operator (drivers, DCGM, device-plugin).
NVIDIA Device Plugin - resource export 'nvidia. com/gpu`.
MIG (A100/H100) - division of one physical GPU into isolated profiles (for example, '1g. 10gb`).
Time-slicing - logical splitting of the GPU in time for small inference tasks.
Node Feature Discovery - labels by GPU type/topology.
Planning and isolation
Tains/Tolerations/NodeSelectors to separate training/inference/experiments.
Topology Manager and CPU Manager (static) for NUMA alignment.
Volcano/Slurm on K8s/Ray - queues, priorities, preemption for big jobs.
yaml resources:
limits:
nvidia. com/gpu: 1 # or MIG profile: nvidia. com/mig-1g. 10gb: 1 requests:
nvidia. com/gpu: 1
Example of taint/affinity for a dedicated training pool:
yaml tolerations:
- key: "gpu-train"
operator: "Exists"
effect: "NoSchedule"
nodeSelector:
gpu. pool: "train"
Learning: Scale and Sustainability
Concurrency
DDP - standard data concurrency.
FSDP/ZeRO - sharding parameters/hail/optimizers, reduces memory.
Tensor/Pipeline Parallel - for very large LLMs; requires NVLink/IB.
Gradient Accumulation - Increases the effective batch without increasing memory peaks.
Mixed accuracy and memory optimizations
AMP (bf16/fp16) + loss scaling; for H100/new - FP8 where possible.
Activation/Gradient Checkpointing, Flash-Attention for long sequences.
Paged/Chunked KV-cache to prepare for inference.
Checkpoints and fault tolerance
Frequent incremental checkpoints for fast NVMe/object with retching.
Idempotent jabs (repetitive wound-identifiers).
Spot-stability: we catch SIGTERM, quickly merge state; the scheduler returns the job to the queue.
Important NCCL/Network Variables (Example)
bash
NCCL_IB_HCA=mlx5_0
NCCL_SOCKET_IFNAME=eth1
NCCL_P2P_LEVEL=NVL
NCCL_MIN_NRINGS=8
NCCL_NET_GDR_LEVEL=SYS
Inference: low latency, high return
Serving frameworks
Triton Inference Server is a single server for TensorRT/ONNX/TS/PyTorch.
vLLM/TGI/TensorRT-LLM - LLM specialists (paged-attention, effective KV-cache, continuous batching).
Acceleration techniques
Quantization: INT8/FP8/quantum. -aware (AWQ, GPTQ) - decrease in VRAM, increase in TPS.
Batching/Continuous batching: serve bursts of requests without p95 growth.
KV-cache pinning in HBM, context reduction; speculative decoding (draft model).
Concurrency on GPU: multiple threads/models with MIG/time-slice.
Target profiles (SLO example)
p95 latency of the chat model response ≤ 300 ms per prefix/token;
Throughput ≥ 200 current/s/GPU on the target profile;
p99 tails are controlled by sheduling (QoS classes and context limits).
Triton deployment (fragment)
yaml env:
- name: CUDA_VISIBLE_DEVICES value: "0"
- name: TRITONSERVER_MODEL_CONTROL value: "explicit"
args: ["--backend-config=tensorrt,output_memory_pool_size=1024"]
Data and pipelines
Formats: Parquet/Arrow, webdataset (tar-shards) for streaming reading.
Prefetch/Async I/O: DataLoader-ы с pin-memory, prefetch-pipelines, GDS.
Feature Store for online features (anti-fraud/recommendations).
Versioning: DVC/LakeFS/MLflow Model Registry; capture datasets, code, and hyperparameters.
Observability and SLO
DCGM/Prometheus metrics (minimum)
`dcgm_sm_util`, `dcgm_fb_used`, `dcgm_power_usage`, `dcgm_pcie_rx/tx`, `dcgm_dram_bw`
Temperatures/frequencies and ECC errors (alert for growth).
Achieved Occupancy and stall reasons (narrow core layer).
Service Metrics
Generative models: tokens/sec, p50/p95/p99, queue depth, memory failure.
Training: steps/sec, era time, all-reduce efficiency,% time in I/O.
SLO panel: compliance p95, "error budget" (≥ 99. 5% "successful" inference).
Alerting (ideas)
`fb_used / fb_total > 0. 95` 5 мин → throttle/scale-out.
TPS drop by N% with the same disposal - model/code degradation.
ECC/temperature rise → job/incident iron migration.
Safety and isolation
Multi-tenancy: MIG profiles or per-team nodes, namespaces/quotas.
IOMMU/PSP, cgroups, privileged container barring, CAP _ constraint.
MPS (multi-process service) - neat: higher disposal, but the separation is weaker than MIG.
Supply chain: container signatures (cosign), verification of artifacts, control of model uploads.
Data/weights: encryption on disk, access control (ABAC/RBAC), watermarks/hash registers of models.
FinOps: cost, quotas, autoscale
Node pools: 'train' (on-demand/reserves), 'infer' (mix on-demand + spot), 'exp' (spot-heavy).
Spot stability: frequent checkpoints, fast restart logic, Volcano queues with priorities.
Reserves/RI/Savings Plans to a stable base; auto-disable empty nodes.
Right-sizing models: quantization/LoRA adapters instead of the "full" model; Select MIG profiles under SLA.
Budget outline: GPU-hour quotas per-team, "cost for 1k requests/tokens."
YAML Patterns and Artifacts
1) MIG profile (conceptual)
yaml apiVersion: nvidia. com/v1 kind: MigStrategy metadata: { name: mig-a100-1g10gb }
spec:
deviceFilter: "a100"
mode: single resources:
- profile: "1g. 10gb"
count: 7
2) Volcano queue for training
yaml apiVersion: scheduling. volcano. sh/v1beta1 kind: Queue metadata: { name: train-q }
spec:
weight: 100 reclaimable: true capability:
resources:
- name: nvidia. com/gpu quantity: 64
3) KEDA for the depth-of-turn inference autoscale
yaml apiVersion: keda. sh/v1alpha1 kind: ScaledObject metadata: { name: llm-infer }
spec:
scaleTargetRef: { name: llm-deploy }
pollingInterval: 5 minReplicaCount: 2 maxReplicaCount: 80 triggers:
- type: rabbitmq metadata:
queueName: infer-queue mode: QueueLength value: "200"
GPU Cluster Startup Checklist
- NVLink/IB topology map; NIC/GPU on one NUMA.
- Drivers/CUDA consistent, Operator/Device-plugin installed.
- MIG/time-slicing profiles and quotas for namespaces.
- DDP/FSDP pipeline tested on staging; checkpoints are fast.
- Triton/vLLM с continuous batching; p95 and TPS targets are set.
- DCGM/Prometheus/Grafana + ECC alerts/temperature/memory/TPS.
- Security policies (PSP, cosign, weight obfuscation/control).
- FinOps: spot/ri pools, $/1k tokens report, idle auto-shutdown.
Common errors
Training and inference are mixed on the same nodes without tains → GPU/IO is "sawed" to each other.
No checkpoints and preemption logic → loss of progress on spot.
Absence of DCGM-metrics → "blind" disposal and overheating.
Ignoring NUMA/PCIe topology → low NCCL bandwidth.
Incorrect MIG/time-slice → p99 latency and "Out of Memory" profiles.
HPA by CPU instead of TPS/latency → late scale.
iGaming/fintech specificity
Antifraud/scoring: SLA inference ≤ 50 ms p95 on critical paths (payments/conclusions); keep the "fallback" model light.
Recommendations/personalization: on-policy/off-policy learning at night, online-features - low latency.
Chat Assistants/RAG: Content Cache, Request Dedeuplication, Guardrails; sharding vector search indices.
Peaks (matches/tournaments): pre-warm up models/kv-cache, increase minReplicas, QoS classes for VIP.
Total
GPU computing stack becomes really efficient when hardware (HBM/NVLink/IB), software matrix (CUDA/NCCL), scheduling (MIG, queue, tains), data (fast pipeline/GDS), observability (DCGM/SLO) and cost (FinOps/quotas) work in concert. Pin this into IaC and cluster policy - and you get predictable learning speeds, stable low p95-latency inference, and a transparent GPU clock economy.