GPU nodes and ML calculations

Brief Summary

A successful ML stack on a GPU is a collection of hardware, software, planning, data and observability solutions. The cluster should be able to do equally well:

1. Train models (high utilization, fast checkpoints, interrupt tolerance)

2. serve inference (low p95-latency at high conversion),

3. cost predictable money (FinOps, quotas, multi-tenancy),

4. be safe (isolation, supply chain, control of scales/datasets).

Hardware and topologies

GPU and memory

HBM volume and band is more important than "raw TFLOPS" for LLM/RecSys.
For the inference of many small requests - the priority of the built-in memory (KV-cache) and the high clocks/power limit.

Connectivity

NVLink/NVSwitch - inside the node for fast all-reduce.
InfiniBand/RoCE - inter-node exchange for DDP/FSDP (≥ 100-200 Gb/s).
PCIe tree: try to keep the NIC and GPU on the same NUMA node; avoid hot PCIe-switch bottleneck.

Basic BIOS/Host Tuning

Performance modes, disabling C-states (or increased minimum), NUMA awareness, ASPM off on critical PCIe.

Power: stable profiles, not aggressive power-save - otherwise p99 "trembles."

Basic soft stack

NVIDIA + CUDA + cuDNN/TensorRT compatibility matrix matched drivers.
NVIDIA Container Toolkit for GPUs inside containers.
NCCL (collectives), UCX (transport), Apex/xFormers/Flash-Attention - for speed.
Optional GDS (GPUDirect Storage) on fast NVMe/IB - speeds up data flow.

Kubernetes for GPU

Key Components

NVIDIA GPU Operator (drivers, DCGM, device-plugin).
NVIDIA Device Plugin - resource export 'nvidia. com/gpu`.
MIG (A100/H100) - division of one physical GPU into isolated profiles (for example, '1g. 10gb`).
Time-slicing - logical splitting of the GPU in time for small inference tasks.
Node Feature Discovery - labels by GPU type/topology.

Planning and isolation

Tains/Tolerations/NodeSelectors to separate training/inference/experiments.
Topology Manager and CPU Manager (static) for NUMA alignment.
Volcano/Slurm on K8s/Ray - queues, priorities, preemption for big jobs.

An example of a GPU request in Pod:

yaml resources:
limits:
nvidia. com/gpu: 1 # or MIG profile: nvidia. com/mig-1g. 10gb: 1 requests:
nvidia. com/gpu: 1

Example of taint/affinity for a dedicated training pool:

yaml tolerations:
- key: "gpu-train"
operator: "Exists"
effect: "NoSchedule"
nodeSelector:
gpu. pool: "train"

Learning: Scale and Sustainability

Concurrency

DDP - standard data concurrency.
FSDP/ZeRO - sharding parameters/hail/optimizers, reduces memory.
Tensor/Pipeline Parallel - for very large LLMs; requires NVLink/IB.
Gradient Accumulation - Increases the effective batch without increasing memory peaks.

Mixed accuracy and memory optimizations

AMP (bf16/fp16) + loss scaling; for H100/new - FP8 where possible.
Activation/Gradient Checkpointing, Flash-Attention for long sequences.
Paged/Chunked KV-cache to prepare for inference.

Checkpoints and fault tolerance

Frequent incremental checkpoints for fast NVMe/object with retching.
Idempotent jabs (repetitive wound-identifiers).
Spot-stability: we catch SIGTERM, quickly merge state; the scheduler returns the job to the queue.

Important NCCL/Network Variables (Example)

bash
NCCL_IB_HCA=mlx5_0
NCCL_SOCKET_IFNAME=eth1
NCCL_P2P_LEVEL=NVL
NCCL_MIN_NRINGS=8
NCCL_NET_GDR_LEVEL=SYS

Inference: low latency, high return

Serving frameworks

Triton Inference Server is a single server for TensorRT/ONNX/TS/PyTorch.
vLLM/TGI/TensorRT-LLM - LLM specialists (paged-attention, effective KV-cache, continuous batching).

Acceleration techniques

Quantization: INT8/FP8/quantum. -aware (AWQ, GPTQ) - decrease in VRAM, increase in TPS.
Batching/Continuous batching: serve bursts of requests without p95 growth.
KV-cache pinning in HBM, context reduction; speculative decoding (draft model).
Concurrency on GPU: multiple threads/models with MIG/time-slice.

Target profiles (SLO example)

p95 latency of the chat model response ≤ 300 ms per prefix/token;

Throughput ≥ 200 current/s/GPU on the target profile;

p99 tails are controlled by sheduling (QoS classes and context limits).

Triton deployment (fragment)

yaml env:
- name: CUDA_VISIBLE_DEVICES value: "0"
- name: TRITONSERVER_MODEL_CONTROL value: "explicit"
args: ["--backend-config=tensorrt,output_memory_pool_size=1024"]

Data and pipelines

Formats: Parquet/Arrow, webdataset (tar-shards) for streaming reading.
Prefetch/Async I/O: DataLoader-ы с pin-memory, prefetch-pipelines, GDS.
Feature Store for online features (anti-fraud/recommendations).
Versioning: DVC/LakeFS/MLflow Model Registry; capture datasets, code, and hyperparameters.

Observability and SLO

DCGM/Prometheus metrics (minimum)

`dcgm_sm_util`, `dcgm_fb_used`, `dcgm_power_usage`, `dcgm_pcie_rx/tx`, `dcgm_dram_bw`

Temperatures/frequencies and ECC errors (alert for growth).
Achieved Occupancy and stall reasons (narrow core layer).

Service Metrics

Generative models: tokens/sec, p50/p95/p99, queue depth, memory failure.
Training: steps/sec, era time, all-reduce efficiency,% time in I/O.
SLO panel: compliance p95, "error budget" (≥ 99. 5% "successful" inference).

Alerting (ideas)

`fb_used / fb_total > 0. 95` 5 мин → throttle/scale-out.
TPS drop by N% with the same disposal - model/code degradation.
ECC/temperature rise → job/incident iron migration.

Safety and isolation

Multi-tenancy: MIG profiles or per-team nodes, namespaces/quotas.
IOMMU/PSP, cgroups, privileged container barring, CAP _ constraint.
MPS (multi-process service) - neat: higher disposal, but the separation is weaker than MIG.
Supply chain: container signatures (cosign), verification of artifacts, control of model uploads.
Data/weights: encryption on disk, access control (ABAC/RBAC), watermarks/hash registers of models.

FinOps: cost, quotas, autoscale

Node pools: 'train' (on-demand/reserves), 'infer' (mix on-demand + spot), 'exp' (spot-heavy).
Spot stability: frequent checkpoints, fast restart logic, Volcano queues with priorities.
Reserves/RI/Savings Plans to a stable base; auto-disable empty nodes.
Right-sizing models: quantization/LoRA adapters instead of the "full" model; Select MIG profiles under SLA.

Budget outline: GPU-hour quotas per-team, "cost for 1k requests/tokens."

YAML Patterns and Artifacts

1) MIG profile (conceptual)

yaml apiVersion: nvidia. com/v1 kind: MigStrategy metadata: { name: mig-a100-1g10gb }
spec:
deviceFilter: "a100"
mode: single resources:
- profile: "1g. 10gb"
count: 7

2) Volcano queue for training

yaml apiVersion: scheduling. volcano. sh/v1beta1 kind: Queue metadata: { name: train-q }
spec:
weight: 100 reclaimable: true capability:
resources:
- name: nvidia. com/gpu quantity: 64

3) KEDA for the depth-of-turn inference autoscale

yaml apiVersion: keda. sh/v1alpha1 kind: ScaledObject metadata: { name: llm-infer }
spec:
scaleTargetRef: { name: llm-deploy }
pollingInterval: 5 minReplicaCount: 2 maxReplicaCount: 80 triggers:
- type: rabbitmq metadata:
queueName: infer-queue mode: QueueLength value: "200"

GPU Cluster Startup Checklist

NVLink/IB topology map; NIC/GPU on one NUMA.
Drivers/CUDA consistent, Operator/Device-plugin installed.
MIG/time-slicing profiles and quotas for namespaces.
DDP/FSDP pipeline tested on staging; checkpoints are fast.
Triton/vLLM с continuous batching; p95 and TPS targets are set.
DCGM/Prometheus/Grafana + ECC alerts/temperature/memory/TPS.
Security policies (PSP, cosign, weight obfuscation/control).
FinOps: spot/ri pools, $/1k tokens report, idle auto-shutdown.

Common errors

Training and inference are mixed on the same nodes without tains → GPU/IO is "sawed" to each other.
No checkpoints and preemption logic → loss of progress on spot.
Absence of DCGM-metrics → "blind" disposal and overheating.
Ignoring NUMA/PCIe topology → low NCCL bandwidth.
Incorrect MIG/time-slice → p99 latency and "Out of Memory" profiles.
HPA by CPU instead of TPS/latency → late scale.

iGaming/fintech specificity

Antifraud/scoring: SLA inference ≤ 50 ms p95 on critical paths (payments/conclusions); keep the "fallback" model light.
Recommendations/personalization: on-policy/off-policy learning at night, online-features - low latency.
Chat Assistants/RAG: Content Cache, Request Dedeuplication, Guardrails; sharding vector search indices.
Peaks (matches/tournaments): pre-warm up models/kv-cache, increase minReplicas, QoS classes for VIP.

Total

GPU computing stack becomes really efficient when hardware (HBM/NVLink/IB), software matrix (CUDA/NCCL), scheduling (MIG, queue, tains), data (fast pipeline/GDS), observability (DCGM/SLO) and cost (FinOps/quotas) work in concert. Pin this into IaC and cluster policy - and you get predictable learning speeds, stable low p95-latency inference, and a transparent GPU clock economy.