GH GambleHub

Block Storage and Performance

Brief Summary

Block storage gives raw devices (LUN/volume) on top of which you build FS, LVM/ZFS, etc. Performance is determined by media type, access protocol, queues and depth, block size, coding scheme (RAID/EC), caches and barriers, network fabric, and application-specific I/O pattern (random/sequential, read/write, sync/async). The goal is to provide the required p95/p99 latency and IOPS/bandwidth with robustness and predictability.

Block access taxonomy

Local: NVMe (PCIe), SAS/SATA SSD/HDD. Minimal latency, no network bottlenecks.

Network:
  • iSCSI (Ethernet, LUN, MPIO, ALUA).
  • Fibre Channel (FC) (16-64G, low latency, zoning).
  • NVMe-oF: NVMe/TCP, NVMe/RoCE, NVMe/FC - "native" NVMe over the network, less overhead.
  • HCI/distributed (Ceph RBD, vSAN): convenient scalability, but latency is higher, network/coding is critical.
Selection (signals):
  • p99 latency ≤ 1-2 ms, very high IOPS → local NVMe/NVMe-oF.
  • Stable "mean" latency 2-5 ms, mature → FC or NVMe/FC factory.
  • Unified on Ethernet, easier to operate → iSCSI or NVMe/TCP.

Protocols and their features

iSCSI: versatility, MPIO/ALUA, TCP configuration sensitive (MTU, offload, qdepth).
FC: isolation, lossless flows, WWPN zoning, HBA queues and credits.
NVMe-oF: parallelism through multiple Submission/Completion Queues, low CPU load, TLS is possible for NVMe/TCP (if necessary).

RAID/EC and Media

RAID10 - minimal latency and predictable IOPS; optimal for databases/wallets.
RAID5/6 - better in capacity, write penalty, IOPS drops for sync-write.

Erasure Coding in distributed arrays is advantageous in terms of capacity, but recording is "more expensive."

NVMe SSD - p99 top; SAS SSD - compromise; HDD is a sequential bandwidth, but a bad random.

File Systems and Alignment

XFS is an excellent choice for large database files/logs; customizable 'agcount', 'realtime' for logs.
ext4 - versatile, carefully to 'stride/stripe _ width' for RAID.
ZFS - CoW, integrity check, snapshots/replica, ARC/ZIL/SLOG; for sync loads - SLOG on NVMe with PLP.
Alignment: 1MiB-aligned partitions, correct 'recordsize '/' blocksize' under load.

Queues, depth and block size

IOPS are rising with Queue Depth, but so is latency; target - QD, which gives the required IOPS during p95/p99 control.
Block size: small (4-16K) - more IOPS, worse bandwidth; large (128K-1M) - better end-to-end speed.
NVMe qpairs: allocate by cores/NUMA; iSCSI/FC: qdepth HBA/initiators, MPIO policies.
Barriers and FUA: included write-barriers increase reliability but increase p99; SLOG/PLP offset.

Multipath and availability

MPIO/DM-Multipath: path aggregation, fault tolerance.

Politicians: 'round-robin' (balance sheet), 'queue-length' (smarter), 'failover' (asset-liability).
ALUA "preferred" paths to the active controller.
Important: 'no _ path _ retry', 'queue _ if _ no _ path' - carefully so as not to "freeze" I/O for long minutes.
FC zoning: "one initiator zone - one target" (reduces blast radius).
NVMe-oF: ANA (Asymmetric Namespace Access) — аналог ALUA.

TRIM/Discard and Caching

TRIM/Discard frees up SSD blocks (lowers write-amp, stabilizes latency). Turn on regularly (cron) or online discard where appropriate.
Read-ahead is useful for consecutive reads; is harmful in random.
Write-back controller caches - with BBU/PLP only; otherwise, the risk of data loss.

Network Stack (for iSCSI/NVMe-TCP)

Separate VLAN/VRF for the SRF factory; isolation from client traffic.
MTU 9000 end-to-end; RSS/RPS and IRQ pinning to NUMA.
QoS/priority for RoCE (if lossless), ECN/RED for TCP peaks.
Two independent fat trees up to storaj (dual TORs, different power feeders).

Linux/Host Tuning (Sample)

bash
Scheduler for NVMe echo none     sudo tee /sys/block/nvme0n1/queue/scheduler echo 1024      sudo tee /sys/block/nvme0n1/queue/nr_requests echo 0        sudo tee /sys/block/nvme0n1/queue/add_random echo 0        sudo tee /sys/block/nvme0n1/queue/iostats

Read-ahead (sequential loads)
blockdev --setra 4096 /dev/nvme0n1

iSCSI: example of aggressive timeouts and retries iscsiadm -m node --op update -n node. session. timeo. replacement_timeout -v 10 iscsiadm -m node --op update -n node. conn[0].timeo. noop_out_interval -v 5 iscsiadm -m node --op update -n node. conn[0].timeo. noop_out_timeout -v 5
Multipath (fragment 'multipath. conf`):
conf defaults {
find_multipaths yes polling_interval 5 no_path_retry 12
}
devices {
device {
vendor "PURE    DELL    NETAPP    HITACHI"
path_checker tur features "1 queue_if_no_path"
path_grouping_policy group_by_prio prio alua
}
}

Benchmarking and profiling

fio - minimum set of profiles:
bash
Random read 4K, queue 32, 4 threads fio --name = randread --filename =/dev/nvme0n1 --direct = 1 --rw = randread\
--bs=4k --iodepth=32 --numjobs=4 --time_based --runtime=60

Random 4K entry (sync), log loads fio --name = randwrite --rw = randwrite --bs = 4k --iodepth = 16 --numjobs = 4\
--fsync=1 --direct=1 --runtime=60

Large block sequential recording (backups/dumps)
fio --name=seqwrite --rw=write --bs=1M --iodepth=64 --numjobs=2 --runtime=60
Tips:
  • Separate heating and measurement, record the temperature/thermal throttling.
  • Test on LUN/volume, not FS (if the target is raw hardware).
  • Measure p95/p99 latency and 99. 9% tail - they are the ones who "kill" the database.

Monitoring and SLO

Metrics:
  • Latency p50/p95/p99 (read/write), IOPS, throughput, queue-depth, device busy%, merges, discard.
  • At the network level: drops, retransmitts, ECN marks, interface errors.
  • At the array level: replication lag, rebuild/resolver progress, write-amp, wear-level SSD.
SLO (examples):
  • LUN БД (OLTP): p99 write ≤ 1. 5ms, p99 read ≤ 1. 0 ms, availability ≥ 99. 95%.
  • Logs: p95 append ≤ 2. 5 ms, bandwidth ≥ 400 MB/s per volume.
  • Backups: seq write ≥ 1 GB/s (aggregated), recovery RTO ≤ 15 minutes.
Alerts:
  • p99 latency> threshold N minutes, degradation of IOPS with the same QD, growth of read-modify-write in RAID5/6, overheating/thermal throttle SSD, started/stuck ribs.

Kubernetes и CSI

PVC/StorageClass: parameters' reclaimPolicy ',' volumeBindingMode = WaitForFirstConsumer '(correct location),' allowVolumeExpansion '.
Vendor CSI plugins: snapshots/clones, QoS/performance policies, volume-topology.
AccessModes: RWO for database/state, RWX - carefully (usually via file/network).
Topology/Affinity: pin pads to nodes next to storage (low latency).
Important: HPA/VPA will not "cure" a bad drive; plan SLO volumes, use PodDisruptionBudget for stateful networks.

Snapshots, clones, Consistency Groups

Crash-consistent snapshots are fast, but database inconsistencies are possible.
App-consistent - via quiesce scripts (fsfreeze, pre/post hooks DB).
Consistency Group (CG) - for several LUNs (transactional systems) at the same time.
Clones are quick dev/test environments without copying.

Safety and compliance

iSCSI CHAP/Mutual CHAP, VLAN/VRF isolation.
NVMe/TCP with TLS - for cross-center/multi-lease scenarios.
Encryption "at rest": LUKS/dm-crypt, self-encrypting drives (TCG Opal), keys in KMS.
Audit: who mapped the LUN, FC zone change, multipath changes.

DR and operations

Synchronous replica (RPO≈0) - increases latency, short distances.
Asynchronous (RPO = N min) - geo-distance, acceptable for most databases with logs.
Runbooks: MPIO path loss, controller loss, disk rebuild, pool degradation, site switch.
Service windows: "rolling" controllers, rebield limits so as not to eat prod.

FinOps (cost per performance)

$/IOPS and $/ms p99 are more useful "$/TB" for OLTP.
Tiering: hot OLTP - NVMe/RAID10; reports/archive - HDD/EC.
Provisions and depreciation: Plan for 30-50% IOPS growth; keep stock under rebilds/scrubs.
Egress/factory: separate budget for storage network and HBA/NIC updates.

Implementation checklist

  • Protocol (NVMe-oF/FC/iSCSI) and isolated fabric selected.
  • RAID/EC and load type pools (OLTP/log/backup) are designed.
  • MPIO/ALUA/ANA and timeouts configured; checked failover/restore.
  • FS/alignment for RAID, TRIM/Discard enabled as per regulation.
  • Queue tuning/qdepth/read-ahead; validated by fio profiles (randread/write 4k, seq 1M).
  • Disk/path/latency monitoring p95/p99, alerts to rebilds and throttle.
  • Snapshots (app-consistent) and CG; DR/recovery test.
  • Encryption and CHAP/TLS; Keys in KMS audit of operations.
  • Kubernetes/CSI parameters, topology and QoS per volume.

Common errors

One path without MPIO → single point of failure.
RAID5/6 under sync-write OLTP → high p99 write.
No TRIM → write-amp growth and SSD degradation.
QD is too big → "beautiful" IOPS and terrible tail for the database.
Online discard on "hot" volumes with OLTP → latency jumps.
'queue _ if _ no _ path'without timeout → "frozen" services in a disaster.
Mixing NVMe and HDD in the same pool → unpredictable latency.

iGaming/fintech specific

Wallet/transactional databases: NVMe + RAID10, synchronous log on a separate SLOG/NVMe, p99 write ≤ 1. 5 ms, CG snapshots.
Payment queues/anti-fraud: serial logs → large blocks, high bandwidth, separate LUNs for log and data.
Peak TPS (tournaments/matches): pre-warm database caches, headroom ≥ 30%, thermal throttle control, burn-rate SLO.
Regulatory: LUN encryption, mapping audit log, DR exercises, RPO/RTO reporting.

Total

Productive block storage is the correct protocol + correctly configured queues and qdepth + adequate RAID/EC + cache/barrier discipline + isolated fabric. Pin everything in runbooks, measure p95/p99, validate with fio profiles, automate snapshots and DR - and get the predictable latency and IOPS needed for critical product and cash flow paths.

Contact

Get in Touch

Reach out with any questions or support needs.We are always ready to help!

Telegram
@Gamble_GC
Start Integration

Email is required. Telegram or WhatsApp — optional.

Your Name optional
Email optional
Subject optional
Message optional
Telegram optional
@
If you include Telegram — we will reply there as well, in addition to Email.
WhatsApp optional
Format: +country code and number (e.g., +380XXXXXXXXX).

By clicking this button, you agree to data processing.