Block Storage and Performance
Brief Summary
Block storage gives raw devices (LUN/volume) on top of which you build FS, LVM/ZFS, etc. Performance is determined by media type, access protocol, queues and depth, block size, coding scheme (RAID/EC), caches and barriers, network fabric, and application-specific I/O pattern (random/sequential, read/write, sync/async). The goal is to provide the required p95/p99 latency and IOPS/bandwidth with robustness and predictability.
Block access taxonomy
Local: NVMe (PCIe), SAS/SATA SSD/HDD. Minimal latency, no network bottlenecks.
Network:- iSCSI (Ethernet, LUN, MPIO, ALUA).
- Fibre Channel (FC) (16-64G, low latency, zoning).
- NVMe-oF: NVMe/TCP, NVMe/RoCE, NVMe/FC - "native" NVMe over the network, less overhead.
- HCI/distributed (Ceph RBD, vSAN): convenient scalability, but latency is higher, network/coding is critical.
- p99 latency ≤ 1-2 ms, very high IOPS → local NVMe/NVMe-oF.
- Stable "mean" latency 2-5 ms, mature → FC or NVMe/FC factory.
- Unified on Ethernet, easier to operate → iSCSI or NVMe/TCP.
Protocols and their features
iSCSI: versatility, MPIO/ALUA, TCP configuration sensitive (MTU, offload, qdepth).
FC: isolation, lossless flows, WWPN zoning, HBA queues and credits.
NVMe-oF: parallelism through multiple Submission/Completion Queues, low CPU load, TLS is possible for NVMe/TCP (if necessary).
RAID/EC and Media
RAID10 - minimal latency and predictable IOPS; optimal for databases/wallets.
RAID5/6 - better in capacity, write penalty, IOPS drops for sync-write.
Erasure Coding in distributed arrays is advantageous in terms of capacity, but recording is "more expensive."
NVMe SSD - p99 top; SAS SSD - compromise; HDD is a sequential bandwidth, but a bad random.
File Systems and Alignment
XFS is an excellent choice for large database files/logs; customizable 'agcount', 'realtime' for logs.
ext4 - versatile, carefully to 'stride/stripe _ width' for RAID.
ZFS - CoW, integrity check, snapshots/replica, ARC/ZIL/SLOG; for sync loads - SLOG on NVMe with PLP.
Alignment: 1MiB-aligned partitions, correct 'recordsize '/' blocksize' under load.
Queues, depth and block size
IOPS are rising with Queue Depth, but so is latency; target - QD, which gives the required IOPS during p95/p99 control.
Block size: small (4-16K) - more IOPS, worse bandwidth; large (128K-1M) - better end-to-end speed.
NVMe qpairs: allocate by cores/NUMA; iSCSI/FC: qdepth HBA/initiators, MPIO policies.
Barriers and FUA: included write-barriers increase reliability but increase p99; SLOG/PLP offset.
Multipath and availability
MPIO/DM-Multipath: path aggregation, fault tolerance.
Politicians: 'round-robin' (balance sheet), 'queue-length' (smarter), 'failover' (asset-liability).
ALUA "preferred" paths to the active controller.
Important: 'no _ path _ retry', 'queue _ if _ no _ path' - carefully so as not to "freeze" I/O for long minutes.
FC zoning: "one initiator zone - one target" (reduces blast radius).
NVMe-oF: ANA (Asymmetric Namespace Access) — аналог ALUA.
TRIM/Discard and Caching
TRIM/Discard frees up SSD blocks (lowers write-amp, stabilizes latency). Turn on regularly (cron) or online discard where appropriate.
Read-ahead is useful for consecutive reads; is harmful in random.
Write-back controller caches - with BBU/PLP only; otherwise, the risk of data loss.
Network Stack (for iSCSI/NVMe-TCP)
Separate VLAN/VRF for the SRF factory; isolation from client traffic.
MTU 9000 end-to-end; RSS/RPS and IRQ pinning to NUMA.
QoS/priority for RoCE (if lossless), ECN/RED for TCP peaks.
Two independent fat trees up to storaj (dual TORs, different power feeders).
Linux/Host Tuning (Sample)
bash
Scheduler for NVMe echo none sudo tee /sys/block/nvme0n1/queue/scheduler echo 1024 sudo tee /sys/block/nvme0n1/queue/nr_requests echo 0 sudo tee /sys/block/nvme0n1/queue/add_random echo 0 sudo tee /sys/block/nvme0n1/queue/iostats
Read-ahead (sequential loads)
blockdev --setra 4096 /dev/nvme0n1
iSCSI: example of aggressive timeouts and retries iscsiadm -m node --op update -n node. session. timeo. replacement_timeout -v 10 iscsiadm -m node --op update -n node. conn[0].timeo. noop_out_interval -v 5 iscsiadm -m node --op update -n node. conn[0].timeo. noop_out_timeout -v 5
Multipath (fragment 'multipath. conf`):
conf defaults {
find_multipaths yes polling_interval 5 no_path_retry 12
}
devices {
device {
vendor "PURE DELL NETAPP HITACHI"
path_checker tur features "1 queue_if_no_path"
path_grouping_policy group_by_prio prio alua
}
}
Benchmarking and profiling
fio - minimum set of profiles:bash
Random read 4K, queue 32, 4 threads fio --name = randread --filename =/dev/nvme0n1 --direct = 1 --rw = randread\
--bs=4k --iodepth=32 --numjobs=4 --time_based --runtime=60
Random 4K entry (sync), log loads fio --name = randwrite --rw = randwrite --bs = 4k --iodepth = 16 --numjobs = 4\
--fsync=1 --direct=1 --runtime=60
Large block sequential recording (backups/dumps)
fio --name=seqwrite --rw=write --bs=1M --iodepth=64 --numjobs=2 --runtime=60
Tips:
- Separate heating and measurement, record the temperature/thermal throttling.
- Test on LUN/volume, not FS (if the target is raw hardware).
- Measure p95/p99 latency and 99. 9% tail - they are the ones who "kill" the database.
Monitoring and SLO
Metrics:- Latency p50/p95/p99 (read/write), IOPS, throughput, queue-depth, device busy%, merges, discard.
- At the network level: drops, retransmitts, ECN marks, interface errors.
- At the array level: replication lag, rebuild/resolver progress, write-amp, wear-level SSD.
- LUN БД (OLTP): p99 write ≤ 1. 5ms, p99 read ≤ 1. 0 ms, availability ≥ 99. 95%.
- Logs: p95 append ≤ 2. 5 ms, bandwidth ≥ 400 MB/s per volume.
- Backups: seq write ≥ 1 GB/s (aggregated), recovery RTO ≤ 15 minutes.
- p99 latency> threshold N minutes, degradation of IOPS with the same QD, growth of read-modify-write in RAID5/6, overheating/thermal throttle SSD, started/stuck ribs.
Kubernetes и CSI
PVC/StorageClass: parameters' reclaimPolicy ',' volumeBindingMode = WaitForFirstConsumer '(correct location),' allowVolumeExpansion '.
Vendor CSI plugins: snapshots/clones, QoS/performance policies, volume-topology.
AccessModes: RWO for database/state, RWX - carefully (usually via file/network).
Topology/Affinity: pin pads to nodes next to storage (low latency).
Important: HPA/VPA will not "cure" a bad drive; plan SLO volumes, use PodDisruptionBudget for stateful networks.
Snapshots, clones, Consistency Groups
Crash-consistent snapshots are fast, but database inconsistencies are possible.
App-consistent - via quiesce scripts (fsfreeze, pre/post hooks DB).
Consistency Group (CG) - for several LUNs (transactional systems) at the same time.
Clones are quick dev/test environments without copying.
Safety and compliance
iSCSI CHAP/Mutual CHAP, VLAN/VRF isolation.
NVMe/TCP with TLS - for cross-center/multi-lease scenarios.
Encryption "at rest": LUKS/dm-crypt, self-encrypting drives (TCG Opal), keys in KMS.
Audit: who mapped the LUN, FC zone change, multipath changes.
DR and operations
Synchronous replica (RPO≈0) - increases latency, short distances.
Asynchronous (RPO = N min) - geo-distance, acceptable for most databases with logs.
Runbooks: MPIO path loss, controller loss, disk rebuild, pool degradation, site switch.
Service windows: "rolling" controllers, rebield limits so as not to eat prod.
FinOps (cost per performance)
$/IOPS and $/ms p99 are more useful "$/TB" for OLTP.
Tiering: hot OLTP - NVMe/RAID10; reports/archive - HDD/EC.
Provisions and depreciation: Plan for 30-50% IOPS growth; keep stock under rebilds/scrubs.
Egress/factory: separate budget for storage network and HBA/NIC updates.
Implementation checklist
- Protocol (NVMe-oF/FC/iSCSI) and isolated fabric selected.
- RAID/EC and load type pools (OLTP/log/backup) are designed.
- MPIO/ALUA/ANA and timeouts configured; checked failover/restore.
- FS/alignment for RAID, TRIM/Discard enabled as per regulation.
- Queue tuning/qdepth/read-ahead; validated by fio profiles (randread/write 4k, seq 1M).
- Disk/path/latency monitoring p95/p99, alerts to rebilds and throttle.
- Snapshots (app-consistent) and CG; DR/recovery test.
- Encryption and CHAP/TLS; Keys in KMS audit of operations.
- Kubernetes/CSI parameters, topology and QoS per volume.
Common errors
One path without MPIO → single point of failure.
RAID5/6 under sync-write OLTP → high p99 write.
No TRIM → write-amp growth and SSD degradation.
QD is too big → "beautiful" IOPS and terrible tail for the database.
Online discard on "hot" volumes with OLTP → latency jumps.
'queue _ if _ no _ path'without timeout → "frozen" services in a disaster.
Mixing NVMe and HDD in the same pool → unpredictable latency.
iGaming/fintech specific
Wallet/transactional databases: NVMe + RAID10, synchronous log on a separate SLOG/NVMe, p99 write ≤ 1. 5 ms, CG snapshots.
Payment queues/anti-fraud: serial logs → large blocks, high bandwidth, separate LUNs for log and data.
Peak TPS (tournaments/matches): pre-warm database caches, headroom ≥ 30%, thermal throttle control, burn-rate SLO.
Regulatory: LUN encryption, mapping audit log, DR exercises, RPO/RTO reporting.
Total
Productive block storage is the correct protocol + correctly configured queues and qdepth + adequate RAID/EC + cache/barrier discipline + isolated fabric. Pin everything in runbooks, measure p95/p99, validate with fio profiles, automate snapshots and DR - and get the predictable latency and IOPS needed for critical product and cash flow paths.