Database Sharding and Replication
Database Sharding and Replication
1) Why do you need it
When the vertical upgrade of the database abuts against the CPU/IO/RAM or one cluster becomes SPOF, replication (for reads/HA) and sharding (for write/data distribution) come. Objectives:- Throughput (write QPS horizontal growth).
- Availability (fast failover, no single point of failure).
- Data localization (multi-region, low latency).
- Isolation of noisy neighbors (hot tenants/hot keys).
2) Basic terms and consistency models
Primary/Leader ↔ Replica/Follower: write on the leader, read on the replicas.
Synchronous replication: transaction confirmation after writing on N nodes (low RPO, higher latency).
Asynchronous: leader commits and sends log later (RPO> 0, low latency).
Quorum (Raft/Paxos): writing to most nodes; one log, automatic leader.
Read-after-write: guaranteed reading of its records (see § 5).
We read CAP in the sales like this: in case of network problems, you choose consistency (CP) or availability (AP) for critical operations, often combining levels on different paths.
3) Replication: Options and Practices
3. 1 Physical and logical
Physical (WAL/redo/binlog): closer to the block log, simple and fast; limited to homogeneous topology/version.
Logical: DML/DDL stream at the row/table level; allows partial replicas, cross-version migrations, CDC for DWH/streaming.
3. 2 Setup and management
Monitor lag (time/bytes/LSN).
Limit hot-standby feedback and long requests on replicas (so as not to block VACUUM/cleaning).
For MySQL - GTID and Orchestrator; для PostgreSQL — Patroni/replication slots, synchronous_standby_names.
sql
-- на лидере
ALTER SYSTEM SET synchronous_commit = 'on';
ALTER SYSTEM SET synchronous_standby_names = 'FIRST 1 (standby_a, standby_b)';
SELECT pg_reload_conf();
MySQL GTID (transaction ID):
sql
SET GLOBAL enforce_gtid_consistency = ON;
SET GLOBAL gtid_mode = ON; -- перезапуск
CHANGE MASTER TO MASTER_AUTO_POSITION=1;
3. 3 Topologies
1→N (leader → replica) + cascades (replica merges further).
Multi-primary (active-active) - avoid in OLTP without strict conflict management.
Quorum cluster (Raft) - CockroachDB/Yugabyte/PG-Raft add-ons.
4) Read/Write Split and Routing
Always write as a leader; read from cues, but consider lag.
Read-after-write strategies:1. Session stickiness: After a successful recording, the client reads from the leader during 'Δ T'.
2. LSN/GTID gate: the client says "I want to not get older LSN = X," the router sends to the replica, whose LSN ≥ X.
3. Stale-ok: some queries allow stale data (directories/tapes).
Tools: PgBouncer/Pgpool-II (Postgres), ProxySQL/MaxScale (MySQL), Vitess (shard routing).
Example of an LSN gate (idea): save 'pg _ current _ wal _ lsn ()' to an HTTP header/cookie and require the router to replica with 'pg _ last _ wal _ replay _ lsn () ≥ LSN'.
5) Sharding strategies
5. 1 Key selection
The key shall ensure uniformity and locality of requests:- Hash by 'tenant _ id '/' user _ id' - evenly, but deprives range scans.
- Range in time/ID - great for time-series/archive, but risk hot-shard.
- Consistent hashing - makes it easy to add/remove shards.
- Directory/lookup table - flexible (any algorithm), but another table/cache.
5. 2 Patterns
Shared-nothing: each shard is a separate database/cluster, the application knows routing.
Middleware-sharding: Vitess (MySQL), Citus (Postgres), Proxy-level hides topology.
Federation: separation of data domains by services (catalog, payments, auth).
5. 3 Composite keys
Use the key space: '{tenant}: {entity}: {id}' and store it in the application and cache. Для Postgres — hash partitioning + LIST/RANGE subpartition.
PostgreSQL partitioning (fragment):sql
CREATE TABLE orders (
tenant_id int,
id bigint,
created_at timestamptz,
...,
PRIMARY KEY (tenant_id, id)
) PARTITION BY HASH (tenant_id);
CREATE TABLE orders_t0 PARTITION OF orders FOR VALUES WITH (MODULUS 16, REMAINDER 0);
--... t1..t15
6) ID generation
Avoid "hot" monotonous auto-increments on sharding.
Use Snowflake-like 64-bit ID (time + region + shard + seq) or ULID/KSUID (monotony and distribution).
Для Postgres — sequence per shard; for MySQL - auto_increment_increment/offset (different offsets on shard leaders).
7) Online oversharing and migrations
Key principles: dual-write, idempotency, temporary double routing.
Steps (generalized):1. Add a new shard/cluster.
2. Enable dual-read (consistency check).
3. Include dual-write (in both shards), record discrepancies.
4. Backfill historical data (batches, logical/CDC replication).
5. Switch the "source of truth" to a new shard; leave "tail" synchronization.
6. Turn off the old one.
Tools: Vitess Resharding, Citus move shards, pg_logical/pgoutput, Debezium (CDC), gh-ost/pt-online-schema-change (DDL without locks).
8) Multi-region and geo-distribution
Leader-follower per region: local reads, write - through the global leader (simple model, but cross-region RTT).
Multi-leader: recording in both regions - you need conflict-mering (timestamp/version/CRDT).
True distributed SQL (Raft): CockroachDB/Yugabyte - data is "glued" to the region, queries go to the local quorum.
- Money/orders - CP (quorum/leader), directories/tapes - AP (cache, eventual).
- Always plan write fencing (unique keys/versioning) with a possible split-brain.
9) Consistency in practice
Read-your-writes: the leader or cue that "caught up" with LSN/GTID.
Monotonic reads: "no older" than the last LSN read.
Write-conflict control: `SELECT... FOR UPDATE ', versions (' xmin '/' rowversion '), UPSERT with version check.
Idempotence: idempotence keys on payments/events.
10) Observability, SLO and alerts
Replica lag: time (seconds), LSN distance (bytes), seconds_behind_master (MySQL).
Forced rollbacks/conflicts, replication errors.
p95/p99 latency по route (read leader vs replica, write).
Throughput: TPS/locks/row-contended tables.
Bloat/VACUUM (PG), InnoDB buffer pool hit ratio (MySQL).
Dashboards: per-shard load, "hot" shards, key distribution.
11) Backups, PITR and DR
Full backup + WAL/binlog for PITR (point-in-time recovery).
Store in another region/cloud, do restore tests regularly.
For shards, a consistent "slice" (time coordination/LSN) or applicative idempotence on recovery.
RPOs/RTOs are written and tested on game-days.
bash pg_basebackup -D /backups/base -X stream -C -S slot_replica_1 архивация WAL через archive_command или pgBackRest/Barman
12) Security and access
Segmentation by VPC/ACL, mTLS by proxy.
Roles/grants on the principle of minimum rights; individual users per shard/role.
Audit DDL/DCL, limits on "heavy" requests on replicas.
Encryption at rest (KMS) and in transit (TLS).
Panic Button: Global 'READ ONLY' for the duration of the incident/investigation.
13) Tools and bricks
PostgreSQL: Patroni (HA), PgBouncer (pooling/RO-routing), repmgr, pgBackRest/Barman (бэкап), Citus (шардинг), pglogical/Logical Replication, pgbadger/pg_stat_statements.
MySQL: Orchestrator (topologies/auto-failover), ProxySQL/MaxScale (routing), Percona XtraBackup (backup), Group Replication/InnoDB Cluster, Vitess (sharding/resharding).
Distributed SQL: CockroachDB, YugabyteDB (quorum, built-in sharding/geolocation).
CDC: Debezium + Kafka/Pulsar for events/ETL.
14) Anti-patterns
Single-primary without auto-failover and without DR tests.
"Magic" read-split excluding lag → phantom errors/suspicious bugs.
Sharding "for the sake of sharding": premature complication instead of vertical scale/indexes/cache.
Hot range (time-range) without time-bucket/hash-salt → one shard melts.
The global transaction 2PC on top of dozens of shards in OLTP - high p99 tails and frequent locks.
Lack of dual-write/dual-read during migrations → loss/out of sync.
DDL in prod without online tools and without compatibility feature flags.
15) Implementation checklist (0-60 days)
0-15 days
Define DB SLO, RPO/RTO.
Enable replication, lag monitoring, basic backups + PITR.
Enter the router (PgBouncer/ProxySQL) and the read-after-write policy.
16-30 days
Choose a sharding strategy, describe the keys and schemes.
Prepare over-charging tools (Vitess/Citus/CDC).
Directory of services/tables marked "read-stale-ok" vs "strict."
31-60 days
Run pilot-shard, dual-read and backfill.
Game-day: leader failover, recovery from PITR, region switch.
Automate hot shard key and unevenness reporting.
16) Maturity metrics
Replica lag p95 <target (e.g. 500 ms) for critical reads.
Successful DR tests ≥ 1/quarter (restore ≤ RTO, loss of RPO ≤).
Load distribution by shards: imbalance <20% by QPS/storage.
Percentage of requests with strict-consistency correctly routed = 100%.
Zero-data-loss in incidents requiring CP guarantees (money/orders).
Online DDL/migration without downtime, with compatibility flags.
17) Recipe examples
Hash-salt for time-range (so as not to heat one shard):sql
-- вычисляйте bucket = hash(user_id) % 16, храните (bucket, created_at)
PARTITION BY LIST (bucket) SUBPARTITION BY RANGE (created_at)
Read-my-writes middleware (pseudocode):
python lsn = db.leader_query("SELECT pg_current_wal_lsn()")
ctx.sticky_until = now()+5s ctx.min_lsn = lsn в роутере чтений: выбираем реплику с last_lsn >= ctx.min_lsn, иначе лидер
Vitess VSchema (fragment):
json
{
"tables": {
"orders": { "column_vindexes": [{ "column": "tenant_id", "name": "hash" }] }
}
}
18) Conclusion
Sharding and replication are not only a technique, but also processes: consistency-aware routing, migration discipline (dual-write/read, backfill), regular DR tests, and lag/hot shard observability. Start with a simple + read-after-write leader→replica, then add sharding where the load profile really requires it. Use ready-made platforms (Vitess/Citus/Distributed SQL) and keep business-critical data in CP mode - this way the base will cease to be a bottleneck and become a predictable, elastic foundation of the platform.