Database Sharding and Replication

1) Why do you need it

When the vertical upgrade of the database abuts against the CPU/IO/RAM or one cluster becomes SPOF, replication (for reads/HA) and sharding (for write/data distribution) come. Objectives:

Throughput (write QPS horizontal growth).
Availability (fast failover, no single point of failure).
Data localization (multi-region, low latency).
Isolation of noisy neighbors (hot tenants/hot keys).

2) Basic terms and consistency models

Primary/Leader ↔ Replica/Follower: write on the leader, read on the replicas.
Synchronous replication: transaction confirmation after writing on N nodes (low RPO, higher latency).
Asynchronous: leader commits and sends log later (RPO> 0, low latency).
Quorum (Raft/Paxos): writing to most nodes; one log, automatic leader.
Read-after-write: guaranteed reading of its records (see § 5).

We read CAP in the sales like this: in case of network problems, you choose consistency (CP) or availability (AP) for critical operations, often combining levels on different paths.

3) Replication: Options and Practices

3. 1 Physical and logical

Physical (WAL/redo/binlog): closer to the block log, simple and fast; limited to homogeneous topology/version.
Logical: DML/DDL stream at the row/table level; allows partial replicas, cross-version migrations, CDC for DWH/streaming.

3. 2 Setup and management

Monitor lag (time/bytes/LSN).
Limit hot-standby feedback and long requests on replicas (so as not to block VACUUM/cleaning).
For MySQL - GTID and Orchestrator; для PostgreSQL — Patroni/replication slots, synchronous_standby_names.

PostgreSQL (synchronous replica, fragment):

sql
-- on leader
ALTER SYSTEM SET synchronous_commit = 'on';
ALTER SYSTEM SET synchronous_standby_names = 'FIRST 1 (standby_a, standby_b)';
SELECT pg_reload_conf();

MySQL GTID (transaction ID):

sql
SET GLOBAL enforce_gtid_consistency = ON;
SET GLOBAL gtid_mode = ON; -- restart
CHANGE MASTER TO MASTER_AUTO_POSITION=1;

3. 3 Topologies

1→N (leader → replica) + cascades (replica merges further).
Multi-primary (active-active) - avoid in OLTP without strict conflict management.
Quorum cluster (Raft) - CockroachDB/Yugabyte/PG-Raft add-ons.

4) Read/Write Split and Routing

Always write as a leader; read from cues, but consider lag.

Read-after-write strategies:

1. Session stickiness: After a successful recording, the client reads from the leader during 'Δ T'.

2. LSN/GTID gate: the client says "I want to not get older LSN = X," the router sends to the replica, whose LSN ≥ X.

3. Stale-ok: some queries allow stale data (directories/tapes).

Tools: PgBouncer/Pgpool-II (Postgres), ProxySQL/MaxScale (MySQL), Vitess (shard routing).

Example of an LSN gate (idea): save 'pg _ current _ wal _ lsn ()' to an HTTP header/cookie and require the router to replica with 'pg _ last _ wal _ replay _ lsn () ≥ LSN'.

5) Sharding strategies

5. 1 Key selection

The key shall ensure uniformity and locality of requests:

Hash by 'tenant _ id '/' user _ id' - evenly, but deprives range scans.
Range in time/ID - great for time-series/archive, but risk hot-shard.
Consistent hashing - makes it easy to add/remove shards.
Directory/lookup table - flexible (any algorithm), but another table/cache.

5. 2 Patterns

Shared-nothing: each shard is a separate database/cluster, the application knows routing.
Middleware-sharding: Vitess (MySQL), Citus (Postgres), Proxy-level hides topology.
Federation: separation of data domains by services (catalog, payments, auth).

5. 3 Composite keys

Use the key space: '{tenant}: {entity}: {id}' and store it in the application and cache. Для Postgres — hash partitioning + LIST/RANGE subpartition.

PostgreSQL partitioning (fragment):

sql
CREATE TABLE orders (
tenant_id int,
id     bigint,
created_at timestamptz,
...,
PRIMARY KEY (tenant_id, id)
) PARTITION BY HASH (tenant_id);

CREATE TABLE orders_t0 PARTITION OF orders FOR VALUES WITH (MODULUS 16, REMAINDER 0);
--... t1..t15

6) ID generation

Avoid "hot" monotonous auto-increments on sharding.
Use Snowflake-like 64-bit ID (time + region + shard + seq) or ULID/KSUID (monotony and distribution).
Для Postgres — sequence per shard; for MySQL - auto_increment_increment/offset (different offsets on shard leaders).

7) Online oversharing and migrations

Key principles: dual-write, idempotency, temporary double routing.

Steps (generalized):

1. Add a new shard/cluster.

2. Enable dual-read (consistency check).

3. Include dual-write (in both shards), record discrepancies.

4. Backfill historical data (batches, logical/CDC replication).

5. Switch the "source of truth" to a new shard; leave "tail" synchronization.

6. Turn off the old one.

Tools: Vitess Resharding, Citus move shards, pg_logical/pgoutput, Debezium (CDC), gh-ost/pt-online-schema-change (DDL without locks).

8) Multi-region and geo-distribution

Leader-follower per region: local reads, write - through the global leader (simple model, but cross-region RTT).
Multi-leader: recording in both regions - you need conflict-mering (timestamp/version/CRDT).
True distributed SQL (Raft): CockroachDB/Yugabyte - data is "glued" to the region, queries go to the local quorum.

Recommendations:

Money/orders - CP (quorum/leader), directories/tapes - AP (cache, eventual).
Always plan write fencing (unique keys/versioning) with a possible split-brain.

9) Consistency in practice

Read-your-writes: the leader or cue that "caught up" with LSN/GTID.
Monotonic reads: "no older" than the last LSN read.
Write-conflict control: `SELECT... FOR UPDATE ', versions (' xmin '/' rowversion '), UPSERT with version check.
Idempotence: idempotence keys on payments/events.

10) Observability, SLO and alerts

Replica lag: time (seconds), LSN distance (bytes), seconds_behind_master (MySQL).
Forced rollbacks/conflicts, replication errors.
p95/p99 latency по route (read leader vs replica, write).
Throughput: TPS/locks/row-contended tables.
Bloat/VACUUM (PG), InnoDB buffer pool hit ratio (MySQL).
Dashboards: per-shard load, "hot" shards, key distribution.

11) Backups, PITR and DR

Full backup + WAL/binlog for PITR (point-in-time recovery).
Store in another region/cloud, do restore tests regularly.
For shards, a consistent "slice" (time coordination/LSN) or applicative idempotence on recovery.
RPOs/RTOs are written and tested on game-days.

PostgreSQL base backup (idea):

bash pg_basebackup -D/backups/base -X stream -C -S slot_replica_1 WAL archiving via archive_command or pgBackRest/Barman

12) Security and access

Segmentation by VPC/ACL, mTLS by proxy.
Roles/grants on the principle of minimum rights; individual users per shard/role.
Audit DDL/DCL, limits on "heavy" requests on replicas.
Encryption at rest (KMS) and in transit (TLS).
Panic Button: Global 'READ ONLY' for the duration of the incident/investigation.

13) Tools and bricks

PostgreSQL: Patroni (HA), PgBouncer (pooling/RO-routing), repmgr, pgBackRest/Barman (бэкап), Citus (шардинг), pglogical/Logical Replication, pgbadger/pg_stat_statements.
MySQL: Orchestrator (topologies/auto-failover), ProxySQL/MaxScale (routing), Percona XtraBackup (backup), Group Replication/InnoDB Cluster, Vitess (sharding/resharding).
Distributed SQL: CockroachDB, YugabyteDB (quorum, built-in sharding/geolocation).
CDC: Debezium + Kafka/Pulsar for events/ETL.

14) Anti-patterns

Single-primary without auto-failover and without DR tests.
"Magic" read-split excluding lag → phantom errors/suspicious bugs.
Sharding "for the sake of sharding": premature complication instead of vertical scale/indexes/cache.
Hot range (time-range) without time-bucket/hash-salt → one shard melts.
The global transaction 2PC on top of dozens of shards in OLTP - high p99 tails and frequent locks.
Lack of dual-write/dual-read during migrations → loss/out of sync.
DDL in prod without online tools and without compatibility feature flags.

15) Implementation checklist (0-60 days)

0-15 days

Define DB SLO, RPO/RTO.
Enable replication, lag monitoring, basic backups + PITR.
Enter the router (PgBouncer/ProxySQL) and the read-after-write policy.

16-30 days

Choose a sharding strategy, describe the keys and schemes.
Prepare over-charging tools (Vitess/Citus/CDC).

Directory of services/tables marked "read-stale-ok" vs "strict."

31-60 days

Run pilot-shard, dual-read and backfill.
Game-day: leader failover, recovery from PITR, region switch.
Automate hot shard key and unevenness reporting.

16) Maturity metrics

Replica lag p95 <target (e.g. 500 ms) for critical reads.
Successful DR tests ≥ 1/quarter (restore ≤ RTO, loss of RPO ≤).
Load distribution by shards: imbalance <20% by QPS/storage.
Percentage of requests with strict-consistency correctly routed = 100%.
Zero-data-loss in incidents requiring CP guarantees (money/orders).
Online DDL/migration without downtime, with compatibility flags.

17) Recipe examples

Hash-salt for time-range (so as not to heat one shard):

sql
-- calculate bucket = hash (user_id)% 16, store (bucket, created_at)
PARTITION BY LIST (bucket) SUBPARTITION BY RANGE (created_at)

Read-my-writes middleware (pseudocode):

python lsn = db. leader_query("SELECT pg_current_wal_lsn()")
ctx. sticky_until = now()+5s ctx. min_lsn = lsn in the read router: select a replica with last_lsn> = ctx. min_lsn, otherwise the leader

Vitess VSchema (fragment):

json
{
"tables": {
"orders": { "column_vindexes": [{ "column": "tenant_id", "name": "hash" }] }
}
}

18) Conclusion

Sharding and replication are not only a technique, but also processes: consistency-aware routing, migration discipline (dual-write/read, backfill), regular DR tests, and lag/hot shard observability. Start with a simple + read-after-write leader→replica, then add sharding where the load profile really requires it. Use ready-made platforms (Vitess/Citus/Distributed SQL) and keep business-critical data in CP mode - this way the base will cease to be a bottleneck and become a predictable, elastic foundation of the platform.

Database Sharding and Replication