Generating IDs
1) Why pay attention to identifiers
Identifier (ID) - the fundamental key of the entity: database lines, messages, file, order. Its properties depend on:- Uniqueness and scale (collisions, horizontal growth).
- Order and sorting (time correlation, replication, dedup).
- Storage performance (indexes, hot pages, key size).
- Safety (unpredictability, leaks, guessing).
- Usability/integration (short, URL-safe, not case sensitive).
Choosing ID is a compromise between entropy, orderability, length, generation rate, and exploitation.
2) Key requirements and terms
Uniqueness: the probability of collision must be lower than the acceptable risk.
Entropy: "how much randomness" contains ID (bit).
Time-sortable/k-sortable-Lexicographic ≈ time-based sorting.
Monotony: a non-decreasing sequence within a node/stream.
Locality of entry: how much the new insert is concentrated in the "tail" of the index (danger of hot pages).
Predictability: Is it possible to guess neighboring IDs (important for security/API).
Representation: binary/string, Base16/32/36/58/64, hyphens, case.
3) Major identifier families
3. 1 UUID
v4 (random): 122 bits of entropy. Disordered, good for safety and simplicity. Minus: "chaotic" indices due to random distribution - which, however, evenly dissipates loads and removes "hot pages."
v1 (time + MAC): arrange, but carries MAC/time (privacy); often avoided.
v7 (time-ordered): millisecond time + random part. Design for lexicographic sorting by time and good compression in the database. Compromise: The index's "hot tail" appears; treated by shardening/prefixes/increment.
Tips
For external APIs and lax order requirements - v4.
For event/log databases and "sorted" keys - v7.
3. 2 ULID (Crockford Base32)
128 bits: 48 bits of time (ms) + 80 bits of randomness. Lexicographically sorted by time, man-friendly (without 'I, L, O, U'), URL-safe. There is a monotone variation (with the same time stamp, the random part increases).
Pros: readability, orderability, portability.
Cons: with a very high frequency of inserts at one point in time - "hot tail."
3. 3 KSUID
160 bits: 32 bits of time (sec) relative to the epoch + 128 bits of randomness. Larger time range and stable sorting, strings shorter than ULID? (no - longer, but with its own encoding), good for distributed logs and objects.
3. 4 Snowflake-like (k-sortable flake IDs)
Classic schema (custom):
[ timestamp bits ][ region/datacenter bits ][ worker bits ][ sequence bits ]
Properties: monotone growth on a node, quasi-global uniqueness, short (64 bit) binary representation.
Risks: clock dependence (time drift/regression), exhaustion of sequence in one tick, coordination of region/worker bits.
Treated: protection against "clock back," reserve sequence, time detector, PTP/NTP discipline.
3. 5 DB sequences (SEQUENCE/IDENTITY)
The simplest monotone generation in one DBMS/shard.
Pros: short, fast, convenient for local tables.
Cons: difficult globally in a distributed cluster; predictable (insecure as a public key), creates a hot tail of the index.
3. 6 Content-address IDs (hash content)
Content SHA-256/Blake3 → stable ID, deduplication, integrity checking, caching.
Pros: determinism, protection against substitution.
Cons: expensive generation (CPU), collisions are practical zeros, no time sorting, length.
4) Collisions and the "birthday paradox" (intuitive)
The collision probability for a random ID of size'b 'bits at'n' generations is approximately:
p ≈ 1 - exp (-n (n-1 )/2/2 ^ b) ≈ n ^ 2/2 ^ (b + 1) (for small p)
Examples:
- UUIDv4 (122 bits) at n = 10 ^ 12 (trillion) → p ~ 1e-14 (negligible).
- 64-bit random → with n = 10 ^ 9 already p ~ 0. 027 (notable risk).
- Conclusion: 64-bit random is often not enough for huge systems; use 96/128 bits.
5) Indexes, hot pages and storage
Random keys (v4) evenly distribute inserts across the index tree → there is no "tail," but cache locality is worse.
Time-sorted (v7/ULID/Snowflake) are inserted "in the tail" → better locality and compression, but the risk of hot pages under high parallel recording.
- prefixes/sharding by tenant/region (add 1-2 bytes before time);
- interleaving: part of the randomness in the higher bits;
- batch inserts, fillfactor in B-tree, auto-transition to BRIN/clustering for large logs.
- 'UUID (16B) 'vs' BIGINT (8B) '/' INT8'saves memory/cache; Base32/58/64 rows increase size by 20-60%. For the database, store binary, serialize to a string on the edge.
6) Security and privacy
Do not use SEQUENCE/INT as public IDs in the URL/API: guessable → enumeration of resources.
Add random, unpredictable IDs (v4/v7/ULID/KSUID) for external references.
Do not encode PII into ID. If you want to enable the attribute, encrypt/sign (for example, JWE/JWS) or use opaque tokens.
URL-safe encodings: Base32 Crockford, Base58 (without '0OIl'), Base64url.
7) Multi-tenancy, prefixes and routing
Format: '[TENANT _ PREFIX] - [ID]' or binary: 'tenant _ id | | id'.
Pros: quick filters/tenant parties, protection against N + 1 scans.
Cons: may worsen the entropy density in the higher bits → consider the distribution (prefix hash).
Hash suffix (2-3 bytes) reduces collisions and helps shard routing: 'shard = hash (id)% N'.
8) Practical recommendations for selection
API, public links, distributed services without strict order: UUIDv4, ULID/KSUID.
Logs/events/orders, where we often sort by time: UUIDv7 or ULID (monotone).
Ultra-high bandwidth with local monotony and short key: Snowflake-like 64-bit (time discipline required).
Vaults of artifacts/builds/blobs: content-addressable (SHA-256), and on top - a man-friendly short "showcase" (Hashids/link).
Local tables in one database: SEQUENCE/IDENTITY + external "wrapper" for public links (masking).
9) Implementations and examples
9. 1 PostgreSQL
Store UUID binary, indexes - 'btree' or 'hash' as needed.
sql
CREATE EXTENSION IF NOT EXISTS "uuid-ossp";
CREATE TABLE orders (
id uuid PRIMARY KEY DEFAULT gen_random_uuid(), -- или uuid_generate_v4()
created_at timestamptz NOT NULL DEFAULT now(),
tenant smallint NOT NULL
);
-- For time-sortable (UUIDv7) store binary (uuid), generation in the application.
-- If you want a cluster by time:
CREATE INDEX ON orders (created_at DESC);
Sequential hot fix: for time-sorted ID, add "salt" to the upper bits or score by tenant:
sql
CREATE TABLE orders_t1 PARTITION OF orders FOR VALUES IN (1);
CREATE TABLE orders_t2 PARTITION OF orders FOR VALUES IN (2);
9. 2 Redis (atomic counters/monutonia)
bash
INCR "seq: orders" # local sequence combine: epoch_ms<<20 (worker_id<<10) (seq & 1023)
9. 3 Snowflake-like generator (pseudocode)
pseudo const EPOCH = 1704067200000 # custom epoch (ms)
state: last_ms=0, seq=0, worker=7, region=3
next():
now = epoch_ms()
if now < last_ms: wait_until(last_ms) # защита от clock back if now == last_ms:
seq = (seq + 1) & ((1<<12)-1) # 12 бит if seq == 0: wait_next_ms()
else:
seq = 0 last_ms = now return (now-EPOCH)<<22 region<<17 worker<<12 seq
9. 4 ULID/UUID in applications
Go
go
// ULID t:= time. Now(). UTC()
entropy:= ulid. Monotonic(rand. New(rand. NewSource(t. UnixNano())), 0)
id:= ulid. MustNew(ulid. Timestamp(t), entropy)
//UUID v7 (if there is a library)
id:= uuid. Must(uuid. NewV7())
Node. js
js import { ulid } from 'ulid';
import { v4 as uuidv4 } from 'uuid';
const id1 = ulid();
const id2 = uuidv4(); // v4
Python
python import uuid, time id_v4 = uuid. uuid4()
For v7, use a library (for example, uuid6/7 third-party packages)
10) Encodings and representations
Binary in the database ('BYTEA', 'UUID') → compact and fast. At the edge, convert to:- Base32 Crockford (ULID): case insensitive, no visually similar characters.
- Base58: in short Base32/64 for human-readable tokens, URL-safe.
- Base64url: short, but '-' and' _ 'in the URL.
Stabilize case and format (hyphens/none) to avoid duplicates when comparing strings.
11) Test playbooks and observability
Collisions: metric 'id _ collision _ total' (must be 0), alert at> 0.
Prefix distribution: histogram of high bytes - we are looking for buying.
Generation rate: 'ids _ per _ sec', p99 generator latency.
Clock skew (for Snowflake): offset nodes, "clock went back" events.
Index tails: p95/p99 'INSERT' latency; proportion of locks/hot pages.
- Injection "clock drift/back" → make sure that the generator is waiting/switching.
- 'sequence 'overflow in milliseconds → next_ms waiting check.
- Mass parallelism → whether there are storms of locks in the index.
12) Anti-patterns
AUTO_INCREMENT/SEQUENCE as a public ID: guessed, leaks. Use a public opaque ID over an internal one.
UUIDv1 (MAC/time) out: privacy.
64-bit random ID per trillion entries: real risk of collisions.
Global "central generator" without HA: SPOF and bottleneck.
Time-sorted IDs without clock back protection: duplicates/regression of order.
Mixing different ID formats without an explicit version/prefix → chaos in the debate/migrations.
Saving ID as a string with different registers/forms → hidden duplicates.
13) Implementation checklist
- Selected format (v4/v7/ULID/KSUID/Snowflake/SEQ/hash) for domain requirements.
- Order requirements defined (whether sortability is required).
- The probability of collisions (b bits, n generations) is estimated and the risk threshold is set.
- The encoding is designed (binary in DB + human-readable showcase).
- For time-sorted - clock back protection, sequence limits and NTP/PTP discipline.
- For public IDs - unpredictability (random/ULID/KSUID), absence of PII.
- Thought out hash (id)% N, multi-tenant prefixes.
- Observability: collision, distribution, latency, clock skew metrics.
- Sequence/Contention/Window Length Overflow Test Cases.
- Format, version, epoch, bitmap, and migration plan documentation.
14) FAQ
Q: What to choose "default" for microservices?
A: UUIDv7 or ULID: time ordering, a lot of entropy, simple generation at the edge. For external APIs, the ULID/UUIDv4 is also approx.
Q: Need a short and human-readable ID.
A: ULID/KSUID or Base58-128-bit random/temporary ID encoding. Remember about length and collisions.
Q: Is it possible to make "short numerical" IDs, but safe?
A: Yes: store the internal SEQ, and outside give the opaque token (random 96-128 bits) or Hashids with salt + signature.
Q: How do I migrate from SEQ to UUIDv7?
A: Enter a new column 'id _ new' (UUID), two-track, publish references to the new ID, then switch DC/foreign keys and delete the old one.
Q: Why did my ULID inserts get "hot"?
A: Insert strictly increasing keys into one index. Partition/tenant, mix high-order bits, use batch inserts.
15) Totals
A good ID is the correct set of properties for the problem: enough entropy, predictable sorting (if necessary), safe publicity and healthy exploitation of indices. Choose UUIDv4/ULID/UUIDv7/KSUID for simplicity and distribution, Snowflake for dense monotony and short keys (for time discipline), sequences for local tables, content hashes for artifacts. Lay down observability and tests - and identifiers will cease to be a source of surprises.