Event deduplication
1) Why deduplication
Duplicates appear due to retrays, network timeouts, failover and replay of historical data. If they are not controlled:- invariants are violated (double debits, repeated email/SMS, "twice created" order);
- Costs increase (re-writes/reprocesses)
- distorted analytics.
The goal of deduplication is to provide a one-time observed effect with acceptable transport repetitions, often along with idempotency.
2) Where to place deduplication (tiers)
1. Edge/API gateway - cut off explicit duplicates by 'Idempotency-Keu '/body + signature.
2. Broker/stream - logical deduplication by key/sequence, coalescing at a miss (less often - due to cost).
3. Event receiver (consumer) - main location: Inbox/key table/cache.
4. Sink (DB/cache) - unique keys/UPSERT/versions/compression.
5. ETL/analysis - deadline by time window and key in column beds.
The rule: as early as possible, but taking into account the cost of false positives and the need for replay.
3) Deduplication keys
3. 1 Natural (preferred)
`payment_id`, `order_id`, `saga_id#step`, `aggregate_id#seq`.
Guarantee stability and meaning.
3. 2 Composite
`(tenant_id, type, external_id, version)` или `(user_id, event_ts_truncated, payload_hash)`.
3. 3 Fingerprint
Hash of a deterministic subset of fields (normalize order/registers), optionally 'HMAC (secret, payload)'.
3. 4 Sequences/Versions
Monotonous' seq'per aggregate (optimistic blocking/versioning).
Anti-pattern: "random UUID" without a connection with a business entity is impossible.
4) Time windows and order
Deduplication window - the period during which the event can come again (usually 24-72 hours; for finance - longer).
Out-of-order: let's be lateness. In streaming frameworks - event time + watermarks.
Sliding/Fix-window deadup: "have you seen the key in the last N minutes? ».
Sequence-aware: if 'seq' ≤ the last processed - double/repeat.
5) Data structures and implementations
5. 1 Exact accounting
Redis SET/STRING + TTL: 'SETNX key 1 EX 86400' → "for the first time - we are processing, otherwise - SKIP."
LRU/LFU cache (in-proc): fast, but volatile → better only as the first barrier.
SQL unique indexes + UPSERT: "insert or update" (idempotent effect).
5. 2 Approximate structures (probabilistic)
Bloom/Cuckoo filter: cheap memory, false positives are possible. It is suitable for an obvious "noisy" drop (for example, telemetry), not for finance/orders.
Count-Min Sketch: Estimating frequencies to protect against "hot" takes.
5. 3 Streaming states
Kafka Streams/Flink: keyed state store with TTL, dedup by key in the window; checkpoint/restore.
Watermark + allowed lateness: Manages the late events window.
6) Transactional patterns
6. 1 Inbox (incoming table)
Save'message _ id '/key and result to side effects:pseudo
BEGIN;
ins = INSERT INTO inbox(id, received_at) ON CONFLICT DO NOTHING;
IF ins_not_inserted THEN RETURN cached_result;
result = handle(event);
UPSERT sink with result; -- idempotent sync
UPDATE inbox SET status='done', result_hash=... WHERE id=...;
COMMIT;
The replay will see the recording and will not repeat the effect.
6. 2 Outbox
Business record and event in one transaction → the publisher sends to the broker. Does not eliminate the double from the consumer, but excludes "holes."
6. 3 Unique Indexes/UPSERT
sql
INSERT INTO payments(id, status, amount)
VALUES ($1, $2, $3)
ON CONFLICT (id) DO NOTHING; -- "create once"
or controlled version upgrade:
sql
UPDATE orders
SET status = $new, version = version + 1
WHERE id=$id AND version = $expected; -- optimistic blocking
6. 4 Versioning of aggregates
The event is applicable if 'event. version = aggregate. version + 1`. Otherwise - double/repeat/conflict.
7) Deadup and brokers/streams
7. 1 Kafka
Idempotent Producer reduces entry doubles.
Transactions allow you to atomically commit offsets + output records.
Compaction: stores the last value per key - post-factum dedup/coalescing (not for payments).
Consumer-side: state store/Redis/DB for window keys.
7. 2 NATS / JetStream
Ack/redelivery → at-least-once. Dedup in the consumer (Inbox/Redis).
JetStream sequence/consumer work makes it easier to identify repetitions.
7. 3 Queues (Rabbit/SQS)
Visibility timeout + repeated deliveries → you need a key + deadstore.
SQS FIFO with 'MessageGroupId '/' DeduplicationId' helps, but TTL windows are provider-limited - keep keys longer if business requires.
8) Storage and analyzers
8. 1 ClickHouse/BigQuery
Dedup by window: 'ORDER BY key, ts' and 'argMax '/' anyLast' with condition.
ClickHouse:sql
SELECT key,
anyLast(value) AS v
FROM t
WHERE ts >= now() - INTERVAL 1 DAY
GROUP BY key;
Or a materialized layer of "unique" events (merge by key/version).
8. 2 Logs/telemetry
Let's say approximate-dump (Bloom) on ingest → save network/disk.
9) Reprocessing, replay and backfill
Dedup keys must survive the replay (TTL ≥ replay window).
For backfill, use the key space with the version ('key # source = batch2025') or separate "leaks" so as not to interfere with the online window.
Store result artifacts (hash/version) - this speeds up "fast-skip" on replays.
10) Metrics and observability
'dedup _ hit _ total '/' dedup _ hit _ rate '- the proportion of duplicates caught.
'dedup _ fp _ rate'for probabilistic filters.
'window _ size _ seconds' actual (by telemetry late arrivals).
`inbox_conflict_total`, `upsert_conflict_total`.
`replayed_events_total`, `skipped_by_inbox_total`.
Profiles by tenant/key/type: where are the most takes and why.
Логи: `message_id`, `idempotency_key`, `seq`, `window_id`, `action=process|skip`.
11) Security and privacy
Do not put the PII in the key; use hashes/aliases.
To sign the fingerprint - HMAC (secret, canonical_payload) to avoid collisions/forgery.
Coordinate the storage time of the keys with compliance (GDPR retention).
12) Performance and cost
In-proc LRU ≪ Redis ≪ SQL by latency/cost per operation.
Redis: cheap and fast, but consider the volume of keys and TTL; shardy by 'tenant/hash'.
SQL: expensive by p99, but provides strong guarantees and audience.
Probabilistic filters: very cheap, but FPs are possible - use where "extra SKIP" is not critical.
13) Anti-patterns
"We have Kafka exactly-once - no key needed." Needed - in a bruise/business layer.
Too short TTL for keys → replays/delay will deliver a double.
Global single dedup → hotspot and SPOF; not sharded by tenant/key.
Dedup only in memory - loss of process = wave of takes.
Bloom for money/orders - false positive will deprive the legitimate operation.
Inconsistent payload canonization - different hashes for messages that are identical in meaning.
Ignoring out-of-order - late events are marked with duplicates erroneously.
14) Implementation checklist
- Define a natural key (or compound/fingerprint).
- Set the dedup window and the'lateness' policy.
- Select level (s): edge, consumer, sink; provide for shardening.
- Implement Inbox/UPSERT; for flows - keyed state + TTL.
- If you need an approximate barrier - Bloom/Cuckoo (only for non-critical domains).
- Configure replay compatibility (TTL ≥ replay/backfill window).
- Metrics' dedup _ hit _ rate ', conflicts and window lags; dashboards per-tenant.
- Game Day: timeouts/retrays, replay, out-of-order, cache drop.
- Document payload canonization and key versioning.
- Perform load tests on hot keys and long windows.
15) Sample Configurations/Code
15. 1 Redis SETNX + TTL (barrier)
lua
-- KEYS[1] = "dedup:{tenant}:{key}"
-- ARGV[1] = ttl_seconds local ok = redis. call("SET", KEYS[1], "1", "NX", "EX", ARGV[1])
if ok then return "PROCESS"
else return "SKIP"
end
15. 2 PostgreSQL Inbox
sql
CREATE TABLE inbox (
id text PRIMARY KEY,
received_at timestamptz default now(),
status text default 'received',
result_hash text
);
-- In the handler: INSERT... ON CONFLICT DO NOTHING -> check, then UPSERT in blue.
15. 3 Kafka Streams
java var deduped = input
.selectKey((k,v) -> v.idempotencyKey())
.groupByKey()
.windowedBy(TimeWindows. ofSizeWithNoGrace(Duration. ofHours(24)))
.reduce((oldV,newV) -> oldV) // first wins
.toStream()
.map((wKey,val) -> KeyValue. pair(wKey. key(), val));
15. 4 Flink (keyed state + TTL, pseudo)
java
ValueState<Boolean> seen;
env. enableCheckpointing(10000);
onEvent(e):
if (!seen.value()) { process(e); seen. update(true); }
15. 5 NGINX/API gateway (Idempotency-Key on edge)
nginx map $http_idempotency_key $idkey { default ""; }
Proxy the key to the backend; backend solves deadup (Inbox/Redis).
16) FAQ
Q: What to choose: deadup or pure idempotence?
A: Usually both: deadup is a fast "filter" (savings), idempotence is a guarantee of the correct effect.
Q: Which TTL to put?
A: ≥ maximum possible re-delivery time + inventory. Typically 24-72 hours; for finance and deferred tasks - days/weeks.
Q: How do you handle late events?
A: Configure 'allowed lateness' and alarm'late _ event'; later ones - through a separate branch (recompute/skip).
Q: Can the entire telemetry stream be deduplicated?
A: Yes, approximate filters (Bloom) on edge, but consider FP and do not apply to critical business effects.
Q: Deadup getting in the way of backfill?
A: Separate key spaces ('key # batch2025') or disable the barrier for the duration of the backfill; TTL keys should only cover online windows.
17) Totals
Deduplication is composition: the right key, window and state structure + transactional patterns (Inbox/Outbox/UPSERT) and mindful handling of order and late events. Place barriers where it is cheapest, ensure idempotence in bruises, measure 'dedup _ hit _ rate' and test replays/fails - this way you get "effectively exactly-once" without unnecessary tails of latency and cost.