Event deduplication

1) Why deduplication

Duplicates appear due to retrays, network timeouts, failover and replay of historical data. If they are not controlled:

invariants are violated (double debits, repeated email/SMS, "twice created" order);
Costs increase (re-writes/reprocesses)
distorted analytics.

The goal of deduplication is to provide a one-time observed effect with acceptable transport repetitions, often along with idempotency.

2) Where to place deduplication (tiers)

1. Edge/API gateway - cut off explicit duplicates by 'Idempotency-Keu '/body + signature.
2. Broker/stream - logical deduplication by key/sequence, coalescing at a miss (less often - due to cost).
3. Event receiver (consumer) - main location: Inbox/key table/cache.
4. Sink (DB/cache) - unique keys/UPSERT/versions/compression.
5. ETL/analysis - deadline by time window and key in column beds.

The rule: as early as possible, but taking into account the cost of false positives and the need for replay.

3) Deduplication keys

3. 1 Natural (preferred)

`payment_id`, `order_id`, `saga_id#step`, `aggregate_id#seq`.
Guarantee stability and meaning.

3. 2 Composite

`(tenant_id, type, external_id, version)` или `(user_id, event_ts_truncated, payload_hash)`.

3. 3 Fingerprint

Hash of a deterministic subset of fields (normalize order/registers), optionally 'HMAC (secret, payload)'.

3. 4 Sequences/Versions

Monotonous' seq'per aggregate (optimistic blocking/versioning).

Anti-pattern: "random UUID" without a connection with a business entity is impossible.

4) Time windows and order

Deduplication window - the period during which the event can come again (usually 24-72 hours; for finance - longer).
Out-of-order: let's be lateness. In streaming frameworks - event time + watermarks.
Sliding/Fix-window deadup: "have you seen the key in the last N minutes? ».
Sequence-aware: if 'seq' ≤ the last processed - double/repeat.

5) Data structures and implementations

5. 1 Exact accounting

Redis SET/STRING + TTL: 'SETNX key 1 EX 86400' → "for the first time - we are processing, otherwise - SKIP."

LRU/LFU cache (in-proc): fast, but volatile → better only as the first barrier.
SQL unique indexes + UPSERT: "insert or update" (idempotent effect).

5. 2 Approximate structures (probabilistic)

Bloom/Cuckoo filter: cheap memory, false positives are possible. It is suitable for an obvious "noisy" drop (for example, telemetry), not for finance/orders.
Count-Min Sketch: Estimating frequencies to protect against "hot" takes.

5. 3 Streaming states

Kafka Streams/Flink: keyed state store with TTL, dedup by key in the window; checkpoint/restore.
Watermark + allowed lateness: Manages the late events window.

6) Transactional patterns

6. 1 Inbox (incoming table)

Save'message _ id '/key and result to side effects:

pseudo
BEGIN;
ins = INSERT INTO inbox(id, received_at) ON CONFLICT DO NOTHING;
IF ins_not_inserted THEN RETURN cached_result;
result = handle(event);
UPSERT sink with result; -- idempotent sync
UPDATE inbox SET status='done', result_hash=... WHERE id=...;
COMMIT;

The replay will see the recording and will not repeat the effect.

6. 2 Outbox

Business record and event in one transaction → the publisher sends to the broker. Does not eliminate the double from the consumer, but excludes "holes."

6. 3 Unique Indexes/UPSERT

sql
INSERT INTO payments(id, status, amount)
VALUES ($1, $2, $3)
ON CONFLICT (id) DO NOTHING; -- "create once"

or controlled version upgrade:

sql
UPDATE orders
SET status = $new, version = version + 1
WHERE id=$id AND version = $expected; -- optimistic blocking

6. 4 Versioning of aggregates

The event is applicable if 'event. version = aggregate. version + 1`. Otherwise - double/repeat/conflict.

7) Deadup and brokers/streams

7. 1 Kafka

Idempotent Producer reduces entry doubles.
Transactions allow you to atomically commit offsets + output records.
Compaction: stores the last value per key - post-factum dedup/coalescing (not for payments).
Consumer-side: state store/Redis/DB for window keys.

7. 2 NATS / JetStream

Ack/redelivery → at-least-once. Dedup in the consumer (Inbox/Redis).
JetStream sequence/consumer work makes it easier to identify repetitions.

7. 3 Queues (Rabbit/SQS)

Visibility timeout + repeated deliveries → you need a key + deadstore.
SQS FIFO with 'MessageGroupId '/' DeduplicationId' helps, but TTL windows are provider-limited - keep keys longer if business requires.

8) Storage and analyzers

8. 1 ClickHouse/BigQuery

Dedup by window: 'ORDER BY key, ts' and 'argMax '/' anyLast' with condition.

ClickHouse:

sql
SELECT key,
anyLast(value) AS v
FROM t
WHERE ts >= now() - INTERVAL 1 DAY
GROUP BY key;

Or a materialized layer of "unique" events (merge by key/version).

8. 2 Logs/telemetry

Let's say approximate-dump (Bloom) on ingest → save network/disk.

9) Reprocessing, replay and backfill

Dedup keys must survive the replay (TTL ≥ replay window).
For backfill, use the key space with the version ('key # source = batch2025') or separate "leaks" so as not to interfere with the online window.
Store result artifacts (hash/version) - this speeds up "fast-skip" on replays.

10) Metrics and observability

'dedup _ hit _ total '/' dedup _ hit _ rate '- the proportion of duplicates caught.
'dedup _ fp _ rate'for probabilistic filters.
'window _ size _ seconds' actual (by telemetry late arrivals).
`inbox_conflict_total`, `upsert_conflict_total`.
`replayed_events_total`, `skipped_by_inbox_total`.
Profiles by tenant/key/type: where are the most takes and why.

Логи: `message_id`, `idempotency_key`, `seq`, `window_id`, `action=process|skip`.

11) Security and privacy

Do not put the PII in the key; use hashes/aliases.
To sign the fingerprint - HMAC (secret, canonical_payload) to avoid collisions/forgery.
Coordinate the storage time of the keys with compliance (GDPR retention).

12) Performance and cost

In-proc LRU ≪ Redis ≪ SQL by latency/cost per operation.
Redis: cheap and fast, but consider the volume of keys and TTL; shardy by 'tenant/hash'.
SQL: expensive by p99, but provides strong guarantees and audience.
Probabilistic filters: very cheap, but FPs are possible - use where "extra SKIP" is not critical.

13) Anti-patterns

"We have Kafka exactly-once - no key needed." Needed - in a bruise/business layer.
Too short TTL for keys → replays/delay will deliver a double.
Global single dedup → hotspot and SPOF; not sharded by tenant/key.
Dedup only in memory - loss of process = wave of takes.
Bloom for money/orders - false positive will deprive the legitimate operation.
Inconsistent payload canonization - different hashes for messages that are identical in meaning.
Ignoring out-of-order - late events are marked with duplicates erroneously.

14) Implementation checklist

Define a natural key (or compound/fingerprint).
Set the dedup window and the'lateness' policy.
Select level (s): edge, consumer, sink; provide for shardening.
Implement Inbox/UPSERT; for flows - keyed state + TTL.
If you need an approximate barrier - Bloom/Cuckoo (only for non-critical domains).
Configure replay compatibility (TTL ≥ replay/backfill window).
Metrics' dedup _ hit _ rate ', conflicts and window lags; dashboards per-tenant.
Game Day: timeouts/retrays, replay, out-of-order, cache drop.
Document payload canonization and key versioning.
Perform load tests on hot keys and long windows.

15) Sample Configurations/Code

15. 1 Redis SETNX + TTL (barrier)

lua
-- KEYS[1] = "dedup:{tenant}:{key}"
-- ARGV[1] = ttl_seconds local ok = redis. call("SET", KEYS[1], "1", "NX", "EX", ARGV[1])
if ok then return "PROCESS"
else return "SKIP"
end

15. 2 PostgreSQL Inbox

sql
CREATE TABLE inbox (
id text PRIMARY KEY,
received_at timestamptz default now(),
status text default 'received',
result_hash text
);
-- In the handler: INSERT... ON CONFLICT DO NOTHING -> check, then UPSERT in blue.

15. 3 Kafka Streams

java var deduped = input
.selectKey((k,v) -> v.idempotencyKey())
.groupByKey()
.windowedBy(TimeWindows. ofSizeWithNoGrace(Duration. ofHours(24)))
.reduce((oldV,newV) -> oldV)   // first wins
.toStream()
.map((wKey,val) -> KeyValue. pair(wKey. key(), val));

15. 4 Flink (keyed state + TTL, pseudo)

java
ValueState<Boolean> seen;
env. enableCheckpointing(10000);
onEvent(e):
if (!seen.value()) { process(e); seen. update(true); }

15. 5 NGINX/API gateway (Idempotency-Key on edge)

nginx map $http_idempotency_key $idkey { default ""; }
Proxy the key to the backend; backend solves deadup (Inbox/Redis).

16) FAQ

Q: What to choose: deadup or pure idempotence?
A: Usually both: deadup is a fast "filter" (savings), idempotence is a guarantee of the correct effect.

Q: Which TTL to put?
A: ≥ maximum possible re-delivery time + inventory. Typically 24-72 hours; for finance and deferred tasks - days/weeks.

Q: How do you handle late events?
A: Configure 'allowed lateness' and alarm'late _ event'; later ones - through a separate branch (recompute/skip).

Q: Can the entire telemetry stream be deduplicated?
A: Yes, approximate filters (Bloom) on edge, but consider FP and do not apply to critical business effects.

Q: Deadup getting in the way of backfill?
A: Separate key spaces ('key # batch2025') or disable the barrier for the duration of the backfill; TTL keys should only cover online windows.

17) Totals

Deduplication is composition: the right key, window and state structure + transactional patterns (Inbox/Outbox/UPSERT) and mindful handling of order and late events. Place barriers where it is cheapest, ensure idempotence in bruises, measure 'dedup _ hit _ rate' and test replays/fails - this way you get "effectively exactly-once" without unnecessary tails of latency and cost.

Event deduplication

Get in Touch

Quick Contact

The video will be updated soon

We are currently very busy with projects