GH GambleHub

Event deduplication

1) Why deduplication

Duplicates appear due to retrays, network timeouts, failover and replay of historical data. If they are not controlled:
  • invariants are violated (double debits, repeated email/SMS, "twice created" order);
  • Costs increase (re-writes/reprocesses)
  • distorted analytics.

The goal of deduplication is to provide a one-time observed effect with acceptable transport repetitions, often along with idempotency.

2) Where to place deduplication (tiers)

1. Edge/API gateway - cut off explicit duplicates by 'Idempotency-Keu '/body + signature.
2. Broker/stream - logical deduplication by key/sequence, coalescing at a miss (less often - due to cost).
3. Event receiver (consumer) - main location: Inbox/key table/cache.
4. Sink (DB/cache) - unique keys/UPSERT/versions/compression.
5. ETL/analysis - deadline by time window and key in column beds.

The rule: as early as possible, but taking into account the cost of false positives and the need for replay.

3) Deduplication keys

3. 1 Natural (preferred)

`payment_id`, `order_id`, `saga_id#step`, `aggregate_id#seq`.
Guarantee stability and meaning.

3. 2 Composite

`(tenant_id, type, external_id, version)` или `(user_id, event_ts_truncated, payload_hash)`.

3. 3 Fingerprint

Hash of a deterministic subset of fields (normalize order/registers), optionally 'HMAC (secret, payload)'.

3. 4 Sequences/Versions

Monotonous' seq'per aggregate (optimistic blocking/versioning).

Anti-pattern: "random UUID" without a connection with a business entity is impossible.

4) Time windows and order

Deduplication window - the period during which the event can come again (usually 24-72 hours; for finance - longer).
Out-of-order: let's be lateness. In streaming frameworks - event time + watermarks.
Sliding/Fix-window deadup: "have you seen the key in the last N minutes? ».
Sequence-aware: if 'seq' ≤ the last processed - double/repeat.

5) Data structures and implementations

5. 1 Exact accounting

Redis SET/STRING + TTL: 'SETNX key 1 EX 86400' → "for the first time - we are processing, otherwise - SKIP."

LRU/LFU cache (in-proc): fast, but volatile → better only as the first barrier.
SQL unique indexes + UPSERT: "insert or update" (idempotent effect).

5. 2 Approximate structures (probabilistic)

Bloom/Cuckoo filter: cheap memory, false positives are possible. It is suitable for an obvious "noisy" drop (for example, telemetry), not for finance/orders.
Count-Min Sketch: Estimating frequencies to protect against "hot" takes.

5. 3 Streaming states

Kafka Streams/Flink: keyed state store with TTL, dedup by key in the window; checkpoint/restore.
Watermark + allowed lateness: Manages the late events window.

6) Transactional patterns

6. 1 Inbox (incoming table)

Save'message _ id '/key and result to side effects:
pseudo
BEGIN;
ins = INSERT INTO inbox(id, received_at) ON CONFLICT DO NOTHING;
IF ins_not_inserted THEN RETURN cached_result;
result = handle(event);
UPSERT sink with result; -- idempotent sync
UPDATE inbox SET status='done', result_hash=... WHERE id=...;
COMMIT;

The replay will see the recording and will not repeat the effect.

6. 2 Outbox

Business record and event in one transaction → the publisher sends to the broker. Does not eliminate the double from the consumer, but excludes "holes."

6. 3 Unique Indexes/UPSERT

sql
INSERT INTO payments(id, status, amount)
VALUES ($1, $2, $3)
ON CONFLICT (id) DO NOTHING; -- "create once"
or controlled version upgrade:
sql
UPDATE orders
SET status = $new, version = version + 1
WHERE id=$id AND version = $expected; -- optimistic blocking

6. 4 Versioning of aggregates

The event is applicable if 'event. version = aggregate. version + 1`. Otherwise - double/repeat/conflict.

7) Deadup and brokers/streams

7. 1 Kafka

Idempotent Producer reduces entry doubles.
Transactions allow you to atomically commit offsets + output records.
Compaction: stores the last value per key - post-factum dedup/coalescing (not for payments).
Consumer-side: state store/Redis/DB for window keys.

7. 2 NATS / JetStream

Ack/redelivery → at-least-once. Dedup in the consumer (Inbox/Redis).
JetStream sequence/consumer work makes it easier to identify repetitions.

7. 3 Queues (Rabbit/SQS)

Visibility timeout + repeated deliveries → you need a key + deadstore.
SQS FIFO with 'MessageGroupId '/' DeduplicationId' helps, but TTL windows are provider-limited - keep keys longer if business requires.

8) Storage and analyzers

8. 1 ClickHouse/BigQuery

Dedup by window: 'ORDER BY key, ts' and 'argMax '/' anyLast' with condition.

ClickHouse:
sql
SELECT key,
anyLast(value) AS v
FROM t
WHERE ts >= now() - INTERVAL 1 DAY
GROUP BY key;

Or a materialized layer of "unique" events (merge by key/version).

8. 2 Logs/telemetry

Let's say approximate-dump (Bloom) on ingest → save network/disk.

9) Reprocessing, replay and backfill

Dedup keys must survive the replay (TTL ≥ replay window).
For backfill, use the key space with the version ('key # source = batch2025') or separate "leaks" so as not to interfere with the online window.
Store result artifacts (hash/version) - this speeds up "fast-skip" on replays.

10) Metrics and observability

'dedup _ hit _ total '/' dedup _ hit _ rate '- the proportion of duplicates caught.
'dedup _ fp _ rate'for probabilistic filters.
'window _ size _ seconds' actual (by telemetry late arrivals).
`inbox_conflict_total`, `upsert_conflict_total`.
`replayed_events_total`, `skipped_by_inbox_total`.
Profiles by tenant/key/type: where are the most takes and why.

Логи: `message_id`, `idempotency_key`, `seq`, `window_id`, `action=process|skip`.

11) Security and privacy

Do not put the PII in the key; use hashes/aliases.
To sign the fingerprint - HMAC (secret, canonical_payload) to avoid collisions/forgery.
Coordinate the storage time of the keys with compliance (GDPR retention).

12) Performance and cost

In-proc LRU ≪ Redis ≪ SQL by latency/cost per operation.
Redis: cheap and fast, but consider the volume of keys and TTL; shardy by 'tenant/hash'.
SQL: expensive by p99, but provides strong guarantees and audience.
Probabilistic filters: very cheap, but FPs are possible - use where "extra SKIP" is not critical.

13) Anti-patterns

"We have Kafka exactly-once - no key needed." Needed - in a bruise/business layer.
Too short TTL for keys → replays/delay will deliver a double.
Global single dedup → hotspot and SPOF; not sharded by tenant/key.
Dedup only in memory - loss of process = wave of takes.
Bloom for money/orders - false positive will deprive the legitimate operation.
Inconsistent payload canonization - different hashes for messages that are identical in meaning.
Ignoring out-of-order - late events are marked with duplicates erroneously.

14) Implementation checklist

  • Define a natural key (or compound/fingerprint).
  • Set the dedup window and the'lateness' policy.
  • Select level (s): edge, consumer, sink; provide for shardening.
  • Implement Inbox/UPSERT; for flows - keyed state + TTL.
  • If you need an approximate barrier - Bloom/Cuckoo (only for non-critical domains).
  • Configure replay compatibility (TTL ≥ replay/backfill window).
  • Metrics' dedup _ hit _ rate ', conflicts and window lags; dashboards per-tenant.
  • Game Day: timeouts/retrays, replay, out-of-order, cache drop.
  • Document payload canonization and key versioning.
  • Perform load tests on hot keys and long windows.

15) Sample Configurations/Code

15. 1 Redis SETNX + TTL (barrier)

lua
-- KEYS[1] = "dedup:{tenant}:{key}"
-- ARGV[1] = ttl_seconds local ok = redis. call("SET", KEYS[1], "1", "NX", "EX", ARGV[1])
if ok then return "PROCESS"
else return "SKIP"
end

15. 2 PostgreSQL Inbox

sql
CREATE TABLE inbox (
id text PRIMARY KEY,
received_at timestamptz default now(),
status text default 'received',
result_hash text
);
-- In the handler: INSERT... ON CONFLICT DO NOTHING -> check, then UPSERT in blue.

15. 3 Kafka Streams

java var deduped = input
.selectKey((k,v) -> v.idempotencyKey())
.groupByKey()
.windowedBy(TimeWindows. ofSizeWithNoGrace(Duration. ofHours(24)))
.reduce((oldV,newV) -> oldV)   // first wins
.toStream()
.map((wKey,val) -> KeyValue. pair(wKey. key(), val));

15. 4 Flink (keyed state + TTL, pseudo)

java
ValueState<Boolean> seen;
env. enableCheckpointing(10000);
onEvent(e):
if (!seen.value()) { process(e); seen. update(true); }

15. 5 NGINX/API gateway (Idempotency-Key on edge)

nginx map $http_idempotency_key $idkey { default ""; }
Proxy the key to the backend; backend solves deadup (Inbox/Redis).

16) FAQ

Q: What to choose: deadup or pure idempotence?
A: Usually both: deadup is a fast "filter" (savings), idempotence is a guarantee of the correct effect.

Q: Which TTL to put?
A: ≥ maximum possible re-delivery time + inventory. Typically 24-72 hours; for finance and deferred tasks - days/weeks.

Q: How do you handle late events?
A: Configure 'allowed lateness' and alarm'late _ event'; later ones - through a separate branch (recompute/skip).

Q: Can the entire telemetry stream be deduplicated?
A: Yes, approximate filters (Bloom) on edge, but consider FP and do not apply to critical business effects.

Q: Deadup getting in the way of backfill?
A: Separate key spaces ('key # batch2025') or disable the barrier for the duration of the backfill; TTL keys should only cover online windows.

17) Totals

Deduplication is composition: the right key, window and state structure + transactional patterns (Inbox/Outbox/UPSERT) and mindful handling of order and late events. Place barriers where it is cheapest, ensure idempotence in bruises, measure 'dedup _ hit _ rate' and test replays/fails - this way you get "effectively exactly-once" without unnecessary tails of latency and cost.

Contact

Get in Touch

Reach out with any questions or support needs.We are always ready to help!

Start Integration

Email is required. Telegram or WhatsApp — optional.

Your Name optional
Email optional
Subject optional
Message optional
Telegram optional
@
If you include Telegram — we will reply there as well, in addition to Email.
WhatsApp optional
Format: +country code and number (e.g., +380XXXXXXXXX).

By clicking this button, you agree to data processing.