Webhooks and event idempotency
TL; DR
A good webhook is a signed (HMAC/mTLS), summarized and idempotent event delivered on an at-least-once model with exponential backoff and deduplication at the recipient. Agree on an envelope ('event _ id', 'type', 'ts', 'version', 'attempt', 'signature'), time window (≤5 minutes), response codes, retrays, DLQ and status endpoint.
1) Roles and delivery model
Sender (you/provider): generates an event, signs, tries to deliver up to 2xx, retrait at 3xx/4xx/5xx (except for explicit "do not accept"), leads DLQ, gives replay API.
Recipient (partner/your service): checks the signature/time window, makes dedup and idempotent processing, responds with the correct code, provides/status and/ack replay by 'event _ id'.
Warranties: at-least-once. The recipient must be able to handle duplicates and reordering.
2) Envelope of the event
json
{
"event_id": "01HF7H9J9Q3E7DYT5Y6K3ZFD6M",
"type": "payout.processed",
"version": "2025-01-01",
"ts": "2025-11-03T12:34:56.789Z",
"attempt": 1,
"producer": "payments",
"tenant": "acme",
"data": {
"payout_id": "p_123",
"status": "processed",
"amount_minor": 10000,
"currency": "EUR"
}
}
Required fields are 'event _ id', 'type', 'version', 'ts', 'attempt'.
Evolution rules: add fields; delete/change types - only with the new 'version'.
3) Security: signatures and binding
3. 1 HMAC signature (default recommended)
Titles:
X-Signature: v1=base64(hmac_sha256(<secret>, <canonical>))
X-Timestamp: 2025-11-03T12:34:56Z
X-Event-Id: 01HF7...
Canonical string:
<timestamp>\n<method>\n<path>\n<sha256(body)>
Check with recipient:
- abs(now − `X-Timestamp`) ≤ 300s
- 'X-Event-Id'not processed before (dedup)
- 'X-Signature'matches (time-safe comparison)
3. 2 Add. measures
mTLS for highly sensitive webhooks.
IP/ASN allow-list.
DPoP (optional) for sender-constrained if the webhook initiates callbacks.
4) Idempotency and deduplication
4. 1 Event idempotency
An event with the same 'event _ id' should not change state again. Recipient:- stores' event _ id'in the idempotent cache (KV/Redis/DB) on TTL ≥ 24-72 hours;
- saves the processing result (success/error, artifacts) for re-return.
4. 2 Command idempotency (callbacks)
If the webhook forces the client to pull the API (for example, "confirm payout"), use'Idempotency-Key 'on the REST call, store the result on the service side (exactly-once outcome).
KV model (minimum):
key: idempotency:event:01HF7...
val: { status: "ok", processed_at: "...", handler_version: "..." }
TTL: 3d
5) Retrai and backoff
Recommended plot (exponential with jitter):- '5s, 15s, 30s, 1m, 2m, 5m, 10m, 30m, 1h, 3h, 6h, 12h, 24h '(then daily up to N days)
- 2xx - success, stop retrays.
- '400/ 401/403/404/422 '- not retrayable if the signature/format is ok (client error).
- '429 '- retrayim by' Retry-After'or backoff.
- 5xx/network - retrayim.
Sender headers: 'User-Agent', 'X-Webhook-Producer', 'X-Attempt'.
6) Receiver side processing
Pseudo-pipeline:pseudo verify_signature()
if abs(now - X-Timestamp) > 300s: return 401
if seen(event_id):
return 200 // идемпотентный ответ
begin transaction if seen(event_id): commit; return 200 handle(data) // доменная логика mark_seen(event_id) // запись в KV/DB commit return 200
Transactionality: The "seen" label must be set atomically with the effect of the operation (or after fixing the result) to avoid double processing on failure.
7) Guarantees of order and snapshots
Order is not guaranteed. Use 'ts' and domain 'seq '/' version' in 'data' to verify relevance.
For long lags/losses - add/replay at the sender and/resync at the receiver (get snapshot and deltas at the time/ID window).
8) Status, replay and DLQ
8. 1 Sender endpoints
'POST/webhooks/replay '- by the' event _ id'list or by the time window.
'GET/webhooks/events/: id '- show the source package and the history of attempts.
DLQ: "dead" events (the retray limit has been exhausted) → separate storage, alerts.
8. 2 Recipient endpoints
`GET /webhooks/status/:event_id` — `seen=true/false`, `processed_at`, `handler_version`.
'POST/webhooks/ack '- (optional) confirmation of manual processing from DLQ.
9) Error contracts (receiver response)
http
HTTP/1.1 422 Unprocessable Entity
Content-Type: application/json
Retry-After: 120
X-Trace-Id: 4e3f...
{
"error": "invalid_state",
"error_description": "payout not found",
"trace_id": "4e3f..."
}
Recommendations: always return a clear code and, if possible, 'Retry-After'. Do not return detailed security details.
10) Monitoring and SLO
Metrics (sender):- delivery p50/p95, success rate, retray/event, drop-rate DLQ, share 2xx/4xx/5xx, delay window up to 2xx.
- verify fail rate (signature/time), dup-rate, latency handler p95, 5xx.
- Delivery: ≥ 99. 9% of events receive 2xx <3 c p95 (after the first successful attempt).
- Cryptographic verification: signature validation ≤ 2-5 ms p95.
- Dedup: 0 repeated effects (exactly-once outcome at the domain level).
11) Data security and privacy
Do not transmit PAN/PII in the body of the webhook; use IDs and then pull for details against an authorized API.
Mask sensitive fields in logs; store event bodies only to a minimum, with TTL.
Encrypt DLQ stores and replay.
12) Versioning and compatibility
Version in 'version' (envelope) and in transit: '/webhooks/v1/payments'.
New fields are optional; removal - only after the 'Sunset' period.
Document the changes in the machine-readable changelog (for auto-checks).
13) Test cases (UAT checklist)
- Re-delivering the same 'event _ id' → one effect and '200' to duplicates.
- Signature: correct key, incorrect key, old key (rotation), 'X-Timestamp' out of window.
- Backoff: Recipient gives' 429'with' Retry-After '→ correct pause.
- Order: Events'... processed'come before'... created '→ correct processing/waiting.
- Database failure at receiver between effect and 'mark _ seen' → atomicity/repeat.
- DLQ and manual replay → successful delivery.
- Mass "storm" (provider sends packs) → without loss, limits do not stifle critical.
14) Mini snippets
Sender signature (pseudo):pseudo body = json(event)
canonical = ts + "\n" + "POST" + "\n" + path + "\n" + sha256(body)
sig = base64(hmac_sha256(secret, canonical))
headers = {"X-Timestamp": ts, "X-Event-Id": event.event_id, "X-Signature": "v1="+sig}
POST(url, body, headers)
Check and destination (pseudo):
pseudo assert abs(now - X-Timestamp) <= 300 assert timingSafeEqual(hmac(secret, canonical), sig)
if kv.exists("idemp:"+event_id): return 200
begin tx if kv.exists("idemp:"+event_id): commit; return 200 handle(event.data) // доменная логика kv.set("idemp:"+event_id, "ok", ttl=259200)
commit return 200
15) Frequent errors
No deduplication → repeated effects (double refands/payouts).
Signature without timestamp/window → replay vulnerability.
Storing one HMAC secret on all partners.
Responses' 200'before fixing the result → loss of crash events.
"Washing out" security details into answers/logs.
Lack of DLQ/replay - incidents are unsolvable.
16) Implementation cheat sheet
Security: HMAC v1 + 'X-Timestamp' + 'X-Event-Id', window ≤ 5 minutes; mTLS/IP allow-list as required.
Конверт: `event_id`, `type`, `version`, `ts`, `attempt`, `data`.
Delivery: at-least-once, backoff with jitter, 'Retry-After', DLQ + replay API.
Idempotency: KV-cache 24-72 h, atomic fixation of the effect + 'mark _ seen'.
Observability: delivery, signature, duplicate metrics; trace _ id.
Documentation: version, response codes, examples, UAT checklist.
Resume Summary
Persistent webhooks are built on three whales: a signed envelope, at-least-once delivery and idempotent processing. Formalize the contract, enable HMAC/mTLS and the time window, implement retrai + DLQ and replay, store idempotent labels and capture effects atomically. Then events remain reliable even with network failures, load peaks and rare "duplicates of fate."