Webhooks: replays and acknowledgements
1) Basic delivery model
At-least-once (default) - The event will be delivered ≥1 times. Exactly-once guarantees are achieved by receiver idempotency.
Acknowledgement (ACK): only any 2xx (usually 200/204) from the recipient means success. Everything else is interpreted as a failure and leads to repetition.
Fast ACK: Respond 2xx after placing the event in turn, not after full business processing.
2) Event format and mandatory headings
Payload (example)
json
{
"id": "evt_01HXYZ",
"type": "order. created",
"occurred_at": "2025-11-03T18:10:12Z",
"sequence": 128374,
"source": "orders",
"data": { "order_id": "o_123", "amount": "49. 90", "currency": "EUR" },
"schema_version": 1
}
Sender Headers
'X-Webhook-Id: evt_01HXYZ' - unique event ID (use for deduplication).
'X-Webhook-Seq: 128374 '- monotone sequence (by subscription/theme).
`X-Signature: sha256=<base64(hmac_sha256(body, secret))>` — HMAC-подпись.
'X-Retry: 0,1,2... 'is the try number.
'X-Webhook-Version: 1 '- contract versioning.
(optional) 'Traceparent' - trace correlation.
Response from recipient
2xx - successfully accepted (there will be no further repetitions for this'id ').
410 Gone - endpoint deleted/inactive → sender terminates retries and deactivates subscription.
429/5xx/timeout - the sender repeats according to the retray policy.
3) Retries policy
Recommended backoff ladder (+ jitter)
'1s, 3s, 10s, 30s, 2m, 10m, 30m, 2h, 6h, 24h '(stop after the limit, for example 48-72 hours).
Rules:- Exponential backoff + random jitter (± 20-30%) to avoid "herd effect."
- Quorum of errors for temporary failures (for example, retry if 5xx or network timeout).
- Respect 429: set minimum 'min (Retry-After header, next backoff window)'.
Timeouts and sizes
Connection timeout ≤ 3-5 seconds; total response timeout ≤ 10 seconds
The size of the body under the contract (for example, ≤ 256 KB), otherwise 413 → the logic "chunking" or "pull URL."
4) Idempotency and deduplication
Idempotent application: processing repetitions of the same'id 'must return the same result and not change state again.
Dedup storage on the recipient's side: store '(X-Webhook-Id, processed_at, checksum)' with TTL ≥ retray windows (24-72 hours).
Compositional key: if several topics → '(subscription_id, event_id)'.
5) Order and "exactly-once effects"
It is difficult to guarantee strict order in distributed systems. Use:- Partition by key: the same logical set (for example, 'order _ id') is always in one "channel" of delivery.
- Sequence: Reject events with the old 'X-Webhook-Seq' and put them in the "parking lot" before the missing ones arrive.
- log of applied operations (outbox/inbox pattern),
- transactional upsert by 'event _ id' in the database,
- sagas/compensations for complex processes.
6) Error resolution by status codes (Table)
7) Channel security
HMAC signature of each message; check at the receiver with the "time window" (mitm and replay attacks).
mTLS for sensitive domains (LCC/payments).
IP allowlist of outgoing addresses, TLS 1. 2+, HSTS.
PII minimization: do not send unnecessary personal data; disguise in the logs.
Rotation of secrets: two valid keys (active/next) and the'X-Key-Id 'header to indicate the current one.
8) Queues, DLQs and Replays
Events must be written to the output queue/log on the sender side (for reliable replay).
If the maximum of retrays is exceeded, the event goes to DLQ (Dead Letter Queue) with the cause.
Replay API (for recipient/operator): resubmit by 'id '/time range/subject, with RPS restriction and additional signature/authorization.
POST /v1/webhooks/replay
{ "subscription_id": "sub_123", "from": "2025-11-03T00:00:00Z", "to": "2025-11-03T12:00:00Z" }
→ 202 Accepted
9) Contract and version
Version the event (the 'schema _ version' field) and the transport ('X-Webhook-Version').
Add fields only as optional; on deletion - minor migration and transition period (dual-write).
Document event types, examples, schemas (JSON Schemas), error codes.
10) Observability and SLO
Sender Key Metrics:- 'delivery _ success _ rate '(2xx/all attempts),' first _ attempt _ success _ rate'
- `retries_total`, `max_retry_age_seconds`, `dlq_count`
- `latency_p50/p95` (occurred_at → ack_received_at)
- `ack_latency` (receive → 2xx), `processing_latency` (enqueue → done)
- `duplicates_total`, `invalid_signature_total`, `out_of_order_total`
99. 9% of events receive the first ACK ≤ 60 seconds (28d).
- DLQ ≤ 0. 1% of the total; DLQ replay ≤ 24 hours.
11) Timing and network breaks
Use UTC in the time fields; synchronize NTP.
Send 'occurred _ at' and fix 'delivered _ at' to read the lag.
With long breaks, the network/endpoint → accumulate in the queue, limit growth (backpressure + quotas).
12) Recommended limits and hygiene
RPS per subscription (e.g. 50 RPS, burst 100) + concurrency (e.g. 10).
Max. body: 64-256 KB; for more - "notification + URL" and download signature.
Event names in 'snake. case 'or' dot. type` (`order. created`).
Strict idempotency of write operations of the receiver.
13) Examples: Sender and Receiver
13. 1 Sender (pseudocode)
python def send_event(event, attempt=0):
body = json. dumps(event)
sig = hmac_sha256_base64(body, secret)
headers = {
"X-Webhook-Id": event["id"],
"X-Webhook-Seq": str(event["sequence"]),
"X-Retry": str(attempt),
"X-Signature": f"sha256={sig}",
"Content-Type": "application/json"
}
res = http. post(endpoint, body, headers, timeout=10)
if 200 <= res. status < 300:
mark_delivered(event["id"])
elif res. status == 410:
deactivate_subscription()
else:
schedule_retry(event, attempt+1) # backoff + jitter, respect 429 Retry-After
13. 2 Receiver (pseudocode)
python
@app. post("/webhooks")
def handle():
body = request. data headers = request. headers assert verify_hmac(body, headers["X-Signature"], secret)
evt_id = headers["X-Webhook-Id"]
if dedup_store. exists(evt_id):
return, "" 204 enqueue_for_processing (body) # fast path. dedup_store put(evt_id, ttl=723600)
return, "" 202 # or 204
14) Testing and chaos practices
Negative cases: invalid signature, 429/5xx, timeout, 410, large payloads.
Behavioral: out-of-order, duplicates, delays of 1-10 minutes, break for 24 hours.
Load: burst 10 ×; check for backpressure and DLQ persistence.
Contracts: JSON Schema, mandatory headings, stable event types.
15) Implementation checklist
- 2xx = ACK, and quick return after enqueue
- Exponential backoff + jitter, respect 'Retry-After'
- Receiver IDempotency and X-Webhook-Id (TTL ≥ Retray)
- HMAC signatures, secret rotation, optional mTLS
- DLQ + Replay API, Monitoring and Alerts
- Limits: Timeouts, RPS, Body Size
- Order: partition by key or 'sequence' + "parking lot"
- Documentation: schemas, examples, error codes, versions
- Chaos tests: delays, duplicates, network failure, long replay
16) Mini-FAQ
Do I always need to answer 200?
Any 2xx counts as a success. 202/204 is normal practice for "accepted to queue."
Can replays be stopped?
Yes, a 410 response and/or via the sender's console/API (unsubscribe).
What about large payloads?
Send a "notification + secure URL," sign the download request and install TTL.
How to ensure order?
Partition by key + `sequence`; in case of discrepancy - "parking lot" and replay.
Total
Reliable webhooks are clear ACK (2xx) semantics, reasonable repeats with backoff + jitter, strict idempotence and deduplication, competent security (HMAC/mTLS), queues + DLQ + replays, and transparent observability. Fix the contract, enter limits and metrics, regularly run chaos scenarios - and your integrations will stop "pouring in" at the very first failures.