GH GambleHub

Webhook Delivery Guarantees

Webhooks are asynchronous system-to-subscriber notifications over HTTP (S). The network is unreliable: responses are lost, packets come in duplicates or out of order. Therefore, delivery guarantees are not built "over TCP," but at the level of the webhook protocol and domain idempotency.

The key goal: to provide at-least-once delivery with order by key (where necessary), to give the subscriber materials for idempotent processing and a reconcile tool for restores.


1) Warranty levels

Best-effort - a one-time attempt, without retras. Acceptable only for "unimportant" events.
At-least-once (recommended) - duplicates and out-of-order are possible, but the event will be delivered provided that the subscriber is available within a reasonable time.
Effectively-exactly-once (at the effect level) - achieved by a combination of idempotency and dedup storage at the subscriber/sender side. "Exactly-once" HTTP transport is not possible.


2) Webhook contract: minimum required

Headers (example):

X-Webhook-Id: 5d1e6a1b-4f7d-4a3d-8b3a-6c2b2f0f3f21  # глобальный ID события
X-Delivery-Attempt: 3                 # номер попытки
X-Event-Type: payment.authorized.v1          # тип/версия
X-Event-Time: 2025-10-31T12:34:56Z          # ISO8601
X-Partition-Key: psp_tx_987654            # ключ порядка
X-Seq: 418                      # монотонный номер по ключу
X-Signature-Alg: HMAC-SHA256
X-Signature: t=1730378096,v1=hex(hmac(secret, t        body))
Content-Type: application/json
Body (example):
json
{
"id": "5d1e6a1b-4f7d-4a3d-8b3a-6c2b2f0f3f21",
"type": "payment.authorized.v1",
"occurred_at": "2025-10-31T12:34:56Z",
"partition_key": "psp_tx_987654",
"sequence": 418,
"data": {
"payment_id": "psp_tx_987654",
"amount": "10.00",
"currency": "EUR",
"status": "AUTHORIZED"
},
"schema_version": 1
}

The requirement for the recipient: respond quickly '2xx' after buffering and validating the signature, and do business processing asynchronously.


3) Order and causality

Key order: the guarantee "will not leave" only inside one 'partition _ key' (e.g. 'player _ id', 'wallet _ id', 'psp _ tx _ id').
Global order is not guaranteed.
On the sender side there is a queue with serialization by key (one consumer/sharding), on the recipient side there is an inbox with '(source, event_id)' and optionally waiting for missing 'seq'.

If gaps are critical - provide pull-API'GET/events? after = checkpoint 'for "catch up and consult."


4) Idempotency and deduplication

Each webhook carries a stable 'X-Webhook-Id'.
The recipient stores' inbox (event_id) ': PK -' source + event_id'; repeats → no-op.

Side effects (writing to the database/wallet) are performed only once when the event is first "seen."

For commands with effect, use Idempotency-Key and the result cache for the duration of the retray window.


5) Retrai, backoff and windows

Retray policy (reference):
  • Retrain to '5xx/timeout/connection error/409-Conflict (retryable )/429'.
  • Do not retract on '4xx' except '409/423/429' (and only with consistent semantics).
  • Exponential backoff + full jitter: 0. 5s, 1s, 2s, 4s, 8s, … up to 'max = 10-15 min'; TTL retray windows: for example, 72 hours.
  • Respect 'Retry-After' from the recipient.
  • Have a common deadline: "recognize the event as not delivered" and transfer it to DLQ.
yaml retry:
initial_ms: 500 multiplier: 2.0 jitter: full max_delay_ms: 900000 ttl: 72h retry_on: [TIMEOUT, 5xx, 429]

6) DLQ и redrive

DLQ - "cemetery" of poisonous or TTL-expired events with full meta information (payload, headers, errors, attempts, hashes).
Web console/API for redrive (point re-delivery) with optional endpoint/secret editing.
Rate-limited redrive and batch-redrive prioritized.


7) Safety

mTLS (if possible) or TLS 1. 2+.

Body signature (HMAC with secret per tenant/endpoint). Verification:

1. Extract't '(timestamp) from the header, check the sliding window (for example, ± 5 minutes).

2. Restore Signature Line 'tbody ', compare HMAC in constant-time.
Anti-replay: store '(event_id, t)' and reject too old/repeated requests.
Rotation of secrets: support of two active secrets for the period of rotation.
Optional: IP-allowlist, User-Agent header, origin-IP dedication.

8) Quotas, rate limits and equity

Fair-Queue per tenant/subscriber: so that one subscriber/tenant does not score the overall pool.
Quotas and burst limits for outgoing traffic and per-endpoint.
Reaction to '429': honor 'Retry-After', troll stream; for long-term limiting - degrade (sending only critical event types).


9) Subscription lifecycle

Register/Verify: POST endpoint → challenge/response or out-of-band confirmation.
Lease (optional): signature is valid until 'valid _ to'; prolongation - explicit.
Secret rotation: `current_secret`, `next_secret` с `switch_at`.
Test ping: an artificial event to test the route before turning on the main topics.
Health samples: periodic HEAD/GET with latency and TLS profile check.


10) Evolution of schemes (event versions)

Versioning event type: 'payment. authorized. v1` → `…v2`.
Evolution - additive (new fields → MINOR API versions), breaking → a new type.
Schema register (JSON-Schema/Avro/Protobuf) + automatic validation before submission.
The'X-Event-Type 'header and the' schema _ version'field in the body are both required.


11) Observability and SLO

Metrics (by type/tenant/subscriber):
  • `deliveries_total`, `2xx/4xx/5xx_rate`, `timeout_rate`, `signature_fail_rate`.
  • 'attempts _ avg ',' p50/p95/p99 _ delivery _ latency _ ms' (publish to 2xx).
  • `dedup_rate`, `out_of_order_rate`, `dlq_rate`, `redrive_success_rate`.
  • `queue_depth`, `oldest_in_queue_ms`, `throttle_events`.
SLO (reference):
  • The share of deliveries ≤ 60 s (p95) - 99. 5% for critical events.
  • DLQ ≤ 0. 1% in 24 hours
  • Signature failures ≤ 0. 05%.

Логи/трейсинг: `event_id`, `partition_key`, `seq`, `attempt`, `endpoint`, `tenant_id`, `schema_version`, `trace_id`.


12) Sender reference algorithm

1. Write event to transactional outbox.
2. Define partition_key and seq; queue.
3. The worker takes by key, forms a request, signs, sends with timeouts (connect/read).
4. With '2xx' - recognize as delivered, fix latency and seq-checkpoint.
5. With '429/5xx/timeout' - retreat according to policy.
6. By TTL → DLQ and alert.


13) Reference processor (receiver)

1. Accept the request, check TLS/proto.
2. Validation of signature and time window.
3. Fast ACK 2xx (after synchronous write to local inbox/queue).
4. The asynchronous worker reads' inbox ', checks' event _ id '(grandfather), if necessary, orders by' seq'inside' partition _ key '.
5. Performs effects, writes "offset/seq checkpoint" for reconcile.
6. In case of error - local retrays; "poisonous" tasks → local DLQ with alerts.


14) Reconcile (pool loop)

For "impenetrable" incidents:
  • `GET /events? partition_key=...&after_seq=...&limit=...' - to give all the missed.
  • Token checkpoint: 'after = opaque _ token' instead of seq.
  • Idempotent redelivery: the same 'event _ id', the same signature on the new 't'.

15) Useful headings and codes

2xx - accepted (even if business processing is later).
410 Gone - endpoint is closed (the sender stops delivery and marks the subscription as "archived").
409/423 - temporary blocking of the resource → retray is reasonable.
429 - too often → throttle and backoff.
400/401/403/404 - configuration error; stop the retrai, open the ticket.


16) Multi-tenant and regions

Individual queues and limits per tenant/endpoint.
Data residency: sending data from the region; end-to-end headers' X-Tenant ',' X-Region '.
Isolation of failures: the fall of one subscriber does not affect the rest (separate pools).


17) Testing

Contract tests: fixed examples of bodies/signatures, validation check.
Chaos: drop/duplicates, shuffle order, network delays, 'RST', 'TLS' errors.
Load: burst-storm, measured p95/p99.
Security: anti-replay, outdated timestamp, wrong secrets, rotation.
DR/Replay: Mass redrive from DLQ in isolated stand.


18) Playbooks (runbooks)

1. Growth 'signature _ fail _ rate'

Check clock drift, expired 'tolerance', rotation of secrets; temporarily enable "dual secret."

2. The queue is aging ('oldest _ in _ queue _ ms' ↑)

Increase workers, enable prioritization of critical topics, temporarily reduce the frequency of "noisy" types.

3. Storm '429' at subscriber

Enable throttling and pauses between attempts; shift less critical event types.

4. Mass' 5xx'

Open circuit-breaker for a specific endpoint, switch to defer & batch; signal to subscriber.

5. Populate DLQ

Stop non-critical publishing, enable batch-redrive with low RPS, raise alerts to subscription owners.


19) Typical errors

Synchronous heavy processing to 2xx response → retrays and duplicates.
No body/time window signature → substitution/replay vulnerability.
The absence of 'event _ id' and 'inbox' → cannot be made idempotent.
An attempt to "global order" → eternal queue locks.
Retreats without jitter/limits → incident intensification (thundering herd).
A single common pool for all subscribers → "noisy" puts everyone.


20) Pre-sale checklist

  • Contract: 'event _ id', 'partition _ key', 'seq', 'event _ type. vN ', HMAC signature and timestamp.
  • Sender: outbox, serialization by key, retrays with backoff + jitter, TTL, DLQ and redrive.
  • Recipient: quick write to inbox + 2xx; idempotent treatment; local DLQ.
  • Security: TLS, signatures, anti-replay, dual-secret, rotation.
  • Quotas/limits: fair-queue per tenant/endpoint, respect 'Retry-After'.
  • Reconcile APIs and checkpoints; documentation for subscribers.
  • Observability: p95/threads/errors/DLQ, trace to 'event _ id'.
  • Event versioning and schema evolution policy.
  • Incident playbooks and global pause/defrost button.

Conclusion

Trusted webhooks are a protocol on top of HTTP, not just "POST with JSON." A clear contract (ID, order key, signature), idempotency, retray with jitter, fair queue and well-debugged playbooks turn best-case into a predictable and measurable delivery mechanism. Build at-least-once + key order + reconcile, and the system will calmly survive the network, load peaks and human errors.

Contact

Get in Touch

Reach out with any questions or support needs.We are always ready to help!

Start Integration

Email is required. Telegram or WhatsApp — optional.

Your Name optional
Email optional
Subject optional
Message optional
Telegram optional
@
If you include Telegram — we will reply there as well, in addition to Email.
WhatsApp optional
Format: +country code and number (e.g., +380XXXXXXXXX).

By clicking this button, you agree to data processing.