Repetition strategies and idempotency
1) Why do you need it
In networks, failures are the norm: timeouts, transient errors, network flappings, overload. Retreats improve reliability only if:1. repeat safe (idempotent),
2. delays between repetitions are observed,
3. limits/quotas and addictions "health" are respected.
The goal is effectively-once behavior at the level of business operations without false takes and races.
2) Taxonomy of delivery semantics
At-most-once: no repetition, risk of loss (logging, fire-and-forget).
At-least-once: duplicates are possible → consumer idempotence is needed (most queues, webhooks).
Effectively-once: duplicates are possible, but deduplicated correctly (keys, transactions, outbox).
3) When to retract and when not
Retreat makes sense: '408', '429' (observing 'Retry-After'), '425' (Too Early), '499' (client closed on the perimeter), '5xx', '504', network timeouts/breaks, '502' at the gateway, "connection reset."
Do not retract without changing the query: '400/ 401/403/404/422'.
Controversial cases: '409 Conflict' (not usually retrayim; first we read the status of the operation/reconfirm the intention).
4) Timeouts, backoff and jitter
4. 1 Rules
First timeout, then retro: each request must have a "deadline."
Exponential backoff: 'delay _ n = base 2 ^ n', limit'max _ delay'.
Jitter is required: add randomness to decouple "dull synchronous waves."
4. 2 Jitter patterns
Full jitter: 'sleep = rand (0, base2 ^ n)' is the best overall choice.
Decorated jitter: 'sleep = min (max_delay, rand (base, sleep_prev3))' - for long dialogs.
Equal jitter: 'sleep = base2 ^ n/2 + rand (0, base2 ^ n/2)' - soft variation.
4. 3 Retry-budget
Limit the proportion of retrays:- `retry_budget_per_min = max(α success_rps, floor β)`; usually 'α = 0. 1–0. 2`.
- If the budget is exhausted, switch to fail-fast/circuit breaker "open."
5) Interaction with rate limiting and Circuit Breaker
Respect 'Retry-After', 'RateLimit-Reset' and count it in the back-off.
At high '5xx '/timeouts - lower the retray frequency and overall concurrency.
- Half-open: Allows limited sampling.
- Open: instantly rejects (saves resource).
- Closed: ordinary work.
- On write operations, it is preferable to return 409/503 with a clear hint than twist aggressive retrays.
6) Idempotency of write operations
6. 1 General idea
The same intentions → one result. The basis is the idempotence key and the storage of execution records.
6. 2 HTTP contract
The client sends the header:
Idempotency-Key: 7a6b7f9e-2a46-4d0b-9c3a-2b30e1c3c9e3
Idempotency-Key-Expiry: 24h # optional
Server:
- Saves (key, result → status, body hash) on first success
- if repeated, returns the old response and the header'Idempotency-Replay: true ';
- in case of a body conflict (the same key, but a different payload) - '409 Conflict'.
6. 3 Storage and TTL
Table/value key: 'idempotency _ key', 'request _ hash', 'result', 'status', 'expiry _ at'.
TTL = window of possible replays and late deliveries (usually 24-72 hours for payments).
Indices by'idempotency _ key '; for high load - hash sharding.
6. 4 Example Schema (SQL)
sql
CREATE TABLE idempo_store (
key UUID PRIMARY KEY,
req_hash BYTEA NOT NULL,
status INT NOT NULL,
response JSONB NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
expiry_at TIMESTAMPTZ NOT NULL
);
6. 5 Handler pseudocode
pseudo handle_write(req):
k = req. headers["Idempotency-Key"]
h = hash(req. body)
rec = idempo_store. get(k)
if rec and rec. req_hash == h:
return rec. status, rec. response, {"Idempotency-Replay": "true"}
if rec and rec. req_hash!= h:
return 409, problem("IDEMPOTENT_CONFLICT")
begin tx result = apply_business_mutation (req) # change status upsert once (idempo_store, key = k, req_hash=h, status = 201, response = result, expiry = now () + 2d)
commit
return 201, result
7) "effectively-once" patterns
Transactional Outbox: recording a business event and sending a message from the same database transaction through the background relay; the consumer is idempotent.
Inbox/Processed-table at the consumer: save 'event _ id' to ignore duplicates.
Exactly-once on Kafka ≠ exactly-once in business: even with producer/consumer EOS, applied logic should still be idempotent.
Compensating transactions (Saga): if the steps retract and cause side effects, we return the system to the invariant.
8) Special cases: payments and financial transactions
Strong idempotency: The key is bound to the operation logic (e.g. 'external _ payment _ id').
Deduplication on PSP - Store 'merchant _ reference' → if repeated, PSP will return the same result.
Retrays "from the client": allow only when 'Idempotency-Key', otherwise the risk of double write-off.
Competition: locks "on account/tool/contract" for the duration of execution; when repeated, return 409/423.
Observability: metrics' idempo _ replay _ total ',' idempo _ conflict _ total '.
9) Webhooks and external challenges
HMAC signatures and time window; first verification, then processing.
Sender retrays: exponential backoff + jitter, 'max _ attempts' and DLQ.
Consumer - idempotent: 'event _ id' → table/in-memory cache; "tidy" order is not guaranteed.
Codes: 2xx = successful, 4xx = do not repeat, 5xx/timeout = repeat.
10) Queues and background tasks
At-least-once by default → duplicates are inevitable.
Store 'task _ id '/' event _ id' and execution status; with duplicates - the short path "replay."
DLQ and poison-messages: attempt counter, quarantine, manual parsing.
Competitive limits (semaphores) and idempotent workers.
11) Versioning and "natural" keys
Natural keys (account number + date + document number) increase resistance to repetition.
When changing the schema/version, include the version key in the'Idempotency-Key'or in the query hash.
12) HTTP headers and prompts to the client
'Idempotency-Key ',' Idempotency-Replay ',' Retry-After ',' Prefer: wait = <sec> '(on long operations),' If-Match '/' ETag '(optimistic locks).
409 for a key conflict 425/429/503 with the valid'Retry-After '.
For "long" operations - reception of asynchronous status ('202 Accepted' + 'Location' per status resource).
13) Testing and chaos scenarios
Negative tests: double sending, repetition with another body, clock desynchronization.
Out of order: 't2' comes before 't1'.
Injection of timeouts/' RST '/' EOF ', half requests (slow-POST).
Fallen idempotency storage → fail-closed behavior (better failure than double write-off).
14) Metrics and alerts
`retries_total{reason}`, `retry_budget_used{route}`, `backoff_seconds_bucket`.
`idempo_replay_total`, `idempo_conflict_total`, `duplicate_detected_total`.
Share 409/425/429/5xx by routes; p95/p99 "time to success" with retreats.
Alerts: burn-rate retray budget, surge in idempotence conflicts, DLQ growth.
15) Antipatterns
Retract all mistakes in a row.
Lack of jitter → synchronous waves of retraces.
Long-lived keys without TTL and cleaning.
Saving the result after a side effect commit (outbox violation).
Logs without 'trace _ id '/' idempotency _ key' are → impossible to generate.
Aggressive parallel retrays on write operations.
16) Prod Readiness Checklist
- Unified policy: what retrayim, what not; codes and customer prompts.
- Exponential backoff + full jitter; 'retry _ budget'specified.
- Contract'Idempotency-Key '+ storing results with TTL.
- Outbox/Inbox for events; DLQ; competitive limits.
- Integration with circuit breaker, respect 'Retry-After'.
- Metrics/Alerts by Retray/Duplicate/Conflict.
- A set of chaos tests and network failure emulation.
- Customer documentation - examples of back-ups and statuses.
17) TL; DR
Retreats are only useful together with idempotency. Enter 'Idempotency-Key' and result storage, apply exponential backoff with jitter and retry-budget, respect 'Retry-After', integrate with circuit breaker. For events - outbox/inbox; for payments, strict deduplication and locks. Measure retrays and conflicts, test duplicates and timeouts.