DLQ and poison message handling

Dead Letter Queue (DLQ) is an isolated queue/topic for messages that could not be processed by a regular consumer after a given number of attempts or for obvious technical/business reasons (invalid scheme, timeout, version conflict, etc.). Poison message - a record whose reprocessing consistently fails and threatens the stability of the pipeline.

The purpose of DLQ is to preserve SLO, localize the failure, prevent blocking of the main stream and guarantee the possibility of analysis and safe replay (redrave).

1) Where poisonous messages come from

Schemas/contracts: incompatible changes, missing required fields, incorrect types.
Business validations: duplicates, violated invariants, expired events.
Order and causality: came "Update" to "Create," missing correlations, out-of-order.
Idempotency: reprocessing generates side effects.
External dependencies: limited limits/timeouts, API unavailability.
Data: payload corruption, incorrect encoding, oversize.

2) DLQ submission criteria

The message enters the DLQ if one or more of the following conditions are met:

Exceeded maxAttempts of processing at the consumer/worker.
The error is classified as non-retryable: invalid scheme, lack of a critical resource, business prohibition.
The deadline message (TTL/expiration) has expired.
The circuit breaker or admission control policy was triggered for this key/tenant.
Explicit operator solution (manual "eject" from the main thread).

3) DLQ topologies and patterns

Per-queue DLQ: each queue/topic has its own DLQ. Simple and transparent.
Central DLQ (parking lot): general "parking" for complex cases, convenient for unified analysis tools.
DLT (Dead Letter Topic): for log-oriented buses (event log) - a separate topic with metadata of the reason for failure.
Quarantine: quarantine buffer with hard access and PII sanitation for manual analysis.
Shadow-stream: Duplication of problematic messages into a "shadow" for secure fixation experiments.

4) Metadata required to accompany DLQ

Minimum set:

Reason for failure: error code/class, stack/trace id.
Retray context: 'attempt', 'maxAttempts', 'first _ seen _ ts', 'last _ attempt _ ts'.
Correlation: 'trace _ id', 'span _ id', 'tenant _ id', 'entity _ id', partition key.
Original offset/partition/sequence (for log buses) or message-id.
Contract/schema/payload version.
Idempotency-key/Request-id (if any).
Routing source: queue name/topic, consumer group.

5) Retray policies before DLQ

Use correct retries before sending to DLQ:

Short consumer retrays: 'maxAttempts' 2-5, exponential backoff + jitter, caps on concurrency.
Co-operative backpressure: Reducing competition as errors grow.
Error classification: retryable ('5xx', timeout) vs non-retryable (validation, schema mismatch).
Delay queues: 5s → 30s → 2m for temporary failures.
Per-key isolation: if a specific key is "noisy," do not block the entire party.

6) Safe Redrive (Redelivery from DLQ)

Redrive is the controlled return of messages from DLQ to processing.

Principles:

1. Fix check: redraw only after fixing the code/configuration/scheme or after restoring external dependencies.

2. Idempotency: handlers must be resistant to repetition (upsert, effect-toluant operations).

3. Deduplication by'idempotency _ key '/' message _ id '/' business _ key '.

4. Batching and windows: batches by N messages, rate-limit by redrive, "windows" by time/parties.

5. Local validation: quick verification of the scheme before redrawing (reject early invalid cases).

6. Priority: The redrive should not displace the sales traffic (low priority of the workers/individual pool).

7. Observability: individual metrics and trails for redrive; outcome report (success/repeat DLQ/loss).

7) Delivery semantics and order

At-least-once is the most common mode: idempotence and deduplication are required.
At-most-once - DLQ can be disabled; risk of loss. Use only when losses are acceptable.
Exactly-once (efficient): achieved by transactions and deduplication in business storage; expensive and specific.
Order: DLQ usually breaks the order for a specific key/party. If the order is critical, redraw by key and sequentially.

8) Schemes, contracts and validation

Schema registry/contracts: clear versioning, evolution with backward/forward compatibility.
Validation at the entrance: cheap check through JSON Schema/Protobuf/Avro before heavy steps.
Incompatibility policy: with a "breaking" field - immediately in the DLQ with the code'SCHEMA _ INCOMPATIBLE '.
Redaction PII: Store only what you need in DLQ; mask sensitive fields.

9) Idempotency and deduplication

Idempotency-key: form on the producer from "business sense" (tenant + entity + operation + ts-bucket).
Deadup logs: keep the last 'N' keys with TTL; remember the "effect" of the operation.
Upsert/merge: Avoid "insert-only" without restriction.
Side effects: for external calls - log and check "repeat" before calling.

10) Observability and SLO

Metrics (in turn/tenant/key):

DLQ rate (msg/s), proportion of messages, mean/median "age" in DLQ.
Success of redrave (%), repeated DLQ share.
Classification of causes: schema, validation, timeout, dependency, unknown.
p95/p99 mainstream treatment latency vs in redrive.
DLQ size, risk of overflow.

Logs/tracing:

Required tags are 'message _ id', 'entity _ id', 'tenant _ id', 'attempt', 'reason', 'redrive _ batch _ id'.
Tracing the "DLQ branch": from source to repeated success.

SLO:

The percentage of messages successfully processed ≥ X% in T minutes.
Investigation and correction time for DLQ case ≤ Y hours.
The maximum "age" of the message in DLQ ≤ Z hours (with alert).

11) Safety and compliance

Least privilege access: Redrive - operators/playbooks only.
Audit: who and when triggered the redrive/delete/edit metadata.
Sanitation: When transferring to central DLQ, remove unnecessary PII/secrets.
Retention: separate retention and deletion policies for DLQ.

12) Multi-tenancy

Tags' tenant _ id/plan ': distinguish limits, redrave priorities, reports.
Per-tenant DLQ or parties: so that the "noisy" client does not clog the overall DLQ.
Billing/quotas: take into account the DLQ volume and the cost of redrive in usage.

13) Configuration templates (example)

yaml consumer:
max_attempts: 4 backoff:
strategy: exponential_full_jitter initial_ms: 200 max_ms: 5000 classify_errors:
retryable:  [TIMEOUT, DEP_UNAVAILABLE, 5xx]
nonretryable:[SCHEMA_INCOMPATIBLE, VALIDATION_FAILED, DUPLICATE]
concurrency_caps:
per_partition: 8 per_tenant: 50

dlq:
type: topic name: myapp. events. dlq metadata:
include: [reason, stack, attempt, first_seen_ts, last_attempt_ts, schema_version,
tenant_id, entity_id, trace_id, source_topic, partition, offset]
retention_hours: 168 pii_redaction: true

redrive:
mode: batch batch_size: 500 rate_limit_per_sec: 50 priority: low validate_schema_before_redrive: true idempotency:
dedup_ttl_hours: 24 ordering:
by_key: true

14) Operational playbooks (runbooks)

1. Abnormal DLQ growth: turn on throttling of the production consumer, analyze top reasons, check releases/schemes.
2. Schema mismatch: rollback/commit schema, adapter migration, redrive after validation.
3. External dependency unavailable: pause retrains, enable delay queue, redrive after recovery.
4. Repeated DLQs after redrive: enable the "shadow" handler/simulator, check idempotency, narrow batch.
5. DLQ overflow: evacuation to archive-storage, enable selective redrave for keys/reasons.

15) Testing and chaos

Error injection: schema-break, validation, timeouts, 1-on-N sticky errors.

Mass revision: check of dosing and impact on production

Out-of-order sequence: ensure correct key handling.
Payload corruption: validation and safe failure.
Recovery after the fall of redrive worker: idempotency of batch operations.

16) Typical errors

Lack of metadata in DLQ → it is impossible to cluster causes and safely modify.
Mass redraw without limits → re-degradation of production.
No idempotency/deduplication → duplicates and side effects.
PII mixing in central DLQ without sanitation.
Lack of schemes/contracts → "surprises" in the evolution of messages.
The only common DLQ without tenant/key partitioning.
Infinite retrays instead of early DLQ for non-retryable errors.

17) Quick recipes

Normal business flow: 3-4 attempts, exponential backoff with jitter, early classification of errors, DLQ with full metadata.
Critical events (payment): strict idempotence, short timeouts, minimum attempts, fast DLQ and manual parsing.
Mass redraw after fix: small batches (100-500), rate-limit, separate pool of workers, monitoring success> 95% before increasing speed.
Multi-tenant: per-tenant redrive limits, DLQ top customer generator report.

Conclusion

DLQ is not a "trash can," but a controlled reliability loop. Clear hit rules, rich metadata, idempotency and deduplication, secure metered redrive, schema discipline and observability turn toxic messages from a threat to SLO into a manageable engineering process - with understandable playbooks, predictable costs and minimal impact on users.

DLQ and poison message handling

Conclusion

Get in Touch

Quick Contact

The video will be updated soon

We are currently very busy with projects