Distribution of signals and metrics

(Section: Ecosystem and Network)

1) Purpose and area

Signal and metric distribution is a consistent way to collect, normalize and deliver telemetry (events, metrics, logs, traces, health statuses) to all interested participants: operators, content providers, payment/CCM services, bridges, network nodes, affiliates and SRE/BI/Compliance teams. Objectives:

Unified telemetry language and data contracts.
Managed QoS channels: priority of critical signals.
Transparent SLI/SLO and predictable alerting.
Privacy, isolation and budget savings metrics.

2) Signal taxonomy

1. Business events: onboarding, deposits/payments, gaming events, attribution.
2. Tech metrics: latency/throughput/error code, queue, CPU/RAM/IO usage.
3. Logs: structured entries about operations and errors.
4. Traces: query/topic spans, hop-to-hop correlation.
5. Health statuses: synthetic probes, readiness/liveness, heartbeat nodes.
6. Risk/compliance signals: KYC/KYB/AML hits, sanctions events.

Each class has its own criticality level and storage/delivery policy.

3) Distribution architecture (reference)

Edge collectors (SDK/agents) → Ingress (HTTP/OTLP/gRPC/QUIC) → Bus (Kafka/Pulsar) → Processors (stream-jobs) → Storage (TSDB for metrics, object/column - for logs/events, tracer) → Showcases/dashboards/alerts.
Multi-tenancy: namespace/tenant-id in keys, individual quota/limits/ACL.
QoS segmentation: critical (P0), important (P1), background (P2).
Egress: subscribers (Ops/BI/Third-party) through subscriptions to topics and materialized views.

4) Contracts and schemes (events/metrics/trails)

4. 1 Events (simplified, YAML)

yaml event:
id: uuid kind: business    ops    risk ts: timestamp    # ISO8601 tenant: string    # org_id/namespace source: string    # service/peer-id trace_id: string type: string     # deposit. created    payout. failed    probe. ok...
attrs: object # semantic fields (no PII)
severity: info    warn    error    critical qos: P0    P1    P2

4. 2 Metrics (OpenMetrics/OTLP)

Gauge/Counter/Histogram with stable labels (limited cardinality).
Identifiers: 'metric _ name {service, region, tenant, version, route}'.
Histograms for latency/dimensions instead of p99 in code.

4. 3 Trails

Required fields are 'trace _ id', 'span _ id', 'parent _ id', 'service', 'peer', 'route', 'qos'.
Links between domains (consumer/producer) and network hops (relay/bridge).

5) QoS and prioritization

P0 (critical): SLI payments/payments, bridge/node statuses, burn-rate SLO → strict delivery (acks, retries, idempotency), minimum timeouts.
P1 (important): product events/key metrics → guaranteed delivery within SLO.
P2 (background): detailed logs, debugging → best-effort, you can drop when overloaded.

Politicians: different queues, quota for producers, backpressure, rate-limits, grandfather by 'idempotency _ key'.

6) Cardinality and metrics budget

Rule 6 labels: no more than 6 keys per metric, fixed dictionaries of values.
Cardinality ≤ 10k time series/metric/tenant.
Sampling: head-/tail-based for traces; downsampling 10s→1m→5m→1h metrics.
Quotas: limits of points/sec and bytes/sec per tent and per QoS class.
Linter schemes: rejects metrics with "exploding" labels (id, email, ip, etc.).

7) Collect and deliver: push vs pull

Push (OTLP/StatsD/HTTP): flexibility, mobile/edge clients, P0 channels.
Pull (Prometheus): internal infrastructure, predictable targets.
Hybrid: exporters→gateway→TSDB; federated scrapes for regions.
Transport: QUIC/HTTP/2, compression, butching, TLS/mTLS, retrai with jitter.

8) SLI/SLO and alerting

8. 1 Basic SLIs

Availability% Endpoints/Gateways,

Latency p50/p95/p99 on critical routes,

Error-rate (5xx/timeout/abort),

Delivery lag by bus, Queue depth,

Freshness of shop windows (ingest→serve delay).

8. 2 SLO examples

P0 pipelines: Availability ≥ 99. 95%, p99 latency ≤ 400 мс, Delivery lag p95 ≤ 2 с.
P1: Availability ≥ 99. 9%, Freshness p95 ≤ 3 min.
P2: Freshness p95 ≤ 15 мин, no-page.

8. 3 Burn-rate alerts (example)

2-hour window: 'error _ budget _ burn ≥ 2 ×' → page.
6-hour window: 'error _ budget _ burn ≥ 1 ×' → page/escalation.
Combine with 'queue _ lag' and 'drop _ rate' P0.

9) Vaults and retentions

TSDB metrics: high-frequency - 7-14 days; aggregates - 6-12 months

Events/logs: hot storage 7-30 days, cold (object) 6-24 months.
Trails: sampling 1-10%; saving "slow/erroneous" spans (tail-based).
Deletion/revision policies for PII and data subject requests.

10) Privacy, security and isolation

PII minimization: tokenization/pseudonymization of fields, prohibition of "raw" identifiers in metrics.
mTLS/event signatures, producer key pinning.
ACL/ABAC on topics/services/tenants, separate keys for write/read.
Tenant sandboxing: logical/physical separation, limits and rate-limit per tenant.
Audit trail: unchanging logs of access/changes to configs.

11) Processing streams (stream jobs)

Enrich: normalization, geo/version/traffic class.
Aggregate: windows 10s/1m/5m, histograms, quantile sketches.
Detect: anomalies (EWMA/ESD), drift of distributions, bursts of queues.
Route: fan-out to showcase/alert/webhooks partners.
Guard: "red button" - throttling/kill-switch by source/topic.

12) Dashboards (reference layouts)

Ops Core (hour/real-time): p95 latency, error-rate, delivery lag, queue depth, success-rate ingest.
Pipelines Health: freshness per pipeline, drop-rate, backpressure, burn-rate SLO.
Tenant Usage: rows/sec, bytes/sec, cardinality, top-labels.
Security/Compliance: mTLS statuses, expiration keys, accesses, PII revisions.
Business Lens: conversion/payouts/bridge SLIs next to tech metrics.

13) Configuration examples

QoS Classes and Limits (YAML)

yaml telemetry:
qos:
P0:
topics: [payout. sli, bridge. finality, gateway. availability]
delivery: guaranteed retry:
attempts: 3 backoff_ms: [100, 400, 800]
max_queue_lag_ms: 2000
P1:
topics: [product. events, api. metrics]
delivery: at-least-once sampling: 1. 0
P2:
topics: [debug. logs, verbose. traces]
delivery: best-effort sampling: 0. 1 quotas:
tenant_default:
metrics_points_per_sec: 50_000 logs_mb_per_hour: 500 traces_spans_sampled_pct: 5

Metric Labels (Politics)

yaml metrics_policy:
allowed_labels: [service, route, code, region, tenant, version]
forbidden_labels: [user_id, email, ip, session_id]
max_label_value_count: 1000

Alerts burn-rate

yaml alerts:
- name: "p0_error_burn_2h"
expr: burn_rate_p0_2h > 2 action: [page_oncall, open_incident]
- name: "queue_lag_p0"
expr: queue_lag_ms_p95 > 2000 action: [page_oncall]

14) Data schemas and queries

Metric register (directory)

sql
CREATE TABLE metric_catalog(
name TEXT PRIMARY KEY,
unit TEXT, description TEXT,
labels JSONB, owner TEXT, qos TEXT, sla JSONB
);

Queues and lag

sql
SELECT topic,
PERCENTILE_CONT(0. 95) WITHIN GROUP (ORDER BY lag_ms) AS lag_p95,
SUM(dropped) AS drops
FROM queue_metrics
WHERE ts >= now() - INTERVAL '24 hours'
GROUP BY topic;

Tent cardinality

sql
SELECT tenant, metric_name, COUNT(DISTINCT series_id) AS series
FROM tsdb_series
WHERE day = current_date
GROUP BY tenant, metric_name
ORDER BY series DESC
LIMIT 50;

15) Processes and roles

Telemetry Owner - schemes/policies/quotas, cardinality control.
SRE/Ops - SLO, alerts, incidents, scaling.
Security/Compliance - keys, accesses, PII, audits.
Product/BI - KPI showcases, analytics, A/B metrics.
Tenants (partners) - correct SDK integration, contract compliance.

16) Playbook incidents

A. Explosion of cardinality

1. Auto-block producer/metrics, 2) cut off "bad" labels, 3) retro-aggregation, 4) post-mortem and linter rules.

B. Rise of queue lag P0

1. Include priority, 2) expand parties/consumers, 3) temporarily reduce P2 sampling, 4) bottleneck analysis.

C. The Fall of Freshness Storefronts

1. Switch to the backup connector, 2) turn on the degradation mode ("last finalized"), 3) notify the source owners.

D. PII leakage in metrics

1. Immediate flow blocking, 2) redaction on hot layer, 3) DPO/Compliance notification, 4) lenter/SDK update.

E. Massive 5xx/trace errors

1. Page, 2) tail-based sampling ↑ for errors, 3) critical route trace diagnostics, 4) release rollback/feature flag.

17) Implementation checklist

1. Approve event/metric/trace contracts and a list of acceptable labels.
2. Create QoS classes, topics/queues, quotas and metrics budget.
3. Set up ingest (push/pull), TLS/mTLS, retrai and idempotency.
4. Include metrics/event directories and schema linters.
5. Define SLI/SLO, burn-rate alerts and escalations.
6. Build dashboards Ops/Pipelines/Tenant/Security.
7. Run telemetry chaos tests (loss/jitter/adhesions).
8. Regularly revue cardinality, retention and storage costs.

18) Glossary

QoS - delivery quality/priority class.
Freshness - delay in the appearance of data in the showcase.
Burn-rate - error budget consumption rate relative to SLO.
Cardinality - the number of unique rows of metrics (label combinations).
Tail-based sampling - a selection of "slow/erroneous" traces.
Idempotency key - key for event repetition deduplication.

Bottom line: the distribution of signals and metrics is not just "collect and show graphs," but the discipline of contracts, QoS channels and budgets. By following this framework, the ecosystem gains predictable observability, surge-resistant, data-private, and useful for decisions in both the operational and business contours.

Distribution of signals and metrics

Metric Labels (Politics)

Alerts burn-rate

Queues and lag

Tent cardinality

Get in Touch

Quick Contact

The video will be updated soon

We are currently very busy with projects