Root Cause Analysis

1) What is RCA and why is it needed

Root Cause Analysis is a structured process for identifying the root causes of an incident in order to prevent recurrence. In the center - facts, causal relationships and systemic improvements (processes, architecture, tests), and not the search for blame.
Objectives: prevent relapse, reduce MTTR/incident rate, improve SLO, build trust with regulators and partners.

2) Principles (Just Culture)

No charges. We punish not people, but risky practices.
Factuality. Only verifiable data and artifacts.
E2E view. From customer to backend to providers.
Testability of hypotheses. Any statement - with a test/experiment.
CAPA closure. Corrective and preventive measures with owners and deadlines.

3) Entrance artefacts and preparation

UTC time line: T0 detection → T + actions → T + recovery.
Observability data: logs, metrics (including by cohort), trails, synthetics, status page.
Changes: releases, feature flags, configs, provider events.
Environment: versions, artifact hash, SBOM, infrastructure tags.
Incident base: description of the impact (SLO/SLA, customers, turnover), decisions made, workaround.
Chain of custody: who and when collected/modified evidence (important for compliance).

4) RCA methods: when which

1. 5 Why - quickly figure out the causal chain for narrow problems. Risk: "roll up" a complex system to a line.
2. Fishbone - Categorize factors as People/Process/Platform/Policy/Partner/Product. Useful at the beginning.
3. Fault Tree Analysis (FTA) - deduction from event to cause sets (AND/OR). For infrastructure and tree failures.
4. Causal Graph/Event Chain - dependency graph with probabilities and contribution weight. Good for microservices and external providers.
5. FMEA (Failure Modes & Effects Analysis) - Prevention: failure modes, severity (S), frequency (O), detectability (D), RPN = S × O × D.
6. Change Analysis - comparison "as it was/as it became" (config diff, schema, versions).
7. Human Factors Review - the context of people's decisions (alert fatigue, bad playbooks, overload).

Recommended combination: Fishbone → Change Analysis → Causal Graph/FTA → 5 Why by key branches.

5) Step-by-step RCA process

1. Initiate: appoint an RCA owner, determine the deadline for issuing a report (for example, 5 working days), assemble a team (IC, TL, Scribe, provider representatives).
2. Collect facts: timeline, graphs, releases, logs, artifacts; Fix versions and amount control.
3. Map impact: which SLI/SLOs were affected, which cohorts (countries, providers, VIPs).
4. Build hypotheses: primary, alternative; check which are verifiable now.
5. Test hypotheses: playback on the stage/simulation/canary, trace analysis, fault injection.
6. Determine the root and contributing causes: technological, process, organizational.
7. Form CAPA: corrective (correct) and preventive (prevent); success metrics and timelines.
8. Reconcile and publish report: internal knowledge base +, if necessary, external version for clients/regulator.
9. Verify effect: checkpoints after 14/30 days; closing actions.

6) What counts as "root cause"

Not "human error," but the condition that made it possible and invisible:

weak tests/feature flags, missing limits/alerts, ambiguous documentation, incorrect defaults, fragile architecture.
Often this is a combination of factors (configuration × lack of a gate × load × provider).

7) CAPA: corrective and preventive measures

Corrective:

code/config fix, pattern rollback, changing limits/timeouts, adding indexes, replica/sharding, traffic redistribution, certificate update.

Preventive:

tests (contract, chaos cases), alerts (burn rate, quorum of synthetics), release policy (canary/blue-green), GitOps for configs, training/checklists, provider duplication, DR exercises.

Each action: owner, deadline, expected effect, verification metric (for example, a decrease in change-failure-rate by X%, no repetitions of 90 days).

8) Verification of hypotheses and effects

Experiments: fault injection/chaos, shadow traffic, A/B configs, load with real profiles.
Success metrics: SLO recovery, p95/p99 stabilization, no error-rate spikes, MTTR reduction, burn-rate and zero-reopen trend for 30 days.
Control points: D + 7, D + 30, D + 90 - revision of CAPA implementation and impact.

9) RCA Report Template (Internal)

1. Short summary: what happened when, who affected.
2. Impact: SLI/SLO, users, regions, turnover/penalties (if any).
3. Time line (UTC): main events (alerts, decisions, releases, fixes).
4. Observations and data: graphs, logs, traces, configs (diffs), provider statuses.
5. Hypotheses and tests: accepted/rejected, references to experiments.
6. Root causes: technological, process, organizational.

7. Contributing factors: "why did not notice/did not stop."

8. CAPA plan: table of actions with owners/deadlines/metrics.
9. Risks and residual vulnerabilities: what else needs to be monitored/tested.
10. Applications: artifacts, links, graphs (list).

10) Example (short, generalized)

Event: payment success on 35% at 19: 05-19: 26 (SEV-1).
Impact: 21 min e2e-SLO violated, 3 countries affected, returns/compensations.
Reason 1 (those): The new version of the card validator increased the latency to 1. 2 s → timeouts to the provider.
Reason 2 (percent): there was no canary for provider "A," the release was immediately 100%.
Reason 3 (org): alert threshold on business SLI did not cover a specific BIN range (VIP cohort).

CAPA: return the old version of the validator; enter canary 1/5/25%; add business SLIs by BIN cohorts; agree on failover 30% for provider "B"; chaos case "slow upstream."

11) RCA process maturity metrics

CAPA completion on time (% closed in 30 days).
Reopen rate (incidents reopened in 90 days).
Change-failure-rate before/after.
The proportion of incidents where systemic causes are found (not just "human error").
Test coverage of new scenarios from RCA.
Report release time (publication SLA).

12) Features of regulated domains (fintech/iGaming, etc.)

Reporting to the outside: client/regulatory versions of the report without sensitive details, but with a plan to prevent repetitions.
Audit log and immutability: storing artifacts, signed reports, linking to tickets, CMDB, release logs.
User data: depersonalization/masking in sample logs.
Notice periods: tied to contracts and regulations (e.g. N hours per initial notice).

13) Anti-patterns

"Vasya is to blame" - a stop on the human factor without systemic reasons.
Lack of hypothesis tests - conclusions by intuition.
Too general RCA ("service was overloaded") - no specific changes.
No CAPA or no owners/deadlines - report for the sake of the report.
Hiding information - loss of trust, inability to train the organization.
Overload with non-SLO/business SLI metrics.

14) Tools and practices

RCA repository (wiki/knowledge base) with metadata: service, SEV, reasons, CAPA, status.
Templates and bots: generating a report frame from an incident (timeline, graphs, releases).
Causality graph: the construction of an event-causal map (for example, based on logs/traces).
Chaos catalog: scripts for reproducing past incidents in the stage.
Dashboards "after RCA": individual widgets, which confirms the CAPA effect.

15) Checklist "ready for publication"

Timelines and artifacts are complete and verified.
Root causes identified and proven by tests/experiments.
Root and contributing causes are separated.
CAPA contains owners, deadlines, measurable effect metrics.
There is a verification plan in 14/30 days.
The version for external stakeholders is prepared (if necessary).
Report passed tech/percent review.

16) The bottom line

RCA is not a retrospective for the sake of formality, but a learning mechanism for the system. When the facts are collected, causality is proven, and CAPAs are locked into metrics and tested by experiments, the organization becomes more stable every time: SLOs are more stable, the risk of relapse is lower, and user and regulatory confidence is higher.

Root Cause Analysis

Get in Touch

Quick Contact

The video will be updated soon

We are currently very busy with projects