Cohort analysis
Cohort analysis
Cohort analysis groups objects (usually users) by a single start event and compares how and for how long they remain active and valuable. This approach separates the effect of time in the system (seasons, stocks) from the effect of the age of the cohort (days from the start).
1) Basic definitions
Cohort: many players united by the event "birth" - registration, first deposit, first game, first purchase.
Calendar time axis: actual dates (2025-10-01,...).
Cohort age axis: days/weeks since birth (D0, D1,...).
Retention metrics: D1/D7/D30 (Exact and Rolling), WAU/MAU, Stickiness (DAU/MAU).
Monetization: ARPU/ARPPU, cumulative LTV (on D7/D30/D90).
Accounting unit: user (user/master_id) - record in the passport.
2) Types of cohorts and when to choose them
Acquisition cohorts: by date of registration/first visit - assessment of recruitment and onboarding channels.
Activation/Monetization-cohorts: by first deposit/purchase - early-monetization evaluation and promo.
Feature cohorts: for the first use of the feature/game category - the effect of releases.
Behavior cohorts: by RFM/start pattern (for example, "night mobile").
3) Axes and grids: how to watch the matrix
Cohort matrix: rows - cohorts (calendar), columns - age (D0... D90).
Seasonality: Compare diagonals (same calendar day) to separate seasonal effects.
Normalization: relative metrics (CR, fractions) + cumulative (LTV), show both.
4) Cohort passport and metrics (template)
5) Pseudo-SQL: retention matrix (Exact Dn)
sql
WITH regs AS (
SELECT user_id, DATE_TRUNC('day', MIN(ts)) AS cohort_day
FROM event_register
GROUP BY 1
),
act AS (
SELECT user_id, DATE_TRUNC('day', ts) AS act_day
FROM event_activity
),
ages AS (
SELECT r. user_id, r. cohort_day, a. act_day,
(a. act_day - r. cohort_day) AS age_days
FROM regs r
JOIN act a ON a. user_id = r. user_id
),
exact AS (
SELECT cohort_day,
age_days,
COUNT(DISTINCT user_id) AS users_active
FROM ages
GROUP BY 1,2
),
coh_size AS (
SELECT cohort_day, COUNT(DISTINCT user_id) AS cohort_size
FROM regs GROUP BY 1
)
SELECT e. cohort_day,
e. age_days,
e. users_active::decimal / NULLIF(c. cohort_size,0) AS exact_retention
FROM exact e
JOIN coh_size c USING (cohort_day)
WHERE age_days IN (1,7,30,90)
ORDER BY cohort_day, age_days;
Rolling Dn (activity on day 1... n)
sql
WITH days AS (... as above...),
roll AS (
SELECT cohort_day,
CASE WHEN age_days BETWEEN 1 AND 7 THEN 7
WHEN age_days BETWEEN 1 AND 30 THEN 30 END AS bucket,
COUNT(DISTINCT user_id) AS any_active
FROM days
WHERE age_days BETWEEN 1 AND 30
GROUP BY 1,2
)
SELECT r. cohort_day, r. bucket AS Dn,
r. any_active::decimal / s. cohort_size AS rolling_retention
FROM roll r
JOIN (SELECT cohort_day, COUNT(DISTINCT user_id) cohort_size FROM regs GROUP BY 1) s USING (cohort_day)
ORDER BY cohort_day, Dn;
6) Cohort LTV and monetization
Cumulative LTV (Dn): sum of income per cohort user by Dn.
ARPU/ARPPU: revenue per user/per Dn payer.
% paying: share with ≥1 payment to Dn.
sql
WITH reg AS (
SELECT user_id, DATE_TRUNC('day', MIN(ts)) AS cohort_day
FROM event_register GROUP BY 1
),
pay AS (
SELECT user_id, amount, DATE_TRUNC('day', ts) AS pay_day
FROM fact_payments
),
ltv AS (
SELECT r. cohort_day,
(pay_day - r. cohort_day) AS age_days,
SUM(amount) AS rev
FROM reg r JOIN pay p USING (user_id)
WHERE pay_day >= r. cohort_day
GROUP BY 1,2
),
cum AS (
SELECT cohort_day, age_days,
SUM(rev) OVER (PARTITION BY cohort_day ORDER BY age_days ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS rev_cum
FROM ltv
)
SELECT c. cohort_day, c. age_days,
c. rev_cum::decimal / NULLIF(sz. cohort_size,0) AS ltv_per_user
FROM cum c
JOIN (SELECT cohort_day, COUNT(DISTINCT user_id) cohort_size FROM reg GROUP BY 1) sz USING (cohort_day)
WHERE age_days IN (7,30,90)
ORDER BY cohort_day, age_days;
7) Survival/hazard for retention
Kaplan-Meier: non-model survival curve (S (t)) - proportion not "drained."
Hazard models: influence of characteristics (channel, country, platform, bonuses, content) on outflow risk.
Practice: we build KM by segments, then explain the difference with the hazard model.
8) Seasonality, TZ and calendar
TZ: store events in UTC, analyze in the local TZ of the market; be consistent.
Calendar: holidays/salary/matches/releases - like flags; compare cohorts of similar weeks.
Sliding window: for weekly/monthly cohorts - multiplicity of holidays and reporting periods.
9) Segmentation and attribution
Segments: attraction channel, platform/OS, geo, first content, price/limits, payment method.
Cohort attribution: "who brought" the user - fix the algorithm (last non-direct, data-driven).
LTV weighting: compare not only CR, but also LTV (D30/D90) by channel/segment.
10) Visualization
Heat map of the cohort matrix (CR/LTV).
Trend lines D1/D7/D30 by calendar.
Survival/Hazard charts.
Bridge "what changed LTV to D30": payer contribution, frequency, average check.
11) Experiments and causality
A/B: onboarding, tutorials, paywall, offers. The main metric is D7/D30 retention and LTV (D30).
Quasi-experiments: DiD/synthetic control for roll-out to markets.
Uplift models: target the return gain in re-activation (Qini/AUUC, uplift @ k).
12) Operation and governance
Versioning: 'RET _ D7 _ vN', 'LTV _ D30 _ vN'; changelog when changing activity/currency definition.
SLO freshness: daily cohorts - readiness until 06:00 lock.; data log ≤ 1 hour.
Quality: coverage of events, proportion of duplicates, proportion of bots/fraud outside cohorts.
Access: RLS/CLS, PII masking; export - aggregates only.
Runbooks: D1 drop (onboarding), D7 (content), event/identity scrapping.
13) Frequent errors (anti-patterns)
Axis mixing: Compare different ages of cohorts in different seasons without adjustment.
Rolling vs Exact: treated as the same thing.
Mixing units: sessions in denominator, users in numerator.
Aggregation of "means": instead of summing numerators/denominators.
Ignoring TZ/calendar: D1 offset at day/holiday boundaries.
No bot/fraud/QA filter.
Unaccounted restarts: split/merge accounts without identity bridges.
14) Checklist before publication of cohort report
- Birth event, unit, TZ, activity windows defined
- Excluded bots/fraud/QA; identities mixed (golden record)
- Built CR (Exact/Rolling) and LTV matrices for D7/D30/D90
- Calendar/holidays taken into account; segments by channel/platform/geo
- Added survival/hazard graphics and bridge LTV
- Documented metric versions and attribution algorithm
- Fresh SLOs configured, coverage/duplicate/error monitoring
- Ready runbooks for D1/D7 drops and event breaks
Total
Cohort analyses are two axes and discipline: fixed "moment of birth," correct windows and TZ, retention and LTV matrices, segmentation, and causal verification of change. This approach helps not only to observe curves, but also to make decisions: where to fix onboarding, which channels to scale, what content and offers keep players longer and increase LTV.