Cohort Analysis
Learning Focus
Use this lesson to build cohort retention tables — a foundational analytics technique for measuring user loyalty and product stickiness.
What is Cohort Analysis?
Cohort analysis groups users by their acquisition date (signup month) and tracks what percentage of each cohort is still active in subsequent months.
Cohort Month │ Month 0 │ Month 1 │ Month 2 │ Month 3
─────────────┼─────────┼─────────┼─────────┼────────
2024-01 │ 100% │ 65% │ 45% │ 32%
2024-02 │ 100% │ 70% │ 50% │ —
2024-03 │ 100% │ 68% │ — │ —
Sample Data Setup
CREATE TABLE user_events (
user_id INTEGER,
event_type VARCHAR,
event_date DATE
);
-- Simulate signups and activity
INSERT INTO user_events VALUES
(1, 'signup', '2024-01-05'), (1, 'active', '2024-02-10'),
(1, 'active', '2024-03-15'), (2, 'signup', '2024-01-10'),
(2, 'active', '2024-02-20'), (3, 'signup', '2024-02-01'),
(3, 'active', '2024-03-05'), (4, 'signup', '2024-01-20');
Step 1: Identify Cohort (Signup Month)
CREATE VIEW user_cohorts AS
SELECT
user_id,
MIN(event_date) AS signup_date,
date_trunc('month', MIN(event_date)) AS cohort_month
FROM user_events
WHERE event_type = 'signup'
GROUP BY user_id;
SELECT * FROM user_cohorts LIMIT 5;
Step 2: Calculate Activity Months
CREATE VIEW user_activity AS
SELECT DISTINCT
e.user_id,
date_trunc('month', e.event_date) AS activity_month
FROM user_events e
WHERE event_type = 'active';
Step 3: Build the Cohort Retention Matrix
WITH cohort_data AS (
SELECT
c.cohort_month,
-- How many months after signup is this activity?
DATEDIFF('month', c.cohort_month, a.activity_month) AS period,
COUNT(DISTINCT a.user_id) AS active_users
FROM user_cohorts c
INNER JOIN user_activity a ON c.user_id = a.user_id
GROUP BY c.cohort_month, period
),
cohort_sizes AS (
SELECT cohort_month, COUNT(*) AS cohort_size
FROM user_cohorts
GROUP BY cohort_month
)
SELECT
cd.cohort_month,
cs.cohort_size,
cd.period,
cd.active_users,
ROUND(cd.active_users * 100.0 / cs.cohort_size, 1) AS retention_pct
FROM cohort_data cd
JOIN cohort_sizes cs ON cd.cohort_month = cs.cohort_month
ORDER BY cd.cohort_month, cd.period;
Export Cohort Table to CSV
COPY (
-- same query as above
SELECT cohort_month, period, retention_pct
FROM cohort_data cd JOIN cohort_sizes cs ON cd.cohort_month = cs.cohort_month
ORDER BY cohort_month, period
) TO 'cohort_retention.csv' (FORMAT CSV, HEADER true);
Common Pitfalls
| Pitfall | Consequence | Prevention |
|---|---|---|
| Including re-signup events in cohort | Duplicated users in cohort | Use MIN(event_date) for signup |
| Not deduplicating activity per month | Inflated retention % | Use DISTINCT user_id in activity CTE |
| DATEDIFF in days instead of months | Wrong period column | Use DATEDIFF('month', ...) explicitly |
Quick Reference
-- Cohort = first event per user
MIN(event_date) AS signup_date
-- Period = months since signup
DATEDIFF('month', cohort_month, activity_month) AS period
-- Retention %
active_users * 100.0 / cohort_size AS retention_pct
What's Next
- Next: Time-Series Analysis - Analyze trends, seasonality, and rolling metrics.
- Module Overview - Return to this module index.