Approximate Aggregation
Learning Focus
Use this lesson to understand approximate aggregation — a DuckDB superpower for analyzing massive datasets at speed.
Why Approximate Aggregation?
When datasets are billions of rows, exact aggregations can:
- Require GBs of memory
- Take minutes to compute
Approximate algorithms give ~99% accurate results in a fraction of the time and memory.
approx_count_distinct (HyperLogLog)
-- Exact (slow on massive tables)
SELECT COUNT(DISTINCT user_id) FROM events;
-- Approximate (fast, ~1% error rate)
SELECT approx_count_distinct(user_id) FROM events;
-- Per group
SELECT
event_type,
approx_count_distinct(user_id) AS approx_unique_users
FROM events
GROUP BY event_type;
approx_quantile
-- Approximate median (much faster on large tables)
SELECT approx_quantile(salary, 0.5) AS approx_median FROM employees;
-- Approximate percentiles
SELECT
approx_quantile(response_time_ms, 0.50) AS p50,
approx_quantile(response_time_ms, 0.95) AS p95,
approx_quantile(response_time_ms, 0.99) AS p99
FROM request_logs;
Reservoir Sampling
-- Random sample of N rows (no full sort required)
SELECT * FROM large_table USING SAMPLE 1000;
-- Percentage-based sample
SELECT * FROM large_table USING SAMPLE 5 PERCENT;
-- Fixed seed for reproducibility
SELECT * FROM large_table USING SAMPLE 1000 (reservoir, 42);
When to Use Approximate vs. Exact
| Scenario | Use |
|---|---|
| BI dashboards with billions of events | approx_count_distinct |
| Financial auditing requiring exact count | COUNT(DISTINCT ...) |
| SLA latency percentiles (P95/P99) | approx_quantile |
| Compliance reports | quantile (exact) |
Common Pitfalls
| Pitfall | Consequence | Prevention |
|---|---|---|
| Using approx for financial reporting | Incorrect totals | Use exact aggregations for monetary data |
| Not documenting approximate columns | Consumers assume exact values | Always label approximate columns |
| Using SAMPLE without a seed | Non-reproducible results | Add (reservoir, seed) for reproducibility |
Quick Reference
SELECT approx_count_distinct(col) FROM t;
SELECT approx_quantile(col, 0.95) FROM t;
SELECT * FROM t USING SAMPLE 1000;
SELECT * FROM t USING SAMPLE 10 PERCENT;
What's Next
- Next: Joins and Set Operations - Combine data from multiple tables.
- Module Overview - Return to this module index.