Approximate Aggregation

Learning Focus

Use this lesson to understand approximate aggregation — a DuckDB superpower for analyzing massive datasets at speed.

Why Approximate Aggregation?

When datasets are billions of rows, exact aggregations can:

Require GBs of memory
Take minutes to compute

Approximate algorithms give ~99% accurate results in a fraction of the time and memory.

approx_count_distinct (HyperLogLog)

-- Exact (slow on massive tables)
SELECT COUNT(DISTINCT user_id) FROM events;

-- Approximate (fast, ~1% error rate)
SELECT approx_count_distinct(user_id) FROM events;

-- Per group
SELECT
    event_type,
    approx_count_distinct(user_id) AS approx_unique_users
FROM events
GROUP BY event_type;

approx_quantile

-- Approximate median (much faster on large tables)
SELECT approx_quantile(salary, 0.5) AS approx_median FROM employees;

-- Approximate percentiles
SELECT
    approx_quantile(response_time_ms, 0.50) AS p50,
    approx_quantile(response_time_ms, 0.95) AS p95,
    approx_quantile(response_time_ms, 0.99) AS p99
FROM request_logs;

Reservoir Sampling

-- Random sample of N rows (no full sort required)
SELECT * FROM large_table USING SAMPLE 1000;

-- Percentage-based sample
SELECT * FROM large_table USING SAMPLE 5 PERCENT;

-- Fixed seed for reproducibility
SELECT * FROM large_table USING SAMPLE 1000 (reservoir, 42);

When to Use Approximate vs. Exact

Scenario	Use
BI dashboards with billions of events	`approx_count_distinct`
Financial auditing requiring exact count	`COUNT(DISTINCT ...)`
SLA latency percentiles (P95/P99)	`approx_quantile`
Compliance reports	`quantile` (exact)

Common Pitfalls

Pitfall	Consequence	Prevention
Using approx for financial reporting	Incorrect totals	Use exact aggregations for monetary data
Not documenting approximate columns	Consumers assume exact values	Always label approximate columns
Using SAMPLE without a seed	Non-reproducible results	Add `(reservoir, seed)` for reproducibility

Quick Reference

SELECT approx_count_distinct(col) FROM t;
SELECT approx_quantile(col, 0.95) FROM t;
SELECT * FROM t USING SAMPLE 1000;
SELECT * FROM t USING SAMPLE 10 PERCENT;

What's Next

Next: Joins and Set Operations - Combine data from multiple tables.
Module Overview - Return to this module index.

Why Approximate Aggregation?​

approx_count_distinct (HyperLogLog)​

approx_quantile​

Reservoir Sampling​

When to Use Approximate vs. Exact​

Common Pitfalls​

Quick Reference​

What's Next​