Skip to main content

Approximate Aggregation

Learning Focus

Use this lesson to understand approximate aggregation — a DuckDB superpower for analyzing massive datasets at speed.

Why Approximate Aggregation?

When datasets are billions of rows, exact aggregations can:

  • Require GBs of memory
  • Take minutes to compute

Approximate algorithms give ~99% accurate results in a fraction of the time and memory.

approx_count_distinct (HyperLogLog)

-- Exact (slow on massive tables)
SELECT COUNT(DISTINCT user_id) FROM events;

-- Approximate (fast, ~1% error rate)
SELECT approx_count_distinct(user_id) FROM events;

-- Per group
SELECT
event_type,
approx_count_distinct(user_id) AS approx_unique_users
FROM events
GROUP BY event_type;

approx_quantile

-- Approximate median (much faster on large tables)
SELECT approx_quantile(salary, 0.5) AS approx_median FROM employees;

-- Approximate percentiles
SELECT
approx_quantile(response_time_ms, 0.50) AS p50,
approx_quantile(response_time_ms, 0.95) AS p95,
approx_quantile(response_time_ms, 0.99) AS p99
FROM request_logs;

Reservoir Sampling

-- Random sample of N rows (no full sort required)
SELECT * FROM large_table USING SAMPLE 1000;

-- Percentage-based sample
SELECT * FROM large_table USING SAMPLE 5 PERCENT;

-- Fixed seed for reproducibility
SELECT * FROM large_table USING SAMPLE 1000 (reservoir, 42);

When to Use Approximate vs. Exact

ScenarioUse
BI dashboards with billions of eventsapprox_count_distinct
Financial auditing requiring exact countCOUNT(DISTINCT ...)
SLA latency percentiles (P95/P99)approx_quantile
Compliance reportsquantile (exact)

Common Pitfalls

PitfallConsequencePrevention
Using approx for financial reportingIncorrect totalsUse exact aggregations for monetary data
Not documenting approximate columnsConsumers assume exact valuesAlways label approximate columns
Using SAMPLE without a seedNon-reproducible resultsAdd (reservoir, seed) for reproducibility

Quick Reference

SELECT approx_count_distinct(col) FROM t;
SELECT approx_quantile(col, 0.95) FROM t;
SELECT * FROM t USING SAMPLE 1000;
SELECT * FROM t USING SAMPLE 10 PERCENT;

What's Next