Skip to main content

When to Use DuckDB

Learning Focus

Use this lesson to develop a practical decision framework for when DuckDB is the right tool.

Real-World Use Cases

Use Case 1: Local Data Exploration

Scenario: A data analyst receives a 2GB CSV file and needs quick summary statistics.

-- No database setup needed
SELECT
department,
COUNT(*) AS headcount,
AVG(salary) AS avg_salary,
MAX(salary) AS top_salary
FROM read_csv_auto('employees_2024.csv')
GROUP BY department
ORDER BY avg_salary DESC;

DuckDB processes this in seconds without importing into any database.

Use Case 2: Analytical Pipeline on S3

Scenario: A data engineer needs to query Parquet files stored in S3.

INSTALL httpfs;
LOAD httpfs;

SET s3_region = 'us-east-1';

SELECT year, SUM(revenue)
FROM read_parquet('s3://my-bucket/sales/*.parquet')
GROUP BY year
ORDER BY year;

Use Case 3: Python Data Science Workflow

Scenario: A data scientist combines DuckDB with pandas for fast transformations.

import duckdb
import pandas as pd

df = pd.read_parquet("large_dataset.parquet")

result = duckdb.query("""
SELECT category, AVG(price) AS avg_price
FROM df
GROUP BY category
ORDER BY avg_price DESC
""").df()

print(result)

Use Case 4: Embedded Analytics in Applications

Scenario: A Go or Node.js application needs embedded analytical queries without a database server.

-- In-memory DuckDB, no file needed
CREATE TABLE events AS SELECT * FROM read_csv_auto('events.csv');
SELECT event_type, COUNT(*) FROM events GROUP BY event_type;

Use Case 5: ETL Intermediate Layer

Scenario: Transform raw CSV data and export clean Parquet files.

COPY (
SELECT
id,
UPPER(name) AS name,
salary * 1.1 AS adjusted_salary,
strftime(hire_date, '%Y-%m') AS hire_month
FROM read_csv_auto('raw_employees.csv')
WHERE salary > 0
) TO 'clean_employees.parquet' (FORMAT PARQUET);

Decision Framework

Need analytics on files?          → DuckDB ✓
Need multi-user concurrent writes? → PostgreSQL ✓
Embedded app, small dataset? → SQLite ✓
Need a data warehouse in the cloud?→ Snowflake / BigQuery ✓

Learning Check

Questions:

  1. You need to analyze 10GB of CSV logs daily in a Python script. Which tool?
  2. Your team needs concurrent write access from 20 web app instances. Which tool?
  3. What DuckDB feature makes S3 Parquet queries possible?

Common Pitfalls

PitfallConsequencePrevention
Using DuckDB for real-time transactional dataData inconsistencyUse an OLTP database as the source of truth
Querying remote files without checking network latencySlow queriesCache remote files locally when querying repeatedly
Not filtering early in file queriesFull file scan on large datasetsAlways push WHERE conditions before aggregations

Quick Reference

-- Query a local CSV
SELECT * FROM read_csv_auto('file.csv') LIMIT 5;

-- Query a Parquet file
SELECT * FROM read_parquet('file.parquet') LIMIT 5;

-- Query a JSON file
SELECT * FROM read_json_auto('file.json') LIMIT 5;

-- In-memory database (no .db file)
duckdb

What's Next