When to Use DuckDB

Learning Focus

Use this lesson to develop a practical decision framework for when DuckDB is the right tool.

Real-World Use Cases

Use Case 1: Local Data Exploration

Scenario: A data analyst receives a 2GB CSV file and needs quick summary statistics.

-- No database setup needed
SELECT
    department,
    COUNT(*)        AS headcount,
    AVG(salary)     AS avg_salary,
    MAX(salary)     AS top_salary
FROM read_csv_auto('employees_2024.csv')
GROUP BY department
ORDER BY avg_salary DESC;

DuckDB processes this in seconds without importing into any database.

Use Case 2: Analytical Pipeline on S3

Scenario: A data engineer needs to query Parquet files stored in S3.

INSTALL httpfs;
LOAD httpfs;

SET s3_region = 'us-east-1';

SELECT year, SUM(revenue)
FROM read_parquet('s3://my-bucket/sales/*.parquet')
GROUP BY year
ORDER BY year;

Use Case 3: Python Data Science Workflow

Scenario: A data scientist combines DuckDB with pandas for fast transformations.

import duckdb
import pandas as pd

df = pd.read_parquet("large_dataset.parquet")

result = duckdb.query("""
    SELECT category, AVG(price) AS avg_price
    FROM df
    GROUP BY category
    ORDER BY avg_price DESC
""").df()

print(result)

Use Case 4: Embedded Analytics in Applications

Scenario: A Go or Node.js application needs embedded analytical queries without a database server.

-- In-memory DuckDB, no file needed
CREATE TABLE events AS SELECT * FROM read_csv_auto('events.csv');
SELECT event_type, COUNT(*) FROM events GROUP BY event_type;

Use Case 5: ETL Intermediate Layer

Scenario: Transform raw CSV data and export clean Parquet files.

COPY (
    SELECT
        id,
        UPPER(name) AS name,
        salary * 1.1 AS adjusted_salary,
        strftime(hire_date, '%Y-%m') AS hire_month
    FROM read_csv_auto('raw_employees.csv')
    WHERE salary > 0
) TO 'clean_employees.parquet' (FORMAT PARQUET);

Decision Framework

Need analytics on files?          → DuckDB ✓
Need multi-user concurrent writes? → PostgreSQL ✓
Embedded app, small dataset?       → SQLite ✓
Need a data warehouse in the cloud?→ Snowflake / BigQuery ✓

Learning Check

Questions:

You need to analyze 10GB of CSV logs daily in a Python script. Which tool?
Your team needs concurrent write access from 20 web app instances. Which tool?
What DuckDB feature makes S3 Parquet queries possible?

Common Pitfalls

Pitfall	Consequence	Prevention
Using DuckDB for real-time transactional data	Data inconsistency	Use an OLTP database as the source of truth
Querying remote files without checking network latency	Slow queries	Cache remote files locally when querying repeatedly
Not filtering early in file queries	Full file scan on large datasets	Always push `WHERE` conditions before aggregations

Quick Reference

-- Query a local CSV
SELECT * FROM read_csv_auto('file.csv') LIMIT 5;

-- Query a Parquet file
SELECT * FROM read_parquet('file.parquet') LIMIT 5;

-- Query a JSON file
SELECT * FROM read_json_auto('file.json') LIMIT 5;

-- In-memory database (no .db file)
duckdb

What's Next

Next: Installing DuckDB - Get DuckDB running on your machine.
Module Overview - Return to this module index.

Real-World Use Cases​

Use Case 1: Local Data Exploration​

Use Case 2: Analytical Pipeline on S3​

Use Case 3: Python Data Science Workflow​

Use Case 4: Embedded Analytics in Applications​

Use Case 5: ETL Intermediate Layer​

Decision Framework​

Learning Check​

Common Pitfalls​

Quick Reference​

What's Next​