When to Use DuckDB
Learning Focus
Use this lesson to develop a practical decision framework for when DuckDB is the right tool.
Real-World Use Cases
Use Case 1: Local Data Exploration
Scenario: A data analyst receives a 2GB CSV file and needs quick summary statistics.
-- No database setup needed
SELECT
department,
COUNT(*) AS headcount,
AVG(salary) AS avg_salary,
MAX(salary) AS top_salary
FROM read_csv_auto('employees_2024.csv')
GROUP BY department
ORDER BY avg_salary DESC;
DuckDB processes this in seconds without importing into any database.
Use Case 2: Analytical Pipeline on S3
Scenario: A data engineer needs to query Parquet files stored in S3.
INSTALL httpfs;
LOAD httpfs;
SET s3_region = 'us-east-1';
SELECT year, SUM(revenue)
FROM read_parquet('s3://my-bucket/sales/*.parquet')
GROUP BY year
ORDER BY year;
Use Case 3: Python Data Science Workflow
Scenario: A data scientist combines DuckDB with pandas for fast transformations.
import duckdb
import pandas as pd
df = pd.read_parquet("large_dataset.parquet")
result = duckdb.query("""
SELECT category, AVG(price) AS avg_price
FROM df
GROUP BY category
ORDER BY avg_price DESC
""").df()
print(result)
Use Case 4: Embedded Analytics in Applications
Scenario: A Go or Node.js application needs embedded analytical queries without a database server.
-- In-memory DuckDB, no file needed
CREATE TABLE events AS SELECT * FROM read_csv_auto('events.csv');
SELECT event_type, COUNT(*) FROM events GROUP BY event_type;
Use Case 5: ETL Intermediate Layer
Scenario: Transform raw CSV data and export clean Parquet files.
COPY (
SELECT
id,
UPPER(name) AS name,
salary * 1.1 AS adjusted_salary,
strftime(hire_date, '%Y-%m') AS hire_month
FROM read_csv_auto('raw_employees.csv')
WHERE salary > 0
) TO 'clean_employees.parquet' (FORMAT PARQUET);
Decision Framework
Need analytics on files? → DuckDB ✓
Need multi-user concurrent writes? → PostgreSQL ✓
Embedded app, small dataset? → SQLite ✓
Need a data warehouse in the cloud?→ Snowflake / BigQuery ✓
Learning Check
Questions:
- You need to analyze 10GB of CSV logs daily in a Python script. Which tool?
- Your team needs concurrent write access from 20 web app instances. Which tool?
- What DuckDB feature makes S3 Parquet queries possible?
Common Pitfalls
| Pitfall | Consequence | Prevention |
|---|---|---|
| Using DuckDB for real-time transactional data | Data inconsistency | Use an OLTP database as the source of truth |
| Querying remote files without checking network latency | Slow queries | Cache remote files locally when querying repeatedly |
| Not filtering early in file queries | Full file scan on large datasets | Always push WHERE conditions before aggregations |
Quick Reference
-- Query a local CSV
SELECT * FROM read_csv_auto('file.csv') LIMIT 5;
-- Query a Parquet file
SELECT * FROM read_parquet('file.parquet') LIMIT 5;
-- Query a JSON file
SELECT * FROM read_json_auto('file.json') LIMIT 5;
-- In-memory database (no .db file)
duckdb
What's Next
- Next: Installing DuckDB - Get DuckDB running on your machine.
- Module Overview - Return to this module index.