Reading Parquet Files
Learning Focus
Use this lesson to query Parquet files — the most efficient format for DuckDB's columnar engine.
Why Parquet with DuckDB?
Parquet is the ideal format for DuckDB because:
- Columnar: Only reads columns referenced in the query
- Compressed: Smaller files, faster I/O
- Predicate pushdown: DuckDB pushes WHERE filters into the file reader
- Schema embedded: No type inference needed
- Widely supported: Spark, pandas, Hive, BigQuery all export Parquet
Basic Usage
-- Read a Parquet file
SELECT * FROM 'data.parquet' LIMIT 10;
-- Using the read_parquet() function
SELECT * FROM read_parquet('data.parquet') LIMIT 10;
-- Read all Parquet files in a directory
SELECT * FROM read_parquet('data/*.parquet');
Column Pruning (Automatic)
DuckDB automatically reads only the columns you SELECT:
-- Only 'department' and 'salary' columns are read from disk
SELECT department, AVG(salary) AS avg_salary
FROM 'large_dataset.parquet'
GROUP BY department;
Predicate Pushdown
-- DuckDB pushes WHERE into the Parquet reader — only relevant row groups are read
SELECT * FROM 'sales.parquet'
WHERE year = 2024 AND region = 'APAC';
Reading Hive-Partitioned Parquet
-- Auto-detect Hive partitioning: data/year=2024/month=01/data.parquet
SELECT * FROM read_parquet('data/**/*.parquet', hive_partitioning = true);
-- Partitions become filterable columns
SELECT year, month, SUM(revenue)
FROM read_parquet('data/**/*.parquet', hive_partitioning = true)
WHERE year = 2024
GROUP BY year, month;
Convert CSV to Parquet
-- Best practice: convert CSV to Parquet for repeated analysis
COPY (SELECT * FROM read_csv_auto('raw_data.csv'))
TO 'clean_data.parquet' (FORMAT PARQUET);
Inspect Parquet File Metadata
-- View schema
DESCRIBE SELECT * FROM 'data.parquet';
-- View file metadata (row groups, compression, etc.)
SELECT * FROM parquet_metadata('data.parquet');
-- View schema metadata
SELECT * FROM parquet_schema('data.parquet');
Common Pitfalls
| Pitfall | Consequence | Prevention |
|---|---|---|
| Using CSV when Parquet is available | 5-10× slower queries | Convert large CSVs to Parquet |
| Reading all columns from wide Parquet | No column pruning benefit | Always SELECT only needed columns |
| Mixing compressed & uncompressed | Not an error, just suboptimal | Standardize on SNAPPY or ZSTD compression |
Quick Reference
SELECT * FROM 'file.parquet' LIMIT 10;
SELECT * FROM read_parquet('dir/*.parquet');
SELECT * FROM read_parquet('data/**/*.parquet', hive_partitioning=true);
COPY (SELECT * FROM t) TO 'out.parquet' (FORMAT PARQUET);
SELECT * FROM parquet_metadata('file.parquet');
What's Next
- Next: Reading JSON Files - Ingest and query JSON data natively.
- Module Overview - Return to this module index.