Reading Parquet Files

Learning Focus

Use this lesson to query Parquet files — the most efficient format for DuckDB's columnar engine.

Why Parquet with DuckDB?

Parquet is the ideal format for DuckDB because:

Columnar: Only reads columns referenced in the query
Compressed: Smaller files, faster I/O
Predicate pushdown: DuckDB pushes WHERE filters into the file reader
Schema embedded: No type inference needed
Widely supported: Spark, pandas, Hive, BigQuery all export Parquet

Basic Usage

-- Read a Parquet file
SELECT * FROM 'data.parquet' LIMIT 10;

-- Using the read_parquet() function
SELECT * FROM read_parquet('data.parquet') LIMIT 10;

-- Read all Parquet files in a directory
SELECT * FROM read_parquet('data/*.parquet');

Column Pruning (Automatic)

DuckDB automatically reads only the columns you SELECT:

-- Only 'department' and 'salary' columns are read from disk
SELECT department, AVG(salary) AS avg_salary
FROM 'large_dataset.parquet'
GROUP BY department;

Predicate Pushdown

-- DuckDB pushes WHERE into the Parquet reader — only relevant row groups are read
SELECT * FROM 'sales.parquet'
WHERE year = 2024 AND region = 'APAC';

Reading Hive-Partitioned Parquet

-- Auto-detect Hive partitioning: data/year=2024/month=01/data.parquet
SELECT * FROM read_parquet('data/**/*.parquet', hive_partitioning = true);

-- Partitions become filterable columns
SELECT year, month, SUM(revenue)
FROM read_parquet('data/**/*.parquet', hive_partitioning = true)
WHERE year = 2024
GROUP BY year, month;

Convert CSV to Parquet

-- Best practice: convert CSV to Parquet for repeated analysis
COPY (SELECT * FROM read_csv_auto('raw_data.csv'))
TO 'clean_data.parquet' (FORMAT PARQUET);

Inspect Parquet File Metadata

-- View schema
DESCRIBE SELECT * FROM 'data.parquet';

-- View file metadata (row groups, compression, etc.)
SELECT * FROM parquet_metadata('data.parquet');

-- View schema metadata
SELECT * FROM parquet_schema('data.parquet');

Common Pitfalls

Pitfall	Consequence	Prevention
Using CSV when Parquet is available	5-10× slower queries	Convert large CSVs to Parquet
Reading all columns from wide Parquet	No column pruning benefit	Always SELECT only needed columns
Mixing compressed & uncompressed	Not an error, just suboptimal	Standardize on SNAPPY or ZSTD compression

Quick Reference

SELECT * FROM 'file.parquet' LIMIT 10;
SELECT * FROM read_parquet('dir/*.parquet');
SELECT * FROM read_parquet('data/**/*.parquet', hive_partitioning=true);
COPY (SELECT * FROM t) TO 'out.parquet' (FORMAT PARQUET);
SELECT * FROM parquet_metadata('file.parquet');

What's Next

Next: Reading JSON Files - Ingest and query JSON data natively.
Module Overview - Return to this module index.

Why Parquet with DuckDB?​

Basic Usage​

Column Pruning (Automatic)​

Predicate Pushdown​

Reading Hive-Partitioned Parquet​

Convert CSV to Parquet​

Inspect Parquet File Metadata​

Common Pitfalls​

Quick Reference​

What's Next​