Skip to main content

Reading Parquet Files

Learning Focus

Use this lesson to query Parquet files — the most efficient format for DuckDB's columnar engine.

Why Parquet with DuckDB?

Parquet is the ideal format for DuckDB because:

  • Columnar: Only reads columns referenced in the query
  • Compressed: Smaller files, faster I/O
  • Predicate pushdown: DuckDB pushes WHERE filters into the file reader
  • Schema embedded: No type inference needed
  • Widely supported: Spark, pandas, Hive, BigQuery all export Parquet

Basic Usage

-- Read a Parquet file
SELECT * FROM 'data.parquet' LIMIT 10;

-- Using the read_parquet() function
SELECT * FROM read_parquet('data.parquet') LIMIT 10;

-- Read all Parquet files in a directory
SELECT * FROM read_parquet('data/*.parquet');

Column Pruning (Automatic)

DuckDB automatically reads only the columns you SELECT:

-- Only 'department' and 'salary' columns are read from disk
SELECT department, AVG(salary) AS avg_salary
FROM 'large_dataset.parquet'
GROUP BY department;

Predicate Pushdown

-- DuckDB pushes WHERE into the Parquet reader — only relevant row groups are read
SELECT * FROM 'sales.parquet'
WHERE year = 2024 AND region = 'APAC';

Reading Hive-Partitioned Parquet

-- Auto-detect Hive partitioning: data/year=2024/month=01/data.parquet
SELECT * FROM read_parquet('data/**/*.parquet', hive_partitioning = true);

-- Partitions become filterable columns
SELECT year, month, SUM(revenue)
FROM read_parquet('data/**/*.parquet', hive_partitioning = true)
WHERE year = 2024
GROUP BY year, month;

Convert CSV to Parquet

-- Best practice: convert CSV to Parquet for repeated analysis
COPY (SELECT * FROM read_csv_auto('raw_data.csv'))
TO 'clean_data.parquet' (FORMAT PARQUET);

Inspect Parquet File Metadata

-- View schema
DESCRIBE SELECT * FROM 'data.parquet';

-- View file metadata (row groups, compression, etc.)
SELECT * FROM parquet_metadata('data.parquet');

-- View schema metadata
SELECT * FROM parquet_schema('data.parquet');

Common Pitfalls

PitfallConsequencePrevention
Using CSV when Parquet is available5-10× slower queriesConvert large CSVs to Parquet
Reading all columns from wide ParquetNo column pruning benefitAlways SELECT only needed columns
Mixing compressed & uncompressedNot an error, just suboptimalStandardize on SNAPPY or ZSTD compression

Quick Reference

SELECT * FROM 'file.parquet' LIMIT 10;
SELECT * FROM read_parquet('dir/*.parquet');
SELECT * FROM read_parquet('data/**/*.parquet', hive_partitioning=true);
COPY (SELECT * FROM t) TO 'out.parquet' (FORMAT PARQUET);
SELECT * FROM parquet_metadata('file.parquet');

What's Next