HTTPFS and S3

Learning Focus

Use this lesson to query remote files from the web, S3, or cloud storage without downloading them locally.

Installing and Loading HTTPFS

-- Install once per DuckDB installation
INSTALL httpfs;

-- Load per session
LOAD httpfs;

Query Files Over HTTP

LOAD httpfs;

-- Query a CSV file directly from a URL
SELECT * FROM read_csv_auto('https://example.com/data/employees.csv') LIMIT 10;

-- Query a Parquet file from GitHub
SELECT * FROM read_parquet('https://github.com/user/repo/raw/main/data.parquet');

Configure S3 Access

LOAD httpfs;

-- Option 1: Set credentials explicitly
SET s3_region      = 'us-east-1';
SET s3_access_key_id     = 'YOUR_ACCESS_KEY';
SET s3_secret_access_key = 'YOUR_SECRET_KEY';

-- Option 2: Use a profile from ~/.aws/credentials
SET s3_profile = 'default';

-- Option 3: Anonymous access (public buckets)
SET s3_access_key_id     = '';
SET s3_secret_access_key = '';

Query S3 Files

LOAD httpfs;

-- CSV from S3
SELECT * FROM read_csv_auto('s3://my-bucket/data/employees.csv');

-- Parquet from S3
SELECT * FROM read_parquet('s3://my-bucket/warehouse/sales/*.parquet');

-- Hive-partitioned Parquet on S3
SELECT * FROM read_parquet(
    's3://my-bucket/events/**/*.parquet',
    hive_partitioning = true
);

-- Aggregate directly from S3
SELECT year, SUM(revenue) AS total
FROM read_parquet('s3://my-bucket/sales/**/*.parquet', hive_partitioning=true)
GROUP BY year
ORDER BY year;

Write Back to S3

-- Export query results to S3
COPY (SELECT * FROM employees WHERE active = true)
TO 's3://my-bucket/output/active_employees.parquet' (FORMAT PARQUET);

GCS and Azure

-- Google Cloud Storage (uses S3-compatible API)
SET s3_endpoint      = 'storage.googleapis.com';
SET s3_url_style     = 'path';
SELECT * FROM read_parquet('s3://gcs-bucket/data.parquet');

-- Azure Blob Storage
INSTALL azure;
LOAD azure;
SET azure_storage_connection_string = 'DefaultEndpointsProtocol=https;...';
SELECT * FROM read_parquet('azure://container/path/data.parquet');

Common Pitfalls

Pitfall	Consequence	Prevention
Forgetting `LOAD httpfs`	Error: Unknown function read_parquet on S3	Always `LOAD httpfs` before S3 queries
Reading many small files from S3	High latency (one request per file)	Consolidate small files into larger Parquet files
Public bucket without clearing credentials	Auth error	Set `s3_access_key_id = ''` for anonymous access

Quick Reference

INSTALL httpfs; LOAD httpfs;

-- HTTP
SELECT * FROM read_csv_auto('https://host/file.csv');

-- S3
SET s3_region = 'us-east-1';
SELECT * FROM read_parquet('s3://bucket/path/*.parquet');

-- Write to S3
COPY (SELECT * FROM t) TO 's3://bucket/out.parquet' (FORMAT PARQUET);

What's Next

Next: COPY FROM and Export - Import and export data with COPY.
Module Overview - Return to this module index.

Installing and Loading HTTPFS​

Query Files Over HTTP​

Configure S3 Access​

Query S3 Files​

Write Back to S3​

GCS and Azure​

Common Pitfalls​

Quick Reference​

What's Next​

Installing and Loading HTTPFS

Query Files Over HTTP

Configure S3 Access

Query S3 Files

Write Back to S3

GCS and Azure

Common Pitfalls

Quick Reference

What's Next