HTTPFS and S3
Learning Focus
Use this lesson to query remote files from the web, S3, or cloud storage without downloading them locally.
Installing and Loading HTTPFS
-- Install once per DuckDB installation
INSTALL httpfs;
-- Load per session
LOAD httpfs;
Query Files Over HTTP
LOAD httpfs;
-- Query a CSV file directly from a URL
SELECT * FROM read_csv_auto('https://example.com/data/employees.csv') LIMIT 10;
-- Query a Parquet file from GitHub
SELECT * FROM read_parquet('https://github.com/user/repo/raw/main/data.parquet');
Configure S3 Access
LOAD httpfs;
-- Option 1: Set credentials explicitly
SET s3_region = 'us-east-1';
SET s3_access_key_id = 'YOUR_ACCESS_KEY';
SET s3_secret_access_key = 'YOUR_SECRET_KEY';
-- Option 2: Use a profile from ~/.aws/credentials
SET s3_profile = 'default';
-- Option 3: Anonymous access (public buckets)
SET s3_access_key_id = '';
SET s3_secret_access_key = '';
Query S3 Files
LOAD httpfs;
-- CSV from S3
SELECT * FROM read_csv_auto('s3://my-bucket/data/employees.csv');
-- Parquet from S3
SELECT * FROM read_parquet('s3://my-bucket/warehouse/sales/*.parquet');
-- Hive-partitioned Parquet on S3
SELECT * FROM read_parquet(
's3://my-bucket/events/**/*.parquet',
hive_partitioning = true
);
-- Aggregate directly from S3
SELECT year, SUM(revenue) AS total
FROM read_parquet('s3://my-bucket/sales/**/*.parquet', hive_partitioning=true)
GROUP BY year
ORDER BY year;
Write Back to S3
-- Export query results to S3
COPY (SELECT * FROM employees WHERE active = true)
TO 's3://my-bucket/output/active_employees.parquet' (FORMAT PARQUET);
GCS and Azure
-- Google Cloud Storage (uses S3-compatible API)
SET s3_endpoint = 'storage.googleapis.com';
SET s3_url_style = 'path';
SELECT * FROM read_parquet('s3://gcs-bucket/data.parquet');
-- Azure Blob Storage
INSTALL azure;
LOAD azure;
SET azure_storage_connection_string = 'DefaultEndpointsProtocol=https;...';
SELECT * FROM read_parquet('azure://container/path/data.parquet');
Common Pitfalls
| Pitfall | Consequence | Prevention |
|---|---|---|
Forgetting LOAD httpfs | Error: Unknown function read_parquet on S3 | Always LOAD httpfs before S3 queries |
| Reading many small files from S3 | High latency (one request per file) | Consolidate small files into larger Parquet files |
| Public bucket without clearing credentials | Auth error | Set s3_access_key_id = '' for anonymous access |
Quick Reference
INSTALL httpfs; LOAD httpfs;
-- HTTP
SELECT * FROM read_csv_auto('https://host/file.csv');
-- S3
SET s3_region = 'us-east-1';
SELECT * FROM read_parquet('s3://bucket/path/*.parquet');
-- Write to S3
COPY (SELECT * FROM t) TO 's3://bucket/out.parquet' (FORMAT PARQUET);
What's Next
- Next: COPY FROM and Export - Import and export data with COPY.
- Module Overview - Return to this module index.