Skip to main content

HTTPFS and S3

Learning Focus

Use this lesson to query remote files from the web, S3, or cloud storage without downloading them locally.

Installing and Loading HTTPFS

-- Install once per DuckDB installation
INSTALL httpfs;

-- Load per session
LOAD httpfs;

Query Files Over HTTP

LOAD httpfs;

-- Query a CSV file directly from a URL
SELECT * FROM read_csv_auto('https://example.com/data/employees.csv') LIMIT 10;

-- Query a Parquet file from GitHub
SELECT * FROM read_parquet('https://github.com/user/repo/raw/main/data.parquet');

Configure S3 Access

LOAD httpfs;

-- Option 1: Set credentials explicitly
SET s3_region = 'us-east-1';
SET s3_access_key_id = 'YOUR_ACCESS_KEY';
SET s3_secret_access_key = 'YOUR_SECRET_KEY';

-- Option 2: Use a profile from ~/.aws/credentials
SET s3_profile = 'default';

-- Option 3: Anonymous access (public buckets)
SET s3_access_key_id = '';
SET s3_secret_access_key = '';

Query S3 Files

LOAD httpfs;

-- CSV from S3
SELECT * FROM read_csv_auto('s3://my-bucket/data/employees.csv');

-- Parquet from S3
SELECT * FROM read_parquet('s3://my-bucket/warehouse/sales/*.parquet');

-- Hive-partitioned Parquet on S3
SELECT * FROM read_parquet(
's3://my-bucket/events/**/*.parquet',
hive_partitioning = true
);

-- Aggregate directly from S3
SELECT year, SUM(revenue) AS total
FROM read_parquet('s3://my-bucket/sales/**/*.parquet', hive_partitioning=true)
GROUP BY year
ORDER BY year;

Write Back to S3

-- Export query results to S3
COPY (SELECT * FROM employees WHERE active = true)
TO 's3://my-bucket/output/active_employees.parquet' (FORMAT PARQUET);

GCS and Azure

-- Google Cloud Storage (uses S3-compatible API)
SET s3_endpoint = 'storage.googleapis.com';
SET s3_url_style = 'path';
SELECT * FROM read_parquet('s3://gcs-bucket/data.parquet');

-- Azure Blob Storage
INSTALL azure;
LOAD azure;
SET azure_storage_connection_string = 'DefaultEndpointsProtocol=https;...';
SELECT * FROM read_parquet('azure://container/path/data.parquet');

Common Pitfalls

PitfallConsequencePrevention
Forgetting LOAD httpfsError: Unknown function read_parquet on S3Always LOAD httpfs before S3 queries
Reading many small files from S3High latency (one request per file)Consolidate small files into larger Parquet files
Public bucket without clearing credentialsAuth errorSet s3_access_key_id = '' for anonymous access

Quick Reference

INSTALL httpfs; LOAD httpfs;

-- HTTP
SELECT * FROM read_csv_auto('https://host/file.csv');

-- S3
SET s3_region = 'us-east-1';
SELECT * FROM read_parquet('s3://bucket/path/*.parquet');

-- Write to S3
COPY (SELECT * FROM t) TO 's3://bucket/out.parquet' (FORMAT PARQUET);

What's Next