DuckDB for AI Agents
Learning Focus
Use this lesson to understand how DuckDB enables AI agents to query, analyze, and reason over structured data efficiently — without requiring a database server.
Why DuckDB for AI Agents?
AI agents and LLM-powered applications need to:
- Query structured data in response to natural language requests
- Run analytics on-the-fly over CSV / Parquet / JSON files
- Operate without infrastructure — no database to spin up
- Return results fast — LLM reasoning loops need low-latency tool calls
- Handle arbitrary datasets — agents don't know the schema upfront
DuckDB satisfies all five requirements out of the box.
The Text-to-SQL Pattern
The most common AI agent pattern with DuckDB:
User Prompt
↓
LLM generates SQL
↓
Agent calls DuckDB tool
↓
DuckDB executes query on local file
↓
Result returned to LLM
↓
LLM formulates natural language answer
import duckdb
import openai
def run_analytics_query(user_question: str, data_path: str) -> str:
# Step 1: Get schema so the LLM knows the table structure
schema = duckdb.sql(f"DESCRIBE SELECT * FROM '{data_path}'").df().to_string()
# Step 2: Ask LLM to generate SQL
response = openai.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": f"You are a SQL expert. Table schema:\n{schema}\nWrite DuckDB SQL to answer the question."},
{"role": "user", "content": user_question}
]
)
sql = response.choices[0].message.content.strip()
# Step 3: Execute with DuckDB
result = duckdb.sql(f"SELECT * FROM '{data_path}' -- {sql}").df()
return result.to_markdown()
# Usage
answer = run_analytics_query(
"Which product category had the highest revenue last quarter?",
"sales_2024.parquet"
)
print(answer)
Use Case 1: Agent Tool — Query Any File
import duckdb
def query_data_file(file_path: str, sql_query: str) -> dict:
"""
DuckDB tool for AI agents: executes SQL over any supported file.
Supports: .csv, .parquet, .json, .xlsx
"""
try:
con = duckdb.connect()
# Register the file as a virtual table
result = con.execute(
sql_query.replace("FROM data", f"FROM read_csv_auto('{file_path}')")
).df()
return {"status": "success", "rows": len(result), "data": result.to_dict()}
except Exception as e:
return {"status": "error", "message": str(e)}
# Example agent tool call
result = query_data_file(
"sales.csv",
"SELECT region, SUM(revenue) AS total FROM data GROUP BY region"
)
Use Case 2: Auto Schema Discovery for LLM Context
import duckdb
def get_schema_for_llm(file_path: str) -> str:
"""Return a compact schema description for LLM prompt injection."""
con = duckdb.connect()
# Get column names and types
schema = con.execute(
f"DESCRIBE SELECT * FROM '{file_path}'"
).fetchdf()
# Get sample rows
sample = con.execute(
f"SELECT * FROM '{file_path}' USING SAMPLE 3"
).fetchdf()
return f"""
Table: data (from {file_path})
Columns:
{schema[['column_name','column_type']].to_string(index=False)}
Sample rows:
{sample.to_string(index=False)}
"""
# Inject into LLM system prompt
schema_context = get_schema_for_llm("customers.parquet")
Use Case 3: Agentic ETL Pipeline
import duckdb
def transform_and_analyze(input_path: str, output_path: str) -> str:
"""Agent pipeline: clean, transform, analyze, export."""
con = duckdb.connect()
# Step 1: Load raw data
raw_count = con.execute(f"SELECT COUNT(*) FROM '{input_path}'").fetchone()[0]
# Step 2: Clean and transform
con.execute(f"""
CREATE TABLE clean_data AS
SELECT
id,
UPPER(TRIM(name)) AS name,
LOWER(TRIM(email)) AS email,
COALESCE(revenue, 0) AS revenue,
strptime(signup_date, '%d/%m/%Y')::DATE AS signup_date
FROM '{input_path}'
WHERE id IS NOT NULL AND email ILIKE '%@%.%'
""")
# Step 3: Analyze
summary = con.execute("""
SELECT COUNT(*) AS clean_rows, SUM(revenue) AS total_revenue, AVG(revenue) AS avg_revenue
FROM clean_data
""").fetchdf()
# Step 4: Export
con.execute(f"COPY clean_data TO '{output_path}' (FORMAT PARQUET)")
return f"Processed {raw_count} raw rows → {summary.to_string()}"
Use Case 4: RAG Data Retrieval (Structured)
In Retrieval-Augmented Generation (RAG) pipelines, DuckDB serves as the structured data retrieval layer alongside vector databases for unstructured content.
import duckdb
def retrieve_structured_context(user_question: str, db_path: str) -> str:
"""
Hybrid RAG: use DuckDB for structured data, vector DB for unstructured.
This handles the structured side.
"""
con = duckdb.connect(db_path)
# Keywords extracted from user question (simplified)
keywords = [w for w in user_question.split() if len(w) > 4]
pattern = f"%{'%'.join(keywords[:3])}%"
result = con.execute(f"""
SELECT title, summary, published_date, source
FROM articles
WHERE body ILIKE ?
ORDER BY published_date DESC
LIMIT 5
""", [pattern]).fetchdf()
return result.to_markdown(index=False)
Use Case 5: Agent Memory with DuckDB
DuckDB can serve as a lightweight persistent memory store for agents:
import duckdb
from datetime import datetime
class AgentMemory:
def __init__(self, db_path: str = "agent_memory.db"):
self.con = duckdb.connect(db_path)
self.con.execute("""
CREATE TABLE IF NOT EXISTS memories (
id INTEGER DEFAULT nextval('memory_seq'),
session_id VARCHAR,
role VARCHAR,
content TEXT,
created_at TIMESTAMP DEFAULT current_timestamp
)
""")
def save(self, session_id: str, role: str, content: str):
self.con.execute(
"INSERT INTO memories (session_id, role, content) VALUES (?, ?, ?)",
[session_id, role, content]
)
def recall(self, session_id: str, limit: int = 10) -> list:
return self.con.execute("""
SELECT role, content, created_at
FROM memories
WHERE session_id = ?
ORDER BY created_at DESC
LIMIT ?
""", [session_id, limit]).fetchall()
DuckDB as AI Agent Tool — Summary
| Agent Capability | DuckDB Role |
|---|---|
| Text-to-SQL execution | Execute LLM-generated SQL on any file |
| Schema discovery | DESCRIBE + SUMMARIZE for LLM context |
| Result summarization | Return DataFrames to LLM for reasoning |
| ETL in agent pipelines | Clean, transform, export without a server |
| Structured RAG | Query structured data alongside vector search |
| Agent memory | Lightweight persistent state in a .db file |
| Analytics tools | Window functions, aggregations, JOIN across files |
Common Pitfalls
| Pitfall | Consequence | Prevention |
|---|---|---|
| LLM generates MySQL syntax | DuckDB syntax errors | Include "Use DuckDB SQL" in system prompt |
| Agent runs no LIMIT on large files | Huge result returned to LLM | Add LIMIT 100 guard in the tool wrapper |
| No error handling in tool wrapper | Agent crashes on bad SQL | Wrap execute in try/except and return error message |
Quick Reference
import duckdb
# Query any file
duckdb.sql("SELECT * FROM 'data.csv' LIMIT 10").df()
# Get schema for LLM context
duckdb.sql("DESCRIBE SELECT * FROM 'data.parquet'").df()
# Summarize data for agent context
duckdb.sql("SUMMARIZE SELECT * FROM 'data.parquet'").df()
# Execute LLM-generated SQL safely
try:
result = duckdb.sql(llm_generated_sql).df()
except Exception as e:
result = f"SQL Error: {e}"
What's Next
- Next: Module 2 — DuckDB Data Types - Understand the full DuckDB type system.
- Module Overview - Return to this module index.