Skip to main content

DuckDB for AI Agents

Learning Focus

Use this lesson to understand how DuckDB enables AI agents to query, analyze, and reason over structured data efficiently — without requiring a database server.

Why DuckDB for AI Agents?

AI agents and LLM-powered applications need to:

  1. Query structured data in response to natural language requests
  2. Run analytics on-the-fly over CSV / Parquet / JSON files
  3. Operate without infrastructure — no database to spin up
  4. Return results fast — LLM reasoning loops need low-latency tool calls
  5. Handle arbitrary datasets — agents don't know the schema upfront

DuckDB satisfies all five requirements out of the box.

The Text-to-SQL Pattern

The most common AI agent pattern with DuckDB:

User Prompt

LLM generates SQL

Agent calls DuckDB tool

DuckDB executes query on local file

Result returned to LLM

LLM formulates natural language answer
import duckdb
import openai

def run_analytics_query(user_question: str, data_path: str) -> str:
# Step 1: Get schema so the LLM knows the table structure
schema = duckdb.sql(f"DESCRIBE SELECT * FROM '{data_path}'").df().to_string()

# Step 2: Ask LLM to generate SQL
response = openai.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": f"You are a SQL expert. Table schema:\n{schema}\nWrite DuckDB SQL to answer the question."},
{"role": "user", "content": user_question}
]
)
sql = response.choices[0].message.content.strip()

# Step 3: Execute with DuckDB
result = duckdb.sql(f"SELECT * FROM '{data_path}' -- {sql}").df()
return result.to_markdown()

# Usage
answer = run_analytics_query(
"Which product category had the highest revenue last quarter?",
"sales_2024.parquet"
)
print(answer)

Use Case 1: Agent Tool — Query Any File

import duckdb

def query_data_file(file_path: str, sql_query: str) -> dict:
"""
DuckDB tool for AI agents: executes SQL over any supported file.
Supports: .csv, .parquet, .json, .xlsx
"""
try:
con = duckdb.connect()
# Register the file as a virtual table
result = con.execute(
sql_query.replace("FROM data", f"FROM read_csv_auto('{file_path}')")
).df()
return {"status": "success", "rows": len(result), "data": result.to_dict()}
except Exception as e:
return {"status": "error", "message": str(e)}

# Example agent tool call
result = query_data_file(
"sales.csv",
"SELECT region, SUM(revenue) AS total FROM data GROUP BY region"
)

Use Case 2: Auto Schema Discovery for LLM Context

import duckdb

def get_schema_for_llm(file_path: str) -> str:
"""Return a compact schema description for LLM prompt injection."""
con = duckdb.connect()

# Get column names and types
schema = con.execute(
f"DESCRIBE SELECT * FROM '{file_path}'"
).fetchdf()

# Get sample rows
sample = con.execute(
f"SELECT * FROM '{file_path}' USING SAMPLE 3"
).fetchdf()

return f"""
Table: data (from {file_path})
Columns:
{schema[['column_name','column_type']].to_string(index=False)}

Sample rows:
{sample.to_string(index=False)}
"""

# Inject into LLM system prompt
schema_context = get_schema_for_llm("customers.parquet")

Use Case 3: Agentic ETL Pipeline

import duckdb

def transform_and_analyze(input_path: str, output_path: str) -> str:
"""Agent pipeline: clean, transform, analyze, export."""
con = duckdb.connect()

# Step 1: Load raw data
raw_count = con.execute(f"SELECT COUNT(*) FROM '{input_path}'").fetchone()[0]

# Step 2: Clean and transform
con.execute(f"""
CREATE TABLE clean_data AS
SELECT
id,
UPPER(TRIM(name)) AS name,
LOWER(TRIM(email)) AS email,
COALESCE(revenue, 0) AS revenue,
strptime(signup_date, '%d/%m/%Y')::DATE AS signup_date
FROM '{input_path}'
WHERE id IS NOT NULL AND email ILIKE '%@%.%'
""")

# Step 3: Analyze
summary = con.execute("""
SELECT COUNT(*) AS clean_rows, SUM(revenue) AS total_revenue, AVG(revenue) AS avg_revenue
FROM clean_data
""").fetchdf()

# Step 4: Export
con.execute(f"COPY clean_data TO '{output_path}' (FORMAT PARQUET)")

return f"Processed {raw_count} raw rows → {summary.to_string()}"

Use Case 4: RAG Data Retrieval (Structured)

In Retrieval-Augmented Generation (RAG) pipelines, DuckDB serves as the structured data retrieval layer alongside vector databases for unstructured content.

import duckdb

def retrieve_structured_context(user_question: str, db_path: str) -> str:
"""
Hybrid RAG: use DuckDB for structured data, vector DB for unstructured.
This handles the structured side.
"""
con = duckdb.connect(db_path)

# Keywords extracted from user question (simplified)
keywords = [w for w in user_question.split() if len(w) > 4]
pattern = f"%{'%'.join(keywords[:3])}%"

result = con.execute(f"""
SELECT title, summary, published_date, source
FROM articles
WHERE body ILIKE ?
ORDER BY published_date DESC
LIMIT 5
""", [pattern]).fetchdf()

return result.to_markdown(index=False)

Use Case 5: Agent Memory with DuckDB

DuckDB can serve as a lightweight persistent memory store for agents:

import duckdb
from datetime import datetime

class AgentMemory:
def __init__(self, db_path: str = "agent_memory.db"):
self.con = duckdb.connect(db_path)
self.con.execute("""
CREATE TABLE IF NOT EXISTS memories (
id INTEGER DEFAULT nextval('memory_seq'),
session_id VARCHAR,
role VARCHAR,
content TEXT,
created_at TIMESTAMP DEFAULT current_timestamp
)
""")

def save(self, session_id: str, role: str, content: str):
self.con.execute(
"INSERT INTO memories (session_id, role, content) VALUES (?, ?, ?)",
[session_id, role, content]
)

def recall(self, session_id: str, limit: int = 10) -> list:
return self.con.execute("""
SELECT role, content, created_at
FROM memories
WHERE session_id = ?
ORDER BY created_at DESC
LIMIT ?
""", [session_id, limit]).fetchall()

DuckDB as AI Agent Tool — Summary

Agent CapabilityDuckDB Role
Text-to-SQL executionExecute LLM-generated SQL on any file
Schema discoveryDESCRIBE + SUMMARIZE for LLM context
Result summarizationReturn DataFrames to LLM for reasoning
ETL in agent pipelinesClean, transform, export without a server
Structured RAGQuery structured data alongside vector search
Agent memoryLightweight persistent state in a .db file
Analytics toolsWindow functions, aggregations, JOIN across files

Common Pitfalls

PitfallConsequencePrevention
LLM generates MySQL syntaxDuckDB syntax errorsInclude "Use DuckDB SQL" in system prompt
Agent runs no LIMIT on large filesHuge result returned to LLMAdd LIMIT 100 guard in the tool wrapper
No error handling in tool wrapperAgent crashes on bad SQLWrap execute in try/except and return error message

Quick Reference

import duckdb

# Query any file
duckdb.sql("SELECT * FROM 'data.csv' LIMIT 10").df()

# Get schema for LLM context
duckdb.sql("DESCRIBE SELECT * FROM 'data.parquet'").df()

# Summarize data for agent context
duckdb.sql("SUMMARIZE SELECT * FROM 'data.parquet'").df()

# Execute LLM-generated SQL safely
try:
result = duckdb.sql(llm_generated_sql).df()
except Exception as e:
result = f"SQL Error: {e}"

What's Next