DuckDB for AI Agents

Learning Focus

Use this lesson to understand how DuckDB enables AI agents to query, analyze, and reason over structured data efficiently — without requiring a database server.

Why DuckDB for AI Agents?

AI agents and LLM-powered applications need to:

Query structured data in response to natural language requests
Run analytics on-the-fly over CSV / Parquet / JSON files
Operate without infrastructure — no database to spin up
Return results fast — LLM reasoning loops need low-latency tool calls
Handle arbitrary datasets — agents don't know the schema upfront

DuckDB satisfies all five requirements out of the box.

The Text-to-SQL Pattern

The most common AI agent pattern with DuckDB:

User Prompt
    ↓
LLM generates SQL
    ↓
Agent calls DuckDB tool
    ↓
DuckDB executes query on local file
    ↓
Result returned to LLM
    ↓
LLM formulates natural language answer

import duckdb
import openai

def run_analytics_query(user_question: str, data_path: str) -> str:
    # Step 1: Get schema so the LLM knows the table structure
    schema = duckdb.sql(f"DESCRIBE SELECT * FROM '{data_path}'").df().to_string()

    # Step 2: Ask LLM to generate SQL
    response = openai.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": f"You are a SQL expert. Table schema:\n{schema}\nWrite DuckDB SQL to answer the question."},
            {"role": "user",   "content": user_question}
        ]
    )
    sql = response.choices[0].message.content.strip()

    # Step 3: Execute with DuckDB
    result = duckdb.sql(f"SELECT * FROM '{data_path}' -- {sql}").df()
    return result.to_markdown()

# Usage
answer = run_analytics_query(
    "Which product category had the highest revenue last quarter?",
    "sales_2024.parquet"
)
print(answer)

Use Case 1: Agent Tool — Query Any File

import duckdb

def query_data_file(file_path: str, sql_query: str) -> dict:
    """
    DuckDB tool for AI agents: executes SQL over any supported file.
    Supports: .csv, .parquet, .json, .xlsx
    """
    try:
        con = duckdb.connect()
        # Register the file as a virtual table
        result = con.execute(
            sql_query.replace("FROM data", f"FROM read_csv_auto('{file_path}')")
        ).df()
        return {"status": "success", "rows": len(result), "data": result.to_dict()}
    except Exception as e:
        return {"status": "error", "message": str(e)}

# Example agent tool call
result = query_data_file(
    "sales.csv",
    "SELECT region, SUM(revenue) AS total FROM data GROUP BY region"
)

Use Case 2: Auto Schema Discovery for LLM Context

import duckdb

def get_schema_for_llm(file_path: str) -> str:
    """Return a compact schema description for LLM prompt injection."""
    con = duckdb.connect()

    # Get column names and types
    schema = con.execute(
        f"DESCRIBE SELECT * FROM '{file_path}'"
    ).fetchdf()

    # Get sample rows
    sample = con.execute(
        f"SELECT * FROM '{file_path}' USING SAMPLE 3"
    ).fetchdf()

    return f"""
Table: data (from {file_path})
Columns:
{schema[['column_name','column_type']].to_string(index=False)}

Sample rows:
{sample.to_string(index=False)}
"""

# Inject into LLM system prompt
schema_context = get_schema_for_llm("customers.parquet")

Use Case 3: Agentic ETL Pipeline

import duckdb

def transform_and_analyze(input_path: str, output_path: str) -> str:
    """Agent pipeline: clean, transform, analyze, export."""
    con = duckdb.connect()

    # Step 1: Load raw data
    raw_count = con.execute(f"SELECT COUNT(*) FROM '{input_path}'").fetchone()[0]

    # Step 2: Clean and transform
    con.execute(f"""
        CREATE TABLE clean_data AS
        SELECT
            id,
            UPPER(TRIM(name))         AS name,
            LOWER(TRIM(email))        AS email,
            COALESCE(revenue, 0)      AS revenue,
            strptime(signup_date, '%d/%m/%Y')::DATE AS signup_date
        FROM '{input_path}'
        WHERE id IS NOT NULL AND email ILIKE '%@%.%'
    """)

    # Step 3: Analyze
    summary = con.execute("""
        SELECT COUNT(*) AS clean_rows, SUM(revenue) AS total_revenue, AVG(revenue) AS avg_revenue
        FROM clean_data
    """).fetchdf()

    # Step 4: Export
    con.execute(f"COPY clean_data TO '{output_path}' (FORMAT PARQUET)")

    return f"Processed {raw_count} raw rows → {summary.to_string()}"

Use Case 4: RAG Data Retrieval (Structured)

In Retrieval-Augmented Generation (RAG) pipelines, DuckDB serves as the structured data retrieval layer alongside vector databases for unstructured content.

import duckdb

def retrieve_structured_context(user_question: str, db_path: str) -> str:
    """
    Hybrid RAG: use DuckDB for structured data, vector DB for unstructured.
    This handles the structured side.
    """
    con = duckdb.connect(db_path)

    # Keywords extracted from user question (simplified)
    keywords = [w for w in user_question.split() if len(w) > 4]
    pattern  = f"%{'%'.join(keywords[:3])}%"

    result = con.execute(f"""
        SELECT title, summary, published_date, source
        FROM articles
        WHERE body ILIKE ?
        ORDER BY published_date DESC
        LIMIT 5
    """, [pattern]).fetchdf()

    return result.to_markdown(index=False)

Use Case 5: Agent Memory with DuckDB

DuckDB can serve as a lightweight persistent memory store for agents:

import duckdb
from datetime import datetime

class AgentMemory:
    def __init__(self, db_path: str = "agent_memory.db"):
        self.con = duckdb.connect(db_path)
        self.con.execute("""
            CREATE TABLE IF NOT EXISTS memories (
                id          INTEGER DEFAULT nextval('memory_seq'),
                session_id  VARCHAR,
                role        VARCHAR,
                content     TEXT,
                created_at  TIMESTAMP DEFAULT current_timestamp
            )
        """)

    def save(self, session_id: str, role: str, content: str):
        self.con.execute(
            "INSERT INTO memories (session_id, role, content) VALUES (?, ?, ?)",
            [session_id, role, content]
        )

    def recall(self, session_id: str, limit: int = 10) -> list:
        return self.con.execute("""
            SELECT role, content, created_at
            FROM memories
            WHERE session_id = ?
            ORDER BY created_at DESC
            LIMIT ?
        """, [session_id, limit]).fetchall()

DuckDB as AI Agent Tool — Summary

Agent Capability	DuckDB Role
Text-to-SQL execution	Execute LLM-generated SQL on any file
Schema discovery	`DESCRIBE` + `SUMMARIZE` for LLM context
Result summarization	Return DataFrames to LLM for reasoning
ETL in agent pipelines	Clean, transform, export without a server
Structured RAG	Query structured data alongside vector search
Agent memory	Lightweight persistent state in a `.db` file
Analytics tools	Window functions, aggregations, JOIN across files

Common Pitfalls

Pitfall	Consequence	Prevention
LLM generates MySQL syntax	DuckDB syntax errors	Include "Use DuckDB SQL" in system prompt
Agent runs no LIMIT on large files	Huge result returned to LLM	Add `LIMIT 100` guard in the tool wrapper
No error handling in tool wrapper	Agent crashes on bad SQL	Wrap execute in try/except and return error message

Quick Reference

import duckdb

# Query any file
duckdb.sql("SELECT * FROM 'data.csv' LIMIT 10").df()

# Get schema for LLM context
duckdb.sql("DESCRIBE SELECT * FROM 'data.parquet'").df()

# Summarize data for agent context
duckdb.sql("SUMMARIZE SELECT * FROM 'data.parquet'").df()

# Execute LLM-generated SQL safely
try:
    result = duckdb.sql(llm_generated_sql).df()
except Exception as e:
    result = f"SQL Error: {e}"

What's Next

Next: Module 2 — DuckDB Data Types - Understand the full DuckDB type system.
Module Overview - Return to this module index.

Why DuckDB for AI Agents?​

The Text-to-SQL Pattern​

Use Case 1: Agent Tool — Query Any File​

Use Case 2: Auto Schema Discovery for LLM Context​

Use Case 3: Agentic ETL Pipeline​

Use Case 4: RAG Data Retrieval (Structured)​

Use Case 5: Agent Memory with DuckDB​

DuckDB as AI Agent Tool — Summary​

Common Pitfalls​

Quick Reference​

What's Next​