DuckDB for SEO
Use this lesson to master DuckDB for SEO analytics — from keyword research and GSC analysis to log file mining and rank tracking — all using fast, serverless SQL.
Why DuckDB for SEO?
SEO professionals regularly work with:
- Google Search Console (GSC) keyword exports
- Keyword research tool exports (Ahrefs, SEMrush, Moz)
- Web server log files (hundreds of MB to GBs daily)
- Crawl data from Screaming Frog or Sitebulb
- Backlink data exports
- GA4 BigQuery exports
All of these are large tabular datasets — exactly what DuckDB is built for.
| SEO Data Source | Typical Size | DuckDB Fit |
|---|---|---|
| Keyword research export (CSV) | 10K–500K rows | Under 1 second |
| GSC performance export | Up to 25K rows/query | Instant aggregation |
| Server access logs | 1M–100M rows/day | Streamed via read_csv_auto |
| Crawl CSV (Screaming Frog) | 50K–500K rows | Instant |
| Backlink dataset (Ahrefs) | 1M–10M rows | Direct Parquet query |
Part 1: Keyword Research with DuckDB
What Data Do You Need?
Start by exporting keyword data from your preferred tool:
- Ahrefs: Export from Keyword Explorer → CSV
- SEMrush: Export from Keyword Magic Tool → CSV
- Google Keyword Planner: Download keyword ideas → CSV
- GSC: Performance → Search type: Web → Export
Typical keyword CSV columns:
keyword, search_volume, keyword_difficulty, cpc, clicks, impressions, ctr, position
Load Keyword Data
-- Preview keyword export
SELECT * FROM read_csv_auto('keyword_research.csv', header=true) LIMIT 10;
-- Check schema and data types
DESCRIBE SELECT * FROM read_csv_auto('keyword_research.csv', header=true);
-- Load into a persistent table for repeated analysis
CREATE TABLE keywords AS
SELECT * FROM read_csv_auto('keyword_research.csv', header=true);
-- Quick stats on your keyword set
SUMMARIZE keywords;
Keyword Opportunity Scoring
Find keywords with high volume, low difficulty, and decent CPC — the classic SEO sweet spot:
-- Opportunity score: balance volume vs difficulty
SELECT
keyword,
search_volume,
keyword_difficulty AS kd,
cpc,
ROUND(
(search_volume * 1.0 / NULLIF(keyword_difficulty, 0)) * (cpc + 0.5),
2
) AS opportunity_score
FROM keywords
WHERE search_volume >= 100
AND keyword_difficulty <= 50
AND cpc >= 0.5
ORDER BY opportunity_score DESC
LIMIT 50;
Low-Hanging Fruit Analysis
Keywords you already rank for (positions 5–20) where minor improvement = big traffic gain:
-- From GSC data: keywords in positions 5-20 with decent impressions
SELECT
query AS keyword,
ROUND(AVG(position), 1) AS avg_position,
SUM(impressions) AS total_impressions,
SUM(clicks) AS total_clicks,
ROUND(SUM(clicks) * 100.0 / NULLIF(SUM(impressions), 0), 2) AS ctr_pct
FROM read_csv_auto('gsc_export.csv', header=true)
WHERE date >= DATE '2024-01-01'
GROUP BY query
HAVING avg_position BETWEEN 5 AND 20
AND total_impressions > 200
ORDER BY avg_position ASC
LIMIT 100;
Keyword Clustering by Topic
Group semantically related keywords together to identify content pillars:
-- Cluster keywords by shared root word (2-3 first words)
SELECT
-- Extract first 2 words as topic cluster label
regexp_extract(lower(keyword), '^(\w+\s+\w+)', 1) AS topic_cluster,
COUNT(*) AS keyword_count,
SUM(search_volume) AS cluster_volume,
ROUND(AVG(keyword_difficulty), 1) AS avg_difficulty,
ROUND(AVG(cpc), 2) AS avg_cpc,
string_agg(keyword, ' | ' ORDER BY search_volume DESC) FILTER (WHERE search_volume > 100) AS top_keywords
FROM keywords
WHERE search_volume > 50
GROUP BY topic_cluster
HAVING keyword_count >= 3
ORDER BY cluster_volume DESC
LIMIT 30;
Keyword Gap Analysis
Find keywords your competitors rank for but you don't:
-- Load your keyword list and competitor keyword list
CREATE TABLE my_keywords AS
SELECT lower(trim(keyword)) AS keyword, search_volume, position
FROM read_csv_auto('my_site_keywords.csv', header=true);
CREATE TABLE competitor_keywords AS
SELECT lower(trim(keyword)) AS keyword, search_volume, competitor_position
FROM read_csv_auto('competitor_keywords.csv', header=true);
-- Keywords competitor ranks for that you don't cover
SELECT
c.keyword,
c.search_volume,
c.competitor_position
FROM competitor_keywords c
LEFT JOIN my_keywords m ON c.keyword = m.keyword
WHERE m.keyword IS NULL
AND c.search_volume >= 100
AND c.competitor_position <= 20
ORDER BY c.search_volume DESC
LIMIT 100;
Long-Tail Keyword Discovery
Identify long-tail opportunities (4+ word phrases) with low difficulty:
SELECT
keyword,
search_volume,
keyword_difficulty AS kd,
-- Count words in keyword
length(keyword) - length(replace(keyword, ' ', '')) + 1 AS word_count,
cpc
FROM keywords
WHERE length(keyword) - length(replace(keyword, ' ', '')) + 1 >= 4
AND keyword_difficulty <= 30
AND search_volume >= 20
ORDER BY search_volume DESC, kd ASC
LIMIT 100;
Search Intent Classification
Classify keywords by intent using pattern matching:
SELECT
keyword,
search_volume,
keyword_difficulty AS kd,
CASE
WHEN keyword ILIKE '%buy%' OR keyword ILIKE '%price%'
OR keyword ILIKE '%cheap%' OR keyword ILIKE '%order%'
OR keyword ILIKE '%deal%' OR keyword ILIKE '%discount%'
THEN 'Transactional'
WHEN keyword ILIKE '%how to%' OR keyword ILIKE '%what is%'
OR keyword ILIKE '%guide%' OR keyword ILIKE '%tutorial%'
OR keyword ILIKE '%step%' OR keyword ILIKE '%learn%'
THEN 'Informational'
WHEN keyword ILIKE '%best%' OR keyword ILIKE '%review%'
OR keyword ILIKE '%vs%' OR keyword ILIKE '%compare%'
OR keyword ILIKE '%top%' OR keyword ILIKE '%alternative%'
THEN 'Commercial Investigation'
WHEN keyword ILIKE '%near me%' OR keyword ILIKE '%login%'
OR keyword ILIKE '%website%' OR keyword ILIKE '%official%'
THEN 'Navigational'
ELSE 'Unclear'
END AS search_intent,
cpc
FROM keywords
ORDER BY search_volume DESC;
Aggregate Keyword Intent Distribution
-- Summarize your keyword set by intent
SELECT
CASE
WHEN keyword ILIKE '%buy%' OR keyword ILIKE '%price%' OR keyword ILIKE '%order%' THEN 'Transactional'
WHEN keyword ILIKE '%how to%' OR keyword ILIKE '%guide%' OR keyword ILIKE '%what is%' THEN 'Informational'
WHEN keyword ILIKE '%best%' OR keyword ILIKE '%vs%' OR keyword ILIKE '%review%' THEN 'Commercial'
WHEN keyword ILIKE '%near me%' OR keyword ILIKE '%login%' THEN 'Navigational'
ELSE 'Unclear'
END AS intent,
COUNT(*) AS keyword_count,
SUM(search_volume) AS total_volume,
ROUND(AVG(keyword_difficulty), 1) AS avg_kd
FROM keywords
GROUP BY intent
ORDER BY total_volume DESC;
SERP Feature Targeting
Identify keywords likely to have featured snippets (question-based):
SELECT
keyword,
search_volume,
keyword_difficulty AS kd,
CASE
WHEN keyword ILIKE 'what%' THEN 'Definition Snippet'
WHEN keyword ILIKE 'how%' THEN 'How-To Snippet'
WHEN keyword ILIKE 'why%' THEN 'Explanation Snippet'
WHEN keyword ILIKE 'best%' THEN 'List Snippet'
WHEN keyword ILIKE 'when%' THEN 'Factual Snippet'
ELSE 'Regular Result'
END AS target_serp_feature
FROM keywords
WHERE search_volume >= 100
ORDER BY search_volume DESC;
Part 2: Google Search Console Analysis
Load GSC Export and Clean
CREATE TABLE gsc AS
SELECT
query,
page,
CAST(clicks AS INTEGER) AS clicks,
CAST(impressions AS INTEGER) AS impressions,
CAST(ctr AS DOUBLE) AS ctr,
CAST(position AS DOUBLE) AS position,
strptime(date, '%Y-%m-%d')::DATE AS date
FROM read_csv_auto('gsc_export.csv', header=true);
Top Declining Keywords (Comparing Two Periods)
WITH period_a AS (
SELECT query, SUM(clicks) AS clicks_a, ROUND(AVG(position),1) AS pos_a
FROM gsc WHERE date BETWEEN DATE '2024-01-01' AND DATE '2024-03-31'
GROUP BY query
),
period_b AS (
SELECT query, SUM(clicks) AS clicks_b, ROUND(AVG(position),1) AS pos_b
FROM gsc WHERE date BETWEEN DATE '2024-04-01' AND DATE '2024-06-30'
GROUP BY query
)
SELECT
a.query,
a.clicks_a, b.clicks_b,
b.clicks_b - a.clicks_a AS click_change,
a.pos_a, b.pos_b,
b.pos_b - a.pos_a AS position_change
FROM period_a a
JOIN period_b b ON a.query = b.query
WHERE a.clicks_a > 50
ORDER BY click_change ASC -- worst decliners first
LIMIT 50;
Part 3: Server Log Analysis (Bot vs Human)
SELECT
date_trunc('day', strptime(log_time, '%d/%b/%Y:%H:%M:%S %z')) AS day,
CASE
WHEN user_agent ILIKE '%Googlebot%' THEN 'Googlebot'
WHEN user_agent ILIKE '%bingbot%' THEN 'Bingbot'
WHEN user_agent ILIKE '%bot%' THEN 'Other Bots'
ELSE 'Human'
END AS visitor_type,
COUNT(*) AS requests
FROM read_csv_auto('access.log',
delim = ' ',
columns = {
'ip': 'VARCHAR', 'dash1': 'VARCHAR', 'dash2': 'VARCHAR',
'log_time': 'VARCHAR', 'offset': 'VARCHAR', 'request': 'VARCHAR',
'status': 'INTEGER', 'bytes': 'INTEGER',
'referer': 'VARCHAR', 'user_agent': 'VARCHAR'
}
)
GROUP BY day, visitor_type
ORDER BY day, visitor_type;
The DuckDB SEO Analytics Workflow
flowchart LR
A[Keyword Tool Export\nGSC / Ahrefs / SEMrush] --> B[DuckDB read_csv_auto]
B --> C[Keyword Research\nOpportunity Scoring]
C --> D[Intent Classification\nClustering]
D --> E[Content Gap Analysis\nCompetitor Comparison]
E --> F[COPY TO Parquet\nor CSV Report]
F --> G[SEO Action Plan]
Practical Tips
- Convert large CSV exports to Parquet — once converted, keyword analysis runs in milliseconds
- Use
SUMMARIZEon every new export to check for NULLs, range anomalies, and data quality issues - Use
ILIKEfor all keyword matching — keyword case is always inconsistent - Use
string_agg()to concatenate keyword lists per cluster for content briefs - Use Window Functions for ranking and MoM/YoY comparisons without Excel
Common Pitfalls
| Pitfall | Consequence | Prevention |
|---|---|---|
| GSC date column not parsed as DATE | String comparisons silently fail | Cast: strptime(date_col, '%Y-%m-%d')::DATE |
| Log file with irregular delimiters | Wrong column mapping | Use explicit columns = {...} in read_csv_auto |
| Schema mismatch across monthly CSVs | Error merging files | Check with DESCRIBE SELECT * FROM 'month1.csv' |
| Division by zero in opportunity score | NULL or crash | Always NULLIF(kd, 0) before division |
| Long-tail word count off by one | Incorrect filtering | Use length(kw) - length(replace(kw,' ','')) + 1 |
Quick Reference
-- Load keyword export
CREATE TABLE keywords AS SELECT * FROM read_csv_auto('keywords.csv', header=true);
-- Opportunity score
SELECT keyword, search_volume / NULLIF(keyword_difficulty,0) * cpc AS score
FROM keywords ORDER BY score DESC LIMIT 50;
-- Long-tail keywords (4+ words)
SELECT keyword FROM keywords
WHERE length(keyword) - length(replace(keyword,' ','')) + 1 >= 4 AND kd <= 30;
-- Keyword gap (yours vs competitor)
SELECT c.keyword FROM competitor_keywords c
LEFT JOIN my_keywords m ON c.keyword = m.keyword WHERE m.keyword IS NULL;
-- Intent classification
SELECT keyword, CASE WHEN keyword ILIKE '%how%' THEN 'Informational'
WHEN keyword ILIKE '%buy%' THEN 'Transactional' ELSE 'Other' END
FROM keywords;
-- GSC: low-hanging fruit
SELECT query, AVG(position), SUM(impressions) FROM gsc
GROUP BY query HAVING AVG(position) BETWEEN 5 AND 20 ORDER BY 2;
-- Convert to Parquet for speed
COPY (SELECT * FROM keywords) TO 'keywords.parquet' (FORMAT PARQUET);
What's Next
- Next: DuckDB for AI Agents - Use DuckDB as the analytical backbone for AI agent pipelines.
- Module Overview - Return to this module index.