DEV Community

Cover image for I Built an AI Data Chat Tool in My Portfolio App Using Gemma 4, CrewAI, DuckDB, Supabase Edge Functions & Google Cloud Run 🚀

I Built an AI Data Chat Tool in My Portfolio App Using Gemma 4, CrewAI, DuckDB, Supabase Edge Functions & Google Cloud Run 🚀

What If You Could Just... Ask Your Data a Question? 🤔

Most people who need insights from a data file are blocked by one simple thing: they don't know SQL. Even technically strong users often don't want to stop, inspect schema manually, write queries, debug syntax, and format results just to answer a quick question like "Which category has the highest revenue?" or "Show me null rates by column."

This project removes that friction entirely. Upload your file, type your question in plain English, and let a Gemma 4-powered agentic backend inspect the schema, generate DuckDB SQL, execute it, and return the results — right inside a clean chat interface. 🎯

👉 Try it live here: https://databro.dev/backend/ai-data-chat/


🛠️ The Full Stack at a Glance

Before diving deep, here's the architecture that powers this tool:

Layer Technology
Frontend Next.js (Portfolio App)
Edge Middleware Supabase Edge Functions
Backend API FastAPI on Google Cloud Run
AI Agents CrewAI
SQL Engine DuckDB
LLM Gemma 4 via Hugging Face Inference API
File Formats CSV, XLSX, Parquet, JSON, Arrow IPC

🌟 Meet the Gemma 4 Model Family

Gemma 4 is Google DeepMind's latest family of open models — and they are genuinely exciting for agentic builders. Unlike earlier iterations, Gemma 4 is built with logic-heavy, reasoning-oriented workflows in mind, which makes it a natural fit for tasks like natural-language-to-SQL generation.

Here's a builder-friendly overview of the lineup:

Model Context Window Why It's Interesting
Gemma 4 E2B 128K tokens Smallest effective-size model — ultra-deployable, great for edge and lightweight inference
Gemma 4 E4B 128K tokens Stronger capability with multimodal notes — a sweet middle ground for multi-step tasks
Gemma 4 26B A4B 256K tokens Mixture-of-Experts architecture — advanced reasoning with long context, perfect for agentic workflows
Gemma 4 31B 256K tokens Largest dense instruct-tuned model — maximum open-model capability for complex tasks

The AI Data Chat tool exposes google/gemma-4-31B-it and google/gemma-4-26B-A4B-it via Hugging Face endpoints. Once you see these models turn a fuzzy natural-language question into a working DuckDB query over your own uploaded file, it becomes very hard not to imagine ten more use cases. 🔥


🤖 A Brief Foundation on CrewAI

CrewAI is an open-source framework for orchestrating autonomous AI agents in structured multi-agent workflows. Instead of a single giant prompt doing everything, CrewAI lets you break a problem into specialized, coordinated responsibilities.

Three building blocks to understand:

🧑‍💼 Agents

Specialized workers, each with a defined role. Think of them as employees on your AI team — one might be a "Schema Inspector," another a "SQL Writer," another a "Result Formatter."

📋 Tasks

Units of work assigned to agents. A task has a clear description, expected output, and the agent responsible for it. Examples:

  • "Analyze the uploaded file and return its schema"
  • "Given this schema and user intent, write a valid DuckDB SQL query"
  • "Execute the SQL and format the result for the user"

🔧 Tools

Capabilities agents can invoke to act on the world — file inspection utilities, DuckDB query executors, schema extractors, etc.

This model is powerful because it creates transparent, inspectable pipelines instead of black-box AI magic. Every step has a purpose, and every output is traceable.


🏗️ How the Tool Is Built — End to End

Step 1 — The Next.js Frontend

The portfolio app at databro.dev/backend/ai-data-chat/ provides a split-panel chat UI:

  • Left panel: File upload zone (drag & drop), attached file display with name and size, LLM Settings panel with Provider + Model selectors
  • Right panel: Chat interface with starter prompt buttons, conversation history with SQL panels and output tables

The user picks a file, selects a Gemma 4 model via Hugging Face, and types a natural-language question. The frontend packages everything into FormData (file + user intent + provider + model) and sends it to a Supabase Edge Function.

Step 2 — Supabase Edge Function

A lean TypeScript Edge Function validates the multipart request and proxies it to the Google Cloud Run backend. This keeps the frontend thin, avoids exposing Cloud Run endpoints directly to the browser, and centralizes auth concerns at the edge.

Step 3 — FastAPI on Google Cloud Run

The Cloud Run backend receives the uploaded file and user intent. It:

  1. Stores the file in a temporary directory
  2. Detects file type (CSV, Parquet, Arrow, JSON, XLSX)
  3. Loads the data into DuckDB as a data table
  4. Runs DESCRIBE data to extract the full schema
  5. Counts total rows
  6. Passes schema context + row count + user intent to the CrewAI agent pipeline

Step 4 — CrewAI Agent Pipeline

The agentic pipeline runs two coordinated tasks:

Task 1 — Schema Analysis
An agent uses a DuckDB tool to introspect the uploaded file and returns column names, types, and a row count. This grounding step is critical — the LLM sees real column names before generating SQL, which dramatically reduces hallucination.

Task 2 — SQL Generation + Execution
The schema and user intent are passed to the Gemma 4 model via Hugging Face. The LLM returns a valid DuckDB SQL query. The agent then executes that SQL against the uploaded file and serializes the results.

Step 5 — Response Back to the UI

The backend returns a structured JSON response containing:

  • model — the resolved model used
  • sql — the exact DuckDB SQL that was generated and run
  • schema — column definitions
  • total_rows — row count of the uploaded file
  • result — the query output rows

The frontend renders the SQL in a dark code panel and the results in a scrollable table. Transparency is built in — you always see exactly what query was run.


🎬 Live Demo — Real Queries, Real Results

I tested the tool with a real e-commerce sales CSV (49 rows, 18 columns covering orders, customers, products, categories, regions, sales, profit, discounts, and returns). Here's what Gemma 4 + CrewAI + DuckDB produced for 6 progressively complex natural language questions.


✅ Test Setup — File Uploaded, Gemma 4 31B Selected

The file sample_test_file.csv is attached (8.7 KB). Provider set to Hugging Face, model set to Gemma 4 31B Instruct. The chat interface is ready with starter prompts visible.

![Upload & model selection — sample_test_file.csv loaded, Hugging Face + Gemma 4 31B Instruct selected

💡 The LLM Settings panel (visible by scrolling right on the chat panel) lets you switch between Gemma 4 31B Instruct and Gemma 4 26B A4B Instruct mid-session.


🟢 Query 1 — "Show me the top 10 rows"

What Gemma 4 generated:

SELECT * FROM data LIMIT 10
Enter fullscreen mode Exit fullscreen mode

Result: Processed 49 CSV rows, returned 10 of 10 matching rows. Clean tabular output showing order_id, customer_id, customer_name, customer_email, customer_segment and more.

Query 1 — Top 10 rows result with SQL EXECUTED panel and tabular output


🟢 Query 2 — "What are the key columns and null rates?"

This is where things get interesting. Instead of a simple SELECT, Gemma 4 understood intent — it generated a full null-rate audit query covering every single column:

What Gemma 4 generated:

SELECT
    count(*) AS total_rows,
    (count(*) - count(order_id)) * 100.0 / count(*) AS order_id_null_rate,
    (count(*) - count(customer_id)) * 100.0 / count(*) AS customer_id_null_rate,
    (count(*) - count(customer_name)) * 100.0 / count(*) AS customer_name_null_rate,
    (count(*) - count(customer_email)) * 100.0 / count(*) AS customer_email_null_rate,
    -- ... (every column covered)
    (count(*) - count(returned)) * 100.0 / count(*) AS returned_null_rate
FROM data
Enter fullscreen mode Exit fullscreen mode

Result: 49 total rows, 0% null rate across all 18 columns — a complete data quality snapshot in one shot, without the user writing a single line of SQL.

Query 2 — Full null rate audit SQL generated by Gemma 4, covering all 18 columns

Query 2 — SQL OUTPUT showing 49 total rows with 0 null rates across all key columns

🤯 This is the magic moment. A non-technical user typed a plain English question, and the system produced a production-quality data quality audit query autonomously.


🟡 Query 3 — "Show top 5 categories ranked by total sales and profit margin percentage"

What Gemma 4 generated:

SELECT
    category,
    SUM(sales) AS total_sales,
    (SUM(profit) / SUM(sales)) * 100 AS profit_margin_percentage
FROM data
GROUP BY category
ORDER BY total_sales DESC
LIMIT 5
Enter fullscreen mode Exit fullscreen mode

Result: 3 categories returned with calculated profit margins:

category total_sales profit_margin_percentage
Furniture 4438.45 -4.08% ⚠️
Technology 4228.69 +20.16%
Office Supplies 918.77 +37.42% 🏆

Insight unlocked: Furniture is the highest-revenue category but is running at a loss. Technology and Office Supplies are both profitable — Office Supplies has the best margin.

Query 3 — Category sales and profit margin percentage ranked, showing Furniture at a loss


🟡 Query 4 — "Show me monthly revenue trend with total orders count and average order value grouped by month"

What Gemma 4 generated:

SELECT
    date_trunc('month', order_date) AS month,
    SUM(sales) AS total_revenue,
    COUNT(order_id) AS total_orders,
    AVG(sales) AS avg_order_value
FROM data
GROUP BY date_trunc('month', order_date)
ORDER BY month
Enter fullscreen mode Exit fullscreen mode

Result:

month total_revenue total_orders avg_order_value
2024-01-01 6471.26 30 215.71
2024-02-01 3114.65 19 163.93

Gemma 4 correctly used DuckDB's native date_trunc('month', ...) function — not a workaround, not a SUBSTR hack — real DuckDB date intelligence applied from natural language.

Query 4 — Monthly revenue trend using date_trunc, showing Jan and Feb 2024 with order counts and avg values


🔴 Query 5 — "Which customers have placed more than one order? Show their name, order count and total spend ranked by highest spend"

What Gemma 4 generated:

SELECT
    customer_name,
    COUNT(DISTINCT order_id) AS order_count,
    SUM(sales) AS total_spend
FROM data
GROUP BY customer_name
HAVING COUNT(DISTINCT order_id) > 1
ORDER BY total_spend DESC
Enter fullscreen mode Exit fullscreen mode

Result:

customer_name order_count total_spend
Alice Johnson 3 163.85

Gemma 4 correctly identified the need for HAVING COUNT(DISTINCT ...) > 1 — a classic SQL pattern that trips up many developers, generated correctly from plain English.

Query 5 — Repeat customer analysis using COUNT DISTINCT + HAVING, correctly identifying Alice Johnson


🔴 Query 6 — "Find all returned orders, show total profit loss by category and ship mode, sorted by biggest loss first"

What Gemma 4 generated:

SELECT
    category,
    ship_mode,
    SUM(profit) AS total_profit_loss
FROM data
WHERE returned = 'Yes'
GROUP BY category, ship_mode
ORDER BY total_profit_loss ASC
Enter fullscreen mode Exit fullscreen mode

Result:

category ship_mode total_profit_loss
Furniture Second Class -200 🚨
Furniture Standard Class -155 ⚠️
Furniture First Class -20

Every single return came from Furniture. Second Class shipping returns caused the most financial damage. This is exactly the kind of insight a business analyst would spend hours finding — delivered in under 30 seconds via plain English.

Query 6 — Returned orders profit loss analysis by category and ship mode, showing Furniture returns at -$200, -$155, -$20


💡 Why This Is Genuinely Useful — No SQL Expertise Needed

Look at what just happened across those 6 queries:

What the user typed What DuckDB actually executed
"Show me the top 10 rows" SELECT * FROM data LIMIT 10
"What are the key columns and null rates?" 18-column null rate audit with (count(*) - count(col)) * 100.0 / count(*)
"Top categories by sales and profit margin" SUM + division for margin %, GROUP BY, ORDER BY
"Monthly revenue trend" date_trunc('month', order_date) + AVG + COUNT
"Customers with multiple orders" COUNT(DISTINCT ...) + HAVING clause
"Returned orders profit loss by category" WHERE returned = 'Yes' + GROUP BY + ORDER BY ASC

Top comments (0)