Kumaravelu Saraboji Mahalingam

Posted on May 22

I Built an AI Data Chat Tool in My Portfolio App Using Gemma 4, CrewAI, DuckDB, Supabase Edge Functions & Google Cloud Run 🚀

#gemma #ai #showdev #crewai

What If You Could Just... Ask Your Data a Question? 🤔

Most people who need insights from a data file are blocked by one simple thing: they don't know SQL. Even technically strong users often don't want to stop, inspect schema manually, write queries, debug syntax, and format results just to answer a quick question like "Which category has the highest revenue?" or "Show me null rates by column."

This project removes that friction entirely. Upload your file, type your question in plain English, and let a Gemma 4-powered agentic backend inspect the schema, generate DuckDB SQL, execute it, and return the results — right inside a clean chat interface. 🎯

👉 Try the AI Data Chat Live Demo 🚀

🛠️ The Full Stack at a Glance

Before diving deep, here's the architecture that powers this tool:

Layer	Technology
Frontend	Next.js (Portfolio App)
Edge Middleware	Supabase Edge Functions
Backend API	FastAPI on Google Cloud Run
AI Agents	CrewAI
SQL Engine	DuckDB
LLM	Gemma 4 via Hugging Face Inference API
File Formats	CSV, XLSX, Parquet, JSON, Arrow IPC

🌟 Meet the Gemma 4 Model Family

Gemma 4 is Google DeepMind's latest family of open models — and they are genuinely exciting for agentic builders. Unlike earlier iterations, Gemma 4 is built with logic-heavy, reasoning-oriented workflows in mind, which makes it a natural fit for tasks like natural-language-to-SQL generation.

Here's a builder-friendly overview of the lineup:

Model	Active Params	Context Window	Input Modalities	Ideal For
Gemma 4 E2B 🎙️	2.3B (5.1B total)	128K tokens	📝 Text + 🖼️ Image + 🔊 Audio (Speech)	Ultra-mobile, edge, browser & IoT deployment — e.g. Pixel, Chrome
Gemma 4 E4B 🎙️	4.5B (8B total)	128K tokens	📝 Text + 🖼️ Image + 🔊 Audio (Speech)	On-device agents with richer capability — real-time audio/visual processing
Gemma 4 26B A4B 🎬	3.8B (25.2B total)	256K tokens	📝 Text + 🖼️ Image + 🎬 Video (up to 60s)	High-throughput MoE reasoning — advanced agentic & long-context workflows
Gemma 4 31B 🎬	30.7B	256K tokens	📝 Text + 🖼️ Image + 🎬 Video (up to 60s)	Maximum open-model capability — frontier-grade reasoning on local hardware

🔊 Audio support (E2B & E4B): These models include a built-in USM-style conformer encoder that processes up to 30 seconds of audio at 16kHz. They handle automatic speech recognition (ASR), multilingual speech-to-text translation, and audio Q&A natively on-device — no cloud API call required. Each second of audio costs 25 tokens.

🎬 Video support (26B & 31B): The two larger models can analyze video sequences up to 60 seconds at 1 fps, covering scene understanding, temporal reasoning, and chart reading across frames.

🖼️ Image support (all models): Variable aspect ratio and resolution, configurable token budgets (70–1,120 tokens per image), OCR, document parsing, chart comprehension, and UI/screen understanding.

The AI Data Chat tool in this project exposes google/gemma-4-31B-it and google/gemma-4-26B-A4B-it via Hugging Face endpoints — the two most capable text+reasoning models in the family, ideal for the complex DuckDB SQL generation tasks you'll see below. 🔥

🤖 A Brief Foundation on CrewAI

CrewAI is an open-source framework for orchestrating autonomous AI agents in structured multi-agent workflows. Instead of a single giant prompt doing everything, CrewAI lets you break a problem into specialized, coordinated responsibilities.

Three building blocks to understand:

🧑‍💼 Agents

Specialized workers, each with a defined role. Think of them as employees on your AI team — one might be a "Schema Inspector," another a "SQL Writer," another a "Result Formatter."

📋 Tasks

Units of work assigned to agents. A task has a clear description, expected output, and the agent responsible for it. Examples:

"Analyze the uploaded file and return its schema"
"Given this schema and user intent, write a valid DuckDB SQL query"
"Execute the SQL and format the result for the user"

🔧 Tools

Capabilities agents can invoke to act on the world — file inspection utilities, DuckDB query executors, schema extractors, etc.

This model is powerful because it creates transparent, inspectable pipelines instead of black-box AI magic. Every step has a purpose, and every output is traceable.

🏗️ How the Tool Is Built — End to End

Step 1 — The Next.js Frontend

The portfolio app at databro.dev/backend/ai-data-chat/ provides a split-panel chat UI:

Left panel: File upload zone (drag & drop), attached file display with name and size, LLM Settings panel with Provider + Model selectors
Right panel: Chat interface with starter prompt buttons, conversation history with SQL panels and output tables

The user picks a file, selects a Gemma 4 model via Hugging Face, and types a natural-language question. The frontend packages everything into FormData (file + user intent + provider + model) and sends it to a Supabase Edge Function.

Step 2 — Supabase Edge Function

A lean TypeScript Edge Function validates the multipart request and proxies it to the Google Cloud Run backend. This keeps the frontend thin, avoids exposing Cloud Run endpoints directly to the browser, and centralizes auth concerns at the edge.

Step 3 — FastAPI on Google Cloud Run

The Cloud Run backend receives the uploaded file and user intent. It:

Stores the file in a temporary directory
Detects file type (CSV, Parquet, Arrow, JSON, XLSX)
Loads the data into DuckDB as a data table
Runs DESCRIBE data to extract the full schema
Counts total rows
Passes schema context + row count + user intent to the CrewAI agent pipeline

Step 4 — CrewAI Agent Pipeline

The agentic pipeline runs two coordinated tasks:

Task 1 — Schema Analysis
An agent uses a DuckDB tool to introspect the uploaded file and returns column names, types, and a row count. This grounding step is critical — the LLM sees real column names before generating SQL, which dramatically reduces hallucination.

Task 2 — SQL Generation + Execution
The schema and user intent are passed to the Gemma 4 model via Hugging Face. The LLM returns a valid DuckDB SQL query. The agent then executes that SQL against the uploaded file and serializes the results.

Step 5 — Response Back to the UI

The backend returns a structured JSON response containing:

model — the resolved model used
sql — the exact DuckDB SQL that was generated and run
schema — column definitions
total_rows — row count of the uploaded file
result — the query output rows

The frontend renders the SQL in a dark code panel and the results in a scrollable table. Transparency is built in — you always see exactly what query was run.

Even when the implementation keeps the flow compact, the mental model is powerful. Readers instantly understand that the system is not “AI guessing at data,” but a coordinated pipeline where each step has a purpose. ✅

How the data workflow works 🔄

This project uses DuckDB as the execution engine over uploaded files. The backend detects file type, loads the data into a temporary data table, runs DESCRIBE data to extract schema, counts rows, and then uses the resulting schema context as grounding for SQL generation.

That design is important because it reduces hallucination risk. The LLM is not inventing column names from imagination; it sees the actual schema first, then turns the user’s natural-language intent into DuckDB SQL that is executed against the uploaded file.

At a high level, the flow looks like this:

This is the kind of architecture that makes AI feel grounded in engineering reality. Every step is inspectable, and every answer is backed by an actual query. 🛠️

🎬 Live Demo — Real Queries, Real Results

I tested the tool with a real e-commerce sales CSV (49 rows, 18 columns covering orders, customers, products, categories, regions, sales, profit, discounts, and returns). Here's what Gemma 4 + CrewAI + DuckDB produced for 6 progressively complex natural language questions.

✅ Test Setup — File Uploaded, Gemma 4 31B Selected

The file sample_test_file.csv is attached (8.7 KB). Provider set to Hugging Face, model set to Gemma 4 31B Instruct. The chat interface is ready with starter prompts visible.

💡 The LLM Settings panel (visible by scrolling right on the chat panel) lets you switch between Gemma 4 31B Instruct and Gemma 4 26B A4B Instruct mid-session.

🟢 Query 1 — "Show me the top 10 rows"

What Gemma 4 generated:

SELECT * FROM data LIMIT 10

Result: Processed 49 CSV rows, returned 10 of 10 matching rows. Clean tabular output showing order_id, customer_id, customer_name, customer_email, customer_segment and more.

🟢 Query 2 — "What are the key columns and null rates?"

This is where things get interesting. Instead of a simple SELECT, Gemma 4 understood intent — it generated a full null-rate audit query covering every single column:

What Gemma 4 generated:

SELECT
    count(*) AS total_rows,
    (count(*) - count(order_id)) * 100.0 / count(*) AS order_id_null_rate,
    (count(*) - count(customer_id)) * 100.0 / count(*) AS customer_id_null_rate,
    (count(*) - count(customer_name)) * 100.0 / count(*) AS customer_name_null_rate,
    (count(*) - count(customer_email)) * 100.0 / count(*) AS customer_email_null_rate,
    -- ... (every column covered)
    (count(*) - count(returned)) * 100.0 / count(*) AS returned_null_rate
FROM data

Result: 49 total rows, 0% null rate across all 18 columns — a complete data quality snapshot in one shot, without the user writing a single line of SQL.

🤯 This is the magic moment. A non-technical user typed a plain English question, and the system produced a production-quality data quality audit query autonomously.

🟡 Query 3 — "Show top 5 categories ranked by total sales and profit margin percentage"

What Gemma 4 generated:

SELECT
    category,
    SUM(sales) AS total_sales,
    (SUM(profit) / SUM(sales)) * 100 AS profit_margin_percentage
FROM data
GROUP BY category
ORDER BY total_sales DESC
LIMIT 5

Result: 3 categories returned with calculated profit margins:

category	total_sales	profit_margin_percentage
Furniture	4438.45	-4.08% ⚠️
Technology	4228.69	+20.16% ✅
Office Supplies	918.77	+37.42% 🏆

Insight unlocked: Furniture is the highest-revenue category but is running at a loss. Technology and Office Supplies are both profitable — Office Supplies has the best margin.

🟡 Query 4 — "Show me monthly revenue trend with total orders count and average order value grouped by month"

What Gemma 4 generated:

SELECT
    date_trunc('month', order_date) AS month,
    SUM(sales) AS total_revenue,
    COUNT(order_id) AS total_orders,
    AVG(sales) AS avg_order_value
FROM data
GROUP BY date_trunc('month', order_date)
ORDER BY month

Result:

month	total_revenue	total_orders	avg_order_value
2024-01-01	6471.26	30	215.71
2024-02-01	3114.65	19	163.93

Gemma 4 correctly used DuckDB's native date_trunc('month', ...) function — not a workaround, not a SUBSTR hack — real DuckDB date intelligence applied from natural language.

🔴 Query 5 — "Which customers have placed more than one order? Show their name, order count and total spend ranked by highest spend"

What Gemma 4 generated:

SELECT
    customer_name,
    COUNT(DISTINCT order_id) AS order_count,
    SUM(sales) AS total_spend
FROM data
GROUP BY customer_name
HAVING COUNT(DISTINCT order_id) > 1
ORDER BY total_spend DESC

Result:

customer_name	order_count	total_spend
Alice Johnson	3	163.85

Gemma 4 correctly identified the need for HAVING COUNT(DISTINCT ...) > 1 — a classic SQL pattern that trips up many developers, generated correctly from plain English.

🔴 Query 6 — "Find all returned orders, show total profit loss by category and ship mode, sorted by biggest loss first"

What Gemma 4 generated:

SELECT
    category,
    ship_mode,
    SUM(profit) AS total_profit_loss
FROM data
WHERE returned = 'Yes'
GROUP BY category, ship_mode
ORDER BY total_profit_loss ASC

Result:

category	ship_mode	total_profit_loss
Furniture	Second Class	-200 🚨
Furniture	Standard Class	-155 ⚠️
Furniture	First Class	-20

Every single return came from Furniture. Second Class shipping returns caused the most financial damage. This is exactly the kind of insight a business analyst would spend hours finding — delivered in under 30 seconds via plain English.

💡 Why This Is Genuinely Useful — No SQL Expertise Needed

Look at what just happened across those 6 queries:

What the user typed	What DuckDB actually executed
"Show me the top 10 rows"	`SELECT * FROM data LIMIT 10`
"What are the key columns and null rates?"	18-column null rate audit with `(count() - count(col)) 100.0 / count(*)`
"Top categories by sales and profit margin"	`SUM` + division for margin `%`, `GROUP BY`, `ORDER BY`
"Monthly revenue trend"	`date_trunc('month', order_date)` + `AVG` + `COUNT`
"Customers with multiple orders"	`COUNT(DISTINCT ...)` + `HAVING` clause
"Returned orders profit loss by category"	`WHERE returned = 'Yes'` + `GROUP BY` + `ORDER BY ASC`

Why this is useful to end users 📊

The biggest value is accessibility. A user does not need to know SQL, DuckDB syntax, schema introspection, or data-loading APIs to get insights from a file. They can simply ask questions like “Show me the top 10 rows,” “What are the null rates?”, “Find duplicates,” or “Give me a distribution summary,” which the UI even suggests as starter prompts.

That changes the audience dramatically. Analysts can move faster, product managers can self-serve, recruiters or business users can inspect exported CSVs, and students can explore datasets without first learning query languages.

It also improves transparency compared with many “AI chat with your data” demos. The app returns the generated SQL alongside the tabular output, so users can verify what was run and build trust in the response rather than blindly accepting a paragraph from a model. 🔍

Why this makes readers curious about Gemma 4 🌟

There is something deeply compelling about watching an open model turn vague human language into a working analytic query. Once readers see Gemma 4 participating in a real stack — frontend, edge function, backend, SQL engine, and agent workflow — the model stops being an abstract release and starts feeling like a tool they can ship with.

That is where curiosity grows. If Gemma 4 can help power natural-language analytics over local files today, what else could it do tomorrow — internal BI copilots, document intelligence, edge-side assistants, debugging copilots, data quality triage, or multimodal workflow agents? 🤯

And because Gemma 4 is available in multiple sizes and distributed through platforms like Hugging Face, it invites experimentation instead of gatekeeping it. For builders, that is a powerful combination: capability, openness, and a very tangible path from prototype to product.

Final thoughts ✍️

This project started as a portfolio build, but it became a strong example of where agentic AI gets genuinely useful: reducing the distance between a person’s question and the data-backed answer they need.

Gemma 4 gives that experience an extra spark. It feels like the kind of model family that makes developers want to try one more workflow, one more agent, one more product idea — and that is exactly the energy a great open model should create. 🚀

DEV Community

I Built an AI Data Chat Tool in My Portfolio App Using Gemma 4, CrewAI, DuckDB, Supabase Edge Functions & Google Cloud Run 🚀

What If You Could Just... Ask Your Data a Question? 🤔

🛠️ The Full Stack at a Glance

🌟 Meet the Gemma 4 Model Family

🤖 A Brief Foundation on CrewAI

🧑‍💼 Agents

📋 Tasks

🔧 Tools

🏗️ How the Tool Is Built — End to End

Step 1 — The Next.js Frontend

Step 2 — Supabase Edge Function

Step 3 — FastAPI on Google Cloud Run

Step 4 — CrewAI Agent Pipeline

Step 5 — Response Back to the UI

How the data workflow works 🔄

🎬 Live Demo — Real Queries, Real Results

✅ Test Setup — File Uploaded, Gemma 4 31B Selected

🟢 Query 1 — "Show me the top 10 rows"

🟢 Query 2 — "What are the key columns and null rates?"

🟡 Query 3 — "Show top 5 categories ranked by total sales and profit margin percentage"

🟡 Query 4 — "Show me monthly revenue trend with total orders count and average order value grouped by month"

🔴 Query 5 — "Which customers have placed more than one order? Show their name, order count and total spend ranked by highest spend"

🔴 Query 6 — "Find all returned orders, show total profit loss by category and ship mode, sorted by biggest loss first"

💡 Why This Is Genuinely Useful — No SQL Expertise Needed

Why this is useful to end users 📊

Why this makes readers curious about Gemma 4 🌟

Final thoughts ✍️

Top comments (0)