What If You Could Just... Ask Your Data a Question? π€
Most people who need insights from a data file are blocked by one simple thing: they don't know SQL. Even technically strong users often don't want to stop, inspect schema manually, write queries, debug syntax, and format results just to answer a quick question like "Which category has the highest revenue?" or "Show me null rates by column."
This project removes that friction entirely. Upload your file, type your question in plain English, and let a Gemma 4-powered agentic backend inspect the schema, generate DuckDB SQL, execute it, and return the results β right inside a clean chat interface. π―
π Try the AI Data Chat Live Demo π
π οΈ The Full Stack at a Glance
Before diving deep, here's the architecture that powers this tool:
| Layer | Technology |
|---|---|
| Frontend | Next.js (Portfolio App) |
| Edge Middleware | Supabase Edge Functions |
| Backend API | FastAPI on Google Cloud Run |
| AI Agents | CrewAI |
| SQL Engine | DuckDB |
| LLM | Gemma 4 via Hugging Face Inference API |
| File Formats | CSV, XLSX, Parquet, JSON, Arrow IPC |
π Meet the Gemma 4 Model Family
Gemma 4 is Google DeepMind's latest family of open models β and they are genuinely exciting for agentic builders. Unlike earlier iterations, Gemma 4 is built with logic-heavy, reasoning-oriented workflows in mind, which makes it a natural fit for tasks like natural-language-to-SQL generation.
Here's a builder-friendly overview of the lineup:
| Model | Active Params | Context Window | Input Modalities | Ideal For |
|---|---|---|---|---|
| Gemma 4 E2B ποΈ | 2.3B (5.1B total) | 128K tokens | π Text + πΌοΈ Image + π Audio (Speech) | Ultra-mobile, edge, browser & IoT deployment β e.g. Pixel, Chrome |
| Gemma 4 E4B ποΈ | 4.5B (8B total) | 128K tokens | π Text + πΌοΈ Image + π Audio (Speech) | On-device agents with richer capability β real-time audio/visual processing |
| Gemma 4 26B A4B π¬ | 3.8B (25.2B total) | 256K tokens | π Text + πΌοΈ Image + π¬ Video (up to 60s) | High-throughput MoE reasoning β advanced agentic & long-context workflows |
| Gemma 4 31B π¬ | 30.7B | 256K tokens | π Text + πΌοΈ Image + π¬ Video (up to 60s) | Maximum open-model capability β frontier-grade reasoning on local hardware |
π Audio support (E2B & E4B): These models include a built-in USM-style conformer encoder that processes up to 30 seconds of audio at 16kHz. They handle automatic speech recognition (ASR), multilingual speech-to-text translation, and audio Q&A natively on-device β no cloud API call required. Each second of audio costs 25 tokens.
π¬ Video support (26B & 31B): The two larger models can analyze video sequences up to 60 seconds at 1 fps, covering scene understanding, temporal reasoning, and chart reading across frames.
πΌοΈ Image support (all models): Variable aspect ratio and resolution, configurable token budgets (70β1,120 tokens per image), OCR, document parsing, chart comprehension, and UI/screen understanding.
The AI Data Chat tool in this project exposes google/gemma-4-31B-it and google/gemma-4-26B-A4B-it via Hugging Face endpoints β the two most capable text+reasoning models in the family, ideal for the complex DuckDB SQL generation tasks you'll see below. π₯
π€ A Brief Foundation on CrewAI
CrewAI is an open-source framework for orchestrating autonomous AI agents in structured multi-agent workflows. Instead of a single giant prompt doing everything, CrewAI lets you break a problem into specialized, coordinated responsibilities.
Three building blocks to understand:
π§βπΌ Agents
Specialized workers, each with a defined role. Think of them as employees on your AI team β one might be a "Schema Inspector," another a "SQL Writer," another a "Result Formatter."
π Tasks
Units of work assigned to agents. A task has a clear description, expected output, and the agent responsible for it. Examples:
- "Analyze the uploaded file and return its schema"
- "Given this schema and user intent, write a valid DuckDB SQL query"
- "Execute the SQL and format the result for the user"
π§ Tools
Capabilities agents can invoke to act on the world β file inspection utilities, DuckDB query executors, schema extractors, etc.
This model is powerful because it creates transparent, inspectable pipelines instead of black-box AI magic. Every step has a purpose, and every output is traceable.
ποΈ How the Tool Is Built β End to End
Step 1 β The Next.js Frontend
The portfolio app at databro.dev/backend/ai-data-chat/ provides a split-panel chat UI:
- Left panel: File upload zone (drag & drop), attached file display with name and size, LLM Settings panel with Provider + Model selectors
- Right panel: Chat interface with starter prompt buttons, conversation history with SQL panels and output tables
The user picks a file, selects a Gemma 4 model via Hugging Face, and types a natural-language question. The frontend packages everything into FormData (file + user intent + provider + model) and sends it to a Supabase Edge Function.
Step 2 β Supabase Edge Function
A lean TypeScript Edge Function validates the multipart request and proxies it to the Google Cloud Run backend. This keeps the frontend thin, avoids exposing Cloud Run endpoints directly to the browser, and centralizes auth concerns at the edge.
Step 3 β FastAPI on Google Cloud Run
The Cloud Run backend receives the uploaded file and user intent. It:
- Stores the file in a temporary directory
- Detects file type (CSV, Parquet, Arrow, JSON, XLSX)
- Loads the data into DuckDB as a
datatable - Runs
DESCRIBE datato extract the full schema - Counts total rows
- Passes schema context + row count + user intent to the CrewAI agent pipeline
Step 4 β CrewAI Agent Pipeline
The agentic pipeline runs two coordinated tasks:
Task 1 β Schema Analysis
An agent uses a DuckDB tool to introspect the uploaded file and returns column names, types, and a row count. This grounding step is critical β the LLM sees real column names before generating SQL, which dramatically reduces hallucination.
Task 2 β SQL Generation + Execution
The schema and user intent are passed to the Gemma 4 model via Hugging Face. The LLM returns a valid DuckDB SQL query. The agent then executes that SQL against the uploaded file and serializes the results.
Step 5 β Response Back to the UI
The backend returns a structured JSON response containing:
-
modelβ the resolved model used -
sqlβ the exact DuckDB SQL that was generated and run -
schemaβ column definitions -
total_rowsβ row count of the uploaded file -
resultβ the query output rows
The frontend renders the SQL in a dark code panel and the results in a scrollable table. Transparency is built in β you always see exactly what query was run.
Even when the implementation keeps the flow compact, the mental model is powerful. Readers instantly understand that the system is not βAI guessing at data,β but a coordinated pipeline where each step has a purpose. β
How the data workflow works π
This project uses DuckDB as the execution engine over uploaded files. The backend detects file type, loads the data into a temporary data table, runs DESCRIBE data to extract schema, counts rows, and then uses the resulting schema context as grounding for SQL generation.
That design is important because it reduces hallucination risk. The LLM is not inventing column names from imagination; it sees the actual schema first, then turns the userβs natural-language intent into DuckDB SQL that is executed against the uploaded file.
At a high level, the flow looks like this:
This is the kind of architecture that makes AI feel grounded in engineering reality. Every step is inspectable, and every answer is backed by an actual query. π οΈ
π¬ Live Demo β Real Queries, Real Results
I tested the tool with a real e-commerce sales CSV (49 rows, 18 columns covering orders, customers, products, categories, regions, sales, profit, discounts, and returns). Here's what Gemma 4 + CrewAI + DuckDB produced for 6 progressively complex natural language questions.
β Test Setup β File Uploaded, Gemma 4 31B Selected
The file sample_test_file.csv is attached (8.7 KB). Provider set to Hugging Face, model set to Gemma 4 31B Instruct. The chat interface is ready with starter prompts visible.
π‘ The LLM Settings panel (visible by scrolling right on the chat panel) lets you switch between
Gemma 4 31B InstructandGemma 4 26B A4B Instructmid-session.
π’ Query 1 β "Show me the top 10 rows"
What Gemma 4 generated:
SELECT * FROM data LIMIT 10
Result: Processed 49 CSV rows, returned 10 of 10 matching rows. Clean tabular output showing order_id, customer_id, customer_name, customer_email, customer_segment and more.
π’ Query 2 β "What are the key columns and null rates?"
This is where things get interesting. Instead of a simple SELECT, Gemma 4 understood intent β it generated a full null-rate audit query covering every single column:
What Gemma 4 generated:
SELECT
count(*) AS total_rows,
(count(*) - count(order_id)) * 100.0 / count(*) AS order_id_null_rate,
(count(*) - count(customer_id)) * 100.0 / count(*) AS customer_id_null_rate,
(count(*) - count(customer_name)) * 100.0 / count(*) AS customer_name_null_rate,
(count(*) - count(customer_email)) * 100.0 / count(*) AS customer_email_null_rate,
-- ... (every column covered)
(count(*) - count(returned)) * 100.0 / count(*) AS returned_null_rate
FROM data
Result: 49 total rows, 0% null rate across all 18 columns β a complete data quality snapshot in one shot, without the user writing a single line of SQL.
π€― This is the magic moment. A non-technical user typed a plain English question, and the system produced a production-quality data quality audit query autonomously.
π‘ Query 3 β "Show top 5 categories ranked by total sales and profit margin percentage"
What Gemma 4 generated:
SELECT
category,
SUM(sales) AS total_sales,
(SUM(profit) / SUM(sales)) * 100 AS profit_margin_percentage
FROM data
GROUP BY category
ORDER BY total_sales DESC
LIMIT 5
Result: 3 categories returned with calculated profit margins:
| category | total_sales | profit_margin_percentage |
|---|---|---|
| Furniture | 4438.45 | -4.08% β οΈ |
| Technology | 4228.69 | +20.16% β |
| Office Supplies | 918.77 | +37.42% π |
Insight unlocked: Furniture is the highest-revenue category but is running at a loss. Technology and Office Supplies are both profitable β Office Supplies has the best margin.
π‘ Query 4 β "Show me monthly revenue trend with total orders count and average order value grouped by month"
What Gemma 4 generated:
SELECT
date_trunc('month', order_date) AS month,
SUM(sales) AS total_revenue,
COUNT(order_id) AS total_orders,
AVG(sales) AS avg_order_value
FROM data
GROUP BY date_trunc('month', order_date)
ORDER BY month
Result:
| month | total_revenue | total_orders | avg_order_value |
|---|---|---|---|
| 2024-01-01 | 6471.26 | 30 | 215.71 |
| 2024-02-01 | 3114.65 | 19 | 163.93 |
Gemma 4 correctly used DuckDB's native date_trunc('month', ...) function β not a workaround, not a SUBSTR hack β real DuckDB date intelligence applied from natural language.
π΄ Query 5 β "Which customers have placed more than one order? Show their name, order count and total spend ranked by highest spend"
What Gemma 4 generated:
SELECT
customer_name,
COUNT(DISTINCT order_id) AS order_count,
SUM(sales) AS total_spend
FROM data
GROUP BY customer_name
HAVING COUNT(DISTINCT order_id) > 1
ORDER BY total_spend DESC
Result:
| customer_name | order_count | total_spend |
|---|---|---|
| Alice Johnson | 3 | 163.85 |
Gemma 4 correctly identified the need for HAVING COUNT(DISTINCT ...) > 1 β a classic SQL pattern that trips up many developers, generated correctly from plain English.
π΄ Query 6 β "Find all returned orders, show total profit loss by category and ship mode, sorted by biggest loss first"
What Gemma 4 generated:
SELECT
category,
ship_mode,
SUM(profit) AS total_profit_loss
FROM data
WHERE returned = 'Yes'
GROUP BY category, ship_mode
ORDER BY total_profit_loss ASC
Result:
| category | ship_mode | total_profit_loss |
|---|---|---|
| Furniture | Second Class | -200 π¨ |
| Furniture | Standard Class | -155 β οΈ |
| Furniture | First Class | -20 |
Every single return came from Furniture. Second Class shipping returns caused the most financial damage. This is exactly the kind of insight a business analyst would spend hours finding β delivered in under 30 seconds via plain English.
π‘ Why This Is Genuinely Useful β No SQL Expertise Needed
Look at what just happened across those 6 queries:
| What the user typed | What DuckDB actually executed |
|---|---|
| "Show me the top 10 rows" | SELECT * FROM data LIMIT 10 |
| "What are the key columns and null rates?" | 18-column null rate audit with (count(*) - count(col)) * 100.0 / count(*)
|
| "Top categories by sales and profit margin" |
SUM + division for margin %, GROUP BY, ORDER BY
|
| "Monthly revenue trend" |
date_trunc('month', order_date) + AVG + COUNT
|
| "Customers with multiple orders" |
COUNT(DISTINCT ...) + HAVING clause |
| "Returned orders profit loss by category" |
WHERE returned = 'Yes' + GROUP BY + ORDER BY ASC
|
Why this is useful to end users π
The biggest value is accessibility. A user does not need to know SQL, DuckDB syntax, schema introspection, or data-loading APIs to get insights from a file. They can simply ask questions like βShow me the top 10 rows,β βWhat are the null rates?β, βFind duplicates,β or βGive me a distribution summary,β which the UI even suggests as starter prompts.
That changes the audience dramatically. Analysts can move faster, product managers can self-serve, recruiters or business users can inspect exported CSVs, and students can explore datasets without first learning query languages.
It also improves transparency compared with many βAI chat with your dataβ demos. The app returns the generated SQL alongside the tabular output, so users can verify what was run and build trust in the response rather than blindly accepting a paragraph from a model. π
Why this makes readers curious about Gemma 4 π
There is something deeply compelling about watching an open model turn vague human language into a working analytic query. Once readers see Gemma 4 participating in a real stack β frontend, edge function, backend, SQL engine, and agent workflow β the model stops being an abstract release and starts feeling like a tool they can ship with.
That is where curiosity grows. If Gemma 4 can help power natural-language analytics over local files today, what else could it do tomorrow β internal BI copilots, document intelligence, edge-side assistants, debugging copilots, data quality triage, or multimodal workflow agents? π€―
And because Gemma 4 is available in multiple sizes and distributed through platforms like Hugging Face, it invites experimentation instead of gatekeeping it. For builders, that is a powerful combination: capability, openness, and a very tangible path from prototype to product.
Final thoughts βοΈ
This project started as a portfolio build, but it became a strong example of where agentic AI gets genuinely useful: reducing the distance between a personβs question and the data-backed answer they need.
Gemma 4 gives that experience an extra spark. It feels like the kind of model family that makes developers want to try one more workflow, one more agent, one more product idea β and that is exactly the energy a great open model should create. π









Top comments (0)