Conversa — A Multi-Agent AI Platform Powered by Gemma 4

#devchallenge #gemmachallenge #gemma #ai

Gemma 4 Challenge: Build With Gemma 4 Submission

Conversa — A Multi-Agent AI Platform Powered by Gemma 4

This is a submission for the Gemma 4 Challenge: Build with Gemma 4

What I Built

Conversa is a multi-agent AI platform built with Next.js (App Router) that transforms unstructured files — audio recordings, documents, and images — into structured, actionable intelligence. The platform consists of three specialized agents, each solving a distinct real-world problem:

🎙️ Meeting Analyzer (Audio Agent)

Upload a voice recording (MP3, WAV, M4A) and get back:

A full verbatim transcript via Groq Whisper
Key discussion points extracted by Gemma 4
Action items with clear ownership
Follow-up questions to keep the conversation moving

For large audio files (>25MB), the agent automatically compresses them in the browser — resampling to 16kHz mono WAV using the Web Audio API — before sending to the server.

📄 Brief Generator (Document Agent)

Upload a PDF or Word document and choose from 5 brief types:

Meeting Brief — agenda, discussion points, critical questions
Project Kickoff — goals, scope, roles, milestones
Client Proposal — executive summary, pricing overview
Interview Prep — questions, scorecard, red flags
SOP Generator — step-by-step procedures and checkpoints

Gemma 4's 256K context window processes the entire document in one pass — no chunking, no information loss. Word documents (.docx/.doc) are automatically converted to PDF via mammoth + jsPDF before processing.

🖼️ Whiteboard Analyzer (Image Agent)

Upload a whiteboard photo or handwritten notes (JPG, PNG, WEBP) and receive:

Extracted text — every visible word, transcribed verbatim
Diagram & visual element descriptions — shapes, flows, connections
Structured summary — professional 2–4 sentence synthesis
Suggested next steps — 3–5 actionable recommendations

Images above 4MB are automatically compressed via Canvas API before upload to stay within Vercel's serverless function payload limit.

All three agents stream results progressively via Server-Sent Events (SSE), so content appears section by section as Gemma 4 generates it — no waiting for the full response.

Demo

🌐 Live App: https://conversa-gemma4.vercel.app

Try uploading a meeting recording, a PDF report, or a whiteboard photo to see all three agents in action.

Code

📦 GitHub Repository: https://github.com/jefribulomakassar/conversa-gemma4

Tech stack:

Framework: Next.js 14 (App Router, TypeScript)
AI Model: google/gemma-4-26b-a4b-it via OpenRouter
Transcription: Groq Whisper (audio pipeline)
Deployment: Vercel
Styling: Pure CSS-in-JS (no UI library)

Key files:

app/
├── api/
│   ├── audio/route.ts      → Transcription + Gemma 4 analysis pipeline
│   ├── document/route.ts   → PDF extraction + brief generation pipeline
│   └── image/route.ts      → Base64 encoding + visual analysis pipeline
├── audio/page.tsx          → Meeting Analyzer UI
├── document/page.tsx       → Brief Generator UI
└── image/page.tsx          → Whiteboard Analyzer UI

How I Used Gemma 4

Model Choice: `google/gemma-4-26b-a4b-it` (26B MoE)

I chose Gemma 4 26B (the 26-billion parameter Mixture-of-Experts variant, a4b architecture) for three specific reasons:

1. Multimodal capability for the Image Agent
The image agent sends photos directly as base64 image_url to the model. Gemma 4's native vision support eliminates the need for a separate OCR service — the same model that generates structured summaries also reads handwritten text and interprets diagram flows.

2. 256K context window for the Document Agent
Most open models force chunking for long documents, which causes information loss at chunk boundaries. Gemma 4's extended context lets the document agent ingest entire PDFs (legal contracts, project proposals, SOPs) in a single API call and reason over the full content holistically.

3. Structured JSON output reliability
All three agents require the model to return strict JSON (no markdown fences, no preamble). Gemma 4 26B consistently honors the system prompt instruction "respond with ONLY a valid JSON object" with temperature 0.2, which made the SSE streaming pipeline reliable without complex retry logic.

Pipeline Architecture

Each agent follows the same SSE streaming pattern:

Client uploads file
      ↓
Browser-side compression (if needed)
      ↓
POST /api/[agent] (FormData)
      ↓
Server: parse → convert → call Gemma 4 via OpenRouter
      ↓
Stream SSE events back: status → field1 → field2 → done
      ↓
Client renders results progressively

The model is called with temperature: 0.2 and max_tokens: 3000 across all agents to balance creativity with output consistency.

For the audio agent, Gemma 4 receives the transcript text (produced by Groq Whisper) and extracts key points, action items, and follow-up questions — acting as a reasoning layer on top of the raw transcription.

For the document agent, the PDF is converted to base64 and passed in full to Gemma 4, which then generates structured brief sections streamed one by one as SSE section events.

For the image agent, the photo is passed as a image_url content block alongside a detailed JSON schema prompt, and Gemma 4 returns all four analysis fields in a single structured response.

DEV Community

Conversa — A Multi-Agent AI Platform Powered by Gemma 4

Conversa — A Multi-Agent AI Platform Powered by Gemma 4