Conversa — A Multi-Agent AI Platform Powered by Gemma 4
This is a submission for the Gemma 4 Challenge: Build with Gemma 4
What I Built
Conversa is a multi-agent AI platform built with Next.js (App Router) that transforms unstructured files — audio recordings, documents, and images — into structured, actionable intelligence. The platform consists of three specialized agents, each solving a distinct real-world problem:
🎙️ Meeting Analyzer (Audio Agent)
Upload a voice recording (MP3, WAV, M4A) and get back:
- A full verbatim transcript via Groq Whisper
- Key discussion points extracted by Gemma 4
- Action items with clear ownership
- Follow-up questions to keep the conversation moving
For large audio files (>25MB), the agent automatically compresses them in the browser — resampling to 16kHz mono WAV using the Web Audio API — before sending to the server.
📄 Brief Generator (Document Agent)
Upload a PDF or Word document and choose from 5 brief types:
- Meeting Brief — agenda, discussion points, critical questions
- Project Kickoff — goals, scope, roles, milestones
- Client Proposal — executive summary, pricing overview
- Interview Prep — questions, scorecard, red flags
- SOP Generator — step-by-step procedures and checkpoints
Gemma 4's 256K context window processes the entire document in one pass — no chunking, no information loss. Word documents (.docx/.doc) are automatically converted to PDF via mammoth + jsPDF before processing.
🖼️ Whiteboard Analyzer (Image Agent)
Upload a whiteboard photo or handwritten notes (JPG, PNG, WEBP) and receive:
- Extracted text — every visible word, transcribed verbatim
- Diagram & visual element descriptions — shapes, flows, connections
- Structured summary — professional 2–4 sentence synthesis
- Suggested next steps — 3–5 actionable recommendations
Images above 4MB are automatically compressed via Canvas API before upload to stay within Vercel's serverless function payload limit.
All three agents stream results progressively via Server-Sent Events (SSE), so content appears section by section as Gemma 4 generates it — no waiting for the full response.
Demo
🌐 Live App: https://conversa-gemma4.vercel.app
Try uploading a meeting recording, a PDF report, or a whiteboard photo to see all three agents in action.
Code
📦 GitHub Repository: https://github.com/jefribulomakassar/conversa-gemma4
Tech stack:
- Framework: Next.js 14 (App Router, TypeScript)
-
AI Model:
google/gemma-4-26b-a4b-itvia OpenRouter - Transcription: Groq Whisper (audio pipeline)
- Deployment: Vercel
- Styling: Pure CSS-in-JS (no UI library)
Key files:
app/
├── api/
│ ├── audio/route.ts → Transcription + Gemma 4 analysis pipeline
│ ├── document/route.ts → PDF extraction + brief generation pipeline
│ └── image/route.ts → Base64 encoding + visual analysis pipeline
├── audio/page.tsx → Meeting Analyzer UI
├── document/page.tsx → Brief Generator UI
└── image/page.tsx → Whiteboard Analyzer UI
How I Used Gemma 4
Model Choice: google/gemma-4-26b-a4b-it (26B MoE)
I chose Gemma 4 26B (the 26-billion parameter Mixture-of-Experts variant, a4b architecture) for three specific reasons:
1. Multimodal capability for the Image Agent
The image agent sends photos directly as base64 image_url to the model. Gemma 4's native vision support eliminates the need for a separate OCR service — the same model that generates structured summaries also reads handwritten text and interprets diagram flows.
2. 256K context window for the Document Agent
Most open models force chunking for long documents, which causes information loss at chunk boundaries. Gemma 4's extended context lets the document agent ingest entire PDFs (legal contracts, project proposals, SOPs) in a single API call and reason over the full content holistically.
3. Structured JSON output reliability
All three agents require the model to return strict JSON (no markdown fences, no preamble). Gemma 4 26B consistently honors the system prompt instruction "respond with ONLY a valid JSON object" with temperature 0.2, which made the SSE streaming pipeline reliable without complex retry logic.
Pipeline Architecture
Each agent follows the same SSE streaming pattern:
Client uploads file
↓
Browser-side compression (if needed)
↓
POST /api/[agent] (FormData)
↓
Server: parse → convert → call Gemma 4 via OpenRouter
↓
Stream SSE events back: status → field1 → field2 → done
↓
Client renders results progressively
The model is called with temperature: 0.2 and max_tokens: 3000 across all agents to balance creativity with output consistency.
For the audio agent, Gemma 4 receives the transcript text (produced by Groq Whisper) and extracts key points, action items, and follow-up questions — acting as a reasoning layer on top of the raw transcription.
For the document agent, the PDF is converted to base64 and passed in full to Gemma 4, which then generates structured brief sections streamed one by one as SSE section events.
For the image agent, the photo is passed as a image_url content block alongside a detailed JSON schema prompt, and Gemma 4 returns all four analysis fields in a single structured response.
Top comments (0)