Mustapha Adekunle

Posted on Mar 16

I Built a Real-Time AI Legal Assistant That Reads Your Documents and Talks to You — Here's How

Built for the Google Gemini Live Agent Challenge. This article covers what it does, the architecture behind it, and how to run it yourself.

Imagine receiving a letter from your landlord. The words "NOTICE TO QUIT" are printed in bold at the top. Below that: dense legal text, references to statutes you've never heard of, a deadline buried somewhere in the third paragraph.

You don't know if you have three days or thirty. You don't know if this is serious or routine. You don't know your rights.

For 60 million Americans facing civil legal crises every year — evictions, debt collection, court summons, insurance denials — this is a real, daily experience. And most of them face it alone, because a lawyer costs $400 an hour and legal aid organizations have years-long waitlists.

That's the problem I set out to address with ClearRight.

What Is ClearRight?

ClearRight is a real-time voice AI assistant that helps people understand their legal documents and know their rights. You upload a document, an AI reads it and gives you an instant briefing, and then you talk — out loud, naturally, conversationally — with Clara, a legal information assistant powered by Google's Gemini Live API.

No typing. No waiting. No cost.

The experience is designed to feel like having a knowledgeable, calm friend walk you through a scary letter — not like querying a legal database.

Here's what happens when you use it:

Step 1 — Upload your document. You drop in a PDF or photo of any legal document. Within a few seconds, the AI reads it using Gemini's multimodal vision and surfaces a structured analysis: what kind of document it is, whether it's high, medium, or low risk, the two most important things you need to know, and four specific questions you should be asking.

Step 2 — Read the briefing. Before you say a word, you already know what you're dealing with. The right panel shows you the document type ("Pay or Quit Notice"), the risk level (a red "High Risk" badge), key points ("You have 3 days to pay $1,200 or vacate the property"), and tappable question chips. Tap any chip and it's sent to Clara instantly.

Step 3 — Talk to Clara. Start a voice session and speak naturally. Clara hears you, responds with audio, and can be interrupted mid-sentence. She uses Google Search in real time to ground her answers in current law. She tells you your rights, explains what the document is asking you to do, and always ends by pointing you to free legal aid resources — because she provides legal information, not legal advice.

The Technical Architecture

ClearRight has two independently deployed services on Google Cloud Run: a Python FastAPI backend and a Next.js frontend. Let me walk through how each piece works.

Document Processing: Two Gemini Calls in One Upload

When you upload a document, the backend makes two sequential calls to gemini-2.5-flash.

The first call passes the raw file bytes directly to the model using Part.from_bytes() with the file's MIME type. Gemini's multimodal vision reads the document — PDF, JPEG, PNG, HEIC, whatever — and returns the full extracted text. This is more reliable than traditional OCR because it understands document structure, handles handwriting, and correctly interprets scanned pages.

response = client.models.generate_content(
    model="gemini-2.5-flash",
    contents=[
        types.Part.from_bytes(data=file_bytes, mime_type=file.content_type),
        "Please extract and transcribe ALL text content from this document...",
    ],
)
extracted_text = response.text

The second call takes that extracted text and asks the model to produce a structured JSON analysis:

analysis_prompt = """Based on the document above, return ONLY a valid JSON object:
{
  "doc_type": "type in 2-4 words (e.g. Lease Agreement)",
  "risk_level": "high or medium or low",
  "key_points": ["key point 1", "key point 2"],
  "suggested_questions": ["question 1?", "question 2?", "question 3?", "question 4?"]
}"""

The analysis step is fault-tolerant — if JSON parsing fails, the upload still succeeds and the document is still available for Clara to read. The UI just won't show the analysis card. This matters in production: you never want a secondary feature to break the primary flow.

The Voice Agent: Google ADK + Gemini Live API

The live conversation is powered by Google's Agent Development Kit (ADK) and the Gemini Live API. Here's the setup:

root_agent = Agent(
    name="clara_legal_assistant",
    model="gemini-2.5-flash-native-audio-latest",
    description="Clara — a compassionate legal information assistant",
    instruction=AGENT_INSTRUCTION,
    tools=[google_search],
)

The model gemini-2.5-flash-native-audio-latest is a native audio model — it processes and generates audio directly, without converting speech to text and back. This is what makes the conversation feel natural: there's no robotic cadence, no processing pauses between speech segments. Clara sounds like a person.

When a WebSocket connection opens, the backend starts an ADK InMemoryRunner session and configures it for bidirectional streaming:

run_config = RunConfig(
    streaming_mode="bidi",
    response_modalities=[types.Modality.AUDIO],
    realtime_input_config=types.RealtimeInputConfig(
        automatic_activity_detection=types.AutomaticActivityDetection(
            start_of_speech_sensitivity=types.StartSensitivity.START_SENSITIVITY_HIGH,
            end_of_speech_sensitivity=types.EndSensitivity.END_SENSITIVITY_HIGH,
            prefix_padding_ms=100,
            silence_duration_ms=200,
        )
    ),
    speech_config=types.SpeechConfig(
        voice_config=types.VoiceConfig(
            prebuilt_voice_config=types.PrebuiltVoiceConfig(voice_name="Aoede")
        )
    ),
)

The VAD (Voice Activity Detection) settings are tuned for low latency: 200ms silence duration means Clara starts responding quickly after you stop talking, and high sensitivity means she catches soft speech.

If a document was uploaded, it's injected into the session before any user interaction:

live_request_queue.send_content(content=Content(
    role="user",
    parts=[Part(text=f"[DOCUMENT UPLOADED BY USER]\n\n{document_context}")],
))
live_request_queue.send_content(content=Content(
    role="user",
    parts=[Part(text="Please greet me briefly, confirm you've read my document, "
                     "and tell me the single most important thing I should know about it.")],
))

Clara's first words when you connect are always relevant to your document. She's already read it.

The Audio Pipeline: AudioWorklet in the Browser

Getting real-time audio in and out of a browser is more involved than it sounds. ClearRight uses two custom AudioWorkletProcessor instances running on dedicated audio threads.

Recording (16kHz): The recorder worklet captures microphone input, resamples to 16kHz PCM, and sends chunks to the main thread roughly every 100ms. It also runs a simple energy-based VAD that detects speech start and end events, which the UI uses to drive the "you're speaking" indicator and flush the playback buffer when you interrupt.

Playback (24kHz): The player worklet receives 24kHz PCM chunks from the backend via WebSocket and plays them in order. When the user starts speaking (detected by the recorder), the player's buffer is flushed immediately — this is what makes interruption work. Clara stops mid-sentence, the buffer clears, and the session moves forward.

One important implementation detail: AudioContexts should be suspended, not closed, between sessions. Chrome does not allow re-registering an AudioWorklet module on a context that has been previously used. Suspending the context keeps the worklet thread alive for reuse.

// On disconnect — suspend, don't close
audioRecorderContextRef.current?.suspend().catch(() => {});
audioPlayerContextRef.current?.suspend().catch(() => {});

// On reconnect — resume the existing context
if (ctx.state === "suspended") await ctx.resume();

Clara's Persona

Clara is not just a legal search engine. The system prompt defines a careful persona:

Warm and reassuring. The people talking to Clara are scared. They received a threatening letter and don't know what it means. Clara's tone is the same as a trusted, knowledgeable friend — not a lawyer billing by the hour.
Plain English always. No legal jargon without immediate explanation. If Clara says "FDCPA," she immediately follows with what that means in plain language.
Information, not advice. Clara explains what documents say, what rights exist under the law, what deadlines apply. She does not tell you what to do in your specific situation — that requires a licensed attorney. Every substantive response ends with a pointer to free legal aid resources.
Grounded answers. Clara never guesses at specific deadlines or statute numbers. She uses google_search when she needs current information, state-specific rules, or local legal aid contacts.

Running ClearRight Yourself

Prerequisites

Python 3.11+
Node.js 18+
A free Google AI Studio API key from aistudio.google.com
Chrome or Edge (required for AudioWorklet support)

1. Clone and configure

git clone https://github.com/engr-krooozy/clearright.git
cd clearright

cp .env.example server/.env

Edit server/.env:

GOOGLE_API_KEY=your_api_key_here
APP_NAME=clearright
AGENT_VOICE=Aoede
AGENT_LANGUAGE=en-US

2. Start everything

./run_local.sh

This starts the FastAPI backend on port 8000 and the Next.js frontend on port 3000. Open http://localhost:3000.

3. Try it

Upload any PDF — a lease, a letter from a debt collector, an employment contract. Watch the analysis card appear. Then click "Talk to Clara" and ask anything.

If you don't have a legal document handy, try uploading the terms of service from any app. The analysis will still work and Clara can walk you through what you're actually agreeing to.

Things I Learned Building This

Voice-first UX is genuinely different from chat-first UX. The initial instinct was to show a conversation transcript. But the native audio model doesn't surface text through the ADK event stream — it outputs audio only. That constraint forced a better design decision: front-load the useful information (the analysis card) and let the voice feel like voice, not a chat interface with audio bolted on.

Multimodal document processing is remarkably capable. Gemini 2.5 Flash handles handwritten notes, photographs of physical documents, and scanned PDFs with high fidelity. I tested it with water-damaged lease printouts, photos of letters taken on a phone, and PDFs with multi-column legal formatting. The extraction quality was consistently better than expected.

The ADK does the hard orchestration work. Session management, tool call routing, event streaming, and the live connection lifecycle are all handled by the ADK. What you write is the agent definition, the system prompt, and the application logic around it. That said, understanding what events actually flow through run_live() for a native audio model required careful debugging — always add server-side event logging during development.

Guardrails are a product decision, not just a legal disclaimer. The distinction between legal information and legal advice shapes every aspect of Clara's persona. Defining that boundary clearly in the system prompt, and testing that the model respects it across a range of scenarios, was as important as any technical implementation detail.

What's Next

The foundation is solid. A few natural extensions:

Persistent sessions — Return to a conversation about your document across multiple days
Document comparison — "Here's my old lease and my new one — what changed?"
State-specific legal aid routing — Automatically surface the right legal aid organization based on the document type and detected jurisdiction
Spanish and other languages — Legal document complexity is compounded by language barriers; this is an obvious high-impact next step

Try It

The live version is deployed at: https://clearright-ui-q63ub5ulzq-uc.a.run.app

The full source code is on GitHub: https://github.com/engr-krooozy/clearright

This project was built for the Gemini Live Agent Challenge hosted by Google.

If you're building something with the Gemini Live API or ADK, I hope this walkthrough saves you some debugging time. The audio pipeline details in particular took a while to get right — happy to answer questions.

ClearRight provides general legal information only — not legal advice. For guidance specific to your situation, consult a licensed attorney or contact your local legal aid organization.

#GeminiLiveAgentChallenge #GoogleAI #ADK #GeminiLiveAPI #LegalTech

DEV Community