DEV Community

Cover image for How I Built OmniSence -A Multimodal AI That Streams Text, Images & Audio Together
Balaraj R
Balaraj R

Posted on

How I Built OmniSence -A Multimodal AI That Streams Text, Images & Audio Together

Disclosure: I created this piece of content for the purposes of
entering the Gemini Live Agent Challenge hackathon on Devpost.

GeminiLiveAgentChallenge


The Problem That Kept Me Up at Night

Every AI tool I've used thinks in documents, not experiences.

You get text here. An image there. Maybe audio if you switch tabs
and use a different tool entirely. But a real creative director doesn't
hand you a Word document โ€” they paint a scene with words, sketches, and
emotion simultaneously.

That gap is what I built OmniSence to close.


What OmniSence Does

OmniSence is a Creative Director AI that takes a single idea โ€” spoken
or typed โ€” and streams text, images, and audio together in real-time
as one cohesive, interleaved experience.

You speak: "A girl who discovers she can paint the future."

OmniSence responds with:

  • ๐Ÿ“ Narrative prose streaming word by word
  • ๐Ÿ–ผ๏ธ Watercolor illustrations appearing inline mid-sentence
  • ๐Ÿ”Š Studio-quality narration reading the story back to you

All at once. All live. No switching tabs.


The Core Technical Innovation: Orchestrated Interleaved Streaming

This was the hardest problem to solve.

Gemini doesn't natively emit image bytes mid-text stream. So I designed
a pattern I call Orchestrated Interleaved Streaming using Google ADK:

User Prompt
    โ†“
Google ADK Agent
    โ†“
Gemini 3.1 Flash streams text with [IMAGE_DIRECTIVE: ...] markers
    โ†“ (on marker detection)
Imagen 4 (async) โ†โ†’ Cloud TTS (async)
    โ†“                    โ†“
GCS Upload           GCS Upload  
    โ†“                    โ†“
SSE: {type:"image"}  SSE: {type:"audio"}
    โ†“
React frontend renders everything inline, live
Enter fullscreen mode Exit fullscreen mode

The key insight: image and audio generation run in parallel while
text continues streaming. The user never waits for one to finish before
the next begins.

The perceived latency formula:

$$T_{total} = T_{text} + \max(T_{imagen}, T_{tts}) - T_{overlap}$$

In practice, this means a full illustrated, narrated story appears in
under 30 seconds from a single voice prompt.


Building the ADK Agent

The Google ADK (Agent Development Kit) was the backbone of the entire
project. Instead of chaining API calls manually, I defined the agent
with 5 real async tools:

root_agent = Agent(
    name="omnisence",
    model="gemini-3.1-flash",
    description="OmniSence โ€” Elite Creative Director AI",
    instruction=SYSTEM_PROMPT,
    tools=[
        generate_scene_image,   # Imagen 4 โ†’ GCS
        narrate_text,           # Cloud TTS โ†’ GCS  
        search_creative_references,  # Google Search grounding
        save_session_asset,     # GCS persistence
        get_style_constraints,  # Creative mode framework
    ],
)
Enter fullscreen mode Exit fullscreen mode

What surprised me: the agent naturally decides when to generate
images vs. keep writing โ€” creating organic pacing without me hardcoding
any rules. That emergent creative judgment was something I didn't expect.


Grounding: Preventing Hallucinations

For educational mode, I integrated Google Search grounding with a
single SDK addition:

if mode == "educational":
    generation_config["tools"] = [{"google_search": {}}]
Enter fullscreen mode Exit fullscreen mode

One line. Dramatically more accurate factual content. The agent now
cites real sources before weaving them into the narrative.


The Google Cloud Stack

Service What I Used It For
Gemini 3.1 Flash Core text generation with image directives
Imagen 4 Scene illustration via Vertex AI
Cloud Text-to-Speech Studio voice narration
Cloud Run Serverless backend hosting
Cloud Storage Asset persistence across sessions
Cloud Build CI/CD with cloudbuild.yaml

One thing I want to highlight: Cloud Run's concurrency model was
perfect for SSE streaming. Each user gets a persistent async generator
that streams for up to 5 minutes โ€” Cloud Run handles this gracefully
without the connection dropping.


The UX Decision That Changed Everything

Early in development, images appeared below the text after it
finished generating. It felt like "AI output."

The moment I moved images to appear inline โ€” mid-paragraph,
exactly where the story described them โ€” the experience shifted from
"AI output" to "living document."

That single UX change had more impact on how the product felt than
any technical improvement I made.


Challenges I Didn't Anticipate

Stream cancellation was deceptively hard. When a user hits
"Stop & Redirect" mid-generation, you can't just close the SSE
connection โ€” there are async Imagen and TTS calls in flight on
GCP that need to be cleanly abandoned without leaving orphaned
uploads or billing surprises.

My solution: per-session cancellation flags checked at every
yield point in the async generator, combined with asyncio.shield()
for GCS cleanup tasks that must complete regardless.

Cloud TTS latency was the other beast. Studio voices take 2โ€“4
seconds per paragraph. I solved this with a rolling generation pipeline:
generate audio for completed sentences while later sentences still stream.


What I'd Tell Someone Starting This Today

  1. Use ADK, not raw SDK calls. The tool system makes multi-model
    orchestration feel natural instead of spaghetti.

  2. Design for streaming from day one. Adding SSE to a
    request/response architecture later is painful. Build async generators
    first.

  3. Google Search grounding is a one-liner. Add it to any factual
    mode. The quality difference is immediate and obvious.

  4. The persona matters more than the features. OmniSence has a
    distinct voice โ€” warm, bold, cinematic. Users respond to that
    personality more than any specific capability.

  5. Deploy early to Cloud Run. Local dev hides async edge cases
    that only appear under real network conditions.


Try It Yourself

๐Ÿ”— GitHub: https://github.com/balaraj74/Omnisence

๐Ÿ”— Live Demo: https://omnisence-518586257861.us-central1.run.app/

The entire project is open source. The README includes a one-command
deploy script โ€” ./deploy.sh YOUR_PROJECT_ID YOUR_API_KEY โ€” and
you'll have your own OmniSence instance running on Cloud Run in
under 10 minutes.


Built for the Gemini Live Agent Challenge ยท Powered by Google Gemini,
Imagen 4, and Google Cloud ยท #GeminiLiveAgentChallenge

Top comments (0)