Disclosure: I created this piece of content for the purposes of
entering the Gemini Live Agent Challenge hackathon on Devpost.GeminiLiveAgentChallenge
The Problem That Kept Me Up at Night
Every AI tool I've used thinks in documents, not experiences.
You get text here. An image there. Maybe audio if you switch tabs
and use a different tool entirely. But a real creative director doesn't
hand you a Word document โ they paint a scene with words, sketches, and
emotion simultaneously.
That gap is what I built OmniSence to close.
What OmniSence Does
OmniSence is a Creative Director AI that takes a single idea โ spoken
or typed โ and streams text, images, and audio together in real-time
as one cohesive, interleaved experience.
You speak: "A girl who discovers she can paint the future."
OmniSence responds with:
- ๐ Narrative prose streaming word by word
- ๐ผ๏ธ Watercolor illustrations appearing inline mid-sentence
- ๐ Studio-quality narration reading the story back to you
All at once. All live. No switching tabs.
The Core Technical Innovation: Orchestrated Interleaved Streaming
This was the hardest problem to solve.
Gemini doesn't natively emit image bytes mid-text stream. So I designed
a pattern I call Orchestrated Interleaved Streaming using Google ADK:
User Prompt
โ
Google ADK Agent
โ
Gemini 3.1 Flash streams text with [IMAGE_DIRECTIVE: ...] markers
โ (on marker detection)
Imagen 4 (async) โโ Cloud TTS (async)
โ โ
GCS Upload GCS Upload
โ โ
SSE: {type:"image"} SSE: {type:"audio"}
โ
React frontend renders everything inline, live
The key insight: image and audio generation run in parallel while
text continues streaming. The user never waits for one to finish before
the next begins.
The perceived latency formula:
$$T_{total} = T_{text} + \max(T_{imagen}, T_{tts}) - T_{overlap}$$
In practice, this means a full illustrated, narrated story appears in
under 30 seconds from a single voice prompt.
Building the ADK Agent
The Google ADK (Agent Development Kit) was the backbone of the entire
project. Instead of chaining API calls manually, I defined the agent
with 5 real async tools:
root_agent = Agent(
name="omnisence",
model="gemini-3.1-flash",
description="OmniSence โ Elite Creative Director AI",
instruction=SYSTEM_PROMPT,
tools=[
generate_scene_image, # Imagen 4 โ GCS
narrate_text, # Cloud TTS โ GCS
search_creative_references, # Google Search grounding
save_session_asset, # GCS persistence
get_style_constraints, # Creative mode framework
],
)
What surprised me: the agent naturally decides when to generate
images vs. keep writing โ creating organic pacing without me hardcoding
any rules. That emergent creative judgment was something I didn't expect.
Grounding: Preventing Hallucinations
For educational mode, I integrated Google Search grounding with a
single SDK addition:
if mode == "educational":
generation_config["tools"] = [{"google_search": {}}]
One line. Dramatically more accurate factual content. The agent now
cites real sources before weaving them into the narrative.
The Google Cloud Stack
| Service | What I Used It For |
|---|---|
| Gemini 3.1 Flash | Core text generation with image directives |
| Imagen 4 | Scene illustration via Vertex AI |
| Cloud Text-to-Speech | Studio voice narration |
| Cloud Run | Serverless backend hosting |
| Cloud Storage | Asset persistence across sessions |
| Cloud Build | CI/CD with cloudbuild.yaml
|
One thing I want to highlight: Cloud Run's concurrency model was
perfect for SSE streaming. Each user gets a persistent async generator
that streams for up to 5 minutes โ Cloud Run handles this gracefully
without the connection dropping.
The UX Decision That Changed Everything
Early in development, images appeared below the text after it
finished generating. It felt like "AI output."
The moment I moved images to appear inline โ mid-paragraph,
exactly where the story described them โ the experience shifted from
"AI output" to "living document."
That single UX change had more impact on how the product felt than
any technical improvement I made.
Challenges I Didn't Anticipate
Stream cancellation was deceptively hard. When a user hits
"Stop & Redirect" mid-generation, you can't just close the SSE
connection โ there are async Imagen and TTS calls in flight on
GCP that need to be cleanly abandoned without leaving orphaned
uploads or billing surprises.
My solution: per-session cancellation flags checked at every
yield point in the async generator, combined with asyncio.shield()
for GCS cleanup tasks that must complete regardless.
Cloud TTS latency was the other beast. Studio voices take 2โ4
seconds per paragraph. I solved this with a rolling generation pipeline:
generate audio for completed sentences while later sentences still stream.
What I'd Tell Someone Starting This Today
Use ADK, not raw SDK calls. The tool system makes multi-model
orchestration feel natural instead of spaghetti.Design for streaming from day one. Adding SSE to a
request/response architecture later is painful. Build async generators
first.Google Search grounding is a one-liner. Add it to any factual
mode. The quality difference is immediate and obvious.The persona matters more than the features. OmniSence has a
distinct voice โ warm, bold, cinematic. Users respond to that
personality more than any specific capability.Deploy early to Cloud Run. Local dev hides async edge cases
that only appear under real network conditions.
Try It Yourself
๐ GitHub: https://github.com/balaraj74/Omnisence
๐ Live Demo: https://omnisence-518586257861.us-central1.run.app/
The entire project is open source. The README includes a one-command
deploy script โ ./deploy.sh YOUR_PROJECT_ID YOUR_API_KEY โ and
you'll have your own OmniSence instance running on Cloud Run in
under 10 minutes.
Built for the Gemini Live Agent Challenge ยท Powered by Google Gemini,
Imagen 4, and Google Cloud ยท #GeminiLiveAgentChallenge
Top comments (0)