<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Balaraj R</title>
    <description>The latest articles on DEV Community by Balaraj R (@balaraj_r).</description>
    <link>https://dev.to/balaraj_r</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3827883%2Fce388338-f088-4f3b-a114-b6eb78b9d08b.jpg</url>
      <title>DEV Community: Balaraj R</title>
      <link>https://dev.to/balaraj_r</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/balaraj_r"/>
    <language>en</language>
    <item>
      <title>How I Built OmniSence -A Multimodal AI That Streams Text, Images &amp; Audio Together</title>
      <dc:creator>Balaraj R</dc:creator>
      <pubDate>Mon, 16 Mar 2026 18:21:55 +0000</pubDate>
      <link>https://dev.to/balaraj_r/how-i-built-omnisence-a-multimodal-ai-that-streams-text-images-audio-together-4gk8</link>
      <guid>https://dev.to/balaraj_r/how-i-built-omnisence-a-multimodal-ai-that-streams-text-images-audio-together-4gk8</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Disclosure:&lt;/strong&gt; I created this piece of content for the purposes of &lt;br&gt;
entering the &lt;strong&gt;Gemini Live Agent Challenge&lt;/strong&gt; hackathon on Devpost. &lt;/p&gt;
&lt;h1&gt;
  
  
  GeminiLiveAgentChallenge
&lt;/h1&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  The Problem That Kept Me Up at Night
&lt;/h2&gt;

&lt;p&gt;Every AI tool I've used thinks in &lt;strong&gt;documents, not experiences.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;You get text &lt;em&gt;here&lt;/em&gt;. An image &lt;em&gt;there&lt;/em&gt;. Maybe audio if you switch tabs &lt;br&gt;
and use a different tool entirely. But a real creative director doesn't &lt;br&gt;
hand you a Word document — they paint a scene with words, sketches, and &lt;br&gt;
emotion &lt;em&gt;simultaneously&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;That gap is what I built &lt;strong&gt;OmniSence&lt;/strong&gt; to close.&lt;/p&gt;


&lt;h2&gt;
  
  
  What OmniSence Does
&lt;/h2&gt;

&lt;p&gt;OmniSence is a Creative Director AI that takes a single idea — spoken &lt;br&gt;
or typed — and streams &lt;strong&gt;text, images, and audio together in real-time&lt;/strong&gt; &lt;br&gt;
as one cohesive, interleaved experience.&lt;/p&gt;

&lt;p&gt;You speak: &lt;em&gt;"A girl who discovers she can paint the future."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;OmniSence responds with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;📝 Narrative prose streaming word by word&lt;/li&gt;
&lt;li&gt;🖼️ Watercolor illustrations appearing &lt;strong&gt;inline mid-sentence&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;🔊 Studio-quality narration reading the story back to you&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All at once. All live. No switching tabs.&lt;/p&gt;


&lt;h2&gt;
  
  
  The Core Technical Innovation: Orchestrated Interleaved Streaming
&lt;/h2&gt;

&lt;p&gt;This was the hardest problem to solve.&lt;/p&gt;

&lt;p&gt;Gemini doesn't natively emit image bytes mid-text stream. So I designed &lt;br&gt;
a pattern I call &lt;strong&gt;Orchestrated Interleaved Streaming&lt;/strong&gt; using Google ADK:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User Prompt
    ↓
Google ADK Agent
    ↓
Gemini 3.1 Flash streams text with [IMAGE_DIRECTIVE: ...] markers
    ↓ (on marker detection)
Imagen 4 (async) ←→ Cloud TTS (async)
    ↓                    ↓
GCS Upload           GCS Upload  
    ↓                    ↓
SSE: {type:"image"}  SSE: {type:"audio"}
    ↓
React frontend renders everything inline, live
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The key insight: image and audio generation run &lt;strong&gt;in parallel&lt;/strong&gt; while &lt;br&gt;
text continues streaming. The user never waits for one to finish before &lt;br&gt;
the next begins.&lt;/p&gt;

&lt;p&gt;The perceived latency formula:&lt;/p&gt;

&lt;p&gt;$$T_{total} = T_{text} + \max(T_{imagen}, T_{tts}) - T_{overlap}$$&lt;/p&gt;

&lt;p&gt;In practice, this means a full illustrated, narrated story appears in &lt;br&gt;
&lt;strong&gt;under 30 seconds&lt;/strong&gt; from a single voice prompt.&lt;/p&gt;


&lt;h2&gt;
  
  
  Building the ADK Agent
&lt;/h2&gt;

&lt;p&gt;The Google ADK (Agent Development Kit) was the backbone of the entire &lt;br&gt;
project. Instead of chaining API calls manually, I defined the agent &lt;br&gt;
with 5 real async tools:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;root_agent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;omnisence&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemini-3.1-flash&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;OmniSence — Elite Creative Director AI&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;instruction&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;SYSTEM_PROMPT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="n"&gt;generate_scene_image&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="c1"&gt;# Imagen 4 → GCS
&lt;/span&gt;        &lt;span class="n"&gt;narrate_text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;           &lt;span class="c1"&gt;# Cloud TTS → GCS  
&lt;/span&gt;        &lt;span class="n"&gt;search_creative_references&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# Google Search grounding
&lt;/span&gt;        &lt;span class="n"&gt;save_session_asset&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;     &lt;span class="c1"&gt;# GCS persistence
&lt;/span&gt;        &lt;span class="n"&gt;get_style_constraints&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# Creative mode framework
&lt;/span&gt;    &lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;What surprised me: the agent &lt;strong&gt;naturally decides&lt;/strong&gt; when to generate &lt;br&gt;
images vs. keep writing — creating organic pacing without me hardcoding &lt;br&gt;
any rules. That emergent creative judgment was something I didn't expect.&lt;/p&gt;


&lt;h2&gt;
  
  
  Grounding: Preventing Hallucinations
&lt;/h2&gt;

&lt;p&gt;For educational mode, I integrated &lt;strong&gt;Google Search grounding&lt;/strong&gt; with a &lt;br&gt;
single SDK addition:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;mode&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;educational&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;generation_config&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tools&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;google_search&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{}}]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One line. Dramatically more accurate factual content. The agent now &lt;br&gt;
cites real sources before weaving them into the narrative.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Google Cloud Stack
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Service&lt;/th&gt;
&lt;th&gt;What I Used It For&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Gemini 3.1 Flash&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Core text generation with image directives&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Imagen 4&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Scene illustration via Vertex AI&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cloud Text-to-Speech&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Studio voice narration&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cloud Run&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Serverless backend hosting&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cloud Storage&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Asset persistence across sessions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cloud Build&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;CI/CD with &lt;code&gt;cloudbuild.yaml&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;One thing I want to highlight: &lt;strong&gt;Cloud Run's concurrency model&lt;/strong&gt; was &lt;br&gt;
perfect for SSE streaming. Each user gets a persistent async generator &lt;br&gt;
that streams for up to 5 minutes — Cloud Run handles this gracefully &lt;br&gt;
without the connection dropping.&lt;/p&gt;




&lt;h2&gt;
  
  
  The UX Decision That Changed Everything
&lt;/h2&gt;

&lt;p&gt;Early in development, images appeared &lt;strong&gt;below&lt;/strong&gt; the text after it &lt;br&gt;
finished generating. It felt like "AI output."&lt;/p&gt;

&lt;p&gt;The moment I moved images to appear &lt;strong&gt;inline&lt;/strong&gt; — mid-paragraph, &lt;br&gt;
exactly where the story described them — the experience shifted from &lt;br&gt;
"AI output" to "living document."&lt;/p&gt;

&lt;p&gt;That single UX change had more impact on how the product felt than &lt;br&gt;
any technical improvement I made.&lt;/p&gt;




&lt;h2&gt;
  
  
  Challenges I Didn't Anticipate
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Stream cancellation was deceptively hard.&lt;/strong&gt; When a user hits &lt;br&gt;
"Stop &amp;amp; Redirect" mid-generation, you can't just close the SSE &lt;br&gt;
connection — there are async Imagen and TTS calls in flight on &lt;br&gt;
GCP that need to be cleanly abandoned without leaving orphaned &lt;br&gt;
uploads or billing surprises.&lt;/p&gt;

&lt;p&gt;My solution: per-session cancellation flags checked at every &lt;br&gt;
&lt;code&gt;yield&lt;/code&gt; point in the async generator, combined with &lt;code&gt;asyncio.shield()&lt;/code&gt; &lt;br&gt;
for GCS cleanup tasks that must complete regardless.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cloud TTS latency&lt;/strong&gt; was the other beast. Studio voices take 2–4 &lt;br&gt;
seconds per paragraph. I solved this with a rolling generation pipeline: &lt;br&gt;
generate audio for completed sentences while later sentences still stream.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I'd Tell Someone Starting This Today
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Use ADK, not raw SDK calls.&lt;/strong&gt; The tool system makes multi-model &lt;br&gt;
orchestration feel natural instead of spaghetti.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Design for streaming from day one.&lt;/strong&gt; Adding SSE to a &lt;br&gt;
request/response architecture later is painful. Build async generators &lt;br&gt;
first.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Google Search grounding is a one-liner.&lt;/strong&gt; Add it to any factual &lt;br&gt;
mode. The quality difference is immediate and obvious.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The persona matters more than the features.&lt;/strong&gt; OmniSence has a &lt;br&gt;
distinct voice — warm, bold, cinematic. Users respond to that &lt;br&gt;
personality more than any specific capability.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Deploy early to Cloud Run.&lt;/strong&gt; Local dev hides async edge cases &lt;br&gt;
that only appear under real network conditions.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Try It Yourself
&lt;/h2&gt;

&lt;p&gt;🔗 &lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/balaraj74/Omnisence" rel="noopener noreferrer"&gt;https://github.com/balaraj74/Omnisence&lt;/a&gt;&lt;br&gt;&lt;br&gt;
🔗 &lt;strong&gt;Live Demo:&lt;/strong&gt; &lt;a href="https://omnisence-518586257861.us-central1.run.app/" rel="noopener noreferrer"&gt;https://omnisence-518586257861.us-central1.run.app/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The entire project is open source. The README includes a one-command &lt;br&gt;
deploy script — &lt;code&gt;./deploy.sh YOUR_PROJECT_ID YOUR_API_KEY&lt;/code&gt; — and &lt;br&gt;
you'll have your own OmniSence instance running on Cloud Run in &lt;br&gt;
under 10 minutes.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Built for the Gemini Live Agent Challenge · Powered by Google Gemini, &lt;br&gt;
Imagen 4, and Google Cloud · #GeminiLiveAgentChallenge&lt;/em&gt;&lt;/p&gt;

</description>
      <category>gemini</category>
      <category>geminiliveagentchallenge</category>
    </item>
  </channel>
</rss>
