<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Amit Bhaskar</title>
    <description>The latest articles on DEV Community by Amit Bhaskar (@amitbhaskar).</description>
    <link>https://dev.to/amitbhaskar</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3825993%2F91d5307c-0f5a-4c1d-bdf4-faab08ef5026.jpg</url>
      <title>DEV Community: Amit Bhaskar</title>
      <link>https://dev.to/amitbhaskar</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/amitbhaskar"/>
    <language>en</language>
    <item>
      <title>Building the Harikatha Live Agent: A Spiritual QA System with Real Voice</title>
      <dc:creator>Amit Bhaskar</dc:creator>
      <pubDate>Sun, 15 Mar 2026 23:19:21 +0000</pubDate>
      <link>https://dev.to/amitbhaskar/building-the-harikatha-live-agent-a-spiritual-qa-system-with-real-voice-41l7</link>
      <guid>https://dev.to/amitbhaskar/building-the-harikatha-live-agent-a-spiritual-qa-system-with-real-voice-41l7</guid>
      <description>&lt;h1&gt;
  
  
  Building the Harikatha Live Agent: A Spiritual QA System with Real Voice
&lt;/h1&gt;

&lt;p&gt;&lt;strong&gt;Author:&lt;/strong&gt; Amit Bhaskar, Vedantic AI Ltd, New Zealand&lt;/p&gt;




&lt;p&gt;Hare Krishna, Namaskar, Hello, When I started the Harikatha Live Agent project for the #GeminiLiveAgentChallenge, I wasn't just building another AI chatbot. I was trying to solve a problem that's been bothering me for years: how do you deliver authentic spiritual teachings at scale without losing the voice, nuance, and presence of the original teacher?&lt;/p&gt;

&lt;p&gt;One of my deepest concerns with the rise of AI — deepfakes, synthetic voices, fabricated content — is the danger this poses in a spiritual context. A devotee could unknowingly receive words that were never spoken by their guru, presented as if they were. To address this, the Harikatha Live Agent is built around a single principle: the source of truth is the archive. Every answer must trace back to an actual recorded moment — a real lecture, a real voice, a real timestamp. The AI never speaks for Gurudeva. It only finds where Gurudeva already spoke.&lt;/p&gt;

&lt;p&gt;The answer I landed on is unconventional. Instead of generating synthetic speech like every other voice AI, the Harikatha Live Agent retrieves &lt;em&gt;actual audio recordings&lt;/em&gt; of Srila Bhaktivedanta Narayana Goswami Maharaja answering questions. Users ask via voice/text, and they receive answers in the guru's real recorded voice—not fabricated, not synthesized but coming from an authorized source of truth archive. Just as it is.&lt;/p&gt;

&lt;p&gt;Designed for the seeker first:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Text or voice input — ask your question by typing or speaking, whatever your situation allows&lt;/li&gt;
&lt;li&gt;Text, audio, or video response — the answer appears instantly as text, with optional audio and video on demand. If you are in a public place, read the transcript. If you can listen, hear Gurudeva's voice. If you want the full presence, watch the video.&lt;/li&gt;
&lt;li&gt;Barge-in — ask a new question at any time, even while an answer is playing. The agent listens continuously, just like a real conversation.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This post documents the technical journey of building this system using Google's Gemini Live API, and the unexpected challenges that emerge when you prioritize authenticity over convenience.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem Space
&lt;/h2&gt;

&lt;p&gt;Spiritual knowledge is traditionally transmitted through direct conversation. A student sits with a teacher, asks a question, and receives wisdom in the teacher's own voice and presence. When that teacher is no longer alive, how do you preserve that transmission?&lt;/p&gt;

&lt;p&gt;Narayana Maharaja spent decades giving audio lectures, interviews, and classes—thousands of hours of recorded material. The material exists, but it's fragmented across files, languages, and decades. A spiritual seeker can't easily ask their question and receive a directly relevant response from the guru's mouth.&lt;/p&gt;

&lt;p&gt;Enter the challenge: Can I build a system that makes this possible? And can I do it using the latest AI tools?&lt;/p&gt;

&lt;h2&gt;
  
  
  Architecture: Real-Time WebSocket Proxying
&lt;/h2&gt;

&lt;p&gt;The architecture is deceptively simple on paper:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User speaks → WebSocket → FastAPI (Cloud Run) → Gemini Live API
    → search_harikatha tool → Firestore vector search → audio segment returned
    → browser plays real voice
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;But implementing it was trickier than it sounds.&lt;/p&gt;

&lt;p&gt;The Gemini Live API communicates over WebSockets with real-time audio. The browser can't directly connect to Google's servers for security reasons, so I needed to build a proxy. FastAPI handles the bidirectional WebSocket connection between browser and backend, while simultaneously maintaining a connection to Gemini's Live API.&lt;/p&gt;

&lt;p&gt;Here's the essential pattern:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@app.websocket&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/ws/live&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;websocket_endpoint&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;websocket&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;WebSocket&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;websocket&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;accept&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="c1"&gt;# Connect to Gemini Live API
&lt;/span&gt;    &lt;span class="n"&gt;gemini_ws&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;create_gemini_connection&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="c1"&gt;# Bidirectional relay with tool call interception
&lt;/span&gt;    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;message&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;websocket&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;iter_bytes&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
        &lt;span class="c1"&gt;# Forward audio to Gemini
&lt;/span&gt;        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;gemini_ws&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;send_bytes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Listen for Gemini responses
&lt;/span&gt;        &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;gemini_ws&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;recv_bytes&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

        &lt;span class="c1"&gt;# Intercept tool calls before sending to browser
&lt;/span&gt;        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;is_tool_call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="n"&gt;tool_result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;handle_tool_call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="c1"&gt;# Send result back to Gemini, which refines its response
&lt;/span&gt;            &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;gemini_ws&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;send_bytes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tool_result&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="c1"&gt;# Forward speech response to browser
&lt;/span&gt;            &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;websocket&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;send_bytes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The magic is in the tool interception. Gemini doesn't have direct access to our vector database, so when it detects a spiritual question, it calls the &lt;code&gt;search_harikatha&lt;/code&gt; tool. The backend catches this call, executes the search server-side, and feeds the results back to Gemini—all without the browser knowing anything about it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Function Calling: Teaching Gemini to Search
&lt;/h2&gt;

&lt;p&gt;Function calling with the Gemini Live API was surprisingly elegant. I defined the search tool like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;tools&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;search_harikatha&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;description&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Search the harikatha corpus for teachings related to the user&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s question. Use this when the user asks a spiritual question about bhakti, meditation, Krishna consciousness, or related topics.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;parameters&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;object&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;properties&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;query&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;string&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;description&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;The semantic essence of the user&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s question in 1-2 sentences&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                &lt;span class="p"&gt;},&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;topics&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;array&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;items&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;string&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;description&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Relevant Sanskrit/Hindi spiritual terms (e.g., &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;bhakti&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;saranagati&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;uttama-bhakti&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                &lt;span class="p"&gt;}&lt;/span&gt;
            &lt;span class="p"&gt;},&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;required&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;query&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Gemini learned to call this automatically. The moment a user asks about spiritual practice, karma, or devotion, it triggers a search. What impressed me most was that the model understood when &lt;em&gt;not&lt;/em&gt; to search—casual greetings and off-topic questions skip the tool entirely.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Vector Search Challenge: Embedding Spiritual Vocabulary
&lt;/h2&gt;

&lt;p&gt;Here's where things got interesting. Gemini embeddings work brilliantly on English, but what about Sanskrit and Hindi spiritual terms?&lt;/p&gt;

&lt;p&gt;I tested vectors for words like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;uttama bhakti&lt;/strong&gt; (supreme devotion)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;saranagati&lt;/strong&gt; (surrender)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;bhava&lt;/strong&gt; (emotional state in spiritual practice)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;rasa&lt;/strong&gt; (relish, taste of devotion)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The Gemini Embedding API handled multilingual terms surprisingly well. I achieved &lt;strong&gt;73-76% cosine similarity accuracy&lt;/strong&gt; when matching user queries to relevant harikatha segments. That means when someone asks about surrender, the system correctly identified answers about saranagati roughly 3 out of 4 times.&lt;/p&gt;

&lt;p&gt;The indexing strategy was straightforward:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Segment the audio corpus into logical chunks (usually 1-3 minutes each)&lt;/li&gt;
&lt;li&gt;Transcribe each segment&lt;/li&gt;
&lt;li&gt;Generate embeddings using &lt;code&gt;gemini-embedding-001&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Store metadata (audio URL, video timestamp, topic tags) in Firestore&lt;/li&gt;
&lt;li&gt;Index the embeddings for vector search&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The hardest part wasn't the embeddings—it was the segmentation. Harikatha discourse is philosophical and flowing. Where do you split a 2-hour lecture without breaking the meaning?&lt;/p&gt;

&lt;h2&gt;
  
  
  The Thinking Text Problem
&lt;/h2&gt;

&lt;p&gt;Here's a problem I didn't anticipate: &lt;strong&gt;Gemini's native audio model leaks internal reasoning.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;When using &lt;code&gt;gemini-2.5-flash-native-audio-preview&lt;/code&gt;, the model's internal reasoning process sometimes appears in the text response as well as the audio. You'd get something like:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Let me search for teachings on this topic... [thinking] The user is asking about karma, I should search for relevant segments... [/thinking] Here's the answer from Maharaja..."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Users don't want to hear the backend deliberation. So I added regex-based filtering to strip out thinking markers before passing responses to audio generation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;clean_thinking_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# Remove thinking blocks from response text
&lt;/span&gt;    &lt;span class="n"&gt;cleaned&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sub&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;\[thinking\](.*?)\[/thinking\]&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;''&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;flags&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DOTALL&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;cleaned&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sub&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Let me search.*?\.\.\.\n*&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;''&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cleaned&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;cleaned&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It's a blunt instrument, but it works. Ideally, the API would have a parameter to disable thinking output entirely, but for now, regex is our friend.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Always-On Mic Disaster
&lt;/h2&gt;

&lt;p&gt;Early versions used an always-on microphone—the browser constantly streamed audio to the backend. I thought this would create a more natural experience, like talking to someone in the room.&lt;/p&gt;

&lt;p&gt;It didn't. Instead, ambient noise—a car passing, a dog barking, a fan in the background—would trigger searches. The system would think you asked a spiritual question when you were just living your life.&lt;/p&gt;

&lt;p&gt;I switched to &lt;strong&gt;push-to-talk&lt;/strong&gt;: users hold a button to record. It's less flashy, but it actually works. Sometimes the simplest interface is the right one.&lt;/p&gt;

&lt;h2&gt;
  
  
  Real Voice vs Synthetic: A Philosophical Stand
&lt;/h2&gt;

&lt;p&gt;This is the core decision that shaped the entire project.&lt;/p&gt;

&lt;p&gt;Every modern voice AI generates synthetic speech. It's impressive—voices sound natural, modulation is perfect. But it's not real. It's an imitation.&lt;/p&gt;

&lt;p&gt;Building Harikatha Live Agent taught me that this distinction matters, especially in spiritual contexts. When seekers hear Narayana Maharaja answer their question, they're not hearing an approximation or a trained model. They're hearing the actual person—his voice, his cadence, his presence.&lt;/p&gt;

&lt;p&gt;The tradeoff is obvious: synthetic speech can answer any question instantly. Retrieved audio can only answer questions that were already addressed in the corpus. But within that constraint, authenticity wins.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tech Stack Summary
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Gemini Live API&lt;/strong&gt; (gemini-2.5-flash-native-audio-preview) for real-time conversational reasoning&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Gemini Embedding API&lt;/strong&gt; (gemini-embedding-001) for vector search&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;FastAPI + WebSockets&lt;/strong&gt; for the proxy layer&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Google Cloud Run&lt;/strong&gt; for serverless hosting&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Firestore&lt;/strong&gt; for both vector indexing and metadata storage&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Docker&lt;/strong&gt; for containerization&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Vanilla HTML/JS&lt;/strong&gt; frontend (no frameworks—just WebSocket calls and audio playback)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Key Learnings
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Gemini's Live API is a game-changer for real-time agents.&lt;/strong&gt; The audio-native model eliminates the latency and quality loss of separate transcription/generation pipelines. It's genuinely different from prior APIs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Function calling is the secret sauce.&lt;/strong&gt; Seamless tool integration means the model can orchestrate complex workflows without you building state machines. Gemini just calls your functions when it needs them.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Firestore + embeddings is a viable vector search solution.&lt;/strong&gt; You don't always need Pinecone or Weaviate. If your dataset fits in Firestore and your QPS is reasonable, it works fine.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. The hardest problem is never the technology.&lt;/strong&gt; It was segmenting and indexing the corpus accurately. The first attempt at automatic segmentation was disaster—spiritual teachings don't have natural breakpoints. I ended up with a hybrid approach: algorithmic chunking with manual review.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Authenticity matters.&lt;/strong&gt; In an age of synthetic everything, users actually notice and care when something is real.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;The Harikatha Live Agent is just the beginning. Future work includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Expanding the corpus to other spiritual teachers and traditions&lt;/li&gt;
&lt;li&gt;Adding multiple language support (Gujarati, Tamil, Bengali)&lt;/li&gt;
&lt;li&gt;Building an admin panel for corpus management&lt;/li&gt;
&lt;li&gt;Implementing feedback loops so the system learns which segments users find most helpful&lt;/li&gt;
&lt;li&gt;Eventually, building this into a structured library of spiritual knowledge&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For now, the system is live, and seekers are asking questions. Maharaja's voice is answering. It's not perfect, but it's real.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Building the Harikatha Live Agent for the #GeminiLiveAgentChallenge has been a reminder that the best tools aren't the ones with the most features—they're the ones that disappear. When the technology becomes transparent, what remains is just a student, a question, and a teacher's voice. That's what we built.&lt;/strong&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  GeminiLiveAgentChallenge
&lt;/h1&gt;

</description>
      <category>agents</category>
      <category>devchallenge</category>
      <category>gemini</category>
      <category>showdev</category>
    </item>
  </channel>
</rss>
