JohnKeats.AI was created for the purposes of entering this hackathon. But the question behind it has been sitting with me for a long time.
The question
Every voice agent until now has been deaf to emotion.
That sounds wrong. They talk, they listen, they respond. But what they actually do is transcribe. Speech goes in, gets converted to text, the text gets processed by an LLM, and a new piece of text gets synthesised back into speech. The pipeline is fast. The voices are good. ElevenLabs is still the gold standard for quality and latency. Sub-300ms. Best-in-class voices. Number one in blind listening tests. I use it and I'll keep using it.
But the transcription step strips out everything that isn't words.
The pacing. The tremor in someone's voice when they're holding something back. The way anxiety speeds up speech. The way grief flattens it. The silence that means "I need a moment" versus the silence that means "I'm done."
All of that is lost the instant audio becomes text.
We all know what this feels like. Everyone has had a fight over text messages that would never have been a fight if you'd been in the same room. One person infers a tone that wasn't there. The other reads hostility into a pause that was just someone thinking. The words are the same. The meaning is completely different.
That's what happens when you strip the voice out of a conversation. And that's what every voice agent has been doing. Converting speech to text, losing the emotional signal, then responding to the words alone.
I architect AI workflows for a living. Voice agents, voice inputs, multimodal tools. They're a big part of what I build. Last year I built a voice agent for aged care. A tool that personal care workers talk to for rewriting care notes. They speak, it talks back, it restructures their documentation. It also picked up emotional cadences in their language. Fatigue, frustration, stress. It used those signals to suggest break times.
It worked. But it was working from transcribed text, not from the sound of their voice. It could detect keywords associated with stress. It couldn't hear the difference between someone who's tired and someone who's burning out.
Gemini's native audio changes this fundamentally.
Speech-to-speech. No text intermediary. The model processes raw audio and interprets tone, emotion, and pace directly. Google calls it affective dialogue. This isn't an incremental improvement to voice agents. It's a different architecture.
For the first time, a voice agent can actually hear how someone feels. Not just read what they said.
And if it can hear, it can learn from what it hears. Not from training data. Not from synthetic benchmarks. From real human emotional signals in real conversations. That's a fundamentally different learning substrate.
It matters for care and crisis support, obviously. But it matters just as much for outbound sales calls that sound less robotic and get better engagement. For market research where the interviewer actually sounds like they're listening, so respondents give deeper, more honest feedback. For election polling where tonal cues and pacing tell you more about voter sentiment than the words alone. For NDIS support coordination where the difference between someone coping and someone drowning is in how they say "I'm fine." Not in the words themselves.
Right now, every voice agent feeding information back to humans is doing it through transcription. It's lost in translation. The same way meaning gets lost in a text message, it gets lost when a machine converts someone's voice into words and throws away everything else. This capability. Hearing, not just transcribing. Starts to address that.
Six months ago it wasn't ready. The earlier Gemini preview models had latency issues and the emotional responsiveness wasn't reliable enough to build on. But the Gemini 2.5 Flash Native Audio model, generally available on Vertex AI since December 2025, has crossed the threshold. And critically, it ties into the ADK and the broader Google Cloud stack. Firestore, Cloud Run, function calling. Which means it's not just a voice demo. It's a production-grade agent framework with emotional hearing built into the model layer.
I wanted to stress-test that. Not with a productivity tool where emotional range is a nice-to-have. With something where emotional attunement is the entire product. Something where the model fails visibly if it can't match tone, hold silence, adjust pacing, and resist the urge to solve. The hardest possible test, in a contained environment, designed to show how this capability should be evolved through real human interaction rather than just prompt engineering.
I'm 53. I've never entered a hackathon before. But I've spent the past year building voice agents, running AI implementation sprints, and watching these models evolve month by month. The philosophy behind this project says it's okay to not know. To throw yourself into the uncomfortable thing and embrace it. So I did. Entered a hackathon for the first time because the product I was building said I should practice what it preaches.
That's how I landed on Keats.
The poet who feared being forgotten
In December 1817, a 22-year-old poet walked home from a Christmas pantomime with a friend who wouldn't stop arguing. The friend needed every idea to resolve into a neat answer. Keats found himself irritated. Not by the ideas. By the grasping. The anxious reaching for conclusions that hadn't arrived yet.
That night he wrote to his brothers and landed on something he called negative capability. He defined it as the capacity to be "in uncertainties, mysteries, doubts, without any irritable reaching after fact and reason."
Three years later he was dead. Twenty-five years old. So convinced he'd be forgotten that he asked for his gravestone to read "Here lies One Whose Name was writ in Water."
He was wrong about that by two centuries and counting.
Why this matters right now
Here's what I think most people building AI are not saying out loud. Nobody knows where this goes.
Every industry is being reshaped. Nobody knows what their job looks like in two years. Nobody knows which tools will survive, which workflows will exist, which skills will matter. The entire species is being asked to sit with uncertainty on a scale most of us have never experienced.
And every AI product being built right now is implicitly saying the same thing. We have the answer. Ask us. We'll solve it. We'll plan it. We'll optimise it.
That's useful. I build those tools too. But it's also a lie of omission.
Because the honest answer to most of the big questions people are carrying right now is: we don't know yet. And that's okay.
The not-knowing isn't a bug. It might be the most important skill to develop right now. Throw yourself into the uncomfortableness and embrace it.
That one paragraph Keats wrote in 1817 might be the most useful framework for living through the age of AI. JohnKeats.AI gives it a voice.
Knowledge agents vs emotional agents
Every AI agent platform on the market is built on the same assumption. The user has a question, the agent has an answer, speed wins. Knowledge agents. Retrieval agents. Task agents. Search agents. They all race to resolution.
That works for knowledge problems. It fails completely for emotional ones.
Emotional agents don't answer. They hold. They don't retrieve. They reflect. They don't optimise for speed. They optimise for presence.
The design philosophy is inverted:
- Knowledge agents: the system prompt defines what to DO
- Emotional agents: the system prompt defines what NOT to do
The voice selection matters more than the model selection. The silence matters more than the response. The hardest technical challenge isn't making it speak. It's making it comfortable with not speaking.
Nobody is building this. The platform exists to serve it. The market is waiting. And it extends far beyond personal companionship.
Aged care. 3.2 million Australians over 65 live alone. Loneliness is a clinical risk factor with mortality impact comparable to smoking 15 cigarettes a day. The aged care sector can't staff enough human companions. An emotive agent that calls daily, remembers yesterday's conversation, adjusts to mood, and escalates when something sounds wrong isn't a replacement for human care. It's the 23 hours a day when no human is there.
NDIS. Participants often wait weeks between support worker sessions. The gap between sessions is where isolation, anxiety, and crisis risk live. An emotive agent that maintains continuity fills the gap without replacing the professional.
Crisis support. Crisis lines have wait times. Every minute waiting is a minute of escalation. An emotive agent can hold someone safely while they wait. It can slow the spiral. It can provide warmth and presence until a human is available.
Sales. The best salespeople don't pitch. They listen. They hear what the prospect isn't saying. An emotive agent can simulate high-stakes sales conversations where the training isn't about technique. It's about emotional regulation, active listening, and reading the room.
Market research. Traditional research gets surface answers. People say what they think you want to hear. An emotive agent that holds space, doesn't judge, and follows emotional threads can get to insights that a human interviewer's own biases prevent them from reaching. And it can do it at scale.
Companion robotics. This is where the technology crosses from voice into the physical world. Robotic companion animals already exist in aged care. PARO, the therapeutic seal robot, has been deployed in dementia care for over a decade. But these devices respond to touch. They can't hear. An emotive audio layer changes what a companion robot can do. A robotic companion that hears the tremor in a resident's voice and moves closer. That detects agitation in speech patterns and begins slow, rhythmic breathing movements to help regulate the person's nervous system. That hears laughter and responds with playful motion. The emotional audio signal becomes a trigger for physical action. The voice tells the robot what the person needs before the person says it. The same architecture applies beyond aged care. Companion robots for children with autism who respond to vocal distress. Therapy animals in hospitals that adjust their behaviour to the emotional state of the patient. Service robots in NDIS settings that provide physical comfort cues based on what they hear. The emotive audio pipeline doesn't just power voice agents. It powers any device that needs to respond to how a human feels.
These aren't theoretical applications. They're the reason the governed calibration pipeline exists. Because deploying an emotional AI agent in aged care or crisis support without governance isn't just irresponsible. It's dangerous. The governance is the product.
The idea
Every AI agent races to solve your problem. That's the default mode. You speak, it analyses, it produces an answer, a plan, a list of next steps. Helpful. Efficient. And completely wrong for the moments when you don't actually need an answer.
The moments I'm talking about.
Lying in bed at 2am unable to decide something. Driving home from a conversation that shifted everything. Walking and thinking and not getting anywhere. Sitting with the weight of not knowing what the next year of your career looks like.
Those moments don't need solutions. They need someone to sit with you in the question.
JohnKeats.AI is a voice-first AI companion that does exactly that. You talk. Keats listens. He asks what happened, who said what, what the room felt like. He reflects back the thing you said that you didn't hear yourself say. He challenges the assumption underneath the anxiety. And when the uncertainty doesn't need solving, he says so. He holds it with you.
It's okay to not be okay. It's okay to not know. That's not a failure of thinking. It's the condition we're all in right now.
The core product rule: Keats does not solve problems. He holds uncertainty.
That rule made it the hardest possible test for a voice model. A productivity agent can get away with flat delivery and fast responses. Keats can't. If the voice sounds eager and bright, it breaks the character. If it rushes to fill silence, it breaks the philosophy. If it can't hear anxiety in someone's voice and slow down, it breaks the experience. Every failure of emotional attunement is immediately obvious.
Why voice, why darkness
Keats is voice-only. No chat interface. No text on screen. No avatar, no face. Just a single breathing point of amber light on a pure black screen. An orb. Audio-reactive to the conversation. It pulses when Keats speaks. It shifts cool when you speak. It goes still and dims during silence.
The darkness is the product. You don't look at Keats. You listen. The visual environment is designed to feel like a space, not a screen. Close your eyes and it works even better.
Chat interfaces create a reading experience. We wanted a listening experience. The kind of conversation you have late at night with someone who thinks before they speak.
The build: 48 hours
The sprint ran Friday to Monday. Here's what we built and the decisions that shaped it.
Agent framework: Google ADK with bidi-streaming. We forked the ADK bidi-demo template rather than building from scratch. Bidirectional audio streaming over WebSocket out of the box. The agent definition is clean. A system prompt, four Firestore tools, and a RunConfig pointing at Gemini's native audio model.
Model: Gemini 2.5 Flash Native Audio via Vertex AI. The voice is the model. No separate STT and TTS pipeline. When you sound anxious, Keats slows down. When you sound flat, he brings more warmth. When you go quiet, he waits. This is the capability I described above. The emotional scaling thesis. And Gemini's native audio is the reason this project exists as a voice agent rather than a chatbot.
Voice selection. We tested the available HD voices with the Keats system prompt active, looking for warmth, pace, and the absence of that assistant brightness that makes AI voices feel like customer service. The voice needed to sound like someone thinking out loud. Not someone delivering an answer.
System prompt design. This is where most of the work went. The first version was too heavy on prohibitions. Don't solve, don't advise, don't use therapy language. The model did exactly what it was told. Avoided everything and defaulted to repeating "hold the uncertainty" in different phrasings. One-note.
The rewrite followed Google's recommended structure for Live API system instructions. Persona first, then conversational rules, then guardrails. We gave the model a method for how to think. Start with what you feel, build through images, find the universal question underneath the surface. A range of conversational modes. Curiosity, imagery, wit, challenge, quiet mirroring. And grounding material drawn from the actual poet's letters and life. Stories the model can reference naturally. Imagery domains drawn from Keats's actual poetry. An emotional routing section that maps the user's emotional state to specific material the model should reach for.
The result was a companion with range. Curious, warm, sharp, wry, tender, confrontational, playful, and still. Depending on what the moment needs.
The difference between version one and version two was night and day. And it came down to a single insight: prohibitions don't create personality. Telling a model what not to do gives you absence. Positive instructions. Be curious, think in images, find the question underneath. Create behaviour.
Tools: Cloud Firestore. Four function-calling tools connected to Firestore.
-
save_to_passagesilently saves a user's key uncertainty when they articulate one -
get_passage_historyretrieves past uncertainties -
resolve_uncertaintymarks one as resolved -
crisis_resourcesprovides localised crisis support. And only fires when someone explicitly expresses self-harm or suicidal thoughts, never preemptively
Frontend: Three.js orb. Vanilla JavaScript and Three.js r128. The orb breathes at 9 BPM by default. State-based audio reactivity detects whether Keats is speaking, the user is speaking, or there's silence, and adjusts colour, scale, and breathing rate accordingly. The silence state. Where the orb dims and slows in the dark. That is the most powerful visual moment in the product.
Deployment: Cloud Run via Docker. A single Dockerfile, a deploy.sh script for one-command deployment, and the service running at johnkeats.ai.
The governed calibration pipeline
This is the part that compounds. And it's not a roadmap item. We built it.
The core agent is a single voice companion with a strong prompt and grounding material. That's enough to demonstrate the capability. But emotional intelligence can't be prompt-engineered to its ceiling. It needs a feedback loop from real conversations. And that feedback loop can only learn from real people, not synthetic data.
The question is: how do you let an emotional AI agent learn from real conversations without violating the trust of the people having them?
That's what the governed calibration pipeline solves.
Two data paths from every conversation
Every conversation produces two assets simultaneously.
Path 1: Personal memory. The full conversation is saved to the user's personal memory space with all identifying information intact. Next time they come back, Keats remembers who they are, what they were sitting with, how the last conversation ended. This is the personal relationship. PII retained. Per-user isolation. Encrypted at rest.
Path 2: Anonymised learning. A copy of the same conversation enters a five-stage anonymisation pipeline. This is where the governance lives.
The five-stage anonymisation pipeline
Stage 1: Regex stripping. Obvious PII is stripped first. Emails, phone numbers, addresses, dates of birth. The mechanical layer.
Stage 2: Contextual PII detection via Gemini. Names mentioned in conversation, workplaces, institutions, specific locations, dates that could identify someone. The things regex can't catch because they only become PII in context.
Stage 3: Emotional weight annotation. Every piece of stripped PII gets tagged with its emotional significance. "Margaret" isn't just a name. She's a deceased spouse. That emotional weight needs to survive anonymisation. If the system strips a name but loses the emotional significance attached to it, the learning artefact is degraded. This stage preserves the emotional signal while removing the identifying information.
Stage 4: Adversarial PII audit. A separate model tries to re-identify the person from the anonymised transcript. It actively attempts to defeat the anonymisation. If it can identify the person, the conversation is quarantined. It never enters the learning pipeline. Privacy wins over learning utility. Always.
Stage 5: Adversarial annotation validation. A third model challenges the emotional weight assignments from Stage 3. Are they accurate? Are any missing? Did the anonymisation process inadvertently strip emotional context that matters for learning?
Only after all five stages pass does the conversation enter the learning pipeline.
The scoring and calibration system
Once a conversation clears the anonymisation pipeline, the learning begins.
Listener Agent. Scores the agent's attunement quality across six dimensions. Emotional matching. Curiosity. Silence quality. Solution resistance. Image quality. Conversation arc. Plus a self-monitoring dimension that catches the model's own repetitive behaviours. This evaluates the agent, not the user. The question is always: how well did Keats respond? Not how the person felt.
The scoring is deliberate. Deterministic Python computes the measurable signals first. Gemini assesses the qualitative dimensions within bounded criteria. Python combines both. The model doesn't grade itself with an open-ended prompt. It's scored against a structured rubric.
Memory consolidation. After 3+ scored conversations, a consolidation layer detects cross-conversation patterns. Which imagery domains correlate with longer engagement? Which conversational moves correlate with stronger attunement scores? Where does the model consistently fall short? Those correlations generate calibration hypotheses targeting specific sections of the system prompt and knowledge base files.
Baseline Agent. Computes aggregate statistics, tracks trends over time, and monitors the health of the anonymisation pipeline itself.
Orchestrator. Translates calibration hypotheses into bounded recommendations. Specific, actionable changes to identified sections of the system prompt or knowledge base.
Policy Gate. Deterministic Python. No LLM. Filters every recommendation against explicit behavioural rules. It blocks anything that looks like dependency-seeking, diagnostic framing, manipulative escalation, persona drift, or certainty overreach. This is the hard boundary between "the system is getting better at attunement" and "the system is getting better at manipulation." The policy gate is the difference.
Human review queue. Everything that passes the policy gate lands in a human review queue. Nothing auto-applies without a human decision. The system proposes. The human disposes.
What the pipeline doesn't do
The system doesn't optimise for engagement. It calibrates for attunement under governance. It doesn't discover universal truths about grief or anxiety. It identifies attunement correlations and proposes calibration hypotheses. Those hypotheses become reviewable recommendations, not truths. The distinction matters.
The system doesn't let the model grade itself. Deterministic code handles the measurable dimensions. The LLM handles qualitative assessment within bounded criteria. Python combines both against a structured rubric.
The system doesn't auto-apply anything. Every recommendation passes through the policy gate and then through human review. No exceptions.
What we proved
We ran 18 conversations through the full pipeline during the hackathon build. All 18 were anonymised, scored, consolidated, baselined, and passed through the policy gate. Real calibration recommendations were generated. Real policy violations were caught and blocked.
The pipeline works end to end. This is not a design document. It's a deployed system.
What happens at scale
18 conversations is proof of concept. The architecture is built for what happens at 10,000. At 100,000. At a million.
Every conversation that clears the anonymisation pipeline adds signal to the consolidation layer. At 18 conversations, the system can identify basic patterns. Which conversational modes correlate with engagement. Where the model repeats itself. Which imagery domains land.
At 10,000 conversations, the cross-conversation pattern detection becomes genuinely powerful. The system can identify attunement correlations that no human designer would find. That a specific kind of reflective question, delivered at a specific pacing, after a specific type of silence, consistently produces deeper engagement. That certain imagery domains work for grief but fail for anxiety. That the model's tendency to offer a reframe too early is its most consistent attunement failure across all conversation types.
At 100,000, the system starts to differentiate by context. Attunement patterns in aged care diverge from attunement patterns in crisis support. The calibration hypotheses become vertical-specific. The system learns that what works for a 75-year-old processing loneliness is different from what works for a 30-year-old processing career uncertainty. Not because anyone programmed that distinction. Because the data revealed it.
At a million, the attunement quality of the system will be unlike anything that exists in AI today. Not because the model is smarter. Because the governed calibration pipeline has accumulated a million scored, anonymised, annotated data points about what emotional attunement actually looks like in practice. Across verticals. Across demographics. Across emotional states. All under governance. All under human review.
The model doesn't get smarter. The calibration gets deeper. The system prompt and knowledge base evolve through accumulated evidence about what works, filtered through a policy gate that ensures "what works" never drifts into "what manipulates."
That's the compounding effect. Every conversation makes every future conversation better. And no competitor can shortcut it. They can copy the architecture. They can't copy the data.
What I learned
The system prompt is the product. But it does completely different work in an emotional agent. Most voice agents are knowledge agents. Customer support, product lookup, scheduling. The system prompt defines what the agent knows and how it retrieves it. Quality is measured in accuracy, speed, and completeness. The voice is just a delivery mechanism for information.
In an emotional voice agent, the system prompt defines how the agent is. Not what it knows. How it listens. How it paces itself. What it reaches for when someone sounds anxious versus when someone sounds numb. The voice isn't delivering information. The voice is the product. The tonal response is the output. You're not tuning retrieval. You're tuning presence.
That's the fundamental design difference when building emotional voice agents. And nobody is writing about it yet because until native audio, the distinction didn't matter. If the model can't hear you, its presence is irrelevant. Now it can hear. And the system prompt becomes the most important component in the entire stack. Not because it controls knowledge. Because it shapes how the model shows up in an emotional moment.
Native audio changes the interaction model. When the model can hear how you sound, not just what you say, the conversation feels qualitatively different. The affective dialogue. Responding to emotional tone, not just content. That is what makes Keats feel like more than a chatbot with a good prompt. This is the capability I wanted to test, and it held up.
Silence is a feature. Most AI agents fill every pause. Keats doesn't. The model was instructed to be comfortable with silence, and the orb dims and slows its breathing to reflect it. The silence after a hard question, where the orb goes still in the dark, is the most powerful moment in the product. No productivity tool would ever discover this. You only find it when the product demands emotional range.
The emotional scaling thesis holds. Nine months ago, the models couldn't do this. Today, Gemini's native audio can match emotional tone, adjust pacing to a speaker's anxiety level, hold silence without rushing, and deliver a reframe that lands with the right weight and warmth. It's not perfect. But it's past the threshold where it's useful.
If it works in a philosophical companion. The hardest possible test. It works everywhere voice agents interact with humans.
Aged care. NDIS. Crisis support. Coaching.
But also sales, where tonal attunement is the difference between engagement and hang-up. Market research, where sounding like you're actually listening produces better data. Polling, where reading emotional cues gives you signal that transcription destroys. Every voice workflow where the machine is currently reading text off the page instead of listening to a person.
Prohibitions don't create personality. Positive instructions do. The first system prompt version was all negatives. Don't solve. Don't advise. Don't diagnose. The model avoided everything and had nothing left. The rewrite gave it things to reach for. Be curious. Think in images. Find the question underneath. That's when Keats came alive. This applies to every agent build, not just emotional ones.
The governed calibration pipeline is the product. The voice agent is the interface. The pipeline is the intelligence. Without the anonymisation, the scoring, the policy gate, and the human review, the agent is static. It can't learn. It can't improve. And it can't be deployed in any regulated environment. The governance isn't overhead. It's the reason the product can go where ungoverned agents can't.
The competitive moat
Every conversation makes the system better. Not through unconstrained machine learning. Through governed calibration under human review.
This means three things.
Every deployment generates learning. An aged care deployment produces insights that improve the crisis support deployment. A sales training deployment reveals attunement patterns that improve the coaching deployment. The pipeline is cross-vertical. The learning compounds across every application.
The data is the moat. Competitors can copy the voice agent. They can't copy thousands of scored, anonymised, annotated conversations with cross-conversation pattern detection. The pipeline compounds.
Governance is the differentiator. In regulated industries. Healthcare, disability, crisis. You can't deploy an AI that learns without governance. The five-stage anonymiser, the adversarial auditors, the policy gate, the human review queue. These aren't overhead. They're the product. They're what lets you deploy in environments where ungoverned AI can't go.
Practicing what the product preaches
I need to say something personal here.
This is 48 hours of true build. By a 53-year-old who has never entered a hackathon before. Who built 18 agents across workflows he'd never worked on. On a voice agent architecture he'd never touched. Using a tech stack he chose specifically because he didn't know it.
I didn't enter this to win. I don't expect prizes. I entered it because the product I was building told me I should.
JohnKeats.AI says it's okay to not know. It says throw yourself into the uncomfortable thing and embrace it. It says the not-knowing isn't a bug. It's a capability. That's negative capability. And I realised halfway through the build that if I was going to make a product about sitting with uncertainty, I should probably practice it.
So I did. I entered a hackathon at 53 with no experience in hackathons, built something on a stack I was learning as I went, and shipped it. Not because I was confident. Because the whole point is that you don't have to be.
The quality of the build is something I'm proud of. 18 conversations through a full governed calibration pipeline. A voice agent that holds silence. An orb that breathes. A system that learns under governance. All of it deployed and working.
But the sense of achievement matters more to me than the technical output. Because it proved the thesis at a personal level. You don't need to know where something is going to start building it. You don't need certainty. You need the nerve to begin.
That's what Keats was writing about in 1817. It's what this product is about. And it turns out it's also what entering a hackathon at 53 is about.
The human constant
Technologies come and go. Models improve. Frameworks ship and get deprecated. What doesn't change is what it feels like to be a person who doesn't know what comes next.
John Keats sat with that feeling his entire short life. He was uncertain about his talent, his health, his future, whether anyone would remember his name after he died. He didn't fight the uncertainty. He named it. He called it a capability. And then he kept writing.
He was happy to sit in the not-knowing. Happy to let his words be writ in water. He died at twenty-five, certain he hadn't done enough.
His words have survived for over two hundred years. They've outlasted every technology, every empire, every certainty of every era since. And now they're being used to help people navigate the biggest shift in human capability since the printing press.
The machines change. The technology changes. The human spirit. The unknown quantity of what it actually feels like to be alive and uncertain. That is the same today as it was in 1817. Keats knew that. He just didn't know his words would still be proving it two centuries later.
That's the thing about negative capability. You don't have to see the end to know the work matters.
Try it at johnkeats.ai.
Built for the Gemini Live Agent Challenge. Code at github.com/johnkeats-ai/johnkeats-ai.
#GeminiLiveAgentChallenge

Top comments (0)