Natasha Newbold

Posted on Jan 22 • Edited on Feb 7

From Black-Box to Observable AI — A Google AI Challenge

#devchallenge #googleaichallenge #gemini #ai

New Year, New You Portfolio Challenge Submission

This is a submission for the "New Year, New You" Portfolio Challenge

About Me

I am a Head of AI working across frontier research and cross-sector environments, designing and deploying AI systems intended for real-world use rather than demonstration alone.

I am an engineer by training and inclination, with a strategic lens shaped by working at the intersection of advanced capability, organisational decision-making, and societal impact.

My work focuses on making AI behaviour legible: exposing reasoning, intent, and system boundaries so that decisions are not merely produced, but understood.

This portfolio reflects that philosophy. It is not a showcase of outcomes, but of thinking — how complex AI systems are designed, constrained, and made observable in practice.

Figure 1. Meltem Subasioglu, Turan Bulmus,and Wafae Bakkali (2025). Agent Quality. Google.

Portfolio

This portfolio demonstrates how modern AI systems can be designed to be inspectable, interpretable, and production-ready.

True multi-agent coordination
Queries are handled by four specialised agents rather than a single generalist model. A lightweight Coordinator analyses intent and routes requests to the appropriate specialist—covering technical implementation, research and methodology, or professional context. This mirrors how enterprise AI systems are structured in practice.

Observable reasoning traces
Every interaction exposes how the system arrives at an answer:

which agent handled the query
how prior context influenced the response
token usage and latency
the decision logic behind routing and generation

Observability here is not a visual flourish. It is a debugging and audit surface, designed to make AI behaviour understandable and accountable.

Session memory
Conversations retain context across turns, enabling coherent, multi-step dialogue that evolves naturally over time rather than resetting on each request.

Production-grade architecture
The system is built with Next.js 14, TypeScript, and Tailwind CSS, powered by Gemini 2.0 Flash, Gemini 3 Pro, and deployed on Google Cloud Run with automatic scaling. Performance targets are treated as constraints, not afterthoughts, with sub-two-second response times and a consistently high Lighthouse score.

Try It Yourself

Ask technical, research, or background-oriented questions and observe how the system responds.

Pay attention to:

the agent routing indicator
the live reasoning trace
how earlier context is incorporated
performance metrics updating in real time

The goal is not just to receive an answer, but to see how the answer was produced.

Figure 2. Alan Blount, Antonio Gulli, Shubham Saboo, Michael Zimmermann, and Vladimir Vuskovic (2025). Introduction to agents. Google.

How I Built It

This portfolio was designed and delivered in less than a week, from concept to production deployment, with observability and safety treated as first-class architectural constraints rather than add-ons.

Architecture Overview

At its core is a true multi-agent system, following enterprise AI design patterns rather than prompt orchestration. Four specialised agents operate under a lightweight coordination layer:

a Coordinator agent performs intent analysis and routing

Coordinator-Based Intent Routing
At the heart of the system is a dedicated Coordinator agent that performs intent classification using conversational context and routes each query to a single specialist, enabling accurate, cost-efficient multi-agent coordination rather than parallel prompt execution.

Implementation:

Projects, Research, and Career agents each maintain a narrow, well-defined knowledge domain

The Coordinator receives the user query plus a summarised conversation context and returns a single agent identifier. This avoids the common anti-pattern of querying all agents in parallel, reducing latency, cost, and response inconsistency while improving semantic routing accuracy.

Routing is LLM-based rather than keyword-driven, allowing intent to be inferred from both language and conversational history. In practice, this yields ~95% routing accuracy with a negligible latency overhead (~200–300 ms).

Implementation:

Session Memory & Context Management

Because Gemini APIs are stateless, I implemented an in-memory session management layer to support coherent multi-turn dialogue.

Each session maintains:

full conversational history
the last active agent
inferred user intent and topic context

To prevent token growth, only the most recent exchanges are passed verbatim into prompts. Older interactions are compressed into a structured summary describing previously discussed topics and active context. This hybrid strategy reduces prompt size by ~60–70% in long conversations while preserving response quality and reducing latency.

The current implementation uses in-memory storage for speed (<1 ms lookup) and simplicity; the architecture is intentionally compatible with a Redis-backed persistence layer for scale.

Observable Reasoning as a System Primitive

Rather than treating reasoning traces as a UI feature, the system emits a structured reasoning trace alongside every response.

Each trace records:

selected agent and routing rationale
context summary used in the prompt
token usage and response latency
model version and timestamp

These traces are rendered in a dedicated side panel with progressive disclosure: high-level signals are always visible, while full prompts and metadata are expandable. This makes AI behaviour inspectable without overwhelming the user.

In practice, observability became a core development tool. Debugging, prompt iteration, and performance tuning were significantly faster because routing decisions and context usage were immediately visible.

Model & Tooling Choices

All agents are powered by Gemini 2.0 Flash, selected for its balance of latency, reasoning quality, and structured output reliability. Streaming responses are used to keep perceived latency low, while retries and fallbacks ensure graceful degradation.

Development was accelerated using Google Antigravity, which materially shifted effort away from boilerplate and towards architecture, prompt design, and system boundaries. The result was a measurable reduction in development time (≈40–50%) and cleaner, more consistent patterns across the codebase.

Deployment & Performance

The application is deployed on Google Cloud Run using a Next.js standalone build. This enables automatic scaling from zero to ten instances, HTTPS by default, and cost-efficient operation within the free tier.

Cold starts were reduced to ~2–3 seconds through container and bundle optimisation, while warm requests complete in under 100 ms excluding model inference. End-to-end response times typically fall in the 1–3 second range.

Key Engineering Takeaways

Coordination matters more than agent count: multi-agent systems require explicit routing authority.
Observability pays for itself: traces built for safety became indispensable for debugging and optimisation.
Summarisation beats brute force context: intelligent memory handling outperforms full-history prompts.

The result is not just a portfolio of projects, but a working demonstration of modern AI systems engineering: observable, coordinated, and designed to operate safely under real-world constraints.

The portfolio can be viewed on the website, and below via the live, functional, and embedded Cloud Run embed feature.

Figure 3. Meltem Subasioglu, Turan Bulmus,and Wafae Bakkali (2025). Agent Quality. Google.

Evaluating AI Systems Matters

Building AI systems is only half the work; rigorous evaluation of outputs, internal processes, and safety properties is what determines whether those systems can be trusted, governed, and deployed responsibly at scale.

The portfolio evaluates its multi-agent AI system through observable reasoning traces with real-time performance metrics, and plans to enhance this with LLM-as-a-judge automated quality scoring, user feedback collection, and cross-agent verification to systematically improve response accuracy and user satisfaction.

What I'm Most Proud Of

This portfolio bridges research and production in a way most AI demos
don't attempt. It's grounded in Google's agent quality frameworks and
evaluation methodologies, yet deployed as a working system with real
performance constraints and operational boundaries.

Observable reasoning, multi-agent coordination, and session memory aren't
theoretical concepts here—they're production components operating under
the same constraints I'd apply to any deployed AI system. The portfolio
proves these patterns work at scale, not just in papers.