Learner

Posted on Mar 15

Inside Aura: Building a Live AI Fashion Agent with Gemini Live, Vertex AI, and Google Cloud

#geminiliveagentchallenge #gemini #googlecloud #vertexai

Hackathon disclosure: This article was created for the purposes of entering the Gemini Live Agent Challenge.

Fashion apps are usually good at one part of the journey and weak at the rest.

Some are good for inspiration.

Some are good for catalog search.

Some are good for checkout.

But real style decisions do not happen in isolated tabs. They happen in messy, human moments where confidence, context, budget, taste, and time pressure all collide.

That is what we built for with Anyaself and Aura.

Project links

Live app: https://www.anyaself.com
Source code: https://github.com/gotapc/anyaself

Tech stack

Aura and Anyaself are built using:

Gemini Live API for real-time voice interaction
Vertex AI for agent orchestration
Google Cloud Run for service deployment
Firestore for mission and application state
Cloud Storage for media and generated assets

Why we built this

Getting dressed is not just a visual decision. It is identity in motion.

It is what you wear to a hard meeting when your confidence is shaky.

It is what you throw on for a last-minute dinner when you are tired but still want to feel good.

It is what you buy for someone in your family when you want to help, but still need approvals, budgets, and practical constraints.

The name Anyaself comes from “on yourself” because the whole idea is grounded in your own life:

your own wardrobe
your own context
your own decisions

The goal was simple to say and hard to build:

Create an experience where AI feels like a capable partner in the moment, not an extra interface you have to manage.

The product idea in one sentence

Anyaself is an agent-first platform, and Aura is the live intelligence layer that can understand, decide, and take action with you.

That sentence shaped almost every architecture choice we made.

What Aura actually does

A real Aura flow looks like this:

You speak naturally.
Aura understands intent in context, not as isolated commands.
Aura pulls from your wardrobe and preferences first.
If needed, Aura helps discover new items and can open live guided shopping.
You can run virtual try-on before deciding.
Sensitive steps require explicit confirmation from you.

That is the key: Aura does not just describe next steps. It can execute them while keeping you in control.

Here is what that feels like in practice.

You open Anyaself and say:

“I need one look for work and one for dinner. Try to use what I already own.”

Aura can pull your wardrobe context, suggest combinations, and explain why they fit your request. If you want a new item to complete the look, Aura can move into guided shopping and show that process live. If you want to preview before deciding, Aura can kick off virtual try-on.

The user experience is one continuous conversation, but under the hood it is a coordinated sequence of agent actions across multiple services.

Why we call Anyaself agent-first

In Anyaself, Aura is the intelligence layer that operates the product experience. The platform is built around that.

Aura can:

navigate and operate app flows directly
manage wardrobe actions
run mission-style workflows
coordinate virtual try-on
support guided shopping sessions
handle household-aware approval paths

For interactive shopping, Aura can work live in-browser and hand control to the user whenever needed. For cart preparation, Aura can automate setup, but final purchase still requires explicit user confirmation.

This is what we mean by agent-first: action + control, not action without control.

How we built it

We built Anyaself as an agent-first system from the start, not as a single app with an assistant bolted on top. The hackerthon offered $100 in GCP credits to support the participants.😊

Core Google AI + Google Cloud stack:

Gemini Live API for real-time voice interaction
Vertex AI for model-backed orchestration
Google Cloud Run for independently deployable services
Firestore for mission and application state
Cloud Storage for media and generated assets

But the real build story is less about the stack list and more about the architecture tradeoffs.

In simple terms, the platform architecture is the 8-service cloud system that handles data, workflows, automation, and observability.

The Aura architecture is the live intelligence layer on top of that platform: voice session + mission state + multi-agent orchestration + tool execution.

We designed around four constraints:

Voice should feel live, not turn-based.
Aura should be able to take useful actions, not just suggest them.
High-risk actions must remain explicitly user-controlled.
The system had to stay debuggable when multiple tools and services are active in one session.

Those constraints pushed us toward a service-based architecture and a mission-driven agent runtime.

Why we split into 8 services (instead of one large backend)

If everything lived in one process, we would move fast early and lose control later. So we split by responsibility and failure domain:

The API gateway owns authentication, household/purchase flows, and WebSocket proxying for Gemini Live.
The orchestration service owns mission state, tool execution, and agent turn logic.
The wardrobe service owns item/outfit management, curation, and wardrobe image workflows.
The commerce service owns offer search, ingestion, and scoring.
The virtual try-on service owns asynchronous try-on jobs and result lifecycle.
The interactive browser bridge owns live cloud browser sessions, indexed page context, and user takeover/release.
The headless cart prep service owns constrained cart automation jobs.
The artifacts and audit service owns transcripts, plan artifacts, recordings metadata, and audit events.

This gave us independent scaling and safer failure behavior. A VTO spike does not degrade mission turns. A browser-session issue does not break wardrobe CRUD. A cartprep failure does not block voice sessions.

It also made safety enforceable at the correct layer. We do not rely on one prompt to keep behavior safe; policies are encoded in service boundaries and state transitions.

Aura intelligence layer: ADK multi-agent orchestration

Aura’s intelligence layer runs in the orchestrator using Google Agent Development Kit (ADK). We implemented a real multi-agent hierarchy rather than one generalized model prompt.

Root coordinator:

Anyaself coordinator

Specialists under it:

A style advisor for wardrobe reasoning and style synthesis
A try-on specialist for virtual try-on workflows
A shopping specialist for cart preparation workflows
A browser specialist for live interactive browsing
An onboarding specialist for style profile setup

Inside the style advisor, we use a fetch-then-synthesize pattern:

Wardrobe context and commerce offers are fetched in parallel.
A synthesis step merges both sources into grounded suggestions.
A dedicated wardrobe management step handles create/update/delete/curation and outfit tasks.

That structure came from a practical problem: style quality drops when one model has to fetch, reason, and mutate data in one shot. Parallel retrieval plus explicit synthesis made outputs more grounded and consistent.

For async operations, we used ADK’s long-running tool path. The try-on specialist starts a long-running try-on job, then checks progress and returns results when rendering is done. This keeps the conversation responsive while heavier image generation runs in the background.

We also kept mission memory outside the ADK runner on purpose. ADK handles delegation and tool execution, while mission history/plan/state are persisted by the orchestrator. That split gave us better traceability and made mission recovery more predictable.

Mission model: how we kept agent behavior predictable

We model user intents as missions, not loose conversational turns. Each mission has a type, a plan, and an explicit state.

Mission types:

Style missions
Polling missions
Cart-prep missions
Interactive missions

Mission states:

Planned
Running
Waiting for user
Done
Failed

Every mission starts with a typed initial plan. As tools run, the orchestrator updates step statuses and enforces next-state rules. That means Aura cannot silently drift from “assistive guidance” into “sensitive execution.” In cart prep flows especially, state is forced back to waiting for user until approvals and confirmations are complete.

This is the core thought process: we wanted the flexibility of an agent, but with the predictability of a workflow engine.

Tool runtime context and policy enforcement

Each agent turn sets runtime context before execution (auth token, household ID, actor user ID, and mission ID) and records tool traces. Tool functions validate scope against that context before calling downstream services.

That gives us three important properties:

household isolation (tools cannot operate outside mission scope)
attributable actions (we can link action trails to actor + mission)
replayable debugging (tool traces and mission artifacts stay inspectable)

For shopping operations, policy gates are code-enforced:

purchase request must already be approved
actor role must be allowed for the action
final checkout requires explicit confirmation token flow

Live voice path: Gemini Live + mission loop

We adopted a Model B voice architecture:

Client captures audio.
API gateway mints a scoped session token and proxies a secure WebSocket session.
Gemini Live handles low-latency speech recognition and speech synthesis.
Aura routes intent through mission turns and tool execution.
User can interrupt and redirect naturally.

This kept voice and mission logic unified. We did not want one “voice brain” and a separate “app brain.” The same mission and policy system should apply whether the user types or speaks.

Interactive browsing and cart automation: action with boundaries

There are two different shopping action modes because they solve different trust problems:

A browser specialist (Interactive browsing) for visible, interactive, co-controlled browsing
A shopping specialist (headless cart prep) for constrained cart setup

In interactive browser sessions, Aura can navigate, click, type, scroll, and query page elements while the user watches live. User takeover pauses agent actions immediately, and release hands control back.

In headless cart prep, we enforce domain allowlists, block sensitive form interactions, prevent payment completion, and run hard timeouts. The output is a prepared cart state, not a completed purchase.

That distinction is deliberate. We wanted useful automation, but we did not want hidden autonomy.

VTO pipeline: asynchronous by design

VTO runs as a job lifecycle (queued, running, then terminal state) with quality scoring, warnings, and cached reuse of successful results for matching inputs. Signed URLs are generated for image access, and rate limits protect the service under burst load.

We treated this as an async subsystem so Aura can keep the session moving while try-on renders complete.

Build workflow

We developed and iterated in Google Antigravity as our IDE, which helped us move quickly across frontend, orchestrator logic, and service boundaries during the hackathon.

The result is an agent system where voice interaction, tool execution, and multi-service workflows all operate inside a single mission runtime.

The design choices that mattered most

1) We modeled workflows as missions, not loose chat turns

A style request, a shopping task, and a checkout flow should not all be treated as the same kind of interaction. We designed mission types with explicit state transitions so Aura knows when to proceed, when to wait, and when to hand control back.

This single decision made the whole system easier to reason about.

2) We separated “can execute” from “can finalize”

Aura can perform useful actions. But high-risk actions must still pass explicit user confirmation.

That line is non-negotiable for trust.

3) We treated household logic as a first-class requirement

Many assistants assume one user and one permission model. Real life is messier. We built for role-based household contexts so approvals and boundaries are part of the flow, not an afterthought.

4) We optimized for continuity

Users should not feel dropped between modules. Voice, styling, shopping, and try-on should feel like one experience, even when multiple services are involved.

What made this hard

1) Real-time quality

Low-latency streaming alone is not enough.

The hard part is interruption handling, session continuity, and maintaining response quality when the user changes intent mid-flow. If Aura sounds natural for five seconds but loses context on the sixth, trust drops immediately.

We spent a lot of effort on making the session feel stable under real conversational behavior, not scripted demos.

2) Live agent actions in unpredictable web environments

Retail websites are inconsistent and noisy. DOM structure changes. Popups appear unpredictably. Some automation paths degrade.

So we designed for graceful fallback behavior and user-visible control handoff rather than pretending every site behaves ideally.

In other words, we built for real web entropy, not happy-path browser demos.

3) Trust boundaries

The moment an assistant can take action, trust stops being a copywriting problem and becomes a systems problem.

We needed explicit boundaries for:

who can initiate which action
what can be automated
what always requires user confirmation
how every critical action is recorded

If you do not encode those boundaries in the flow itself, users will eventually feel unsafe even if the model seems “smart.”

4) Keeping the experience human

A lot of AI products are either too passive or too pushy. We wanted a different balance:

confident, but not controlling
warm, but not performative
proactive, but not noisy

That tone is part product writing, part interaction design, and part system behavior.

Where Gemini Live made the biggest difference

Gemini Live was central to the product feel because it allowed us to keep interaction natural while still running agent workflows in the background.

Practically, that gave us:

real-time speech handling
natural interruption support
tool-driven loops that do not break the conversation

Without that live loop, the experience would feel like stitched-together features. With it, Aura feels like one continuous companion.

Where Google Cloud made the biggest difference

Cloud Run let us keep services independent and deploy quickly. Firestore and Cloud Storage gave us practical persistence for household context and media-heavy flows. Vertex AI gave us the orchestration foundation for agent behavior.

The biggest benefit was not any one service in isolation. It was how quickly the stack allowed us to connect voice, orchestration, browsing, and try-on into one coherent product.

What we learned about building agent products

Agentic UX is mostly architecture

People often focus on prompts and personality. Those matter. But the real quality comes from state management, tool contracts, permissions, and fallbacks.

“Looks smart” is not enough

Users trust systems that are clear and accountable, not just systems that sound intelligent.

Context quality beats generic intelligence

In fashion, recommendations are only useful when grounded in what a user owns, wants, and can actually act on.

Control is part of delight

Giving users explicit takeover and confirmation paths does not reduce the magic. It increases it, because confidence rises when agency is preserved.

What’s next

We are focused on:

Better VTO quality and faster turnaround.
Stronger merchant coverage with safe execution patterns.
Deeper personalization across longer time horizons.
Better observability and accountability for agent actions.

Longer term, we see Anyaself becoming an operating system for personal style: where wardrobe memory, real-time guidance, and intelligent commerce come together in one human-centered experience.

Because the best fashion technology should not make people feel more pressure.

It should make them feel more like themselves.

Credits

Built solo by me.

Tooling and model acknowledgements:

Gemini Live API
Vertex AI
Google Cloud Run
Firestore
Cloud Storage
Unsplash API (for licensed image sourcing via their API terms)
Playwright
Google Antigravity (IDE)
Guidance and resources from the Gemini Live Agent Challenge hackathon support team

Built with ❤️ for the Gemini Live Agent Challenge using Google AI models and Google Cloud.

DEV Community