DEV Community

Cover image for From Paper Documents to Structured Data: How I Built an AI Onboarding Agent in a Week
Livio Kropf
Livio Kropf

Posted on

From Paper Documents to Structured Data: How I Built an AI Onboarding Agent in a Week

Submitted for the Gemini Live Agent Challenge #GeminiLiveAgentChallenge


There's a version of this post where everything goes smoothly. Where I open the Gemini docs, integrate the Live API in an afternoon, and ship a polished product by day three.

This is not that post.

This is the honest one — about the planning that saved the project, the APIs that didn't work, and how six versions of a working product shipped in a single day. Because that's actually what happened.


The Problem I Wanted to Solve

I support my wife in running a small wellness business. Every time we hire someone new, the same thing happens: A picture or a paper ID card gets handed over, someone manually types the data into a spreadsheet, someone else copies it into another system, and at least one field ends up wrong.

It's not a glamorous problem. But it's real, slow, and it happens at thousands of small businesses every day. The deeper issue — one I've seen play out in many software projects — is that system rollouts don't fail because the software is bad. They fail because getting existing real-world data into the software is too hard.

I wanted to fix that. Not with a form. With a conversation — the way things actually work in real life.


Starting with a Plan (Not Code)

On March 7th I did something I've learned from experience: I wrote everything down before touching a keyboard.

Five planning documents. System definition, system concept, technical blueprint, API spec, development plan. The whole thing took a few hours and saved me days.

The core of the system definition was five principles I kept coming back to throughout the build:

1. The Reality Principle — the system adapts to reality, not the other way around. If someone hands you a handwritten form, the agent works with that. It doesn't demand a clean scan.

2. Structure Before Content — recognize what kind of thing you're looking at first (this is a person, this is a role, this is a date), then extract the content.

3. Dialog Over Forms — missing fields are clarified through brief conversation, not by presenting users with a form they have to fill out.

4. Error Tolerance — the agent asks rather than guesses. Uncertain information is flagged explicitly. No confident hallucinations.

5. Minimal Intervention — the agent supports the process, it doesn't replace judgment. A human always confirms before anything is saved.

One line from the development plan stuck with me the most:

"Build only what appears in the video. If it's not in the video, don't build it."

This is extremely good advice for hackathons — and honestly for most product development. I kept this visible throughout the build.


Dead End #1: The Gemini Live API

My original architecture was built around the Gemini Live API — the real-time bidirectional API that handles audio and text natively. Perfect on paper for an agent that's supposed to feel truly "live."

Two problems killed this quickly:

  1. Text-only mode wasn't available. The Live API is primarily designed for audio streams. Text-only sessions weren't supported for my API key configuration.

  2. The ADK JavaScript SDK was too early-stage for production use. I spent a full day on @google/adk and ran into dependency conflicts, unresolved imports, and build failures. The Python ADK is well-established — the JavaScript port is clearly still maturing, and for a time-boxed project I couldn't afford to work around it.

I could have pushed through either of these. But here's what the planning phase gave me: a clear sense of what the actual goal was. The goal was real-time streaming dialogue with document understanding — not a specific API. generateContentStream from @google/genai delivers exactly that. Token-by-token streaming, multi-turn conversation history, full multimodal support.

The agent feels live because the text streams in as it's generated. That's all that matters to the user.

Lesson: Don't let the perfect be the enemy of the shipped. The feature matters more than the specific API path you imagined.


Dead End #2: The Model That Wasn't Available

Error: gemini-2.0-flash is not available to new users
Enter fullscreen mode Exit fullscreen mode

Short story: gemini-2.5-flash works better anyway. Check which models are actually available for your API key before you design your architecture around one.


The Architecture I Actually Built

After clearing those blockers, the architecture came together quickly. The technical blueprint had laid it out clearly, and reality only diverged in one meaningful way: I'd considered Python/FastAPI for the backend. I went with Node.js/Express/TypeScript instead — same team, no context switch, and WebSocket support is first-class.

HR Staff
  → scans document (camera or file upload)
  → image compressed client-side (≤800px, 70% JPEG)
  → sent over WebSocket as base64 to Express backend
  → AgentSession passes it to Gemini 2.5 Flash
  → Gemini streams response token-by-token
  → chunks forwarded to frontend in real time
  → agent asks follow-up questions for missing fields
  → on completion: structured employee record saved
  → optionally: full employment contract generated (downloadable .txt)
Enter fullscreen mode Exit fullscreen mode

The compression step deserves a mention because it made a bigger difference than I expected. Full-resolution camera output — often 3-5MB — creates significant latency before the first token even arrives. Capping at 800px on the longest edge with 70% JPEG quality keeps the document perfectly readable for Gemini while cutting transmission time dramatically.


Six Versions in One Day

Here's something the CHANGELOG reveals that I'm still a little surprised by: versions 0.1.0 through 0.6.0 all shipped on March 9th.

That's one day. Starting from scratch to a production-ready feature set in about 8-10 hours of focused work.

The reason this was possible was the planning. I wasn't discovering the architecture as I went — I was implementing one I'd already validated on paper.

The sequence matters. 0.1.0 was the hardest — everything from there was additive. If I'd tried to build 0.5.0 features before the core worked, nothing would have shipped.


The Part That Actually Took Time: The System Instruction

People underestimate how much work goes into a good system instruction. Mine handles four things that took real iteration to get right:

Org grounding — the agent knows the company name, the valid roles (Therapist, Receptionist, Manager, Other) and teams. Without this, Gemini would sometimes invent plausible-sounding but incorrect role names.

The DOB Validation Guard — early versions would confidently confirm a date of birth as "UNKNOWN" or fabricate a plausible date. This required a two-layer fix: a hard rule in the system instruction, plus a backend guard that rejects any completion event where birth_date is null, empty, or a generic placeholder like "N/A".

Confidence calibration — if the document is unclear or a field is unreadable, the agent asks. It doesn't guess and move on. This required explicit instruction — Gemini's default behavior is to be helpful, which sometimes means filling gaps with confident-sounding inferences.

Date normalization — all dates come out as DD.MM.YYYY regardless of how they appear on the source document. This keeps the data store clean and removes a whole class of downstream formatting bugs.

Each of these was discovered by testing edge cases. A crumpled receipt. An ID card photographed at an angle. A document with a partially obscured birth year. The system instruction is where you encode all of that learning.


The Streaming Implementation

The core of the live feeling is a persistent WebSocket connection between browser and backend. The frontend sends:

{ "type": "start", "imageBase64": "...", "mimeType": "image/jpeg" }
Enter fullscreen mode Exit fullscreen mode

The backend streams back:

{ "type": "chunk", "text": "I can see " }
{ "type": "chunk", "text": "an ID card belonging to " }
{ "type": "turn_complete" }
{ "type": "complete", "employee": { "name": "...", "birth_date": "..." } }
Enter fullscreen mode Exit fullscreen mode

One detail that improves the interaction significantly: the AgentSession uses an AbortController to cancel in-flight Gemini requests when the user sends a new message mid-stream. This means the agent is immediately interruptible — you don't have to wait for it to finish a response before you can reply.

async handleMessage(msg: WsMessage): Promise<void> {
  if (this.abortController) {
    this.abortController.abort();
    this.abortController = null;
  }
  // ... build user parts, push to history
  await this.stream();
}
Enter fullscreen mode Exit fullscreen mode

Small detail, significant UX difference.


Dead End #3: Cloud Run and Build-Time Variables

This one hit on deployment day and cost two failed builds.

NEXT_PUBLIC_* variables in Next.js are build-time, not runtime. They get baked into the JavaScript bundle during next build. This means passing them as runtime environment variables to Cloud Run does nothing — they need to be present during the Docker build step.

My first attempt used --set-build-env-vars with gcloud run deploy --source. This sets environment variables in the Cloud Build environment, but does not automatically map them to Docker --build-arg. The ARG NEXT_PUBLIC_API_URL in the Dockerfile was empty on every build.

Clean solution for a stable deployment: hardcode the URL directly in the Dockerfile:

ENV NEXT_PUBLIC_API_URL=https://glac-backend-842528390248.us-central1.run.app
Enter fullscreen mode Exit fullscreen mode

Not elegant for a general-purpose setup. Perfectly valid when your Cloud Run service names are stable — and they are, as long as you don't delete and recreate the service.


One More Thing: The Agent That's Always There

After the core product was working, something kept bothering me during testing. I'd be on the employees page, want to ask something, and instinctively reach for the agent button — only to find a dumb navigation shortcut that just sent me back to the home page.

That's not an agent. That's a link.

The fix was more involved than I expected, but the right move: I replaced the per-page placeholder bubble with a global always-on agent panel, available across all pages at all times. Outside of an active onboarding session, it connects to a separate /ws/chat WebSocket with a different system instruction — one that knows the employee database, understands the app's capabilities, and can navigate on the user's behalf.

You can type "show me everyone in the Therapy team" and get an answer. You can say "start a new onboarding" and the app navigates to the home page where the document scanner is ready to use. The panel minimizes when it's done with its job and stays out of the way until you need it again.

This is what "live agent" should mean. Not a process that starts and stops with a specific workflow — a persistent presence that's useful whenever you reach for it.

The last feature I added takes this one step further: the agent is now proactive. When you open an existing employee record, the agent panel opens automatically after a short delay — and if that employee has no employment contract on file, it offers to generate one without being asked. One tap, and Gemini streams a complete 9-section employment contract. The agent noticed something, acted on it, and delivered a result — without the user ever typing a command.

The technical challenge: two WebSocket servers sharing one HTTP port. The ws library doesn't support this cleanly with its default server option — the first instance rejects connections destined for the second with a 400. The fix is noServer: true on both, with a manual upgrade event handler that routes by path.

server.on('upgrade', (request, socket, head) => {
  const pathname = new URL(request.url, `http://${request.headers.host}`).pathname;
  if (pathname === '/ws/agent') {
    wss.handleUpgrade(request, socket, head, (ws) => wss.emit('connection', ws));
  } else if (pathname === '/ws/chat') {
    chatWss.handleUpgrade(request, socket, head, (ws) => chatWss.emit('connection', ws));
  } else {
    socket.destroy();
  }
});
Enter fullscreen mode Exit fullscreen mode

Small thing. Took a while to find. Worth documenting.


What the Final Product Does

  • File upload or photo — scan any ID document on mobile or desktop
  • Live streaming dialogue — token-by-token, feels immediate
  • Quick-reply buttons — pre-built answers for Yes/No, role, team
  • Voice input — Web Speech API, works in Chrome and Edge
  • Preview → Edit → Confirm — human always in the loop before anything is saved
  • Employee list — full CRUD, CSV import and export
  • AI contract generation — Gemini writes a 9-section employment contract, downloadable as .txt
  • Always-on agent panel — available on every page, answers questions, navigates the app, starts onboarding on request

What I'd Do Differently

Use Firestore from day one.
The current demo uses an in-memory store — it works, and for the hackathon period both services are configured with min-instances=1 so no data is lost between sessions. But that's a workaround, not a solution. The technical blueprint called for Firestore; I moved it to post-hackathon for time reasons. That's a trade-off I'd reconsider.

Test Cloud deployment earlier.
I treated it as a "last step" and hit three configuration issues in rapid succession. Half a day of that could have been avoided by building and deploying a minimal version on day two instead of day four.

Lock the system instruction earlier.
I refined it continuously throughout development, but the core constraints (DOB guard, date format, org grounding) should be fixed from the start. Every system instruction change requires re-running the full dialogue flow to check for regressions.


The Stack

Layer Choice Why
AI Gemini 2.5 Flash Best available; multimodal vision; fast streaming
Frontend Next.js 16 + React 19 App Router, great DX, works well with streaming
Backend Node.js + Express + WebSocket Lightweight, full control over streaming pipeline
Hosting Google Cloud Run Serverless, scales to zero, WebSocket support
SDK @google/genai v1.44 Stable; generateContentStream works reliably

Try It

Live demo: https://glac-frontend-842528390248.us-central1.run.app

Demo video: https://www.youtube.com/watch?v=XJjfxDNQwdg

Code: https://github.com/LeeWu-Agents/ai-onboarding-agent

Built for the Gemini Live Agent Challenge by Kropf Systems · kropfsystems.com

Built using Gemini 2.5 Flash and the Google GenAI SDK on Google Cloud Run.


If you're building something similar — document understanding, streaming agent UIs, or onboarding automation — feel free to compare notes in the comments.
_

Top comments (0)