From Text to Live Video: How We Built a Serverless Multimodal Logistics AI on Google Cloud Run

#architecture #google #serverless #ai

The logistics industry runs on information. From tracking numbers on a crumpled label to complex terms in a PDF contract, the speed and accuracy of communication can make or break a shipment. So, for a recent hackathon, my team decided to tackle this challenge head-on. Our goal? To build an intelligent assistant that could understand and assist with logistics queries, no matter how they were presented.

The result was the World Movers AI Agent, a multimodal assistant that can chat, read documents, analyze images, understand voice commands, and even interpret live video from a webcam or screen share.

Here’s a look at how we built this powerful, scalable AI application using the magic of Google Cloud Run and Google's Gemini models.

The Vision: An AI That Sees and Hears

We didn't want to build just another chatbot. We envisioned a true "agent" that could interact with the world in the same ways a human does. Our core feature list looked like this:

Smart Conversations: Handle text-based questions about services, quotes, and support.
Document Analysis: Upload a PDF, DOCX, or TXT file and ask questions about its contents.
Image Recognition: Analyze images of shipping labels, cargo, or warehouse scenes.
Voice Commands: Speak directly to the agent for hands-free interaction.
Live Analysis: Use a webcam or screen share to get real-time feedback and support.
Actionable AI: Go beyond answering questions by taking action, like compiling and emailing a quote request to the sales team.

Why Google Cloud Run was a No-Brainer

To build this, we needed a platform that was fast, scalable, and let us focus on the AI logic, not infrastructure. Google Cloud Run was the perfect fit.

Serverless Simplicity: As a small hackathon team, we didn't have time for server provisioning or management. With Cloud Run, we just packaged our code in a Docker container and deployed it. Google handles the rest.
Scale to Zero (and to Infinity): This is a killer feature. When our app wasn't being used, it automatically scaled down to zero instances, meaning we weren't paying for idle time. But if we suddenly got a surge of users, Cloud Run would instantly scale up to handle the load.
Container-Powered: Using Docker meant our development environment was identical to our production environment. This eliminated the classic "it works on my machine" problem and made deployment a breeze.

Our Architecture: A Tale of Two Services

We kept our architecture clean and simple, leveraging two distinct Cloud Run services:

frontend-service: A very lightweight service whose only job was to serve our static web interface (built with HTML, CSS, and JavaScript). This is what the user sees and interacts with.
backend-api-service: This was the brains of the operation. A Python application (using the FastAPI framework) running as a separate Cloud Run service. It handled all the heavy lifting:
- Exposing an API that the frontend could call.
- Processing all incoming data, whether it was text, an uploaded file, or a video frame.
- Orchestrating the calls to our AI models on Google's Vertex AI platform.
- Sending the final, AI-generated response back to the user.

This separation of concerns made our application clean, scalable, and easy to manage.

The Magic Ingredient: Google's Gemini Models

Our agent's intelligence comes from Google's powerful Gemini models, which we accessed via the Vertex AI API:

Gemini Pro: This was our workhorse for all language-based tasks. It powered the chat conversations, summarized uploaded documents with incredible accuracy, and helped parse user intent.
Gemini Pro Vision: This is where the real magic happened. Gemini Pro Vision is a true multimodal model, meaning it can reason across text, images, and video. It's what allowed our agent to:
- Read the tracking number from a photo of a shipping label.
- Describe the contents of a user's shared screen.
- Identify a "shipping container" when held up to the webcam.

Crafting the prompts to combine these different data types was a fun challenge, but the results were astounding.

From Chatbot to "Agent": Taking Action

The defining moment was implementing the quote request workflow. A user can simply type, "I need a quote to ship a package." The AI, powered by Gemini Pro, understands this intent and asks follow-up questions like "What are the dimensions?" and "Where is it going?".

Once it has the necessary info, it doesn't just stop. It formats the data and triggers an API call to an SMTP service, sending a professional email directly to the sales team's inbox. This transforms the AI from a passive information source into an active participant in the business workflow.

Final Thoughts

Building a powerful, multimodal AI application is more accessible than ever. By combining the serverless power of Google Cloud Run with the incredible reasoning capabilities of Google's Gemini models, our small team was able to create a sophisticated logistics agent in a matter of days.

This project proved to us that the future of AI is multimodal, and the future of deploying it is serverless.