DEV Community

Rubens Zimbres
Rubens Zimbres

Posted on • Originally published at Medium on

From Proof of Concept to Production: Building an Enterprise-Grade Platform for AI Systems

Introduction

The transition from a working AI prototype to a production-ready system represents one of the most challenging journeys in modern software development. While building a chatbot that can answer questions is relatively straightforward, deploying an AI agent system that can serve thousands or even millions of users securely, reliably, and cost-effectively requires careful architectural decisions and enterprise-grade infrastructure.

This article presents a comprehensive reference architecture for deploying multi-agents AI systems on Google Cloud Platform, designed with the explicit goal of allowing developers to plug any AI agent system into a robust infrastructure.

The architecture presented here contains several critical best practices that make it suitable for enterprise deployment.

  • First, it implements a strict separation of concerns through a decoupled frontend and backend architecture, allowing teams to independently develop, test, and deploy each component.
  • Second, it follows a security-first design philosophy with defense in depth, implementing protections at every layer from the network edge to the application core.
  • Third, it embraces infrastructure as code through modular Terraform configurations, ensuring reproducible deployments and facilitating disaster recovery.
  • Fourth, the system is built for observability with comprehensive distributed tracing, structured logging, and health monitoring throughout.
  • Finally, the architecture is designed for cost efficiency , using serverless compute, intelligent caching, and tiered storage to minimize operational expenses while maintaining high availability.

What makes this infrastructure particularly valuable is its agent-agnostic design. The platform provides all the surrounding capabilities that any AI agent system needs: authentication, payment processing, secure data storage, content delivery, rate limiting, and observability. Developers can focus on building their specific AI capabilities while the infrastructure handles the undifferentiated heavy lifting of enterprise deployment.

In this article, I present the basic structure of the project. For a more detailed description and code, access the Github repository of the project:

GitHub - RubensZimbres/Enterprise-Grade-Infra-for-AI-Agents: Terraform Deployment of AI Agents Solution in Google Cloud

⭐🇾 the repo if you like it. Contributions are welcome!

Architecture Overview

The platform consists of three primary layers: a Next.js frontend serving as the user interface and secure proxy, a FastAPI backend orchestrating the AI capabilities, and a comprehensive infrastructure layer managed through Terraform modules.


Google Cloud Architecture

The frontend layer is built with React 18 and Next.js, utilizing the modern App Router pattern. It serves as more than just a user interface; it acts as a secure proxy that handles all communication with backend services. Authentication is managed through Firebase , providing seamless integration with Google Identity services while supporting millions of consumer-scale users. The frontend implements circuit breaker patterns using the opossum library, ensuring that temporary backend failures do not cascade into system-wide outages. To eliminate cold-start latency, the service maintains a minimum of one warm Cloud Run instance at all times.

The backend layer is a FastAPI application designed for high concurrency and resilience. It orchestrates Retrieval-Augmented Generation using LangGraph and Vertex AI, connecting to Cloud SQL for PostgreSQL with the pgvector extension for semantic search capabilities. The backend is configured for internal-only ingress traffic, ensuring it remains unreachable from the public internet and only accessible through the authenticated frontend proxy. Full OpenTelemetry instrumentation provides distributed tracing capabilities exported to Google Cloud Trace, enabling detailed debugging and performance analysis in production environments.

The Frontend Layer

The frontend architecture centers around three core components that manage the user experience:

  • The AuthProvider component serves as the the authentication system, using Firebase Authentication to manage user state and protect routes from unauthorized access.
  • The ChatInterface component provides the main interaction surface, delivering a real-time streaming chat experience tightly integrated with the backend API. It handles authentication errors and payment-related issues gracefully, redirecting users to appropriate pages when necessary.
  • The PaymentClient component delivers a seamless checkout experience using Stripe Embedded Checkout, guiding users through the payment process with comprehensive error handling.
  • The routing structure implements a clear user journey from landing page through authentication and payment to the main chat interface. Server-side API routes handle critical operations including the chat proxy, payment status verification, and checkout session creation.

The chat API route implements a circuit breaker to prevent cascading failures while using OIDC tokens for secure service-to-service authentication. It streams responses from the backend to provide real-time chat capabilities, forwarding user authentication tokens to the backend for authorization decisions.

The Backend Layer

The backend exposes four primary endpoints:

  • A health check for infrastructure monitoring,
  • A webhook endpoint for Stripe event processing, and
  • Two chat endpoints supporting both standard request-response and streaming communication patterns.

Security is implemented at multiple levels. Rate limiting restricts requests to ten per minute per IP address to prevent abuse. Input validation through Pydantic models enforces strict message size limits to prevent denial-of-service attacks and the authentication dependency ensures all chat requests come from verified users, while session IDs are scoped to authenticated users to prevent insecure direct object reference attacks.

The data layer uses PostgreSQL as the primary database, storing user information including subscription status and Stripe customer identifiers. All database operations are encapsulated in dedicated modules for maintainability and testability. The Stripe integration is tight and bidirectional: webhooks listen for payment events and automatically update user subscription status in the database, while the authentication middleware verifies subscription status for every protected request.

AI Engine and Knowledge Core

The AI capabilities are built around a Retrieval-Augmented Generation pipeline that balances high-performance search with secure session management. The system implements two distinct memory systems: short-term memory for maintaining conversation context and long-term memory for the knowledge base.

  • Short-term memory utilizes Google Cloud Firestore in Native Mode for low-latency persistence of chat history. The implementation leverages FirestoreChatMessageHistory within the LangGraph framework, with every session cryptographically scoped to the authenticated user identity. This ensures strict multi-tenancy where users cannot access or leak into another user’s conversation history. The system automatically retrieves the last N messages and injects them into the RAG prompt, enabling multi-turn, context-aware dialogue.
  • Long-term memory is powered by PostgreSQL 16 with the pgvector extension, enabling semantic similarity search using Vertex AI Embeddings. For every query, the engine retrieves the top five most relevant document chunks to provide grounded context to the language model. A semantic cache backed by Redis provides an additional optimization layer: if a user asks a question semantically similar to a previously cached query, the system returns the cached response instantly, bypassing the language model entirely to save cost and reduce latency.

The document ingestion pipeline transforms raw data into AI-ready vectors through a specialized process, whose ingestion process is triggered automatically through _Cloud Function_s when new documents are uploaded to the storage bucket.

Security and Resilience

The platform implements a multi-layered security strategy addressing both traditional web application vulnerabilities and AI-specific threats. Protection against SQL injection operates at two levels: Cloud Armor is configured with pre-defined WAF (Web Application Firewall) rules to filter malicious SQL patterns at the network edge, while the backend uses asyncpg with strictly parameterized queries to ensure user input is never executed as raw SQL. (OWASP Top 10)

Similarly, cross-site scripting (XSS) protection combines Cloud Armor WAF rules with Next.js’s automatic content sanitization and the backend’s structured JSON responses. Broken access control and insecure direct object reference vulnerabilities are addressed through a verified identity system. The frontend captures user identity from Firebase Authentication tokens and propagates them to the backend for verification.

Chat histories are cryptographically scoped to authenticated user identities, preventing one user from accessing another’s private conversation history. DDoS (Distributed Denial of Service) and resource abuse protection operates at multiple layers: Cloud Armor implements a global rate-limiting policy of 500 requests per minute per IP address with rate-based banning for volumetric attacks, while the backend uses slowapi to enforce granular rate limiting specifically for expensive language model operations.

The architecture addresses AI-specific security concerns including prompt injection and sensitive data leakage. The RAG prompt template uses strict structural delimiters and prioritized system instructions to ensure the model adheres to its enterprise role and ignores adversarial overrides in documents or user queries. (OWASP Top 10 for LLM and MAESTRO Framework).

A sandwich defense using XML tagging provides explicit instructions to ignore external commands found within retrieved context. Google Cloud DLP is integrated into the core pipeline with a regex fast-path that intelligently filters expensive API calls for clean content, invoking the Data Loss Prevention service only when potential PII patterns are detected. The knowledge base itself is stored in a private Cloud SQL instance reachable only via Serverless VPC Access connector, ensuring the AI’s brain is never exposed to the public internet.

Infrastructure as Code

The entire infrastructure is defined through modular Terraform configurations organized into logical components, following cybersecurity best practices:

  • The network module provisions a custom VPC with private subnets and Cloud NAT gateway, ensuring services are not exposed directly to the public internet.
  • The compute module deploys decoupled frontend and backend services on Cloud Run with granular IAM policies.
  • The database module provisions Cloud SQL for PostgreSQL with Firestore for chat history storage. A dedicated Redis module provides Memorystore for semantic caching.
  • The ingress module configures a global external HTTPS load balancer with Cloud Armor providing WAF rules for SQL injection, cross-site scripting, and rate limiting.
  • The function module sets up Cloud Functions for event-driven PDF ingestion.
  • Additional modules handle CI/CD pipelines, storage buckets with lifecycle policies, and billing monitoring with alert policies and notification channels.


Terraform folder

The infrastructure follows a security-first design philosophy.

  • The database has no public IP and uses IAM authentication. All sensitive information is stored in Google Secret Manager.
  • The load balancer provides a single entry point with Cloud CDN improving performance by caching static assets closer to users.
  • Health checks with startup and liveness probes ensure reliability.
  • The CI/CD pipeline automates build and deployment processes, maintaining a Zero-Trust permission model where service accounts have only the specific roles they require.

You just need to run:

terraform init
terraform plan
terraform apply
Enter fullscreen mode Exit fullscreen mode

Performance and Scaling

The architecture is optimized for both performance and cost efficiency.

  • The backend is built on FastAPI with asyncpg for non-blocking database connections, allowing a single instance to handle thousands of concurrent requests with minimal resource usage.
  • Server-Sent Events enable real-time token streaming from the language model directly to the frontend, providing sub-second time-to-first-token for a highly responsive user experience. Expensive operations like PII (Personal Identifiable Information) de-identification are offloaded to asynchronous background threads to prevent blocking the main request-response cycle.
  • Cost control measures include using the Gemini 3 Flash model for a significant reduction in token costs compared to larger models, implementing regex-based pre-checks for PII to intelligently bypass expensive DLP API calls, and enabling Cloud CDN for global caching of static assets.
  • Object Lifecycle Management on storage buckets automatically transitions files to Nearline storage after seven days, Archive storage after thirty days, and deletes them after ninety days, providing disaster recovery capabilities without indefinite storage costs.

The current infrastructure is benchmarked to handle approximately 2,500 users per hour. For scaling to a million users per hour, I recommend offloading vector search to Vertex AI Vector Search , a fully managed service designed to handle billions of vectors and thousands of queries per second with sub-10-millisecond latency. In this configuration, PostgreSQL handles only chat history and user metadata while the specialized vector engine handles the high-throughput similarity search load.

Payment and Subscription System

The platform enforces a strict workflow where users must log in, then pay, before accessing the chat functionality. The PostgreSQL database serves as the single source of truth for user subscription status. Stripe integration is implemented through secure webhooks that listen for checkout completion and invoice payment success events, automatically updating user status when payments succeed.

The backend middleware checks subscription status for every request, while the frontend intercepts these errors and redirects users to the subscription/payment page. The database schema links user emails to Firebase Identity, tracks active subscription status, and maintains Stripe customer identifiers for seamless payment management.

Disaster Recovery

The infrastructure includes disaster recovery capabilities. Cloud SQL is configured with automated backups retained for seven days, point-in-time recovery allowing restoration to any second within the retention window, and deletion protection to prevent accidental instance removal.

For data corruption scenarios, the database can be cloned to a specific point in time before the corruption occurred, allowing verification before switching traffic to the restored instance. For complete instance loss, restoration from the last successful nightly backup is straightforward through the gcloud command-line interface.

Firestore is configured with daily backup schedules retained for seven days. Since Firestore does not support in-place restores, recovery involves restoring to a new database ID and updating the backend configuration to point to the restored database. Post-recovery procedures include verifying backend connectivity, running application-level smoke tests, and ensuring backup schedules are re-applied through Terraform.

Cost Considerations

The architecture is designed for cost efficiency while maintaining enterprise capabilities.

  • Cloud Run compute costs are $25 a month.
  • Cloud SQL database costs approximately $34 per month,
  • Memorystore for Redis at approximately $36 per month,
  • Cloud NAT gateway at approximately $33 per month, and
  • Load balancer with Cloud Armor at approximately $33 per month.

This brings the baseline monthly cost to approximately $161 for a production-ready enterprise platform that handles 2,500 users per hour.

⚠️ Note that you have to be careful to not deploy the Enterprise version of Cloud Armor in Terraform, otherwise it will cost you $3,000.

For development or staging environments, costs can be reduced to under $50 per month by scaling Cloud Run instances to zero, removing the Redis module and using local containers, eliminating the NAT gateway if static outbound IP addresses are not required, and potentially downgrading or replacing Cloud SQL with Firestore for simpler use cases.

Variable costs depend on usage and include storage fees, data transfer, LLM API calls, and DLP processing.

Conclusion

This reference architecture demonstrates that transitioning from AI proof of concept to production deployment requires careful attention to security, scalability, observability, and cost management.

By implementing infrastructure as code, following cloud-native best practices, and building defense in depth, teams can create a foundation that supports any AI agent system while handling the complexities of enterprise deployment.

The modular design allows components to be upgraded or replaced as requirements evolve, while the comprehensive security measures ensure compliance with enterprise standards. Whether deploying a simple RAG-based chatbot or a complex multi-agent system, this infrastructure provides the robust foundation needed for production success.

Acknowledgements

✨ Special thanks for Natalie Godec (https://medium.com/@ouvessvit), my fellow GDE for reviewing the Terraform deployment.

✨ Google ML Developer Programs and Google Developers Program supported this work by providing Google Cloud Credits.

Top comments (0)