Rafael Campos

Posted on Nov 11

Building NetGenius Instructor Copilot: A Multi-Agent AI System on Google Cloud Run

#agents #google #serverless #ai

This blog post was created for the Google Cloud Run Hackathon to share how I built NetGenius Instructor Copilot using Cloud Run Services, Cloud Run Jobs, and Google's Agent Development Kit (ADK).

The Problem: Lab Creation is Painfully Manual

After 25 years of teaching Cisco CCNA courses, I've witnessed the same frustration repeated countless times: instructors spending an entire afternoon meticulously crafting a networking lab (designing the topology, writing step-by-step instructions, creating configurations for each and every device) only to discover during class that a command doesn't work, an IP address is wrong, or a verification step fails.

The problem isn't lack of expertise. It's the sheer complexity of coordinating multiple moving parts: network design, CLI syntax, student instructions, and validation. A single typo can derail an entire lab session.

I knew AI could help, but building a production-ready solution required more than just "throw it at an LLM." It needed intelligent orchestration, specialized agents, and real validation. That's where Google Cloud Run and ADK came in.

The Solution: Multi-Agent AI Orchestration

NetGenius Instructor Copilot automates the entire lab creation lifecycle using four specialized AI agents, orchestrated by Google's Agent Development Kit and running on Cloud Run infrastructure.

Architecture Overview

The system consists of three main components:

Frontend (Next.js on Vercel): Web interface for instructors
Orchestrator (FastAPI + ADK on Cloud Run Service): Multi-agent coordination
Network Simulator (Cloud Run Job): Headless validation engine

Why Cloud Run Was Perfect for This

Cloud Run Services: Always-On Orchestrator

The orchestrator runs as a Cloud Run Service, handling incoming API requests and coordinating the four ADK agents.

Why Cloud Run Service?

Automatic scaling based on traffic
Pay only for actual usage
HTTPS out of the box
Fast cold starts (< 2 seconds with Python + FastAPI)

Cloud Run Jobs: On-Demand Validation

The most innovative part is the Validator Agent, which triggers a Cloud Run Job to run headless network simulations.

Why Cloud Run Jobs?

On-demand execution (only runs when validation needed)
No server management
Scales to zero when idle (zero cost)
Isolated execution environment for network simulation
Can run long-running tasks (up to 60 minutes)

The Complete Flow

Planner Agent (LlmAgent) conducts multi-turn Q&A with the instructor
Designer Agent (LlmAgent with tools) generates network topology YAML and device configurations
Author Agent (LlmAgent) writes step-by-step lab guide with verification commands
Validator Agent (Custom BaseAgent):
- Packages everything into spec.json
- Uploads to Google Cloud Storage
- Triggers Cloud Run Job (headless-runner)
- Network simulator downloads spec, runs commands in containerized routers
- Uploads validation results back to GCS
- Validator polls GCS and returns success/failure

Key Metrics & Results

After deploying to production:

Time savings: 2-4 hours → 5-10 minutes (70-90% reduction)
Cold start time: ~2 seconds for orchestrator
Validation time: 2-5 minutes per lab (Cloud Run Job execution)
Cost: ~$0.50 per lab generation (mostly Gemini API + Cloud Run Job execution)

Deployment Architecture

Orchestrator Deployment

The orchestrator was deployed as a Cloud Run Service:

# Build and deploy to Cloud Run Service
gcloud builds submit --tag gcr.io/netgenius-hackathon/netgenius-orchestrator
gcloud run deploy netgenius-orchestrator \
  --image gcr.io/netgenius-hackathon/netgenius-orchestrator \
  --region us-central1 \
  --allow-unauthenticated \
  --set-env-vars GOOGLE_API_KEY=${GOOGLE_API_KEY}

Headless Runner Job

The network simulator engine was deployed as a Cloud Run Job, that is executed on demand each time a lab generation takes place:

# Deploy as Cloud Run Job
gcloud run jobs create headless-runner \
  --image us-central1-docker.pkg.dev/netgenius-hackathon/netgenius/headless-runner:latest \
  --region us-central1 \
  --set-env-vars SPEC_GCS_PATH=gs://netgenius-artifacts-dev/pending/latest/spec.json \
  --max-retries 0 \
  --task-timeout 10m

What I Learned About Cloud Run

1. Services vs Jobs: Complementary, Not Competing

Cloud Run Services and Jobs work beautifully together:

Services: For APIs, webhooks, always-on endpoints
Jobs: For batch processing, scheduled tasks, event-driven workloads

In my architecture:

Service handles HTTP requests and orchestration
Jobs handle compute-intensive simulation (only when needed)

2. Scale-to-Zero is a Game Changer

During development and testing, costs were negligible because:

Orchestrator scales to zero between requests
Validation jobs only run when explicitly triggered
No idle server costs

3. Container Portability is Real

The same Docker containers run identically:

Locally (for development)
On Cloud Run (production)
No environment-specific code needed

Future Enhancements

1. Expose RCA Agent in UI

The RCA (Root Cause Analysis) agent is already implemented in the backend:

# Already working in backend!
rca_agent = create_rca_agent()  # Analyzes validation failures
patch_router = create_patch_router_agent()  # Routes fixes to appropriate agent

# Classifies failures as:
# - DESIGN: Topology/config issue → retry Designer
# - INSTRUCTION: Lab guide error → retry Author
# - OBJECTIVES: Spec problem → escalate to human

Just needs frontend UI to show retry progress.

2. Topology Visualization

Generate visual network diagrams from topology YAML using D3.js or Cytoscape.js.

3. Lab Editing

Allow instructors to request modifications: "Add a troubleshooting component" or "Simplify the IP addressing."

Try It Yourself

Live Demo: https://copilot.netgenius.ai
GitHub: https://github.com/racampos/cloud-run-hackathon
Hackathon Submission: https://devpost.com/software/netgenius-instructor-copilot-nic

The orchestrator is fully open-source. The network simulator is proprietary (our "secret sauce"), but the API contract is documented.

Conclusion

Building NetGenius Instructor Copilot taught me that modern cloud infrastructure (Cloud Run) combined with intelligent orchestration frameworks (Google ADK) can solve real-world problems that seemed impossible just a year ago.

The combination of:

Cloud Run Services for the always-on orchestrator
Cloud Run Jobs for on-demand validation
Google ADK for multi-agent coordination
Gemini 2.5 Flash for AI reasoning

created a production-ready system that, once in production, will save instructors (including mysel) hours of manual work.

If you're building multi-agent AI systems, I highly recommend exploring Google ADK and Cloud Run. The developer experience is excellent, and the scale-to-zero cost model is perfect for bootstrapped projects.

Questions? Feedback? Drop a comment below or reach out on X/Twitter (@racampos)

Built for the Google Cloud Run Hackathon. #CloudRunHackathon #GoogleCloud #AI #EdTech

DEV Community