This blog post was created for the Google Cloud Run Hackathon to share how I built NetGenius Instructor Copilot using Cloud Run Services, Cloud Run Jobs, and Google's Agent Development Kit (ADK).
The Problem: Lab Creation is Painfully Manual
After 25 years of teaching Cisco CCNA courses, I've witnessed the same frustration repeated countless times: instructors spending an entire afternoon meticulously crafting a networking lab (designing the topology, writing step-by-step instructions, creating configurations for each and every device) only to discover during class that a command doesn't work, an IP address is wrong, or a verification step fails.
The problem isn't lack of expertise. It's the sheer complexity of coordinating multiple moving parts: network design, CLI syntax, student instructions, and validation. A single typo can derail an entire lab session.
I knew AI could help, but building a production-ready solution required more than just "throw it at an LLM." It needed intelligent orchestration, specialized agents, and real validation. That's where Google Cloud Run and ADK came in.
The Solution: Multi-Agent AI Orchestration
NetGenius Instructor Copilot automates the entire lab creation lifecycle using four specialized AI agents, orchestrated by Google's Agent Development Kit and running on Cloud Run infrastructure.
Architecture Overview
The system consists of three main components:
- Frontend (Next.js on Vercel): Web interface for instructors
- Orchestrator (FastAPI + ADK on Cloud Run Service): Multi-agent coordination
- Network Simulator (Cloud Run Job): Headless validation engine
Why Cloud Run Was Perfect for This
Cloud Run Services: Always-On Orchestrator
The orchestrator runs as a Cloud Run Service, handling incoming API requests and coordinating the four ADK agents.
Why Cloud Run Service?
- Automatic scaling based on traffic
- Pay only for actual usage
- HTTPS out of the box
- Fast cold starts (< 2 seconds with Python + FastAPI)
Cloud Run Jobs: On-Demand Validation
The most innovative part is the Validator Agent, which triggers a Cloud Run Job to run headless network simulations.
Why Cloud Run Jobs?
- On-demand execution (only runs when validation needed)
- No server management
- Scales to zero when idle (zero cost)
- Isolated execution environment for network simulation
- Can run long-running tasks (up to 60 minutes)
The Complete Flow
- Planner Agent (LlmAgent) conducts multi-turn Q&A with the instructor
- Designer Agent (LlmAgent with tools) generates network topology YAML and device configurations
- Author Agent (LlmAgent) writes step-by-step lab guide with verification commands
-
Validator Agent (Custom BaseAgent):
- Packages everything into
spec.json - Uploads to Google Cloud Storage
- Triggers Cloud Run Job (
headless-runner) - Network simulator downloads spec, runs commands in containerized routers
- Uploads validation results back to GCS
- Validator polls GCS and returns success/failure
- Packages everything into
Key Metrics & Results
After deploying to production:
- Time savings: 2-4 hours → 5-10 minutes (70-90% reduction)
- Cold start time: ~2 seconds for orchestrator
- Validation time: 2-5 minutes per lab (Cloud Run Job execution)
- Cost: ~$0.50 per lab generation (mostly Gemini API + Cloud Run Job execution)
Deployment Architecture
Orchestrator Deployment
The orchestrator was deployed as a Cloud Run Service:
# Build and deploy to Cloud Run Service
gcloud builds submit --tag gcr.io/netgenius-hackathon/netgenius-orchestrator
gcloud run deploy netgenius-orchestrator \
--image gcr.io/netgenius-hackathon/netgenius-orchestrator \
--region us-central1 \
--allow-unauthenticated \
--set-env-vars GOOGLE_API_KEY=${GOOGLE_API_KEY}
Headless Runner Job
The network simulator engine was deployed as a Cloud Run Job, that is executed on demand each time a lab generation takes place:
# Deploy as Cloud Run Job
gcloud run jobs create headless-runner \
--image us-central1-docker.pkg.dev/netgenius-hackathon/netgenius/headless-runner:latest \
--region us-central1 \
--set-env-vars SPEC_GCS_PATH=gs://netgenius-artifacts-dev/pending/latest/spec.json \
--max-retries 0 \
--task-timeout 10m
What I Learned About Cloud Run
1. Services vs Jobs: Complementary, Not Competing
Cloud Run Services and Jobs work beautifully together:
- Services: For APIs, webhooks, always-on endpoints
- Jobs: For batch processing, scheduled tasks, event-driven workloads
In my architecture:
- Service handles HTTP requests and orchestration
- Jobs handle compute-intensive simulation (only when needed)
2. Scale-to-Zero is a Game Changer
During development and testing, costs were negligible because:
- Orchestrator scales to zero between requests
- Validation jobs only run when explicitly triggered
- No idle server costs
3. Container Portability is Real
The same Docker containers run identically:
- Locally (for development)
- On Cloud Run (production)
- No environment-specific code needed
Future Enhancements
1. Expose RCA Agent in UI
The RCA (Root Cause Analysis) agent is already implemented in the backend:
# Already working in backend!
rca_agent = create_rca_agent() # Analyzes validation failures
patch_router = create_patch_router_agent() # Routes fixes to appropriate agent
# Classifies failures as:
# - DESIGN: Topology/config issue → retry Designer
# - INSTRUCTION: Lab guide error → retry Author
# - OBJECTIVES: Spec problem → escalate to human
Just needs frontend UI to show retry progress.
2. Topology Visualization
Generate visual network diagrams from topology YAML using D3.js or Cytoscape.js.
3. Lab Editing
Allow instructors to request modifications: "Add a troubleshooting component" or "Simplify the IP addressing."
Try It Yourself
- Live Demo: https://copilot.netgenius.ai
- GitHub: https://github.com/racampos/cloud-run-hackathon
- Hackathon Submission: https://devpost.com/software/netgenius-instructor-copilot-nic
The orchestrator is fully open-source. The network simulator is proprietary (our "secret sauce"), but the API contract is documented.
Conclusion
Building NetGenius Instructor Copilot taught me that modern cloud infrastructure (Cloud Run) combined with intelligent orchestration frameworks (Google ADK) can solve real-world problems that seemed impossible just a year ago.
The combination of:
- Cloud Run Services for the always-on orchestrator
- Cloud Run Jobs for on-demand validation
- Google ADK for multi-agent coordination
- Gemini 2.5 Flash for AI reasoning
created a production-ready system that, once in production, will save instructors (including mysel) hours of manual work.
If you're building multi-agent AI systems, I highly recommend exploring Google ADK and Cloud Run. The developer experience is excellent, and the scale-to-zero cost model is perfect for bootstrapped projects.
Questions? Feedback? Drop a comment below or reach out on X/Twitter (@racampos)
Built for the Google Cloud Run Hackathon. #CloudRunHackathon #GoogleCloud #AI #EdTech


Top comments (0)