Sarthak Rawat

Posted on Mar 1

From Idea to Pull Request in Minutes: Building an Autonomous Dev Team with Google Gemini

#devchallenge #geminireflections #gemini

Built with Google Gemini: Writing Challenge

This is a submission for the Built with Google Gemini: Writing Challenge

What I Built with Google Gemini

Ever wished you could just describe an app idea and have it built for you? That's exactly what I set out to create with AutoStack — an AI-powered autonomous software factory that takes your project description and handles everything from planning to deployment.

Here's the wild part: I built this entire system with Google Gemini, and then the system itself uses AI agents to build other projects. Meta, right?

The Problem

As developers, we spend countless hours on repetitive tasks: setting up project structures, writing boilerplate code, creating tests, writing documentation, and provisioning cloud resources. What if AI could handle all of that while we focus on the creative, high-level decisions?

The Solution: AutoStack

AutoStack is a multi-agent AI system with two core capabilities:

1. AutoStack Core (Software Development)
A team of four specialized AI agents that work together like a real dev team:

Project Manager Agent: Analyzes your requirements, researches current best practices using Tavily AI, and creates a detailed technical specification with task breakdowns
Developer Agent: Writes production-ready code in batches, manages Git operations, creates branches, and opens pull requests
QA Agent: Reviews every line of code, generates comprehensive unit tests, and validates implementations against requirements
Documentation Agent: Generates READMEs, API documentation, and user guides that stay in sync with your codebase

2. AutoStack Infra (Cloud Provisioning)
A trio of agents that handle Azure infrastructure:

Infrastructure Architect: Designs secure, scalable cloud architectures
DevOps Agent: Generates clean Terraform code with proper state management
SecOps Agent: Validates security using Checkov and estimates costs with Infracost

Google Gemini's Role

Google Gemini was absolutely central to this project in multiple ways:

During Development:
I used Gemini as my coding partner throughout the entire build process. From architecting the LangGraph workflow orchestration to debugging tricky async issues with ChromaDB, Gemini helped me think through complex problems and write cleaner code. It was like pair programming with someone who never gets tired.

In the System Architecture:

Vector Embeddings: I'm using Google's text-embedding-004 model through ChromaDB for the agent memory system. This allows agents to semantically search through previous decisions, code artifacts, and architectural plans.
Initial LLM Choice: I originally built the entire agent system using Google Gemini models for all LLM operations. The structured output capabilities and reasoning quality were excellent for complex tasks like architecture planning and code generation.

The Quota Reality:
Here's where things got interesting. After building and testing the system extensively, I hit my Google Cloud quota limits (turns out running a multi-agent system that generates entire codebases uses a lot of tokens!). This forced me to pivot to Groq and OpenRouter for the LLM operations, but I kept Gemini embeddings because they work so well for semantic search.

This actually taught me an important lesson about building production AI systems: always have fallback options and design for flexibility.

Tech Stack

Backend: FastAPI + Python 3.11
Frontend: Next.js 16 + React 19 + TypeScript
AI Orchestration: LangChain + LangGraph with PostgreSQL checkpointing
Vector Store: ChromaDB with Google Gemini embeddings
LLMs: Groq (Llama 3) + OpenRouter (Qwen) — originally Google Gemini
Database: PostgreSQL
Tools: GitHub API, Tavily (research), Checkov (security), Infracost (cost estimation)

How It Works

The workflow is beautifully orchestrated using LangGraph:

Initialize: Create project and repository
Plan: PM agent researches and creates task breakdown
Develop: Developer agent implements features in batches
Test: QA agent reviews code and generates tests
Document: Documentation agent creates comprehensive docs
Review: System validates everything works
Finalize: Merge PR and deploy

The system supports two modes:

Automatic Mode: Fully autonomous execution
Human-in-the-Loop Mode: Pause at each phase for your approval and feedback

If you request changes, the agents loop back and refine their work based on your feedback. It's like having a dev team that actually listens!

Real-World Features

Some cool things the system does:

Smart Code Analysis: Uses RepoMap to understand existing codebases semantically
Research Integration: Agents use Tavily to look up current package versions and best practices
Selective Updates: During refinement, only modifies files that need changes
GitHub Actions Integration: Automatically sets up CI/CD and runs tests
Notifications: Sends updates via Slack/Discord as work progresses

Demo

Demo Video :- Watch Demo
Github Repository :- Github

What I Learned

Technical Lessons

1. LangGraph is Powerful but Complex
Building a state machine with conditional branching, checkpointing, and interrupt points taught me a lot about workflow orchestration. The ability to pause execution and resume later is crucial for human-in-the-loop systems.

2. Agent Memory is Critical
I initially tried semantic search for everything, but learned that critical data (like architecture plans) needs exact-key retrieval. Semantic search is great for discovery, but not for reliability.

3. Rate Limiting Matters
When you have multiple agents making LLM calls in parallel, you hit rate limits fast. I implemented a proper rate limiter that respects Groq's 30 requests/minute limit.

4. Structured Output is Tricky
Different LLM providers handle structured output differently. I built a fallback system that tries tool calling first, then falls back to JSON mode if that fails. This made the system much more robust.

5. Context Management is an Art
Agents need enough context to make good decisions, but too much context wastes tokens and slows things down. I learned to use compact summaries and only fetch detailed code when necessary.

Soft Skills

Debugging Distributed Systems: When you have four agents working together, figuring out which one caused an issue requires systematic thinking and good logging.

User Experience Design: Building a system that's both powerful and approachable meant thinking carefully about when to automate and when to ask for human input.

Resilience Planning: The quota issue forced me to think about graceful degradation and provider flexibility from day one.

Unexpected Lessons

The biggest surprise? Agents need to communicate through shared memory, not just state. I initially tried passing everything through the LangGraph state, but that became unwieldy. Using ChromaDB as a shared memory layer where agents can leave notes for each other worked much better.

Also, I learned that AI agents are opinionated! The Developer agent has strong preferences about code structure, and sometimes you need to guide it with specific instructions in the prompts.

Google Gemini Feedback

What Worked Well

1. Development Partnership
Using Gemini during development was fantastic. It understood complex architectural questions and helped me think through edge cases I hadn't considered. The conversational nature made it feel like I had a senior developer reviewing my approach.

2. Embedding Quality
The text-embedding-004 model produces excellent embeddings for code and technical content. Semantic search results are consistently relevant, which is crucial for agent memory retrieval.

3. Structured Output (When I Used It)
When I was using Gemini models for the agents, the structured output capabilities were solid. It understood Pydantic schemas well and generated valid JSON consistently.

4. Reasoning Quality
For complex tasks like architecture planning and requirement analysis, Gemini's reasoning was impressive. It could break down vague requirements into concrete technical specifications.

Where I Needed More Support

1. Cost Estimation Tools
For a project like this that makes hundreds of LLM calls, having a built-in cost estimator would be incredibly valuable. Something like "this operation will cost approximately X tokens" before execution.

2. Batch Processing
When you need to make many similar LLM calls (like analyzing multiple files), batch processing with better rate limit handling would be helpful.

Suggestions for Improvement

1. Persistent Context Windows
For agent systems, having a way to maintain context across multiple calls without re-sending everything would be game-changing. Something like a session-based context that persists.

2. Agent-Specific Models
Different agents have different needs. A lightweight model for simple tasks and a powerful model for complex reasoning would let developers optimize cost vs. quality.

3. Streaming for Long Operations
When generating large amounts of code, streaming responses would improve UX significantly. Users could see progress instead of waiting for the entire response.

4. Better Error Messages
When structured output fails, more specific error messages about what went wrong would help debugging. "Invalid JSON" is less helpful than "Missing required field: architecture_plan.tech_stack".

The Honest Take

Despite hitting quota limits, I'm genuinely impressed with Google Gemini. The quality of responses, especially for technical tasks, is excellent. The embedding model is rock-solid and hasn't let me down once.

It pushed me to build a more flexible system. In production, you need to handle provider failures anyway, so this was a valuable lesson.

Google Gemini is my go to for developing modern beautiful frontends and even writing polished backends.

Would I use Google Gemini again? Absolutely. In fact, I'm planning to add it back as a primary LLM option. For anyone building AI systems, I'd recommend:

Start with Gemini for development and prototyping
Use the embedding models for production (they're great)
Plan for quota limits from day one

AutoStack is open source and available on GitHub. If you're interested in AI agents, LangGraph orchestration, or just want to see how to build a complex multi-agent system, check it out!

DEV Community