Building a Multi-Agent Containerization System at Bunnyshell

#ai #aiops #docker #development

At Bunnyshell, we’re building the environment layer for modern software delivery. One of the hardest problems our users face is converting arbitrary codebases into production-ready environments, especially when dealing with monoliths, microservices, ML workloads, and non-standard frameworks.

To solve this, we built MACS: a multi-agent system that automates containerization and deployment from any Git repo. With MACS, developers can go from raw source code to a live, validated environment in minutes, without writing Docker or Compose files manually.

In this post, we’ll share how we architected MACS internally, the design patterns we borrowed, and why a multi-agent approach was essential for solving this problem at scale.

Problem: From Codebase to Cloud, Automatically
Containerizing an application isn’t just about writing a Dockerfile. It involves:

Analyzing unfamiliar codebases
Detecting languages, frameworks, services, and DBs
Researching Docker best practices (and edge cases)
Building and testing artifacts
Debugging failed builds
Composing services and deploying environments
This process typically takes hours or days for experienced DevOps teams. We wanted to compress it to minutes, with no human intervention.

The Multi-Agent Approach
Similar to Anthropic’s research assistant and other cognitive architectures, we split the problem into multiple specialized agents, each responsible for a narrow set of capabilities. Agents operate independently, communicate asynchronously, and converge on a working deployment through iterative refinement.

Our agent topology:

Agent/ Responsibility

Orchestrator: Breaks goals into atomic tasks, tracks plan state

Delegator: Manages task distribution and parallelism

Analyzer: Performs static & semantic code analysis

Researcher: Queries web resources for heuristics and Docker patterns

Executor: Builds, tests, and validates artifacts

Memory Store: Stores past runs, diffs, artifacts, logs

This modular architecture enables robustness, parallel discovery, and reflexive self-correction when things go wrong.

Pipeline Flow
Each repo flows through a pipeline of loosely-coupled agent interactions:

Initialization
A Git URL is submitted via UI, CLI or API
The system builds a contextual index: file tree, README, CI/CD hints, existing Dockerfiles
Planning
The Orchestrator builds a goal tree: identify components, generate artifacts, validate outputs
Delegator breaks tasks into subtrees and assigns to Analyzer/Researcher in parallel
Discovery
Analyzer inspects the codebase: detects Python, Node.js, Go, etc., plus frameworks like Flask, FastAPI, Express, etc.
Researcher consults external heuristics (e.g., “best Dockerfile for Django + Celery + Redis”)
Synthesis
Executor generates Dockerfile and Compose services
Everything is run in ephemeral Docker sandboxes
Logs and test results are collected
Refinement
Failures trigger self-prompting and diff-based retries
Agents update their plan and try again
Transformation
Once validated, Compose files are converted into bunnyshell.yml
Environment is deployed on our infrastructure
A live URL is returned
Memory & Execution Traces
Unlike simpler systems, we separate planning memory from execution memory:

Planning Memory (Orchestrator): Tracks reasoning paths, subgoals, dependencies
Execution Memory (Executor): Stores validated artifacts, performance metrics, diffs, logs
Only Executor memory is persisted across runs, this allows us to optimize for reuse and convergence without bloating the planning context.

Implementation Details
Models:

Orchestrator: GPT-4.1 (high-context)
Sub-agents: 3B–7B domain-tuned models Runtime:
Each agent runs in an ephemeral Docker container with CPU/RAM/network caps Observability:
Full token-level tracing of prompts, responses, API calls, build logs
Used for debugging, auditing, and improving agent behavior over time Why Multi-Agent? We could have built MACS as a single LLM chain, but this quickly broke down in practice. Here’s why we went multi-agent:

Parallelism: Analyzer and Researcher run concurrently to speed up discovery
Modular reasoning: Each agent focuses on a narrow domain of expertise
Error isolation: Build failures don’t halt the planner — they trigger retries
Reflexivity: Agents can revise their plans based on test results and diffs
Reusability: Learned solutions are reused across similar projects
What We’ve Learned
Multi-agent debugging is hard: you need good observability, logs, and introspection tools.
Robustness beats optimality: our system favors “works for 95%” over exotic edge-case perfection.
Emergent behavior happens: some of the most efficient retry paths were not explicitly coded.
Boundaries matter: defining clean interfaces (e.g., JSON messages) between agents pays off massively.
What’s Next
We’re expanding MACS with:

Better multi-language support (Polyglot repo inference)
Orchestrator collaboration (multi-planner mode)
Plugin SDKs for self-hosted agents and agent fine-tuning
Our north star: a fully autonomous DevOps layer, where developers focus only on code — and the system handles the rest.

Want to try it?
You need only to paste your repo. Hopx by Bunnyshell instantly turns it into production-ready containers.

Try it now