How Prathyusha Ramanjaneyulu at Celonis Eliminated Dev Environment Overhead for 600 Engineers

#kubernetes #productivity #devops #docker

I recently spoke with Prathyusha Ramanjaneyulu, a Senior Software Engineer at Celonis. She owns the local development component of the organization's internal developer tooling and is responsible for maintaining high velocity for over 600 developers. The conversation was about a problem that comes up at nearly every large engineering org: how do you give hundreds of developers a way to test their code against a real environment without having them being slowed down by CI and staging deployments? She described environments that took weeks to set up, 3+ hours a week lost to VM maintenance, and teams blocked whenever someone pushed a bad change. This is a recap of that conversation. I'll share the approaches she was using, the issues she faced with those, and how mirrord helped her finally solve this problem.

The problems she told me are ones I think many engineers at large companies will relate to and can learn from.

The Nature of the Bottleneck at Scale

The engineering org she works with has around 600 developers building applications that have a microservices architecture and are deployed on Kubernetes. The nature of that setup means that when an engineer is working on a single service, their code doesn't run in isolation but it depends on a web of other services, databases, and queues. Testing even a small change means the service needs to interact with those dependencies in a way that reflects what actually happens in production.

What she had tried

Before landing on a solution that worked, she went through two approaches.

Approach 1: Run everything locally

Every engineer would clone, compile, and run all the services their code depended on locally, which in an org of that size meant a lot of repos and a lot of compute. And even after spending the time to do all that, the machines were so resource-constrained while running all of it that engineers couldn't do much else at the same time. And because each engineer was running services they hadn't written, they'd inevitably run into issues they had no context to debug. This led to them spending time on problems that had nothing to do with the code they were actually trying to test.

Approach 2: Shared VM + some local services

The second approach moved some services to a shared virtual machine (VM) while keeping others running locally. That got the initial environment setup time down from the weeks required by the all-local approach to roughly a day and a half, which was meaningful progress. But it introduced a different set of problems. Engineers were spending 3+ hours a week on VM maintenance and things like debugging connectivity issues. This was time that had nothing to do with writing code. And because the VM was shared, when one engineer's change broke something, everyone working against that environment was blocked until it was resolved. The bottleneck had moved, but it hadn't gone away.

The Selection Criteria for Scaling Local Development

As the owner of the Local Development component, Prathyusha is tasked with maintaining high engineering velocity for 600 developers. She mentioned her team maintains a proactive watch over the cloud-native ecosystem; it was through this strategic evaluation that she and her team identified both Telepresence and mirrord as potential solutions to address their bottlenecks.

Both tools solve the same core problem: connecting a local development environment to a remote Kubernetes cluster. Here's the reasoning she gave me for landing on mirrord.

Philosophy. Telepresence's model is "your workstation becomes part of the cluster network." mirrord's model is "your local process impersonates a process inside a pod." With Telepresence, a broader portion of your machine inherits cluster connectivity and behavior. With mirrord, only the specific process being run is affected. That narrower scope reduces blast radius and configuration complexity.
Operational overhead. Telepresence requires managing daemon processes and per-user configuration across the team, which adds ongoing maintenance work for the platform team. mirrord is plug-and-play. Engineers pick it up and use it without her having to maintain it.
Security. Telepresence relies on Preview URLs, which are public endpoints, to enable collaboration. mirrord operates entirely within the private network, so there's no risk of accidentally exposing data through a public endpoint.
Resource footprint. mirrord only consumes cluster resources during an active debug session. Nothing persists between sessions, which keeps the staging cluster clean and means the platform team isn't maintaining persistent dev-tooling components on top of everything else.

What changed

When Prathyusha walked me through the impact, I asked for the numbers first. She mentioned that with the all-local approach, environment setup took weeks per engineer. With the shared VM, that came down to about a day and a half. With mirrord, it takes around 15 minutes. That's a 144x speedup over the shared VM approach! She also mentioned how the time between an engineer writing code and knowing if it works went from an hour down to seconds.

But beyond the speed, she pointed to the elimination of an entire category of problem. With mirrord, testing locally became equivalent to testing in staging as engineers connect their local process directly to the real cluster, with real dependencies. That closed the environment gap that caused the "it works on my machine but not in staging" problem. She described it as a shift in the confidence of the engineers: if it works with mirrord, engineers don't have to wonder whether it'll work in staging or not.

The shared environment also solved the blocking problem that plagued the shared VM approach. With the VM, one engineer's bad change could bring the environment down for everyone. With mirrord, any number of engineers can test against the same staging cluster at the same time without affecting each other. mirrord handles the isolation through features like HTTP traffic filtering, queue splitting and DB branching which ensure each engineer's test traffic and changes don't affect other people using the environment.

What all of this added up to was engineers getting their time back.

No more 3+ hours a week wasted on VM maintenance and configuration.
No debugging of services they didn't write.
No waiting on the dev tools team to fix a broken shared environment.

To put the capacity impact in perspective: if mirrord saves each engineer just one hour per week, at 600 engineers that's 600 hours reclaimed every week. Over a year, that's 30,000 hours of engineering capacity redirected from infrastructure maintenance to product work.

Closing remarks

The thing that struck me most from talking with Prathyusha was how much of the improvement came from changing the model, not the tooling. At this scale, even small amounts of per-engineer overhead compound into a really big cost for the organization as a whole. But the solution wasn't faster local tooling or more compute. It was changing the model: instead of every engineer maintaining their own isolated environment, share one staging cluster and just have the DevOps team maintain that.

Top comments (1)

arun rajkumar • Jul 2

Great writeup, and the closing line is the real lesson: they changed the model, not the tooling. We hit the identical wall with our payment services and went the opposite way from mirrord. Instead of sharing one staging cluster, we invested in making the whole stack runnable locally in about five minutes: a shared env schema, Traefik for routing, mocked AWS services. Both roads close the "works on my machine" gap, and the tradeoff is real-dependencies-but-shared-blast-radius versus fully-isolated-but-you-maintain-the-mocks. Curious where Prathyusha's team landed on data: did connecting local processes to the real cluster ever cause test-data collisions between engineers, or did the queue splitting and DB branching fully handle that?