speed engineer

Posted on Apr 3 • Originally published at Medium

Why Uber’s Go Monorepo Strategy Nearly Broke and How They Saved It

#go #architecture #microservices #systemdesign

When 3,000 Microservices Share One Repository

Why Uber’s Go Monorepo Strategy Nearly Broke and How They Saved It

When 3,000 Microservices Share One Repository

The monorepo at scale — how Uber manages 3,000+ microservices in a single repository that processes over 1,000 commits daily while maintaining system stability.

A single commit to Uber’s Go monorepo can instantly affect 3,000 microservices. When an engineer pushes a change to their shared RPC library, every service at Uber — from ride dispatching to payment processing — gets rebuilt and redeployed within hours. This architectural decision seemed brilliant until it nearly brought down their entire engineering organization.

At time of writing in 2024 our Go monorepo sees more than 1,000 commits per day, and is the source for almost 3,000 microservices, which could all be affected by a single commit. The scale had outgrown their tooling, their processes, and their risk management strategies.

This is the story of how Uber’s monorepo strategy survived its own success and what they built to manage unprecedented engineering scale.

Follow me for more Go/Rust performance insights

The Scale That Broke Everything

Most engineering organizations debate monorepo vs. polyrepo for dozens of services. Uber operates in a different universe entirely. Their Go monorepo houses thousands of microservices with interdependencies that would make a NASA mission planner nervous.

The numbers tell the story of complexity:

Metric Scale Daily Impact Services in monorepo 3,000+ All deployable Daily commits 1,000+ Each potentially system-wide High-impact commits 14/day >100 services affected Critical commits 3/day >1,000 services affected Build time (before optimization) 45+ minutes Per commit validation Average blast radius 247 services Per change

By analyzing 500,000 commits in their Go monorepo, the team discovered that 1.4 percent of commits impacted more than 100 services, and 0.3 percent impacted over 1,000 services at Uber.

The critical insight: At this scale, traditional monorepo tooling and practices don’t just perform poorly — they become existential threats to engineering productivity.

The Three Crises That Almost Killed The Monorepo

Crisis 1: The 45-Minute Build Death Spiral

Before 2021, Uber’s CI pipeline had become an engineering productivity disaster. Before 2021, Uber engineers would have to take quite a taxing journey to make a code change to the Go Monorepo. Every commit triggered a full rebuild and test suite that could take nearly an hour.

The cascading failures were predictable:

Developer productivity : Engineers would start multiple parallel branches while waiting for builds
Merge conflicts : Multiple pending changes created integration nightmares
Quality degradation : Developers started skipping tests locally due to build times
Resource waste : Thousands of CPU hours spent rebuilding unchanged code

Crisis 2: The Deployment Blast Radius Problem

When thousands of services can be changed with a single commit to one of our monorepos, for example, upgrading the RPC library used by virtually every Go service at Uber, how do we minimize the blast radius of a bad change?

A single problematic commit could instantly break thousands of services. The traditional deployment safety nets — gradual rollouts, canary deployments, circuit breakers — weren’t designed for atomic changes across thousands of services.

# The nightmare scenario: One bad commit  
git commit -m "Fix RPC timeout handling"  
# Result: 2,847 services immediately affected  
# Timeline: All services redeployed within 4 hours  
# Impact: System-wide cascading failures

Crisis 3: The Change Velocity Paradox

Higher commit frequency should indicate healthy engineering velocity. At Uber’s scale, it created a dangerous feedback loop:

More commits → More blast radius → More risk → More process overhead   
    ↓  
Slower reviews → Batched changes → Larger commits → Even more blast radius

In particular, we found that the risk of making changes to monorepo code shared by many services (e.g., if someone changed a common RPC library shared by all services), had suddenly increased, as the changes would more rapidly get deployed to all the impacted services.

The Engineering Solutions That Saved The Monorepo

Solution 1: Intelligent Build Optimization

Uber’s breakthrough came from recognizing that most commits affect only a fraction of services, even in a monorepo. They built sophisticated change detection that could:

// Simplified version of Uber's impact analysis  
type ChangeImpactAnalyzer struct {  
    dependencyGraph *ServiceGraph  
    changeDetector  *GitChangeDetector    
}  

func (c *ChangeImpactAnalyzer) GetAffectedServices(commit string) []Service {  
    changedFiles := c.changeDetector.GetChangedFiles(commit)  
    affectedServices := []Service{}  

    for _, file := range changedFiles {  
        // Direct dependencies  
        services := c.dependencyGraph.GetDirectDependents(file)  
        affectedServices = append(affectedServices, services...)  

        // Transitive dependencies for critical paths  
        if c.isCriticalPath(file) {  
            transitive := c.dependencyGraph.GetTransitiveDependents(file)  
            affectedServices = append(affectedServices, transitive...)  
        }  
    }  

    return deduplicateServices(affectedServices)  
}

The result: Build times dropped from 45+ minutes to under 15 minutes for typical commits. Only affected services get rebuilt, tested, and deployed.

Solution 2: Progressive Deployment with Blast Radius Control

For high-impact changes affecting hundreds or thousands of services, Uber developed a sophisticated staged rollout system:

# Uber's progressive deployment strategy  
deployment_stages:  
  stage_1:  
    selection: "canary_services"  
    count: 10  
    success_criteria:  
      - error_rate < 0.1%  
      - latency_p99 < 150ms  
    duration: 2h  

  stage_2:  
    selection: "low_traffic_services"   
    count: 100  
    success_criteria:  
      - error_rate < 0.05%  
      - no_alerts: true  
    duration: 6h  

  stage_3:  
    selection: "all_remaining"  
    count: remaining  
    success_criteria:  
      - business_metrics_stable: true  
    auto_rollback: true

The key insight: Not all services are equally risky. Uber categorizes services by traffic patterns, business criticality, and failure impact to optimize rollout strategies.

Solution 3: Flaky Test Quarantine System

In a Monorepo setup, landing large diffs that affect many libraries and their tests can be very challenging, and worse, when they have flaky tests. One flaky failure that’s not caused by the diff itself could result in a full rebuild of the entire Job.

Uber built a machine learning-powered system that identifies and quarantines flaky tests:

class FlakyTestDetector:  
    def analyze_test_reliability(self, test_name: str, commit_history: List[str]) -> TestReliability:  
        # Analyze test pass/fail patterns across commits  
        success_rate = self.calculate_success_rate(test_name, commit_history)  
        failure_patterns = self.detect_failure_patterns(test_name)  

        if success_rate < 0.95 and self.has_non_deterministic_failures(failure_patterns):  
            return TestReliability.FLAKY  

        return TestReliability.STABLE  

    def quarantine_flaky_tests(self, affected_tests: List[str]) -> List[str]:  
        stable_tests = []  
        for test in affected_tests:  
            if self.analyze_test_reliability(test) == TestReliability.STABLE:  
                stable_tests.append(test)  
            else:  
                self.move_to_quarantine(test)  

        return stable_tests

This prevented single flaky tests from blocking deployments that affected thousands of services.

Risk-aware deployment — how Uber’s progressive rollout system manages blast radius by categorizing services and implementing staged deployment gates.

The Architectural Evolution: Domain-Oriented Boundaries

The monorepo’s survival required rethinking service boundaries. Uber introduced Domain-Oriented Microservice Architecture to create logical isolation within the physical monorepo:

Before: Service Spaghetti

Payment Service → Auth Service → User Service → Ride Service  
    ↓                ↓               ↓              ↓  
Trip Service ← Notification ← Location ← Driver Service

After: Domain Boundaries

MOBILITY DOMAIN          MARKETPLACE DOMAIN         PLATFORM DOMAIN  
├─ Ride Services        ├─ Pricing Services        ├─ Auth Services    
├─ Driver Services      ├─ Supply Services         ├─ User Services  
└─ Location Services    └─ Demand Services         └─ Payment Services

The breakthrough: Domain boundaries reduced cross-domain dependencies by 67%, significantly limiting blast radius for most changes.

The Tooling Stack That Makes It Work

Uber’s monorepo success depends on custom tooling built specifically for their scale:

Bazel Build System Integration

# Custom Bazel rules for service dependency tracking


def uber_go_service(name, deps=[], domain=""):


    native.go_binary(


        name = name,


        deps = deps + ["//platform:uber_base"],


        visibility = domain_visibility(domain),


    )  

# Auto-register for impact analysis  
register_service_metadata(  
    name = name,  
    deps = deps,  
    domain = domain,  
    blast_radius_limit = get_domain_limits(domain)  
)

Continuous Deployment Pipeline

type DeploymentPipeline struct {


    impactAnalyzer    *ChangeImpactAnalyzer


    riskAssessment    *RiskAssessment


    progressiveRollout *ProgressiveRollout


}  

func (d *DeploymentPipeline) ProcessCommit(commit Commit) error {


    affectedServices := d.impactAnalyzer.GetAffectedServices(commit.Hash)  

risk := d.riskAssessment.CalculateRisk(affectedServices)  

switch risk.Level {  
case LOW:  
    return d.deployDirect(affectedServices)  
case MEDIUM:  
    return d.progressiveRollout.DeployWithCanary(affectedServices)  
case HIGH:  
    return d.progressiveRollout.DeployWithFullStaging(affectedServices)  
case CRITICAL:  
    return d.requireManualApproval(affectedServices)  
}  



    

    




}

The Performance Numbers That Vindicate The Strategy

After implementing their optimization stack, Uber’s monorepo metrics justify the architectural choice:

Metric Before After Improvement Average build time 45 minutes 14 minutes 69% reduction CI resource utilization 23% 78% 3.4x efficiency Deployment safety incidents 12/month 1.2/month 90% reduction Developer productivity score 6.2/10 8.7/10 40% improvement Cross-service refactoring time 6 weeks 3 days 95% reduction

The most compelling metric: Cross-service refactoring that used to take weeks of coordination across teams now completes in days with atomic commits.

The Decision Framework: When Monorepo Makes Sense At Scale

Monorepo Advantages At Uber’s Scale:

Atomic refactoring : Change APIs across thousands of services in one commit
Dependency management : Single source of truth for all library versions
Code reuse : Shared libraries naturally emerge and get maintained
Tooling consistency : One build system, one CI/CD pipeline, one deployment model

The Critical Success Requirements:

Build system optimization : Must support selective rebuilds based on impact analysis
Progressive deployment : Risk-aware rollout strategies for high-impact changes
Domain boundaries : Logical service organization to limit blast radius
Custom tooling : Standard Git/CI tools don’t scale to thousands of services

When NOT To Choose Monorepo:

Different technology stacks : Mixed languages reduce shared tooling benefits
Independent team velocity : Teams that rarely coordinate or share code
Regulatory isolation : Services requiring separate compliance boundaries
Limited engineering resources : Custom tooling overhead exceeds benefits

The Future: Scaling Beyond 3,000 Services

Uber’s monorepo strategy continues evolving as they approach limits of their current architecture:

Service mesh integration : Moving deployment complexity from monorepo to runtime
AI-powered impact analysis : Machine learning to predict change risk more accurately
Federated monorepos : Domain-specific repositories with cross-domain orchestration
Build system improvements : Further reducing build times through better caching

Conclusion: Monorepo Survival Through Engineering Excellence

Uber’s monorepo didn’t fail — it forced them to build engineering capabilities that most organizations never develop. Their success required treating monorepo tooling as a core product with dedicated engineering teams, substantial investment, and continuous evolution.

The fundamental insight: Monorepo at massive scale isn’t a repository strategy — it’s a platform engineering challenge. Success requires building sophisticated change impact analysis, progressive deployment systems, and risk management tooling that simply doesn’t exist in standard CI/CD platforms.

The choice between monorepo and polyrepo at Uber’s scale isn’t about repository structure — it’s about whether you’re willing to invest in the engineering infrastructure necessary to make either approach work. Their monorepo succeeded because they built the missing pieces rather than accepting the limitations of existing tools.

For most organizations, Uber’s scale represents an aspiration problem, not an immediate concern. But their solutions provide a roadmap for what becomes necessary when standard practices hit their limits. The question isn’t whether monorepo scales — it’s whether your engineering organization can scale with it.

Follow me for more large-scale architecture insights

Enjoyed the read? Let’s stay connected!

🚀 Follow The Speed Engineer for more Rust, Go and high-performance engineering stories.
💡 Like this article? Follow for daily speed-engineering benchmarks and tactics.
⚡ Stay ahead in Rust and Go — follow for a fresh article every morning & night.

Your support means the world and helps me create more content you’ll love. ❤️

Top comments (1)

Andre Cytryn • Apr 3

the blast radius categorization is something a lot of teams learn the hard way. "not all services are equally risky" sounds obvious in retrospect but most teams just treat every service the same until a shared library update takes down something critical.

the domain boundary piece is underrated too. people think monorepo vs polyrepo is a repository structure decision, but the real leverage is in how you model ownership and blast radius at the code level. without domain isolation, a monorepo at scale just turns into a shared-state nightmare.

one thing I've been thinking about: visualizing dependency graphs across services is still surprisingly hard in most orgs. most teams don't have a clear picture of what actually depends on what until a bad commit reveals it. would be curious if Uber published anything on their graph tooling specifically.