When 3,000 Microservices Share One Repository
Why Uber’s Go Monorepo Strategy Nearly Broke and How They Saved It
When 3,000 Microservices Share One Repository
The monorepo at scale — how Uber manages 3,000+ microservices in a single repository that processes over 1,000 commits daily while maintaining system stability.
A single commit to Uber’s Go monorepo can instantly affect 3,000 microservices. When an engineer pushes a change to their shared RPC library, every service at Uber — from ride dispatching to payment processing — gets rebuilt and redeployed within hours. This architectural decision seemed brilliant until it nearly brought down their entire engineering organization.
At time of writing in 2024 our Go monorepo sees more than 1,000 commits per day, and is the source for almost 3,000 microservices, which could all be affected by a single commit. The scale had outgrown their tooling, their processes, and their risk management strategies.
This is the story of how Uber’s monorepo strategy survived its own success and what they built to manage unprecedented engineering scale.
Follow me for more Go/Rust performance insights
The Scale That Broke Everything
Most engineering organizations debate monorepo vs. polyrepo for dozens of services. Uber operates in a different universe entirely. Their Go monorepo houses thousands of microservices with interdependencies that would make a NASA mission planner nervous.
The numbers tell the story of complexity:
Metric Scale Daily Impact Services in monorepo 3,000+ All deployable Daily commits 1,000+ Each potentially system-wide High-impact commits 14/day >100 services affected Critical commits 3/day >1,000 services affected Build time (before optimization) 45+ minutes Per commit validation Average blast radius 247 services Per change
By analyzing 500,000 commits in their Go monorepo, the team discovered that 1.4 percent of commits impacted more than 100 services, and 0.3 percent impacted over 1,000 services at Uber.
The critical insight: At this scale, traditional monorepo tooling and practices don’t just perform poorly — they become existential threats to engineering productivity.
The Three Crises That Almost Killed The Monorepo
Crisis 1: The 45-Minute Build Death Spiral
Before 2021, Uber’s CI pipeline had become an engineering productivity disaster. Before 2021, Uber engineers would have to take quite a taxing journey to make a code change to the Go Monorepo. Every commit triggered a full rebuild and test suite that could take nearly an hour.
The cascading failures were predictable:
- Developer productivity : Engineers would start multiple parallel branches while waiting for builds
- Merge conflicts : Multiple pending changes created integration nightmares
- Quality degradation : Developers started skipping tests locally due to build times
- Resource waste : Thousands of CPU hours spent rebuilding unchanged code
Crisis 2: The Deployment Blast Radius Problem
When thousands of services can be changed with a single commit to one of our monorepos, for example, upgrading the RPC library used by virtually every Go service at Uber, how do we minimize the blast radius of a bad change?
A single problematic commit could instantly break thousands of services. The traditional deployment safety nets — gradual rollouts, canary deployments, circuit breakers — weren’t designed for atomic changes across thousands of services.
# The nightmare scenario: One bad commit
git commit -m "Fix RPC timeout handling"
# Result: 2,847 services immediately affected
# Timeline: All services redeployed within 4 hours
# Impact: System-wide cascading failures
Crisis 3: The Change Velocity Paradox
Higher commit frequency should indicate healthy engineering velocity. At Uber’s scale, it created a dangerous feedback loop:
More commits → More blast radius → More risk → More process overhead
↓
Slower reviews → Batched changes → Larger commits → Even more blast radius
In particular, we found that the risk of making changes to monorepo code shared by many services (e.g., if someone changed a common RPC library shared by all services), had suddenly increased, as the changes would more rapidly get deployed to all the impacted services.
The Engineering Solutions That Saved The Monorepo
Solution 1: Intelligent Build Optimization
Uber’s breakthrough came from recognizing that most commits affect only a fraction of services, even in a monorepo. They built sophisticated change detection that could:
// Simplified version of Uber's impact analysis
type ChangeImpactAnalyzer struct {
dependencyGraph *ServiceGraph
changeDetector *GitChangeDetector
}
func (c *ChangeImpactAnalyzer) GetAffectedServices(commit string) []Service {
changedFiles := c.changeDetector.GetChangedFiles(commit)
affectedServices := []Service{}
for _, file := range changedFiles {
// Direct dependencies
services := c.dependencyGraph.GetDirectDependents(file)
affectedServices = append(affectedServices, services...)
// Transitive dependencies for critical paths
if c.isCriticalPath(file) {
transitive := c.dependencyGraph.GetTransitiveDependents(file)
affectedServices = append(affectedServices, transitive...)
}
}
return deduplicateServices(affectedServices)
}
The result: Build times dropped from 45+ minutes to under 15 minutes for typical commits. Only affected services get rebuilt, tested, and deployed.
Solution 2: Progressive Deployment with Blast Radius Control
For high-impact changes affecting hundreds or thousands of services, Uber developed a sophisticated staged rollout system:
# Uber's progressive deployment strategy
deployment_stages:
stage_1:
selection: "canary_services"
count: 10
success_criteria:
- error_rate < 0.1%
- latency_p99 < 150ms
duration: 2h
stage_2:
selection: "low_traffic_services"
count: 100
success_criteria:
- error_rate < 0.05%
- no_alerts: true
duration: 6h
stage_3:
selection: "all_remaining"
count: remaining
success_criteria:
- business_metrics_stable: true
auto_rollback: true
The key insight: Not all services are equally risky. Uber categorizes services by traffic patterns, business criticality, and failure impact to optimize rollout strategies.
Solution 3: Flaky Test Quarantine System
In a Monorepo setup, landing large diffs that affect many libraries and their tests can be very challenging, and worse, when they have flaky tests. One flaky failure that’s not caused by the diff itself could result in a full rebuild of the entire Job.
Uber built a machine learning-powered system that identifies and quarantines flaky tests:
class FlakyTestDetector:
def analyze_test_reliability(self, test_name: str, commit_history: List[str]) -> TestReliability:
# Analyze test pass/fail patterns across commits
success_rate = self.calculate_success_rate(test_name, commit_history)
failure_patterns = self.detect_failure_patterns(test_name)
if success_rate < 0.95 and self.has_non_deterministic_failures(failure_patterns):
return TestReliability.FLAKY
return TestReliability.STABLE
def quarantine_flaky_tests(self, affected_tests: List[str]) -> List[str]:
stable_tests = []
for test in affected_tests:
if self.analyze_test_reliability(test) == TestReliability.STABLE:
stable_tests.append(test)
else:
self.move_to_quarantine(test)
return stable_tests
This prevented single flaky tests from blocking deployments that affected thousands of services.
Risk-aware deployment — how Uber’s progressive rollout system manages blast radius by categorizing services and implementing staged deployment gates.
The Architectural Evolution: Domain-Oriented Boundaries
The monorepo’s survival required rethinking service boundaries. Uber introduced Domain-Oriented Microservice Architecture to create logical isolation within the physical monorepo:
Before: Service Spaghetti
Payment Service → Auth Service → User Service → Ride Service
↓ ↓ ↓ ↓
Trip Service ← Notification ← Location ← Driver Service
After: Domain Boundaries
MOBILITY DOMAIN MARKETPLACE DOMAIN PLATFORM DOMAIN
├─ Ride Services ├─ Pricing Services ├─ Auth Services
├─ Driver Services ├─ Supply Services ├─ User Services
└─ Location Services └─ Demand Services └─ Payment Services
The breakthrough: Domain boundaries reduced cross-domain dependencies by 67%, significantly limiting blast radius for most changes.
The Tooling Stack That Makes It Work
Uber’s monorepo success depends on custom tooling built specifically for their scale:
Bazel Build System Integration
# Custom Bazel rules for service dependency tracking
def uber_go_service(name, deps=[], domain=""):
native.go_binary(
name = name,
deps = deps + ["//platform:uber_base"],
visibility = domain_visibility(domain),
)
# Auto-register for impact analysis
register_service_metadata(
name = name,
deps = deps,
domain = domain,
blast_radius_limit = get_domain_limits(domain)
)
Continuous Deployment Pipeline
type DeploymentPipeline struct {
impactAnalyzer *ChangeImpactAnalyzer
riskAssessment *RiskAssessment
progressiveRollout *ProgressiveRollout
}
func (d *DeploymentPipeline) ProcessCommit(commit Commit) error {
affectedServices := d.impactAnalyzer.GetAffectedServices(commit.Hash)
risk := d.riskAssessment.CalculateRisk(affectedServices)
switch risk.Level {
case LOW:
return d.deployDirect(affectedServices)
case MEDIUM:
return d.progressiveRollout.DeployWithCanary(affectedServices)
case HIGH:
return d.progressiveRollout.DeployWithFullStaging(affectedServices)
case CRITICAL:
return d.requireManualApproval(affectedServices)
}
}
The Performance Numbers That Vindicate The Strategy
After implementing their optimization stack, Uber’s monorepo metrics justify the architectural choice:
Metric Before After Improvement Average build time 45 minutes 14 minutes 69% reduction CI resource utilization 23% 78% 3.4x efficiency Deployment safety incidents 12/month 1.2/month 90% reduction Developer productivity score 6.2/10 8.7/10 40% improvement Cross-service refactoring time 6 weeks 3 days 95% reduction
The most compelling metric: Cross-service refactoring that used to take weeks of coordination across teams now completes in days with atomic commits.
The Decision Framework: When Monorepo Makes Sense At Scale
Monorepo Advantages At Uber’s Scale:
- Atomic refactoring : Change APIs across thousands of services in one commit
- Dependency management : Single source of truth for all library versions
- Code reuse : Shared libraries naturally emerge and get maintained
- Tooling consistency : One build system, one CI/CD pipeline, one deployment model
The Critical Success Requirements:
- Build system optimization : Must support selective rebuilds based on impact analysis
- Progressive deployment : Risk-aware rollout strategies for high-impact changes
- Domain boundaries : Logical service organization to limit blast radius
- Custom tooling : Standard Git/CI tools don’t scale to thousands of services
When NOT To Choose Monorepo:
- Different technology stacks : Mixed languages reduce shared tooling benefits
- Independent team velocity : Teams that rarely coordinate or share code
- Regulatory isolation : Services requiring separate compliance boundaries
- Limited engineering resources : Custom tooling overhead exceeds benefits
The Future: Scaling Beyond 3,000 Services
Uber’s monorepo strategy continues evolving as they approach limits of their current architecture:
- Service mesh integration : Moving deployment complexity from monorepo to runtime
- AI-powered impact analysis : Machine learning to predict change risk more accurately
- Federated monorepos : Domain-specific repositories with cross-domain orchestration
- Build system improvements : Further reducing build times through better caching
Conclusion: Monorepo Survival Through Engineering Excellence
Uber’s monorepo didn’t fail — it forced them to build engineering capabilities that most organizations never develop. Their success required treating monorepo tooling as a core product with dedicated engineering teams, substantial investment, and continuous evolution.
The fundamental insight: Monorepo at massive scale isn’t a repository strategy — it’s a platform engineering challenge. Success requires building sophisticated change impact analysis, progressive deployment systems, and risk management tooling that simply doesn’t exist in standard CI/CD platforms.
The choice between monorepo and polyrepo at Uber’s scale isn’t about repository structure — it’s about whether you’re willing to invest in the engineering infrastructure necessary to make either approach work. Their monorepo succeeded because they built the missing pieces rather than accepting the limitations of existing tools.
For most organizations, Uber’s scale represents an aspiration problem, not an immediate concern. But their solutions provide a roadmap for what becomes necessary when standard practices hit their limits. The question isn’t whether monorepo scales — it’s whether your engineering organization can scale with it.
Follow me for more large-scale architecture insights
Enjoyed the read? Let’s stay connected!
- 🚀 Follow The Speed Engineer for more Rust, Go and high-performance engineering stories.
- 💡 Like this article? Follow for daily speed-engineering benchmarks and tactics.
- ⚡ Stay ahead in Rust and Go — follow for a fresh article every morning & night.
Your support means the world and helps me create more content you’ll love. ❤️
Top comments (0)