DEV Community: Sahil Singh

Code Health Metrics That Actually Matter (Not Lines of Code)

Sahil Singh — Thu, 05 Mar 2026 10:13:46 +0000

"How healthy is your codebase?"

If you can't answer that question with data, you're flying blind. Most teams rely on gut feeling: "It's getting harder to ship." "That service is a mess." "Don't touch the billing module."

Here are the code health metrics that actually predict problems — and the ones that are noise.

Metrics That Matter

1. Change Failure Rate

What percentage of changes to this code area cause bugs or incidents? This is the most direct measure of code health.

Healthy: <5% of changes cause issues
**Unhealthy:** >15% of changes cause issues

Track this per module, not just globally. You might have 95% healthy code and one module that's a landmine.

Part of DORA metrics, but applied at the code-area level instead of org level.

2. Knowledge Distribution

How many people can independently work on this code? A module with 5 active contributors is healthier than one with 1, even if the code quality is identical.

Healthy: Bus factor >= 3
Unhealthy: Bus factor = 1

This is the most overlooked code health metric. Beautiful code that only one person understands is unhealthy code.

3. Coupling Score

How many other modules does this code depend on, and how many depend on it? High coupling = high blast radius = high risk.

Healthy: Clear, minimal dependencies with defined interfaces
Unhealthy: Circular dependencies, god modules that everything imports

4. Change Frequency vs Test Coverage

Code that changes frequently NEEDS high test coverage. Code that never changes can survive with less.

Healthy: High-churn code has proportionally high test coverage
Unhealthy: Your most-changed files have the lowest coverage

5. Time-to-Understand

How long does it take a new engineer to understand this module well enough to make changes? This is subjective but measurable through onboarding feedback.

Healthy: New engineer can make changes within 1-2 days
Unhealthy: New engineer needs 2+ weeks of ramp-up

Metrics That Are Noise

Lines of Code

A 500-line file isn't healthier than a 2000-line file by default. It depends on what the code does. LOC tells you nothing about quality, maintainability, or risk.

Cyclomatic Complexity (by itself)

High complexity CAN indicate problems, but many perfectly fine algorithms are complex. Without context about change frequency and failure rate, it's noise.

Comment Density

More comments don't mean healthier code. Often the opposite — excessive comments indicate the code itself is unclear.

Code Coverage (global)

80% coverage doesn't mean your code is healthy if the 20% that's untested is your most critical business logic.

How to Track Code Health

Option 1: Manual review. Once per quarter, review your critical modules against these metrics. Simple but doesn't scale.

Option 2: CI integration. Add test coverage tracking, dependency analysis, and lint rules to your pipeline. Catches trends but misses knowledge distribution.

Option 3: Codebase intelligence. Tools that continuously analyze your codebase and surface health metrics automatically — including the human factors (knowledge distribution, tribal knowledge risk) that CI tools miss.

The Action Framework

For any unhealthy code area, ask:

How often does it change? (If rarely, it can wait)
What's the blast radius? (If isolated, it's lower priority)
Who's affected? (If it blocks many engineers, fix it first)

Don't try to make everything "healthy." Focus on the code that changes often, affects many people, and has the highest blast radius.

Originally published on getglueapp.com/glossary/code-health

Glue tracks code health metrics continuously — including knowledge silos, bus factor, dependency coupling, and change risk — so you can fix problems before they become incidents.

Why Developer Onboarding Takes 3 Months (and How to Cut It to 3 Weeks)

Sahil Singh — Thu, 05 Mar 2026 10:13:10 +0000

The industry average for developer onboarding to full productivity is 3-6 months. The best teams do it in 3-4 weeks.

That gap isn't about the new hire's ability. It's about how the team handles knowledge transfer.

What Takes So Long

1. Tribal Knowledge Discovery

The new engineer reads the docs (if they exist). The docs are partially wrong. They ask on Slack. Get pointed to "the person who knows." That person is busy. The answer comes 2 days later. Repeat 50 times.

This is the tribal knowledge problem. Critical system understanding lives in people's heads, not in any discoverable format.

2. Codebase Archaeology

"Where does the checkout flow start?" In a monolith, maybe you can find it. In a microservices architecture with 30 repos, the answer spans 5 services across 3 teams. Nobody has drawn a current architecture diagram.

3. Missing Context

The code does X. But why does it do X? Why not Y, which seems simpler? Without Architecture Decision Records (ADRs), the new engineer either:

Asks someone (adding to their load)
Guesses wrong and builds on a misunderstanding
Spends hours reading git blame and PR history

4. No Guided Path

Most onboarding is: "Here's Jira, here's the codebase, good luck." There's no structured path from "I just got access" to "I can independently debug production issues."

What Fast-Onboarding Teams Do Differently

Week 1: Orientation and Context

Architecture overview session (recorded, not just live). Not every microservice — just the top-level context diagram.
Meet the knowledge holders. For each critical system, introduce the engineer to the 1-2 people who know it best.
First commit on day 2. Something small — a typo fix, a config change. The point is to get through the full PR → review → merge → deploy cycle immediately.

Week 2: Guided Contribution

Pair programming on a real task. Not a toy project. Work alongside a senior engineer on actual sprint work.
Explore with a map. Use codebase intelligence to show the new engineer: dependency maps, ownership maps, knowledge distribution. "Here's who owns what, here's what depends on what."
On-call shadow. Observe an on-call shift. Seeing how incidents are detected and resolved teaches more about the system than a month of reading code.

Week 3: Independent Work

Own a small feature end-to-end. From design through deployment. With a buddy for questions, but doing the work independently.
Write an ADR or improve docs. The new hire has fresh eyes. Things that confuse them will confuse the next hire. Capture that feedback while it's fresh.

Ongoing: Make Knowledge Self-Serve

The biggest lever for onboarding speed is making system understanding discoverable without asking someone:

Living architecture diagrams that stay current
Dependency maps that show blast radius
Bus factor visibility (who knows what)
Searchable ADRs for "why" questions

Measuring Onboarding Success

Track these metrics:

Time to first commit (should be <2 days)
Time to first independent PR (should be <2 weeks)
Time to first on-call shift (should be <6 weeks)
New hire satisfaction survey at 30, 60, 90 days

If your senior engineers are taking 12+ weeks to become productive, the problem isn't them. It's the environment. Fix the environment, and every future hire benefits.

Originally published on getglueapp.com/glossary/developer-onboarding

Glue accelerates onboarding by making codebase knowledge self-serve — dependency maps, ownership, and knowledge distribution are always up to date.

Software Estimation Is Broken. Here's What to Do Instead.

Sahil Singh — Thu, 05 Mar 2026 10:12:34 +0000

Every engineering manager has lived this: "How long will this take?" followed by a confident answer that turns out to be wrong by 2-5x.

The problem isn't that engineers are bad at estimating. The problem is that software estimation is fundamentally hard, and most teams use methods that make it harder.

Why Estimates Are Always Wrong

Unknown Unknowns

You can estimate the work you can see. You can't estimate the work you'll discover along the way: the legacy code that doesn't work like the docs say, the API that has undocumented rate limits, the database migration that reveals data inconsistencies.

In a typical feature implementation, 30-50% of the work is discovered during the work. Any upfront estimate misses this by definition.

Anchoring Bias

The first number mentioned becomes the anchor. If a PM says "I was thinking 2 weeks," the estimate gravitates toward 2 weeks regardless of actual complexity. If the tech lead says "probably a sprint," everyone adjusts from there.

Parkinson's Law

Work expands to fill the time allocated. Give a team 2 weeks and they'll take 2 weeks. Give them 3 weeks and they'll take 3 weeks. The estimate becomes a self-fulfilling prophecy, but not always a correct one.

Knowledge Gaps

The person giving the estimate often doesn't have full understanding of the system they're estimating changes to. Tribal knowledge means the real complexity is hidden in someone else's head. The estimate reflects the visible complexity, not the actual complexity.

What Actually Works

1. Reference-Class Forecasting

Instead of estimating from scratch, look at similar past work:

"The last 5 API endpoints we built took 3-7 days each"
"Database migrations in this codebase have historically taken 2x our estimate"
"Integration with external APIs has never taken less than 2 weeks"

Your past delivery data is the best predictor of future delivery.

2. Thin-Slice Delivery

Instead of estimating a 6-week project and hoping you're right, break it into 1-week deliverable slices. Ship the first slice. Re-estimate based on what you learned.

This is why high deployment frequency correlates with better outcomes — small batches give you faster feedback on your estimates.

3. Understand the Codebase First

Half the reason estimates are wrong is that engineers don't understand the full complexity of the code they'll need to change. The dependency graph, the bus factor (can you even get help if you're stuck?), the code health of the area you'll be working in.

Codebase intelligence tools can surface this context before you estimate — showing you the blast radius, ownership map, and historical change patterns for the area of code involved.

4. Ranges, Not Points

"3 weeks" is a point estimate and is almost certainly wrong. "2-5 weeks, most likely 3" gives the PM what they actually need: a range for planning, a best case, and a worst case.

5. Track Accuracy

Measure how accurate your estimates are over time. If you consistently estimate 1 week and deliver in 2, you have a systematic 2x bias. Correct for it.

The Real Problem

The real problem isn't estimation — it's that organizations use estimates for the wrong thing. Estimates should be planning inputs, not commitments. When an estimate becomes a deadline, engineers pad defensively, managers pressure for smaller numbers, and the whole system produces unreliable information.

The best teams I've worked with don't argue about estimates. They ship small batches, measure throughput, and use historical data to forecast. No guessing. No negotiation. Just data.

Originally published on getglueapp.com/glossary/software-project-estimation

Glue helps teams understand codebase complexity before estimating — surfacing dependencies, knowledge concentration, and historical change patterns.

What AI Code Assistants Can't Do (Yet): The Gap Between Generation and Understanding

Sahil Singh — Thu, 05 Mar 2026 10:11:58 +0000

Copilot can write a function. Cursor can refactor a file. Claude Code can scaffold a service. But ask any of them: "What's the blast radius if I change this API endpoint?" and you'll get a hallucination, not an answer.

The gap between code generation and code understanding is the most important gap in AI tooling right now. And most teams aren't even aware it exists.

What AI Code Assistants Are Great At

Let's be fair about what works:

Boilerplate generation. Creating CRUD endpoints, test scaffolding, type definitions. Massive time saver.
Single-file refactoring. Renaming variables, extracting functions, converting patterns. Solid.
Documentation. Generating docstrings, README sections, inline comments. Good enough.
Autocomplete. Suggesting the next line of code based on context. The original killer feature.

For individual developer productivity, these tools are genuinely transformative. But they all operate at the same level: the file or function level.

What They Can't Do

Understand Cross-Service Dependencies

"If I change the schema of the UserCreated event, which services will break?"

This requires understanding:

Which services consume that event
What fields they depend on
Whether they handle schema evolution gracefully
Who owns those services and needs to be notified

No code assistant can answer this because it requires analyzing the relationships between codebases, not just the code within one file. This is dependency mapping — a fundamentally different capability.

Identify Knowledge Risks

"Who can fix the billing pipeline if it breaks at 2 AM?"

This requires understanding:

Who has historically worked on this code
Who has successfully resolved incidents here before
Whether that knowledge has been shared with others
What the bus factor is for this system

Code assistants generate code. They don't understand the human context around the code.

Predict Blast Radius

"How risky is this refactoring?"

Risk isn't about the code change itself — it's about:

How many other things depend on what you're changing
How frequently those dependent systems change
How well-tested the integration points are
Who needs to review and approve

This is codebase intelligence — understanding the codebase as a system, not as individual files.

Surface Architectural Drift

"Is our architecture still aligned with our team structure?"

Conway's Law tells us architecture mirrors org structure. Detecting misalignment requires analyzing patterns across the entire codebase and the entire organization. No file-level tool can see this.

The Two Layers of AI for Engineering

Layer 1: Code Generation (Copilot, Cursor, Claude Code)

Operates at file/function level
Accelerates individual productivity
Every team should be using these

Layer 2: Code Intelligence (Codebase analysis, engineering analytics)

Operates at system/organization level
Answers strategic questions about the codebase
Identifies risks that no individual can see

Most teams have Layer 1 but not Layer 2. They can generate code faster than ever, but they still can't answer: "Is this change safe?" or "Where are our biggest risks?"

Why This Matters

The faster you generate code, the more important it becomes to understand the impact of that code. AI code assistants without codebase intelligence is like having a faster car without a map. You'll go fast — but you might be going in the wrong direction.

The best engineering teams in 2026 use both layers:

AI code assistants for individual productivity
Codebase intelligence for organizational understanding

The generation gap will close. But the understanding gap is where the real value is.

Glue is the codebase intelligence layer — answering the questions that code assistants can't: dependency mapping, bus factor, knowledge silos, and code health across your entire codebase.

Knowledge Silos in Microservices: The Hidden Cost of Distributed Systems

Sahil Singh — Thu, 05 Mar 2026 10:11:22 +0000

Everyone talks about the technical challenges of microservices: network latency, distributed transactions, service discovery. Nobody talks about the knowledge challenge.

When you split a monolith into 30 services, you don't just distribute the code. You distribute the understanding. And unlike code, understanding doesn't scale horizontally.

The Knowledge Distribution Problem

In a monolith, any engineer can grep for a function and trace its behavior. In microservices, understanding a single user flow might require reading code in 5 different repositories owned by 3 different teams.

The result: knowledge silos form along service boundaries. The payments team understands payments. The onboarding team understands onboarding. Nobody understands the full picture.

This is Conway's Law applied to knowledge, not just architecture.

What It Actually Costs

Incident response. An incident in the checkout flow touches the cart service, payment service, and notification service. Three teams get paged. Each team understands their slice but not the interaction between slices. Debugging takes 3x longer because the problem is in the integration, not any single service.

Architecture decisions. When nobody holds a complete mental model, every cross-service change requires a committee. "How does this affect the event bus?" becomes a multi-day question because the answer spans three teams.

Onboarding. New engineers join a team and learn one service deeply. But to be effective, they need to understand the services their service depends on. That understanding lives in other teams' heads.

How to Detect Knowledge Silos

Look at your git history across all repos:

# Who commits to which repos?
for repo in service-a service-b service-c; do
  echo "=== $repo ==="
  cd $repo
  git log --since="90 days" --format='%aN' | sort | uniq -c | sort -rn | head -5
  cd ..
done

If each repo has a completely different set of contributors, you have siloed knowledge.

More sophisticated approaches use codebase intelligence to map knowledge distribution across your entire org — showing not just who commits where, but who can actually understand and debug each service.

5 Fixes That Actually Work

1. Cross-Service Code Reviews

Require at least one reviewer from a different team for PRs that change API contracts or event schemas. This forces knowledge transfer at the integration points where silos cause the most damage.

2. Rotation Programs

Engineers spend one sprint per quarter embedded in a different team. Not just reading their code — actually shipping features. This builds empathy and understanding that no documentation can replace.

3. Architecture Decision Records (ADRs)

When you make a cross-service decision, document it in a shared location (not in any single repo). Include: the decision, the alternatives considered, and the constraints that drove the choice.

4. Shared On-Call for Integration Flows

For critical user flows that span multiple services, create a shared on-call rotation with members from each team. During incidents, they debug together. The shared context compounds over time.

5. Living Dependency Maps

Maintain an always-up-to-date map of which services depend on which. Not a stale Confluence diagram — a living dependency map derived from actual code and API calls.

The Microservices Tax

Microservices have a knowledge tax that monoliths don't. Every service boundary is a potential knowledge silo. Every API contract is a potential misunderstanding waiting to happen.

This doesn't mean microservices are wrong. It means you need to budget for knowledge distribution the same way you budget for infrastructure. If you're not spending 10-15% of engineering time on cross-team knowledge sharing, your silos are growing quietly.

The bus factor for individual services might be fine. But what's the bus factor for understanding how they all fit together? For most teams, that number is dangerously low.

Glue maps knowledge silos across your entire microservices architecture — showing which services have the highest concentration risk and where cross-training is needed most.

How to Build an AI Roadmap for Your Engineering Team (2026)

Sahil Singh — Thu, 05 Mar 2026 10:10:46 +0000

Most organizations that fail with AI fail because they skipped the roadmap. They jumped straight to buying tools or training models without understanding what problems AI should solve.

An AI roadmap is a strategic plan for how your engineering org will adopt, integrate, and scale AI. Not "we need to use AI" — that leads to solutions looking for problems. Instead: "our code review cycle takes 5 days and we want it under 1 day."

The 5 Stages of AI Adoption

Based on patterns across hundreds of engineering organizations:

Stage 1: AI-Assisted Individual Productivity (Month 1-3)

Individual devs use coding assistants: GitHub Copilot, Cursor, Claude Code.

Measure: Developer self-reported productivity, time saved on routine tasks.
Mistake: Measuring adoption rate instead of actual productivity improvement.

Stage 2: AI-Augmented Workflows (Month 3-6)

AI moves from individual tools to team workflows: AI code review, automated test generation, AI-assisted sprint planning.

Measure: Code review cycle time, test coverage improvement, estimation accuracy.
Mistake: Forcing AI into workflows where it adds friction rather than removing it.

Stage 3: AI-Powered Engineering Intelligence (Month 6-12)

AI analyzes patterns across the org: knowledge silo detection, predictive bus factor analysis, code health trends.

Measure: Time to identify risks, accuracy of predictions, reduction in unplanned work.
Mistake: Treating AI insights as absolute truth rather than signals needing human interpretation.

Stage 4: AI-Native Development (Month 12-24)

Development practices redesigned around AI: AI-first testing, automated architecture review, AI-driven refactoring.

Measure: Ratio of AI-generated to human-written code, quality of AI artifacts.

Stage 5: Autonomous Engineering Operations (Month 24+)

Self-healing infrastructure, automated incident response, AI-managed deployments. Very few orgs are here today.

Building the Roadmap

Step 1: Assess Current State

Data inventory: What data do you have, where, how clean?
Tool inventory: What AI tools are devs already using (officially or not)?
Skill assessment: What AI/ML skills exist on the team?
Process maturity: Are your dev processes well-defined enough to augment with AI?

Step 2: Identify High-Value Use Cases

Use Case	Impact	Feasibility	Priority
AI code review	High	High	Do first
Automated test generation	High	Medium	Do second
Predictive incident detection	High	Medium	Plan for Q2
AI-powered onboarding	Medium	High	Quick win
Autonomous deployments	Very High	Low	Long-term

Step 3: Define Success Metrics

"Reduce code review time from 48 hours to 12 hours"
"Increase test coverage from 45% to 70% in 6 months"
"Detect 80% of production incidents before user impact"

Step 4: Plan the Rollout

Pilot (1-2 months): One team. Measure everything.
Expansion (2-4 months): 3-5 teams. Refine based on learnings.
Org-wide (4-6 months): Standard rollout with training.

Step 5: Build Feedback Loops

Collect developer feedback on AI tool effectiveness
Track quantitative metrics monthly
Review and adjust quarterly
Sunset AI tools that don't deliver value

Common Misconceptions

"We need ML engineers." For most teams, adopting AI means using existing tools, not building models. You need engineers who can evaluate and integrate, not necessarily build.

"AI will replace developers." AI augments developers. The most productive devs in 2026 use AI effectively as a tool — they don't resist it or blindly trust it.

"We should wait for AI to mature." Code completion, code review assistance, and automated testing are all proven. Waiting means falling behind.

"One AI tool does everything." Build your AI stack like your engineering stack: best-of-breed tools that integrate well.

Template: Your First Year

Q1: Audit AI usage → select coding assistant → pilot with 1-2 teams → establish baselines

Q2: Roll out coding assistant org-wide → pilot AI code review → begin data readiness assessment

Q3: Implement AI code review across all teams → pilot AI test generation → pilot codebase intelligence for knowledge silo detection

Q4: Deploy engineering analytics → implement predictive incident detection → plan year 2

Originally published on getglueapp.com/glossary/ai-roadmap

Glue is a codebase intelligence platform that provides AI-powered engineering insights — from code health to bus factor to knowledge silo detection.

DORA Metrics Explained: The Only 4 Engineering Metrics Backed by Research

Sahil Singh — Thu, 05 Mar 2026 10:10:10 +0000

Most engineering metrics are vanity metrics. Lines of code, story points, commit counts — they measure activity, not outcomes.

DORA metrics are different. They're the only engineering metrics with rigorous academic research proving their correlation to business outcomes. The DORA team at Google surveyed 32,000+ professionals across multiple years and proved that these four metrics predict both technical AND business performance.

The 4 DORA Metrics

1. Deployment Frequency

How often you deploy to production.

Level	Frequency
Elite	Multiple per day
High	Daily to weekly
Medium	Weekly to monthly
Low	Monthly to biannual

Elite teams deploy 973x more frequently than low performers.

2. Lead Time for Changes

Time from code commit to running in production.

Level	Lead Time
Elite	< 1 hour
High	1 day - 1 week
Medium	1 week - 1 month
Low	1 - 6 months

Where time is typically lost: code review queues, manual QA, change advisory boards, deployment windows.

3. Change Failure Rate

Percentage of deployments causing failures in production.

Level	Failure Rate
Elite	0-5%
High	5-10%
Medium	10-15%
Low	16-30%+

How to reduce it: automated testing, feature flags, canary deployments, better code review.

4. Mean Time to Recovery (MTTR)

How quickly you recover from a production incident.

Level	Recovery Time
Elite	< 1 hour
High	< 1 day
Medium	1 day - 1 week
Low	> 1 week

Low MTTR requires: good monitoring, clear incident response, multiple people who can debug each system (bus factor matters here).

The Key Insight

Speed and stability are NOT tradeoffs.

This is the most important finding. Elite performers are both faster AND more reliable. The common belief that "moving fast breaks things" is a myth. Teams with better practices achieve both.

Why These 4 and Not Others?

The research shows teams with better DORA metrics also have:

Higher organizational performance (profitability, market share)
Lower employee burnout
Higher job satisfaction
Better ability to meet business goals

No other set of engineering metrics has this level of evidence behind it.

How to Measure Them

Manual surveys. Ask your team: How often did we deploy? How long from commit to prod? What % caused issues? How fast did we recover? Works for small teams starting out.

CI/CD pipeline data. GitHub Actions, GitLab CI, Jenkins already track deployment frequency and lead time.

Incident management data. PagerDuty, Opsgenie, incident.io track MTTR. Correlate with deployment timestamps for change failure rate.

Dedicated platforms. Sleuth, LinearB, Jellyfish, Swarmia aggregate from multiple sources automatically.

Codebase intelligence. Glue calculates engineering health metrics including code change velocity and team collaboration patterns that complement DORA metrics with deeper codebase insights.

Common Mistakes

Measuring adoption, not improvement. "80% of engineers use our CI/CD" is not a DORA metric. Measure outcomes.

Optimizing all four simultaneously. Start with deployment frequency. When you deploy small batches frequently, lead time drops, failures are easier to diagnose, and recovery is faster.

Gaming the metrics. Deploying empty commits, ignoring incidents. Use all four together and focus on trends rather than absolute numbers.

Treating DORA as DevOps-only. These measure the entire software delivery process. They're relevant to engineering leadership, product teams, and anyone who cares about how fast software reaches users.

Getting Started

Pick ONE metric. Deployment frequency is the easiest to start with.
Measure the baseline. Where are you today?
Set a target. "Move from monthly to weekly deployments in Q2."
Remove the biggest bottleneck. Usually it's batch size, testing, or approval processes.
Measure monthly. Track trends, not snapshots.

Originally published on getglueapp.com/glossary/dora-metrics

Glue provides engineering intelligence that complements DORA metrics — mapping code health, knowledge silos, and team collaboration patterns.

Bus Factor: The Metric That Predicts Team Disasters Before They Happen

Sahil Singh — Thu, 05 Mar 2026 10:08:59 +0000

How many people can disappear from your team before a critical system becomes unmaintainable?

That's the bus factor — the minimum number of team members who could leave before the project enters serious trouble. A bus factor of 1 means one person leaving would be catastrophic. A bus factor of 3 means you can absorb the loss of any two people.

For most teams, the honest answer for their most critical systems is one.

Why Bus Factor Matters

Bus factor is not an academic concept. It's a direct measure of operational risk.

When key engineers leave (and they will), the team's ability to maintain, debug, and evolve that system drops dramatically. Everything slows down: feature development, incident response, code reviews. New engineers can't get up to speed because the person who could explain the system is gone.

Common triggers:

Employee turnover — Engineers leave. Average tenure in tech is 2-3 years.
Reorgs and layoffs — Knowledge domains can vanish overnight.
Illness and vacation — Even temporary absence of a bus factor-1 engineer creates blockers.
Promotion — When a senior IC becomes a manager, they stop writing code but nobody absorbs their knowledge.

How to Calculate Bus Factor

For any system, module, or codebase area:

Look at git commit history over the last 6-12 months
Count how many unique contributors have made meaningful changes
Identify the minimum set of people whose combined knowledge covers the system
That number is your bus factor

A more precise approach: For each file or module, identify who can independently debug and fix production issues (not just review PRs). If only one person truly understands the billing pipeline enough to fix it at 2 AM, the bus factor for billing is 1.

Codebase intelligence tools can automate this by analyzing git history and deriving knowledge distribution maps.

Bus Factor by Team Size

Team Size	Minimum Bus Factor	Target	Risk if BF = 1
2-3 people	2	2	Critical
4-6 people	2	3	High
7-10 people	3	4+	High
10+ people	3	5+	Medium

Startups (2-5 engineers): Bus factor of 1 is common and sometimes unavoidable. Mitigate with documentation and recorded architecture sessions.

Growth-stage (10-50 engineers): Bus factor of 1 is unacceptable for any production system. Budget 10-15% of engineering time for knowledge sharing.

Enterprise (50+): Bus factor should be 3+ for all critical systems. Formal rotation policies become necessary.

Real-World Examples

The OpenSSL Heartbleed Case. In 2014, the Heartbleed vulnerability affected millions of servers worldwide. At the time, OpenSSL was maintained by essentially one full-time developer. A project critical to internet security had a bus factor of ~1.

The left-pad Incident. In 2016, one developer unpublished a small npm package and broke thousands of builds including React and Babel. The npm ecosystem had a bus factor problem at the package level.

Knowledge Loss During Layoffs. When companies do large layoffs, entire knowledge domains disappear overnight. If the three people who understood the billing system are all let go, the bus factor drops to zero.

How to Improve Bus Factor

Step 1: Identify critical systems. List every system that would cause significant impact if it went down. For each, identify who can independently debug production issues.

Step 2: Pair programming rotations. The fastest way to transfer knowledge. One hour of pairing transfers more knowledge than a week of documentation.

Step 3: Rotate on-call responsibility. If only one person handles incidents for a system, start with shadow on-call where others observe before taking primary.

Step 4: Require multi-person code review. For critical systems, require a reviewer who is NOT the primary maintainer.

Step 5: Write Architecture Decision Records (ADRs). Document the why behind decisions. When the original author leaves, successors understand the reasoning.

Step 6: Measure and track quarterly. Celebrate when bus factor improves. Flag when it regresses.

The Tribal Knowledge Connection

Bus factor and tribal knowledge are two sides of the same coin. High tribal knowledge = low bus factor. When critical understanding lives in one person's head, you're one resignation away from a crisis.

The fix isn't just documentation — it's making knowledge discoverable and distributing it through deliberate practices.

Originally published on getglueapp.com/glossary/bus-factor

Glue calculates bus factor automatically from your git history and alerts you when knowledge silos form in critical systems.

Tribal Knowledge: The $300K Problem Nobody Talks About

Sahil Singh — Thu, 05 Mar 2026 10:07:47 +0000

I watched a $4.2 million engineering hire fail because of something that never showed up in a single dashboard.

We had recruited a senior architect away from Stripe. Brilliant engineer. Perfect cultural fit. She started on a Monday. By Friday, she had asked the same question to four different people and gotten four different answers about how our payment processing pipeline worked.

By week six, she was spending more time in Slack archaeology than writing code. By month three, she gave her notice. "I can't be effective here," she told me. "The system makes sense to people who built it. I'm not one of them."

The system she was describing had a name: tribal knowledge.

What Tribal Knowledge Actually Is

Tribal knowledge in software development is NOT "stuff we haven't documented yet." That framing makes it sound like a documentation problem with a documentation solution. It's deeper than that.

Tribal knowledge is the gap between what your code does and why it does it that way. It's the architectural decisions made in a meeting three years ago that nobody recorded. It's the workaround in the billing service that prevents a race condition but looks like a bug to anyone who wasn't there when the incident happened.

Every codebase has two layers of meaning:

Syntactic: what the code literally does (anyone can read this)
Semantic: why the code exists in this form (lives in people's heads)

Tribal knowledge is that second layer. And it's the layer that determines whether a team can move fast or gets stuck in interpretation loops.

Why It Compounds Silently

A product manager asks "can we add real-time notifications?" The engineering lead doesn't say "I don't know." They say "let me check with Marcus." Marcus built the event system two years ago. He spends 45 minutes explaining the constraints. The PM gets a qualified answer three days later.

Everyone treats this as normal. It's not normal. It's a three-day delay on a thirty-minute question, and it happens dozens of times per quarter.

A new engineer gets assigned a bug in the checkout flow. There's a conditional branch that doesn't make sense. They ask on Slack. Someone responds: "Oh, that handles the edge case from the Acme migration. Don't touch it." No documentation. No comment. No test. The new engineer patches around it. Six months later, someone removes the branch. Production breaks on a Saturday night.

The Real Cost

For a 40-person engineering team at a Series B SaaS company:

Senior onboarding: 12-16 weeks to full productivity (vs 4-6 weeks at well-documented teams). That's $200K-300K in lost productivity annually for 6 hires.
Decision latency: 3-5 days for architectural questions requiring tribal knowledge consultation (vs 2-4 hours when codified).
Incident response: MTTR roughly doubles when the on-call engineer doesn't have tribal knowledge about the failing system.
Senior engineer time: 30-40% of the week spent answering questions instead of building, because they're the only translator between the code and everyone else.

The Bus Factor Connection

The software industry calls this the bus factor. How many people can disappear before a system becomes unmaintainable?

For most teams, the honest answer for their most critical systems is one. Sometimes zero, because the person who understood it already left.

The bus factor problem creates a perverse incentive: the more tribal knowledge you accumulate, the more indispensable you become, and the less time you have to distribute that knowledge. The bottleneck reinforces itself.

Why Documentation Doesn't Fix It

Knowledge silos don't form because engineers are bad at documentation. They form because the incentive structure makes documentation irrational.

Writing code is visible, measurable, and rewarded. It ships features. It closes tickets. Writing documentation is invisible, unmeasurable, and unrewarded. Nobody gets promoted for a great Architecture Decision Record.

There's a second structural cause: code evolves faster than documentation. You write a systems overview on Monday. By Thursday, two services have been refactored. The overview is now partially wrong. Partially wrong documentation is worse than no documentation because it creates false confidence.

What Actually Works

1. Make knowledge discoverable, not just written. The problem isn't that knowledge doesn't exist — it's that it can't be found. Tools that analyze your codebase and extract understanding automatically (codebase intelligence) create a living knowledge layer that stays current.

2. Pair programming rotations. One hour of pairing transfers more knowledge than a week of documentation. Schedule regular pairing sessions where the knowledge holder works alongside someone learning the system.

3. Architecture Decision Records (ADRs). Document the why behind decisions. When the original author leaves, successors can understand the reasoning, not just the code.

4. Rotate on-call responsibility. If only one person can handle incidents for a system, that's a bus factor of 1. Add people to the rotation gradually, starting with shadow on-call.

5. Require multi-person code review. For critical systems, require at least one reviewer who is not the primary maintainer. This forces knowledge distribution through the review process.

How to Measure It

Track these signals:

Time-to-first-commit for new engineers
"Ask X" frequency in Slack (how often people defer to one person)
Cross-training coverage — how many people can independently debug each critical system
PR review concentration — are reviews always assigned to the same 2-3 people?

If your knowledge silos score is high and your bus factor is low, you have a tribal knowledge problem. The good news: it's fixable. The bad news: it won't fix itself.

Originally published on getglueapp.com/blog/tribal-knowledge-software-teams

Glue automatically detects knowledge silos and bus factor risks from your git history — no surveys, no manual tracking.

Why Your Code Reviews Take 3 Days (and How to Fix It)

Sahil Singh — Thu, 05 Mar 2026 10:07:08 +0000

The average PR at most companies sits waiting for review for 24-72 hours. Not because reviewers are lazy. Because the system is broken.

Here's what's actually happening and how to fix it.

The 5 Reasons Code Reviews Are Slow

1. PRs Are Too Big

A 50-file PR takes exponentially longer to review than five 10-file PRs. Not just because there's more code — because the reviewer has to build a mental model of all the changes at once.

Research shows: Review quality drops dramatically after 400 lines of changes. After 1000 lines, reviewers start rubber-stamping.

Fix: Break work into smaller PRs. Ship behind feature flags if needed. A PR should do ONE thing.

2. Only 1-2 People Can Review Critical Areas

If your billing service PRs always go to the same person, you've created a bottleneck. That person also has their own work to do. Your PR sits in their queue behind 5 others.

This is a bus factor problem in disguise. If only Marcus can review billing PRs, what happens when Marcus is on vacation?

Fix: Cross-train reviewers. Pair junior engineers with seniors on reviews. Expand the pool of qualified reviewers for each critical area.

3. No Shared Context

The reviewer opens the PR and thinks: "What is this trying to do? Why is it changing the auth flow? What ticket is this for?" They spend 20 minutes just understanding the intent before they can evaluate the implementation.

Fix: Write PR descriptions. Not novels — just:

What this changes
Why
How to test it
Any risks

If your codebase has a lot of tribal knowledge, even understanding the code being changed requires asking someone. Consider investing in codebase intelligence to make system understanding self-serve.

4. Review Is Not Scheduled

Most engineers treat code review as an interruption — something they do between their "real" work. So reviews happen whenever the reviewer has a gap, which might be never.

Fix: Block time for reviews. Two 30-minute review blocks per day (morning and afternoon) creates a max 4-hour wait time. Some teams use "review o'clock" — a daily 30-minute slot where the whole team does reviews.

5. Unclear Standards

Reviewers spend time debating style (tabs vs spaces, naming conventions) instead of substance (correctness, performance, security). These debates are slow and demoralizing.

Fix: Automate style enforcement. ESLint, Prettier, Black, gofmt — whatever your language has. If a machine can catch it, a human shouldn't be spending review time on it.

The Compound Effect

Slow reviews → larger batch sizes (devs pile up changes while waiting) → even slower reviews → even larger batches.

This directly impacts your DORA metrics:

Lead time increases because code sits in review
Deployment frequency drops because changes batch up
Change failure rate increases because large PRs hide bugs
MTTR increases because it's harder to identify which change caused an issue

What Good Looks Like

Elite engineering teams:

Median PR size: <200 lines
Median time-to-first-review: <4 hours
Median time-to-merge: <24 hours
3+ qualified reviewers for every critical area

Start Here

This week: Measure your current median time-to-merge
Next sprint: Implement "review o'clock" — 30 minutes daily
This quarter: Cross-train at least 2 additional reviewers for your most bottlenecked area

The bottleneck isn't your reviewers. It's the system around them.

Glue helps identify review bottlenecks, knowledge concentration, and bus factor risks — so you can fix the system, not blame the people.

The 10-Minute Codebase Health Check: A Checklist for Every Sprint

Sahil Singh — Thu, 05 Mar 2026 10:07:05 +0000

You check your production monitoring dashboards daily. You review your DORA metrics monthly. But when was the last time you checked the health of the codebase itself?

Here's a 10-minute checklist you can run at the start of every sprint to catch problems before they become incidents.

The Checklist

1. Knowledge Concentration (2 min)

Open your git log for the last 30 days. For your 5 most critical services, count how many unique contributors made changes.

# Replace 'src/billing' with your critical path
git log --since="30 days ago" --format='%aN' -- src/billing/ | sort -u | wc -l

Red flag: If any critical service has only 1 contributor, your bus factor is 1 for that service. One resignation away from a crisis.

Action: Schedule a pairing session this sprint.

2. PR Review Bottlenecks (2 min)

Check your average PR merge time for the last 2 weeks. Most CI/CD tools or GitHub itself can show this.

Red flag: If average merge time is >48 hours, you have a review bottleneck. This directly impacts your DORA lead time metric.

Action: Identify which PRs are waiting the longest and why.

3. Test Coverage Trends (1 min)

Don't look at absolute coverage — look at the direction. Is coverage going up or down over the last month?

Red flag: Declining coverage means new code is being shipped without tests. This increases your change failure rate.

Action: Require coverage checks in CI for new PRs.

4. Dependency Freshness (2 min)

# For Node.js
npx npm-check-updates

# For Python
pip list --outdated

Red flag: Dependencies more than 2 major versions behind, especially security-critical ones.

Action: Schedule a dependency update session. Don't let it pile up.

5. Dead Code and Unused Imports (1 min)

Run your linter's unused import/variable check. In large codebases, dead code accumulates and confuses new team members.

Red flag: Hundreds of unused imports or exports.

Action: Add a lint rule to block new unused imports in CI.

6. Cross-Team Coupling (1 min)

Look at your last 10 PRs. How many required changes in code owned by another team?

Red flag: If >30% of your PRs touch other teams' code, you have a Conway's Law problem. Your architecture and team boundaries are misaligned.

Action: Discuss with the other team whether an API boundary would be better.

7. Documentation Freshness (1 min)

Check the last modified date on your main README and architecture docs.

Red flag: >6 months since last update. The docs are probably wrong.

Action: Assign someone to review and update during this sprint.

Automate What You Can

Most of these checks can be scripted and added to a weekly Slack notification:

#!/bin/bash
echo "=== Codebase Health Report ==="
echo "Bus Factor (billing):"
git log --since="30 days ago" --format='%aN' -- src/billing/ | sort -u
echo ""
echo "Avg PR Age (open):"
gh pr list --json createdAt --jq '.[].createdAt'
echo ""
echo "Outdated Deps:"
npx npm-check-updates 2>/dev/null | tail -5

For a more comprehensive, always-on view, codebase intelligence tools can track all of these metrics automatically and alert you when things degrade.

The Point

Most codebase problems are visible weeks before they become incidents. The difference between teams that catch them early and teams that don't isn't talent — it's having a habit of looking.

10 minutes per sprint. That's all it takes.

Want automated codebase health monitoring? Glue tracks code health, bus factor, knowledge silos, and dependency risks continuously.

7 Technical Debt Patterns That Are Actually Costing You Money

Sahil Singh — Thu, 05 Mar 2026 10:06:38 +0000

If you've been shipping code for more than a year, you have technical debt. The question isn't whether it exists — it's whether you can see it, measure it, and have a plan to address it.

Most teams feel the drag: slow deployments, fragile tests, the growing anxiety around "what breaks if we touch this module?" But they can't articulate specifically what the debt is.

After a decade in codebases of all sizes, I've found the damage doesn't come from abstract debt. It comes from concrete patterns that recur across teams. These seven patterns are the ones that actually slow you down.

1. Dependency Tangling

Modules that should be independent have become tightly coupled through ad-hoc integration. You can't change the API gateway without touching the database layer. Updating the auth service means modifying three payment modules.

How to spot it: Try to extract a module for reuse or testing. Discover it imports from 8+ other modules, and those modules import back. The dependency graph isn't a tree — it's a mesh.

What it costs: Every change becomes risky. Testing becomes expensive. Onboarding slows because nobody can understand code in isolation.

Fix: Map the actual dependency graph. Make coupling explicit with defined APIs. Introduce a layering strategy: presentation → business logic → infrastructure, nothing flowing backward.

2. God Objects

A single class or module that knows too much and does too much. The UserService that handles authentication, authorization, profile management, notification preferences, AND billing.

How to spot it: The class has dozens of public methods with nothing in common. The file is 2000+ lines. Pull requests to this file are always massive and touch unrelated logic.

What it costs: Impossible to test. False bottlenecks (everyone waiting for everyone else's changes). One misunderstood invariant breaks half the application.

Fix: Break apart by responsibility. UserService → AuthService + AuthorizationPolicy + ProfileManager. Start with the smallest responsibility you can separate cleanly.

3. Implicit Contracts

Interfaces that work only because of undocumented assumptions about call order, data format, or environment state.

How to spot it: Something works in production but fails in tests. The difference is some invisible precondition. Engineers regularly get surprised by how a system behaves.

What it costs: Systems become fragile. Debugging takes forever. Refactoring becomes dangerous because you don't know what assumptions the code depends on.

Fix: Make contracts explicit. Add assertions. Document sequences. Use type systems to encode invariants. If initialization order matters, enforce it in code.

4. Test Debt

Production code that can't be tested without heroic mocking effort.

How to spot it: Test files are longer than the code they test, and half the test is setup. You avoid writing tests for certain modules because "it's too complicated."

What it costs: You lose confidence in changes. Tests don't catch regressions because they're brittle. You ship bugs because you only test through manual clicks.

Fix: Invert dependencies. Push external connections to the edges. Use dependency injection. Start small — make your next module testable.

5. Configuration Sprawl

Environment-specific logic scattered across the codebase. If-statements checking env === "production". Different S3 bucket names hardcoded in three different modules.

How to spot it: Deployment to a new environment requires code changes. Environment-specific bugs can't be reproduced locally. Same config defined in three places with different values.

What it costs: Error-prone deployments. Can't safely test without running in the actual environment. Adding a new environment requires changes throughout the codebase.

Fix: Centralize configuration. Read environment variables once, at startup. Feature flags in a single source of truth. Code should be environment-agnostic.

6. Parallel Implementations

Multiple implementations of the same logic existing simultaneously because nobody knew (or trusted) the existing one.

How to spot it: Search for "date formatting" and find five different utility files. Three different HTTP client wrappers. Two implementations of the same business rule in different services.

What it costs: Bug fixes need to be applied in multiple places (and they never are). Behavior becomes inconsistent. The codebase grows without adding value.

Fix: Search before writing. Use codebase intelligence to discover existing implementations. Consolidate gradually. Don't create new utilities without checking what already exists.

7. Knowledge Concentration

When critical system understanding lives in one person's head. Not a code pattern, but it's the most expensive debt of all.

How to spot it: "Ask Marcus, he built that." PRs for a critical service always assigned to the same reviewer. When that person is on vacation, changes to their service wait.

What it costs: That person becomes a bottleneck. When they leave, the team loses months of productivity. The bus factor for the system is 1.

Fix: Pair programming rotations. Require multi-person code review for critical systems. Document architectural decisions (ADRs). Track knowledge silos explicitly.

How to Prioritize

Not all debt is equal. Prioritize by:

Blast radius — How many things break when this area changes?
Change frequency — How often does this area need to change?
Team impact — How many engineers are slowed by this?

Debt in code that changes weekly and affects 10 engineers is more urgent than debt in code that hasn't been touched in a year.

The mistake most teams make is treating tech debt as one amorphous backlog item. Break it into specific patterns. Measure each one. Fix the ones that cost the most.

Originally published on getglueapp.com/blog/tech-debt-patterns

Glue helps you identify these patterns automatically — mapping dependency tangles, knowledge concentration, and code health across your entire codebase.