DEV Community: Aditya satrio nugroho

Why Your Team Keeps Ignoring Your Instructions (And What Actually Works)

Aditya satrio nugroho — Thu, 28 May 2026 12:46:02 +0000

You've said it in the sprint planning. You've said it in the 1:1. You've put it in the team wiki. And yet — the same thing keeps happening.

PRs without ticket references. Deployments without checklists. Steps skipped "just this once" because there was deadline pressure.

Sound familiar?

This isn't a knowledge problem. Your team knows the rule. This is a behavior problem. And behavior problems require a completely different approach than just explaining things better or louder.

The Three Levels Most Managers Go Through

Most engineering managers — especially in fast-moving startups — tend to escalate through the same pattern when trying to enforce team discipline:

Level 1 — Verbal instruction: You explain the rule, why it exists, what benefit it brings. You do this in team meetings, in Slack, maybe even in a nicely formatted Confluence page.

Level 2 — Micromanagement: When level 1 doesn't stick, you start observing more closely. You remind people in code reviews. You bring it up again in standups. You personally check that things are done.

Level 3 — System enforcement: You build the guardrail. The pipeline rejects PRs without Jira codes. The checklist is a required form. The system blocks the wrong behavior automatically.

Most managers spend too long in Level 2, burning themselves out, before getting to Level 3. And the worst part? Level 2 only works while you're watching.

Why Explaining Isn't Enough

Here's the uncomfortable truth: your team isn't ignoring you because they don't understand. They're ignoring you because the cost of non-compliance is zero until you enforce it.

Human brains are efficiency machines. We take the path of least resistance. If skipping the Jira code on a PR takes 5 seconds and saves mental overhead during a deadline crunch — and nothing happens — the brain logs that as: acceptable shortcut.

Your explanation created awareness. It did not create consequences. And without consequences, awareness alone fades fast.

There's also a cultural layer to this, particularly in startup environments. Teams that have been through multiple managers, multiple "initiatives of the month," learn something: instructions eventually fade. So they wait to see if this one is real. They're not being malicious. They're being rational based on past experience.

The WIIFM Problem

One of the biggest mistakes in process enforcement is framing everything around what benefits the manager or the organization.

"We need Jira codes so I can track velocity."
"We need this checklist so the audit is clean."
"We need this because compliance requires it."

Your team hears: this is overhead that helps the boss, not me.

WIIFM — What's In It For Me — is the real question every engineer is silently asking. Until you answer it from their perspective, you're asking people to add friction to their day for someone else's benefit.

Some reframes that actually land:

Jira code on a PR = your work gets credited. If you use any engineering metrics tooling, unlinked PRs are invisible contributions. Their output disappears from the data.
Linked PRs = faster code reviews. Reviewers understand context immediately. Fewer back-and-forth questions. Less time blocked waiting for review.
Audit trail = protection during incidents. When something breaks in production, the person with clean, linked commit history is the one who can clearly show what they were working on and what was in scope.

But here's the thing — even if you nail the WIIFM framing, it still might not be enough. Because WIIFM changes motivation. It doesn't change habits. And habits need systems.

Six Behavior Frameworks That Actually Explain What's Happening

After going through this cycle enough times, it's worth understanding the underlying mechanics. These frameworks explain why teams behave the way they do — and more importantly, what to do about it.

1. BJ Fogg's Behavior Model

Fogg's model says behavior only happens when three things converge at the same moment: Motivation + Ability + Prompt.

Most managers nail motivation (explaining why) but completely miss the prompt. You explain the Jira code rule in a Monday meeting. Three days later, an engineer is pushing a hotfix at 11pm under deadline pressure. The motivation has faded. The prompt isn't there. The behavior doesn't happen.

What to do: Put the prompt at the exact moment the behavior needs to occur. A PR template with a mandatory Jira field triggers at the right moment — when the PR is being opened, not three days earlier in a meeting.

2. Nudge Theory

From Thaler and Sunstein: you don't need to force behavior if you design the environment so the desired behavior is the default.

Most process enforcement is designed as a wall — do the wrong thing and something blocks you. But walls require constant maintenance and create resentment. Nudges work differently — they make the right behavior the easiest behavior.

What to do: A PR template that pre-fills the Jira ticket field with a placeholder makes filling it in take 5 seconds. Leaving it blank takes more effort. You've flipped the friction. The wall (pipeline rejection) is the backup, not the primary mechanism.

3. The Consequences Model

This one is simple but often ignored: behavior that has no immediate consequence doesn't change.

The consequence needs to be three things: immediate, consistent, and certain. Not severe — just unavoidable.

A verbal reminder in a 1:1 next week is a delayed, inconsistent consequence. A pipeline that rejects a PR right now, in front of the team, before they can move on, is an immediate and certain consequence. The immediacy is what makes it register.

What to do: Automate the consequence. Don't rely on yourself to catch it and bring it up later. The system should be the one saying no — not you.

4. Renting vs. Owning Behavior

This is the most important distinction for busy engineering managers.

When you are the enforcement mechanism, you're renting behavior. The team complies because you're watching. The moment you're heads-down on a hiring cycle, or in back-to-back stakeholder meetings, or on leave — behavior reverts. You were the rule, not the system.

Owned behavior means the team follows the rule when you're not there. You don't get that from verbal instructions. You get it from consistent system enforcement over time, until the system becomes the authority — not the manager.

What to do: Ask yourself: "Would this process survive a 2-week vacation without me?" If the answer is no, you have rented behavior. Build the system.

5. Edmondson's Peer Compliance Effect

Amy Edmondson's research on team dynamics shows something important: people calibrate their behavior to what they observe their peers doing — especially influential peers — not just what leaders say.

If your most senior or most respected engineer submits a PR without a Jira code and it gets merged (even once, even with good reason), every other engineer reads that as the real rule. Senior engineers can skip this under pressure. And now everyone starts finding their own version of "pressure."

What to do: Make enforcement hierarchy-blind. The pipeline should reject a senior engineer's PR the same way it rejects a junior's. Impersonal, automated enforcement removes the social dynamics from the equation. It's not about punishing the senior — it's about making the rule real for everyone equally.

6. Rule Erosion

Here's the one that managers cause themselves without realizing it.

Every exception you allow — even with a completely valid reason — sends a signal: this rule has conditions. Your team doesn't hear "the rule still applies, this was just an emergency." They hear: "I just need to find the right condition."

Over time, the rule erodes. Not because anyone is being defiant, but because you've taught them the rule is negotiable.

What to do: Separate the exception from the rule. A production incident happens and a hotfix needs to be merged fast — fine. But require a follow-up: the Jira link must be added within 24 hours, or a retroactive ticket created. The rule still applies. The timing is flexible. This closes the exit without blocking production, and preserves the integrity of the rule.

The Decision Framework: Which Tool For Which Problem

Not every situation calls for the same response. Here's a simple way to map your problem to the right framework:

Your Problem	Start With
Team forgets the rule at execution time	Fogg Behavior Model + Nudge Theory
Team knows but doesn't bother	Consequences Model + Rule Erosion
Behavior improves when you're around, regresses when you're not	Renting vs Owning + Consequences Model
A senior engineer is setting the wrong example	Edmondson Peer Compliance

In most real situations, you need Nudge + Consequences + Rule Erosion running together. The others help you diagnose what's really happening.

The Hardest Part: Protecting the System From Yourself

You can build a perfect pipeline. You can design the right nudges. You can get buy-in from the team.

And then you override the system once, with a good reason, under pressure.

That one override costs more than you think. The team noticed. And they'll use it as a reference point the next time they need an exception.

The real skill isn't building the enforcement system. It's having the discipline to protect it — including from your own judgment calls in the moment.

If you need exceptions to exist (and you will), formalize them. Make the exception process explicit and documented. "Hotfixes can merge without a Jira code IF a follow-up ticket is created within 24 hours" is a better rule than "no exceptions" that gets violated, because it's honest and it closes the gap.

The Bigger Picture

Process discipline in engineering teams isn't really about process. It's about trust and predictability.

When a team consistently does what they say they'll do — when the agreed process is actually followed — it creates a foundation where people can rely on each other, where metrics are trustworthy, and where quality compounds over time.

That foundation doesn't come from better explanations. It comes from systems that enforce the right behavior consistently, leaders who protect those systems, and enough time for the behavior to become habit.

The verbal instruction was never going to be enough. It was always going to end with a pipeline rejection. The question is just how long you spend in the middle before you get there.

Lessons from a MySQL Migration: What We Learned and How to Do It Better Next Time

Aditya satrio nugroho — Sat, 04 Oct 2025 05:35:12 +0000

We migrated our MySQL database. It “worked,” until it didn’t: the new DB’s size didn’t match the old one. Same schema, same rows—different footprint. That tiny mismatch pushed us to build a real migration playbook: understand what’s happening, prove data equality, and leave a paper trail that stakeholders actually trust.

Here’s the journey—told as we lived it—with commands, expected outputs, and the why behind each step.

Step 1 — Capture More Than Just Rows

Before moving data, we grabbed the database’s shape and logic. Otherwise you carry the data but lose the rules that make it behave.

Command

mysqldump -h OLD_HOST -u root -p --no-data OLD_DB > schema_only.sql
mysqldump -h OLD_HOST -u root -p --routines --triggers --events OLD_DB > routines.sql

Expected Output

schema_only.sql contains only CREATE TABLE ... statements (no INSERT INTO).
routines.sql contains CREATE PROCEDURE, CREATE FUNCTION, CREATE TRIGGER, CREATE EVENT.
You’ll likely see DEFINER= clauses—note them.

Why this matters

Schema/routine parity prevents silent logic drift (e.g., missing trigger = missing audit row).
Pro tip: keep these files in source control for diffs across migrations.

Step 2 — Dump & Restore (Safely and Predictably)

We wanted a consistent snapshot without table-level locks that freeze the app.

Command

mysqldump -h OLD_HOST -u root -p \
  --single-transaction --quick \
  --routines --triggers --events \
  --default-character-set=utf8mb4 \
  OLD_DB > old_db_dump.sql

mysql -h NEW_HOST -u root -p -e "CREATE DATABASE IF NOT EXISTS NEW_DB DEFAULT CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci;"
mysql -h NEW_HOST -u root -p NEW_DB < old_db_dump.sql

Expected Output

old_db_dump.sql is large and full of INSERT INTO ... lines.
SHOW TABLES FROM NEW_DB; lists the same tables as OLD_DB.
Critical tables pass a spot-check: SELECT COUNT(*) FROM big_table; (numbers line up or are close; exact checks later).

Why this matters

--single-transaction gives a consistent InnoDB snapshot without blocking writers.
--quick streams rows to keep memory flat.

Now we had data in place—but sizes looked off. Time to measure, then normalize.

Step 3 — Snapshot the Raw Footprint

We measured where space was going before any tuning.

Command

SELECT table_name,
       ENGINE,
       TABLE_ROWS, -- estimate for InnoDB
       ROUND((data_length + index_length)/1024/1024, 2) AS size_mb
FROM information_schema.tables
WHERE table_schema = 'NEW_DB'
ORDER BY size_mb DESC;

Expected Output (sample)

+------------+--------+-----------+---------+
| table_name | ENGINE | TABLE_ROWS| size_mb |
+------------+--------+-----------+---------+
| orders     | InnoDB | 1000000   | 350.12  |
| users      | InnoDB |  500000   |  80.45  |

Why this matters

In InnoDB, TABLE_ROWS is an estimate; sizes reflect fragmentation and stale stats after bulk load.
Don’t conclude inequality yet—stats come next.
The “aha” moment came when we ran ANALYZE/OPTIMIZE. Here’s the deeper why.

Step 4 — The Deep Dive: ANALYZE and OPTIMIZE (What They Do, and Why You Should Care)

After bulk inserts, InnoDB pages are fragmented and index statistics are stale. The optimizer can’t “see” reality, so it guesses—sometimes badly. Two tools fix that:

ANALYZE TABLE: refresh index statistics (cardinality)

What it does: Re-samples index distributions and updates cardinality (estimated unique values per index).
Why it matters: The optimizer chooses join order and index paths based largely on cardinality. Bad cardinality → bad plans.
Where it lives: With innodb_stats_persistent=ON (default in MySQL 8), stats are stored persistently and survive restarts.
Histograms: MySQL 8 supports column histograms to model non-indexed predicates:

ANALYZE TABLE my_table UPDATE HISTOGRAM ON col1, col2 WITH 128 BUCKETS;

Check with:

SELECT * FROM information_schema.COLUMN_STATISTICS
WHERE SCHEMA_NAME='NEW_DB' AND TABLE_NAME='my_table';

(If the table is empty, enable show_compatibility_56=OFF and ensure information_schema_stats_expiry permits refresh.)

OPTIMIZE TABLE: rebuild and defragment

What it does: For InnoDB, effectively rebuilds the table and its indexes (similar to ALTER TABLE ... ENGINE=InnoDB), compacting pages and reclaiming space.
Why it matters: You get a cleaner on-disk layout, tighter B-trees, and often a smaller file size that now resembles your source DB more closely.
Locking/perf: On large tables, it’s heavy. In MySQL 8, many operations are in-place or “instant,” but plan it off-hours for big tables.

Commands

-- For a single table
ANALYZE TABLE my_table;
OPTIMIZE TABLE my_table;

-- Batch all tables
SELECT CONCAT('ANALYZE TABLE `', table_name, '`; OPTIMIZE TABLE `', table_name, '`;')
FROM information_schema.tables
WHERE table_schema='NEW_DB';

(Pipe the generator into mysql to execute in one go.)

Expected Output

+------------------+----------+----------+----------+
| Table            | Op       | Msg_type | Msg_text |
+------------------+----------+----------+----------+
| NEW_DB.my_table  | analyze  | status   | OK       |
| NEW_DB.my_table  | optimize | status   | OK       |

Verifying Cardinality Improved

SHOW INDEX FROM my_table;

Expected Output (excerpt)

+----------+------------+----------+--------------+------------+
| Table    | Non_unique | Key_name | Column_name  | Cardinality|
+----------+------------+----------+--------------+------------+
| my_table |          0 | PRIMARY  | id           |   999800   |
| my_table |          1 | idx_cust | customer_id  |    54012   |

After ANALYZE, Cardinality should look realistic (not suspiciously tiny like 1 or 2 on huge tables).
If predicates rely on non-indexed columns, consider histograms (above). They don’t change cardinality but drastically improve selectivity estimates for those columns.

With stats refreshed and fragmentation reduced, our sizes converged. But “looks good” isn’t enough—we wanted proofs.

Step 5 — Prove Equality with Percona Toolkit (and More You Can Do)

We rely on Percona Toolkit because it’s built for production-grade checks.

pt-table-checksum: detect row-level differences

How it works: Splits each table into chunks (ranges by PK), computes checksums (CRC32) per chunk on the source, then compares on the target (best with replication; otherwise compare results tables).
Why it’s great: It scales. You get a precise answer without FULL TABLE SCAN everywhere.

Command (typical replication setup)

pt-table-checksum \
  --host=PRIMARY_HOST --user=USER --password=PASS \
  --databases NEW_DB \
  --replicate=percona.checksums

Expected Output

TS ERRORS DIFFS ROWS DIFF_ROWS CHUNKS SKIPPED TIME TABLE
... 0      0     1000000     0     20     0    5.3 NEW_DB.orders

DIFFS = 0 across all rows/tables = ✅

No replication? Two independent servers?

Option A: Run pt-table-checksum on both and diff the percona.checksums tables.
Option B: Use pt-table-sync directly to compare and optionally fix.

pt-table-sync: generate the minimal fix

pt-table-sync \
  --print --execute \
  h=OLD_HOST,u=USER,p=PASS,D=OLD_DB \
  h=NEW_HOST,u=USER,p=PASS,D=NEW_DB

Expected Output

SQL INSERT/UPDATE/DELETE statements (and execution if --execute).
Use --print first, review, then add --execute.

More Percona tools (worth having on every migration)

pt-query-digest: Analyze slow logs/traces to find worst queries (post-migration).
pt-duplicate-key-checker: Identify redundant/overlapping indexes before/after migration.
pt-index-usage: See which indexes aren’t used (on sampled workload).
pt-online-schema-change: Safer online DDL for big tables without long lock times.

Data proven equal, we closed the loop on objects and behavior.

Step 6 — Don’t Forget the “Invisible” Pieces

Missing triggers or a changed collation can pass unnoticed—until a bug report lands.

Commands

SHOW TRIGGERS FROM NEW_DB;
SHOW PROCEDURE STATUS WHERE Db='NEW_DB';
SHOW FUNCTION STATUS WHERE Db='NEW_DB';
SELECT table_name, constraint_name, referenced_table_name
FROM information_schema.key_column_usage
WHERE table_schema='NEW_DB' AND referenced_table_name IS NOT NULL;
SELECT default_character_set_name, default_collation_name
FROM information_schema.schemata
WHERE schema_name='NEW_DB';

Expected Output

Counts and names match OLD_DB for triggers/procs/functions.
Foreign keys listed as expected.
Charset/collation align with app expectations (e.g., utf8mb4_0900_ai_ci).

Why this matters

A single missing trigger can silently break invariants (e.g., stock, audit, denormalized totals).

With the database validated, we packaged results for non-DBA stakeholders.

Step 7 — Report in Plain Language

Engineers love logs; stakeholders love summaries. Stakeholder Table (example)

Item	Old DB	New DB	Status
Row Count (orders)	1,000,000	1,000,000	✅ Equal
Index Count (orders)	5	5	✅ Equal
Size (MB)	350	348	⚠ 0.6% diff (≤5% = OK)
Checksums (Percona)	Match	Match	✅ All chunks DIFFS=0
Triggers/Procedures	2/3	2/3	✅ Parity
Collation/Charset	utf8mb4/…	utf8mb4/…	✅ Match

Expected Output

Clear pass/fail with small, explained variances.
A link to the raw validation logs for auditors (appendix).

Migration doesn’t end at restore; we watch for regressions.

Step 8 — Watch the System Breathe (Post-Migration)

We enabled the slow log to catch new bad plans caused by fresh stats.

Commands

SET GLOBAL slow_query_log = 'ON';
SET GLOBAL long_query_time = 1;
SHOW VARIABLES LIKE 'slow_query_log%';

Expected Output

+-----------------+-------+
| Variable_name   | Value |
+-----------------+-------+
| slow_query_log  | ON    |

Why this matters

ANALYZE can change plans; 1–2 weeks of vigilance pays off.
Pair with pt-query-digest to summarize hotspots quickly.

⸻

What We’ll Keep Doing Next Time

Run ANALYZE (and histograms when needed) to fix cardinality → good plans.
Run OPTIMIZE on big movers to defragment and align sizes.
Use Percona for truth (pt-table-checksum to detect, pt-table-sync to fix).
Validate the “invisibles” (triggers, procs, FKs, collation).
Report with thresholds (e.g., size diff ≤5% = acceptable).
Monitor for 1–2 weeks to catch plan regressions early.

⸻

Appendix — Quick “Pass/Fail” Thresholds We Use

Row count (critical tables): 100% match → PASS
Percona checksum: DIFFS=0 for all chunks → PASS
Size variance (post-OPTIMIZE): ≤5% → PASS
Cardinality sanity (via SHOW INDEX): no suspiciously tiny values on high-cardinality columns → PASS
Objects (triggers/procs/functions/FKs): full parity → PASS

SMEs vs Owners in Software Engineering: The Coach and the Captain

Aditya satrio nugroho — Wed, 01 Oct 2025 13:16:36 +0000

When the Lights Go Out

It’s 11 PM on payday. Your payments service crashes. Notifications blow up, managers panic, and someone asks the inevitable question: “Who’s fixing this?”

The service owner jumps in to roll back and stabilize. Meanwhile, a payments SME joins the call, explaining why retry logic wasn’t built to handle this traffic spike and suggesting how to prevent it next time.

Two people, two very different roles, both essential in that moment. One takes responsibility for restoring the system. The other shapes the long-term fix. This is the difference between an Owner Expert and a Subject Matter Expert (SME).

The Coach: Subject Matter Expert (SME)

Every team has someone who knows a specific domain inside out. They might not be the one pushing commits at midnight, but when you’re designing a new feature or trying to avoid a costly mistake, they’re the first call.

That’s the SME — the coach of the engineering world. They’re the ones who:

Define standards and best practices.
Review designs and guide decisions.
Train teams so that knowledge spreads, not just stays locked in one head.

SMEs aren’t measured by the uptime of a single service. Their impact comes from enabling multiple teams to work smarter and more consistently.

But expertise alone isn’t enough. Every team also needs someone who takes the field, owns the system, and carries the accountability.

The Captain: Owner Expert

Where the SME guides, the Owner delivers. This is the engineer who carries the weight of a system every day. If it breaks, they fix it. If it needs scaling, they plan it.

Think of the Owner Expert as the captain of the ship. They’re the ones who:

Keep uptime and reliability on track.
Handle incidents, rollbacks, and bug fixes.
Own the costs, performance, and stability of their system.

If the SME is about how things should be done, the Owner is about making sure it actually gets done. And the difference becomes obvious when you compare them side by side.

Coaches and Captains in Action

SME (Coach)	Owner Expert (Captain)
Security SME (defines authentication guidelines)	Identity Service Owner (keeps login alive)
Database SME (advises schema, migration)	Product DB Owner (ensures data is consistent and available)
Frontend SME (drives design system adoption)	Web App Owner (meets Lighthouse score and bug SLA)

The SME spreads knowledge across teams. The Owner goes deep on a single service. Neither role is optional — and nowhere is this more obvious than in payments.

The Payments Story

Picture the same e-commerce platform. The payments SME is the one who wrote the playbook: retry strategies, PCI-DSS compliance, fraud detection libraries. They’ve made sure every squad knows the rules of the game.

But when the API slows down at 11 PM on payday, it’s the payments service owner who’s accountable. They’re the one watching latency, applying fixes, and making sure money keeps flowing.

The SME sets the strategy. The Owner executes under pressure. To formalize this relationship, many companies use RACI.

Making It Clear with RACI

RACI — Responsible, Accountable, Consulted, Informed — helps untangle who does what.

SMEs fit best as Consulted: they guide, review, and teach.
Owners are both Responsible and Accountable: they’re on the hook for results.

Here’s what that looks like in practice:

Use Case	Responsible	Accountable	Consulted	Informed
Checkout outage	Checkout service owner	Checkout service owner	Infra SME (incident review)	PM, leadership
Database migration	Product DB owner	Product DB owner	Database SME	Affected squads
Design system rollout	Web app owners	Web app owners	Frontend SME	UX team, PM
CI/CD deployment strategy	Pipeline owner	Pipeline owner	DevOps SME	All squads

Another Story: The Migration Gone Wrong

Last year, a squad ran a database migration without consulting the database SME. On paper, the product DB owner was both responsible and accountable. They executed the migration, but a subtle indexing issue caused queries to crawl, impacting three other services.

It took hours of firefighting before the SME jumped in, identified the missing partitioning strategy, and guided the fix. The owner restored the system, but the SME made sure the same mistake would never happen again.

This is RACI in action: owners get systems back online, SMEs make sure the org learns and doesn’t repeat mistakes.

But RACI only clarifies roles. To really measure success, you need OKRs.

From Roles to Results: OKRs

RACI defines responsibilities. OKRs define outcomes.

An Owner Expert OKR might be: “Reduce failed deployments to fewer than 3 per quarter.”
An SME OKR might be: “Ensure 100% of squads adopt canary deployment guidelines by end of quarter.”

Owners are measured by the health of their system. SMEs are measured by the adoption of their expertise. And when you make them SMART, they become even sharper.

SMART OKRs for Coaches and Captains

SMART (Specific, Measurable, Achievable, Relevant, Time-bound) highlights the contrast.

Owners: SMART OKRs are outcome-driven.

Example: “Maintain 99.95% uptime for Checkout API by end of Q2.”
SMEs: SMART OKRs are adoption-driven.

Example: “Train 4 squads on retry logic best practices by end of Q2.”

One ensures delivery. The other ensures consistency. Together, they keep the org moving forward. But how does this cascade in a real workflow?

Cascading OKRs in Practice

Imagine the chain from manager to IC.

Engineering Manager (SME role)

Objective: Improve backend reliability across the org

Key Result: 99.95% uptime across critical services

Tech Lead (Owner Expert for Checkout Service)

Objective: Improve Checkout API reliability

Key Result: Reduce p95 latency from 300ms to 200ms

IC (Backend Engineer in Checkout Squad)

Objective: Contribute to Checkout reliability

Key Result: Refactor retry logic to cut failures by 15%

Each level connects. The IC’s change drives the TL’s service reliability, which rolls up to the manager’s org-wide outcome. When done right, cascading OKRs create alignment instead of silos.

Closing the Loop

SMEs and Owners are not competing roles. They’re complementary. The SME ensures everyone knows the right way to play the game. The Owner ensures the game is actually won.

Without SMEs, teams scatter into inconsistency. Without Owners, accountability disappears. Together, they bring both breadth and depth.

If your org doesn’t know who the SMEs and Owners are, your OKRs will drift and responsibility will blur. Define them early, connect them with RACI, and cascade their OKRs. That’s how you get teams that stay aligned at 2 PM in a planning meeting — and steady at 2 AM during a production fire.

Sales Talks Impact, Engineers Talk Process. Bridging the Language Gap

Aditya satrio nugroho — Thu, 28 Aug 2025 01:42:59 +0000

The Meeting Room Divide

Picture this: you’re in a leadership meeting.

The Head of Sales is energized, pacing at the front of the room:

“There’s a huge opportunity in this new vertical. If we move fast, we could add 20% revenue this quarter!”

On the other side of the table, the Head of Engineering calmly counters:

“We need to improve our deployment process. Change failure rates are too high, and our lead time is slowing down.”

Both are speaking passionately, both are right — yet they sound like they’re on different planets.

Here’s the truth: business and sales-oriented people talk about opportunities for sales impact, while software engineers talk about processes for quality impact.

It’s not that one is short-term and the other long-term. It’s simply two different lenses on the same mission: building a business that grows, scales, and lasts.

Two Languages, One Goal

Sales / Business Lens → “Opportunities” → revenue, market share, customer acquisition.
Engineering Lens → “Processes” → code quality, deployment stability, defect rates, developer productivity.

Both are obsessed with impact — just measured differently. Sales sees impact on top-line revenue. Engineering sees impact on system reliability and delivery velocity.

If these two perspectives remain disconnected, companies end up with mismatched expectations: Sales closes big deals the system can’t support, or Engineering optimizes processes with no clear link to business growth.

The bridge lies in translation. And here’s where research and experience back this up.

The Engineering Side: Process = Impact

Software engineering has long been treated as an internal cost center. But modern research shows otherwise.

In Accelerate (Forsgren, Humble, Kim), the authors found that elite software teams deploy 46x more frequently and recover from incidents 96x faster than low performers. These process improvements directly correlate with business performance: profitability, market share, and customer satisfaction.
The DORA metrics (Deployment Frequency, Lead Time, Mean Time to Restore, Change Failure Rate) are now industry standards precisely because they prove that process quality isn’t “nice-to-have” — it drives competitiveness.
The Phoenix Project (Gene Kim et al.) illustrates this in story form: organizations that ignore engineering bottlenecks see their business grind to a halt, no matter how strong their sales pipeline is.

In other words: better processes lead to higher-quality software, which leads to faster time-to-market, fewer outages, and ultimately happier customers who stay and spend more.

The Sales Side: Opportunity = Impact

Sales, on the other hand, has always been outcome-obsessed — but the best sales thinking also emphasizes process discipline.

The Challenger Sale (Dixon & Adamson) showed that top-performing sales reps succeed not by chasing every lead, but by following a repeatable approach: teach, tailor, and take control.
SPIN Selling (Rackham) provides a structured framework: Situation → Problem → Implication → Need-payoff. This isn’t freewheeling persuasion; it’s process that scales.
Crossing the Chasm (Geoffrey Moore) demonstrates that capturing new markets requires systematic go-to-market strategies, not opportunistic wins.

In other words: opportunities convert into real impact only if sales organizations follow repeatable, quality-driven processes.

Side-by-Side Comparison

Dimension	Sales / Business Lens	Engineering Lens	Common Ground
Focus	Opportunities → Revenue Impact	Processes → Quality Impact	Both seek predictable growth
Metrics	Pipeline health, win rate, ARR uplift	Deployment frequency, MTTR, defect ratio	Predictability and trust
Risk if ignored	Missed deals, poor market capture	System failure, poor velocity, high churn	Lost credibility and revenue
Key References	Challenger Sale, SPIN Selling, Crossing the Chasm	Accelerate, Phoenix Project, DORA metrics	Discipline creates impact

Notice the symmetry: both sides care about impact, but impact without process is fragile, and process without impact is meaningless.

A Real-World Use Case

Let’s make this real.

A SaaS startup landed a major enterprise client — a global bank. The sales team celebrated: millions in potential ARR, new market credibility, and a case study to unlock future deals.

But Engineering had concerns. Their deployment pipeline had a 20% change failure rate. Incidents occurred weekly. Monitoring was minimal. The bank expected a 99.9% SLA.

Sales was talking opportunity → impact: “This deal could double our revenue.”
Engineering was talking process → quality: “Without fixing deployment, we risk outages that break our SLA.”

Initially, leadership saw this as friction. But once translated, it became synergy. Engineering tied their improvements directly to business outcomes:

Before: 12 incidents per month, 4-hour average recovery, risk of SLA penalties.
After CI/CD & automated testing improvements: incidents down to 5/month, recovery time cut to under 1 hour.

That reliability gave Sales the confidence to close two more enterprise clients, together worth $3M ARR.

The result: process improvements in engineering enabled opportunity capture in sales.

Insight

Business leaders and engineers don’t need to “speak the same language.” What they need is translation and alignment.

Sales translates opportunities into revenue impact.
Engineering translates processes into quality impact.
Together, they drive sustainable, scalable growth.

As Accelerate shows, quality in engineering fuels business performance. As The Challenger Sale and SPIN Selling show, process discipline in sales fuels revenue impact.

The companies that win are those that see beyond the divide and recognize the truth: sales talks about what is possible, engineering ensures it is sustainable.

Tabular vs Columnar Databases

Aditya satrio nugroho — Mon, 11 Aug 2025 12:30:52 +0000

When you first hear “tabular” vs “columnar” databases, it might sound like an abstract storage concept. But if we put it into a grocery shopping analogy, it suddenly becomes a lot easier to grasp.

🛒 The Grocery Store Analogy

Tabular (Row-Oriented) — Shopping by Recipe

In a row-oriented (tabular) database, data is stored row by row.

Imagine a grocery store where each aisle contains everything you need for a single recipe:

Aisle 1 → Spaghetti Bolognese kit (pasta, sauce, beef, spices)
Aisle 2 → Chicken Curry kit (chicken, curry paste, coconut milk, rice)
Aisle 3 → Salad kit (lettuce, tomato, dressing, croutons)

If you’re cooking one recipe, you simply go to that aisle and grab all the ingredients in one go.

💡 Best for: Tasks where you often need all data for a single record, like retrieving a full customer profile or processing a transaction.

Columnar (Column-Oriented) — Shopping by Ingredient

In a column-oriented database, data is stored column by column.

Imagine a grocery store organized by ingredient type:

Aisle 1 → All pasta types
Aisle 2 → All sauces
Aisle 3 → All meats
Aisle 4 → All vegetables

If you want to find all tomatoes in the store, you only go to the vegetable aisle — you don’t waste time walking through every recipe aisle.

💡 Best for: Analytical tasks where you scan specific columns over large datasets — like calculating the average age of all customers or the total sales per region.

⚖️ Pros & Cons

Feature	Tabular (Row-Oriented)	Columnar (Column-Oriented)
Optimized for	OLTP (transactions)	OLAP (analytics)
Read pattern	All columns for a few rows	A few columns for many rows
Insert/Update speed	Fast	Slower
Aggregate queries	Slower	Very fast
Compression	Lower	Higher
Examples	MySQL, PostgreSQL, SQL Server	ClickHouse, BigQuery, Redshift, Snowflake

📌 Best Use Cases

Tabular (Row-Oriented) is ideal when:

You’re handling real-time transactions (banking, e-commerce orders, POS systems).
You frequently insert, update, and delete individual rows.

Columnar (Column-Oriented) is ideal when:

You’re running heavy analytics on large datasets.
You often aggregate or filter by specific columns.
Your queries typically touch a small subset of columns but many rows.

⚠️ Pitfalls to Watch Out For

Tabular

Inefficient for analytical queries on large datasets.
Higher storage I/O when only a few columns are needed.

Columnar

Poor performance for frequent single-row updates.
More complex transactional handling — often not the best choice as a primary OLTP store.
Can be overkill for small datasets or systems with simple queries.

🔧 Popular Tools

Row-Oriented (Tabular):

MySQL
PostgreSQL
Oracle Database
Microsoft SQL Server

Column-Oriented:

Google BigQuery
Amazon Redshift
Snowflake
ClickHouse
Apache Parquet (file format)

🏁 Quick Takeaway

Tabular (Row) → “Give me everything about one thing.”
Columnar → “Give me one thing about everything.”

Choosing the right one depends on your workload — transactional systems thrive on tabular, while analytics shines on columnar.

📘 Build a Tech Performance Framework for Engineering OKRs That Actually Drive Impact

Aditya satrio nugroho — Mon, 23 Jun 2025 23:32:48 +0000

In my experience leading engineering teams, I’ve found that the hardest part of OKRs isn’t setting them — it’s making sure they actually mean something.

Too many teams set OKRs like "refactor the admin panel" or "increase test coverage" without asking the bigger question:

What business outcome are we trying to enable?

This post introduces a simple, powerful framework I use to ensure every engineering OKR ladders up to something that matters — whether that’s profitability, product reliability, user experience, or operational efficiency.

🎯 Why This Framework Exists

Most engineering leaders know we should align our work to business goals. But how do you translate something like “reduce churn” or “increase CVR” into backend initiatives or platform improvements?

The answer: start with engineering fundamentals that map cleanly to business impact, not just project deliverables.

This framework helps me to align the tech metrics with the business metrics:

Prioritize what to build and what to cut
Make trade-offs explicit (not accidental)
Hold teams accountable with metrics that matter

🔺 The “Project Management Triangle” and Why It Still Matters

You’ve probably heard the saying:

“You can have it fast, cheap, or good — pick two.”

This idea is rooted in what’s known academically as the Project Management Triangle, sometimes called the Iron Triangle or informally the Golden Triangle.

It describes the fundamental trade-offs in any technical decision:

Constraint	Engineering Focus	Example
Speed	Delivery & time-to-market	Lead time, sprint velocity
Quality	Bug prevention, testability, stability	Defect rates, incident count
Cost	Infra and labor efficiency	Infra cost, developer utilization

No matter the size of the org, these tensions always exist, and the C-Level mostly only care about this triangle. And the best engineering OKRs don’t ignore them — they make them visible and intentional.

🧠 What the Experts Say (And Why I Take It Seriously)

This framework is inspired by some of the best minds in software engineering and DevOps.

📘 The Mythical Man-Month — Frederick Brooks

“Adding manpower to a late software project makes it later.”

Brooks explains how rushing projects often leads to even more delays and coordination overhead. A powerful reminder that quality and speed are not linearly scalable.

⚙️ Continuous Delivery — Jez Humble & David Farley

“If it hurts, do it more often.”

This quote refers to things like testing, deployment, and integration. The more painful a process is, the more it needs to be automated, so quality doesn’t degrade as you scale speed.

📈 Accelerate — Nicole Forsgren, Jez Humble, Gene Kim

“High performers deploy more frequently, recover faster, and are more stable.”

This book backs everything with data. The takeaway? You don’t have to trade speed for quality — high-performing teams achieve both.

✅ ISO/IEC 25010:2011

This global standard defines what "software quality" actually means, beyond just bugs. It includes:

Reliability
Maintainability
Performance efficiency
Functional suitability

These ideas directly inspired the six dimensions below.

🧱 The 7 Dimensions of Tech Performance

Every good engineering OKR I’ve seen (or set) can be mapped to one or more of the following seven dimensions. These are the technical levers that actually move the business — across speed, reliability, cost, and growth-readiness.

Dimension	What It Measures	Example Metrics
1. Delivery	How fast and predictably we ship value	Lead time, deployment frequency, sprint velocity
2. Quality	How well we avoid defects and rework	Defect rate, escaped bugs, test coverage
3. Availability	Whether the system is up when users need it	Uptime %, MTTR, alerting coverage
4. Reliability	Whether the system behaves as expected under normal use	API P95 latency, crash-free sessions
5. Maintainability	How easily the system can evolve without breaking	PR cycle time, SonarQube score, legacy deprecation progress
6. Cost Efficiency	How efficiently we use compute and human resources	Infra cost/session, cloud bill reduction, manual hour savings
7. Scalability	How well the system performs as usage or data grows	Throughput under load, autoscaling behavior, resource saturation thresholds

🧠 Pro tip: Every OKR should align to at least two of these dimensions. One is not enough.

🧭 Aligning to Business Impact (Without Internal Jargon)

Instead of exposing internal OKRs, I prefer to frame impact areas like this:

🔄 Improving system stability for user-facing products
📈 Supporting growth experiments by speeding up delivery
💰 Reducing cloud infrastructure and operational costs
🔧 Eliminating manual work through better tooling
🧪 Improving data quality to make analytics more trustworthy

These themes are universally valuable, whether you’re in a startup or scaling enterprise.

So, when I review OKRs, I ask:

Does this actually improve one of those outcomes?

If not, it's probably technical debt disguised as a priority.

🧠 Example OKRs Using This Framework

Here’s what this looks like in practice

🚀 Improve Admin Dashboard Quality

Objective	Sunset legacy platform and reduce manual issues
KR 1	Avoid security issues, migrate 100% of Legacy admin dashboard to the new code base
KR 2	Improve Sentry Perf score page XXX in the admin dashboard by 90%

📌 Dimensions: Maintainability, Reliability, Quality

💸 Infra Cost Optimization

Objective	Reduce infrastructure cost and latency API
KR 1	Reduce Database reads by 60%
KR 2	Keep P95 check-in latency ≤ 500ms

📌 Dimensions: Cost Efficiency, Reliability

🔁 How I Operationalize This

Use this framework not just for OKR planning, but for ongoing decision-making:

During planning: Tag each draft OKR with the dimensions it targets
During reviews: Check if any key business outcomes are neglected
During sprints: Map Jira stories to the OKRs and dimensions

Tools that you can use:

Jira, Google sheet (delivery & velocity)
APM like Sentry or New Relic (monitoring and error tracking)
Static code analysis, SonarQube (maintainability)
GCP/AWS billing menu for cost reports
Team WIKI, you can use Confluence or Notion (shared visibility)

🔚 Final Thoughts

You don’t need 20 OKRs to show impact. You need fewer, smarter, well-targeted ones.

This framework — based on engineering theory, real-world use case, and business alignment — helps me set OKRs that do more than just tick boxes.

They guide teams.
They inform trade-offs.
They create leverage.

Let’s stop writing OKRs that “sound good” or are not correlated with business impacts, let's start writing ones that move the needle — for real.

Comparing Software Architecture Documentation Models and When to Use Them

Aditya satrio nugroho — Mon, 23 Jun 2025 23:04:10 +0000

Documenting software architecture isn’t just a formality—it’s a critical tool for communication, onboarding, and decision-making. While the C4 Model has become popular for its simplicity and developer focus, there are several other frameworks and templates, each with strengths for specific contexts.

This post breaks down the most widely used architecture documentation models, compares them, highlights real-world use cases, and provides concrete examples to help you choose the right approach for your team and project.

1. C4 Model

What is it?
A hierarchical model (Context, Container, Component, Code) for visualizing software architecture at different levels of detail.

Best For:

Agile teams
Developer-centric communication
Fast onboarding
Modern cloud-native applications

When to Use:

When your audience includes developers, product owners, or external stakeholders who need a visual “big picture” down to component level.
Projects where diagrams need to stay in sync with code (C4 can be generated from code in some tools).

Example:

SaaS web app: Context diagram shows users, payment gateways, and your platform; container diagram shows API, frontend, and database; component diagram details API modules.

Drawbacks:

Doesn’t prescribe document structure, just diagrams.
Less focus on non-visual documentation (rationale, cross-cutting concerns).

2. 4+1 View Model

What is it?
Introduced by Philippe Kruchten, 4+1 organizes architecture into Logical, Development, Process, Physical views, plus Scenarios.

Best For:

Large enterprise systems
Projects with multiple stakeholder groups
Situations where hardware, deployment, and runtime concerns matter

When to Use:

When you need to separate “what the system does” (Logical) from “how it is deployed” (Physical), “how it is built” (Development), and “how it runs” (Process).
Projects with non-technical and technical audiences.

Example:

Telecom system: Logical view shows services, Development view shows microservices repos, Process view shows runtime processes/threads, Physical view maps containers to servers, Scenarios walk through call setup.

Drawbacks:

More effort and overhead than C4.
Can be overkill for simple systems.

3. Views and Beyond (V&B)

What is it?
A framework from the Software Engineering Institute (SEI) that focuses on describing a system from different views (Module, Component & Connector, Allocation), each tailored to stakeholder concerns.

Best For:

Complex systems with many stakeholders (ops, QA, business, dev)
Organizations with a culture of detailed documentation

When to Use:

When you need to ensure every stakeholder’s concern is addressed.
For compliance or formal architecture review processes.

Example:

Banking platform: Module view for code structure, Connector view for service integrations, Allocation view for cloud vs. on-prem deployment.

Drawbacks:

Heavyweight, can be too formal for agile/startup environments.

4. Arc42

What is it?
A template for comprehensive architecture documentation, combining structure (what to write) with flexibility (how to visualize).

Best For:

Teams looking for a complete architecture documentation template
Projects requiring thorough coverage (context, quality scenarios, cross-cutting concerns)

When to Use:

When you want to document not only structure, but also decisions, quality attributes, and concepts.
Good for regulated environments or projects with high turnover.

Example:

Healthcare platform: Use Arc42 template to document system context, business goals, architecture decisions, data flow, deployment, and risks.

Drawbacks:

Can seem overwhelming at first.
Not diagram-focused; you must choose your own diagram styles (often used with C4).

5. ISO/IEC/IEEE 42010

What is it?
An international standard for describing architecture using viewpoints and views.

Best For:

Organizations needing compliance with international standards
Very large, mission-critical projects

When to Use:

When documentation must satisfy formal regulatory, client, or industry requirements.

Example:

Aerospace control system: Architecture documentation split into safety, security, and deployment viewpoints as per ISO 42010 guidelines.

Drawbacks:

Very formal and generic; doesn’t provide concrete diagram or template recommendations.
Usually implemented through other frameworks (Arc42, 4+1, etc.).

6. ADR (Architecture Decision Records)

What is it?
A lightweight way to document individual architectural or technical decisions as short markdown/text files.

Best For:

Agile teams
Projects where decisions evolve rapidly
Complementing high-level documentation

When to Use:

When you want to record the “why” behind important choices (tech stack, database, patterns).
When you need an auditable trail of decisions for future maintainers.

Example:

Microservices platform: Each decision (e.g., "Use Postgres instead of MySQL") gets a 1-pager with context, options, decision, consequences.

Drawbacks:

Not a full documentation framework, but a supplement.

7. UML (Unified Modeling Language)

What is it?
A standard visual language with diagram types (class, sequence, deployment, etc.).

Best For:

Teams needing detailed object-level diagrams
Generating code from diagrams (and vice versa)
Modeling at various levels of abstraction

When to Use:

When low-level relationships, interactions, or deployment details are needed.
When standard visual notations are required (e.g., for handover).

Example:

Library management system: UML class diagram for book, member, loan objects; sequence diagram for the checkout process.

Drawbacks:

Can get too detailed (“diagram for the sake of diagram”).
Not a documentation methodology—just diagrams.

Comparison Table

Model	Best For	Use Case Example	Overhead	Visual/Template
C4	Modern, code-centric teams	SaaS/web app	Low-Med	Visual
4+1 View	Multi-stakeholder, enterprise	Telecom, ERP	Medium	Visual
V&B (SEI)	Formal, many stakeholders	Banking, critical infra	High	Both
Arc42	Thorough, template-driven	Healthcare, gov	Medium-High	Template
ISO 42010	Compliance, formal review	Aerospace, defense	High	Template
ADR	Decision history, agility	All	Low	Text
UML	Object, behavior diagrams	Component, flow models	Variable	Visual

How to Choose?

For startups and fast-moving teams: Start with C4 for high-level clarity + ADRs for decision history.
For large enterprises: Use 4+1 View or V&B if many stakeholders and complex concerns. Arc42 works great if you need a thorough template.
For regulated industries: Consider ISO 42010 compliance, usually with Arc42 or V&B as the documentation base.
For legacy systems or heavy object-oriented designs: Use UML diagrams as needed, but avoid over-documenting.

Conclusion

No one-size-fits-all. Mix and match based on project complexity, team size, compliance needs, and audience. For most modern SaaS products, a combination of C4 diagrams, a few ADRs, and lightweight templates (even Notion or Markdown) gives the right balance of clarity and speed.

References:

My Leadership Playbook

Aditya satrio nugroho — Fri, 02 May 2025 05:49:36 +0000

This playbook is based on my experience leading small teams in bootstrapped startups, navigating growing pains in scale-ups, and coaching managers in legacy corporations trying to stay relevant.

Indonesian leadership isn't like Silicon Valley's and shouldn't try to be. We have our rhythms, cultural codes, expectations, and blind spots. We lead with respect, context, and emotions, but we sometimes struggle with clarity, consistency, and execution.

This book is not a list of hacks. It is a mirror. A way to see what is working, what is not, and what might be worth rethinking, and trigger discussion.

Each chapter starts with a story because leadership does not happen in slides. It happens in rooms, message threads, crisis calls, and awkward 1:1s. It happens in human moments.

And that is where this playbook belongs: in real conversations, in real companies, with real people.

Let's lead better. Together.

PLAY 1: Leader Selection "Beyond IQ & Degrees"

The Scene: Imagine you are hiring a new team lead. One candidate graduated from a prestigious university with top marks. Another did not, but she/he has led teams through tough pivots, learned from failure, and kept their team intact through two reorgs. The first impresses the board. The second earns silent respect from peers. Who do you pick?

The Lesson: Credentials may open the door, but leadership walks through experience, resilience, and the ability to make others better.

What to do:

Evaluate three core dimensions: Cognitive Skill, Character/Temperament, and Cultural/Team Fit.
Use real-world simulations: present them with live team or delivery problems to solve collaboratively.
Ask for past examples of failure and rebound, not just success.

What to avoid:

Overvaluing academic pedigree. Intelligence and maturity.
Assuming confidence equals competence, some great leaders are calm, not loud.

Cultural Frame (Indonesia):

Many local teams still respect authority from title or education. Your job is to model the shift toward credibility through action.

Adage: Degrees do not lead teams. People do.

PLAY 2: Show up your Presence as Leadership

The Scene: A product launch failed. Everyone's frustrated, but the team has not seen the Lead of the Engineering team in days. No acknowledgment, no regroup, no face. Slack goes quiet. Then the frustrated Engineers start looking for jobs. The narrative becomes: We are on our own.

The Lesson: Leadership is not just about solving the problem it is about being there when the team needs a steady face, a decision, or just acknowledgment. Presence creates psychological safety. Absence creates drift.

What to do:

Be visible during high-stress or high-uncertainty moments (failures, pivots, conflicts).
Create rituals of visibility: daily standups, open office hours, and walkarounds.
Practice emotional presence, and listen fully, even when you do not have the answer yet.

What to avoid:

Becoming a ghost leader during hard times.
Confusing "trust" with "total hands-off".

Cultural Frame (Indonesia):

Teams are used to hierarchy, but what they remember is who stood with them when in difficult situations. In a high-context culture, silence often gets interpreted as ignorance or abandonment.

Adage: You do not need to have the answer. You need to be the one who stays in the room. Act as a team and make a plan to find the solutions together.

PLAY 3: Frugal Signals "Kill the Rolex Myth"

The Scene: A new VP arrives at a startup still struggling with burn rate. He parks a luxury car at the office, wears a 150 million IDR watch, and casually mentions his Bali villa. That week, the team quietly stopped pushing extra hours. They think, Why bleed for a leader who is already cashed out?

The Lesson: Symbolism matters. In leadership, your choices signal values. Frugality is not about being cheap, it is about being credible.

What to do:

Model intentional modesty, especially during resource-constrained phases.
Make financial decisions that reinforce the collective mission.
Talk openly about company finances, so your modesty is not seen as performative it is aligned.

What to avoid:

Flaunting wealth when the team is grinding.
Using frugality to guilt employees.

Cultural Frame (Indonesia):

We have a culture of quiet comparison. People notice, even if they do not say it. Your lifestyle choices impact morale, especially in collectivist teams.

Adage: The more power you have, the more your choices echo.

PLAY 4: Mentor > Patron

The Scene: A senior leader tells a junior PM, Just follow my instructions. The PM does but never grows. After 6 months, the leader complains, Why is this person so dependent? The answer: because you trained them to be.

The Lesson: Paternalistic leadership might bring speed early, but it kills scalability. Great leaders don't just solve problems they teach people how to think.

What to do:

Shift from command to coaching: Ask guiding questions instead of giving answers.
Empower decisions with boundaries. You decide X, I will review Y
Share your thought process, not just your conclusions.

What to avoid:

Being the bottleneck.
Criticizing people for not thinking when you never gave them space to.

Cultural Frame (Indonesia):

Many teams are trained to defer. That is okay, start from where they are, then lead them out of dependency with trust and teaching.

Adage: a leader's job is to make themselves less needed, not more feared.

PLAY 5: Avoid Vacuums "Build Leadership Safety Nets"

The Scene: A team is thriving. The founder takes a two-week vacation, proudly stating, Let them figure it out. It will build muscle. When he returned, 3 key decisions were delayed, 2 high performers were disengaged, and nobody knew who was accountable.

The Lesson: Autonomy does not mean ambiguity. Absence without structure is abandonment. Great leaders step back with design.

What to do:

Create clear fallback roles: Who decides when you are out?
Write "if-then" escalation plans for product, people, and conflict.
Train people ahead of time, do not surprise them with sudden empowerment.

What to avoid:

Using absence as a test. It is not a test if they are not prepared.

Cultural Frame (Indonesia):

Teams still look upward for direction. Without clarity, they default to inaction. Build soft guardrails, not total freedom.

Adage: Real delegation is not letting go, it is setting up first.

PLAY 6: Ritualize Execution

The Scene: A startup team constantly misses deadlines. Everyone's busy, no one's aligned. When asked who is doing what, responses are vague: We will discuss in the next sync. The next sync? Canceled.

The Lesson: Without structure, energy scatters. Rituals are the heartbeat of execution. Not for bureaucracy but for rhythm.

What to do:

Set non-negotiable cadences: daily standups, weekly planning, fortnight retros.
Keep it sharp: focused agendas, rotating leads, timed updates.
Add rhythm to recognition to praise, it's not a quarterly event.

What to avoid:

Treating rituals like status updates. Make them decision moments.
Letting meetings become autopilot or optional.

Cultural Frame (Indonesia):

Teams respond well to predictable rhythm, especially in hybrid or distributed setups. It creates psychological security.

Adage: Speed doesn't come from chaos. It comes from choreography.

PLAY 7: Filter Thought Leadership "BS"

The Scene: A new leader joins and starts quoting buzzwords: Let's be agile, build tribes, and practice radical candor. Everyone nods. No one understands. Execution drops, alignment fades, and engineers roll their eyes whenever a new slogan drops.

The Lesson: Not all thought leadership is useful. A lot of it is packaged noise. Great leaders translate ideas into tools, not vibes.

What to do:

Validate frameworks before adoption: pilot in one team.
Choose ideas based on problem relevance, not trendiness.
Favor materials used in actual business schools over viral LinkedIn content.

What to avoid:

Blindly importing Western models into local teams.
Using jargon as a proxy for insight.

Cultural Frame (Indonesia):

Teams value clarity. A simpler language works better. Over-complex methods often get quietly ignored.

Adage: Good leadership is not found in quotes, but forged in context.

Final Reflection

Leadership is rarely about knowing what to do. It is about doing what you know consistently. In Indonesian companies, where culture is high-context and hierarchies still shape interactions, leading well requires both strength and subtlety.

You do not need to be loud to be powerful. You do not need to be perfect to be trusted. But you must be present. You must be clear. And you must be willing to grow while helping others grow.

Use this playbook as your check-in guide, not your checklist. Come back to it when you feel stuck. Or when your team feels stuck.

Because in the end, leadership is not the spotlight. It is the structure that helps others shine.

This playbook is alive. Revisit it. Rewrite it. And apply it.

Back-of-the-Envelope Thinking for Scalable System Design

Aditya satrio nugroho — Mon, 21 Apr 2025 10:02:59 +0000

Have you ever been assigned a project where you designed an architecture using all the latest state-of-the-art tools — sharded databases, message queues, event buses, and more? At first glance, the architecture looks impressive. It sounds cool. But does it really solve the core problem you're facing?

Even if your CTO gives you the green light to build it, can you be sure the system will perform as expected? Once you start implementing it, doubt often creeps in. You begin questioning the performance, wondering how to validate your assumptions.

One simple but powerful method to validate your system design early is through back-of-the-envelope calculations. It helps you estimate, reason, and catch potential issues long before they become expensive mistakes.

Back-of-the-envelope calculations will help you create estimations using a combination of thought experiments and common performance numbers to get a good feel for which designs will meet your requirements.

🧮 Operation Latency Table

No	Original Data	Activity Category	Component Category	Time (ns)	Time (ms)	Time (min)	Time (hr)
1	L1 cache reference	read	cache	0.5	0.0000005	0.00000000000833	0.00000000000139
2	Branch mispredict	misc	cpu	5	0.000005	0.00000000008333	0.00000000001389
3	L2 cache reference	read	cache	7	0.000007	0.00000000011667	0.00000000001944
4	Mutex lock/unlock	sync	cpu	100	0.0001	0.00000000166667	0.00000000002778
5	Main memory reference	read	memory	100	0.0001	0.00000000166667	0.00000000002778
6	Compress 1K bytes with Zippy	compute	cpu	10000	0.01	0.000000166667	0.000000002778
7	Send 2K bytes over 1 Gbps network	write	network	20000	0.02	0.000000333333	0.000000005556
8	Read 1 MB sequentially from memory	read	memory	250000	0.25	0.000004166667	0.000000069444
9	Round trip within same datacenter	network	network	500000	0.5	0.000008333333	0.000000138889
10	Disk seek	read	disk	10000000	10	0.000166667	0.000002778
11	Read 1 MB sequentially from network	read	network	10000000	10	0.000166667	0.000002778
12	Read 1 MB sequentially from disk	read	disk	30000000	30	0.0005	0.000008333
13	Send packet CA→Netherlands→CA	network	network	150000000	150	0.0025	0.000041667

💡 The Lessons

Writes are 40 times more expensive than reads.
- Frequent writes/updates will have high contention.
- To scale writes, you need to partition, and once you do that, it becomes difficult to maintain shared state like counters.
Global shared data is expensive.
- This is a fundamental limitation of distributed systems.
- Lock contention on heavily written shared objects kills performance as transactions become serialized and slow.
Architect for scaling writes.
Optimize for low write contention.
Optimize wide. Make writes as parallel as you can.

🔥 Writes Are Expensive!

Datastores are transactional: writes require disk access.
Disk access means disk seeks.
🧠 Rule of thumb:

1 disk seek = ~10ms
→ 1s / 10ms = 100 seeks/second (max per disk)

Throughput depends on:

The size and shape of your data
Doing work in batches (batch puts/gets)

⚡ Reads Are Cheap!

Reads don’t have to be transactional — just consistent.
After the first disk load, data is cached in memory.
Subsequent reads are super fast.
🧠 Rule of thumb:

Read 1MB from memory ≈ 250μs
→ 1s / 250μs = 4GB/sec
→ For 1MB entities: 4000 fetches/sec

🧪 Example: Generate Image Results Page of 30 Thumbnails

❌ Design 1 – Serial

Read images one-by-one:
Each image = disk seek + read 256KB at 30MB/s
Calculation:

30 seeks × 10ms = 300ms
30 × (256KB / 30MB/s) = 250ms
→ Total: 300 + 250 = 550ms

✅ Design 2 – Parallel

Issue reads in parallel.
Calculation:

1 seek = 10ms
Read 256KB / 30MBps ≈ 8.5ms
→ Total: 10 + 8.5 = ~18.5ms
- Expect variance in real world: ~30–60ms range

🧠 Simplified Mental Models

Insight	What It Means (Simplified)
💾 Disk is super slow	Like walking to the garage. You don’t want to do this often.
🧠 RAM is much faster than disk	Like grabbing from your desk instead of walking to the cabinet.
⚡ CPU is rarely the bottleneck	Your processor is fast. If your system is slow, it’s not the CPU’s fault.
🔁 Cache is insanely fast	Think of L1/L2 cache like stuff in your pocket — instant access.
🌐 Network trips are expensive	Talking to another datacenter is like mailing a letter to Europe. Avoid it.
🔃 Batching is your friend	Instead of reading 1 comment at a time, grab 100 at once.
🧵 Avoid shared locks	Waiting for someone to unlock the bathroom wastes time.
📦 Design for locality	Keep data close to where it’s processed — like keeping your tools nearby.

"Cache beats RAM. RAM beats disk. Disk is lava. Network is long-distance love."

🧠 Conclusion

Back-of-the-envelope calculations won’t give you perfect answers — but they give you fast and estimations answers. That’s often all you need to:

Avoid wasteful engineering
Identify bottlenecks early
Make sound architecture decisions without building the wrong thing first

Before building that real-time dashboard or scaling out another microservice, ask yourself:

“Did I run the numbers? Even roughly?”

You might just save yourself days of debugging.

📚 References

Google Pro Tip: Back-of-the-Envelope Calculations

Moving Fast and Safely: Lessons from Scaling Tech Organizations

Aditya satrio nugroho — Thu, 10 Apr 2025 06:54:18 +0000

"Why scaling startups need more than just lean practices to survive and thrive."

Scaling a tech organization in the financial industry, particularly in sensitive domains like stocks and crypto, introduces unique challenges. What works for a 10-person startup no longer holds when the team grows to 80+ engineers. While lean principles foster agility early on, structure and governance become critical for sustainable growth.

This article draws heavily from the Team Topologies framework by Matthew Skelton and Manuel Pais, supported by Cognitive Load Theory, and validated by real-world scaling practices from Amazon, Spotify, and Google. Together, these references form the academic and practical foundation for our approach.

We explore the typical scaling pains, diagnose the root causes behind them, and outline a step-by-step guide to solving these issues.

Startup Growth Pains: From Lean Beginnings to Structured Necessity

You are no longer a "startup" at 80 engineers. You are a mid-sized tech company.

In the early stages, small teams rely on flexibility: informal communication, rapid decisions, and blurred responsibilities. However, as the team size grows, these strengths become liabilities:

Delivery slows due to coordination overhead.
Infrastructure strains under increasing demand.
Internal politics and confusion rise.

Lean practices help small teams move fast, but beyond a certain scale, intentional structure must complement speed. Without it, chaos, instability, and organizational mistrust set in.

Team Size and Scaling Needs

Team Size	Typical State	Scaling Requirement
1-10 engineers	Chaos is acceptable	Maximize flexibility and exploration
10-30 engineers	Growing pains start	Light processes, early team ownership
30-80 engineers	Structured chaos	Formalize team types, start Platform teams
80-150 engineers	Scaling complexity	Introduce IDP, enforce clear boundaries, governance
150+ engineers	Large-scale organization	Split into Tribes, strong Platform engineering culture

Our example of 80 engineers places us firmly in the "Structured chaos" phase, where building an Internal Developer Platform and enforcing clear team structures becomes mandatory.

big tech and academic papers have hit exactly the same problem and evolved similar solutions.

1. Amazon (early 2000s) — “You build it, you run it” with platform guardrails

Problem:
Amazon was scaling fast. Developers needed to move fast but infra/security couldn’t let them “touch everything.”

Solution:

Split into 2-pizza teams (small, independent Stream-aligned Teams).
Mandatory self-service platforms (deployment, logging, monitoring).
No manual infra work: Developers use platforms built by platform teams.
Teams own their service end-to-end within predefined guardrails.

Key Quote:

“You build it, you run it. But you run it inside the constraints provided by the central platform.”

Impact:
Allowed Amazon to scale to thousands of services without losing security, compliance, or control.

2. Spotify (2012–2014) — “Squads, Tribes, Chapters, Guilds” model

Problem:
Growing fast, too much friction between teams.

Solution:

Squads: Stream-aligned teams (own one part of the product).
Chapters: Shared function across squads (e.g., Infra Chapter).
Guilds: Loose, voluntary knowledge sharing (e.g., Security Guild).
Platform Teams: Build enabling platforms, not manual ops.

Special Rule:

Infra teams acted as Internal Service Providers.
Developers self-serve infra through APIs, not by asking infra engineers.

Academic Reference:
Spotify Engineering Culture, by Henrik Kniberg (official document, referenced globally).

3. Google SRE Model — “Error Budgets” and strict production control

Problem:
At Google scale, random developer changes = massive risks.

Solution:

Developers are responsible for code and minor ops.
SREs own production environment stability.
Error Budgets: Developers are allowed to break things within acceptable limits. If errors spike, devs lose the right to deploy until fixed.

Quote from Google SRE Book:

“Letting developers deploy freely without accountability is a path to ruin.”

Impact:

Developers move fast but inside a mathematically defined safety zone.
SREs protect core infra and enforce reliability.

4. Academic Reference: “Cognitive Load Theory for Software Teams”

(Skelton, Pais, 2019 — same guys as Team Topologies, published academically)

Thesis:

Developers cannot own too many unrelated concerns at once.
Infra must be productized into easy-to-use platforms.
Team boundaries must be designed to optimize flow and minimize handoffs.

Their research shows that high cognitive load (devs doing dev + infra + security manually) = slower delivery, higher burnout, and higher incident rate.

Diagnosing the Core Problems

The problems faced by growing startups often trace back to fundamental organizational issues:

Organizational Maturity Mismatch

Small team behaviors persist even as the organization demands more maturity. Teams lack clear boundaries, and developers are expected to juggle responsibilities across development, infrastructure, operations, and security.

Cognitive Load Overload

Cognitive Load Theory teaches that individuals and teams can only handle a limited amount of complexity effectively. When teams handle too many unrelated domains, delivery becomes error-prone and slow.

Tech Politics: Erosion of Trust

Opaque decision-making processes create mistrust. Engineers begin competing for resources and priorities in an unhealthy way, leading to favoritism and internal alliances.

Root Cause: All these issues stem from the absence of deliberate team structures and communication models, a concept central to Team Topologies.

Principles for Scaling Successfully

Drawing directly from Team Topologies and Cognitive Load Theory, organizations must adopt three core principles:

1. Design Clear Team Boundaries

Team Topologies prescribes explicit team types to reduce cognitive load and improve flow:

Stream-aligned Teams: Build and run product features end-to-end.
Platform Teams: Create internal platforms that other teams consume.
Enabling Teams: Help other teams build missing capabilities.
Complicated Subsystem Teams: Handle highly specialized areas that require deep expertise.

Amazon demonstrates this with their internal platform systems. Developers own services completely but operate within strict platform guardrails, minimizing unnecessary complexity.

2. Build Systems of Trust

To reduce political behavior, decision-making must be transparent and predictable:

RFCs (Request for Comments): Publicly document and discuss major technical decisions.
Open Architecture Boards: Ensure that decisions are made based on merit, not hierarchy.
Public OKRs: Make team goals visible and measurable.

Spotify applied these principles with their Squads, Tribes, Chapters, and Guilds model. Squads operated independently but within a framework that encouraged transparency and cross-team collaboration.

3. Empower Developers Inside Guardrails

Developers should have autonomy but within safe, automated boundaries:

Infrastructure must be self-service.
Access must be controlled and audited.
Guardrails must automate security and compliance.

Google practices this balance through their Site Reliability Engineering (SRE) model. Developers own their services, but SREs enforce reliability through Error Budgets, aligning freedom with operational excellence.

If you allow full infra access "for speed" today, you borrow time against massive technical debt and existential risk later.

Allowing unrestricted access for the sake of moving fast might provide short-term gains, but it compromises the long-term stability of the organization. Technical debt accumulates invisibly, security vulnerabilities grow unnoticed, and incident recovery becomes slower. In regulated industries like finance and crypto, these risks aren't just technical — they are existential.

Building robust guardrails through an Internal Developer Platform protects the organization without throttling developer productivity.

Step-by-Step Solution

Step 1: Define Proper Team Structures

Clearly establish team types:

Stream-aligned Teams own and deliver complete product features.
Platform Teams abstract and simplify complex infrastructure needs.
Enabling Teams improve capability without owning delivery.
Complicated Subsystem Teams manage specialized technical areas.

Each team has a distinct mission, reducing overlap and conflict. This structure directly follows Team Topologies principles.

Step 2: Define Developer and Infra Responsibilities

Responsibilities must be split clearly:

Developers own application code, deployment pipelines, and monitoring.
Infra teams provide secured, templatized pipelines, observability tools, and enforced security policies.

This division helps manage cognitive load and supports faster, safer delivery.

Step 3: Introduce RFCs for Major Changes

Every significant architectural or infrastructural change must go through an RFC process:

Written proposals are discussed openly.
Decisions are transparent and based on technical merit.

This process builds organizational memory and eliminates backchannel decision-making, reinforcing trust systems.

Step 4: Leadership Rituals to Maintain Trust

Leadership must reinforce trust continuously:

Weekly Leads Meetings ensure alignment.
Public OKRs make priorities clear.
Rotating Architecture Review Boards distribute authority and expertise fairly.

These rituals align with building transparent, predictable decision-making systems.

Step 5: Build an Internal Developer Platform (IDP)

An Internal Developer Platform provides the foundation for developer autonomy without sacrificing safety. It must include:

Infrastructure as Code: Tools like Terraform and Pulumi to create pre-approved, self-service modules.
GitOps Deployments: Tools like ArgoCD or FluxCD automate deployment through Git workflows.
Self-Service Portal: Platforms like Backstage allow developers to launch services, view documentation, and manage their environments easily.
Secrets Management: Vault or AWS Secrets Manager centralizes secret handling and improves security.
Observability: Prometheus and Grafana provide monitoring and alerting out-of-the-box.
Incident Management: Slack integrations with Alertmanager or tools like PagerDuty enable professional on-call rotations.

Minimal Viable Stack Recommendation:

Category	Tool Choices
Infrastructure	Terraform + Atlantis
Deployments	GitHub Actions + ArgoCD
Portal	Backstage
Secrets	Vault or AWS Secrets Manager
Observability	Prometheus + Grafana
Incident Management	Slack + Alertmanager or PagerDuty

Building this platform aligns with Team Topologies' goal of enabling fast, secure, and independent delivery.

Conclusion

The desire for developers to "own everything end-to-end" is natural but risky when scaling in regulated industries. True ownership must happen within well-designed systems that balance speed, safety, and organizational trust.

The principles presented here are deeply grounded in the Team Topologies framework and Cognitive Load Theory, and validated by real-world practices at Amazon, Spotify, and Google. These references provide a solid foundation for any growing tech organization to scale successfully.

By applying structured team models, building internal platforms, and fostering trust through transparent processes, financial tech startups can achieve sustainable, scalable growth without chaos.

Build systems, not heroes. Move fast, but move safely.

Appendix: References

Team Topologies by Matthew Skelton and Manuel Pais
Cognitive Load Theory in Software Engineering
Google SRE Book
Spotify Engineering Culture (Henrik Kniberg)
Amazon Leadership Principles ("You build it, you run it")
Ruth Malan: Thoughts on Systems and Architecture

Change Data Capture (CDC) in Modern Systems: Pros, Cons, and Alternatives

Aditya satrio nugroho — Sun, 30 Mar 2025 16:08:16 +0000

Change Data Capture (CDC) is a powerful technique used to track and react to data changes in real time. As modern systems lean more heavily into real-time data flows, microservices, and event-driven architectures, CDC has become a key strategy for syncing data across services, feeding analytics pipelines, and enabling responsiveness without overloading source databases.

What is CDC?

CDC refers to the process of identifying and capturing changes (INSERT, UPDATE, DELETE) in a data source, typically a relational database, and propagating those changes to downstream consumers like data lakes, caches, search indexes, or microservices.

Types of CDC:

Log-based: Taps into database transaction logs (e.g., binlog, WAL). Tools: Debezium, AWS DMS.
Trigger-based: Uses SQL triggers to write changes to an audit or events table.
Timestamp/version-based: Uses columns like updated_at to query for changes during polling.

Example: Debezium listens to PostgreSQL's WAL and emits changes to Kafka topics, which are then consumed by services or streamed to BigQuery.

Benefits of Using CDC

Near real-time updates: Data pipelines become reactive, not batch-driven.
Decoupling: Source systems remain focused on core responsibilities.
Event-driven support: Downstream systems can respond to events as they happen.
Less DB strain: Avoids heavy polling logic.
Audit/history capabilities: Replaying and inspecting changes becomes easier.

Example: Syncing inventory updates from a MySQL database into Elasticsearch via CDC ensures the search index is always up to date.

Drawbacks of CDC

Operational complexity: Needs connector management, offset handling, and monitoring.
Schema evolution fragility: Renames, drops, and type changes can break consumers.
Latency and ordering challenges: Out-of-order or delayed delivery in high throughput systems.
Data loss or duplication: Misconfigured offsets or restarts can cause inconsistencies.
Security/access: Log-based CDC often needs high-privilege DB access.
Performance impact: Trigger-based CDC increases write latency and can introduce locks.

Common Pitfalls:

Log rotation without connector sync: If your database rotates or purges logs before the CDC connector has consumed them, you may lose change events. For example, MySQL binlogs may expire and be deleted before Debezium catches up.
Missing schema registry: If you're sending CDC data (especially via Kafka) without a schema registry, changes like renaming fields or adding new ones can break downstream consumers expecting the old structure.
Offset mismanagement: CDC tools track how far they've read through the change log using offsets. If offsets are lost or incorrectly restored after a restart, the system may reprocess changes (duplicates) or skip them entirely.
Backpressure issues: In high-throughput systems, if consumers are slow, buffers fill up and connectors fall behind. This can lead to data lag, system crashes, or inconsistent sync.

Alternatives to CDC

1. Polling

Querying tables periodically for changes using timestamps.

Pros: Simple, no DB internals required
Cons: High latency, risk of missing updates

2. Database Triggers

Triggers record changes into separate tables.

Pros: Real-time-ish, customizable
Cons: Adds DB load, brittle, hard to scale

3. Event Sourcing

Application emits domain events instead of just changing the DB.

Pros: Full audit, strong consistency
Cons: High complexity, requires redesign

4. Dual Writes

App writes to DB and queue (e.g., Kafka) at the same time.

Pros: Simple to start
Cons: Prone to inconsistency, needs idempotency

5. Transactional Outbox Pattern

App writes to a DB + outbox table in one transaction, then a relay service reads from outbox.

Pros: Reliable, atomic
Cons: Extra infra, slight delay

Tooling Comparison

Approach	Tooling Example	Infra Complexity	Cost	Scalability	Maturity
Log-based CDC	Debezium, AWS DMS	Medium to High	Medium–High	High	Mature
Trigger-based	Custom SQL Triggers	Low to Medium	Low	Low	Low
Polling	Custom cron/schedulers	Low	Low	Medium	Mature
Event Sourcing	Kafka, Axon Framework	High	High	High	Mature
Transactional Outbox	Kafka + relay service	Medium	Medium	High	Proven

Cloud vs Open-source Considerations:

AWS DMS and Google Datastream are managed, easy to set up but more expensive.
Debezium is free but requires Kafka Connect, Zookeeper, and ops work.

When to Use CDC vs Alternatives

Use Case	Recommended Approach
Real-time analytics	CDC or polling
Microservices sync	Outbox or CDC
Cache invalidation	Dual write or CDC
Audit/history logging	Event sourcing or CDC
Event-driven orchestration	Event sourcing

Choose based on:

Team maturity: Infra, Kafka, observability
Data sensitivity: Can you tolerate duplicates/loss?
Latency requirements: ms vs seconds vs batch
Complexity budget: Is the benefit worth the effort?

Data Consistency and Integrity Considerations

Yes, your choice of strategy has a direct impact on data consistency and integrity:

Dual writes without transactional guarantees can lead to mismatched states between your DB and event consumers if one write succeeds but the other fails.
Polling risks missing changes if rows are updated multiple times between intervals.
Trigger-based CDC may lose events if triggers fail silently or if permissions/configurations change.
CDC with proper offset tracking and delivery guarantees (like exactly-once semantics in Kafka) offers higher consistency but demands stronger infrastructure.
Transactional Outbox ensures atomicity between the DB change and the emitted event, making it one of the most reliable methods when done correctly.

Always evaluate the failure modes of your strategy—what happens when a component crashes, restarts, or loses network—and choose tools that give you the right trade-offs between consistency, complexity, and performance.

Conclusion

CDC is a powerful pattern to enable reactive and event-driven systems with minimal impact on source DBs. However, it's not a one-size-fits-all solution. Consider operational complexity, data criticality, and your system's maturity before choosing it over simpler polling or more robust outbox/event sourcing models. Thoughtful architecture always beats chasing trends.

Tech Pillars vs. Metrics: Foundations of a Technology Engineering Organization

Aditya satrio nugroho — Sun, 30 Mar 2025 15:51:20 +0000

Why This Matters

As engineering leaders, we often operate in high-velocity, high-uncertainty environments. Our teams are shipping fast, but are we improving sustainably? Without clearly defining the strategic priorities (pillars) and measuring them effectively (metrics), we risk optimizing the wrong things—and building tech that's brittle, expensive, or misaligned with business goals.

This article serves as a framework to build a structured, metrics-driven engineering culture rooted in clear, strategic pillars. It's written for tech leadership who want to scale with intention.

1. What Are Tech Pillars?

Tech pillars are non-negotiable strategic truths for your engineering organization. They represent core domains of focus that support the long-term viability, performance, and alignment of the tech org.

They aren't measured directly, but instead guide decision-making, investment, and cultural behaviors.

Example Tech Pillars:

System Reliability & Observability
Engineering Productivity & Developer Experience (DX)
Cost Efficiency & Resource Optimization
Performance Optimization
Quality, Stability & Security
Innovation & Technical Growth
Cross-Team Collaboration & Alignment

Analogy: Think of pillars like the foundation of a building. You don’t measure the pillar itself; you check for cracks in the walls, leaks in the ceiling—the symptoms of a failing pillar.

Reference: Forsgren et al. (2018, Accelerate) describe these as "capability domains" predictive of software delivery and organizational performance.

2. What Are Metrics?

Metrics are the gauges, dials, and warning lights of your engineering organization. They provide quantifiable feedback on how well you’re upholding each pillar.

Metrics are only meaningful when tied to a strategic context. Measuring uptime means little unless it serves your Reliability pillar. Tracking PR merge time without pairing it with review quality might be counterproductive.

Examples of Metrics (Grouped by Pillar):

System Reliability & Observability

Uptime %
Mean Time to Detect (MTTD)
Mean Time to Recovery (MTTR)
Incident Recurrence Rate

Engineering Productivity & DX

Cycle Time
Lead Time for Changes
PR Review Time
Deployment Frequency
Developer Satisfaction Score

Cost Efficiency & Resource Optimization

Cost per User Session
Cost per API Hit
Infra Utilization %
Cloud Spend per Feature

Performance Optimization

P95 API Latency
Backend Build Time
App Startup Time
Cache Hit Ratio
Database Query Latency

Quality, Stability & Security

Code Coverage
Hotfix Rate
Rollback Rate
CO Success Rate
SonarQube Score
Security Issue Resolution Ratio

Innovation & Growth

R&D Initiative Completion Rate
New Tech Adoption %
Internal Tech Talks per Quarter
Training Hours per Engineer

Cross-Team Collaboration

Conflict Resolution Time
Tech-Biz OKR Alignment Score
Engineering Contributions to Shared Goals

3. Pillars vs. Metrics: A Simple Table

Pillar	Metric Example
Reliability	Uptime %, MTTR
Productivity & DX	Cycle Time, PR Merge Time
Cost Efficiency	Cost per Session, Infra Utilization
Performance	API Latency, App Startup Time
Quality & Security	Code Coverage, Bug Leakage Rate
Innovation	R&D Completion Rate, Tech Talks
Collaboration	OKR Alignment, Conflict Resolution Time

4. Anti-Patterns: What Happens When You Confuse the Two

Example 1:

You obsess over uptime, but ignore MTTR. The result? Systems stay up—until they don’t. When they crash, it takes hours to recover. You’re measuring the wrong thing.

Fix: Track both Uptime (proactive) and MTTR (reactive) under the Reliability pillar.

Example 2:

You optimize for PR merge speed without context. Review quality plummets, bugs leak to prod, and velocity backfires.

Fix: Balance speed (Cycle Time) with review quality or test coverage.

Example 3:

You track too many unaligned metrics. Your dashboard is impressive but meaningless. Engineers feel overwhelmed, not empowered.

Fix: Anchor every metric to a pillar and business goal.

5. How to Structure & Operationalize

Step 1: Define Your Pillars

Use company objectives, system retros, and org pain points. Don’t copy-paste from others. Your pillars should reflect your context.

Step 2: Map Metrics to Each Pillar

2–5 meaningful metrics per pillar is a good starting point. Less is more, especially early on.

Step 3: Assign Ownership

Each pillar should have a driver (e.g., EM, Tech Lead) accountable for continuous improvement and reporting.

Step 4: Embed in the Process

Use pillars in OKRs
Mention them in sprint reviews
Include pillar health in quarterly business reviews

Step 5: Review and Adapt

Pillars rarely change. Metrics do. Track monthly, reflect quarterly, refine yearly.

6. Practical Use Cases

Case A: Performance Regression on Checkout

Pillar: Performance Optimization
Metric: P95 latency, CO success rate
Action: Profile APIs, introduce pre-warming or caching

Case B: Developer Burnout from Delivery Pressure

Pillar: Productivity & DX
Metric: PR Cycle Time, Dev Satisfaction Survey
Action: Automate boilerplate, reduce context switching, enable focus blocks

Case C: Infra Costs Spike with Flat Traffic

Pillar: Cost Efficiency
Metric: Cost per Session, Infra Utilization
Action: Analyze low-efficiency services, right-size instances, optimize autoscaling

7. Templates & Playbooks

Quarterly Pillar Review Template:

Pillar Health (Red/Yellow/Green)
Key Metrics
Trends vs. Last Quarter
Upcoming Initiatives

Engineering Metric Audit Checklist:

Is this metric still relevant?
Is it tied to a pillar?
Is someone accountable for it?
Are we acting on it regularly?

Starter Pillar Set for Scaling Teams:

Reliability
Productivity
Quality
Cost

Add others as the org matures.

Final Thoughts

Pillars give purpose. Metrics give feedback.

Together, they create a system that is strategic, actionable, and resilient. When properly structured, they align engineers, guide investment, and let your team scale with clarity instead of chaos.

Don’t just track more. Track what matters.

Don’t just optimize metrics. Reinforce your pillars.

References

Forsgren, N., Humble, J., & Kim, G. (2018). Accelerate: The Science of Lean Software and DevOps.
Treacy, M., & Wiersema, F. (1993). Customer Intimacy and Other Value Disciplines. HBR.
Kerzner, H. (2017). Project Management: A Systems Approach to Planning, Scheduling, and Controlling.
Google DORA Research. (2023). State of DevOps Report.