DEV Community: Andy Nian

leetcode

Andy Nian — Tue, 05 May 2026 21:00:37 +0000

Metric Tradeoffs in Data Science: Deciding When One Metric Goes Up and Another Goes Down

Andy Nian — Thu, 13 Nov 2025 07:54:59 +0000

In data science interviews — and in real-world product work — you’ll often face this classic dilemma:

Metric A goes up 📈 but Metric B goes down 📉 — what should you do?

Should you celebrate the improvement or worry about the decline?
This post walks through a structured decision framework to help data scientists analyze such trade-offs logically and confidently.

1️⃣ Identify: Real Degradation or Expected Behavior?
The first step is to determine whether the drop is a true degradation or an expected behavioral shift caused by the product change.

✅ Expected Behavior (Safe to Launch)
Sometimes, what looks like a “drop” in one metric is actually a normal behavioral adjustment aligned with the product’s goal.

Example: Meta Group Call Feature

Result: DAU ↑ but Total Time Spent ↓
Analysis: Users need fewer group calls because communication becomes more efficient through one-on-one calls.
Key metric checks: DAU ↑ Average time per session ↑ User engagement ↑

Conclusion:
The decrease in total call count is expected behavior — not a real degradation.

2️⃣ Mix Shift vs. Real Degradation
Sometimes, metrics decline not because the feature worsened but because of user composition changes — a phenomenon called mix shift.

Example: Retention ↓ but DAU ↑

Step 1: Segment Analysis
Break down the DAU increase:

New users vs. existing users

Step 2: Evaluate Each Segment

If new users naturally have lower retention → Mix shift (✅ safe to launch)
If both groups maintain or improve retention → Not degradation
If both groups show lower retention → Real degradation (⚠️ requires further investigation)

3️⃣ Long-Term vs. Short-Term Trade-Offs
When facing a real trade-off (e.g., engagement ↓ but ad revenue ↑), analyze user behavior patterns to assess risk.

Scenario A: Loss from low-intent users only

Most core users remain engaged
Risk: Low long-term impact
Decision: Proceed or monitor safely

Scenario B: Engagement drops across all users

Risk: High — large-scale disengagement
Decision: Delay or avoid launch

4️⃣ Build a Trade-Off Calculator
Use historical experiment data to quantify relationships between key metrics and guide consistent decision-making.

Example Framework

Relationship: 1% capacity cost → ≥2% engagement increase
Decision rule: If a new test shows <2% engagement increase, don’t launch.
Benefit: Standardizes decisions using empirically validated ratios.

Common Relationships to Track

Engagement gain per capacity cost
Revenue per user engagement point
Retention improvement per feature complexity

5️⃣ Use Composite Metrics

Don’t rely on a single metric — build composite metrics that directly capture trade-offs between multiple objectives.

Examples

Promo Cost per Incremental Order Before: $3 per order After: $2 per order → Cost efficiency improved
Cost per Acquisition (CPA)
Revenue per Marketing Dollar
Engagement per Development Hour

🧭 Decision Framework Summary
First: Identify if the drop is real degradation or expected behavior.
Second: If it’s real, evaluate short-term vs. long-term trade-offs.
Third: Use historical benchmarks and trade-off calculators.
Fourth: Apply composite metrics to balance efficiency and outcome.

💡 Key Takeaway
When one metric goes up and another goes down, resist the urge to react emotionally.
Instead, follow a structured, data-driven framework to understand why it happened, who it affected, and whether it aligns with your long-term product goals.

The Hidden Danger of P-Hacking in A/B Testing: When Curiosity Crosses the Line

Andy Nian — Wed, 12 Nov 2025 08:12:42 +0000

In the world of data science and experimentation, we love finding “statistical significance.” That magical p < 0.05 feels like a stamp of scientific approval — a signal that our experiment “worked.” But what happens when our excitement to find meaning turns into manipulation, even unintentionally?

Welcome to the world of p-hacking — the quiet villain behind countless misleading A/B test conclusions.

What Is P-Hacking, Really?
At its core, p-hacking means manipulating your analysis until you find a statistically significant result, whether or not that result truly reflects reality.

It’s not always malicious. Sometimes, it’s as subtle as:

Peeking at results every few hours and stopping the test when p < 0.05.
Dropping “noisy” data points because they make the results look messy.
Trying multiple metrics or segmentations until one happens to be significant.
The danger? These actions inflate the probability of finding false positives — results that appear meaningful but are actually due to random chance.

Why It’s So Tempting in A/B Testing
A/B testing feels simple: run two variants, measure the difference, and declare a winner. But in practice, the process is full of judgment calls that can quietly open the door to p-hacking.

Consider this scenario:
You launch an experiment on a new homepage design. After three days, the conversion rate looks +4% with p = 0.04. You’re excited — it’s significant! But wait — your test was supposed to run two weeks. You stopped early because you “already saw the trend.”

That’s a classic p-hack.
The more often you check, the higher the chance you’ll catch a false signal that looks significant. In fact, if you peek every day, your true error rate might jump from 5% to 20% or more.

The Psychology Behind It
Humans are pattern-seeking creatures. We want our hypotheses to be right. We want to tell our stakeholders that the new recommendation system improved engagement or that our UX redesign boosted conversion.

This emotional bias — the pressure to show progress — leads us to “massage the data” just enough to make the story work.

The problem? When we do this across dozens of tests, we end up building on illusions. False wins pile up, and real learnings get buried under statistical noise.

How to Avoid P-Hacking
Here’s how to keep your A/B testing honest — and your data credible:

Pre-register your hypotheses.
Define what you’re testing before you run the experiment. List your primary metric, segmentation, and duration upfront.

Stick to fixed test durations.
Avoid peeking or stopping early unless you’re using a proper sequential testing framework like Bayesian methods or Alpha spending.

Correct for multiple comparisons.
If you test multiple metrics or segments, use corrections (e.g., Bonferroni, Holm-Bonferroni, or False Discovery Rate) to maintain integrity.

Focus on practical significance.
A p-value of 0.049 doesn’t mean much if the effect size is negligible. Ask: Would this result matter to users or business outcomes?

Promote a culture of learning, not winning.
Teams that reward genuine insights (including null results) are less likely to p-hack. The goal isn’t to “prove” — it’s to understand.

The Real Cost of P-Hacking
P-hacking doesn’t just mislead data scientists — it misleads entire organizations.

Bad decisions get shipped to millions of users.
False confidence undermines trust in experimentation.
Wasted time and resources accumulate chasing fake improvements.
Over time, this erodes the most valuable thing in data science: credibility.

Final Thoughts
P-hacking is seductive because it rewards us now — a statistically significant result, a green light, a presentation win.
But in the long run, it poisons our understanding of what actually works.

As data scientists, our job isn’t to find significance — it’s to find truth.
And sometimes, the truth is that nothing changed. And that’s perfectly okay.