DEV Community: GrowthBook

How The Social Club Cut Experimentation Costs by 82%

Ryan Feigenbaum — Sat, 24 Jan 2026 03:07:05 +0000

Rudger de Groot of Mintminds shared how The Social Hub slashed its experimentation costs with GrowthBook. By driving down the incremental cost per experiment as close as possible to zero, companies can run as many experiments on as much traffic as they want.

The best experimentation programs scale cost-efficiently, so they can run more experiments, learn faster, and ship smarter. But a hidden cost killer is BigQuery query inefficiency. The more you test, the more you pay. What if there were a way to test more and pay less?

In this case study, we’ll show you how Mintminds cut experimentation costs for The Social Hub using GrowthBook with BigQuery optimizations from GA4Dataform by Superform Labs. The setup slashed BigQuery costs by 81.8% while improving data refresh speeds and monitoring capabilities. Here's how they did it.

A Scaling Advantage Built into the Cost Structure

The mission at Mintminds is simple: build high-quality experiments with reliable data and analysis. GrowthBook’s pricing model allows for a setup where the more you test, the lower your per-experiment cost. But to optimize costs, you need to understand where money actually flows. Let’s break down the pricing:

Fixed Costs (pricing, as of Nov 2025)

$40/month per seat for GrowthBook Pro license
Typical team size: 5 seats = $200/month

Variable Costs (GrowthBook Cloud):

2 million CDN requests included (≈ pageviews)
20 GB CDN bandwidth included
Overage: $10 per million requests, $1 per GB bandwidth

Self-Hosting Alternative: You can eliminate CDN costs by self-hosting GrowthBook for $11-50/month (depending on your infrastructure choice).

How Experimentation Costs Compare

To understand how GrowthBook experimentation costs compare, Mintminds shares a real-world example from a client with 2.6 million unique users/month and running 5-7 experiments a month. In this example, they are running the GrowthBook JS SDK on Cloudflare pages, which means no limitations on the number of tested visitors for free. Yes, you read it right…for free!

The variable GrowthBook costs are:

6.6 million CDN requests: 6.6 – 2 (first 2 million are free) = 4.6 * $10 = $46
6 GB CDN Bandwidth usage: $ 0 (first 20GB is free)
BigQuery usage cost estimation with daily updates: $300

Fixed GrowthBook Pro costs for a team of 5 members: 5 * $40 = $200

Platform	Monthly Cost	Annual Cost	vs. GrowthBook Optimized
Convert.com Pro	$3,488	$41,856	1,050% more expensive
VWO Pro	$4,308	$51,696	1,320% more expensive
GrowthBook (Unoptimized)	$546	$6,552	80% more expensive
GrowthBook (Optimized)	$303	$3,640	Baseline

With BigQuery costs included, GrowthBook remains dramatically cheaper than traditional alternatives like Convert ($3,500/month) or VWO ($4,300/month) at comparable traffic levels. GrowthBook is already the smart financial choice. With optimization, it becomes unbeatable. Using GrowthBook cuts experimentation costs by 82% versus Convert.com Pro and 93% compared to VWO Pro.

An 82% BigQuery reduction transforms GrowthBook from “very affordable” to an offer you simply can’t refuse.

GA4 Structure Wastes BigQuery Resources

Regardless of hosting choice, BigQuery becomes your primary variable cost when using GA4 as your data source. For companies running active experimentation programs with daily updates, Mintminds finds that unoptimized BigQuery costs can easily reach $200 to $400/month.

The default GrowthBook BigQuery integration queries GA4’s standard events_* and events_intraday_* tables. These tables store event parameters in nested structures, forcing BigQuery to process far more data than necessary.

For example when you’re running experiments with:

5 metrics (1 goal + 1 secondary + 3 guardrails)
3 dimensions for segmentation
Daily (or more frequent) data refreshes

BigQuery has to scan through nested arrays and repeated fields to extract the specific event parameters you need. You’re paying to process gigabytes of data when you only need megabytes of relevant information.

GrowthBook allows custom fact tables and metrics to select only relevant events and parameters. This helps, but optimizations plateau quickly because you’re still querying nested GA4 tables.

Enterprise customers get access to:

Advanced fact table query optimization
Data pipelines (significantly improved in GrowthBook 4.2)

But Pro license users need a different approach.

How to Use GA4Dataform's Flattened Datasets to Reduce Query Costs

At #CH2024 (the conference formerly known as Conversion Hotel), Rudger connected with Jules Stuifbergen from Superform Labs about this exact challenge. Jules introduced him to GA4Dataform, which offered an elegant solution.

What GA4Dataform Does: The Core Version (free!) creates a customized, flattened dataset optimized for the type of queries that GrowthBook uses.

Feature	Benefit
Fully flattened structure	No nested fields = dramatically faster queries
Smart partitioning and clustering	Restricting queries by date and event names will decrease the number of rows scanned
Smaller data footprint	Less data processed = lower BigQuery costs
Daily automated updates	Fresh data from GA4 events table is appended to the table, using incremental logic

Key insight: Even though you’re creating a new dataset in BigQuery (which feeds from the generic GA4 table), the flattened structure makes it cheaper to generate AND cheaper to query than repeatedly querying GA4’s nested tables.

Bonus benefit: This same optimized dataset can be used for all your other BigQuery reports and dashboards, compounding the savings.

A Rigorous A/A Experiment to Test the Setup

Mindminds partnered with Laura Semeraro and the team at The Social Hub—a hybrid hospitality brand offering hotel rooms, co-living spaces, coworking facilities, and creative playgrounds across Europe—to validate this approach with real data.

Using GA4Dataform's flattened datasets didn't just reduce GrowthBook costs—it optimized all our BigQuery reports and dashboards.

Laura Semeraro, Digital Analyst at The Social Hub

Implementation Steps

1. GA4Dataform Setup – Laura installed GA4Dataform Core (free version). The custom event parameters from GrowthBook were added to the configuration (experiment ID and variation ID). With the daily schedule enabled, GA4Dataform automatically updates the flat events table incrementally.

2. GrowthBook Configuration – Mintminds created a new assignment query (for counting experiment visitors). Built fact tables for key conversion events: Add-to-cart and purchase events.

3. A/A Test Design – They ran two identical experiments simultaneously:

Configuration:

Same targeting rules
Same 5 metrics (1 goal, 1 secondary, 3 guardrails)
Same 3 dimensions

The Only Difference:

Experiment A: Default GrowthBook queries (nested GA4 tables)

Experiment B: Optimized queries (flattened GA4Dataform dataset)

4. Measurement – GrowthBook usage is automatically labelled in BigQuery, allowing us to track:

BigQuery costs from Experiment A (old approach)
BigQuery costs from Experiment B (new approach)
BigQuery costs for daily dataset updates

Test duration: 1 week

This gave us an objective, apples-to-apples comparison.

The Social Hub Reduced BigQuery Costs by 82%

When the results came in, Rudger and his team had to verify the numbers multiple times to ensure accuracy: a whopping 81.8% cost reduction and a massive query speed improvement, too.

By using the GA4Dataform flattened dataset instead of the default GA4 nested tables, they had reduced BigQuery data processing by more than four-fifths.

Benefit	Impact
Update experiment results more frequently	Better SRM and MDE monitoring without budget concerns
Run updates faster	Flattened queries execute in a fraction of the time
Scale experiment volume	The "more you test, less you pay" promise becomes reality
Optimize other analytics	Use the same flattened dataset for all BigQuery dashboards

The compounding effect: Lower per-experiment costs + faster refresh rates = exponentially better experimentation program ROI.

Enterprise Experimentation at a Fraction of the Cost

This case study demonstrates how to achieve exceptional BigQuery efficiency with GrowthBook. By combining GrowthBook Pro, GA4Dataform Core and Strategic BigQuery optimization, you can build a cost-effective, high-performance experimentation stack that rivals Enterprise setups—at a fraction of the price. The cost reduction Mintminds achieved with The Social Hub isn’t an outlier. It’s the new baseline for GrowthBook implementations.

About Our Partners

Mintminds is a Certified GrowthBook partner based in the Netherlands. Founded by Rudger de Groot, the team assists companies worldwide with hyper-scaling experimentation using GrowthBook.

The Social Hub is a European hospitality brand that blends traditional hotel stays with a vibrant, community-focused experience. Its unique hybrid model combines premium design-led short and long-stay hotel rooms with student accommodation, coworking spaces, meeting and event facilities, restaurants and bars, 24-hour gyms, and open-to-the-public spaces like rooftops, parks, and cultural venues.

AI Evals vs. A/B Testing: Why You Need Both to Ship GenAI

Ryan Feigenbaum — Tue, 20 Jan 2026 15:53:06 +0000

Most teams building with GenAI are flying blind. They've replaced unit tests with vibes and shipped prompts that "felt right" to three engineers on a Friday afternoon.

This isn't a criticism—it's a diagnosis. For decades, we operated under a deterministic paradigm. The contract between developer and machine was explicit: Input A + Code = Output B. Always, without fail. In this world, success was binary. A unit test passed or it failed.

Generative AI has shattered this contract. We have moved from deterministic engineering to probabilistic engineering. We are no longer building binaries; we are managing stochastic agents that produce a distribution of probable outputs. You cannot assert(x == y) when x and y can change every time.

Gian Segato (Anthropic) eloquently sums up this shift: “We are no longer guaranteed what x is going to be, and we're no longer certain about the output y either, because it's now drawn from a distribution…. Stop for a moment to realize what this means. When building on top of this technology, our products can now succeed in ways we’ve never even imagined, and fail in ways we never intended” (Building AI Products In The Probabilistic Era).

As seismic as this shift may be, we’re focusing on a single aspect of it here: the shift from the domain of verification (is it correct?) to the domain of validation (is it good?).

This shift has left teams scrambling to define quality. Many have fallen into the trap of thinking AI Evaluations (Evals) are a replacement for A/B testing. They aren't.

And, for those in a hurry, here’s the point:

AI Evals check for competence—can the model do the job?
A/B testing checks for value —do users care?

You cannot ship a good AI product without both AI Evals and A/B testing.

The Limits of Vibe Checking

In the early days of the LLM boom, “Prompt Engineering” was largely a feeling-based art. Devs would tweak a prompt, run it three times, read the output, and decide if it “felt” better.

This manual inspection—”vibe checking”—leverages human intuition, which is great for nuance but terrible for scale.

Vibe checking suffers from three critical flaws:

Sample size: You might test 5 inputs. Production brings 50k edge cases.
Regression invisibility: Making a prompt “polite” might accidentally break its ability to output valid JSON. You won’t feel that until the API breaks.
Subjectivity: One engineer’s “concise” is another’s “curt.”

As ML Systems Researcher, Shreya Shankar notes, “You can’t vibe check your way to understanding what’s going on.” Manual inspection is mathematically insufficient for understanding probabilistic systems at scale.

To solve this, the industry turned to AI Evals.

💡

For an excellent intro to AI Evals, check out Shreya Shankar and Hamal Husain on Lenny’s Podcast.

What Are AI Evals?

AI Evaluations are an attempt to systematize the vibe check—turning qualitative judgment into quantitative metrics. They're a way to programmatically test the probabilistic parts of your application: prompts, models, and parameters.

But the term "Eval" is overloaded. When someone says "we're running evals," they might mean any of three things.

3 Types of AI Evals and Why They Matter

1. Model Evals

Model evals are benchmarks like MMLU or HumanEval. They're useful for choosing a provider (GPT-5 vs. Claude Opus 4.5), but they tell you almost nothing about your specific application. A model might ace GSM8K (math reasoning) and still be a terrible customer service agent. Worse, these public benchmarks are increasingly contaminated—models have seen the test questions during training, inflating scores that don't transfer to novel problems. (We wrote a whole article about why “The Benchmarks Are Lying To You.”)

2. System Evals

System evals are what matter most. These test your end-to-end pipeline: prompt + RAG retrieval + model. The key metrics here are things like hallucination rate, faithfulness (does the answer stick to the retrieved context?), and relevance.

Many teams now use LLM-as-Judge —a strong model grading outputs on subjective criteria like tone, helpfulness, and coherence. It scales better than human review, but inherits the same limitation: it measures whether an answer seems good, not whether users act on it.

3. Guardrails

Guardrails are real-time safety checks—toxicity filters, PII detection, jailbreak prevention. Important, but a different concern than quality.

All three share a critical constraint: they measure competence, not value. Whether you run evals offline in your CI/CD pipeline against a curated "Golden Dataset," or online against live traffic in shadow mode, you're still asking the same question: Can this model do the job?

Some evals do capture preference—human ratings, side-by-side comparisons, thumbs up/down. But these are still proxies. A user clicking "thumbs up" in a sandbox isn't the same as a user returning to your product tomorrow. Evals measure stated preference; A/B tests measure revealed preference through behavior.

What evals can't tell you is whether users will care enough to stick around.

Where Evals Fall Short

Even within the realm of evals, a model that looks good in controlled conditions can fall apart in production.

The DoorDash engineering team documented this problem in detail. They built a new ad-ranking model that performed well in testing—but when deployed to real users, its accuracy dropped by 4.3%. The culprit? Their test data was too clean. The model had been trained assuming it would always have fresh, up-to-date information about users. But in the real world, that data was often hours or days old due to system delays. The model had been optimized for conditions that didn't exist in production.

This principle applies even more to LLM applications. LLMs are sensitive to prompt phrasing, context length, and retrieval quality—all of which behave differently in production than in curated test sets.

Consider a concrete example: you optimize a customer service prompt for faithfulness—it sticks strictly to your knowledge base and never hallucinates. Evals look great. But in production, users find the responses robotic and impersonal. Satisfaction drops. You optimized for accuracy; they wanted empathy.

This is the core limitation of evals: they measure capability, not value. Even when you run evals against live traffic, you're testing whether the model can do something—not whether that something matters to users.

Why You Should Use A/B Testing with Your AI Evals

If evals are the unit test, A/B testing is the integration test with reality. It’s the only way to measure what actually matters: downstream business impact like retention, revenue, conversion, engagement, and user satisfaction.

But running A/B tests on LLMs introduces challenges that didn't exist in traditional web experimentation. (For an introduction to the topic, see our practical guide to A/B testing AI.)

Challenges of Running A/B Tests on AI

1. The Latency Confound

Intelligence usually costs speed. If you test a fast, simple model against a smart, slow one and the variant loses—why? Was the answer worse or did users just hate waiting three seconds?

Isolating "intelligence" as a variable often requires artificial latency injection: intentionally slowing the control to match the variant. Only then can you measure what you think you're measuring.

2. High Variance

LLMs are non-deterministic. Two users in the same variant might see meaningfully different responses. This noise demands larger sample sizes and longer test durations to reach statistical significance.

A button-color test might reach significance in a few thousand sessions. An LLM prompt test—where output variance is high and effect sizes are often small—might need 10x that, or weeks of runtime, to detect a meaningful difference.

3. Choosing the Right Metric

Choosing the right metric is harder for AI features than for traditional UI changes. A chatbot might increase engagement (users ask more questions) while decreasing efficiency (they take longer to get answers). Align your success metric with actual business value, not just surface activity.

These realities create a tension. A/B testing AI gives you certainty, but certainty takes time. If you have twenty prompts to evaluate, a traditional A/B test could take months. And during those months, a significant portion of your users are experiencing inferior variants.

Enter Multi-Armed Bandits

For prompt optimization—where iterations are cheap, and the cost of a suboptimal variant is low—multi-armed bandits offer a different trade-off. Instead of fixed traffic allocation, they dynamically shift users toward winning variants as data accumulates. You sacrifice some statistical rigor for speed and reduced regret.

🎰

Check out our deep-dive on how they work in GrowthBook.

Comparing A/B Testing to Multi-Armed Bandits

Feature	A/B Testing	Multi-Armed Bandits
Primary Goal	Knowledge. Determine with statistical certainty if B is better than A.	Reward. Maximize total conversions during the experiment.
Traffic Allocation	Fixed for the duration.	Dynamic. Automatically shifts traffic to the winner.
Best Use Case	Major model launches, pricing, UI changes	Prompt optimization, headline testing

Bandits aren't a replacement for A/B testing. They're a complement—best suited for rapid iteration loops where you're optimizing within a validated direction, not making major strategic bets.

How to Use AI Evals and A/B Testing Together

At GrowthBook, we see the highest-performing teams treating evals and experimentation not as separate islands, but as a continuous pipeline—each stage filtering out risk with progressively more expensive (but more accurate) methods.

Using AI Evals and A/B Testing Together in Practice

Stage 1: The Offline Filter (CI/CD)

A developer creates a new prompt branch. The CI/CD pipeline automatically runs evals against the Golden Dataset. If faithfulness drops below 90% or latency exceeds the threshold, the build fails. Bad ideas die here, costing pennies in API credits rather than user trust.

Stage 2: Shadow Mode (Production, Silent)

The prompt passes offline evals and gets deployed—but users never see it. The new model processes live traffic silently, logging predictions without surfacing them.

This is online evaluation: you're still measuring competence (latency, accuracy, edge case handling), but now against real-world conditions. DoorDash's 4% accuracy gap between testing and production is exactly the kind of discrepancy shadow mode is designed to surface—before users experience the degraded results.

Stage 3: Safe Rollout

Shadow mode passes. Feature flags gradually release the new model to users. You're monitoring guardrail metrics: error rates, refusal spikes, support tickets. If something tanks, you flip the flag and revert instantly—no code rollback required.

🦺

Use GrowthBook's Safe Rollouts to monitor guardrail metrics and rollback automatically.

Stage 4: The A/B Test (Causal Proof)

The rollout survives. Now you run the real experiment: new model vs. baseline, measured on business metrics. Not "faithfulness" but retention. Not "relevance" but conversion. This is the only stage that proves value.

Conclusion: AI Evals plus A/B Testing for GenAI

You cannot A/B test a broken model. It’s reckless. And you cannot Eval your way to product-market fit. It’s guesswork.

To ship generative AI that's both safe and profitable, you need both: rigorous evals to ensure competence, and robust A/B testing to prove value. The pipeline between them—shadow mode, safe rollouts—is how you get from one to the other without breaking things.

As Segato warned, our products can now fail in ways we never intended. This pipeline is how we catch those failures before users do.

We've moved from is it correct? to is it good? Evals answer the first question. A/B tests answer the second. You need both.

Frequently Asked Questions

Can AI Evals replace A/B testing?

No. AI Evals and A/B testing serve different purposes in the development lifecycle. Evals measure competence—accuracy, safety, tone—whether run offline or online. A/B testing measures business value through revealed user behavior: retention, revenue, conversion. Evals tell you the model works; A/B tests tell you it's worth shipping.

What is the difference between Offline and Online Evaluation?

Offline evaluation happens pre-deployment using a static Golden Dataset to check for regressions and quality. Online evaluation happens in production using live traffic (e.g., shadow mode). Both measure competence, but online evaluation catches issues—like feature staleness or latency spikes—that don't appear in controlled conditions.

How do you handle latency when A/B testing LLMs?

Latency is a major confounding variable because "smarter" models are often slower. If a slower model performs worse, it's unclear if users disliked the answer or the wait time. To fix this, engineers use Artificial Latency Injection—intentionally slowing down the control group to match the variant's response time, isolating "intelligence" as the single variable.

What is "Vibe Checking" in AI development?

"Vibe checking" is the informal process of manually inspecting a few model outputs to see if they "feel" right. While useful for early exploration, it is unscalable and statistically flawed for production systems because it fails to account for edge cases, regressions, or large-scale user preferences.

When should I use a Multi-Armed Bandit instead of an A/B test?

Use a Multi-Armed Bandit when your goal is optimization (maximizing reward) rather than knowledge (statistical significance). MABs are ideal for testing prompt variations or content recommendations because they automatically route traffic to the winning variation, minimizing regret. Use A/B tests for major architectural changes or risky launches where you need certainty.

What is the best way to deploy AI models safely?

Use a staged pipeline. Start with offline evals in CI/CD to catch regressions. Then use shadow mode to test against live traffic silently. Next, use feature flags to release to a small percentage of users while monitoring guardrails. Finally, run a full A/B test to measure business impact. Each stage filters out risk before exposing users to problems.

What is LLM-as-Judge?

LLM-as-Judge is an evaluation technique where a strong model (like GPT-4 or Claude) grades the outputs of your system on subjective criteria such as tone, helpfulness, and coherence. It scales better than human review but shares the same limitation as other evals: it measures whether an answer seems good, not whether users will act on it.

What is the difference between stated and revealed preference in AI evaluation?

Stated preference is what users say they like—thumbs up ratings, side-by-side comparisons in a sandbox. Revealed preference is what users actually do—returning to your product, completing tasks, converting. Evals capture stated preference; A/B tests capture revealed preference. The two often diverge.

Dark Patterns in A/B Testing: How Short-Term Optimization Leads to Product Enshittification

Ryan Feigenbaum — Mon, 12 Jan 2026 22:07:30 +0000

Why optimizing for short-term A/B test wins can degrade user trust and product quality. A look at common dark patterns in experimentation, why they “work,” and how better metrics can help teams build products that create real long-term value.

A post supposedly from a software engineer at a meal delivery company went viral recently. It accused the unnamed company of unscrupulously manipulating pricing, fees, and salaries to increase revenue. One of the things they did was to run an A/B test on a “Priority delivery” fee. According to the post, there were no product changes to make delivery faster, but instead, they delayed regular deliveries.

“We actually ran an A/B test last year where we didn't speed up the priority orders, we just purposefully delayed non-priority orders by 5 to 10 minutes to make the Priority ones "feel" faster by comparison. Management loved the results. We generated millions in pure profit just by making the standard service worse, not by making the premium service better.” (Source: Reddit)

While there are some questions about the veracity of this post, such dark patterns in A/B testing and product development are absolutely being done. And this raises an important question about the ethics of using these techniques in experimentation.

What Are Dark Patterns?

Dark patterns are product design or implementation choices that deliberately nudge, coerce, or mislead users into behaviors that primarily benefit the company. They often come at the expense of the user’s understanding or long-term satisfaction.

For a comprehensive taxonomy, seedeceptive.design, which catalogs these patterns in detail.

How Are Dark Patterns Used in A/B Testing?

In the context of A/B testing, dark patterns typically appear when experiments are optimized narrowly for short-term business metrics, such as a conversion rate, without regard for whether the underlying change actually improves the product. Often they are introduced as a response to an organization’s goal metric that fails to capture the complete picture (see Goodhart’s Law and the dangers of metric selection).

Common Dark Patterns Used in Experiments

Artificial degradation : Making a baseline experience worse (for example, slowing delivery times as above or adding friction) so that a paid tier or alternative appears more attractive.
Obscured choice : Designing UI variants that make it harder to opt out, cancel, or choose a lower-cost option, then validating them via A/B tests that show higher revenue.
Price obfuscation : Experimenting with fees, surcharges, or defaults in ways that users only discover late in the funnel.
Emotional manipulation : Leveraging urgency, guilt, or fear (“Only 2 left!”, “People like you choose…”) to drive behavior, then justifying it with statistically significant lifts.

A/B testing itself is not the problem. The problem is using experimentation as a shield: “the data says it works” becomes a way to avoid asking whether the outcome is aligned with user value or long-term trust. It hides the real question of whether we should do this at all.

Short-Term Wins, Long-Term Costs of Unethical Experimentation

Dark patterns can look good in the short term. They are engineered to do so. Revenue goes up, conversion improves, and dashboards turn green. These tactics exploit goodwill with your current user base and long-term measurement blind spots, creating lifts that are easy to recognize immediately. The costs, however, tend to be delayed and externalized.

Dark patterns in A/B testing introduce several long-term risks for organizations.

Reputational Risk Users are not irrational. They may not always articulate why they are unhappy, but they notice when a product feels hostile, manipulative, or nickel-and-dime driven. Trust erodes quietly and then suddenly. When stories like the viral post above surface (whether accurate or not), they resonate precisely because users already suspect this behavior.
Legislative and Regulatory Risk Many dark patterns operate in gray areas that are increasingly of interest to regulators. Fee transparency, deceptive defaults, and coercive UX are now explicitly called out in regulations in multiple jurisdictions (see the EU’s Digital Services Act (DSA) and the California Privacy Rights Act (CPRA)). An A/B test that boosts revenue today can become legal exposure tomorrow, complete with internal documentation showing intent.
Internal and Cultural Risk Engineers, designers, and PMs generally want to build products that help people. When teams are repeatedly asked to ship features that intentionally worsen user experience, morale suffers. The best people notice. Over time, this can lead to disengagement or attrition, especially among senior contributors who have other options.
Risk from Competition Applying dark patterns that don’t improve the product opens the door, in the long term, for competitors to build a better product and put your company at risk.

In other words, dark patterns trade long-term value for short-term gains.

Practical Solutions to Avoid Dark Patterns in Experimentation

There are some practical ways to help reduce these risks and avoid the enshittification of products. Chief among these are adopting value principles and establishing ethics committees.

Value principles, like Google’s “Don’t be evil”, are frequently treated as aspirational marketing artifacts rather than operational constraints. Many tend to be vague or non-actionable and open to interpretation, which provides no meaningful protection against dark patterns. Finally, even if they are actionable and adopted as policy, they can come into tension with other incentives at the company, such as bonuses or career progression. Google, after all, ditched “Don’t be evil” in 2018.

Ethics committees are used at some larger companies to ensure consistent application of company values. However, they can face the same issues as the values above, particularly if the company is facing financial pressure; the ethics team can be high on the list of cuts.

The most practical way to avoid dark patterns is not an ethics committee or a vague principle statement; it is using the right metrics.

If you only measure immediate revenue or conversion, you will eventually design experiments that extract value rather than create it. To counteract this, teams need to deliberately include metrics that reflect longer-term outcomes.

Example experimentation metrics to use to avoid dark pattern behavior

Retention
Repeat usage
Complaint rates
Refunds
Customer support contacts
Brand sentiment
Qualitative feedback

Not all of these can be perfectly measured- or measured at all (like the likelihood or cost of losing key employees). In the real world, the data will never be perfect. Good product judgment will still be required, as there will always be uncertainty. An experiment that produces a short-term lift but could be seen to damage trust should be treated with skepticism, even if the lift is excellent.

When Experimentation Leads to a Better Product

Ultimately, the goal of experimentation is not to prove that you can move a number. It is to learn how to make something people genuinely want. A/B testing is a powerful tool in the service of that goal, but the further you drift from it, the more your “wins” become signals of underlying enshittification rather than progress. Make sure your metrics reflect your real goals as much as possible.

In the long run, the most effective optimization strategy remains the simplest: make the product better.

7 Steps to Better Experiment Design

Ryan Feigenbaum — Mon, 22 Dec 2025 05:40:30 +0000

A practical checklist for running A/B tests you can trust

From predictive model accuracy at Facebook and experiment design at X (formerly Twitter), to building the best experimentation platform used by DropBox, Sony and Upstart with GrowthBook, I've spent the last six years shaping how some of the largest tech companies measure success and ship features.

Across companies, industries, and scales, I’ve seen the same pattern repeat: experimentation rarely fails because teams don’t understand A/B testing mechanics. It fails because experiments are poorly designed—unclear goals, misaligned metrics, weak baselines, flawed randomization, or decisions made without a plan for ambiguous results.

The teams that get the most value from experimentation aren’t running more tests. They’re running better ones. They’re deliberate about what they’re trying to learn and disciplined about how results turn into decisions.

This article distills the most reliable experiment design practices I’ve learned from years of work in the field. If you already know how A/B testing works and want results you can trust—and act on—these seven steps are a strong place to start.

(For a deeper technical walkthrough, see GrowthBook’s Experimentation Best Practices)

1. Define the Goal Clearly

Every experiment should answer a specific question.

Start by writing down the problem you’re trying to solve in plain language. Is it activation? Retention? Conversion efficiency?

A good test of clarity is whether you can write a concrete hypothesis, such as:

“Users who complete the new onboarding flow will reach the activation milestone 10% more often than users in the existing flow.”

Clear goals prevent experiments from drifting into vague “did anything change?” territory.

In practice: Teams at Dropbox use tightly framed hypotheses to avoid shipping changes that move surface-level engagement but fail to improve long-term collaboration or retention.

2. Choose the Right Success Metrics

Once the goal is clear, metrics follow.

Every experiment should have:

One primary metric that defines success
A set of secondary metrics for context
Guardrail metrics to catch unintended harm

Focusing on too many metrics creates confusion. Tracking too few hides important tradeoffs—especially when multiple metrics are evaluated simultaneously (see GrowthBook’s guidance on multiple testing corrections).

Use your secondary metrics to improve your understanding of what drives your primary metric. They also help you check-in periodically with your primary metric, ensuring it is well-defined and driving you towards your business goals.

Teams at Khan Academy use experimentation to iterate on learning experiences while remaining deeply thoughtful about how success is measured in an educational context.

3. Know Your Baseline

You can’t interpret change without knowing where you started.

Before launching an experiment:

Understand current performance
Measure normal variance
Calibrate expectations for realistic lift

A change from 4% to 5% conversion is only meaningful if you know how stable 4% really is.

In practice: One GrowthBook customer—a large European marketplace—moved away from before-and-after analysis after realizing they couldn’t separate real lift from seasonality. Establishing proper baselines made results interpretable and decisions easier.

4. Understand Leading vs. Lagging Indicators

Not all metrics respond at the same speed.

Leading indicators provide fast feedback and are often better suited for short-term experiments.
Lagging indicators validate long-term impact and strategic alignment.

High-performing teams use both, but they’re intentional about which metric actually determines success.

Optimizing only for lagging indicators slows learning. Ignoring them risks local optimization.

5. Define the Experiment Population and Randomization Strategy

Decide who should be included in the experiment—and exclude everyone else.

Best practices include:

Randomizing users as close to the experience as possible
Ensuring assignment persists across sessions
Using a true control group
Keeping designs simple when traffic is limited

If you don’t have enough users, avoid multi-variant tests.

In practice: One GrowthBook customer, a major European retailer, was running underpowered tests. They moved from partial traffic to testing on 100% of visitors—dramatically reducing time to confidence and revealing insights that challenged long-held assumptions.

If you’re using feature flags to control exposure, GrowthBook’s approach to running experiments with feature flags is designed specifically for this kind of setup.

6. Validate Your Setup Before You Trust Results

You can’t analyze what you can’t connect.

Before launching real experiments, confirm that:

Exposure data joins cleanly with outcome data
Identifiers are consistent
Metrics are computed correctly

Then run an A/A test —two identical variants with no visible change.

In practice: Teams operating at scale use A/A tests to catch instrumentation and analysis issues early. If multiple uncorrelated metrics “win” in a no-change test, or multiple A/A tests fail with clear issues, something is broken. GrowthBook strongly recommends this as a validation step (A/A testing documentation).

7. Decide How Long to Run the Experiment

Ending experiments early increases false positives. Letting them run forever slows learning.

Plan duration in advance based on:

Expected variance
Minimum detectable effect
Available traffic

If you need flexibility, approaches like sequential testing can help—but only if you understand the tradeoffs.

Bonus: Plan for All Outcomes

Only 10–30% of experiments produce a clear winner. That’s normal.

High-performing teams plan for this reality before launching:

Low-cost features may ship on directional evidence
High-cost features require stronger confidence
Neutral results still generate valuable learning

Experiments aren’t always about maximizing win rates. In some cases, they prevent huge losses. In other cases, their primary value is learning about user behavior.

Final Thought

Experimentation isn’t about proving you’re right. It’s about discovering what’s true.

Every experiment—even a neutral one—teaches you something about your users and your assumptions. Teams that stay curious, document learnings, and iterate deliberately are the ones that compound results over time.

That’s what turns experimentation into a real competitive advantage.

FAQ: Experimentation & A/B Testing in Practice

How do you decide whether an A/B test result is actionable?

When the results all point to the same decision, even when accounting for uncertainty. If you would ship even if the results were at the bottom end of the confidence intervals and you've collected a reasonable amount of data, ship!

Why are so many A/B test results inconclusive?

Because most product changes simply don’t meaningfully change behavior. Neutral results often reveal what users don’t care about, guiding better future experiments.

How long should an experiment run?

Long enough to reach sufficient statistical power—not until a metric looks good.

When should you ship a result that isn’t statistically significant?

For low-risk, low-cost changes with stable guardrails. High-risk features need stronger confidence.

What’s the biggest mistake teams make with experimentation?

Treating experimentation as validation instead of learning.

Announcing GrowthBook 4.2: Product Analytics & Experimentation at Scale

Ryan Feigenbaum — Tue, 11 Nov 2025 19:43:04 +0000

At GrowthBook, our mission is to provide the insights you need to build better products that grow your business faster. With GrowthBook 4.2, we’ve added a beta version of GrowthBook Product Analytics . Now our users will have a single integrated platform for feature management, experimentation, and product analytics.

In addition, we’ve continued to enhance the developer experience, making experimentation at scale and integration into any stack easier than ever. Finally, for companies seeking an alternative to Statsig, our Statsig to GrowthBook Migration Kit automates importing feature gates and dynamic configs while replacing Statsig SDKs with GrowthBook SDKs.

Release 4.2 is available immediately to both our cloud and self-hosted users. Visit our Pricing page for details about Starter, Pro, and Enterprise options.

GrowthBook Product Analytics (Beta)

Adding Product Analytics to the GrowthBook platform closes the loop for development. Now, you can go from feature management to experimentation to product analytics in a single tool. While in beta, Product Analytics will be available to all users.

Turn your warehouse data and metrics into actionable product insights. Explore user behavior, share dashboards, and make smarter decisions about what to build next. With Product Analytics, you will be able to:

Build and share dashboards that combine graphs, pivot tables, and text
Create custom charts and tables from any data in your warehouse
Use GrowthBook SQL Explorer with our AI-powered text-to-SQL capabilities to query, aggregate, and group data
Access any metric defined in GrowthBook and track its performance over time

Build charts with any data in your warehouse using SQL Explorer

Analyze any metrics defined in GrowthBook

Slice and dice data with flexible pivot tables

This Product Analytics beta provides a glimpse of what’s to come as GrowthBook develops more self-service tools for building, analyzing, and exploring all of your product data. Let us know what you think in our Slack community!

Statsig to GrowthBook Migration Kit

With the OpenAI acquisition of Statsig, we saw a spike in interest in GrowthBook. Product teams looking for alternatives expressed concern about what would happen to their data. Others worried that the product might be discontinued or deprioritized. To make the transition from the acquired platform to an open-source alternative as effortless as possible, we created the Statsig to GrowthBook Migration Kit, free for all users.

Statsig Importer instantly copies over feature gates, dynamic configs, and segments.
Statsig Code Migration Tool (powered by Claude Code) automatically replaces Statsig SDKs with GrowthBook SDKs.

Enterprise Enhancements

The 4.2 features below continue our investment in the developer experience that makes GrowthBook a top choice for product development teams with high volume apps and advanced experimentation programs.

Metric Slices: Simplify Experiment Design

When users create experiments, they often want to look at a number of metrics across common dimensions like product categories or device types. This can lead to the need to manage a number of metrics. Metric slices solves this problem. Enable auto slices on a Fact Metric once, and GrowthBook automatically generates drill-down analyses for each dimension value across all experiments using that metric.

View revenue per user metric by product category

Instead of creating separate “Orders” metrics for each product category or device type, you can enable Auto Slices on those columns with a single metric which means fewer redundant metrics, faster setup, and cleaner reporting.

Incremental Refresh

We revamped our Data Pipeline Mode to lower query costs and improve performance for long-running experiments and high-traffic apps. By storing intermediate results and incrementally refreshing them, we’ve seen users save up to 85% in query costs. This first version is available on BigQuery, Presto, and Trino. We’ll be adding support for more data warehouses based on customer demand.

Official Metrics

Many organizations rely on a trusted set of “official” metrics. GrowthBook now makes these easier to manage by letting admins mark and edit official metrics directly from the UI (previously API-only). This helps standardize measurement, reduce confusion, and promote consistency across teams.

New SQL Template Variables

You can now access custom field values and phase data directly in your metric and experiment SQL, unlocking several use cases:

Fine-tuned query optimization using non-date partition keys
Reuse of SQL definitions with minor tweaks per experiment
More accurate joins between experiment exposure and phase data

Custom Validation Hooks

GrowthBook has always been flexible — and now it’s even more so. Self-hosted enterprise users can write custom JavaScript validation hooks that run in secure V8 isolates. Use them to:

Require tags on feature flags
Prevent targeting rules containing PII
Enforce naming conventions or internal policies

These hooks let teams automate governance without slowing down development.

Edge Remote Eval

Edge Remote Eval lets client-side SDKs offload feature flag evaluation to a backend server, preventing targeting logic from leaking to users. Previously, this required managing your own GrowthBook proxy servers. Now, you can deploy a Cloudflare Workers–based Remote Eval server — a fast, low-cost, zero-maintenance alternative built on Cloudflare’s global infrastructure.

Quality-of-Life Improvements

Big thanks to all of our users who reported bugs, shared feedback, and contributed ideas to this release on GitHub or Slack.

Many small improvements add up to a big boost in usability:

Faster and more relevant search algorithm for features, metrics, and experiments
Create feature rules in multiple environments at once
Better column-type detection for BigQuery Fact Tables
Add metric row filters based on Boolean columns
Reduced webhook noise (no more notifications for unpublished drafts)
Slack and Discord notifications now include more detailed change info
Custom pre-launch checklist items can be scoped to specific projects
Faster database schema browsing, even with hundreds of tables
New setting to disable legacy metrics for smoother transition to Fact Tables
Sortable experiment results tables — quickly see top or bottom performers

Plus dozens of smaller fixes and performance improvements.

2025: A Year of Rapid Innovation

The 4.2 release is GrowthBook’s sixth major update in 2025, capping off what has easily been the biggest year of innovation in our company’s history. GrowthBook launched over 45 new features across four major themes in 2025:

Experimentation at Scale: New metrics, templates, dashboards, and analytics
Feature Management: Safe rollouts and feature analytics
Artificial Intelligence: A new MCP server and embedded AI capabilities
Developer Experience: Managed data warehouse, native Vercel integration, 13 updated SDKs, enhanced server-side rendering, and support for new CMSs and FerretDB

Whether you’re on the Starter plan ready for more advanced experimentation and analytics or a Pro user building a culture of experimentation, we’re ready to help you grow. We’re excited to see what you build — and how you use these new tools to learn faster.

7,000 Github Stars and Counting

Ryan Feigenbaum — Thu, 30 Oct 2025 19:38:31 +0000

Thank You for Making GrowthBook the World’s Largest Open-Source Experimentation Platform

GrowthBook passed 7,000 stars on GitHub this month thanks to you. Your support confirms our commitment to experimentation-led development and open-source transparency. We see you testing every day in the 100 billion+ feature flag look ups we handle, and the 2,600 organizations actively using GrowthBook each month.

To celebrate this milestone, let’s look back on how we’ve grown and ahead to where we’re going. Our goal is to help you go faster at scale. Let’s see how we do it.

What’s New with GrowthBook in 2025?

GrowthBook released more than 45 new features in our cloud and self-hosted experimentation platform in 4 key areas: data exploration, developer experience, advanced experimentation, and improving the experiment lifecycle with AI. As an engineering-first company, we believe that experiments should be easy and cheap to run so you can learn constantly.

Better Data Exploration

What good is an experiment if you can’t easily analyze the results? GrowthBook provides full transparency by exposing the underlying SQL for your experiments. But we know you wanted more ways to explore your data, debug issues, and create custom reports and visualizations without the context switching. Now you can explore your data and build custom dashboards.

Complexity happens fast when it comes to data analysis across teams and departments. Metric slices give everyone flexibility without complexity. For example, instead of separate revenue metrics for each product type, you can use metric slices to automatically generate distinct revenue metrics for each product type (such as “apparel” or “equipment”). Teams benefit from more granular and relevant analysis without duplicating definitions. Everyone stays on the same page.

Accelerating Experimentation Culture

Why do so many engineering teams build their own experimentation platforms? So they get exactly what they want. GrowthBook helps teams migrate from homegrown to an experimentation culture by giving developers what they want with control. Customizable dashboards and frameworks help more teams, run more experiments faster, and learn from the results.

That’s why we developed experiment dashboards. Developers, data teams, and product managers create their own custom view to go deep on individual experiments. They get exactly what they need to highlight interesting results, hide the noise, and begin to tell a story with the data that everyone in the organization can understand.

The Experiment Decision Framework helps teams make systematic, consistent decisions about when and how to conclude experiments. GrowthBook’s default modes include “do no harm” and “clear signal” with the option to customize with your own rules so you can iterate quickly.

For developers who want to skip setup of a data source for our warehouse native solution, we launched a Managed Warehouse option. Now, your team can go straight to feature management, experimentation, and product analytics without the data connection, cost, and refresh hassles.

Advanced Experimentation

The more experiments you run, the more advanced your experimentation program becomes. We believe that so many of you support GrowthBook because of the high bar we set for statistical rigor. We continued that commitment with features for sophisticated metrics, automated decision-making, and comprehensive measurement capabilities for high-frequency testing programs. Measure the long term impact of changes and control outcomes with holdouts, multi-arm bandits, and safe rollouts.

With Insights, GrowthBook’s executive dashboard offers a 10,000 foot view across all of your organization’s experiments to understand what you’ve done and what you’ve learned. Help your team go further, faster with learnings, experiment timelines, and explore metric effects and metric correlations. Filter by project and data range, view by win rate, scaled impact, and velocity.

Improving the Experiment Lifecycle with AI

It’s time to talk to your experimentation platform. The MCP server streamlines workflows, and enables AI-powered automation and insights within your development environment. Connect to your favorite LLMs to manage feature flags, experiments, and other tasks without switching contexts. The MCP server works with Cursor, Claude, VS Code, and it’s open source.

We’ve also embedded AI into GrowthBook. You can use natural language questions to generate SQL. Your GrowthBook assistant helps you follow best practices by checking hypotheses, summarizing metric descriptions, generating experiment summaries, and comparing past experiments to avoid duplication.

Looking Ahead: The Future of Experimentation at GrowthBook

We continue to be inspired by our GitHub stargazers, Slack community members, and all the experimenters out there, committed to making everything better. As we prepare for the year ahead, we’re looking at a few key themes.

In this time of consolidation and disruption, data security and governance matter more than ever. Our warehouse-native approach allows you to keep your data in-house under your control.
As AI-generated code becomes more pervasive, experimentation provides an essential check on whether code works and benefits the business.
Fostering a culture of experimentation does more than draw the signal from the noise. It helps you fail sooner, in the smallest ways possible, so you can accelerate success.

Here's to the next 7,000 stars and beyond! If you haven't already, check outGrowthBook on GitHub—we'd love to see what you experiment with next.

Ready to join the experimentation revolution? Star us on GitHub, join our Slack community, or dive into the code. The future of product development is open, transparent, and data-driven. Let's build it together.

The Benchmarks Are Lying to You: Why You Should A/B Test Your AI

Ryan Feigenbaum — Tue, 30 Sep 2025 16:16:40 +0000

Quick Takeaways

Performance varies by domain: Models that ace benchmarks often fail on your specific use case
The Trade-offs might not be real: Faster, cheaper models might outperform expensive ones for your needs
The best solution is rarely one model: Most successful deployments use model portfolios

A/B testing quantifies what matters: User completion rates, costs, and latency—not abstract scores

Introduction

OpenAI's GPT-5 (high) model scores 25% on the FrontierMath benchmark for expert-level mathematics. Claude Opus 4.1 only scores 7%. Based on these numbers alone, you might assume GPT-5 is clearly the superior choice for any application requiring mathematical reasoning.

But this assumption illustrates a fundamental problem in AI evaluation, one that we in the experimentation space know quite well as Goodhart's Law: "When a measure becomes a target, it ceases to be a good measure." The AI industry has turned benchmarks into targets, and now those benchmarks are failing us.

When GPT-4 launched, it dominated every benchmark. Yet within weeks, engineering teams discovered that smaller, "inferior" models often outperformed it on specific production tasks—at a fraction of the cost.

With all the fanfare of the GPT-5 launch and outperforming all other models on coding benchmarks, developers continued to prefer Anthropic's models and tooling for real-world usage. This disconnect between benchmark performance and production reality isn't an edge case. It's the norm.

The market for LLMs is expanding rapidly—OpenAI, Anthropic, Google, Mistral, Meta, xAI and dozens of open-source options all compete for your attention. But the question isn't which model scores highest on benchmarks. It's which model actually works in your production environment, with your users, under your constraints.

Why Traditional Benchmarks Fail in Production

AI benchmarks are standardized tests designed to measure model performance—MMLU tests general knowledge, HumanEval measures coding ability, and FrontierMath evaluates mathematical reasoning. Every major model release leads with these scores.

But these benchmarks fail in three critical ways that make them unreliable for production decisions:

1. They Don't Measure What Actually Matters Benchmarks test surrogate tasks—simplified proxies that are easier to measure than actual performance. A model might excel at multiple-choice medical questions while failing to parse your actual clinical notes. It might ace standardized coding challenges while struggling with your company's specific codebase patterns. The benchmarks measure something, just not real-world problem-solving ability.

2. They're Systematically Gamed Data contamination lets models memorize benchmark datasets during training, achieving perfect scores on familiar questions while failing on slight variations. Worse, models are specifically optimized to excel at benchmark tasks—essentially teaching to the test. When your model has seen the answers beforehand, the test becomes meaningless.

3. They Ignore Production Reality Benchmarks operate in a fantasy world without your constraints. Latency doesn't exist in benchmarks, but your multi-model chain takes 15+ seconds. Cost doesn't matter in benchmarks, but 10x price differences destroy unit economics. Your infrastructure has real memory limits. Your healthcare app can't hallucinate drug dosages.

Consider this sobering statistic: 79% of ML papers claiming breakthrough performance used weak baselines to make their results look better. When researchers reran these comparisons fairly, the advantages often disappeared.

The A/B Testing Advantage: Finding What Actually Works

So if benchmarks fail us, how do we actually select and optimize LLMs? Through the same methodology that transformed digital products: rigorous A/B testing with real users and real workloads.

The Portfolio Approach

The first insight from production A/B testing contradicts everything vendors tell you: the optimal solution is rarely a single model.

Successful deployments use a portfolio approach. Through testing, teams discover patterns like:

Simple queries handled by models that are fast, cheap, and good enough
Complex reasoning routed to thinking models
Domain-specific tasks sent to fine-tuned specialist models

Take v0, Vercel's AI app builder. It uses a composite model architecture: a state-of-the-art model for new generations, a Quick Edit model for small changes, and an AutoFix model that checks outputs for errors.

This dynamic selection approach can slash costs by 80% while maintaining or improving quality. But you'll only discover your optimal routing strategy through systematic testing.

Metrics That Actually Drive Business Value

Production A/B testing reveals the metrics that benchmarks completely miss:

Performance Metrics That Matter:

Task completion rate: Do users actually accomplish their goals?
Problem resolution rate: Are issues solved, or do users return?
Regeneration requests: How often is the first answer insufficient?
Session depth: Are simple tasks requiring multiple interactions?

Cost and Efficiency Reality:

Tokens per request: Your actual API costs, not theoretical pricing
P95 latency: How long your slowest users wait (the ones most likely to churn)
Throughput limits: Can you handle Black Friday or just Tuesday afternoon?

Counterintuitive insight: If an LLM solves a user's question on the first try, you may see fewer follow-up prompts. That drop in "requests per session" is actually positive—your model is more effective, not less engaging.

Making A/B Testing Work for LLMs

Testing LLMs requires adapting traditional experimental methods to handle their unique characteristics:

Handle the Randomness: Unlike deterministic code, LLMs produce different outputs for the same prompt. This variance means:

Run tests longer than typical UI experiments
Use larger sample sizes to achieve statistical significance
Consider lowering temperature settings if consistency matters more than creativity

Isolate Your Variables: Test one change at a time:

Model swap (GPT-5 → Claude Opus)
Prompt refinement (shorter, more specific instructions)
Parameter tuning (temperature, max tokens)
Routing logic (which queries go to which model)

Without this discipline, you can't attribute improvements to specific changes.

Set Smart Guardrails : Layer guardrail metrics alongside your primary success metrics. An improvement in task completion that doubles costs might not be worth deploying. Track:

Cost per successful interaction (not just cost per request)
Safety violations that could trigger PR nightmares
Latency thresholds that cause user abandonment

Build Once, Test Forever : Invest in infrastructure that makes testing sustainable:

Centralized proxy service for LLM communications
Automatic metric collection and monitoring
Prompt versioning and management
Response validation and safety checking

This investment pays off immediately—making tests easier to run and results more trustworthy.

Embrace Empiricism

Benchmarks aren't entirely useless—use them for initial screening, understanding capability boundaries, and meeting regulatory minimums. But they should never be your final decision criterion.

The AI industry's benchmark obsession has created a dangerous illusion. Models that dominate standardized tests struggle with real tasks. The metrics we celebrate have divorced from the outcomes we need.

For teams building with LLMs, the path is clear:

Start with hypotheses, not benchmarks:"We believe Model X will improve task completion" not "Model X scores higher"
Test with real users and real data: Your production environment is the only benchmark that matters
Measure what moves your business: User satisfaction, cost per outcome, and regulatory compliance
Iterate based on evidence: Let data, not vendor claims, drive your model selection

The benchmarks aren't exactly lying—they're just answering the wrong questions. A/B testing asks the right ones: Will this solve my users' problems? Can we afford it at scale? Does it meet our requirements?

In the end, the best benchmark for your AI isn't a standardized test. It's users voting with their actions, costs staying within budget, and your application delivering real value.

Everything else is just numbers on a leaderboard.

The Definitive Technical Guide to Generative Engine Optimization (2025)

Ryan Feigenbaum — Tue, 09 Sep 2025 23:09:32 +0000

Has your search behavior changed? Maybe you ask ChatGPT to help diagnose a leaky faucet instead of scrolling through Google Ads. Or you use Claude to debug a type error instead of sifting through Stack Overflow.

You aren’t alone.

Gartner predicts that by 2026 traditional search engine volume will drop 25%, losing out to AI chatbots and other virtual agents. In contrast, their traffic has surged, growing 80.92% YoY. ChatGPT has 800 million weekly active users. And your customers are already there: ChatGPT now drives 10% of new signups for companies like Vercel.

ChatGPT now refers 10% of new @vercel signups, which have also accelerated https://t.co/LzatDz8n8u

— Guillermo Rauch (@rauchg) April 9, 2025

But here's what most marketers don't understand: When someone asks ChatGPT about the best project management software, it doesn't show ten blue links. It synthesizes an answer from sources it trusts and gives one response. There's no page two. You're either in the answer or you're invisible 🫥

This is the fundamental shift. We're moving from competing for rankings to competing for citations. From click-through rates to what a16z calls "reference rates"—how often AI chooses to mention you.

The companies seeing leads from AI platforms aren't using magic. They're using methods you can implement today. These methods are called GEO (Generative Engine Optimization), and this guide will explain exactly what it is and the practical steps you can take to ensure AI sends customers to you, not your competitors.

The Three Layers of AI Visibility (And Why SEO Still Matters)

So, now that we’re optimizing for generative engines (AI chatbots), we don’t have to do SEO anymore, right?

That’d be great, but no. GEO doesn’t replace SEO—it adds a brutal new selection layer on top of it.

Layer 1: Pre-Training Knowledge

This is what's "baked into" the model from training—everything it learned before its knowledge cutoff. AI models weren’t just trained on academic papers and Wikipedia. They gorged on Reddit, Stack Overflow, and forum discussions.

You can't retroactively get into the model's training data. But you can position yourself for future training runs by establishing presence in:

Community platforms : Reddit, Stack Overflow, specialized forums
Authority sources : Wikipedia, academic papers, industry publications
Established directories : G2, Capterra, Clutch (First Page Sage found these critical)

Layer 2: Search Layer - Real-Time Retrieval (RAG)

When AI needs current information—today's weather, recent news, current prices—it searches the web. And here's the kicker: it uses traditional search engines.

ChatGPT searches Bing (Microsoft partnership)
Perplexity queries Google (primarily)
Gemini uses Google (obviously)

This is why traditional SEO still matters! First Page Sage's research is crystal clear: "appearing in lists that rank highly in Google or Bing's organic search results made the biggest difference in earning a chatbot's recommendation."

If you're not ranking, you ain’t even in the game. The AI can't cite what it can't find.

Layer 3: Selection Filter

This is where GEO comes in. Even when AI tools find dozens of relevant results, they only cite a handful. They're making editorial decisions about what to reference.

Researchers from Princeton, IIT Delhi, and other institutions, who published the landmark study that coined the term, “Generative Engine Optimization”, demonstrate what makes content “selectable”:

Statistics make you up to 33.9% more visible - AI can't generate data, so it gravitates toward sources that provide it
Expert quotes boost visibility up to 32% - Direct quotations give AI something concrete to reference
Clear, fluent writing improves citation rates up to 30% - If AI struggles to parse your content, it moves on
Citations to authoritative sources add 30.3% - Credibility signals matter more than ever

The takeaway: You can rank #1 on Google, but if your content isn't optimized for AI selection, ChatGPT will cite your competitor instead. This means that top-ranking content will need to adapt to stay visible in the world of AI, but also that lower ranking pages have an opportunity to increase visibility by implementing the GEO methods explained below.

The GEO Playbook

Here are the strategies you can use to become the darling of the chatbots.

1. Win the List Game

Nearly every AI recommendation starts with a Google or Bing search for “best [category] tools” or “top [solution] companies.” But not all lists are created equal.

What works:

Comparison tables that pit products against each other by features and price
Lists subdivided by use case ("Best for Small Business," "Best for Enterprise")
Recent timestamps (AI preferences freshness for anything with "2024" or "2025")

The hack: Create your own comparison articles that include your product, but be comprehensive and fair. AI can detect obvious bias.

2. Become Statistically Irresistible

Remember: AI can't create data, only synthesize it. Chatbots are thirsty for original statistics.

Quick wins:

Turn customer data into industry insights ("73% of our users reduced churn by...")
Commission surveys for unique data points
Add specific numbers to every claim ("Most dentists" → "4 out of 5 dentists")

Pro tip: Create downloadable CSV files with your data. Perplexity and ChatGPT love citing sources that provide raw data access.

3. Just Write Real Good

AI reads your content differently than humans. It chunks information, analyzes relationships, and looks for clear extractable statements.

The formula:

One idea per paragraph (seriously, one)
Front-load key points in the first sentence
Use headers that directly answer questions
Bold your most important statistics and conclusions

Think of it like optimizing for featured snippets, but for every paragraph on your page.

4. Build Your Entity Authority

Here's what's wild: AI tracks brand mentions even without links. It's building a map of who's credible in each space.

The playbook:

Get mentioned alongside competitors in industry roundups
Contribute to Reddit discussions in your space (AI ❤️ Reddit)
Secure profiles in industry directories like G2, Capterra, or Clutch
Maintain at least 3.5 stars on review platforms (this is table stakes)

As Backlinko discovered, these co-citations and co-occurrences are how AI understands where you fit in your market:

"AI systems don't just look at backlinks to understand your authority. They pay attention to every mention of your brand across the web, even when those mentions don't include a clickable link.”

5. Go Multi-Modal (Because Text Isn't Enough)

Google Lens processes 20 billion visual queries monthly. Voice searches are 3.7× more likely to be questions. The TenTen guide revealed that pages with multi-modal content see 67% more AI referrals.

Essential additions:

High-res product images with descriptive filenames
30-second video summaries with transcripts
Schema markup for VideoObject, ImageObject, and Speakable
Alt text written like mini-tweets (125 characters, naturally includes keywords)

The Dark Traffic Problem

Here’s an inconvenient truth from Seer Interactive: Most AI traffic is invisible. It shows up as “Direct” in Google Analytics because AI doesn’t always pass referral data.

What you can track:

ChatGPT citations include utm_source=chatgpt.com
Create a GA4 segment for AI platforms (chatgpt.com, perplexity.ai, claude.ai)
Monitor brand searches that spike after AI platform updates

What you can't track:

ChatGPT search results (no UTMs)
Voice assistant references
Most AI-generated summaries

The workaround: Embed UTM parameters in your internal links. Yes, this used to be taboo, but GA4 doesn't create new sessions like Universal Analytics did. When AI scrapes your content, it might preserve these parameters.

Don’t Expect Too Much

GEO isn’t magic. It’s not going to 10x your traffic overnight. What it will do is ensure you're not invisible when your customers ask AI for recommendations.

The foundational GEO study by Aggarwal et al. showed that combining methods—using statistics plus improved fluency—can boost visibility by 35.8%. That's not a page-one ranking; that's being the trusted source AI chooses from page one.

More importantly, this isn't optional. Your competitors are already doing this. They're showing up in ChatGPT answers. They're capturing that Vercel-style 10% of new signups from AI.

While the AI slop torrent isn’t slowing down anytime soon, the methods suggested here for optimizing for AI—adding citations and stats, writing for fluency, engaging with the community—all make for better content.

What Happens Now?

We’re at an inflection point. Traditional search isn’t dying but becoming one channel among many. AI platforms are becoming the new front door to the internet.

In two years, we'll look back at companies still doing SEO-only strategies the way we look at businesses that refused to build websites in 1999. They'll exist, but they'll be invisible to an entire generation of users who start every quest for information with "Hey ChatGPT..."

The shift from keywords to concepts, from rankings to references, from pages to entities—it's already happening. The only question is whether you'll ride the wave or watch it pass.

Because in this new world, there's no page two. There's only the answer. Make sure you're in it.

Want to see if AI is already sending you traffic? Start by searching for your brand on ChatGPT, Perplexity, and Claude. If you don't like what you see—or worse, if you don't see anything—it's time to act.

Holdouts in GrowthBook: The Gold Standard for Measuring Cumulative Impact

Ryan Feigenbaum — Wed, 03 Sep 2025 18:46:30 +0000

Many successful product teams iterate quickly, running simultaneous experiments and launching new features weekly. Measuring the overall effect of these tests is critical to understanding the team’s impact and to help set product direction. However, actually measuring this cumulative impact can be quite difficult.

Holdouts in GrowthBook provide a simple way to keep a true control group across multiple features and measure long-run cumulative impact. It’s the gold standard way to answer the question: “What did all of this shipping actually do to my key metric?”

Why holdouts matter

Cumulative impact is important to measure.

Ensuring that your experimentation program is helping you ship winning features and avoid losing features helps set your product direction. Knowing which teams are driving the most impact can help you understand what’s working and what isn’t. Teams successfully moving the needle may deserve more investment to continue driving team goals upward. If a team struggles to have a significant impact, they may have hit diminishing returns, they may need a new direction, or the product may have reached a certain level of maturity, making gains more difficult to achieve.

Cumulative impact is hard to measure.

Looking at the overall trend in your goal metrics is not enough. Forces beyond your control or seasonality can dictate goal metric movements and can mislead you. With constant shipping across product teams, attributing lift to individual teams can be nearly impossible.

Other approaches try to sum up the effect of individual experiments and apply some bias reduction, like the one on our own Insights section. Almost always, the individual impacts of experiments when summed up overstate the final effects due to selection bias, generally diminishing returns over time, and cannibalizing interactions with other experiments. This isn’t just theoretical, Airbnb documented how a naive sum overstates impact by 2x when compared with a holdout, and bias-corrected estimates still overstate impact by 1.3x.

Holdouts as the solution.

A well-run holdout exposes a stable baseline of users to none of your new features for a period of time, then compares them to the general population. Because a holdout can run for longer on a small percentage of traffic, you capture longer-run effects. Furthermore, it allows you to stack all of your features and experiments into one test, capturing cumulative and interactive effects. Finally, it uses the reliable statistics and inference provided by experiments to make holdouts the gold standard for cumulative, long-run impact.

How Holdouts work in GrowthBook

At a high level:

Holdout group : A small percentage of traffic (usually users) is diverted away from new features, experiments, and bandits.
General population : Everyone else—experimenting and shipping as usual. We then take a small subset of the general population to use as a measurement group to compare against the holdout group.

As you launch new features and experiments, all new traffic checks whether they should be diverted to the holdout before seeing the new feature or experiment values.

When an experiment goes live, the holdout group is completely excluded while the general population gets randomized into one condition or another. Once an experiment is shipped, all users in the general population will receive the shipped variant.

This means that the holdout measures the cumulative impact of using your product , which includes all the false starts and the test period for the experiments that didn’t ship, because that is a true record of what actually happened in the past quarter.

Only once the holdout is ended will users in the holdout group receive any shipped features.

Using your Holdout

Facebook and Twitter product teams ran 6-month holdouts for all their features, withholding 5% or less of traffic, and then used the cumulative impact in reporting and to understand if they had correctly set their product direction. They then released the holdout and start a new one for the next 6-month period.

Other teams at Twitter were also using long-run, low traffic holdouts on a bundle of critical features, to ensure they were continuing to provide value.

Define the population size: Pick a sample large enough to measure your cumulative impact, but beware that larger population sizes mean you will end up with less traffic for your day-to-day experiments and fewer users with the latest set of features.
Define the active period length (half month to a quarter): Pick a period long enough to accumulate some wins
During the active period (half to a full quarter): Ship normally. Keep adding experiments and launching features. The holdout quietly accumulates evidence.
Analysis period (2–4 weeks): Freeze adding new changes, let effects settle, and compare cumulative impact with our automatic lookback windows applied to measure only the analysis period.

Product teams at Twitter would run a holdout for a half a year, adding new features to the holdout over the course of 6 months. Then, they would use the following quarter to get a reliable, long-run measure of their cumulative impact.

So, a year would look like this:

Timeframe	Holdout Status
Q1	`h1-holdout` (active)
Q2	`h1-holdout` (active)
Q3	`h2-holdout` (active); `h1-holdout` (measurement only)
Q4	`h2-holdout` (active)

Tips & Trade-offs

Project-scope your Holdout: I f you want to measure the impact of a given team’s set of features, have that team work within one or more GrowthBook Projects and have the Holdout automatically apply to their features and experiments.
Be wary of the user experience : A small group won’t see new features—keep the percentage small and the period finite.
Be ready to keep feature flags in code : Holdouts require feature flags to stick around through the analysis period, so prepare your work flows for longer lasting features.
Metrics : Favor durable outcomes (revenue, retention, engagement) and use lookbacks for clean analysis windows so that you only measure the impact once all experiments have had a chance to bed-in.

Get started

Create your first holdout in the app ( Experiments → Holdouts ) and scope it to a project you want to measure impact within.
Pick 2 - 4 long-run metrics that your team is hoping to improve in the long-run.

Read more about holdouts in our Knowledge Base and see our documentation to help run your first holdout.

What is A/B Testing? The Complete Guide to Data-Driven Decision Making

Ryan Feigenbaum — Wed, 13 Aug 2025 20:28:14 +0000

The 30-Second Summary

A/B testing (also called split testing) is a method of comparing two+ versions of a webpage, app feature, or marketing element to determine which performs better. You show version A (the control) to one group and version B (the variant) to another, then measure which drives better results for your business goals.

Why it matters: A/B testing removes guesswork from decision-making, turning "we think" into "we know" based on actual user behavior and statistical evidence.

Key takeaway: Done right, A/B testing can increase conversions without spending more on traffic, validate ideas before full implementation, and build a culture of continuous improvement.

What Exactly is A/B Testing?

Imagine you're at a coffee shop debating whether to put your tip jar by the register or at the pickup counter. Instead of guessing, you try both locations on alternating days and measure which generates more tips. That's A/B testing in the physical world.

In digital environments, A/B testing works by:

Randomly splitting your audience into two (or more) groups
Showing different versions of the same element to each group simultaneously
Measuring the impact on predetermined metrics
Declaring a winner based on statistical significance
Implementing the better version for all users

The Critical Difference: Testing vs. Guessing

Without A/B testing, decisions rely on:

HiPPO (Highest Paid Person's Opinion)
Best practices that may not apply to your audience
Assumptions about user behavior
Competitor copying without context

With A/B testing, decisions are based on:

Actual user behavior from your specific audience
Statistically validated results
Measurable business impact
Continuous learning about what works

Why A/B Testing is Essential in 2025

1. Maximize Existing Traffic Value

Traffic acquisition costs (TAC) have increased 222% since 2019. And, with the rise of generative AI, the usefulness of long-standing acquisition strategies are more uncertain than ever. A/B testing helps you extract more value from visitors you already have—often delivering higher ROI than acquiring new traffic.

2. Reduce Risk of Major Changes

Instead of redesigning your entire site and hoping for the best, test changes incrementally. If something doesn't work, you've limited the damage to a small test group.

3. Resolve Internal Debates with Data

Stop endless meetings debating what "might" work. Run a test, get data, make decisions. As one PM put it: "A/B testing turned our three-hour design debates into 30-minute data reviews."

4. Discover Surprising Insights

Microsoft found that changing their Bing homepage background from white to a slightly different shade generated $10 million in additional revenue. While you shouldn't expect $10 million revenue gains from A/B test (that's an outlier), these wins don't even become a possibility until you start testing for them.

[

Big conversion gains, small color tweaks - Erin Does Things

Some of the most impactful experiments involve improvements in color. Why? Because the right colors can mean the difference between a buy and a bounce.

Erin Does Things - Erin Weigel • Designer • Speaker • Product & People ManagerErin Weigel

](https://erindoesthings.com/2024/07/15/microsoft-color-tweaks-conversion-gains/?ref=blog.growthbook.io)

5. Build Competitive Advantage

While competitors guess, you know. Netflix attributes much of its success to running thousands of tests annually, optimizing everything from thumbnails to recommendation algorithms.

What Can You Test? (Almost Everything)

Website Elements

Headlines and copy : Different value propositions, tones, lengths
Call-to-action buttons : Color, size, text, placement
Images and videos : Product photos, hero images, background videos
Forms : Number of fields, field types, progressive disclosure
Navigation : Menu structure, sticky headers, breadcrumbs
Layout : Single vs. multi-column, card vs. list view
Pricing : Display format, anchoring, bundling options
Social proof : Testimonials, reviews, trust badges placement

Beyond Websites

Email campaigns : Subject lines, send times, content length
Mobile apps : Onboarding flows, feature placement, notification timing
Ads : Creative, copy, targeting parameters
Product features : Functionality, user interface, defaults
Internal tools : Dashboard layouts, workflow steps
Algorithms: Recommendations, feature items
AI: Prompts, models

The Science Behind A/B Testing

Statistical Foundations: Two Approaches

Modern A/B testing platforms offer two statistical frameworks, each with distinct advantages:

Bayesian Statistics (Often the Default)

Bayesian methods provide more intuitive results by expressing outcomes as probabilities rather than binary significant/not-significant decisions. Instead of p-values, you get statements like "there's a 95% chance variation B is better than A."

This approach:

Allows continuous monitoring without invalidating results
Incorporates prior knowledge to avoid over-interpreting small samples
Provides probability distributions showing the range of likely outcomes
Calculates "risk" or expected loss if you choose the wrong variation

Frequentist Statistics (Traditional Approach)

Frequentist methods use hypothesis testing with p-values and confidence intervals. This classical approach:

Requires predetermined sample sizes
Uses statistical significance thresholds (typically 95%)
Provides clear yes/no decisions based on p-values
Is familiar to those with traditional statistics backgrounds

Key concepts both approaches share:

1. Null Hypothesis (H₀): The assumption that there's no difference between versions A and B

2. Alternative Hypothesis (H₁): Your prediction that version B will perform differently

3. Statistical Significance/Confidence: The certainty that results aren't due to chance

4. Statistical Power: The probability of detecting a real difference when it exists (typically aim for 80%+)

Many modern platforms like GrowthBook default to Bayesian but offer both engines, letting teams choose based on their preferences and expertise. Both approaches can utilize advanced techniques like CUPED for variance reduction and sequential testing for early stopping.

Sample Size: The Foundation of Reliable Tests

You need enough data to trust your results. The required sample size depends on:

Baseline conversion rate : Your current performance
Minimum detectable effect (MDE): The smallest improvement you care about
Statistical significance threshold : Usually 95%
Statistical power : Usually 80%

Example calculation:

Current conversion rate: 3%
Want to detect: 20% relative improvement (to 3.6%)
Required confidence: 95%
Result: ~14,000 visitors per variation

Tools to help: Most A/B testing platforms include built-in power calculators and sample size estimators. These tools eliminate guesswork by automatically calculating the visitors needed based on your specific metrics and goals.

The Danger of Peeking

Checking results before reaching sample size is like judging a marathon at the 5-mile mark. Early results fluctuate wildly and often reverse completely.

Why peeking misleads:

Small samples amplify random variation
Winners and losers often swap positions multiple times
"Regression to the mean" causes early extremes to normalize
Each peek increases your false positive rate

The solution: Set your sample size, run the test to completion, then analyze. No exceptions.

Your Step-by-Step A/B Testing Process

Step 1: Research and Identify Opportunities

Start with data, not opinions:

Analyze your analytics for high-traffic, high-impact pages
Review heatmaps and session recordings
Collect customer feedback and support tickets
Run user surveys about friction points
Audit your conversion funnel for drop-offs

Prioritize using ICE scoring:

Impact : How much could this improve key metrics?
Confidence : How sure are you it will work?
Ease : How simple is it to implement?

Step 2: Form a Strong Hypothesis

Weak hypothesis: "Let's try a green button"

Strong hypothesis: "By changing our CTA button from gray to green (change), we will increase contrast and draw more attention (reasoning), resulting in a 15% increase in click-through rate (predicted outcome) as measured over 14 days with 95% confidence (measurement criteria)."

Hypothesis framework:"By [specific change], we expect [specific metric] to [increase/decrease] by [amount] because [reasoning based on research]."

Step 3: Design Your Test

Critical rules:

Test one variable at a time (multiple changes = unclear results)
Ensure equal, random traffic distribution
Keep everything else identical between versions
Consider mobile vs. desktop separately
Account for different user segments if needed

Quality assurance checklist (guardrails):

Both versions load at the same speed
Tracking is properly implemented
Test works across all browsers
Mobile experience is preserved
No flickering or layout shifts
Forms and CTAs function correctly

Step 4: Calculate Required Sample Size

Never start without knowing your endpoint. Most modern A/B testing platforms include power calculators that do the heavy lifting for you.

Input these parameters:

Current conversion rate (from your analytics)
Minimum improvement worth detecting (be realistic)
Significance level (typically 95%)
Statistical power (typically 80%)

The platform will calculate exactly how many visitors you need per variation. This removes the guesswork and ensures your test has enough power to detect meaningful differences.

Time consideration: Run tests for at least one full business cycle (usually 1-2 weeks minimum) to account for:

Weekday vs. weekend behavior
Beginning vs. end of month patterns
External factors (news, weather, events)

Step 5: Launch and Monitor (Without Peeking!)

Launch checklist:

Set up your test in your A/B testing tool
Configure goal tracking and secondary metrics
Document test details in your testing log
Set calendar reminder for test end date
Resist the urge to check results early

Monitor only for:

Technical errors or bugs
Extreme business impact (massive revenue loss)
Sample ratio mismatch (uneven traffic split)

Step 6: Analyze Results Properly

Beyond the winner/loser binary:

Check statistical significance (p-value < 0.05)
Verify sample size was reached
Look for segment differences (mobile vs. desktop, new vs. returning)
Analyze secondary metrics (did conversions increase but quality decrease?)
Consider practical significance (is 0.1% lift worth implementing?)
Document learnings regardless of outcome

Step 7: Implement and Iterate

If your variation wins:

Implement for 100% of traffic
Monitor post-implementation performance
Test iterations to maximize the improvement
Apply learnings to similar pages/elements

If your variation loses:

This is still valuable learning
Analyze why your hypothesis was wrong
Test the opposite approach
Document insights for future tests

If it's inconclusive:

You may need a larger sample size
The difference might be too small to matter
Test a bolder variation

Advanced A/B Testing Strategies

1. Sequential Testing

Instead of testing A vs. B, test A vs. B, then winner vs. C, building improvements incrementally.

2. Bandit Testing

Automatically shift more traffic to winning variations during the test, maximizing conversions while learning.

3. Personalization Layers

Test different experiences for different segments (new vs. returning, mobile vs. desktop, geographic regions).

4. Full-Funnel Testing

Don't just test for initial conversions—measure downstream impact on retention, lifetime value, and referrals.

5. Qualitative + Quantitative

Combine A/B tests with user research to understand not just what works, but why it works.

Common A/B Testing Mistakes (And How to Avoid Them)

Mistake #1: Testing Without Traffic

Problem: Running tests on pages with <1,000 visitors/week

Solution: Focus on highest-traffic pages or make bolder changes that require smaller samples to detect

Mistake #2: Stopping Tests at Significance

Problem: Ending tests as soon as p-value hits 0.05

Solution: Predetermine sample size and duration; stick to it regardless of interim results

Mistake #3: Ignoring Segment Differences

Problem: Overall winner performs worse for valuable segments

Solution: Always analyze results by key segments before implementing

Mistake #4: Testing Tiny Changes

Problem: Button shade variations when the whole page needs work

Solution: Match change boldness to your traffic volume; small sites need bigger swings

Mistake #5: One-Hit Wonders

Problem: Running one test then moving on

Solution: Create a testing culture with regular cadence and iteration

Mistake #6: Significance Shopping

Problem: Testing 20 metrics hoping one shows significance

Solution: Choose primary metric before starting; treat others as secondary insights

Mistake #7: Seasonal Blindness

Problem: Testing during Black Friday, applying results year-round

Solution: Note external factors; retest during normal periods

Mistake #8: Technical Debt

Problem: Winner requires complex maintenance or breaks other features Solution: Consider implementation cost in your analysis

Mistake #9: Learning Amnesia

Problem: Not documenting or sharing test results

Solution: Maintain a testing knowledge base; share learnings broadly

Building a Testing Culture

Moving Beyond Individual Tests

The real value of A/B testing isn't any single win—it's building an organization that makes decisions based on evidence rather than opinions.

Cultural pillars:

Democratize testing : Enable anyone to propose tests (with proper review)
Celebrate learning : Failed tests that teach are as valuable as winners
Share broadly : Make results visible across the organization
Think in probabilities : Replace "I think" with "Let's test"
Embrace iteration : Every result leads to new questions

Testing Program Maturity Model

Level 1: Sporadic

Occasional tests when someone remembers
No formal process
Results often ignored

Level 2: Systematic

Regular testing cadence
Basic documentation
Some stakeholder buy-in

Level 3: Strategic

Testing roadmap aligned with business goals
Cross-functional involvement
Knowledge sharing practices

Level 4: Embedded

Testing considered for every change
Sophisticated segmentation and analysis
Company-wide testing culture

Level 5: Optimized

Predictive models guide testing
Automated test generation
Testing drives innovation

The Future of A/B Testing

AI-Powered Testing

Machine learning increasingly suggests what to test, predicts results, and automatically generates variations.

Real-Time Personalization

Move beyond testing to delivering the optimal experience for each individual user.

Causal Inference

Advanced statistical methods better isolate true cause-and-effect relationships.

Cross-Channel Orchestration

Test experiences across web, mobile, email, and offline touchpoints simultaneously.

Privacy-First Methods

New approaches maintain testing capability while respecting user privacy and regulatory requirements.

Your Next Steps

Start Today (Even Small)

Pick one element on your highest-traffic page
Form a hypothesis about how to improve it
Run a simple test for two weeks
Analyze results objectively
Share learnings with your team
Test again based on what you learned

Quick Wins to Try First

Headline on your homepage
CTA button color and text
Form field reduction
Social proof placement
Pricing page layout
Email subject lines

Resources for Continued Learning

Essential books:

"Trustworthy Online Controlled Experiments" by Kohavi, Tang, and Xu
"A/B Testing: The Most Powerful Way to Turn Clicks Into Customers" by Dan Siroker

Communities:

Stay updated:

Follow industry leaders on LinkedIn
Subscribe to testing tool blogs
Join local CRO meetups

Conclusion: From Guessing to Knowing

A/B testing transforms how organizations make decisions. Instead of lengthy debates, political maneuvering, and costly mistakes, you get clarity through data.

But remember: A/B testing is a tool, not a religion. Some decisions require vision, creativity, and bold leaps that testing can't validate. The art lies in knowing when to test and when to trust your instincts.

Start small. Test consistently. Learn continuously. Let data guide you while creativity drives you.

The companies that win in 2025 won't be those with the best guesses—they'll be those with the best evidence.

Building in the AI Era: Lessons from Past Technological Revolutions

Ryan Feigenbaum — Tue, 29 Jul 2025 01:07:20 +0000

We are living through a generational technology shift—one that comes along only once or twice in a lifetime, reshaping how humans interact with the world. Just as electricity, automobiles, computers, the internet, and mobile computing were transformative, AI is doing the same today. However, history shows us that in the early days of a new technology, people often misunderstand the power that it unlocks. This article will examine some of the historical technology shifts and the lessons we can learn from them.

Lessons from History

Practical applications of electricity began to take root in the 1880s and 90s, with the first electrical power station opening in Manhattan by Edison. The uses were initially targeted at consumers, with rich New Yorkers able to electrify their homes and replace their gas lights with electric ones. Industry, on the other hand, was slow to adapt, despite the evident advantages. Most industries simply replaced steam-powered equipment with electric ones, or added electric lights, without considering how their industry could operate differently.

The engineering breakthrough came when Henry Ford reimagined the factory in the 1910s. He utilized electric motors' precise speed control and distributed power to create the moving assembly line in 1913—a feat impossible with centralized steam engines that required complex systems of belts and pulleys. These improvements cut the Model T build time from 12 hours to about 93 minutes—a systemic redesign that enabled scale, lowered costs, and transformed labor and manufacturing fundamentally.

A similar lesson comes from the introduction of the television. In the early days of television, content was heavily borrowed from radio—simply filmed broadcasts of radio shows without inventing for the new medium. The real shift came when creators embraced television's potential: drama anthologies, magazine-format shows like Today and The Tonight Show, recording and editing footage from multiple cameras, and new storytelling formats were designed for television. By the 1950s, TV overtook radio: between 1950 and 1960, U.S. household ownership jumped from about 9 percent to over 60 percent, nearing 90 percent in the early 1960s.

The lesson : Early adopters who treat a new medium like the old one often miss its full value. The true winners reimagine processes, experiences—and even entire business models—when they adopt these new technologies.

Parallels with Today’s AI Adoption

It is evident from the above examples that there are parallels with the adoption of AI into our products and businesses. Pressure to add AI or to be the AI for x industry results in many uninspired implementations. Many organizations today bolt on an AI assistant —like lighting a few bulbs in a steam-powered factory—but miss the opportunity to reimagine workflows end-to-end. The real transformation occurs when considering how AI can transform the user experience.

The difference between the past technological shifts and the AI one we’re experiencing today is the incredible velocity of the change.

It took about 13 years for Ford to sell 1 million cars.
It took Google 1 year to reach 1 million searches per day.
Apple’s iPhone launched in 2007, heralding the smartphone revolution, and sold 1 million units in just 74 days.
ChatGPT, on the other hand, reached 1 billion searches per day in under a year—a metric that Google took over 10 years to achieve.

Within just two months of its November 2022 launch, ChatGPT surpassed 100 million users—the fastest adoption rate ever recorded for a consumer software product. This rate of adoption suggests that companies that don't learn from history and adapt to the AI era face an existential threat, not just a competitive disadvantage.

GrowthBook’s Journey with AI

At GrowthBook, our initial step was adding the lightbulb: we launched an AI chatbot to help users navigate our documentation (a helpful concierge, if you will).

Simultaneously, we conducted several brainstorming sessions to reevaluate our product and explore the potential impact of AI on our business. We ran the 11-star brainstorming sessions and planned our roadmap to reimagine what AI will mean in the A/B testing and product analytics space. We built Weblens.ai as a demonstrator of some of the features that AI can unlock for AB testing—and we have many more features coming very soon.

Conclusion

From electrification to television to AI, each technological shift has rewarded those who reimagined systems entirely. They didn’t just adopt new tools—they rewrote workflows, content, and the way they delivered value.

Here are the lessons:

Treat AI as a new paradigm—not just as an add-on. Like Ford reengineered production or TV creators abandoned radio formats, design products from an AI-native perspective.
Focus on user journeys and tasks that AI can redefine —insights, decisions, personalization—rather than isolated features shoe‑horned onto existing interfaces.
If you don’t adapt now, someone else will. AI has experienced an explosive rate of growth, resulting in significant productivity gains and a reduction in the time it takes to bring products to market.

GrowthBook Version 4.0

Ryan Feigenbaum — Wed, 09 Jul 2025 02:51:51 +0000

We shipped so many new features in our June Launch Month that we decided that it deserved a major version increase. Version 4.0 brings a huge array of new features. Here’s a quick summary of everything it includes.

GrowthBook MCP Server

AI tools like Cursor can now interact with GrowthBook via our new MCP server. Create feature flags, check the status of running experiments, clean up stale code, and more.

Safer Rollouts

Building upon our Safe Rollouts release last version, we added gradual traffic ramp up, auto rollback, a smart update schedule, and a time series view of results. All of these combine to add even more safety around your feature releases.

Decision Criteria

You can now customize the shipping recommendation logic for experiments. Choose from a “Clear Signals” model, a “Do No Harm” model, or define your own from scratch.

Search Filters

We’ve revamped the search experience within GrowthBook to make it easier to find feature flags, metrics, and experiments. Easily filter by project, owner, tag, type, and more.

Insights Section

We added a brand new left nav section called “Insights” with a bunch of tools to help you learn from your past experiments.

The Dashboard shows velocity, win rate, and scaled metric impact by project.
Learnings is a searchable knowledge base of all of your completed experiments.
The Experiment Timeline shows when experiments were running and how they overlapped with each other.
Metric Effects lists the experiments that had the biggest impact on a specific metric.
Metric Correlations let you see how two metrics move in relation to each other.

SQL Explorer

We launched a lightweight SQL console and BI tool to explore and visualize your data directly within GrowthBook, without needing to switch to another platform like Looker.

Managed Warehouse

GrowthBook Cloud now offers a fully managed ClickHouse database that is deeply integrated with the product. It’s the fastest way to start collecting data and running experiments on GrowthBook. You still get raw SQL access and all the benefits of a warehouse native product, without the setup and maintenance cost.

Feature Flag Usage

See analytics about how your feature flags are being evaluated in your app in real time. This is built on top of the new Managed Warehouse on GrowthBook Cloud and is a game changer for debugging and QA.

Vercel Flags SDK

GrowthBook now has an official provider for the Vercel Flags SDK. This is now the easiest way to add server-side feature flags to any Next.js project. We have an even deeper Vercel integration coming soon to make this experience even more seamless.

Official Framer Plugin

You can now easily run GrowthBook experiments inside of your Framer projects. Assign visitors to different versions of your design (like layouts, headlines, or calls to action), track results, and confidently choose the best experience for your audience.

Personalized Landing Page

There’s a new landing page when you first log into GrowthBook. Quickly see any features or experiments that need your attention, pick up where you left off, and learn about advanced GrowthBook functionality to get the most out of the platform.

New Experimentation Left Nav

There’s a new “Experimentation” section in the left nav. Experiments and Bandits now live within this section, along with our Power Calculator, Experiment Templates, and Namespaces. We’ll be expanding this section soon with Holdouts and more, so stay tuned!

REST API Updates

Filter the listFeatures endpoint by clientKey
Support partial rule updates in the putFeature endpoint
New Queries endpoint to retrieve raw SQL queries and results from an experiment
Added Custom Field support to feature and experiment endpoints
New endpoints for getting feature code refs
New endpoint to revert a feature to a specific revision

Performance Improvements

We’ve drastically improved the CPU and memory usage when self-hosting GrowthBook at scale. On GrowthBook Cloud, we’ve seen a roughly 50% reduction during peak load, leading to lower latency and virtually eliminating container failures in production.