<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: GrowthBook</title>
    <description>The latest articles on DEV Community by GrowthBook (@growthbook).</description>
    <link>https://dev.to/growthbook</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Forganization%2Fprofile_image%2F9540%2Fed7b9b79-4c3a-402a-80d1-559eb1ee5cc9.png</url>
      <title>DEV Community: GrowthBook</title>
      <link>https://dev.to/growthbook</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/growthbook"/>
    <language>en</language>
    <item>
      <title>How The Social Club Cut Experimentation Costs by 82%</title>
      <dc:creator>Ryan Feigenbaum</dc:creator>
      <pubDate>Sat, 24 Jan 2026 03:07:05 +0000</pubDate>
      <link>https://dev.to/growthbook/how-the-social-club-cut-experimentation-costs-by-82-2nnj</link>
      <guid>https://dev.to/growthbook/how-the-social-club-cut-experimentation-costs-by-82-2nnj</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpv7mavf48arvzpfktth1.webp" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpv7mavf48arvzpfktth1.webp" alt="How The Social Club Cut Experimentation Costs by 82%" width="630" height="473"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Rudger de Groot of&lt;/em&gt; &lt;a href="https://www.mintminds.com/?ref=blog.growthbook.io" rel="noopener noreferrer"&gt;&lt;em&gt;Mintminds&lt;/em&gt;&lt;/a&gt; &lt;em&gt;shared how&lt;/em&gt; &lt;a href="https://www.thesocialhub.co/?ref=blog.growthbook.io" rel="noopener noreferrer"&gt;&lt;em&gt;The Social Hub&lt;/em&gt;&lt;/a&gt; &lt;em&gt;slashed its experimentation costs with GrowthBook. By driving down the incremental cost per experiment as close as possible to zero, companies can run as many experiments on as much traffic as they want.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;The best experimentation programs scale cost-efficiently, so they can run more experiments, learn faster, and ship smarter. But a hidden cost killer is BigQuery query inefficiency. The more you test, the more you pay. What if there were a way to test more and pay less?&lt;/p&gt;

&lt;p&gt;In this case study, we’ll show you how &lt;a href="https://www.mintminds.com/?ref=blog.growthbook.io" rel="noopener noreferrer"&gt;Mintminds&lt;/a&gt; cut experimentation costs for &lt;a href="https://www.thesocialhub.co/?ref=blog.growthbook.io" rel="noopener noreferrer"&gt;The Social Hub&lt;/a&gt; using GrowthBook with BigQuery optimizations from &lt;a href="https://ga4dataform.com/?ref=blog.growthbook.io" rel="noopener noreferrer"&gt;GA4Dataform by Superform Labs&lt;/a&gt;. The setup slashed BigQuery costs by 81.8% while improving data refresh speeds and monitoring capabilities. Here's how they did it.&lt;/p&gt;

&lt;h2&gt;
  
  
  A Scaling Advantage Built into the Cost Structure
&lt;/h2&gt;

&lt;p&gt;The mission at Mintminds is simple: build high-quality experiments with reliable data and analysis. &lt;a href="https://www.growthbook.io/pricing?ref=blog.growthbook.io" rel="noopener noreferrer"&gt;GrowthBook’s pricing model&lt;/a&gt; allows for a setup where the more you test, the lower your per-experiment cost. But to optimize costs, you need to understand where money actually flows. Let’s break down the pricing:&lt;/p&gt;

&lt;p&gt;Fixed Costs (pricing, as of Nov 2025)&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;$40/month per seat for GrowthBook Pro license&lt;/li&gt;
&lt;li&gt;Typical team size: 5 seats = $200/month&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Variable Costs (GrowthBook Cloud):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;2 million CDN requests included (≈ pageviews)&lt;/li&gt;
&lt;li&gt;20 GB CDN bandwidth included&lt;/li&gt;
&lt;li&gt;Overage: $10 per million requests, $1 per GB bandwidth&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Self-Hosting Alternative: You can eliminate CDN costs by &lt;a href="https://hub.docker.com/r/growthbook/growthbook?ref=blog.growthbook.io" rel="noopener noreferrer"&gt;self-hosting GrowthBook&lt;/a&gt; for $11-50/month (depending on your infrastructure choice).&lt;/p&gt;

&lt;h2&gt;
  
  
  How Experimentation Costs Compare
&lt;/h2&gt;

&lt;p&gt;To understand how GrowthBook experimentation costs compare, Mintminds shares a real-world example from a client with 2.6 million unique users/month and running 5-7 experiments a month. In this example, they are running the GrowthBook JS SDK on Cloudflare pages, which means no limitations on the number of tested visitors for free. Yes, you read it right…for free!&lt;/p&gt;

&lt;p&gt;The variable GrowthBook costs are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;6.6 million CDN requests: 6.6 – 2 (first 2 million are free) = 4.6 * $10 = $46&lt;/li&gt;
&lt;li&gt;6 GB CDN Bandwidth usage: $ 0 (first 20GB is free)&lt;/li&gt;
&lt;li&gt;BigQuery usage cost estimation with daily updates: $300&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Fixed GrowthBook Pro costs for a team of 5 members: 5 * $40 = $200&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Platform&lt;/th&gt;
&lt;th&gt;Monthly Cost&lt;/th&gt;
&lt;th&gt;Annual Cost&lt;/th&gt;
&lt;th&gt;vs. GrowthBook Optimized&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Convert.com Pro&lt;/td&gt;
&lt;td&gt;$3,488&lt;/td&gt;
&lt;td&gt;$41,856&lt;/td&gt;
&lt;td&gt;1,050% more expensive&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;VWO Pro&lt;/td&gt;
&lt;td&gt;$4,308&lt;/td&gt;
&lt;td&gt;$51,696&lt;/td&gt;
&lt;td&gt;1,320% more expensive&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GrowthBook (Unoptimized)&lt;/td&gt;
&lt;td&gt;$546&lt;/td&gt;
&lt;td&gt;$6,552&lt;/td&gt;
&lt;td&gt;80% more expensive&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;GrowthBook (Optimized)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;$303&lt;/strong&gt; &lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$3,640&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Baseline&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;With BigQuery costs included, GrowthBook remains dramatically cheaper than traditional alternatives like Convert ($3,500/month) or VWO ($4,300/month) at comparable traffic levels. GrowthBook is already the smart financial choice. With optimization, it becomes unbeatable. Using GrowthBook cuts experimentation costs by 82% versus Convert.com Pro and 93% compared to VWO Pro.&lt;/p&gt;

&lt;p&gt;An 82% BigQuery reduction transforms GrowthBook from “very affordable” to an offer you simply can’t refuse.&lt;/p&gt;

&lt;h2&gt;
  
  
  GA4 Structure Wastes BigQuery Resources
&lt;/h2&gt;

&lt;p&gt;Regardless of hosting choice, BigQuery becomes your primary variable cost when using GA4 as your data source. For companies running active experimentation programs with daily updates, Mintminds finds that unoptimized BigQuery costs can easily reach $200 to $400/month.&lt;/p&gt;

&lt;p&gt;The default GrowthBook BigQuery integration queries GA4’s standard events_* and events_intraday_* tables. These tables store event parameters in nested structures, forcing BigQuery to process far more data than necessary.&lt;/p&gt;

&lt;p&gt;For example when you’re running experiments with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;5 metrics (1 goal + 1 secondary + 3 guardrails)&lt;/li&gt;
&lt;li&gt;3 dimensions for segmentation&lt;/li&gt;
&lt;li&gt;Daily (or more frequent) data refreshes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;BigQuery has to scan through nested arrays and repeated fields to extract the specific event parameters you need. You’re paying to process gigabytes of data when you only need megabytes of relevant information.&lt;/p&gt;

&lt;p&gt;GrowthBook allows &lt;a href="https://blog.growthbook.io/fact-table-query-optimization/" rel="noopener noreferrer"&gt;custom fact tables&lt;/a&gt; and metrics to select only relevant events and parameters. This helps, but optimizations plateau quickly because you’re still querying nested GA4 tables.&lt;/p&gt;

&lt;p&gt;Enterprise customers get access to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Advanced fact table query optimization&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://docs.growthbook.io/app/data-pipeline?ref=blog.growthbook.io#incremental-refresh-recommended" rel="noopener noreferrer"&gt;Data pipelines&lt;/a&gt; (significantly improved in GrowthBook 4.2)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But Pro license users need a different approach.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to Use GA4Dataform's Flattened Datasets to Reduce Query Costs
&lt;/h2&gt;

&lt;p&gt;At #CH2024 (the conference formerly known as Conversion Hotel), Rudger connected with &lt;a href="https://www.linkedin.com/in/stuifbergen/?ref=blog.growthbook.io" rel="noopener noreferrer"&gt;Jules Stuifbergen&lt;/a&gt; from &lt;a href="https://ga4dataform.com/?ref=blog.growthbook.io" rel="noopener noreferrer"&gt;Superform Labs&lt;/a&gt; about this exact challenge. Jules introduced him to GA4Dataform, which offered an elegant solution.&lt;/p&gt;

&lt;p&gt;What GA4Dataform Does: The Core Version (free!) creates a customized, flattened dataset optimized for the type of queries that GrowthBook uses.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;Benefit&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Fully flattened structure&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;No nested fields = dramatically faster queries&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Smart partitioning and clustering&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Restricting queries by date and event names will decrease the number of rows scanned&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Smaller data footprint&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Less data processed = lower BigQuery costs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Daily automated updates&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Fresh data from GA4 events table is appended to the table, using incremental logic&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Key insight: Even though you’re creating a new dataset in BigQuery (which feeds from the generic GA4 table), the flattened structure makes it cheaper to generate AND cheaper to query than repeatedly querying GA4’s nested tables.&lt;/p&gt;

&lt;p&gt;Bonus benefit: This same optimized dataset can be used for all your other BigQuery reports and dashboards, compounding the savings.&lt;/p&gt;

&lt;h2&gt;
  
  
  A Rigorous A/A Experiment to Test the Setup
&lt;/h2&gt;

&lt;p&gt;Mindminds partnered with &lt;a href="https://www.linkedin.com/in/laurasemeraro-marketinganalytics/?ref=blog.growthbook.io" rel="noopener noreferrer"&gt;Laura Semeraro&lt;/a&gt; and the team at &lt;a href="https://www.thesocialhub.co/?ref=blog.growthbook.io" rel="noopener noreferrer"&gt;The Social Hub&lt;/a&gt;—a hybrid hospitality brand offering hotel rooms, co-living spaces, coworking facilities, and creative playgrounds across Europe—to validate this approach with real data.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Using GA4Dataform's flattened datasets didn't just reduce GrowthBook costs—it optimized all our BigQuery reports and dashboards.&lt;/p&gt;

&lt;p&gt;&lt;cite&gt; &lt;span&gt; &lt;span&gt;Laura Semeraro&lt;/span&gt;, &lt;span&gt;Digital Analyst&lt;/span&gt; at &lt;span&gt; &lt;span&gt;The Social Hub&lt;/span&gt; &lt;/span&gt; &lt;/span&gt; &lt;/cite&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Implementation Steps
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;1. GA4Dataform Setup&lt;/strong&gt; – Laura installed GA4Dataform Core (free version). The custom event parameters from GrowthBook were added to the configuration (experiment ID and variation ID). With the daily schedule enabled, GA4Dataform automatically updates the flat events table incrementally.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. GrowthBook Configuration&lt;/strong&gt; – Mintminds created a new assignment query (for counting experiment visitors). Built fact tables for key conversion events: Add-to-cart and purchase events.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. A/A Test Design&lt;/strong&gt; – They ran two identical experiments simultaneously:&lt;/p&gt;

&lt;p&gt;Configuration:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Same targeting rules&lt;/li&gt;
&lt;li&gt;Same 5 metrics (1 goal, 1 secondary, 3 guardrails)&lt;/li&gt;
&lt;li&gt;Same 3 dimensions&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The Only Difference:&lt;/p&gt;

&lt;p&gt;Experiment A: Default GrowthBook queries (nested GA4 tables)&lt;br&gt;&lt;br&gt;
Experiment B: Optimized queries (flattened GA4Dataform dataset)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Measurement&lt;/strong&gt; – GrowthBook usage is automatically labelled in BigQuery, allowing us to track:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;BigQuery costs from Experiment A (old approach)&lt;/li&gt;
&lt;li&gt;BigQuery costs from Experiment B (new approach)&lt;/li&gt;
&lt;li&gt;BigQuery costs for daily dataset updates&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Test duration: 1 week&lt;/p&gt;

&lt;p&gt;This gave us an objective, apples-to-apples comparison.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Social Hub Reduced BigQuery Costs by 82%
&lt;/h2&gt;

&lt;p&gt;When the results came in, Rudger and his team had to verify the numbers multiple times to ensure accuracy: a whopping 81.8% cost reduction and a massive query speed improvement, too.&lt;/p&gt;

&lt;p&gt;By using the GA4Dataform flattened dataset instead of the default GA4 nested tables, they had reduced BigQuery data processing by more than four-fifths.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Benefit&lt;/th&gt;
&lt;th&gt;Impact&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Update experiment results more frequently&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Better SRM and MDE monitoring without budget concerns&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Run updates faster&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Flattened queries execute in a fraction of the time&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Scale experiment volume&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;The "more you test, less you pay" promise becomes reality&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Optimize other analytics&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Use the same flattened dataset for all BigQuery dashboards&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;The compounding effect:&lt;/strong&gt;  Lower per-experiment costs + faster refresh rates = exponentially better experimentation program ROI.&lt;/p&gt;

&lt;h2&gt;
  
  
  Enterprise Experimentation at a Fraction of the Cost
&lt;/h2&gt;

&lt;p&gt;This case study demonstrates how to achieve exceptional BigQuery efficiency with GrowthBook. By combining GrowthBook Pro, GA4Dataform Core and Strategic BigQuery optimization, you can build a cost-effective, high-performance experimentation stack that rivals Enterprise setups—at a fraction of the price. The cost reduction Mintminds achieved with The Social Hub isn’t an outlier. It’s the new baseline for GrowthBook implementations.&lt;/p&gt;

&lt;h2&gt;
  
  
  About Our Partners
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://www.mintminds.com/?ref=blog.growthbook.io" rel="noopener noreferrer"&gt;Mintminds&lt;/a&gt; is a &lt;a href="https://www.mintminds.com/growthbook/?ref=blog.growthbook.io" rel="noopener noreferrer"&gt;Certified GrowthBook partner&lt;/a&gt; based in the Netherlands. Founded by &lt;a href="https://www.linkedin.com/in/rudgerdegroot/?ref=blog.growthbook.io" rel="noopener noreferrer"&gt;Rudger de Groot&lt;/a&gt;, the team assists companies worldwide with hyper-scaling experimentation using GrowthBook.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.thesocialhub.co/?ref=blog.growthbook.io" rel="noopener noreferrer"&gt;The Social Hub&lt;/a&gt; is a European hospitality brand that blends traditional hotel stays with a vibrant, community-focused experience. Its unique hybrid model combines premium design-led short and long-stay hotel rooms with student accommodation, coworking spaces, meeting and event facilities, restaurants and bars, 24-hour gyms, and open-to-the-public spaces like rooftops, parks, and cultural venues.&lt;/p&gt;

</description>
      <category>abtesting</category>
    </item>
    <item>
      <title>AI Evals vs. A/B Testing: Why You Need Both to Ship GenAI</title>
      <dc:creator>Ryan Feigenbaum</dc:creator>
      <pubDate>Tue, 20 Jan 2026 15:53:06 +0000</pubDate>
      <link>https://dev.to/growthbook/ai-evals-vs-ab-testing-why-you-need-both-to-ship-genai-54n7</link>
      <guid>https://dev.to/growthbook/ai-evals-vs-ab-testing-why-you-need-both-to-ship-genai-54n7</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi2ab9e2oo12pt7rrodd8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi2ab9e2oo12pt7rrodd8.png" alt="AI Evals vs. A/B Testing: Why You Need Both to Ship GenAI" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Most teams building with GenAI are flying blind. They've replaced unit tests with vibes and shipped prompts that "felt right" to three engineers on a Friday afternoon.&lt;/p&gt;

&lt;p&gt;This isn't a criticism—it's a diagnosis. For decades, we operated under a &lt;strong&gt;deterministic paradigm&lt;/strong&gt;. The contract between developer and machine was explicit: &lt;code&gt;Input A + Code = Output B&lt;/code&gt;. Always, without fail. In this world, success was binary. A unit test passed or it failed.&lt;/p&gt;

&lt;p&gt;Generative AI has shattered this contract. We have moved from deterministic engineering to &lt;strong&gt;probabilistic engineering&lt;/strong&gt;. We are no longer building binaries; we are managing stochastic agents that produce a distribution of probable outputs. You cannot &lt;code&gt;assert(x == y)&lt;/code&gt; when &lt;code&gt;x&lt;/code&gt; and &lt;code&gt;y&lt;/code&gt; can change every time.&lt;/p&gt;

&lt;p&gt;Gian Segato (Anthropic) eloquently sums up this shift: “We are no longer guaranteed what &lt;code&gt;x&lt;/code&gt; is going to be, and we're no longer certain about the output &lt;code&gt;y&lt;/code&gt; either, because it's now drawn from a distribution…. Stop for a moment to realize what this means. When building on top of this technology, our products can now succeed in ways we’ve never even imagined, and fail in ways we never intended” (&lt;a href="https://giansegato.com/essays/probabilistic-era?ref=blog.growthbook.io" rel="noopener noreferrer"&gt;Building AI Products In The Probabilistic Era&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;As seismic as this shift may be, we’re focusing on a single aspect of it here: the shift from the domain of &lt;strong&gt;verification&lt;/strong&gt; (is it correct?) to the domain of &lt;strong&gt;validation&lt;/strong&gt; (is it good?).&lt;/p&gt;

&lt;p&gt;This shift has left teams scrambling to define quality. Many have fallen into the trap of thinking AI Evaluations (Evals) are a replacement for A/B testing. They aren't.&lt;/p&gt;

&lt;p&gt;And, for those in a hurry, here’s the point:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;AI Evals&lt;/strong&gt; check for competence—&lt;em&gt;can the model do the job?&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.growthbook.io/products/experiments?ref=blog.growthbook.io" rel="noopener noreferrer"&gt;&lt;strong&gt;A/B testing&lt;/strong&gt;&lt;/a&gt; checks for &lt;strong&gt;value&lt;/strong&gt; —&lt;em&gt;do users care?&lt;/em&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;You cannot ship a good AI product without both AI Evals and A/B testing.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Limits of Vibe Checking
&lt;/h2&gt;

&lt;p&gt;In the early days of the LLM boom, “Prompt Engineering” was largely a feeling-based art. Devs would tweak a prompt, run it three times, read the output, and decide if it “felt” better.&lt;/p&gt;

&lt;p&gt;This manual inspection—”vibe checking”—leverages human intuition, which is great for nuance but terrible for scale.&lt;/p&gt;

&lt;p&gt;Vibe checking suffers from three critical flaws:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Sample size:&lt;/strong&gt; You might test 5 inputs. Production brings 50k edge cases.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Regression invisibility:&lt;/strong&gt; Making a prompt “polite” might accidentally break its ability to output valid JSON. You won’t feel that until the API breaks.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Subjectivity:&lt;/strong&gt; One engineer’s “concise” is another’s “curt.”&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;As ML Systems Researcher, &lt;a href="https://thingsithinkithink.blog/posts/2025/06-08-llm-evals-lesson-1/?ref=blog.growthbook.io" rel="noopener noreferrer"&gt;Shreya Shankar notes&lt;/a&gt;, “You can’t vibe check your way to understanding what’s going on.” Manual inspection is mathematically insufficient for understanding probabilistic systems at scale.&lt;/p&gt;

&lt;p&gt;To solve this, the industry turned to &lt;strong&gt;AI Evals&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;💡&lt;/p&gt;

&lt;p&gt;For an excellent intro to AI Evals, check out &lt;a href="https://www.youtube.com/watch?v=BsWxPI9UM4c&amp;amp;ref=blog.growthbook.io" rel="noopener noreferrer"&gt;&lt;u&gt;Shreya Shankar and Hamal Husain on Lenny’s Podcast&lt;/u&gt;&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Are AI Evals?
&lt;/h2&gt;

&lt;p&gt;AI Evaluations are an attempt to systematize the vibe check—turning qualitative judgment into quantitative metrics. They're a way to programmatically test the probabilistic parts of your application: prompts, models, and parameters.&lt;/p&gt;

&lt;p&gt;But the term "Eval" is overloaded. When someone says "we're running evals," they might mean any of three things.&lt;/p&gt;

&lt;h3&gt;
  
  
  3 Types of AI Evals and Why They Matter
&lt;/h3&gt;

&lt;h4&gt;
  
  
  1. Model Evals
&lt;/h4&gt;

&lt;p&gt;Model evals are benchmarks like MMLU or HumanEval. They're useful for choosing a provider (GPT-5 vs. Claude Opus 4.5), but they tell you almost nothing about your specific application. A model might ace GSM8K (math reasoning) and still be a terrible customer service agent. Worse, these public benchmarks are increasingly contaminated—models have seen the test questions during training, inflating scores that don't transfer to novel problems. (We wrote a whole article about why “&lt;a href="https://blog.growthbook.io/the-benchmarks-are-lying/" rel="noopener noreferrer"&gt;The Benchmarks Are Lying To You&lt;/a&gt;.”)&lt;/p&gt;

&lt;h4&gt;
  
  
  2. System Evals
&lt;/h4&gt;

&lt;p&gt;System evals are what matter most. These test your end-to-end pipeline: prompt + RAG retrieval + model. The key metrics here are things like hallucination rate, faithfulness (does the answer stick to the retrieved context?), and relevance.&lt;/p&gt;

&lt;p&gt;Many teams now use &lt;strong&gt;LLM-as-Judge&lt;/strong&gt; —a strong model grading outputs on subjective criteria like tone, helpfulness, and coherence. It scales better than human review, but inherits the same limitation: it measures whether an answer &lt;em&gt;seems&lt;/em&gt; good, not whether users &lt;em&gt;act&lt;/em&gt; on it.&lt;/p&gt;

&lt;h4&gt;
  
  
  3. Guardrails
&lt;/h4&gt;

&lt;p&gt;Guardrails are real-time safety checks—toxicity filters, PII detection, jailbreak prevention. Important, but a different concern than quality.&lt;/p&gt;

&lt;p&gt;All three share a critical constraint: they measure &lt;em&gt;competence&lt;/em&gt;, not &lt;em&gt;value&lt;/em&gt;. Whether you run evals offline in your CI/CD pipeline against a curated "Golden Dataset," or online against live traffic in shadow mode, you're still asking the same question: &lt;em&gt;Can this model do the job?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Some evals do capture preference—human ratings, side-by-side comparisons, thumbs up/down. But these are still proxies. A user clicking "thumbs up" in a sandbox isn't the same as a user returning to your product tomorrow. Evals measure &lt;em&gt;stated&lt;/em&gt; preference; A/B tests measure &lt;em&gt;revealed&lt;/em&gt; preference through behavior.&lt;/p&gt;

&lt;p&gt;What evals can't tell you is whether users will care enough to stick around.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Where Evals Fall Short&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Even within the realm of evals, a model that looks good in controlled conditions can fall apart in production.&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://careersatdoordash.com/blog/how-to-investigate-the-online-vs-offline-performance-for-dnn-models/?ref=blog.growthbook.io" rel="noopener noreferrer"&gt;DoorDash engineering team&lt;/a&gt; documented this problem in detail. They built a new ad-ranking model that performed well in testing—but when deployed to real users, its accuracy dropped by 4.3%. The culprit? Their test data was &lt;em&gt;too clean&lt;/em&gt;. The model had been trained assuming it would always have fresh, up-to-date information about users. But in the real world, that data was often hours or days old due to system delays. The model had been optimized for conditions that didn't exist in production.&lt;/p&gt;

&lt;p&gt;This principle applies even more to LLM applications. LLMs are sensitive to prompt phrasing, context length, and retrieval quality—all of which behave differently in production than in curated test sets.&lt;/p&gt;

&lt;p&gt;Consider a concrete example: you optimize a customer service prompt for &lt;em&gt;faithfulness&lt;/em&gt;—it sticks strictly to your knowledge base and never hallucinates. Evals look great. But in production, users find the responses robotic and impersonal. Satisfaction drops. You optimized for accuracy; they wanted empathy.  &lt;/p&gt;

&lt;p&gt;This is the core limitation of evals: they measure capability, not value. Even when you run evals against live traffic, you're testing whether the model &lt;em&gt;can&lt;/em&gt; do something—not whether that something &lt;em&gt;matters&lt;/em&gt; to users.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Why You Should Use A/B Testing with Your AI Evals&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;If evals are the unit test, A/B testing is the integration test with reality. It’s the only way to measure what actually matters: downstream business impact like retention, revenue, conversion, engagement, and user satisfaction.&lt;/p&gt;

&lt;p&gt;But running A/B tests on LLMs introduces challenges that didn't exist in traditional web experimentation. (For an introduction to the topic, see our &lt;a href="https://blog.growthbook.io/how-to-a-b-test-ai-a-practical-guide/" rel="noopener noreferrer"&gt;practical guide to A/B testing AI&lt;/a&gt;.)&lt;/p&gt;

&lt;h3&gt;
  
  
  Challenges of Running A/B Tests on AI
&lt;/h3&gt;

&lt;h4&gt;
  
  
  1. The Latency Confound
&lt;/h4&gt;

&lt;p&gt;Intelligence usually costs speed. If you test a fast, simple model against a smart, slow one and the variant loses—why? Was the answer worse or did users just hate waiting three seconds?&lt;/p&gt;

&lt;p&gt;Isolating "intelligence" as a variable often requires artificial latency injection: intentionally slowing the control to match the variant. Only then can you measure what you think you're measuring.&lt;/p&gt;

&lt;h4&gt;
  
  
  2. High Variance
&lt;/h4&gt;

&lt;p&gt;LLMs are non-deterministic. Two users in the same variant might see meaningfully different responses. This noise demands larger sample sizes and longer test durations to reach statistical significance.&lt;/p&gt;

&lt;p&gt;A button-color test might reach significance in a few thousand sessions. An LLM prompt test—where output variance is high and effect sizes are often small—might need 10x that, or weeks of runtime, to detect a meaningful difference.&lt;/p&gt;

&lt;h4&gt;
  
  
  3. Choosing the Right Metric
&lt;/h4&gt;

&lt;p&gt;Choosing the right metric is harder for AI features than for traditional UI changes. A chatbot might increase engagement (users ask more questions) while decreasing efficiency (they take longer to get answers). Align your success metric with actual business value, not just surface activity.&lt;/p&gt;

&lt;p&gt;These realities create a tension. A/B testing AI gives you certainty, but certainty takes time. If you have twenty prompts to evaluate, a traditional A/B test could take months. And during those months, a significant portion of your users are experiencing inferior variants.&lt;/p&gt;

&lt;h3&gt;
  
  
  Enter Multi-Armed Bandits
&lt;/h3&gt;

&lt;p&gt;For prompt optimization—where iterations are cheap, and the cost of a suboptimal variant is low—&lt;a href="https://docs.growthbook.io/bandits/overview?ref=blog.growthbook.io" rel="noopener noreferrer"&gt;multi-armed bandits&lt;/a&gt; offer a different trade-off. Instead of fixed traffic allocation, they dynamically shift users toward winning variants as data accumulates. You sacrifice some statistical rigor for speed and reduced regret.&lt;/p&gt;

&lt;p&gt;🎰&lt;/p&gt;

&lt;p&gt;&lt;a href="https://blog.growthbook.io/introducing-multi-armed-bandits-in-growthbook/" rel="noopener noreferrer"&gt;Check out our deep-dive on how they work in GrowthBook.&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Comparing A/B Testing to Multi-Armed Bandits
&lt;/h4&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;A/B Testing&lt;/th&gt;
&lt;th&gt;Multi-Armed Bandits&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Primary Goal&lt;/td&gt;
&lt;td&gt;Knowledge. Determine with statistical certainty if B is better than A.&lt;/td&gt;
&lt;td&gt;Reward. Maximize total conversions during the experiment.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Traffic Allocation&lt;/td&gt;
&lt;td&gt;Fixed for the duration.&lt;/td&gt;
&lt;td&gt;Dynamic. Automatically shifts traffic to the winner.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Best Use Case&lt;/td&gt;
&lt;td&gt;Major model launches, pricing, UI changes&lt;/td&gt;
&lt;td&gt;Prompt optimization, headline testing&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Bandits aren't a replacement for A/B testing. They're a complement—best suited for rapid iteration loops where you're optimizing within a validated direction, not making major strategic bets.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to Use AI Evals and A/B Testing Together
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1lfph7atyypz82jj3bt1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1lfph7atyypz82jj3bt1.png" alt="AI Evals vs. A/B Testing: Why You Need Both to Ship GenAI" width="800" height="515"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;At GrowthBook, we see the highest-performing teams treating evals and experimentation not as separate islands, but as a continuous pipeline—each stage filtering out risk with progressively more expensive (but more accurate) methods.&lt;/p&gt;

&lt;h3&gt;
  
  
  Using AI Evals and A/B Testing Together in Practice
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Stage 1: The Offline Filter (CI/CD)
&lt;/h4&gt;

&lt;p&gt;A developer creates a new prompt branch. The CI/CD pipeline automatically runs evals against the Golden Dataset. If faithfulness drops below 90% or latency exceeds the threshold, the build fails. Bad ideas die here, costing pennies in API credits rather than user trust.&lt;/p&gt;

&lt;h4&gt;
  
  
  Stage 2: Shadow Mode (Production, Silent)
&lt;/h4&gt;

&lt;p&gt;The prompt passes offline evals and gets deployed—but users never see it. The new model processes live traffic silently, logging predictions without surfacing them.&lt;/p&gt;

&lt;p&gt;This is online evaluation: you're still measuring competence (latency, accuracy, edge case handling), but now against real-world conditions. &lt;a href="https://careersatdoordash.com/blog/how-to-investigate-the-online-vs-offline-performance-for-dnn-models/?ref=blog.growthbook.io" rel="noopener noreferrer"&gt;DoorDash's 4% accuracy gap&lt;/a&gt; between testing and production is exactly the kind of discrepancy shadow mode is designed to surface—before users experience the degraded results.&lt;/p&gt;

&lt;h4&gt;
  
  
  Stage 3: Safe Rollout
&lt;/h4&gt;

&lt;p&gt;Shadow mode passes. Feature flags gradually release the new model to users. You're monitoring guardrail metrics: error rates, refusal spikes, support tickets. If something tanks, you flip the flag and revert instantly—no code rollback required.&lt;/p&gt;

&lt;p&gt;🦺&lt;/p&gt;

&lt;p&gt;Use GrowthBook's &lt;a href="https://docs.growthbook.io/features/safe-rollouts?ref=blog.growthbook.io" rel="noopener noreferrer"&gt;Safe Rollouts&lt;/a&gt; to monitor guardrail metrics and rollback automatically.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Stage 4: The A/B Test (Causal Proof)&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;The rollout survives. Now you run the real experiment: new model vs. baseline, measured on business metrics. Not "faithfulness" but retention. Not "relevance" but conversion. This is the only stage that proves value.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion: AI Evals plus A/B Testing for GenAI
&lt;/h2&gt;

&lt;p&gt;You cannot A/B test a broken model. It’s reckless. And you cannot Eval your way to product-market fit. It’s guesswork.&lt;/p&gt;

&lt;p&gt;To ship generative AI that's both safe and profitable, you need both: rigorous evals to ensure competence, and robust A/B testing to prove value. The pipeline between them—shadow mode, safe rollouts—is how you get from one to the other without breaking things.&lt;/p&gt;

&lt;p&gt;As Segato warned, our products can now fail in ways we never intended. This pipeline is how we catch those failures before users do.&lt;/p&gt;

&lt;p&gt;We've moved from &lt;em&gt;is it correct?&lt;/em&gt; to &lt;em&gt;is it good?&lt;/em&gt; Evals answer the first question. A/B tests answer the second. You need both.&lt;/p&gt;




&lt;h2&gt;
  
  
  Frequently Asked Questions
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Can AI Evals replace A/B testing?&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
No. AI Evals and A/B testing serve different purposes in the development lifecycle. Evals measure competence—accuracy, safety, tone—whether run offline or online. A/B testing measures business value through revealed user behavior: retention, revenue, conversion. Evals tell you the model works; A/B tests tell you it's worth shipping.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What is the difference between Offline and Online Evaluation?&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Offline evaluation happens pre-deployment using a static Golden Dataset to check for regressions and quality. Online evaluation happens in production using live traffic (e.g., shadow mode). Both measure competence, but online evaluation catches issues—like feature staleness or latency spikes—that don't appear in controlled conditions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How do you handle latency when A/B testing LLMs?&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Latency is a major confounding variable because "smarter" models are often slower. If a slower model performs worse, it's unclear if users disliked the answer or the wait time. To fix this, engineers use Artificial Latency Injection—intentionally slowing down the control group to match the variant's response time, isolating "intelligence" as the single variable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What is "Vibe Checking" in AI development?&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
"Vibe checking" is the informal process of manually inspecting a few model outputs to see if they "feel" right. While useful for early exploration, it is unscalable and statistically flawed for production systems because it fails to account for edge cases, regressions, or large-scale user preferences.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;When should I use a Multi-Armed Bandit instead of an A/B test?&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Use a Multi-Armed Bandit when your goal is optimization (maximizing reward) rather than knowledge (statistical significance). MABs are ideal for testing prompt variations or content recommendations because they automatically route traffic to the winning variation, minimizing regret. Use A/B tests for major architectural changes or risky launches where you need certainty.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What is the best way to deploy AI models safely?&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Use a staged pipeline. Start with offline evals in CI/CD to catch regressions. Then use shadow mode to test against live traffic silently. Next, use feature flags to release to a small percentage of users while monitoring guardrails. Finally, run a full A/B test to measure business impact. Each stage filters out risk before exposing users to problems.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What is LLM-as-Judge?&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
LLM-as-Judge is an evaluation technique where a strong model (like GPT-4 or Claude) grades the outputs of your system on subjective criteria such as tone, helpfulness, and coherence. It scales better than human review but shares the same limitation as other evals: it measures whether an answer seems good, not whether users will act on it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What is the difference between stated and revealed preference in AI evaluation?&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Stated preference is what users say they like—thumbs up ratings, side-by-side comparisons in a sandbox. Revealed preference is what users actually do—returning to your product, completing tasks, converting. Evals capture stated preference; A/B tests capture revealed preference. The two often diverge.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>experimentation</category>
    </item>
    <item>
      <title>Dark Patterns in A/B Testing: How Short-Term Optimization Leads to Product Enshittification</title>
      <dc:creator>Ryan Feigenbaum</dc:creator>
      <pubDate>Mon, 12 Jan 2026 22:07:30 +0000</pubDate>
      <link>https://dev.to/growthbook/dark-patterns-in-ab-testing-how-short-term-optimization-leads-to-product-enshittification-2p7i</link>
      <guid>https://dev.to/growthbook/dark-patterns-in-ab-testing-how-short-term-optimization-leads-to-product-enshittification-2p7i</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj465e58ek5w52rdwfk8y.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj465e58ek5w52rdwfk8y.png" alt="Dark Patterns in A/B Testing: How Short-Term Optimization Leads to Product Enshittification" width="800" height="461"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Why optimizing for short-term A/B test wins can degrade user trust and product quality. A look at common dark patterns in experimentation, why they “work,” and how better metrics can help teams build products that create real long-term value.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;A post supposedly from a software engineer at a meal delivery company went &lt;a href="https://www.reddit.com/r/confession/comments/1q1mzej/im_a_developer_for_a_major_food_delivery_app_the/?ref=blog.growthbook.io" rel="noopener noreferrer"&gt;&lt;u&gt;viral recently&lt;/u&gt;&lt;/a&gt;. It accused the unnamed company of unscrupulously manipulating pricing, fees, and salaries to increase revenue. One of the things they did was to run an A/B test on a “Priority delivery” fee. According to the post, there were no product changes to make delivery faster, but instead, they delayed regular deliveries.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“We actually ran an A/B test last year where we didn't speed up the priority orders, we just purposefully delayed non-priority orders by 5 to 10 minutes to make the Priority ones "feel" faster by comparison. Management loved the results. We generated millions in pure profit just by making the standard service worse, not by making the premium service better.” (Source: Reddit)&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;While there are some questions about the veracity of this post, such dark patterns in A/B testing and product development are absolutely being done. And this raises an important question about the ethics of using these techniques in experimentation.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;What Are Dark Patterns?&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Dark patterns are product design or implementation choices that deliberately nudge, coerce, or mislead users into behaviors that primarily benefit the company. They often come at the expense of the user’s understanding or long-term satisfaction. &lt;/p&gt;

&lt;p&gt;For a comprehensive taxonomy, see&lt;a href="https://www.deceptive.design/?ref=blog.growthbook.io" rel="noopener noreferrer"&gt;&lt;u&gt;deceptive.design&lt;/u&gt;&lt;/a&gt;, which catalogs these patterns in detail. &lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;How Are Dark Patterns Used in A/B Testing?&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;In the context of &lt;a href="https://blog.growthbook.io/what-is-a-b-testing/" rel="noopener noreferrer"&gt;&lt;u&gt;A/B testing&lt;/u&gt;&lt;/a&gt;, dark patterns typically appear when experiments are optimized narrowly for short-term business metrics, such as a conversion rate, without regard for whether the underlying change actually improves the product. Often they are introduced as a response to an organization’s goal metric that fails to capture the complete picture (see &lt;a href="https://blog.growthbook.io/goodharts-law-and-the-dangers-of-metric-selection-with-a-b-testing/" rel="noopener noreferrer"&gt;&lt;u&gt;Goodhart’s Law and the dangers of metric selection&lt;/u&gt;&lt;/a&gt;). &lt;/p&gt;

&lt;h3&gt;
  
  
  Common Dark Patterns Used in Experiments
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Artificial degradation&lt;/strong&gt; : Making a baseline experience worse (for example, slowing delivery times as above or adding friction) so that a paid tier or alternative appears more attractive.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Obscured choice&lt;/strong&gt; : Designing UI variants that make it harder to opt out, cancel, or choose a lower-cost option, then validating them via A/B tests that show higher revenue.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Price obfuscation&lt;/strong&gt; : Experimenting with fees, surcharges, or defaults in ways that users only discover late in the funnel.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Emotional manipulation&lt;/strong&gt; : Leveraging urgency, guilt, or fear (“Only 2 left!”, “People like you choose…”) to drive behavior, then justifying it with statistically significant lifts.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A/B testing itself is not the problem. The problem is using experimentation as a shield: “the data says it works” becomes a way to avoid asking whether the outcome is aligned with user value or long-term trust. It hides the real question of whether we should do this at all.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Short-Term Wins, Long-Term Costs of Unethical Experimentation&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Dark patterns can look good in the short term. They are engineered to do so. Revenue goes up, conversion improves, and dashboards turn green. These tactics exploit goodwill with your current user base and long-term measurement blind spots, creating lifts that are easy to recognize immediately. The costs, however, tend to be delayed and externalized.&lt;/p&gt;

&lt;p&gt;Dark patterns in A/B testing introduce several long-term risks for organizations.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Reputational Risk&lt;/strong&gt;
Users are not irrational. They may not always articulate why they are unhappy, but they notice when a product feels hostile, manipulative, or nickel-and-dime driven. Trust erodes quietly and then suddenly. When stories like the viral post above surface (whether accurate or not), they resonate precisely because users already suspect this behavior.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Legislative and Regulatory Risk&lt;/strong&gt;
Many dark patterns operate in gray areas that are increasingly of interest to regulators. Fee transparency, deceptive defaults, and coercive UX are now explicitly called out in regulations in multiple jurisdictions (see the EU’s Digital Services Act (DSA) and the California Privacy Rights Act (CPRA)). An A/B test that boosts revenue today can become legal exposure tomorrow, complete with internal documentation showing intent.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Internal and Cultural Risk&lt;/strong&gt;
Engineers, designers, and PMs generally want to build products that help people. When teams are repeatedly asked to ship features that intentionally worsen user experience, morale suffers. The best people notice. Over time, this can lead to disengagement or attrition, especially among senior contributors who have other options.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Risk from Competition&lt;/strong&gt;
Applying dark patterns that don’t improve the product opens the door, in the long term, for competitors to build a better product and put your company at risk.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;In other words, dark patterns trade long-term value for short-term gains. &lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Practical Solutions to Avoid Dark Patterns in Experimentation&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;There are some practical ways to help reduce these risks and avoid the enshittification of products. Chief among these are adopting value principles and establishing ethics committees. &lt;/p&gt;

&lt;p&gt;Value principles, like Google’s “Don’t be evil”, are frequently treated as aspirational marketing artifacts rather than operational constraints. Many tend to be vague or non-actionable and open to interpretation, which provides no meaningful protection against dark patterns. Finally, even if they are actionable and adopted as policy, they can come into tension with other incentives at the company, such as bonuses or career progression. Google, after all, ditched “Don’t be evil” in 2018. &lt;/p&gt;

&lt;p&gt;Ethics committees are used at some larger companies to ensure consistent application of company values. However, they can face the same issues as the values above, particularly if the company is facing financial pressure; the ethics team can be high on the &lt;a href="https://arstechnica.com/tech-policy/2023/03/amid-bing-chat-controversy-microsoft-cut-an-ai-ethics-team-report-says/?ref=blog.growthbook.io" rel="noopener noreferrer"&gt;&lt;u&gt;list of cuts&lt;/u&gt;&lt;/a&gt;. &lt;/p&gt;

&lt;p&gt;The most practical way to avoid dark patterns is not an ethics committee or a vague principle statement; &lt;strong&gt;it is using the right metrics.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If you only measure immediate revenue or conversion, you will eventually design experiments that extract value rather than create it. To counteract this, teams need to deliberately include metrics that reflect longer-term outcomes.&lt;/p&gt;

&lt;h3&gt;
  
  
  Example experimentation metrics to use to avoid dark pattern behavior
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://blog.growthbook.io/experiment-metrics-simplified-retention-count-distinct-max/" rel="noopener noreferrer"&gt;&lt;u&gt;Retention&lt;/u&gt;&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Repeat usage&lt;/li&gt;
&lt;li&gt;Complaint rates&lt;/li&gt;
&lt;li&gt;Refunds&lt;/li&gt;
&lt;li&gt;Customer support contacts&lt;/li&gt;
&lt;li&gt;Brand sentiment&lt;/li&gt;
&lt;li&gt;Qualitative feedback&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Not all of these can be perfectly measured- or measured at all (like the likelihood or cost of losing key employees). In the real world, the data will never be perfect. Good product judgment will still be required, as there will always be uncertainty. An experiment that produces a short-term lift but could be seen to damage trust should be treated with skepticism, even if the lift is excellent. &lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;When Experimentation Leads to a Better Product&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Ultimately, the goal of &lt;a href="https://www.growthbook.io/products/experiments?ref=blog.growthbook.io" rel="noopener noreferrer"&gt;&lt;u&gt;experimentation&lt;/u&gt;&lt;/a&gt; is not to prove that you can move a number. It is to learn how to make something people genuinely want. A/B testing is a powerful tool in the service of that goal, but the further you drift from it, the more your “wins” become signals of underlying enshittification rather than progress. Make sure your metrics reflect your real goals as much as possible.  &lt;/p&gt;

&lt;p&gt;In the long run, the most effective optimization strategy remains the simplest: make the product better.&lt;/p&gt;

</description>
      <category>abtesting</category>
      <category>experimentation</category>
    </item>
    <item>
      <title>7 Steps to Better Experiment Design</title>
      <dc:creator>Ryan Feigenbaum</dc:creator>
      <pubDate>Mon, 22 Dec 2025 05:40:30 +0000</pubDate>
      <link>https://dev.to/growthbook/7-steps-to-better-experiment-design-5fnf</link>
      <guid>https://dev.to/growthbook/7-steps-to-better-experiment-design-5fnf</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2jol7quy1oh36vgrxmzx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2jol7quy1oh36vgrxmzx.png" alt="7 Steps to Better Experiment Design" width="800" height="370"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;A practical checklist for running A/B tests you can trust&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;From predictive model accuracy at &lt;a href="https://facebook.com/" rel="noopener noreferrer"&gt;&lt;strong&gt;Facebook&lt;/strong&gt;&lt;/a&gt; and experiment design at &lt;a href="https://x.com/?ref=blog.growthbook.io" rel="noopener noreferrer"&gt;X&lt;/a&gt; (formerly Twitter), to building the best experimentation platform used by DropBox, Sony and Upstart with &lt;a href="https://growthbook.io/?ref=blog.growthbook.io" rel="noopener noreferrer"&gt;&lt;strong&gt;GrowthBook&lt;/strong&gt;&lt;/a&gt;, I've spent the last six years shaping how some of the largest tech companies measure success and ship features.&lt;/p&gt;

&lt;p&gt;Across companies, industries, and scales, I’ve seen the same pattern repeat: experimentation rarely fails because teams don’t understand A/B testing mechanics. It fails because experiments are poorly designed—unclear goals, misaligned metrics, weak baselines, flawed randomization, or decisions made without a plan for ambiguous results.&lt;/p&gt;

&lt;p&gt;The teams that get the most value from experimentation aren’t running more tests. They’re running &lt;strong&gt;better ones&lt;/strong&gt;. They’re deliberate about what they’re trying to learn and disciplined about how results turn into decisions.&lt;/p&gt;

&lt;p&gt;This article distills the &lt;strong&gt;most reliable experiment design practices I’ve learned from years of work in the field&lt;/strong&gt;. If you already know how A/B testing works and want results you can trust—and act on—these seven steps are a strong place to start.&lt;/p&gt;

&lt;p&gt;(For a deeper technical walkthrough, see GrowthBook’s &lt;a href="https://docs.growthbook.io/using/experimentation-best-practices?ref=blog.growthbook.io" rel="noopener noreferrer"&gt;Experimentation Best Practices&lt;/a&gt;)&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;1. Define the Goal Clearly&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Every experiment should answer a specific question.&lt;/p&gt;

&lt;p&gt;Start by writing down the problem you’re trying to solve in plain language. Is it activation? Retention? Conversion efficiency?&lt;/p&gt;

&lt;p&gt;A good test of clarity is whether you can write a concrete hypothesis, such as:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;“Users who complete the new onboarding flow will reach the activation milestone 10% more often than users in the existing flow.”&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Clear goals prevent experiments from drifting into vague “did anything change?” territory.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;In practice:&lt;/strong&gt; Teams at &lt;a href="https://www.growthbook.io/customers/dropbox?ref=blog.growthbook.io" rel="noopener noreferrer"&gt;&lt;strong&gt;Dropbox&lt;/strong&gt;&lt;/a&gt; use tightly framed hypotheses to avoid shipping changes that move surface-level engagement but fail to improve long-term collaboration or retention.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;2. Choose the Right Success Metrics&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Once the goal is clear, metrics follow.&lt;/p&gt;

&lt;p&gt;Every experiment should have:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;One primary metric&lt;/strong&gt; that defines success&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A set of secondary metrics&lt;/strong&gt; for context&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Guardrail metrics&lt;/strong&gt; to catch unintended harm&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Focusing on too many metrics creates confusion. Tracking too few hides important tradeoffs—especially when multiple metrics are evaluated simultaneously (see GrowthBook’s guidance on &lt;a href="https://docs.growthbook.io/statistics/multiple-corrections?ref=blog.growthbook.io" rel="noopener noreferrer"&gt;multiple testing corrections&lt;/a&gt;).  &lt;/p&gt;

&lt;p&gt;Use your secondary metrics to improve your understanding of what drives your primary metric. They also help you check-in periodically with your primary metric, ensuring it is well-defined and driving you towards your business goals.  &lt;/p&gt;

&lt;p&gt;Teams at &lt;a href="https://www.growthbook.io/customers/khan-academy?ref=blog.growthbook.io" rel="noopener noreferrer"&gt;&lt;strong&gt;Khan Academy&lt;/strong&gt;&lt;/a&gt; use experimentation to iterate on learning experiences while remaining deeply thoughtful about how success is measured in an educational context.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;3. Know Your Baseline&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;You can’t interpret change without knowing where you started.&lt;/p&gt;

&lt;p&gt;Before launching an experiment:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Understand current performance&lt;/li&gt;
&lt;li&gt;Measure normal variance&lt;/li&gt;
&lt;li&gt;Calibrate expectations for realistic lift&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A change from 4% to 5% conversion is only meaningful if you know how stable 4% really is.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;In practice:&lt;/strong&gt; One GrowthBook customer—a large European marketplace—moved away from before-and-after analysis after realizing they couldn’t separate real lift from seasonality. Establishing proper baselines made results interpretable and decisions easier.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;4. Understand Leading vs. Lagging Indicators&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Not all metrics respond at the same speed.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Leading indicators&lt;/strong&gt; provide fast feedback and are often better suited for short-term experiments.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lagging indicators&lt;/strong&gt; validate long-term impact and strategic alignment.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;High-performing teams use both, but they’re intentional about which metric actually determines success.&lt;/p&gt;

&lt;p&gt;Optimizing only for lagging indicators slows learning. Ignoring them risks local optimization.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;5. Define the Experiment Population and Randomization Strategy&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Decide who should be included in the experiment—and exclude everyone else.&lt;/p&gt;

&lt;p&gt;Best practices include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Randomizing users as close to the experience as possible&lt;/li&gt;
&lt;li&gt;Ensuring assignment persists across sessions&lt;/li&gt;
&lt;li&gt;Using a true control group&lt;/li&gt;
&lt;li&gt;Keeping designs simple when traffic is limited&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you don’t have enough users, avoid multi-variant tests.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;In practice:&lt;/strong&gt; One GrowthBook customer, a major European retailer, was running underpowered tests. They moved from partial traffic to testing on 100% of visitors—dramatically reducing time to confidence and revealing insights that challenged long-held assumptions.&lt;/p&gt;

&lt;p&gt;If you’re using feature flags to control exposure, GrowthBook’s approach to &lt;a href="https://docs.growthbook.io/feature-flag-experiments?ref=blog.growthbook.io" rel="noopener noreferrer"&gt;running experiments with feature flags&lt;/a&gt; is designed specifically for this kind of setup.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;6. Validate Your Setup Before You Trust Results&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;You can’t analyze what you can’t connect.&lt;/p&gt;

&lt;p&gt;Before launching real experiments, confirm that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Exposure data joins cleanly with outcome data&lt;/li&gt;
&lt;li&gt;Identifiers are consistent&lt;/li&gt;
&lt;li&gt;Metrics are computed correctly&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Then run an &lt;strong&gt;A/A test&lt;/strong&gt; —two identical variants with no visible change.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;In practice:&lt;/strong&gt; Teams operating at scale use A/A tests to catch instrumentation and analysis issues early. If multiple uncorrelated metrics “win” in a no-change test, or multiple A/A tests fail with clear issues, something is broken. GrowthBook strongly recommends this as a validation step (&lt;a href="https://docs.growthbook.io/kb/experiments/aa-tests?ref=blog.growthbook.io" rel="noopener noreferrer"&gt;A/A testing documentation&lt;/a&gt;).&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;7. Decide How Long to Run the Experiment&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Ending experiments early increases false positives. Letting them run forever slows learning.&lt;/p&gt;

&lt;p&gt;Plan duration in advance based on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Expected variance&lt;/li&gt;
&lt;li&gt;Minimum detectable effect&lt;/li&gt;
&lt;li&gt;Available traffic&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you need flexibility, approaches like &lt;a href="https://docs.growthbook.io/statistics/sequential?ref=blog.growthbook.io" rel="noopener noreferrer"&gt;sequential testing&lt;/a&gt; can help—but only if you understand the tradeoffs.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Bonus: Plan for All Outcomes&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Only &lt;strong&gt;10–30% of experiments produce a clear winner&lt;/strong&gt;. That’s normal.&lt;/p&gt;

&lt;p&gt;High-performing teams plan for this reality &lt;em&gt;before&lt;/em&gt; launching:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Low-cost features may ship on directional evidence&lt;/li&gt;
&lt;li&gt;High-cost features require stronger confidence&lt;/li&gt;
&lt;li&gt;Neutral results still generate valuable learning&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Experiments aren’t always about maximizing win rates. In some cases, they prevent huge losses. In other cases, their primary value is learning about user behavior.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Final Thought&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Experimentation isn’t about proving you’re right. It’s about discovering what’s true.&lt;/p&gt;

&lt;p&gt;Every experiment—even a neutral one—teaches you something about your users and your assumptions. Teams that stay curious, document learnings, and iterate deliberately are the ones that compound results over time.&lt;/p&gt;

&lt;p&gt;That’s what turns experimentation into a real competitive advantage.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;FAQ: Experimentation &amp;amp; A/B Testing in Practice&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;How do you decide whether an A/B test result is actionable?&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
When the results all point to the same decision, even when accounting for uncertainty. If you would ship even if the results were at the bottom end of the confidence intervals and you've collected a reasonable amount of data, ship!&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why are so many A/B test results inconclusive?&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Because most product changes simply don’t meaningfully change behavior. Neutral results often reveal what users don’t care about, guiding better future experiments.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How long should an experiment run?&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Long enough to reach sufficient statistical power—not until a metric looks good.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;When should you ship a result that isn’t statistically significant?&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
For low-risk, low-cost changes with stable guardrails. High-risk features need stronger confidence.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What’s the biggest mistake teams make with experimentation?&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Treating experimentation as validation instead of learning.&lt;/p&gt;

</description>
      <category>abtesting</category>
      <category>datascience</category>
      <category>experimentation</category>
    </item>
    <item>
      <title>Announcing GrowthBook 4.2: Product Analytics &amp; Experimentation at Scale</title>
      <dc:creator>Ryan Feigenbaum</dc:creator>
      <pubDate>Tue, 11 Nov 2025 19:43:04 +0000</pubDate>
      <link>https://dev.to/growthbook/announcing-growthbook-42-product-analytics-experimentation-at-scale-4di4</link>
      <guid>https://dev.to/growthbook/announcing-growthbook-42-product-analytics-experimentation-at-scale-4di4</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnwxioxqhjz1dalcwtflv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnwxioxqhjz1dalcwtflv.png" alt="Announcing GrowthBook 4.2: Product Analytics &amp;amp; Experimentation at Scale" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;At GrowthBook, our mission is to provide the insights you need to build better products that grow your business faster. With GrowthBook 4.2, we’ve added a beta version of GrowthBook &lt;a href="https://docs.growthbook.io/app/product-analytics?ref=blog.growthbook.io" rel="noopener noreferrer"&gt;&lt;strong&gt;Product Analytics&lt;/strong&gt;&lt;/a&gt; &lt;strong&gt;.&lt;/strong&gt; Now our users will have a single integrated platform for feature management, experimentation, and product analytics.&lt;/p&gt;

&lt;p&gt;In addition, we’ve continued to enhance the developer experience, making &lt;a href="https://docs.growthbook.io/app/metrics?ref=blog.growthbook.io#metric-slices" rel="noopener noreferrer"&gt;&lt;strong&gt;experimentation at scale&lt;/strong&gt;&lt;/a&gt; and integration into any stack easier than ever. Finally, for companies seeking an alternative to Statsig, our &lt;a href="https://docs.growthbook.io/guide/migrate-from-statsig?ref=blog.growthbook.io" rel="noopener noreferrer"&gt;&lt;strong&gt;Statsig to GrowthBook Migration Kit&lt;/strong&gt;&lt;/a&gt; automates importing feature gates and dynamic configs while replacing Statsig SDKs with GrowthBook SDKs.&lt;/p&gt;

&lt;p&gt;Release 4.2 is available immediately to both our cloud and self-hosted users. Visit our &lt;a href="https://www.growthbook.io/pricing?ref=blog.growthbook.io" rel="noopener noreferrer"&gt;&lt;u&gt;Pricing page&lt;/u&gt;&lt;/a&gt; for details about Starter, Pro, and Enterprise options. &lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;a href="https://docs.growthbook.io/app/product-analytics?ref=blog.growthbook.io" rel="noopener noreferrer"&gt;GrowthBook Product Analytics (Beta)&lt;/a&gt;
&lt;/h2&gt;

&lt;p&gt;Adding &lt;a href="https://docs.growthbook.io/app/product-analytics?ref=blog.growthbook.io" rel="noopener noreferrer"&gt;Product Analytics&lt;/a&gt; to the GrowthBook platform closes the loop for development. Now, you can go from feature management to experimentation to product analytics in a single tool. While in beta, Product Analytics will be available to all users.&lt;/p&gt;

&lt;p&gt;Turn your warehouse data and metrics into actionable product insights. Explore user behavior, share dashboards, and make smarter decisions about what to build next. With Product Analytics, you will be able to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Build and share dashboards that combine graphs, pivot tables, and text&lt;/li&gt;
&lt;li&gt;Create custom charts and tables from any data in your warehouse&lt;/li&gt;
&lt;li&gt;Use GrowthBook SQL Explorer with our AI-powered text-to-SQL capabilities to query, aggregate, and group data&lt;/li&gt;
&lt;li&gt;Access any metric defined in GrowthBook and track its performance over time&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://docs.growthbook.io/app/product-analytics?ref=blog.growthbook.io" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk7duughip42uahjhlpsk.png" alt="Announcing GrowthBook 4.2: Product Analytics &amp;amp; Experimentation at Scale" width="800" height="485"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Build charts with any data in your warehouse using SQL Explorer&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://docs.growthbook.io/app/product-analytics?ref=blog.growthbook.io" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvtae5wapj7o8zyyz0l6z.png" alt="Announcing GrowthBook 4.2: Product Analytics &amp;amp; Experimentation at Scale" width="800" height="474"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Analyze any metrics defined in GrowthBook&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://docs.growthbook.io/app/product-analytics?ref=blog.growthbook.io" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flcaczrm1pdt6x8w1z83f.png" alt="Announcing GrowthBook 4.2: Product Analytics &amp;amp; Experimentation at Scale" width="800" height="315"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Slice and dice data with flexible pivot tables&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;This Product Analytics beta provides a glimpse of what’s to come as GrowthBook develops more self-service tools for building, analyzing, and exploring all of your product data. Let us know what you think in our &lt;a href="https://join.slack.com/t/growthbookusers/shared_invite/zt-2xw8fu279-Y~hwnfCEf7WrEI9qScHURQ/?ref=blog.growthbook.io" rel="noopener noreferrer"&gt;Slack community&lt;/a&gt;!&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;a href="https://docs.growthbook.io/guide/migrate-from-statsig?ref=blog.growthbook.io" rel="noopener noreferrer"&gt;Statsig to GrowthBook Migration Kit&lt;/a&gt;
&lt;/h2&gt;

&lt;p&gt;With the OpenAI acquisition of Statsig, we saw a spike in interest in GrowthBook. Product teams looking for alternatives expressed concern about what would happen to their data. Others worried that the product might be discontinued or deprioritized. To make the transition from the acquired platform to an open-source alternative as effortless as possible, we created the &lt;a href="https://docs.growthbook.io/guide/migrate-from-statsig?ref=blog.growthbook.io" rel="noopener noreferrer"&gt;Statsig to GrowthBook Migration Kit&lt;/a&gt;, free for all users.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Statsig Importer&lt;/strong&gt; instantly copies over feature gates, dynamic configs, and segments.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Statsig Code Migration Tool&lt;/strong&gt; (powered by Claude Code) automatically replaces Statsig SDKs with GrowthBook SDKs.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Enterprise Enhancements
&lt;/h2&gt;

&lt;p&gt;The 4.2 features below continue our investment in the developer experience that makes GrowthBook a top choice for product development teams with high volume apps and advanced experimentation programs. &lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;a href="https://docs.growthbook.io/app/metrics?ref=blog.growthbook.io#metric-slices" rel="noopener noreferrer"&gt;Metric Slices: Simplify Experiment Design&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;When users create experiments, they often want to look at a number of metrics across common dimensions like product categories or device types. This can lead to the need to manage a number of metrics. &lt;a href="https://docs.growthbook.io/app/metrics?ref=blog.growthbook.io#metric-slices" rel="noopener noreferrer"&gt;Metric slices&lt;/a&gt; solves this problem. Enable auto slices on a Fact Metric once, and GrowthBook automatically generates drill-down analyses for each dimension value across all experiments using that metric.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://docs.growthbook.io/app/metrics?ref=blog.growthbook.io#metric-slices" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdlks5dki77n8f5lp463g.png" alt="Announcing GrowthBook 4.2: Product Analytics &amp;amp; Experimentation at Scale" width="800" height="473"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;View revenue per user metric by product category&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Instead of creating separate “Orders” metrics for each product category or device type, you can enable &lt;em&gt;Auto Slices&lt;/em&gt; on those columns with a single metric which means fewer redundant metrics, faster setup, and cleaner reporting.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;a href="https://docs.growthbook.io/app/data-pipeline?ref=blog.growthbook.io#incremental-refresh-recommended" rel="noopener noreferrer"&gt;Incremental Refresh&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;We revamped our &lt;strong&gt;Data Pipeline Mode&lt;/strong&gt; to lower query costs and improve performance for long-running experiments and high-traffic apps. By storing intermediate results and incrementally refreshing them, we’ve seen users save up to &lt;strong&gt;85% in query costs&lt;/strong&gt;. This first version is available on BigQuery, Presto, and Trino. We’ll be adding support for more data warehouses based on customer demand.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;a href="https://docs.growthbook.io/official-resources?ref=blog.growthbook.io" rel="noopener noreferrer"&gt;Official Metrics&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;Many organizations rely on a trusted set of “&lt;a href="https://docs.growthbook.io/official-resources?ref=blog.growthbook.io" rel="noopener noreferrer"&gt;official” metrics&lt;/a&gt;. GrowthBook now makes these easier to manage by letting admins mark and edit official metrics directly from the UI (previously API-only). This helps standardize measurement, reduce confusion, and promote consistency across teams.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;a href="https://docs.growthbook.io/app/query-optimization?ref=blog.growthbook.io#sql-template-variables" rel="noopener noreferrer"&gt;New SQL Template Variables&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;You can now access custom field values and phase data directly in your metric and experiment SQL, unlocking several use cases:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Fine-tuned query optimization using non-date partition keys&lt;/li&gt;
&lt;li&gt;Reuse of SQL definitions with minor tweaks per experiment&lt;/li&gt;
&lt;li&gt;More accurate joins between experiment exposure and phase data&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;a href="https://github.com/growthbook/growthbook/pull/4511?ref=blog.growthbook.io" rel="noopener noreferrer"&gt;Custom Validation Hooks&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;GrowthBook has always been flexible — and now it’s even more so. &lt;strong&gt;Self-hosted enterprise users&lt;/strong&gt; can write custom JavaScript validation hooks that run in secure V8 isolates. Use them to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Require tags on feature flags&lt;/li&gt;
&lt;li&gt;Prevent targeting rules containing PII&lt;/li&gt;
&lt;li&gt;Enforce naming conventions or internal policies&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These hooks let teams automate governance without slowing down development.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;a href="https://docs.growthbook.io/self-host/remote-evaluation?ref=blog.growthbook.io" rel="noopener noreferrer"&gt;Edge Remote Eval&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;Edge Remote Eval lets client-side SDKs offload feature flag evaluation to a backend server, preventing targeting logic from leaking to users. Previously, this required managing your own GrowthBook proxy servers. Now, you can deploy a &lt;strong&gt;Cloudflare Workers–based Remote Eval server&lt;/strong&gt; — a fast, low-cost, zero-maintenance alternative built on Cloudflare’s global infrastructure.&lt;/p&gt;

&lt;h2&gt;
  
  
  Quality-of-Life Improvements
&lt;/h2&gt;

&lt;p&gt;Big thanks to all of our users who reported bugs, shared feedback, and contributed ideas to this release on &lt;a href="https://github.com/growthbook/growthbook?ref=blog.growthbook.io" rel="noopener noreferrer"&gt;&lt;u&gt;GitHub&lt;/u&gt;&lt;/a&gt; or &lt;a href="https://growthbookusers.slack.com/?ref=blog.growthbook.io" rel="noopener noreferrer"&gt;&lt;u&gt;Slack&lt;/u&gt;&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Many small improvements add up to a big boost in usability:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Faster and more relevant search algorithm for features, metrics, and experiments &lt;/li&gt;
&lt;li&gt;Create feature rules in multiple environments at once&lt;/li&gt;
&lt;li&gt;Better column-type detection for BigQuery Fact Tables&lt;/li&gt;
&lt;li&gt;Add metric row filters based on Boolean columns&lt;/li&gt;
&lt;li&gt;Reduced webhook noise (no more notifications for unpublished drafts)&lt;/li&gt;
&lt;li&gt;Slack and Discord notifications now include more detailed change info&lt;/li&gt;
&lt;li&gt;Custom pre-launch checklist items can be scoped to specific projects&lt;/li&gt;
&lt;li&gt;Faster database schema browsing, even with hundreds of tables&lt;/li&gt;
&lt;li&gt;New setting to disable legacy metrics for smoother transition to Fact Tables&lt;/li&gt;
&lt;li&gt;Sortable experiment results tables — quickly see top or bottom performers&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Plus dozens of smaller fixes and performance improvements.&lt;/p&gt;




&lt;h2&gt;
  
  
  2025: A Year of Rapid Innovation
&lt;/h2&gt;

&lt;p&gt;The 4.2 release is GrowthBook’s sixth major update in 2025, capping off what has easily been the biggest year of innovation in our company’s history. GrowthBook launched over &lt;a href="https://blog.growthbook.io/7-000-github-stars-top-open-source-platform/" rel="noopener noreferrer"&gt;&lt;strong&gt;45 new features&lt;/strong&gt;&lt;/a&gt; across four major themes in 2025:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Experimentation at Scale:&lt;/strong&gt; New metrics, templates, dashboards, and analytics&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Feature Management:&lt;/strong&gt; Safe rollouts and feature analytics&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Artificial Intelligence:&lt;/strong&gt; A new MCP server and embedded AI capabilities&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Developer Experience:&lt;/strong&gt; Managed data warehouse, native Vercel integration, 13 updated SDKs, enhanced server-side rendering, and support for new CMSs and FerretDB&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Whether you’re on the Starter plan ready for more advanced experimentation and analytics or a Pro user building a culture of experimentation, we’re ready to &lt;a href="https://www.growthbook.io/get-started?ref=blog.growthbook.io" rel="noopener noreferrer"&gt;&lt;u&gt;help you grow&lt;/u&gt;&lt;/a&gt;. We’re excited to see what you build — and how you use these new tools to learn faster.&lt;/p&gt;

</description>
      <category>releases</category>
      <category>cloud</category>
      <category>productanalytics</category>
      <category>performance</category>
    </item>
    <item>
      <title>7,000 Github Stars and Counting</title>
      <dc:creator>Ryan Feigenbaum</dc:creator>
      <pubDate>Thu, 30 Oct 2025 19:38:31 +0000</pubDate>
      <link>https://dev.to/growthbook/7000-github-stars-and-counting-1ghh</link>
      <guid>https://dev.to/growthbook/7000-github-stars-and-counting-1ghh</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdnvlnl72c72kiy7urs73.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdnvlnl72c72kiy7urs73.png" alt="7,000 Github Stars and Counting" width="720" height="364"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Thank You for Making GrowthBook the World’s Largest Open-Source Experimentation Platform&lt;/p&gt;

&lt;p&gt;GrowthBook passed 7,000 stars on &lt;a href="https://github.com/growthbook/growthbook?ref=blog.growthbook.io" rel="noopener noreferrer"&gt;&lt;u&gt;GitHub&lt;/u&gt;&lt;/a&gt; this month thanks to you. Your support confirms our commitment to experimentation-led development and open-source transparency. We see you testing every day in the &lt;strong&gt;100 billion+ feature flag look ups&lt;/strong&gt; we handle, and the &lt;strong&gt;2,600 organizations&lt;/strong&gt; actively using GrowthBook each month. &lt;/p&gt;

&lt;p&gt;To celebrate this milestone, let’s look back on how we’ve grown and ahead to where we’re going. Our goal is to help you go faster at scale. Let’s see how we do it.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg9bd8oyq4qeqz1d83xv0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg9bd8oyq4qeqz1d83xv0.png" alt="7,000 Github Stars and Counting" width="800" height="327"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What’s New with GrowthBook in 2025?
&lt;/h2&gt;

&lt;p&gt;GrowthBook released more than 45 new features in our cloud and self-hosted experimentation platform in 4 key areas: data exploration, developer experience, advanced experimentation, and improving the experiment lifecycle with AI. As an engineering-first company, we believe that experiments should be easy and cheap to run so you can learn constantly. &lt;/p&gt;

&lt;h2&gt;
  
  
  Better Data Exploration
&lt;/h2&gt;

&lt;p&gt;What good is an experiment if you can’t easily analyze the results? GrowthBook provides full transparency by exposing the underlying SQL for your experiments. But we know you wanted more ways to explore your data, debug issues, and create custom reports and visualizations without the context switching. Now you can explore your data and build custom dashboards. &lt;/p&gt;

&lt;p&gt;Complexity happens fast when it comes to data analysis across teams and departments. &lt;a href="https://docs.growthbook.io/app/metrics?ref=blog.growthbook.io" rel="noopener noreferrer"&gt;&lt;strong&gt;&lt;u&gt;Metric slices&lt;/u&gt;&lt;/strong&gt;&lt;/a&gt; give everyone flexibility without complexity. For example, instead of separate revenue metrics for each product type, you can use metric slices to automatically generate distinct revenue metrics for each product type (such as “apparel” or “equipment”). Teams benefit from more granular and relevant analysis without duplicating definitions. Everyone stays on the same page. &lt;/p&gt;

&lt;h2&gt;
  
  
  Accelerating Experimentation Culture
&lt;/h2&gt;

&lt;p&gt;Why do so many engineering teams build their own experimentation platforms? So they get exactly what they want. GrowthBook helps teams migrate from homegrown to an experimentation culture by giving developers what they want with control. Customizable dashboards and frameworks help more teams, run more experiments faster, and learn from the results.   &lt;/p&gt;

&lt;p&gt;That’s why we developed &lt;a href="https://www.growthbook.io/products/experiments?ref=blog.growthbook.io" rel="noopener noreferrer"&gt;&lt;strong&gt;&lt;u&gt;experiment dashboards&lt;/u&gt;&lt;/strong&gt;&lt;/a&gt;. Developers, data teams, and product managers create their own custom view to go deep on individual experiments. They get exactly what they need to highlight interesting results, hide the noise, and begin to tell a story with the data that everyone in the organization can understand. &lt;/p&gt;

&lt;p&gt;The &lt;strong&gt;Experiment Decision Framework&lt;/strong&gt; helps teams make systematic, consistent decisions about when and how to conclude experiments. GrowthBook’s default modes include “do no harm” and “clear signal” with the option to customize with your own rules so you can iterate quickly.&lt;/p&gt;

&lt;p&gt;For developers who want to skip setup of a data source for our warehouse native solution, we launched a &lt;a href="https://blog.growthbook.io/growthbook-launch-month-week-4/" rel="noopener noreferrer"&gt;&lt;strong&gt;&lt;u&gt;Managed Warehouse option&lt;/u&gt;&lt;/strong&gt;&lt;/a&gt;. Now, your team can go straight to feature management, experimentation, and product analytics without the data connection, cost, and refresh hassles.&lt;/p&gt;

&lt;h3&gt;
  
  
  Advanced Experimentation
&lt;/h3&gt;

&lt;p&gt;The more experiments you run, the more advanced your experimentation program becomes. We believe that so many of you support GrowthBook because of the high bar we set for statistical rigor. We continued that commitment with features for sophisticated metrics, automated decision-making, and comprehensive measurement capabilities for high-frequency testing programs. Measure the long term impact of changes and control outcomes with &lt;a href="https://blog.growthbook.io/holdouts-in-growthbook/" rel="noopener noreferrer"&gt;&lt;strong&gt;&lt;u&gt;holdouts&lt;/u&gt;&lt;/strong&gt;&lt;/a&gt;, &lt;a href="https://blog.growthbook.io/introducing-multi-armed-bandits-in-growthbook/" rel="noopener noreferrer"&gt;&lt;strong&gt;&lt;u&gt;multi-arm bandits&lt;/u&gt;&lt;/strong&gt;&lt;/a&gt;, and &lt;a href="https://blog.growthbook.io/flavors-of-experimentation-in-growthbook/" rel="noopener noreferrer"&gt;&lt;strong&gt;&lt;u&gt;safe rollouts&lt;/u&gt;&lt;/strong&gt;&lt;/a&gt;. &lt;/p&gt;

&lt;p&gt;With &lt;a href="https://www.growthbook.io/products/insights?ref=blog.growthbook.io" rel="noopener noreferrer"&gt;&lt;strong&gt;&lt;u&gt;Insights&lt;/u&gt;&lt;/strong&gt;&lt;/a&gt;, GrowthBook’s executive dashboard offers a 10,000 foot view across all of your organization’s experiments to understand what you’ve done and what you’ve learned. Help your team go further, faster with learnings, experiment timelines, and explore metric effects and metric correlations. Filter by project and data range, view by win rate, scaled impact, and velocity.&lt;/p&gt;

&lt;h2&gt;
  
  
  Improving the Experiment Lifecycle with AI
&lt;/h2&gt;

&lt;p&gt;It’s time to talk to your experimentation platform. The &lt;a href="https://www.growthbook.io/products/ai-mcp?ref=blog.growthbook.io" rel="noopener noreferrer"&gt;&lt;strong&gt;&lt;u&gt;MCP server&lt;/u&gt;&lt;/strong&gt;&lt;/a&gt; streamlines workflows, and enables AI-powered automation and insights within your development environment. Connect to your favorite LLMs to manage feature flags, experiments, and other tasks without switching contexts. The MCP server works with Cursor, Claude, VS Code, and it’s open source. &lt;/p&gt;

&lt;p&gt;We’ve also embedded AI into GrowthBook. You can use natural language questions to generate SQL. Your GrowthBook assistant helps you follow best practices by checking hypotheses, summarizing metric descriptions, generating experiment summaries, and comparing past experiments to avoid duplication. &lt;/p&gt;

&lt;h2&gt;
  
  
  Looking Ahead: The Future of Experimentation at GrowthBook
&lt;/h2&gt;

&lt;p&gt;We continue to be inspired by our GitHub stargazers, &lt;a href="https://growthbookusers.slack.com/?ref=blog.growthbook.io" rel="noopener noreferrer"&gt;&lt;u&gt;Slack community members&lt;/u&gt;&lt;/a&gt;, and all the experimenters out there, committed to making everything better. As we prepare for the year ahead, we’re looking at a few key themes.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;In this time of consolidation and disruption, data security and governance matter more than ever. Our warehouse-native approach allows you to keep your data in-house under your control.&lt;/li&gt;
&lt;li&gt;As AI-generated code becomes more pervasive, experimentation provides an essential check on whether code works and benefits the business.&lt;/li&gt;
&lt;li&gt;Fostering a culture of experimentation does more than draw the signal from the noise. It helps you fail sooner, in the smallest ways possible, so you can accelerate success. &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Here's to the next 7,000 stars and beyond! If you haven't already, check out&lt;a href="https://github.com/growthbook/growthbook?ref=blog.growthbook.io" rel="noopener noreferrer"&gt;&lt;u&gt;GrowthBook on GitHub&lt;/u&gt;&lt;/a&gt;—we'd love to see what you experiment with next.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Ready to join the experimentation revolution? Star us on&lt;/em&gt; &lt;a href="https://github.com/growthbook/growthbook?ref=blog.growthbook.io" rel="noopener noreferrer"&gt;&lt;em&gt;&lt;u&gt;GitHub&lt;/u&gt;&lt;/em&gt;&lt;/a&gt;&lt;em&gt;, join our&lt;/em&gt; &lt;a href="https://growthbookusers.slack.com/?ref=blog.growthbook.io" rel="noopener noreferrer"&gt;&lt;em&gt;&lt;u&gt;Slack community&lt;/u&gt;&lt;/em&gt;&lt;/a&gt;&lt;em&gt;, or dive into the code. The future of product development is open, transparent, and data-driven. Let's build it together.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>news</category>
    </item>
    <item>
      <title>The Benchmarks Are Lying to You: Why You Should A/B Test Your AI</title>
      <dc:creator>Ryan Feigenbaum</dc:creator>
      <pubDate>Tue, 30 Sep 2025 16:16:40 +0000</pubDate>
      <link>https://dev.to/growthbook/the-benchmarks-are-lying-to-you-why-you-should-ab-test-your-ai-njn</link>
      <guid>https://dev.to/growthbook/the-benchmarks-are-lying-to-you-why-you-should-ab-test-your-ai-njn</guid>
      <description>&lt;h2&gt;
  
  
  Quick Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Performance varies by domain:&lt;/strong&gt; Models that ace benchmarks often fail on your specific use case&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The Trade-offs might not be real:&lt;/strong&gt; Faster, cheaper models might outperform expensive ones for your needs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The best solution is rarely one model:&lt;/strong&gt; Most successful deployments use model portfolios
&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyg1oyu4etxd8cs513l39.png" alt="The Benchmarks Are Lying to You: Why You Should A/B Test Your AI" width="800" height="450"&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;A/B testing quantifies what matters:&lt;/strong&gt; User completion rates, costs, and latency—not abstract scores&lt;/p&gt;

&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;OpenAI's GPT-5 (high) model scores 25% on the FrontierMath benchmark for expert-level mathematics. Claude Opus 4.1 only scores 7%. Based on these numbers alone, you might assume GPT-5 is clearly the superior choice for any application requiring mathematical reasoning.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh9c99hnuafptxuos1jza.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh9c99hnuafptxuos1jza.png" alt="The Benchmarks Are Lying to You: Why You Should A/B Test Your AI" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;But this assumption illustrates a fundamental problem in AI evaluation, one that we in the experimentation space know quite well as Goodhart's Law: "When a measure becomes a target, it ceases to be a good measure." The AI industry has turned benchmarks into targets, and now those benchmarks are failing us.&lt;/p&gt;

&lt;p&gt;When GPT-4 launched, it dominated every benchmark. Yet within weeks, engineering teams discovered that smaller, "inferior" models often outperformed it on specific production tasks—at a fraction of the cost.&lt;/p&gt;

&lt;p&gt;With all the fanfare of the GPT-5 launch and outperforming all other models on coding benchmarks, developers continued to prefer Anthropic's models and tooling for real-world usage. This disconnect between benchmark performance and production reality isn't an edge case. It's the norm.&lt;/p&gt;

&lt;p&gt;The market for LLMs is expanding rapidly—OpenAI, Anthropic, Google, Mistral, Meta, xAI and dozens of open-source options all compete for your attention. But the question isn't which model scores highest on benchmarks. It's which model actually works in your production environment, with your users, under your constraints.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Traditional Benchmarks Fail in Production
&lt;/h2&gt;

&lt;p&gt;AI benchmarks are standardized tests designed to measure model performance—MMLU tests general knowledge, HumanEval measures coding ability, and FrontierMath evaluates mathematical reasoning. Every major model release leads with these scores.&lt;/p&gt;

&lt;p&gt;But these benchmarks fail in three critical ways that make them unreliable for production decisions:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. They Don't Measure What Actually Matters&lt;/strong&gt; Benchmarks test surrogate tasks—simplified proxies that are easier to measure than actual performance. A model might excel at multiple-choice medical questions while failing to parse your actual clinical notes. It might ace standardized coding challenges while struggling with your company's specific codebase patterns. The benchmarks measure &lt;em&gt;something&lt;/em&gt;, just not real-world problem-solving ability.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. They're Systematically Gamed&lt;/strong&gt; Data contamination lets models memorize benchmark datasets during training, achieving perfect scores on familiar questions while failing on slight variations. Worse, models are specifically optimized to excel at benchmark tasks—essentially teaching to the test. When your model has seen the answers beforehand, the test becomes meaningless.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. They Ignore Production Reality&lt;/strong&gt; Benchmarks operate in a fantasy world without your constraints. Latency doesn't exist in benchmarks, but your multi-model chain takes 15+ seconds. Cost doesn't matter in benchmarks, but 10x price differences destroy unit economics. Your infrastructure has real memory limits. Your healthcare app can't hallucinate drug dosages.&lt;/p&gt;

&lt;p&gt;Consider this sobering statistic: 79% of ML papers claiming breakthrough performance used weak baselines to make their results look better. When researchers reran these comparisons fairly, the advantages often disappeared.&lt;/p&gt;

&lt;h2&gt;
  
  
  The A/B Testing Advantage: Finding What Actually Works
&lt;/h2&gt;

&lt;p&gt;So if benchmarks fail us, how do we actually select and optimize LLMs? Through the same methodology that transformed digital products: rigorous A/B testing with real users and real workloads.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Portfolio Approach
&lt;/h3&gt;

&lt;p&gt;The first insight from production A/B testing contradicts everything vendors tell you: the optimal solution is rarely a single model.&lt;/p&gt;

&lt;p&gt;Successful deployments use a portfolio approach. Through testing, teams discover patterns like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Simple queries handled by models that are fast, cheap, and good enough&lt;/li&gt;
&lt;li&gt;Complex reasoning routed to thinking models&lt;/li&gt;
&lt;li&gt;Domain-specific tasks sent to fine-tuned specialist models&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Take v0, Vercel's AI app builder. It uses a composite model architecture: a state-of-the-art model for new generations, a Quick Edit model for small changes, and an AutoFix model that checks outputs for errors.&lt;/p&gt;

&lt;p&gt;This dynamic selection approach can &lt;strong&gt;&lt;em&gt;slash costs by 80% while maintaining or improving quality&lt;/em&gt;&lt;/strong&gt;. But you'll only discover your optimal routing strategy through systematic testing.&lt;/p&gt;

&lt;h3&gt;
  
  
  Metrics That Actually Drive Business Value
&lt;/h3&gt;

&lt;p&gt;Production A/B testing reveals the metrics that benchmarks completely miss:&lt;/p&gt;

&lt;p&gt;Performance Metrics That Matter:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Task completion rate:&lt;/strong&gt; Do users actually accomplish their goals?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Problem resolution rate:&lt;/strong&gt; Are issues solved, or do users return?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Regeneration requests:&lt;/strong&gt; How often is the first answer insufficient?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Session depth:&lt;/strong&gt; Are simple tasks requiring multiple interactions?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Cost and Efficiency Reality:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Tokens per request:&lt;/strong&gt; Your actual API costs, not theoretical pricing&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;P95 latency:&lt;/strong&gt; How long your slowest users wait (the ones most likely to churn)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Throughput limits:&lt;/strong&gt; Can you handle Black Friday or just Tuesday afternoon?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Counterintuitive insight:&lt;/strong&gt; If an LLM solves a user's question on the first try, you may see fewer follow-up prompts. That drop in "requests per session" is actually positive—your model is more effective, not less engaging.&lt;/p&gt;

&lt;h3&gt;
  
  
  Making A/B Testing Work for LLMs
&lt;/h3&gt;

&lt;p&gt;Testing LLMs requires adapting traditional experimental methods to handle their unique characteristics:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Handle the Randomness:&lt;/strong&gt; Unlike deterministic code, LLMs produce different outputs for the same prompt. This variance means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Run tests longer than typical UI experiments&lt;/li&gt;
&lt;li&gt;Use larger sample sizes to achieve statistical significance&lt;/li&gt;
&lt;li&gt;Consider lowering temperature settings if consistency matters more than creativity&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Isolate Your Variables:&lt;/strong&gt; Test one change at a time:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Model swap (GPT-5 → Claude Opus)&lt;/li&gt;
&lt;li&gt;Prompt refinement (shorter, more specific instructions)&lt;/li&gt;
&lt;li&gt;Parameter tuning (temperature, max tokens)&lt;/li&gt;
&lt;li&gt;Routing logic (which queries go to which model)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Without this discipline, you can't attribute improvements to specific changes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Set Smart Guardrails&lt;/strong&gt; : Layer guardrail metrics alongside your primary success metrics. An improvement in task completion that doubles costs might not be worth deploying. Track:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Cost per successful interaction (not just cost per request)&lt;/li&gt;
&lt;li&gt;Safety violations that could trigger PR nightmares&lt;/li&gt;
&lt;li&gt;Latency thresholds that cause user abandonment&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Build Once, Test Forever&lt;/strong&gt; : Invest in infrastructure that makes testing sustainable:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Centralized proxy service for LLM communications&lt;/li&gt;
&lt;li&gt;Automatic metric collection and monitoring&lt;/li&gt;
&lt;li&gt;Prompt versioning and management&lt;/li&gt;
&lt;li&gt;Response validation and safety checking&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This investment pays off immediately—making tests easier to run and results more trustworthy.&lt;/p&gt;

&lt;h2&gt;
  
  
  Embrace Empiricism
&lt;/h2&gt;

&lt;p&gt;Benchmarks aren't entirely useless—use them for initial screening, understanding capability boundaries, and meeting regulatory minimums. But they should never be your final decision criterion.&lt;/p&gt;

&lt;p&gt;The AI industry's benchmark obsession has created a dangerous illusion. Models that dominate standardized tests struggle with real tasks. The metrics we celebrate have divorced from the outcomes we need.&lt;/p&gt;

&lt;p&gt;For teams building with LLMs, the path is clear:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Start with hypotheses, not benchmarks:&lt;/strong&gt;"We believe Model X will improve task completion" not "Model X scores higher"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Test with real users and real data:&lt;/strong&gt; Your production environment is the only benchmark that matters&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Measure what moves your business:&lt;/strong&gt; User satisfaction, cost per outcome, and regulatory compliance&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Iterate based on evidence:&lt;/strong&gt; Let data, not vendor claims, drive your model selection&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The benchmarks aren't exactly lying—they're just answering the wrong questions. A/B testing asks the right ones: Will this solve my users' problems? Can we afford it at scale? Does it meet our requirements?&lt;/p&gt;

&lt;p&gt;In the end, the best benchmark for your AI isn't a standardized test. It's users voting with their actions, costs staying within budget, and your application delivering real value.&lt;/p&gt;

&lt;p&gt;Everything else is just numbers on a leaderboard.&lt;/p&gt;




&lt;h2&gt;
  
  
  Further Reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.nature.com/articles/d41586-025-02462-5?ref=blog.growthbook.io" rel="noopener noreferrer"&gt;&lt;u&gt;Is your AI benchmark lying to you?&lt;/u&gt;&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://pub.towardsai.net/data-driven-llm-evaluation-with-statistical-testing-004b1561793f?ref=blog.growthbook.io" rel="noopener noreferrer"&gt;&lt;u&gt;Data-Driven LLM Evaluation with Statistical Testing&lt;/u&gt;&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2305.05176?ref=blog.growthbook.io" rel="noopener noreferrer"&gt;&lt;u&gt;FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance&lt;/u&gt;&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2502.06559?ref=blog.growthbook.io" rel="noopener noreferrer"&gt;&lt;u&gt;Can We Trust AI Benchmarks? An Interdisciplinary Review of Current Issues in AI Evaluation&lt;/u&gt;&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>abtesting</category>
    </item>
    <item>
      <title>The Definitive Technical Guide to Generative Engine Optimization (2025)</title>
      <dc:creator>Ryan Feigenbaum</dc:creator>
      <pubDate>Tue, 09 Sep 2025 23:09:32 +0000</pubDate>
      <link>https://dev.to/growthbook/the-definitive-technical-guide-to-generative-engine-optimization-2025-1gdk</link>
      <guid>https://dev.to/growthbook/the-definitive-technical-guide-to-generative-engine-optimization-2025-1gdk</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fohrraxgqkh6eav1fwvfp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fohrraxgqkh6eav1fwvfp.png" alt="The Definitive Technical Guide to Generative Engine Optimization (2025)" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Has your search behavior changed? Maybe you ask ChatGPT to help diagnose a leaky faucet instead of scrolling through Google Ads. Or you use Claude to debug a type error instead of sifting through Stack Overflow.&lt;/p&gt;

&lt;p&gt;You aren’t alone.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.gartner.com/en/newsroom/press-releases/2024-02-19-gartner-predicts-search-engine-volume-will-drop-25-percent-by-2026-due-to-ai-chatbots-and-other-virtual-agents?ref=blog.growthbook.io" rel="noopener noreferrer"&gt;Gartner predicts that by 2026 traditional search engine volume will drop 25%&lt;/a&gt;, losing out to AI chatbots and other virtual agents. In contrast, their traffic has surged, &lt;a href="https://onelittleweb.com/data-studies/ai-chatbots-vs-search-engines/?ref=blog.growthbook.io" rel="noopener noreferrer"&gt;growing 80.92% YoY&lt;/a&gt;. ChatGPT has 800 million weekly active users. And your customers are already there: ChatGPT now drives 10% of new signups for companies like Vercel.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;ChatGPT now refers 10% of new &lt;a href="https://twitter.com/vercel?ref_src=twsrc%5Etfw&amp;amp;ref=blog.growthbook.io" rel="noopener noreferrer"&gt;@vercel&lt;/a&gt; signups, which have also accelerated &lt;a href="https://t.co/LzatDz8n8u?ref=blog.growthbook.io" rel="noopener noreferrer"&gt;https://t.co/LzatDz8n8u&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;— Guillermo Rauch (&lt;a class="mentioned-user" href="https://dev.to/rauchg"&gt;@rauchg&lt;/a&gt;) &lt;a href="https://twitter.com/rauchg/status/1910093634445422639?ref_src=twsrc%5Etfw&amp;amp;ref=blog.growthbook.io" rel="noopener noreferrer"&gt;April 9, 2025&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;But here's what most marketers don't understand: When someone asks ChatGPT about the best project management software, it doesn't show ten blue links. It synthesizes an answer from sources it trusts and gives one response. There's no page two. You're either in the answer or you're invisible 🫥&lt;/p&gt;

&lt;p&gt;This is the fundamental shift. We're moving from competing for rankings to competing for citations. From click-through rates to what &lt;a href="https://a16z.com/geo-over-seo/?ref=blog.growthbook.io" rel="noopener noreferrer"&gt;a16z calls "reference rates"&lt;/a&gt;—how often AI chooses to mention you.&lt;/p&gt;

&lt;p&gt;The companies seeing leads from AI platforms aren't using magic. They're using methods you can implement today. These methods are called &lt;strong&gt;GEO (Generative Engine Optimization)&lt;/strong&gt;, and this guide will explain exactly what it is and the practical steps you can take to ensure AI sends customers to you, not your competitors.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Three Layers of AI Visibility (And Why SEO Still Matters)
&lt;/h2&gt;

&lt;p&gt;So, now that we’re optimizing for generative engines (AI chatbots), we don’t have to do SEO anymore, right?&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7nx4fu1iroo1nws670l6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7nx4fu1iroo1nws670l6.png" alt="The Definitive Technical Guide to Generative Engine Optimization (2025)" width="500" height="500"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;That’d be great, but no. GEO doesn’t replace SEO—it adds a &lt;strong&gt;brutal new selection layer on top of it&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 1: Pre-Training Knowledge
&lt;/h3&gt;

&lt;p&gt;This is what's "baked into" the model from training—everything it learned before its knowledge cutoff. AI models weren’t just trained on academic papers and Wikipedia. They gorged on Reddit, Stack Overflow, and forum discussions.&lt;/p&gt;

&lt;p&gt;You can't retroactively get into the model's training data. But you can position yourself for future training runs by establishing presence in:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Community platforms&lt;/strong&gt; : Reddit, Stack Overflow, specialized forums&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Authority sources&lt;/strong&gt; : Wikipedia, academic papers, industry publications&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Established directories&lt;/strong&gt; : G2, Capterra, Clutch (First Page Sage found these critical)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Layer 2: Search Layer - Real-Time Retrieval (RAG)
&lt;/h3&gt;

&lt;p&gt;When AI needs current information—today's weather, recent news, current prices—it searches the web. And here's the kicker: it uses traditional search engines.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;ChatGPT searches Bing&lt;/strong&gt; (Microsoft partnership)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Perplexity queries Google&lt;/strong&gt; (primarily)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Gemini uses Google&lt;/strong&gt; (obviously)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;This is why traditional SEO still matters!&lt;/strong&gt; &lt;a href="https://firstpagesage.com/seo-blog/generative-engine-optimization-geo-strategy-guide/?ref=blog.growthbook.io" rel="noopener noreferrer"&gt;First Page Sage's research&lt;/a&gt; is crystal clear: "appearing in lists that rank highly in Google or Bing's organic search results made the biggest difference in earning a chatbot's recommendation."&lt;/p&gt;

&lt;p&gt;If you're not ranking, you ain’t even in the game. The AI can't cite what it can't find.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 3: Selection Filter
&lt;/h3&gt;

&lt;p&gt;This is where GEO comes in. Even when AI tools find dozens of relevant results, they only cite a handful. They're making editorial decisions about what to reference.&lt;/p&gt;

&lt;p&gt;Researchers from Princeton, IIT Delhi, and other institutions, who published the &lt;a href="https://arxiv.org/pdf/2311.09735?ref=blog.growthbook.io" rel="noopener noreferrer"&gt;landmark study that coined the term, “Generative Engine Optimization”&lt;/a&gt;, demonstrate what makes content “selectable”:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Statistics make you up to 33.9% more visible&lt;/strong&gt; - AI can't generate data, so it gravitates toward sources that provide it&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Expert quotes boost visibility up to 32%&lt;/strong&gt; - Direct quotations give AI something concrete to reference&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Clear, fluent writing improves citation rates up to 30%&lt;/strong&gt; - If AI struggles to parse your content, it moves on&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Citations to authoritative sources add 30.3%&lt;/strong&gt; - Credibility signals matter more than ever&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The takeaway:&lt;/strong&gt; You can rank #1 on Google, but if your content isn't optimized for AI selection, ChatGPT will cite your competitor instead. This means that top-ranking content will need to adapt to stay visible in the world of AI, but also that lower ranking pages have an opportunity to increase visibility by implementing the GEO methods explained below.&lt;/p&gt;

&lt;h2&gt;
  
  
  The GEO Playbook
&lt;/h2&gt;

&lt;p&gt;Here are the strategies you can use to become the darling of the chatbots.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Win the List Game
&lt;/h3&gt;

&lt;p&gt;Nearly every AI recommendation starts with a Google or Bing search for “best [category] tools” or “top [solution] companies.” But not all lists are created equal.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What works:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Comparison tables that pit products against each other by features and price&lt;/li&gt;
&lt;li&gt;Lists subdivided by use case ("Best for Small Business," "Best for Enterprise")&lt;/li&gt;
&lt;li&gt;Recent timestamps (AI preferences freshness for anything with "2024" or "2025")&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The hack:&lt;/strong&gt; Create your own comparison articles that include your product, but be comprehensive and fair. AI can detect obvious bias.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Become Statistically Irresistible
&lt;/h3&gt;

&lt;p&gt;Remember: AI can't create data, only synthesize it. Chatbots are thirsty for original statistics.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Quick wins:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Turn customer data into industry insights ("73% of our users reduced churn by...")&lt;/li&gt;
&lt;li&gt;Commission surveys for unique data points&lt;/li&gt;
&lt;li&gt;Add specific numbers to every claim ("Most dentists" → "4 out of 5 dentists")&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Pro tip:&lt;/strong&gt; Create downloadable CSV files with your data. Perplexity and ChatGPT love citing sources that provide raw data access.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Just Write Real Good
&lt;/h3&gt;

&lt;p&gt;AI reads your content differently than humans. It chunks information, analyzes relationships, and looks for clear extractable statements.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The formula:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;One idea per paragraph (seriously, one)&lt;/li&gt;
&lt;li&gt;Front-load key points in the first sentence&lt;/li&gt;
&lt;li&gt;Use headers that directly answer questions&lt;/li&gt;
&lt;li&gt;Bold your most important statistics and conclusions&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Think of it like optimizing for featured snippets, but for every paragraph on your page.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Build Your Entity Authority
&lt;/h3&gt;

&lt;p&gt;Here's what's wild: AI tracks brand mentions even without links. It's building a map of who's credible in each space.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The playbook:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Get mentioned alongside competitors in industry roundups&lt;/li&gt;
&lt;li&gt;Contribute to Reddit discussions in your space (AI ❤️ Reddit)&lt;/li&gt;
&lt;li&gt;Secure profiles in industry directories like G2, Capterra, or Clutch&lt;/li&gt;
&lt;li&gt;Maintain at least 3.5 stars on review platforms (this is table stakes)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://backlinko.com/generative-engine-optimization-geo?ref=blog.growthbook.io" rel="noopener noreferrer"&gt;As Backlinko discovered&lt;/a&gt;, these co-citations and co-occurrences are how AI understands where you fit in your market:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"AI systems don't just look at backlinks to understand your authority. They pay attention to every mention of your brand across the web, even when those mentions don't include a clickable link.”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  5. Go Multi-Modal (Because Text Isn't Enough)
&lt;/h3&gt;

&lt;p&gt;Google Lens processes 20 billion visual queries monthly. Voice searches are 3.7× more likely to be questions. &lt;a href="https://developer.tenten.co/multi-modal-content-for-ai-seo-the-definitive-2025-guide?ref=blog.growthbook.io" rel="noopener noreferrer"&gt;The TenTen guide revealed that pages with multi-modal content see 67% more AI referrals&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Essential additions:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;High-res product images with descriptive filenames&lt;/li&gt;
&lt;li&gt;30-second video summaries with transcripts&lt;/li&gt;
&lt;li&gt;Schema markup for VideoObject, ImageObject, and Speakable&lt;/li&gt;
&lt;li&gt;Alt text written like mini-tweets (125 characters, naturally includes keywords)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Dark Traffic Problem
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://www.seerinteractive.com/insights/are-ai-sites-like-chatgpt-sending-your-website-traffic?ref=blog.growthbook.io" rel="noopener noreferrer"&gt;Here’s an inconvenient truth from Seer Interactive&lt;/a&gt;: Most AI traffic is invisible. It shows up as “Direct” in Google Analytics because AI doesn’t always pass referral data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What you can track:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;ChatGPT citations include &lt;code&gt;utm_source=chatgpt.com&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Create a GA4 segment for AI platforms (&lt;a href="http://chatgpt.com/?ref=blog.growthbook.io" rel="noopener noreferrer"&gt;chatgpt.com&lt;/a&gt;, &lt;a href="http://perplexity.ai/?ref=blog.growthbook.io" rel="noopener noreferrer"&gt;perplexity.ai&lt;/a&gt;, &lt;a href="http://claude.ai/?ref=blog.growthbook.io" rel="noopener noreferrer"&gt;claude.ai&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;Monitor brand searches that spike after AI platform updates&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;What you can't track:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;ChatGPT search results (no UTMs)&lt;/li&gt;
&lt;li&gt;Voice assistant references&lt;/li&gt;
&lt;li&gt;Most AI-generated summaries&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The workaround:&lt;/strong&gt; Embed UTM parameters in your internal links. Yes, this used to be taboo, but GA4 doesn't create new sessions like Universal Analytics did. When AI scrapes your content, it might preserve these parameters.&lt;/p&gt;

&lt;h2&gt;
  
  
  Don’t Expect Too Much
&lt;/h2&gt;

&lt;p&gt;GEO isn’t magic. It’s not going to 10x your traffic overnight. What it will do is ensure you're not invisible when your customers ask AI for recommendations.&lt;/p&gt;

&lt;p&gt;The foundational GEO study by Aggarwal et al. showed that combining methods—using statistics plus improved fluency—can boost visibility by 35.8%. That's not a page-one ranking; that's being the trusted source AI chooses from page one.&lt;/p&gt;

&lt;p&gt;More importantly, this isn't optional. Your competitors are already doing this. They're showing up in ChatGPT answers. They're capturing that Vercel-style 10% of new signups from AI.&lt;/p&gt;

&lt;p&gt;While the AI slop torrent isn’t slowing down anytime soon, the methods suggested here for optimizing for AI—adding citations and stats, writing for fluency, engaging with the community—all make for better content.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Happens Now?
&lt;/h2&gt;

&lt;p&gt;We’re at an inflection point. Traditional search isn’t dying but becoming one channel among many. AI platforms are becoming the new front door to the internet.&lt;/p&gt;

&lt;p&gt;In two years, we'll look back at companies still doing SEO-only strategies the way we look at businesses that refused to build websites in 1999. They'll exist, but they'll be invisible to an entire generation of users who start every quest for information with "Hey ChatGPT..."&lt;/p&gt;

&lt;p&gt;The shift from keywords to concepts, from rankings to references, from pages to entities—it's already happening. The only question is whether you'll ride the wave or watch it pass.&lt;/p&gt;

&lt;p&gt;Because in this new world, there's no page two. There's only the answer. Make sure you're in it.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Want to see if AI is already sending you traffic? Start by searching for your brand on ChatGPT, Perplexity, and Claude. If you don't like what you see—or worse, if you don't see anything—it's time to act.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
    </item>
    <item>
      <title>Holdouts in GrowthBook: The Gold Standard for Measuring Cumulative Impact</title>
      <dc:creator>Ryan Feigenbaum</dc:creator>
      <pubDate>Wed, 03 Sep 2025 18:46:30 +0000</pubDate>
      <link>https://dev.to/growthbook/holdouts-in-growthbook-the-gold-standard-for-measuring-cumulative-impact-2fmp</link>
      <guid>https://dev.to/growthbook/holdouts-in-growthbook-the-gold-standard-for-measuring-cumulative-impact-2fmp</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flbqhagdcqk5u9t1lhoro.webp" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flbqhagdcqk5u9t1lhoro.webp" alt="Holdouts in GrowthBook: The Gold Standard for Measuring Cumulative Impact" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Many successful product teams iterate quickly, running simultaneous experiments and launching new features weekly. Measuring the overall effect of these tests is critical to understanding the team’s impact and to help set product direction. However, actually measuring this cumulative impact can be quite difficult.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Holdouts&lt;/strong&gt; in GrowthBook provide a simple way to keep a true control group across multiple features and measure long-run cumulative impact. It’s the gold standard way to answer the question: &lt;strong&gt;“What did all of this shipping actually do to my key metric?”&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Why holdouts matter
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Cumulative impact is important to measure.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Ensuring that your experimentation program is helping you ship winning features and avoid losing features helps set your product direction. Knowing which teams are driving the most impact can help you understand what’s working and what isn’t. Teams successfully moving the needle may deserve more investment to continue driving team goals upward. If a team struggles to have a significant impact, they may have hit diminishing returns, they may need a new direction, or the product may have reached a certain level of maturity, making gains more difficult to achieve.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cumulative impact is hard to measure.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Looking at the overall trend in your goal metrics is not enough. Forces beyond your control or seasonality can dictate goal metric movements and can mislead you. With constant shipping across product teams, attributing lift to individual teams can be nearly impossible.&lt;/p&gt;

&lt;p&gt;Other approaches try to sum up the effect of individual experiments and apply some bias reduction, like &lt;a href="https://docs.growthbook.io/insights?ref=blog.growthbook.io#scaled-impact" rel="noopener noreferrer"&gt;the one on our own Insights section&lt;/a&gt;. Almost always, the &lt;strong&gt;individual impacts of experiments when summed up overstate the final effects&lt;/strong&gt; due to selection bias, generally diminishing returns over time, and cannibalizing interactions with other experiments. This isn’t just theoretical, &lt;a href="https://medium.com/airbnb-engineering/selection-bias-in-online-experimentation-c3d67795cceb?ref=blog.growthbook.io" rel="noopener noreferrer"&gt;Airbnb&lt;/a&gt; documented how a naive sum overstates impact by 2x when compared with a holdout, and bias-corrected estimates still overstate impact by 1.3x.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Holdouts as the solution.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A well-run holdout exposes a stable baseline of users to &lt;em&gt;none&lt;/em&gt; of your new features for a period of time, then compares them to the general population. Because a holdout can run for longer on a small percentage of traffic, you capture longer-run effects. Furthermore, it allows you to stack all of your features and experiments into one test, capturing cumulative and interactive effects. Finally, it uses the reliable statistics and inference provided by experiments to make holdouts the &lt;strong&gt;gold standard&lt;/strong&gt; for cumulative, long-run impact.&lt;/p&gt;

&lt;h2&gt;
  
  
  How Holdouts work in GrowthBook
&lt;/h2&gt;

&lt;p&gt;At a high level:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Holdout group&lt;/strong&gt; : A small percentage of traffic (usually users) is diverted away from new features, experiments, and bandits.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;General population&lt;/strong&gt; : Everyone else—experimenting and shipping as usual. We then take a small subset of the general population to use as a measurement group to compare against the holdout group.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;As you launch new features and experiments, all new traffic checks whether they should be diverted to the holdout before seeing the new feature or experiment values.&lt;/p&gt;

&lt;p&gt;When an experiment goes live, the &lt;strong&gt;holdout group&lt;/strong&gt; is completely excluded while the &lt;strong&gt;general population&lt;/strong&gt; gets randomized into one condition or another. Once an experiment is shipped, all users in the &lt;strong&gt;general population&lt;/strong&gt; will receive the shipped variant.&lt;/p&gt;

&lt;p&gt;This means that the holdout measures &lt;strong&gt;the cumulative impact of using your product&lt;/strong&gt; , which includes all the false starts and the test period for the experiments that didn’t ship, because that is a true record of what actually happened in the past quarter.&lt;/p&gt;

&lt;p&gt;Only once the holdout is ended will users in the holdout group receive any shipped features.&lt;/p&gt;

&lt;h2&gt;
  
  
  Using your Holdout
&lt;/h2&gt;

&lt;p&gt;Facebook and Twitter product teams ran 6-month holdouts for all their features, withholding 5% or less of traffic, and then used the cumulative impact in reporting and to understand if they had correctly set their product direction. They then released the holdout and start a new one for the next 6-month period.&lt;/p&gt;

&lt;p&gt;Other teams at Twitter were also using long-run, low traffic holdouts on a bundle of critical features, to ensure they were continuing to provide value.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Define the population size:&lt;/strong&gt; Pick a sample large enough to measure your cumulative impact, but beware that larger population sizes mean you will end up with less traffic for your day-to-day experiments and fewer users with the latest set of features.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Define the active period length (half month to a quarter)&lt;/strong&gt;: Pick a period long enough to accumulate some wins&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;During the active period (half to a full quarter):&lt;/strong&gt; Ship normally. Keep adding experiments and launching features. The holdout quietly accumulates evidence.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Analysis period&lt;/strong&gt; &lt;strong&gt;(2–4 weeks)&lt;/strong&gt;: Freeze adding new changes, let effects settle, and compare cumulative impact with our automatic lookback windows applied to measure only the analysis period.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Product teams at Twitter would run a holdout for a half a year, adding new features to the holdout over the course of 6 months. Then, they would use the following quarter to get a reliable, long-run measure of their cumulative impact.&lt;/p&gt;

&lt;p&gt;So, a year would look like this:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Timeframe&lt;/th&gt;
&lt;th&gt;Holdout Status&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Q1&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;h1-holdout&lt;/code&gt; (active)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Q2&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;h1-holdout&lt;/code&gt; (active)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Q3&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;h2-holdout&lt;/code&gt; (active); &lt;code&gt;h1-holdout&lt;/code&gt; (measurement only)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Q4&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;h2-holdout&lt;/code&gt; (active)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Tips &amp;amp; Trade-offs
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Project-scope your Holdout: I&lt;/strong&gt; f you want to measure the impact of a given team’s set of features, have that team work within one or more GrowthBook Projects and have the Holdout automatically apply to their features and experiments.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Be wary of the user experience&lt;/strong&gt; : A small group won’t see new features—keep the percentage small and the period finite.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Be ready to keep feature flags in code&lt;/strong&gt; : Holdouts require feature flags to stick around through the analysis period, so prepare your work flows for longer lasting features.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Metrics&lt;/strong&gt; : Favor durable outcomes (revenue, retention, engagement) and use lookbacks for clean analysis windows so that you only measure the impact once all experiments have had a chance to bed-in.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Get started
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://app.growthbook.io/?ref=blog.growthbook.io" rel="noopener noreferrer"&gt;&lt;strong&gt;Create your first holdout&lt;/strong&gt; in the app&lt;/a&gt; ( &lt;strong&gt;Experiments&lt;/strong&gt; → &lt;strong&gt;Holdouts&lt;/strong&gt; ) and scope it to a project you want to measure impact within.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pick 2 - 4 long-run metrics&lt;/strong&gt; that your team is hoping to improve in the long-run.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Read more about holdouts in our &lt;a href="https://docs.growthbook.io/kb/experiments/holdouts?ref=blog.growthbook.io" rel="noopener noreferrer"&gt;Knowledge Base&lt;/a&gt; and see &lt;a href="https://docs.growthbook.io/app/holdouts?ref=blog.growthbook.io" rel="noopener noreferrer"&gt;our documentation&lt;/a&gt; to help run your first holdout.&lt;/p&gt;

</description>
      <category>experimentation</category>
    </item>
    <item>
      <title>What is A/B Testing? The Complete Guide to Data-Driven Decision Making</title>
      <dc:creator>Ryan Feigenbaum</dc:creator>
      <pubDate>Wed, 13 Aug 2025 20:28:14 +0000</pubDate>
      <link>https://dev.to/growthbook/what-is-ab-testing-the-complete-guide-to-data-driven-decision-making-2kkj</link>
      <guid>https://dev.to/growthbook/what-is-ab-testing-the-complete-guide-to-data-driven-decision-making-2kkj</guid>
      <description>&lt;h2&gt;
  
  
  The 30-Second Summary
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2zqq5x5eiryrm6oumnn6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2zqq5x5eiryrm6oumnn6.png" alt="What is A/B Testing? The Complete Guide to Data-Driven Decision Making" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A/B testing&lt;/strong&gt; (also called split testing) is a method of comparing two+ versions of a webpage, app feature, or marketing element to determine which performs better. You show version A (the control) to one group and version B (the variant) to another, then measure which drives better results for your business goals.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why it matters:&lt;/strong&gt; A/B testing removes guesswork from decision-making, turning "we think" into "we know" based on actual user behavior and statistical evidence.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key takeaway:&lt;/strong&gt; Done right, A/B testing can increase conversions without spending more on traffic, validate ideas before full implementation, and build a culture of continuous improvement.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Exactly is A/B Testing?
&lt;/h2&gt;

&lt;p&gt;Imagine you're at a coffee shop debating whether to put your tip jar by the register or at the pickup counter. Instead of guessing, you try both locations on alternating days and measure which generates more tips. That's A/B testing in the physical world.&lt;/p&gt;

&lt;p&gt;In digital environments, A/B testing works by:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Randomly splitting&lt;/strong&gt; your audience into two (or more) groups&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Showing different versions&lt;/strong&gt; of the same element to each group simultaneously&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Measuring the impact&lt;/strong&gt; on predetermined metrics&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Declaring a winner&lt;/strong&gt; based on statistical significance&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Implementing the better version&lt;/strong&gt; for all users&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  The Critical Difference: Testing vs. Guessing
&lt;/h3&gt;

&lt;p&gt;Without A/B testing, decisions rely on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;HiPPO (Highest Paid Person's Opinion)&lt;/li&gt;
&lt;li&gt;Best practices that may not apply to your audience&lt;/li&gt;
&lt;li&gt;Assumptions about user behavior&lt;/li&gt;
&lt;li&gt;Competitor copying without context&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;With A/B testing, decisions are based on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Actual user behavior from your specific audience&lt;/li&gt;
&lt;li&gt;Statistically validated results&lt;/li&gt;
&lt;li&gt;Measurable business impact&lt;/li&gt;
&lt;li&gt;Continuous learning about what works&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Why A/B Testing is Essential in 2025
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Maximize Existing Traffic Value
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://www.businessofapps.com/marketplace/user-acquisition/research/user-acquisition-costs/?ref=blog.growthbook.io" rel="noopener noreferrer"&gt;Traffic acquisition costs (TAC) have increased 222% since 2019&lt;/a&gt;. And, with the rise of generative AI, the usefulness of long-standing acquisition strategies are more uncertain than ever. A/B testing helps you extract more value from visitors you already have—often delivering higher ROI than acquiring new traffic.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Reduce Risk of Major Changes
&lt;/h3&gt;

&lt;p&gt;Instead of redesigning your entire site and hoping for the best, test changes incrementally. If something doesn't work, you've limited the damage to a small test group.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Resolve Internal Debates with Data
&lt;/h3&gt;

&lt;p&gt;Stop endless meetings debating what "might" work. Run a test, get data, make decisions. As one PM put it: "A/B testing turned our three-hour design debates into 30-minute data reviews."&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Discover Surprising Insights
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://hbr.org/2017/09/the-surprising-power-of-online-experiments?ref=blog.growthbook.io" rel="noopener noreferrer"&gt;Microsoft found that changing their Bing homepage background from white to a slightly different shade generated $10 million in additional revenue&lt;/a&gt;. While you shouldn't expect $10 million revenue gains from A/B test (that's an outlier), these wins don't even become a possibility until you start testing for them.&lt;/p&gt;

&lt;p&gt;[&lt;/p&gt;

&lt;p&gt;Big conversion gains, small color tweaks - Erin Does Things&lt;/p&gt;

&lt;p&gt;Some of the most impactful experiments involve improvements in color. Why? Because the right colors can mean the difference between a buy and a bounce.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Feayig5d6q6418yb38q9d.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Feayig5d6q6418yb38q9d.png" alt="What is A/B Testing? The Complete Guide to Data-Driven Decision Making" width="270" height="270"&gt;&lt;/a&gt;Erin Does Things - Erin Weigel • Designer • Speaker • Product &amp;amp; People ManagerErin Weigel&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3c7ofyomja4datjxs6d1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3c7ofyomja4datjxs6d1.png" alt="What is A/B Testing? The Complete Guide to Data-Driven Decision Making" width="800" height="754"&gt;&lt;/a&gt;&lt;br&gt;
](&lt;a href="https://erindoesthings.com/2024/07/15/microsoft-color-tweaks-conversion-gains/?ref=blog.growthbook.io" rel="noopener noreferrer"&gt;https://erindoesthings.com/2024/07/15/microsoft-color-tweaks-conversion-gains/?ref=blog.growthbook.io&lt;/a&gt;)&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Build Competitive Advantage
&lt;/h3&gt;

&lt;p&gt;While competitors guess, you know. &lt;a href="https://netflixtechblog.com/a-b-testing-and-beyond-improving-the-netflix-streaming-experience-with-experimentation-and-data-5b0ae9295bdf?ref=blog.growthbook.io" rel="noopener noreferrer"&gt;Netflix attributes much of its success to running thousands of tests annually, optimizing everything from thumbnails to recommendation algorithms&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Can You Test? (Almost Everything)
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Website Elements
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Headlines and copy&lt;/strong&gt; : Different value propositions, tones, lengths&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Call-to-action buttons&lt;/strong&gt; : Color, size, text, placement&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Images and videos&lt;/strong&gt; : Product photos, hero images, background videos&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Forms&lt;/strong&gt; : Number of fields, field types, progressive disclosure&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Navigation&lt;/strong&gt; : Menu structure, sticky headers, breadcrumbs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Layout&lt;/strong&gt; : Single vs. multi-column, card vs. list view&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pricing&lt;/strong&gt; : Display format, anchoring, bundling options&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Social proof&lt;/strong&gt; : Testimonials, reviews, trust badges placement&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Beyond Websites
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Email campaigns&lt;/strong&gt; : Subject lines, send times, content length&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Mobile apps&lt;/strong&gt; : Onboarding flows, feature placement, notification timing&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ads&lt;/strong&gt; : Creative, copy, targeting parameters&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Product features&lt;/strong&gt; : Functionality, user interface, defaults&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Internal tools&lt;/strong&gt; : Dashboard layouts, workflow steps&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Algorithms:&lt;/strong&gt; Recommendations, feature items&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AI:&lt;/strong&gt; Prompts, models&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Science Behind A/B Testing
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Statistical Foundations: Two Approaches
&lt;/h3&gt;

&lt;p&gt;Modern A/B testing platforms offer two statistical frameworks, each with distinct advantages:&lt;/p&gt;

&lt;h4&gt;
  
  
  Bayesian Statistics (Often the Default)
&lt;/h4&gt;

&lt;p&gt;Bayesian methods provide more intuitive results by expressing outcomes as probabilities rather than binary significant/not-significant decisions. Instead of p-values, you get statements like "there's a 95% chance variation B is better than A."&lt;/p&gt;

&lt;p&gt;This approach:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Allows continuous monitoring without invalidating results&lt;/li&gt;
&lt;li&gt;Incorporates prior knowledge to avoid over-interpreting small samples&lt;/li&gt;
&lt;li&gt;Provides probability distributions showing the range of likely outcomes&lt;/li&gt;
&lt;li&gt;Calculates "risk" or expected loss if you choose the wrong variation&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Frequentist Statistics (Traditional Approach)
&lt;/h4&gt;

&lt;p&gt;Frequentist methods use hypothesis testing with p-values and confidence intervals. This classical approach:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Requires predetermined sample sizes&lt;/li&gt;
&lt;li&gt;Uses statistical significance thresholds (typically 95%)&lt;/li&gt;
&lt;li&gt;Provides clear yes/no decisions based on p-values&lt;/li&gt;
&lt;li&gt;Is familiar to those with traditional statistics backgrounds&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Key concepts both approaches share:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Null Hypothesis (H₀):&lt;/strong&gt; The assumption that there's no difference between versions A and B&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Alternative Hypothesis (H₁):&lt;/strong&gt; Your prediction that version B will perform differently&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Statistical Significance/Confidence:&lt;/strong&gt; The certainty that results aren't due to chance&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Statistical Power:&lt;/strong&gt; The probability of detecting a real difference when it exists (typically aim for 80%+)&lt;/p&gt;

&lt;p&gt;Many modern platforms like GrowthBook default to Bayesian but offer both engines, letting teams choose based on their preferences and expertise. Both approaches can utilize advanced techniques like CUPED for variance reduction and sequential testing for early stopping.&lt;/p&gt;

&lt;h3&gt;
  
  
  Sample Size: The Foundation of Reliable Tests
&lt;/h3&gt;

&lt;p&gt;You need enough data to trust your results. The required sample size depends on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Baseline conversion rate&lt;/strong&gt; : Your current performance&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Minimum detectable effect (MDE)&lt;/strong&gt;: The smallest improvement you care about&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Statistical significance threshold&lt;/strong&gt; : Usually 95%&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Statistical power&lt;/strong&gt; : Usually 80%&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Example calculation:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Current conversion rate: 3%&lt;/li&gt;
&lt;li&gt;Want to detect: 20% relative improvement (to 3.6%)&lt;/li&gt;
&lt;li&gt;Required confidence: 95%&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Result: ~14,000 visitors per variation&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Tools to help:&lt;/strong&gt; Most A/B testing platforms include built-in power calculators and sample size estimators. These tools eliminate guesswork by automatically calculating the visitors needed based on your specific metrics and goals.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Danger of Peeking
&lt;/h3&gt;

&lt;p&gt;Checking results before reaching sample size is like judging a marathon at the 5-mile mark. Early results fluctuate wildly and often reverse completely.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why peeking misleads:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Small samples amplify random variation&lt;/li&gt;
&lt;li&gt;Winners and losers often swap positions multiple times&lt;/li&gt;
&lt;li&gt;"Regression to the mean" causes early extremes to normalize&lt;/li&gt;
&lt;li&gt;Each peek increases your false positive rate&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The solution:&lt;/strong&gt; Set your sample size, run the test to completion, then analyze. No exceptions.&lt;/p&gt;

&lt;h2&gt;
  
  
  Your Step-by-Step A/B Testing Process
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Step 1: Research and Identify Opportunities
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Start with data, not opinions:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Analyze your analytics for high-traffic, high-impact pages&lt;/li&gt;
&lt;li&gt;Review heatmaps and session recordings&lt;/li&gt;
&lt;li&gt;Collect customer feedback and support tickets&lt;/li&gt;
&lt;li&gt;Run user surveys about friction points&lt;/li&gt;
&lt;li&gt;Audit your conversion funnel for drop-offs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Prioritize using ICE scoring:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Impact&lt;/strong&gt; : How much could this improve key metrics?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Confidence&lt;/strong&gt; : How sure are you it will work?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ease&lt;/strong&gt; : How simple is it to implement?&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Step 2: Form a Strong Hypothesis
&lt;/h3&gt;

&lt;p&gt;Weak hypothesis: "Let's try a green button"&lt;/p&gt;

&lt;p&gt;Strong hypothesis: "By changing our CTA button from gray to green (change), we will increase contrast and draw more attention (reasoning), resulting in a 15% increase in click-through rate (predicted outcome) as measured over 14 days with 95% confidence (measurement criteria)."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hypothesis framework:&lt;/strong&gt;"By [specific change], we expect [specific metric] to [increase/decrease] by [amount] because [reasoning based on research]."&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 3: Design Your Test
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Critical rules:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Test one variable at a time (multiple changes = unclear results)&lt;/li&gt;
&lt;li&gt;Ensure equal, random traffic distribution&lt;/li&gt;
&lt;li&gt;Keep everything else identical between versions&lt;/li&gt;
&lt;li&gt;Consider mobile vs. desktop separately&lt;/li&gt;
&lt;li&gt;Account for different user segments if needed&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Quality assurance checklist (guardrails):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Both versions load at the same speed&lt;/li&gt;
&lt;li&gt;Tracking is properly implemented&lt;/li&gt;
&lt;li&gt;Test works across all browsers&lt;/li&gt;
&lt;li&gt;Mobile experience is preserved&lt;/li&gt;
&lt;li&gt;No flickering or layout shifts&lt;/li&gt;
&lt;li&gt;Forms and CTAs function correctly&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Step 4: Calculate Required Sample Size
&lt;/h3&gt;

&lt;p&gt;Never start without knowing your endpoint. Most modern A/B testing platforms include power calculators that do the heavy lifting for you.&lt;/p&gt;

&lt;p&gt;Input these parameters:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Current conversion rate (from your analytics)&lt;/li&gt;
&lt;li&gt;Minimum improvement worth detecting (be realistic)&lt;/li&gt;
&lt;li&gt;Significance level (typically 95%)&lt;/li&gt;
&lt;li&gt;Statistical power (typically 80%)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The platform will calculate exactly how many visitors you need per variation. This removes the guesswork and ensures your test has enough power to detect meaningful differences.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Time consideration:&lt;/strong&gt; Run tests for at least one full business cycle (usually 1-2 weeks minimum) to account for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Weekday vs. weekend behavior&lt;/li&gt;
&lt;li&gt;Beginning vs. end of month patterns&lt;/li&gt;
&lt;li&gt;External factors (news, weather, events)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Step 5: Launch and Monitor (Without Peeking!)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Launch checklist:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Set up your test in your A/B testing tool&lt;/li&gt;
&lt;li&gt;Configure goal tracking and secondary metrics&lt;/li&gt;
&lt;li&gt;Document test details in your testing log&lt;/li&gt;
&lt;li&gt;Set calendar reminder for test end date&lt;/li&gt;
&lt;li&gt;Resist the urge to check results early&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Monitor only for:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Technical errors or bugs&lt;/li&gt;
&lt;li&gt;Extreme business impact (massive revenue loss)&lt;/li&gt;
&lt;li&gt;Sample ratio mismatch (uneven traffic split)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Step 6: Analyze Results Properly
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Beyond the winner/loser binary:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Check statistical significance&lt;/strong&gt; (p-value &amp;lt; 0.05)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Verify sample size was reached&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Look for segment differences&lt;/strong&gt; (mobile vs. desktop, new vs. returning)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Analyze secondary metrics&lt;/strong&gt; (did conversions increase but quality decrease?)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Consider practical significance&lt;/strong&gt; (is 0.1% lift worth implementing?)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Document learnings&lt;/strong&gt; regardless of outcome&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Step 7: Implement and Iterate
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;If your variation wins:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Implement for 100% of traffic&lt;/li&gt;
&lt;li&gt;Monitor post-implementation performance&lt;/li&gt;
&lt;li&gt;Test iterations to maximize the improvement&lt;/li&gt;
&lt;li&gt;Apply learnings to similar pages/elements&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;If your variation loses:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;This is still valuable learning&lt;/li&gt;
&lt;li&gt;Analyze why your hypothesis was wrong&lt;/li&gt;
&lt;li&gt;Test the opposite approach&lt;/li&gt;
&lt;li&gt;Document insights for future tests&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;If it's inconclusive:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You may need a larger sample size&lt;/li&gt;
&lt;li&gt;The difference might be too small to matter&lt;/li&gt;
&lt;li&gt;Test a bolder variation&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Advanced A/B Testing Strategies
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Sequential Testing
&lt;/h3&gt;

&lt;p&gt;Instead of testing A vs. B, test A vs. B, then winner vs. C, building improvements incrementally.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Bandit Testing
&lt;/h3&gt;

&lt;p&gt;Automatically shift more traffic to winning variations during the test, maximizing conversions while learning.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Personalization Layers
&lt;/h3&gt;

&lt;p&gt;Test different experiences for different segments (new vs. returning, mobile vs. desktop, geographic regions).&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Full-Funnel Testing
&lt;/h3&gt;

&lt;p&gt;Don't just test for initial conversions—measure downstream impact on retention, lifetime value, and referrals.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Qualitative + Quantitative
&lt;/h3&gt;

&lt;p&gt;Combine A/B tests with user research to understand not just what works, but why it works.&lt;/p&gt;

&lt;h2&gt;
  
  
  Common A/B Testing Mistakes (And How to Avoid Them)
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Mistake #1: Testing Without Traffic
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Problem:&lt;/strong&gt; Running tests on pages with &amp;lt;1,000 visitors/week&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solution:&lt;/strong&gt; Focus on highest-traffic pages or make bolder changes that require smaller samples to detect&lt;/p&gt;

&lt;h3&gt;
  
  
  Mistake #2: Stopping Tests at Significance
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Problem:&lt;/strong&gt; Ending tests as soon as p-value hits 0.05&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solution:&lt;/strong&gt; Predetermine sample size and duration; stick to it regardless of interim results&lt;/p&gt;

&lt;h3&gt;
  
  
  Mistake #3: Ignoring Segment Differences
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Problem:&lt;/strong&gt; Overall winner performs worse for valuable segments&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solution:&lt;/strong&gt; Always analyze results by key segments before implementing&lt;/p&gt;

&lt;h3&gt;
  
  
  Mistake #4: Testing Tiny Changes
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Problem:&lt;/strong&gt; Button shade variations when the whole page needs work&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solution:&lt;/strong&gt; Match change boldness to your traffic volume; small sites need bigger swings&lt;/p&gt;

&lt;h3&gt;
  
  
  Mistake #5: One-Hit Wonders
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Problem:&lt;/strong&gt; Running one test then moving on&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solution:&lt;/strong&gt; Create a testing culture with regular cadence and iteration&lt;/p&gt;

&lt;h3&gt;
  
  
  Mistake #6: Significance Shopping
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Problem:&lt;/strong&gt; Testing 20 metrics hoping one shows significance&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solution:&lt;/strong&gt; Choose primary metric before starting; treat others as secondary insights&lt;/p&gt;

&lt;h3&gt;
  
  
  Mistake #7: Seasonal Blindness
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Problem:&lt;/strong&gt; Testing during Black Friday, applying results year-round&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solution:&lt;/strong&gt; Note external factors; retest during normal periods&lt;/p&gt;

&lt;h3&gt;
  
  
  Mistake #8: Technical Debt
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Problem:&lt;/strong&gt; Winner requires complex maintenance or breaks other features &lt;strong&gt;Solution:&lt;/strong&gt; Consider implementation cost in your analysis&lt;/p&gt;

&lt;h3&gt;
  
  
  Mistake #9: Learning Amnesia
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Problem:&lt;/strong&gt; Not documenting or sharing test results&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solution:&lt;/strong&gt; Maintain a testing knowledge base; share learnings broadly&lt;/p&gt;

&lt;h2&gt;
  
  
  Building a Testing Culture
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Moving Beyond Individual Tests
&lt;/h3&gt;

&lt;p&gt;The real value of A/B testing isn't any single win—it's building an organization that makes decisions based on evidence rather than opinions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cultural pillars:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Democratize testing&lt;/strong&gt; : Enable anyone to propose tests (with proper review)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Celebrate learning&lt;/strong&gt; : Failed tests that teach are as valuable as winners&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Share broadly&lt;/strong&gt; : Make results visible across the organization&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Think in probabilities&lt;/strong&gt; : Replace "I think" with "Let's test"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Embrace iteration&lt;/strong&gt; : Every result leads to new questions&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Testing Program Maturity Model
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Level 1: Sporadic&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Occasional tests when someone remembers&lt;/li&gt;
&lt;li&gt;No formal process&lt;/li&gt;
&lt;li&gt;Results often ignored&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Level 2: Systematic&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Regular testing cadence&lt;/li&gt;
&lt;li&gt;Basic documentation&lt;/li&gt;
&lt;li&gt;Some stakeholder buy-in&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Level 3: Strategic&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Testing roadmap aligned with business goals&lt;/li&gt;
&lt;li&gt;Cross-functional involvement&lt;/li&gt;
&lt;li&gt;Knowledge sharing practices&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Level 4: Embedded&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Testing considered for every change&lt;/li&gt;
&lt;li&gt;Sophisticated segmentation and analysis&lt;/li&gt;
&lt;li&gt;Company-wide testing culture&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Level 5: Optimized&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Predictive models guide testing&lt;/li&gt;
&lt;li&gt;Automated test generation&lt;/li&gt;
&lt;li&gt;Testing drives innovation&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Future of A/B Testing
&lt;/h2&gt;

&lt;h3&gt;
  
  
  AI-Powered Testing
&lt;/h3&gt;

&lt;p&gt;Machine learning increasingly suggests what to test, predicts results, and automatically generates variations.&lt;/p&gt;

&lt;h3&gt;
  
  
  Real-Time Personalization
&lt;/h3&gt;

&lt;p&gt;Move beyond testing to delivering the optimal experience for each individual user.&lt;/p&gt;

&lt;h3&gt;
  
  
  Causal Inference
&lt;/h3&gt;

&lt;p&gt;Advanced statistical methods better isolate true cause-and-effect relationships.&lt;/p&gt;

&lt;h3&gt;
  
  
  Cross-Channel Orchestration
&lt;/h3&gt;

&lt;p&gt;Test experiences across web, mobile, email, and offline touchpoints simultaneously.&lt;/p&gt;

&lt;h3&gt;
  
  
  Privacy-First Methods
&lt;/h3&gt;

&lt;p&gt;New approaches maintain testing capability while respecting user privacy and regulatory requirements.&lt;/p&gt;

&lt;h2&gt;
  
  
  Your Next Steps
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Start Today (Even Small)
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Pick one element&lt;/strong&gt; on your highest-traffic page&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Form a hypothesis&lt;/strong&gt; about how to improve it&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Run a simple test&lt;/strong&gt; for two weeks&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Analyze results&lt;/strong&gt; objectively&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Share learnings&lt;/strong&gt; with your team&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Test again&lt;/strong&gt; based on what you learned&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Quick Wins to Try First
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Headline on your homepage&lt;/li&gt;
&lt;li&gt;CTA button color and text&lt;/li&gt;
&lt;li&gt;Form field reduction&lt;/li&gt;
&lt;li&gt;Social proof placement&lt;/li&gt;
&lt;li&gt;Pricing page layout&lt;/li&gt;
&lt;li&gt;Email subject lines&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Resources for Continued Learning
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Essential books:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"Trustworthy Online Controlled Experiments" by Kohavi, Tang, and Xu&lt;/li&gt;
&lt;li&gt;"A/B Testing: The Most Powerful Way to Turn Clicks Into Customers" by Dan Siroker&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Communities:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://join.slack.com/t/growthbookusers/shared_invite/zt-2xw8fu279-Y~hwnfCEf7WrEI9qScHURQ?ref=blog.growthbook.io" rel="noopener noreferrer"&gt;GrowthBook Slack Community&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://testandlearn.community/?ref=blog.growthbook.io" rel="noopener noreferrer"&gt;Test and Learn Community&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Stay updated:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Follow industry leaders on LinkedIn&lt;/li&gt;
&lt;li&gt;Subscribe to testing tool blogs&lt;/li&gt;
&lt;li&gt;Join local CRO meetups&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Conclusion: From Guessing to Knowing
&lt;/h2&gt;

&lt;p&gt;A/B testing transforms how organizations make decisions. Instead of lengthy debates, political maneuvering, and costly mistakes, you get clarity through data.&lt;/p&gt;

&lt;p&gt;But remember: A/B testing is a tool, not a religion. Some decisions require vision, creativity, and bold leaps that testing can't validate. The art lies in knowing when to test and when to trust your instincts.&lt;/p&gt;

&lt;p&gt;Start small. Test consistently. Learn continuously. Let data guide you while creativity drives you.&lt;/p&gt;

&lt;p&gt;The companies that win in 2025 won't be those with the best guesses—they'll be those with the best evidence.&lt;/p&gt;

</description>
      <category>experimentation</category>
    </item>
    <item>
      <title>Building in the AI Era: Lessons from Past Technological Revolutions</title>
      <dc:creator>Ryan Feigenbaum</dc:creator>
      <pubDate>Tue, 29 Jul 2025 01:07:20 +0000</pubDate>
      <link>https://dev.to/growthbook/building-in-the-ai-era-lessons-from-past-technological-revolutions-2nim</link>
      <guid>https://dev.to/growthbook/building-in-the-ai-era-lessons-from-past-technological-revolutions-2nim</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fimages.unsplash.com%2Fphoto-1567789884554-0b844b597180%3Fcrop%3Dentropy%26cs%3Dtinysrgb%26fit%3Dmax%26fm%3Djpg%26ixid%3DM3wxMTc3M3wwfDF8c2VhcmNofDh8fGF1dG9tYXRpb258ZW58MHx8fHwxNzUzNzUxMDM2fDA%26ixlib%3Drb-4.1.0%26q%3D80%26w%3D2000" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fimages.unsplash.com%2Fphoto-1567789884554-0b844b597180%3Fcrop%3Dentropy%26cs%3Dtinysrgb%26fit%3Dmax%26fm%3Djpg%26ixid%3DM3wxMTc3M3wwfDF8c2VhcmNofDh8fGF1dG9tYXRpb258ZW58MHx8fHwxNzUzNzUxMDM2fDA%26ixlib%3Drb-4.1.0%26q%3D80%26w%3D2000" alt="Building in the AI Era: Lessons from Past Technological Revolutions" width="2000" height="1333"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We are living through a generational technology shift—one that comes along only once or twice in a lifetime, reshaping how humans interact with the world. Just as electricity, automobiles, computers, the internet, and mobile computing were transformative, AI is doing the same today. However, history shows us that in the early days of a new technology, people often misunderstand the power that it unlocks. This article will examine some of the historical technology shifts and the lessons we can learn from them. &lt;/p&gt;

&lt;h2&gt;
  
  
  Lessons from History
&lt;/h2&gt;

&lt;p&gt;Practical applications of electricity began to take root in the 1880s and 90s, with the first electrical power station opening in Manhattan by Edison. The uses were initially targeted at consumers, with rich New Yorkers able to electrify their homes and replace their gas lights with electric ones. Industry, on the other hand, was slow to adapt, despite the evident advantages. Most industries simply replaced steam-powered equipment with electric ones, or added electric lights, without considering how their industry could operate differently. &lt;/p&gt;

&lt;p&gt;The engineering breakthrough came when Henry Ford reimagined the factory in the 1910s. He utilized electric motors' precise speed control and distributed power to create the moving assembly line in 1913—a feat impossible with centralized steam engines that required complex systems of belts and pulleys. These improvements cut the Model T build time from 12 hours to about 93 minutes­—a systemic redesign that enabled scale, lowered costs, and transformed labor and manufacturing fundamentally. &lt;/p&gt;

&lt;p&gt;A similar lesson comes from the introduction of the television. In the early days of television, content was heavily borrowed from radio—simply filmed broadcasts of radio shows without inventing for the new medium. The real shift came when creators embraced television's potential: drama anthologies, magazine-format shows like Today and The Tonight Show, recording and editing footage from multiple cameras, and new storytelling formats were designed for television. By the 1950s, TV overtook radio: between 1950 and 1960, U.S. household ownership jumped from about 9 percent to over 60 percent, nearing 90 percent in the early 1960s.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The lesson&lt;/strong&gt; : Early adopters who treat a new medium like the old one often miss its full value. The true winners reimagine processes, experiences—and even entire business models—when they adopt these new technologies. &lt;/p&gt;

&lt;h2&gt;
  
  
  Parallels with Today’s AI Adoption
&lt;/h2&gt;

&lt;p&gt;It is evident from the above examples that there are parallels with the adoption of AI into our products and businesses. Pressure to add AI or to be the AI for x industry results in many uninspired implementations. Many organizations today &lt;strong&gt;bolt on an AI assistant&lt;/strong&gt; —like lighting a few bulbs in a steam-powered factory—but miss the opportunity to reimagine workflows end-to-end. The real transformation occurs when considering how AI can transform the user experience.&lt;/p&gt;

&lt;p&gt;The difference between the past technological shifts and the AI one we’re experiencing today is the incredible velocity of the change.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;It took about 13 years for Ford to sell 1 million cars. &lt;/li&gt;
&lt;li&gt;It took Google 1 year to reach 1 million searches per day. &lt;/li&gt;
&lt;li&gt;Apple’s iPhone launched in 2007, heralding the smartphone revolution, and sold 1 million units in just 74 days. &lt;/li&gt;
&lt;li&gt;ChatGPT, on the other hand, reached 1 billion searches per day in under a year—a metric that Google took over 10 years to achieve. &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Within just two months of its November 2022 launch, ChatGPT surpassed 100 million users—the fastest adoption rate ever recorded for a consumer software product. This rate of adoption suggests that companies that don't learn from history and adapt to the AI era face an existential threat, not just a competitive disadvantage.&lt;/p&gt;

&lt;h2&gt;
  
  
  GrowthBook’s Journey with AI
&lt;/h2&gt;

&lt;p&gt;At GrowthBook, our initial step was adding the lightbulb: we launched an AI chatbot to help users navigate our documentation (a helpful concierge, if you will). &lt;/p&gt;

&lt;p&gt;Simultaneously, we conducted several brainstorming sessions to reevaluate our product and explore the potential impact of AI on our business. We ran the &lt;a href="https://reid.medium.com/how-to-scale-a-magical-experience-4-lessons-from-airbnbs-brian-chesky-eca0a182f3e3?ref=blog.growthbook.io" rel="noopener noreferrer"&gt;&lt;u&gt;11-star brainstorming &lt;/u&gt;&lt;/a&gt;sessions and planned our roadmap to reimagine what AI will mean in the A/B testing and product analytics space. We built &lt;a href="http://weblens.ai/?ref=blog.growthbook.io" rel="noopener noreferrer"&gt;&lt;u&gt;Weblens.ai&lt;/u&gt;&lt;/a&gt; as a demonstrator of some of the features that AI can unlock for AB testing—and we have many more features coming very soon. &lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;From electrification to television to AI, each technological shift has rewarded those who reimagined systems entirely. They didn’t just adopt new tools—they rewrote workflows, content, and the way they delivered value. &lt;/p&gt;

&lt;p&gt;Here are the lessons:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Treat AI as a new paradigm—not just as an add-on&lt;/strong&gt;. Like Ford reengineered production or TV creators abandoned radio formats, &lt;em&gt;design products from an AI-native perspective&lt;/em&gt;. &lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Focus on user journeys and tasks that AI can redefine&lt;/strong&gt; —insights, decisions, personalization—rather than isolated features shoe‑horned onto existing interfaces.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;If you don’t adapt now, someone else will&lt;/strong&gt;. AI has experienced an explosive rate of growth, resulting in significant productivity gains and a reduction in the time it takes to bring products to market.&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
    </item>
    <item>
      <title>GrowthBook Version 4.0</title>
      <dc:creator>Ryan Feigenbaum</dc:creator>
      <pubDate>Wed, 09 Jul 2025 02:51:51 +0000</pubDate>
      <link>https://dev.to/growthbook/growthbook-version-40-5d8a</link>
      <guid>https://dev.to/growthbook/growthbook-version-40-5d8a</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5g95srhmgtw9kreo74jm.webp" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5g95srhmgtw9kreo74jm.webp" alt="GrowthBook Version 4.0" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We shipped so many new features in our June Launch Month that we decided that it deserved a major version increase. Version 4.0 brings a huge array of new features.  Here’s a quick summary of everything it includes.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://blog.growthbook.io/introducing-the-first-mcp-server-for-experimentation-and-feature-management/" rel="noopener noreferrer"&gt;&lt;strong&gt;&lt;u&gt;GrowthBook MCP Server&lt;/u&gt;&lt;/strong&gt;&lt;/a&gt;&lt;br&gt;&lt;br&gt;
AI tools like Cursor can now interact with GrowthBook via our new MCP server. Create feature flags, check the status of running experiments, clean up stale code, and more.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://blog.growthbook.io/growthbook-launch-month-week-1/" rel="noopener noreferrer"&gt;&lt;strong&gt;&lt;u&gt;Safer Rollouts&lt;/u&gt;&lt;/strong&gt;&lt;/a&gt;&lt;br&gt;&lt;br&gt;
Building upon our Safe Rollouts release last version, we added gradual traffic ramp up, auto rollback, a smart update schedule, and a time series view of results.  All of these combine to add even more safety around your feature releases.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://docs.growthbook.io/app/experiment-decisions?ref=blog.growthbook.io" rel="noopener noreferrer"&gt;&lt;strong&gt;&lt;u&gt;Decision Criteria&lt;/u&gt;&lt;/strong&gt;&lt;/a&gt;&lt;br&gt;&lt;br&gt;
You can now customize the shipping recommendation logic for experiments.  Choose from a “Clear Signals” model, a “Do No Harm” model, or define your own from scratch.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Search Filters&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
We’ve revamped the search experience within GrowthBook to make it easier to find feature flags, metrics, and experiments.  Easily filter by project, owner, tag, type, and more.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://blog.growthbook.io/growthbook-launch-month-week-2/" rel="noopener noreferrer"&gt;&lt;strong&gt;&lt;u&gt;Insights Section&lt;/u&gt;&lt;/strong&gt;&lt;/a&gt;&lt;br&gt;&lt;br&gt;
We added a brand new left nav section called “Insights” with a bunch of tools to help you learn from your past experiments.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The &lt;strong&gt;Dashboard&lt;/strong&gt; shows velocity, win rate, and scaled metric impact by project.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Learnings&lt;/strong&gt; is a searchable knowledge base of all of your completed experiments.&lt;/li&gt;
&lt;li&gt;The &lt;strong&gt;Experiment Timeline&lt;/strong&gt; shows when experiments were running and how they overlapped with each other.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Metric Effects&lt;/strong&gt; lists the experiments that had the biggest impact on a specific metric.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Metric Correlations&lt;/strong&gt; let you see how two metrics move in relation to each other.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://blog.growthbook.io/growthbook-launch-month-week-3/" rel="noopener noreferrer"&gt;&lt;strong&gt;&lt;u&gt;SQL Explorer&lt;/u&gt;&lt;/strong&gt;&lt;/a&gt;&lt;br&gt;&lt;br&gt;
We launched a lightweight SQL console and BI tool to explore and visualize your data directly within GrowthBook, without needing to switch to another platform like Looker.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://blog.growthbook.io/growthbook-launch-month-week-4/" rel="noopener noreferrer"&gt;&lt;strong&gt;&lt;u&gt;Managed Warehouse&lt;/u&gt;&lt;/strong&gt;&lt;/a&gt;&lt;br&gt;&lt;br&gt;
GrowthBook Cloud now offers a fully managed ClickHouse database that is deeply integrated with the product.  It’s the fastest way to start collecting data and running experiments on GrowthBook.  You still get raw SQL access and all the benefits of a warehouse native product, without the setup and maintenance cost.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://blog.growthbook.io/growthbook-launch-month-week-4/" rel="noopener noreferrer"&gt;&lt;strong&gt;&lt;u&gt;Feature Flag Usage&lt;/u&gt;&lt;/strong&gt;&lt;/a&gt;&lt;br&gt;&lt;br&gt;
See analytics about how your feature flags are being evaluated in your app in real time.  This is built on top of the new Managed Warehouse on GrowthBook Cloud and is a game changer for debugging and QA.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://flags-sdk.dev/providers/growthbook?ref=blog.growthbook.io" rel="noopener noreferrer"&gt;&lt;strong&gt;&lt;u&gt;Vercel Flags SDK&lt;/u&gt;&lt;/strong&gt;&lt;/a&gt;&lt;br&gt;&lt;br&gt;
GrowthBook now has an official provider for the Vercel Flags SDK.  This is now the easiest way to add server-side feature flags to any Next.js project. We have an even deeper Vercel integration coming soon to make this experience even more seamless.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://docs.growthbook.io/integrations/framer?ref=blog.growthbook.io" rel="noopener noreferrer"&gt;&lt;strong&gt;&lt;u&gt;Official Framer Plugin&lt;/u&gt;&lt;/strong&gt;&lt;/a&gt;&lt;br&gt;&lt;br&gt;
You can now easily run GrowthBook experiments inside of your Framer projects.  Assign visitors to different versions of your design (like layouts, headlines, or calls to action), track results, and confidently choose the best experience for your audience.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Personalized Landing Page&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
There’s a new landing page when you first log into GrowthBook.  Quickly see any features or experiments that need your attention, pick up where you left off, and learn about advanced GrowthBook functionality to get the most out of the platform.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;New Experimentation Left Nav&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
There’s a new “Experimentation” section in the left nav. Experiments and Bandits now live within this section, along with our Power Calculator, Experiment Templates, and Namespaces.  We’ll be expanding this section soon with Holdouts and more, so stay tuned!&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;REST API Updates&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Filter the listFeatures endpoint by clientKey&lt;/li&gt;
&lt;li&gt;Support partial rule updates in the putFeature endpoint&lt;/li&gt;
&lt;li&gt;New Queries endpoint to retrieve raw SQL queries and results from an experiment&lt;/li&gt;
&lt;li&gt;Added Custom Field support to feature and experiment endpoints&lt;/li&gt;
&lt;li&gt;New endpoints for getting feature code refs&lt;/li&gt;
&lt;li&gt;New endpoint to revert a feature to a specific revision&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Performance Improvements&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
We’ve drastically improved the CPU and memory usage when self-hosting GrowthBook at scale. On GrowthBook Cloud, we’ve seen a roughly 50% reduction during peak load, leading to lower latency and virtually eliminating container failures in production.&lt;/p&gt;

</description>
      <category>releases</category>
      <category>experimentation</category>
      <category>featureflags</category>
      <category>newfeatures</category>
    </item>
  </channel>
</rss>
