Whatsonyourmind

Posted on Apr 2

The $36,000 A/B Test: What Optimizely Charges vs. What the Algorithm Actually Costs

#webdev #programming #api #testing

You Just Want to Test Two Buttons

You're a developer at a Series A startup. Your product manager walks over and says: "We need to A/B test the signup flow. Three variants, maybe four. Can you set that up this week?"

Simple enough. You've read about multi-armed bandits. You know the theory. You start looking at tooling.

Then you open Optimizely's pricing page. Or rather, you try to — because there is no pricing page. Just a "Contact Sales" button and a calendar widget for a 30-minute demo.

After the demo, the sales call, the follow-up, and the "let me check with my manager" email chain, the number comes back: $36,000 per year minimum. For A/B testing.

That's not a typo. And it gets worse. If your product scales to 10 million impressions per month, you're looking at $63,700 to $113,100 per year depending on your package. Enterprise tier? $200,000 to $400,000+. One user reported getting "stuck with a $24,000 bill for a product they no longer needed" after downgrading became impossible without a sales conversation.

The pricing model itself is designed to extract maximum value: Optimizely charges a percentage of your revenue, not a flat fee. The more successful your product becomes, the more you pay for the same algorithm underneath.

It's a system that, as one reviewer put it, "penalizes those just starting with experimentation." If you're a scrappy team trying to validate hypotheses fast, you're priced out before you write a single test.

When Brex — a well-funded fintech company — finally switched away from Optimizely to Statsig, their engineering lead said it plainly: "Our engineers are significantly happier."

The question nobody asks during those sales calls is the one that matters most: what are you actually buying for $36,000?

What You're Actually Buying

Strip away Optimizely's dashboard. Strip away the visual editor, the audience segmentation, the CDN integration, the SSR compatibility, the SDK for six different frameworks.

What's left?

At the mathematical core of Optimizely's experimentation engine is Thompson Sampling — a multi-armed bandit algorithm published by William R. Thompson in 1933. That's not a criticism. Thompson Sampling is genuinely brilliant. It's one of the most elegant solutions to the explore/exploit problem in statistics.

But it fits in about 20 lines of code.

The algorithm itself is public domain. It's been public domain for 91 years. You can find implementations in every language on GitHub, in textbooks, in blog posts. The math is settled.

So when you pay Optimizely $36,000 per year, you're not paying for the algorithm. You're paying for:

The visual editor — drag-and-drop test creation for non-technical users
Audience targeting — segment by geography, device, behavior, custom attributes
The SDK ecosystem — client-side, server-side, edge, mobile, OTT
The analytics dashboard — statistical significance calculations, revenue attribution, funnel visualization
Compliance and governance — SOC 2, GDPR controls, approval workflows

These are real features. They have real value — especially for large organizations with non-technical stakeholders who need to create and monitor experiments without writing code.

But if you're a developer, and you just need the bandit algorithm — the explore/exploit engine that decides which variant to show next — you're paying $36,000 for something that costs pennies to compute.

Thompson Sampling in 5 Minutes

Let's actually learn the algorithm you'd be paying for. It's more intuitive than you think.

The Explore/Exploit Dilemma

You have three signup button variants. After 100 visitors:

Variant A converted 35 out of 100 (35%)
Variant B converted 40 out of 100 (40%)
Variant C converted 5 out of 10 (50%)

Which is best? Traditional A/B testing says: "Keep running all three at equal traffic until we hit statistical significance." That wastes thousands of impressions sending traffic to Variant A, which is clearly losing.

A naive approach says: "Variant C has 50% — send all traffic there." But wait — that's based on only 10 observations. It could easily be noise.

This is the explore/exploit dilemma: do you exploit what looks best now, or explore the uncertain option to learn more?

How Thompson Sampling Solves It

Thompson Sampling uses Beta distributions to model uncertainty about each variant's true conversion rate.

For each variant, you maintain two numbers: successes (alpha) and failures (beta). When you need to pick a variant to show, you:

Sample a random value from each variant's Beta(alpha, beta) distribution
Pick the variant whose sample is highest
Show that variant to the next visitor
Update the winning variant's alpha (if converted) or beta (if didn't)

That's it. The entire algorithm.

The magic is in the Beta distribution's shape. A variant with 40 successes and 60 failures produces a tight distribution centered around 0.40 — you're fairly confident in that number. A variant with 5 successes and 5 failures produces a wide, flat distribution — it could be anywhere from 0.10 to 0.90.

When you sample from the uncertain distribution, it occasionally produces very high values. That's exploration — the algorithm says "this option might be amazing, let's check." As you gather more data, the distribution tightens, and the algorithm naturally shifts from exploration to exploitation.

It converges faster than fixed-split A/B tests because it automatically routes more traffic to winning variants while still exploring promising unknowns. No manual intervention. No arbitrary "stop the test" decisions.

The Failure Mode LLMs Hit

Here's something surprising: large language models consistently get Thompson Sampling wrong when they try to implement decision-making. They see uncertainty and interpret it as risk. When a variant has high variance, an LLM tends to pull back — to avoid the uncertain option and stick with the known quantity.

That's the exact opposite of what Thompson Sampling does. The algorithm treats uncertainty as opportunity. High variance means "we might be missing something great here." This is documented in what one team called "The $3,000 Bug" — an AI agent that was supposed to optimize decisions kept choosing the safe, well-known option and ignoring high-upside alternatives because it conflated uncertainty with danger.

Thompson Sampling doesn't make that mistake. The math doesn't have opinions about risk.

The Alternatives Landscape

Optimizely isn't your only option. The market has fragmented significantly, and there are tools at every price point. Here's an honest comparison:

Tool	Price	Bandit Algorithms	Self-Serve	Lock-in	Best For
Optimizely	$36K+/yr	Thompson only	No (sales call)	High (SDK)	Enterprise with big budgets
VWO	$199+/mo	Thompson only	Partial	Medium	Mid-market marketing teams
GrowthBook	Free (self-host)	Yes	Yes	Low	Teams with DevOps capacity
Statsig	Free-$150/mo	Yes	Yes	Low	Developer-first teams
OraClaw API	$0.005/call	UCB1 + Thompson + LinUCB	Yes	None	Developers who just need the algorithm

A few things jump out:

GrowthBook is the open-source hero. If you have the DevOps capacity to self-host, maintain, and monitor it, it's genuinely free and full-featured. The catch is operational overhead — you're running the infrastructure, handling uptime, managing database migrations.

Statsig hit a sweet spot for developer teams. Their free tier is generous, the DX is good, and it's what Brex switched to. If you need a full experimentation platform with a dashboard, this is the value pick.

VWO occupies the mid-market — cheaper than Optimizely, still dashboard-focused, still requires some sales interaction for advanced features.

OraClaw takes a fundamentally different approach. It's not a platform — it's an API endpoint. You send it arm data, it runs the algorithm, it returns a decision. No SDK to install, no dashboard to learn, no vendor lock-in. It supports three bandit algorithms (UCB1 for deterministic upper confidence bounds, Thompson for Bayesian exploration, and LinUCB for context-aware decisions that factor in features like time-of-day or user segment).

The right choice depends entirely on what you need. Not every problem requires the same tool.

Try It Right Now

Here's a working example. No signup, no API key, no sales call. Just paste this into your terminal:

curl -X POST https://oraclaw-api.onrender.com/api/v1/optimize/bandit \
  -H "Content-Type: application/json" \
  -d '{
    "arms": [
      {"id": "variant-a", "name": "Original CTA", "pulls": 500, "totalReward": 175},
      {"id": "variant-b", "name": "New CTA", "pulls": 300, "totalReward": 126},
      {"id": "variant-c", "name": "Bold CTA", "pulls": 12, "totalReward": 8}
    ],
    "algorithm": "thompson"
  }'

You'll get back something like:

{"selected": {"id": "variant-c"}, "score": 0.71, "algorithm": "thompson"}

Wait — variant-c? The one with only 12 pulls and a 66.7% conversion rate?

Yes. And here's why that's correct.

Variant A has 500 pulls and a 35% conversion rate. Thompson Sampling is very confident about that number — the Beta(175, 325) distribution is tight. It's almost certainly between 31% and 39%.

Variant B has 300 pulls and a 42% conversion rate. Also fairly confident — Beta(126, 174) is tight. Probably between 37% and 47%.

Variant C has 12 pulls and a 66.7% conversion rate. But Beta(8, 4) is wide. The true rate could be anywhere from 35% to 90%. When Thompson samples from this distribution, it frequently draws values above 0.50 — higher than what A or B can produce.

The algorithm is saying: "Variant C looks promising but we barely know anything about it. Let's send more traffic there to find out."

That's exploration in action. If C's true rate is 45%, a few more pulls will tighten the distribution and it'll stop being selected. If C's true rate really is 65%, you just found a massive winner that a fixed 33/33/33 split would have taken 10x longer to identify.

This is exactly the behavior that the "$3,000 Bug" LLM got wrong. It saw the small sample size and treated it as a reason to avoid variant C. Thompson Sampling sees the small sample size and treats it as a reason to investigate.

You can swap "algorithm": "thompson" for "ucb1" or "linucb" (with a context vector) to compare strategies.

When NOT to Use This

Let's be honest about the tradeoffs.

If you need a visual editor so your marketing team can create tests without writing code — use Optimizely or VWO. If you need audience targeting with complex segmentation rules — use a platform. If you need a dashboard with real-time charts for stakeholders who don't read JSON — use Statsig or GrowthBook.

A bare API endpoint is the wrong tool for organizations where non-developers need to create and monitor experiments. That's a real use case, and the $36K platforms serve it well.

But if you're a developer calling an optimization algorithm from your backend, your data pipeline, or your AI agent — you don't need a visual editor. You don't need a dashboard. You need the math, and you need it fast, and you need it cheap.

The Math Doesn't Care About Your Budget

Thompson Sampling produces the same distribution, the same samples, and the same convergence properties whether the compute costs $36,000 per year or $0.005 per call. The algorithm was published in 1933. It's been proven optimal in the limit. No amount of enterprise packaging changes the underlying mathematics.

The question isn't "which algorithm is best" — for most A/B testing scenarios, Thompson Sampling is the answer regardless of vendor. The question is: how much infrastructure do you need wrapped around it?

If the answer is "a lot" — platforms exist for that. If the answer is "just give me the algorithm" — now you know what your options are.

Stop paying $36,000 for 20 lines of math.

DEV Community