Hernan Huwyler

Posted on Apr 13

Build vs Buy for AI Systems (A Developer’s Guide to Not Regretting the Decision)

#ai #devops #machinelearning #development

Before we get technical, two quick pointers if you want the longer, governance-heavy version of this topic and the rest of my field notes. https://hernanhuwyler.wordpress.com/

Start with the original article: Building vs Buying Decisions for AI Systems

https://hernanhuwyler.wordpress.com/2026/03/12/building-vs-buying-decisions-for-ai-systems/

If you like this style of practical, production-minded AI engineering, the full blog index is here: hernanhuwyler.wordpress.com

Now the developer take.

I keep seeing AI teams ask “build vs buy” after the architecture is already half-decided. Engineering has a repo. Procurement has a short list. Security has questions nobody can answer. Then the project turns into a political debate about speed and control.

That is how you end up with either:

a custom system that nobody can operate safely at 2 AM, or
a vendor system that “works in the demo” but you cannot monitor, explain, or roll back when it misbehaves.
This post is the decision framework I wish more teams used before they commit to code, contracts, or platform lock-in.

I am going to be blunt: build vs buy is not a procurement question. It is an operating model decision with consequences for reliability engineering, incident response, and long-term ownership.

Also, yes, I’m leaving three human typos in here on purpose because this is how real engineers write under time pressure: teh, definately, occured.

What “build vs buy” really means in AI (it is rarely binary)
In AI, “build” can mean at least five different things:

build a model from scratch
fine-tune a foundation model
build a retrieval layer and orchestration around a hosted model
build the evaluation and monitoring stack around a vendor tool
build the workflow integration, guardrails, and audit logging around SaaS AI
“Buy” also has levels:

buy a fully managed end-to-end product
buy a platform (model hosting, vector database, feature store, pipeline tooling)
buy a component (OCR, transcription, embeddings, redaction, PII detection)
buy “AI inside SaaS” that quietly becomes a production dependency
Most production systems end up hybrid. The question is whether you are designing hybrid on purpose, or drifting into it without controls.

The four lenses that keep teams honest
I use four lenses. If you skip even one, the decision becomes biased toward ideology.

1) Solution fit (does it actually solve your problem?)
For developers, “fit” is not a feature checklist. It is:

Does it support your data shapes and your failure modes?
Does it support your latency budget and throughput?
Can it run in your environment (networking, identity, compliance boundaries)?
Does it support the behavioral constraints you need (tone, safety, refusal, citations, determinism)?
A vendor might be perfect for commodity workflows like OCR, transcription, translation, ticket summarization, or code completion.

A vendor will struggle when your differentiator is your workflow logic, your proprietary corpus, your control requirements, or your need for deep integration and observability.

Practical test: write one “golden path” scenario and ten “nasty path” scenarios. Make the vendor run them in your environment with your data patterns, not their sandbox.

2) Operating capability (can you run it for years, not weeks?)
Most teams can build a prototype. Fewer can operate an AI system like an SRE-owned service.

If you build, you own:

model registry and artifact lineage
feature pipelines and data contracts
evaluation harness, thresholds, and regressions
model serving, scaling, and cost controls
monitoring, alerting, incident playbooks
retraining triggers, rollback, and retirement
If you buy, you still own:

integration and identity boundaries
monitoring of outcomes in your workflows
“vendor changed something” detection
audit evidence and incident coordination
fallbacks when the service degrades
Hard question: who will be on call when the model starts producing toxic output at 11 PM and Customer Support escalates?

If the answer is “we’ll figure it out,” the decision is not ready.

3) Control and risk (who owns the hardest failure mode?)
Neither build nor buy is safer by default. The safer option is the one where the risk is measurable and enforceable in your environment.

In real systems, the hardest risks tend to be:

data leakage (training or inference)
prompt injection and tool abuse (if you allow tools/actions)
model drift and silent quality decay
fairness regressions across segments
lack of audit logging and replayability
vendor opacity (no eval access, no update transparency)
Control test: when something goes wrong, can you answer these in under an hour?

What exact version is running?
What changed since last week?
Can we roll back safely?
Do we have logs that prove what happened?
If you cannot, you do not have operational control. You have hope.

4) Lifecycle economics (five-quarter view, not quarter-one)
AI cost surprises rarely come from build time. They come from running time.

If you build, hidden cost tends to be:

staffing continuity, turnover, and tribal knowledge
infra, GPUs, storage, and network egress
monitoring and evaluation effort
governance artifacts, audits, and evidence trails
technical debt from “we shipped it fast”
If you buy, hidden cost tends to be:

usage pricing (tokens, queries, seats, “premium support”)
integration complexity and custom connectors
vendor change management and renegotiations
lock-in and migration costs
lack of portability for prompts, embeddings, or policies
Rule I use: compare expected-case cost over five quarters with stressed-case assumptions. AI vendors and internal builds both look great in best-case spreadsheets.

A developer-first decision matrix (build, buy, hybrid)
Here is a lean matrix you can actually use in an engineering review.

Dimension Build tends to win when Buy tends to win when Hybrid tends to win when
Differentiation Your workflow or model behavior is core IP It is commodity capability Core workflow is unique, base capability is commodity
Data constraints You need strict boundary control, custom redaction, or on-prem Vendor supports your boundary model You keep sensitive layers in-house, outsource the rest
Observability You need deep tracing, replay, and segment analytics Vendor offers limited logs You build monitoring + audit around vendor core
Change control You need deterministic releases Vendor changes are opaque You isolate vendor changes behind an abstraction layer
Talent You have ML + platform + security depth You do not You buy platform, build app layer
This is intentionally not “complete.” It is enough to force real trade-offs early.

Technical due diligence if you are buying (what I make teams test)
Buying AI without a test harness is how teams get surprised in production.

1) Black-box evaluation harness (minimum viable)
You need a repeatable harness that can be run:

before purchase (pilot)
before upgrades
after vendor model changes
after policy or prompt changes
A simple pattern:

Python

from dataclasses import dataclass
from typing import Callable, List, Dict
import time

@dataclass
class TestCase:
name: str
input: str
expected_tags: List[str] # e.g., ["no_pii", "refuse_illegal", "cite_sources"]

def run_eval(cases: List[TestCase], call_model: Callable[[str], Dict]) -> Dict:
results = {"pass": 0, "fail": 0, "latency_ms": []}
for c in cases:
t0 = time.time()
out = call_model(c.input)
latency = (time.time() - t0) * 1000
results["latency_ms"].append(latency)

    tags = out.get("tags", [])
    ok = all(tag in tags for tag in c.expected_tags)
    if ok:
        results["pass"] += 1
    else:
        results["fail"] += 1
        print(f"FAIL: {c.name} got tags={tags}")
return results

Do not argue about vendor quality based on a demo. Run your cases.

2) Update detection
If the vendor can update models or policies, you need detection. At minimum:

compare output distributions over time
run nightly regression tests on a fixed suite
alert when drift crosses a threshold
If you cannot detect vendor changes, you will misdiagnose incidents as “our integration” when the behavior changed upstream.

3) Contractual requirements that matter to engineers
This is not legal advice. It is the engineering reality I’ve seen break production.

Ask for:

change notification commitments
data usage boundaries (training, retention, logging)
incident notification timelines
audit evidence availability
export/migration support (prompts, embeddings, configs where possible)
service-level objectives (latency, uptime, support response)
A vendor that cannot commit to update visibility is not a vendor. It is a variable.

Technical risk if you build (what teams underestimate)
When teams build, the failures are usually boring and brutal:

Reproducibility debt
If you cannot reproduce a model, you cannot fix it under pressure.

Minimum: version code, data snapshots, feature definitions, training config, and model artifacts.

Monitoring debt
Teams ship with uptime monitoring and call it done.

You need:

data drift signals
prediction distribution shifts
segment-level performance when labels arrive
operational metrics (latency, errors, cost per request)
user feedback loops (complaints, overrides, appeals)
Ownership debt
If only one person understands the training pipeline, that person becomes your availability risk.

Write it down. Automate it. Rotate ownership.

The hybrid architecture I see working most often
If you want speed and control, hybrid is usually the reality.

A practical hybrid stack looks like this:

Buy a foundation model API or managed model platform
Build your retrieval layer (RAG), guardrails, and orchestration
Build your eval harness, monitoring, and audit logging
Keep sensitive data inside your boundary via redaction, retrieval controls, and least-privilege access
Use feature flags to route traffic and roll back quickly
Hybrid works when you treat the vendor as a dependency behind an interface, not as your entire system.

Where governance frameworks help developers (without slowing them down)
I am not asking engineers to become lawyers. I am asking teams to ship systems that can be defended and operated.

Three references that translate well into engineering controls:

NIST AI Risk Management Framework for lifecycle risk thinking
ISO/IEC 42001 for management system discipline (roles, controls, evidence)
EU AI Act for risk-tiered obligations where applicable
The developer translation is simple: turn requirements into pipeline gates, monitoring, and evidence artifacts.

Read the original, and then argue with me
If you want the broader operating model version, read: Building vs Buying Decisions for AI Systems

And if you want more production-focused AI engineering notes, the full blog is here: hernanhuwyler.wordpress.com

Closing question (the one I ask before approving either path)
If your AI system starts producing harmful outputs tomorrow, can you prove what changed and roll back in under 30 minutes?

DEV Community

Build vs Buy for AI Systems (A Developer’s Guide to Not Regretting the Decision)

Now the developer take.

That is how you end up with either:

Top comments (0)