The Two Places Generative AI Shows Up When You Ship a Custom AI Application

#webdev #machinelearning #programming #ai

Most write-ups about "AI development" quietly conflate two very different activities. One is building software that uses generative AI as a core capability: copilots, retrieval systems, autonomous agents. The other is using generative AI to build software: code generation, test synthesis, legacy modernization. They share a buzzword and almost nothing else. The skills, the risks, and the discipline required are different, and teams that treat them as one thing tend to get burned on both.

If you're shipping a custom AI application, you will run into both at once. This post is a practical map of where each shows up, what tends to break, and how to keep the speed without inheriting the fragility.

Track 1: Generative AI as the Product

When the AI is the feature, the engineering challenge is not "call a model." It's everything wrapped around the call that decides whether the thing is correct, safe, and maintainable.

A few realities that separate a working demo from a deployable application:

Retrieval quality is the silent killer. Most custom AI apps lean on Retrieval-Augmented Generation (RAG). The naive version (embed documents, do a similarity search, stuff results into the prompt) hides every decision that actually matters. Fixed-size chunking severs context. Pure vector search whiffs on exact identifiers like error codes or SKUs, where a hybrid of dense vectors plus keyword search does far better. And if the model can answer without citing which retrieved chunk supports the claim, you have no mechanism to detect hallucination in production.

Use the smallest control structure that works. "Autonomous multi-agent system" is rarely the right starting point. Reliability drops and debugging cost climbs with every layer of autonomy you add. A ticket-classification task needs one well-prompted call with a typed output, not three agents deliberating. Reserve orchestration for problems that genuinely require planning and tool use you can't predetermine.

Guardrails live in code, not prompts. Anything that must always be true (a spending cap, a permission check, a rate limit) belongs in deterministic code that runs no matter what the model decided. A prompt is a request, not an invariant.

# The model can *propose* a refund. Code decides whether it's allowed.
def issue_refund(order_id: str, amount_cents: int) -> dict:
    if amount_cents > MAX_AUTO_REFUND_CENTS:
        return {"status": "escalate_to_human"}   # invariant enforced here
    if not user_owns_order(current_user, order_id):
        raise PermissionError                     # never trust the prompt
    return process_refund(order_id, amount_cents)

Evaluation is non-negotiable. The defining question is "did that change make the system better or worse?" Without a versioned evaluation set you run on every prompt tweak and model swap, every change is a guess. It doesn't need to be huge; a few dozen well-chosen cases catch a surprising number of regressions. Because outputs are non-deterministic, your metrics should be thresholds over a sample, not pass/fail on one run.

Track 2: Generative AI as the Way You Build

The second track is generative AI accelerating the development lifecycle itself, and it follows one governing principle: generation is the cheap part; ownership is the expensive part. A model can produce fifty plausible lines in seconds. Reviewing, testing, securing, and maintaining those lines for years costs exactly as much as if a human wrote them.

What changes behavior is treating AI output as a draft and the reviewer as fully accountable for it. "The model wrote it" is not a defense in a postmortem. Watch for the failure modes assistants over-produce: subtly wrong edge cases (empty collections, timezones, integer truncation), hallucinated or outdated APIs, and security anti-patterns like string-concatenated SQL that they reproduce from training data.

The practical safeguard is a deterministic gate before human review. AI raises the volume of code flowing into review, so the automated floor under that review has to be solid enough to absorb the increase without humans rubber-stamping:

def gate(change):
    checks = [
        ("type-checks",          run_type_check(change)),
        ("linter clean",         run_linter(change)),
        ("no vuln patterns",     run_sast(change)),
        ("no secrets",           scan_for_secrets(change)),
        ("tests non-trivial",    tests_meaningful(change)),
        ("coverage not reduced", coverage_delta(change) >= 0),
    ]
    failed = [name for name, ok in checks if not ok]
    if failed:
        raise ReviewBlocked(f"AI change failed gates: {failed}")
    return "ready_for_human_review"

Test generation is the highest-leverage use, with one caveat about the direction of trust. Generating tests for existing, human-written code is safe and valuable, because the code is the trusted artifact. But when the model writes both the implementation and its tests, the tests tend to encode the implementation's bugs as "expected" behavior. Keep a human-authored specification of intended behavior as the anchor.

Legacy modernization is where this track is most seductive and most dangerous. A model will translate an old module into idiomatic modern code while silently dropping a side effect some downstream system depends on. The discipline that works: modernize in small increments, and use characterization tests (tests that capture the legacy code's existing behavior, quirks included) as the contract the new code must satisfy.

Where the Two Tracks Meet

Shipping a custom AI application means running both tracks at the same time, and the same engineering values turn out to govern each:

Determinism around non-determinism. Whether it's a model deciding to issue a refund or a model writing the refund code, the safety net is deterministic checks that don't depend on the model behaving.
Evaluation over vibes. Track 1 needs faithfulness evals; Track 2 needs change-failure rate and defect-escape rate. Both replace "it worked when I tried it" with a measurement.
Human accountability at the boundary. A high-stakes agent action gets a human approval checkpoint; a high-stakes code change gets a human reviewer who owns it. Same pattern.

A rough readiness check before you call a custom AI application "done":

[ ] Retrieval returns cited, verifiable sources
[ ] Hard business rules enforced in code, not prompts
[ ] Versioned eval set runs on every model/prompt change
[ ] Full request tracing captured (input → retrieval → prompt → output)
[ ] Idempotency on every state-changing action
[ ] Graceful degradation when the model is unavailable
[ ] PII handling and access control on retrieval
[ ] AI-assisted code passed the same gates as human code

If most of those boxes are empty, you have a prototype, not a product, no matter how good the demo looked.

Closing Thought

The hype frames generative AI as a single revolution. In practice it's two distinct disciplines that happen to share a name, and a custom AI application sits at their intersection. The teams that win aren't the ones generating the most code or wiring up the most agents. They're the ones whose evaluation, guardrails, and review gates are strong enough that more AI (in the product and in the process) makes them faster without making them fragile.

I work on AI engineering at Wizr AI, where we build custom AI applications and use generative AI across the software development lifecycle. These tradeoffs are part of the daily job. Happy to compare notes in the comments.

DEV Community

The Two Places Generative AI Shows Up When You Ship a Custom AI Application

Top comments (0)