DEV Community: Marketing wizr

RAG Is Not a Feature. It's a System, and These Are the Parts Nobody Demos.

Marketing wizr — Wed, 01 Jul 2026 09:05:44 +0000

Retrieval-Augmented Generation demos beautifully. Embed your documents, run a similarity search, drop the results into the prompt, and the model answers questions over your data. Ship it, and it works right up until real users ask real questions, at which point the answers get subtly, confidently wrong. The demo hid every decision that actually determines quality. Here are the parts that separate a RAG demo from a RAG system.

Chunking is a design decision, not a default

Splitting documents on a fixed token count is the default in every tutorial and it quietly destroys quality. Fixed windows cut tables in half, separate a clause from the sentence that qualifies it, and orphan headings from the text they describe. Structure-aware chunking, splitting on semantic boundaries like sections, list items, or function definitions, consistently does better. The right chunk size is empirical. Measure it; do not inherit it.

Pure vector search is not enough

Embeddings are great at "find me something similar" and surprisingly bad at "find the document containing error code E-4021." Exact identifiers, product codes, and rare terms are exactly where semantic search whiffs. Hybrid retrieval fixes most of this: run dense vector search and a sparse keyword index (BM25) together, then rerank the merged set. The keyword half catches the exact matches the vectors miss.

Grounding is the difference between answer and hallucination

If your model can produce an answer without citing which retrieved chunk supports it, you have no way to detect hallucination in production. Force citations, then validate them.

def validate_grounding(answer, citations, retrieved_chunks):
    for cid in citations:
        if cid not in retrieved_chunks:
            return False           # model invented a source
    if not citations and makes_factual_claim(answer):
        return False               # unsupported claim
    return True

This runs on every request and turns an invisible failure into a catchable one.

Retrieval respects permissions or it leaks

The moment your corpus contains anything access-controlled, retrieval becomes a security surface. Filtering results after the search is fragile. Scope the query itself with the requesting user's permissions so unauthorized chunks are never candidates for the context window in the first place. A retrieval bug that surfaces the wrong customer's data is not a quality issue; it is an incident.

Evaluation, or you are flying blind

Every knob above (chunk size, retrieval mix, reranking, prompt) changes output quality in ways you cannot eyeball. You need a versioned evaluation set that answers "did that change help or hurt?" on every adjustment. A few dozen well-chosen cases with faithfulness and retrieval-hit metrics catch a startling number of regressions. Without this, every improvement is a guess and quality drifts.

It compounds into real architecture

Put these together and RAG stops looking like a feature and starts looking like a subsystem: ingestion and chunking, a hybrid index, a retrieval layer with per-user scoping, a grounding validator, and an evaluation harness in CI. That is a lot more than "embed and search," and it is why serious enterprise AI solutions treat retrieval as core infrastructure rather than a wrapper around a vector database.

The takeaway

The gap between a RAG demo and a RAG system is measured in chunking strategy, hybrid retrieval, grounding, access control, and evaluation. None of it shows up in the five-minute demo, and all of it determines whether the thing is trustworthy in production. If you are building this into a product rather than a prototype, it is worth treating it with the same rigor as any other custom AI application development effort, because that is exactly what it is.

What broke first in your RAG system? Mine was chunking. Trade stories below.

We Let AI Write a Third of Our Code. Here's the Review Process That Kept Us Sane.

Marketing wizr — Wed, 01 Jul 2026 08:55:30 +0000

There is a seductive moment when AI coding assistants start pulling real weight: a meaningful share of your diffs are machine-drafted, velocity spikes, and everyone feels ten feet tall. Then the first subtle bug from unreviewed generated code reaches production, and you realize the tool changed how fast you write code without changing how much it costs to own it. Reviewing, testing, securing, and maintaining that code costs exactly what it always did.

Here is the process that let us lean on generation without inheriting fragility.

Rule zero: the human who merges it owns it

The most important change was cultural, not technical. Whoever opens the PR is accountable for every line as if they typed it. "The model wrote it" is not a defense in a postmortem. This one norm ended the skim-and-approve reflex, because now skimming was your name on the incident.

Build an automated floor before you open the tap

AI raises the volume of code hitting review. If human review is your only filter, reviewers start rubber-stamping under the load. So we put a deterministic gate before any AI-drafted change reaches a person:

[ ] type-checks / compiles
[ ] linter clean
[ ] static analysis (SAST) finds no known-vuln patterns
[ ] no secrets introduced
[ ] tests present and non-trivial
[ ] coverage does not drop

None of this is AI-specific, which is the point. The floor has to be solid enough to absorb more code without more human hours.

Watch for the failure modes assistants over-produce

Generated code fails in characteristic ways, and knowing them makes review faster: mishandled edge cases (empty collections, timezones, integer truncation) that the happy path never exercises; hallucinated or outdated API calls that sound plausible; and security anti-patterns like string-concatenated SQL that models reproduce from their training data. We keep a short reviewer checklist of exactly these.

Choosing which assistants and scanners to standardize on was its own project; if you are early in that, it is worth surveying the current AI software development tools rather than defaulting to whatever is bundled with your IDE.

Generate tests for human code, not the other way around

Test generation was our highest-leverage use, with one caveat about the direction of trust. Generating tests for existing, human-written code is great: the code is the trusted artifact and the tests are scaffolding. But when the model writes both the implementation and its tests, the tests tend to encode the implementation's bugs as "expected." So the intended behavior is always asserted by a human who understands the requirement:

def test_discount_never_exceeds_cap():
    # Business rule: discount capped at 30%, regardless of input.
    assert apply_discount(price=100, pct=50) == 70   # capped, not 50
    assert apply_discount(price=0,   pct=30) == 0     # no negative totals

Measure delivery, not typing

The trap is celebrating "lines generated" or "PRs opened." Those are inputs. We watch change-failure rate, time-to-restore, and defect-escape rate. When generation sped up but change-failure rate ticked up, that was the signal we had shifted work from writing to debugging, and debugging is the expensive end.

The takeaway

More AI in your pipeline is fine, even great, as long as your review gates, test discipline, and accountability are strong enough that the extra volume makes you faster without making you fragile. The teams that win are not the ones generating the most code. They are the ones who treat generation as cheap and ownership as the real work. If your team is trying to formalize this at scale, it is essentially the operating model of any serious generative AI software development company: move fast on generation, stay strict on verification.

What does your AI code-review process look like? I am collecting patterns in the comments.

Your AI Agent Works in the Demo and Breaks in Production. The Problem Is the Last Mile.

Marketing wizr — Tue, 30 Jun 2026 08:45:10 +0000

The demo is always convincing. You ask the agent to "find the overdue invoice for Acme and send a reminder," it reasons through the steps, calls a couple of tools, and reports success. Everyone nods. Then you put it in front of real traffic against real systems and it creates duplicate invoices on a retry, emails the wrong contact, or cheerfully reports success on an action that silently failed.

The reasoning was never the hard part. The hard part is the last mile: the layer where an agent stops talking and starts acting on systems of record like your CRM, your ticketing platform, or your ERP. That layer is ordinary, unglamorous distributed-systems engineering, and almost none of it is AI-specific. Here are the patterns that matter most.

Every Tool Is a Contract, Not a Suggestion

The single biggest source of agents "going rogue" is loose tool definitions. If a tool accepts free-form input and trusts the model to behave, the model eventually won't. Validate at the boundary, and put hard limits in code where the model can't talk its way past them.

from pydantic import BaseModel, Field

class SendReminder(BaseModel):
    invoice_id: str = Field(pattern=r"^INV-\d{8}$")
    channel: str = Field(json_schema_extra={"enum": ["email", "sms"]})
    # The model cannot send to an arbitrary address; it picks an
    # on-file contact by role, and code resolves the actual destination.
    recipient_role: str = Field(json_schema_extra={"enum": ["billing", "ap_clerk"]})

def send_reminder(req: SendReminder) -> dict:
    invoice = load_invoice(req.invoice_id)        # 404s are real, handle them
    if invoice.status == "paid":
        return {"status": "skipped", "reason": "already_paid"}
    contact = resolve_contact(invoice.account_id, req.recipient_role)
    ...

Notice what the contract removes: the model never supplies a raw email address, never picks an invoice that doesn't match the ID format, and never overrides the "already paid" check. The agent proposes; deterministic code disposes.

Idempotency, Because Agents Retry

Agents retry. Networks fail mid-call. A user double-clicks. If the same logical action can execute twice and produce two effects, you have an incident waiting to happen, and "send payment" or "create ticket" are exactly the actions where a double-execution hurts.

Make state-changing actions idempotent with a key derived from the intent, not from a random ID generated per attempt:

def create_ticket(account_id: str, summary: str, body: str) -> dict:
    # Same logical request => same key => at most one ticket.
    idem_key = sha256(f"{account_id}:{summary}:{body}".encode()).hexdigest()
    existing = tickets.find_by_idempotency_key(idem_key)
    if existing:
        return {"status": "exists", "ticket_id": existing.id}
    return tickets.create(account_id, summary, body, idempotency_key=idem_key)

If your downstream system supports idempotency keys natively (many payment and ticketing APIs do), pass them through. If it doesn't, enforce it in your own layer before the call.

Permissions Belong to the User, Not the Agent

A subtle and dangerous mistake: running every agent action with the agent's own service-account privileges. Now any user who can chat with the agent can implicitly do anything the agent can do, including reading records they should never see.

The agent should act on behalf of the requesting user, carrying that user's authorization to every tool call. Retrieval is the easy place to get this wrong: filtering results after the fact is fragile, so scope the query itself so unauthorized records are never candidates.

def search_accounts(query: str, *, acting_user: User) -> list[Account]:
    # The user's scope is part of the query, not a post-filter.
    return crm.search(query, visibility=acting_user.account_scope)

Plan for Partial Failure and Honest Reporting

A multi-step action will sometimes get halfway and fail. The worst outcome is an agent that reports "Done!" when step three threw an exception. Two rules:

Never let the model narrate success it didn't verify. Tool results, not the model's optimism, determine what the agent tells the user. If a call failed, the failure propagates.
Decide your transaction story up front. Either make the sequence atomic where the systems allow it, or design compensating actions (if you created the order but the payment failed, you cancel the order). Silent half-completed workflows are how data integrity quietly erodes.

def fulfill(order):
    created = create_order(order)        # idempotent
    try:
        charge_payment(order)            # may fail
    except PaymentError:
        cancel_order(created.id)         # compensate, then surface the failure
        raise
    return created

High-Stakes Actions Get a Human Checkpoint

Full autonomy is rarely the right design for actions that move money, delete data, or contact customers. The more reliable pattern is a confident draft plus a human approval step. This is frequently the difference between a system the business will actually authorize and one stuck in pilot forever, and it costs you very little: the agent does all the work, a human just clicks approve on the irreversible part.

Make the threshold explicit and enforce it in code:

def execute(action):
    if action.risk == "irreversible" or action.amount_cents > AUTO_LIMIT:
        return queue_for_human_approval(action)
    return run(action)

You Cannot Debug What You Did Not Trace

When a user says "the agent messed up my account," you need to replay exactly what happened, not reconstruct it from optimism and partial logs. Capture the full chain for every action: the user input, the model's tool selection and arguments, the raw tool results, and the final response. This is the same trace you'll use to build evaluations, so design it once and use it for both.

trace_id: 8f2c...
  user: "send the overdue reminder for Acme"
  acting_user: u_4471 (scope: account:acme)
  tool_call: send_reminder(invoice_id=INV-00038122, channel=email, recipient_role=billing)
  resolved_recipient: billing@acme.example   # resolved by code, not the model
  tool_result: {status: sent, message_id: m_99...}
  agent_reply: "Reminder sent to Acme's billing contact."

With this, "the agent messed up" becomes a five-minute investigation instead of a guessing game.

The Takeaway

An AI agent is only as good as the boundary between its reasoning and your systems of record. The intelligence gets the headlines, but reliability lives in the boring layer: typed tool contracts, idempotency, per-user authorization, partial-failure handling, human checkpoints for irreversible actions, and end-to-end tracing. Get that layer right and the agent becomes something the business can actually trust with real work. Skip it, and you have a very impressive demo.

I work on AI engineering at Wizr AI, where custom AI application development services are the day job. More on us as a generative AI software development company if you're curious. Happy to compare integration war stories in the comments.

The Two Places Generative AI Shows Up When You Ship a Custom AI Application

Marketing wizr — Tue, 30 Jun 2026 08:20:52 +0000

Most write-ups about "AI development" quietly conflate two very different activities. One is building software that uses generative AI as a core capability: copilots, retrieval systems, autonomous agents. The other is using generative AI to build software: code generation, test synthesis, legacy modernization. They share a buzzword and almost nothing else. The skills, the risks, and the discipline required are different, and teams that treat them as one thing tend to get burned on both.

If you're shipping a custom AI application, you will run into both at once. This post is a practical map of where each shows up, what tends to break, and how to keep the speed without inheriting the fragility.

Track 1: Generative AI as the Product

When the AI is the feature, the engineering challenge is not "call a model." It's everything wrapped around the call that decides whether the thing is correct, safe, and maintainable.

A few realities that separate a working demo from a deployable application:

Retrieval quality is the silent killer. Most custom AI apps lean on Retrieval-Augmented Generation (RAG). The naive version (embed documents, do a similarity search, stuff results into the prompt) hides every decision that actually matters. Fixed-size chunking severs context. Pure vector search whiffs on exact identifiers like error codes or SKUs, where a hybrid of dense vectors plus keyword search does far better. And if the model can answer without citing which retrieved chunk supports the claim, you have no mechanism to detect hallucination in production.

Use the smallest control structure that works. "Autonomous multi-agent system" is rarely the right starting point. Reliability drops and debugging cost climbs with every layer of autonomy you add. A ticket-classification task needs one well-prompted call with a typed output, not three agents deliberating. Reserve orchestration for problems that genuinely require planning and tool use you can't predetermine.

Guardrails live in code, not prompts. Anything that must always be true (a spending cap, a permission check, a rate limit) belongs in deterministic code that runs no matter what the model decided. A prompt is a request, not an invariant.

# The model can *propose* a refund. Code decides whether it's allowed.
def issue_refund(order_id: str, amount_cents: int) -> dict:
    if amount_cents > MAX_AUTO_REFUND_CENTS:
        return {"status": "escalate_to_human"}   # invariant enforced here
    if not user_owns_order(current_user, order_id):
        raise PermissionError                     # never trust the prompt
    return process_refund(order_id, amount_cents)

Evaluation is non-negotiable. The defining question is "did that change make the system better or worse?" Without a versioned evaluation set you run on every prompt tweak and model swap, every change is a guess. It doesn't need to be huge; a few dozen well-chosen cases catch a surprising number of regressions. Because outputs are non-deterministic, your metrics should be thresholds over a sample, not pass/fail on one run.

Track 2: Generative AI as the Way You Build

The second track is generative AI accelerating the development lifecycle itself, and it follows one governing principle: generation is the cheap part; ownership is the expensive part. A model can produce fifty plausible lines in seconds. Reviewing, testing, securing, and maintaining those lines for years costs exactly as much as if a human wrote them.

What changes behavior is treating AI output as a draft and the reviewer as fully accountable for it. "The model wrote it" is not a defense in a postmortem. Watch for the failure modes assistants over-produce: subtly wrong edge cases (empty collections, timezones, integer truncation), hallucinated or outdated APIs, and security anti-patterns like string-concatenated SQL that they reproduce from training data.

The practical safeguard is a deterministic gate before human review. AI raises the volume of code flowing into review, so the automated floor under that review has to be solid enough to absorb the increase without humans rubber-stamping:

def gate(change):
    checks = [
        ("type-checks",          run_type_check(change)),
        ("linter clean",         run_linter(change)),
        ("no vuln patterns",     run_sast(change)),
        ("no secrets",           scan_for_secrets(change)),
        ("tests non-trivial",    tests_meaningful(change)),
        ("coverage not reduced", coverage_delta(change) >= 0),
    ]
    failed = [name for name, ok in checks if not ok]
    if failed:
        raise ReviewBlocked(f"AI change failed gates: {failed}")
    return "ready_for_human_review"

Test generation is the highest-leverage use, with one caveat about the direction of trust. Generating tests for existing, human-written code is safe and valuable, because the code is the trusted artifact. But when the model writes both the implementation and its tests, the tests tend to encode the implementation's bugs as "expected" behavior. Keep a human-authored specification of intended behavior as the anchor.

Legacy modernization is where this track is most seductive and most dangerous. A model will translate an old module into idiomatic modern code while silently dropping a side effect some downstream system depends on. The discipline that works: modernize in small increments, and use characterization tests (tests that capture the legacy code's existing behavior, quirks included) as the contract the new code must satisfy.

Where the Two Tracks Meet

Shipping a custom AI application means running both tracks at the same time, and the same engineering values turn out to govern each:

Determinism around non-determinism. Whether it's a model deciding to issue a refund or a model writing the refund code, the safety net is deterministic checks that don't depend on the model behaving.
Evaluation over vibes. Track 1 needs faithfulness evals; Track 2 needs change-failure rate and defect-escape rate. Both replace "it worked when I tried it" with a measurement.
Human accountability at the boundary. A high-stakes agent action gets a human approval checkpoint; a high-stakes code change gets a human reviewer who owns it. Same pattern.

A rough readiness check before you call a custom AI application "done":

[ ] Retrieval returns cited, verifiable sources
[ ] Hard business rules enforced in code, not prompts
[ ] Versioned eval set runs on every model/prompt change
[ ] Full request tracing captured (input → retrieval → prompt → output)
[ ] Idempotency on every state-changing action
[ ] Graceful degradation when the model is unavailable
[ ] PII handling and access control on retrieval
[ ] AI-assisted code passed the same gates as human code

If most of those boxes are empty, you have a prototype, not a product, no matter how good the demo looked.

Closing Thought

The hype frames generative AI as a single revolution. In practice it's two distinct disciplines that happen to share a name, and a custom AI application sits at their intersection. The teams that win aren't the ones generating the most code or wiring up the most agents. They're the ones whose evaluation, guardrails, and review gates are strong enough that more AI (in the product and in the process) makes them faster without making them fragile.

I work on AI engineering at Wizr AI, where we build custom AI applications and use generative AI across the software development lifecycle. These tradeoffs are part of the daily job. Happy to compare notes in the comments.