SciForce

Posted on May 20

DevOps Meets Generative AI: Building, Testing, and Deploying LLM-Powered Apps

#ai #devops #llm

Last spring, OpenAI released a GPT-4o update that made the model hard to trust: it returned sycophantic and less reliable answers than usual, even though nothing was changed in users’ prompts and workflows.

When an LLM system starts drifting in production, the deployment history doesn’t catch it early: nothing changed in the codebase, and providers didn’t release any official updates either. Meanwhile, some providers might have adjusted a classifier without notice, and a request that worked fine yesterday, starts returning confidently wrong answers tomorrow.

If you are already running delivery pipelines, the entire process looks familiar. However, an LLM pipeline has a different kind of release object, where a minor change in prompt, model version, or guardrail can alter system behavior, even though the main codebase was never touched.

What Shapes LLM Production Behavior

While application code gets versioned carefully, changes to prompts, retrieval settings, and guardrails often happen without a formal record, making it harder to identify what exactly caused the drift in model behavior.

*- Prompts *
Sometimes, the reason for regression is a minor change in system prompt: someone changes a sentence targeting one edge case, and an unrelated query category unexpectedly starts performing worse. This happens when multiple people can edit the prompt directly, leaving the edit outside the release record.

- Model versions
In May 2025, Google redirected two dated Gemini endpoints to a newer model without notice. Developers building on gemini-2.5-pro-preview-03-25 found out the software behaved differently than the day before. Afterward, Google updated its documentation to clarify what “stable” and “preview” meant for different endpoints types. If the app works oddly, the provider might have updated the model without notice – worth checking what exact model versions show up in your API responses.

- Retrieval configuration and source data
In RAG systems, answers can drift because the index got stale or because someone changed chunking, ranker, top-k, or the embedding model – none of these requires the app to throw an error. As a result, a financial reporting assistant can start citing figures from outdated quarterly reports, because the knowledgebase was updated without refreshing the index.

*- Guardrails *
Guardrail rules are often managed outside the main app release process. The compliance team might tighten a refusal rule in a separate console, and the app starts rejecting the queries that worked fine without any change on the engineering side.

- Evaluation
A test set built when the product launched doesn't automatically update as the product evolves. A model can keep passing eval while production has moved on: the query mix has shifted, and cases that were rare at launch now make up much of the workload.

Building the delivery pipeline

In traditional software delivery, the release surface is mostly code. In LLM systems it expands to include prompts, model versions, retrieval configuration, and guardrails – components that affect production behavior just as much as the application, but rarely get the same release controls.

Knowing when a release is good enough to ship

In a traditional release you have to make sure that the software runs correctly. When deploying an LLM system, you have to make sure that it behaves acceptably and safely across the full range of inputs it will encounter in production.

Golden prompts

They are fixed test cases that reflect what the system is supposed to do. For the customer support assistant, it checks whether it correctly identified the issue, pointed to the right support article, avoided making things up and escalated when necessary.

When preparing a release, each golden prompt is checked on those dimensions with pass\fail criteria defined before the evaluation. Some checks can be automated, while ambiguous, user-facing or high-risk outputs still need human attention. Not every failure is equally important: failure to escalate or wrong citation block the release immediately, while slightly worse phrasing on a low-traffic query probably doesn't.

Baseline comparison

Eval scores are less stable than they look. One study on prompt sensitivity found accuracy swings of up to 76% from formatting differences alone, with no change to meaning. That is why every candidate release needs to be measured against the production version: without that reference, even a strong score can be a regression from what is already running.

Controlled rollout

Staged deployment strategies let you validate the release in production before committing to it fully. Shadow testing sends user requests in parallel through both current and new versions, but users only see the responses from the current one. Canary testing goes further and shows the new version's responses to a small bunch of real users. If something goes wrong, you catch it on small traffic and roll back before it goes further. Before you start, decide in advance what "something is wrong means", whether it's worse quality of replies, more refusals, or higher cost per query.

Versioning

A quality gate is as good as the release record behind it. If the record doesn't include the exact version of the prompt, retrieval or guardrail configuration, eval set, embedding model that are going live, you might be testing last week's setup.

Any single change to any of them should trigger reevaluation, because even one edit can break the entire construction.

Deploying without losing the gains

Clearing every quality gate doesn't guarantee a smooth release. Inference workloads fail differently from the standard web apps due to concurrency and adding hardware doesn't resolve bottlenecks caused by provider-side rate limits or a queue backing up under long-context requests.

Cost behavior is also harder to predict than token billing alone would suggest. Context growth in lengthy conversations, retrieval payloads, tool-call recursion, and retry loops on failed calls all compound, making inference accountable for 80–90% of total cost of ownership in production GenAI deployments. One of ways to cut the inference costs is query routing – it's faster and cheaper to run routine lookups through deterministic search or rule-based logic.

Keeping it reliable once it's live

Once the system is in production, the question shifts from whether it behaves correctly to whether you know when it stops. Factors that affect production LLM behavior, such as provider update, guardrail adjustment, or users phrasing requests differently, don't always leave obvious signals, and the challenge is to catch the shifts earlier than users do.

Monitoring what matters

Specific metrics, like retry volume and path shifts, can catch tool-use problems early, but the signal usually becomes visible when the bill arrives and the users start complaining. It's easy to overlook cost growth as a monitoring problem, because it compounds slowly – Azure’s documentation confirms that content filter rejections and timeouts get billed even when processing fails. You need to monitor cost thresholds in advance, such as cost per query, per workflow, token growth, and retry spend.

Where human judgment stays in the loop

While automated evaluation catches a lot, it misses things a human would notice. The system can skip confidently wrong answers, while a human looking at real outputs over time would spot a pattern with the system consistently mishandling certain types of requests, or plausible but wrong answers becoming more frequent

Ownership, decisions, and accountability

Governance in LLM systems tends to fail quietly, usually for the same reason. Who can block a release? What counts as a production incident? What happens when output quality drops after a provider update nobody initiated?

When responsibility for the app, user experience, guardrails, and eval set is split across different departments, these questions often go unanswered. As a result, when something breaks with no trace in the codebase, there is no designated person to decide whether the regression is acceptable or whether to declare an incident.

What this looks like in practice

The client’s enterprise performance management platform was slow, expensive, and hard to debug. Two problems were compounding each other.

The first was routing: simple queries that could be handled by a database call were being processed by the LLM instead, just like complex analytical tasks. Based on internal benchmarking, making a database call would have been roughly 40x cheaper and 10x faster.

The second was traceability: the platform had been built with a separate ML model for each end customer, so when outputs degraded, there was no reliable way to tell whether it was caused by model, retrieval configuration, or something else.

What we changed

We replaced per-client model architecture with a shared vector search foundation, and added rule-based routing, directing simple lookups to the database and complex ones to the LLM. We tested several models on client data to handle complex requests - GPT-4, GPT-4o, GPT-4o-mini, Mistral, and Mixtral. GPT-4o-mini offered the best balance, matching the effectiveness of GPT-4o at a lower cost.

All prompts, retrieval settings, and guardrails were versioned, making it possible to assess each release candidate based on consistent benchmarks.

For the routing layer, we developed its own test set, regression checks and configured periodic recalibration as user queries evolved. While hybrid architecture was no simpler, it was testable and versioned, making it easier to manage than the original one.

Results

LLM usage dropped by 37-46% depending on workload type, and latency for simple lookups improved by 32-38%. 68% fewer outputs were flagged as irrelevant or misleading. Manual reconciliation work (the analyst time spent catching and correcting output errors) decreased by 58%.

Conclusion

There's usually a moment, somewhere between the successful demo and the first production incident, when the operational gap becomes obvious. A useful starting point: if something went wrong with your current system today – output degrading, behavior shifting, costs spiking – could you tell within an hour what combination of model, prompt, retrieval configuration, and source data caused it? If the answer is no, that's where to start.

If you want to run that diagnostic on your current system, we're happy to do it with you.

Want to make your LLM systems more reliable, scalable, and cost-efficient in production? Read our articles about LLM and DevOps on the blog 👉 https://sciforce.solutions/blog?tag=LLM&tag=dev-ops

DEV Community