Most teams do not create a second backend for AI because they have a scaling problem. They create it because the feature feels unfamiliar.
That is usually a bad reason.
If your product already has authentication, tenant scoping, billing, permissions, jobs, observability, and domain models, then the cheapest place to add AI is almost always inside the system that already owns those concerns. Spinning up a separate AI service too early means you have to rebuild all of that plumbing around a feature that often only needed one new job queue, one new persistence model, and a few guarded model calls.
So the recommendation up front is blunt: keep AI features inside your existing full stack app until you hit a real boundary that justifies extraction. A real boundary means independent scaling pressure, a different runtime with serious operational needs, hard isolation requirements, or a capability that is genuinely becoming a shared platform. Not excitement. Not architecture fashion. Not a diagram that looks more “AI-native.”
The model call is not the product boundary
The most common architectural mistake is treating the LLM invocation as the center of the feature.
It is not.
The product boundary is still defined by your business rules: who can do what, against which records, under what limits, with what audit trail, and with what downstream side effects. The model call is just one step in that workflow. Sometimes it is a costly step. Sometimes it is slow. Sometimes it is flaky. But it is still a step.
That distinction matters because once you split the AI workflow into a second backend, you introduce a second place where business context has to be reconstructed. Now the AI service needs to know which tenant the request belongs to, which plan the user is on, whether the action is allowed, what data should be visible, how failures are logged, and what to do if the user retries halfway through. Your main app already knows all of that.
What duplication looks like in practice
Teams usually describe the new service as “just an inference layer.” Three weeks later it owns a lot more than inference:
- request signing between services
- duplicated authorization assumptions
- new queue and retry policies
- a second set of logs and traces
- webhook or polling glue for async completions
- serialization rules for domain objects
- out-of-sync read models
- another deployment surface to monitor during incidents
None of those are individually catastrophic. Together they create a permanent tax.
Why AI features are usually tighter to product state than teams admit
Many AI features are not free-floating compute tasks. They are deeply tied to the application’s existing data and rules.
A support reply generator needs access to the ticket, customer history, internal notes, refund policy, and agent permissions. A document summarizer needs the source file, workspace settings, visibility rules, and storage lifecycle. A product-description generator needs catalog attributes, brand voice constraints, approval workflow, and publishing permissions.
Once you see the workflow clearly, the correct default gets obvious: the app should own orchestration because the app already owns meaning.
The default architecture that usually wins
For most SaaS and internal products, the right first version is simple:
- The main app receives the request.
- The main app validates input and authorizes the action.
- The main app persists an AI run record.
- A queued job performs the model work.
- The result is stored back into the same system of record.
- The UI reads status and output from the main app.
This design is not glamorous. It is stable, debuggable, and cheap.
If you are on Laravel, this aligns directly with Queues and Horizon. Long-running or expensive tasks belong in jobs. You do not need a second backend just because the request should not block the web thread.
A concrete baseline
Start with a durable ai_runs table instead of a direct synchronous call from controller to model provider. That one choice fixes a surprising number of problems.
public function generateReply(SupportTicket $ticket, GenerateReplyRequest $request)
{
$this->authorize('replyWithAi', $ticket);
$run = AiRun::create([
'tenant_id' => auth()->user()->tenant_id,
'user_id' => auth()->id(),
'feature' => 'support_reply',
'status' => 'queued',
'subject_type' => $ticket::class,
'subject_id' => $ticket->id,
'input' => [
'tone' => $request->string('tone'),
'goal' => $request->string('goal'),
],
]);
GenerateSupportReply::dispatch($run->id);
return response()->json([
'run_id' => $run->id,
'status' => 'queued',
], 202);
}
This keeps the user-facing contract inside the main app. The controller can enforce plan limits, feature flags, and policy checks before any model token is spent.
Why the run record matters
A proper run record gives you:
- idempotency for retries
- a place to store prompt inputs and artifacts
- lifecycle visibility from queued to completed or failed
- cost attribution per tenant or feature
- a recovery path when a provider call times out
- a place to attach moderation, human review, or rollback flags later
Without it, teams often end up with stateless model calls that are impossible to reason about when users say, “I clicked generate twice and got two different drafts but only one saved.”
Prompt construction should stay close to the use case
Another overcorrection is centralizing prompts too early into a generic “AI gateway.” That sounds clean until every product change needs edits in a shared abstraction that no feature team fully owns.
For feature-specific behavior, keep prompt builders near the feature. The code that understands what a valid support draft or compliant product description looks like should live next to the domain logic.
class GenerateSupportReply implements ShouldQueue
{
public function handle(OpenAIClient $client): void
{
$run = AiRun::with('subject.customer', 'subject.workspace')->findOrFail($this->runId);
$ticket = $run->subject;
$messages = [
[
'role' => 'system',
'content' => 'Write a calm, policy-compliant support reply. Never invent refunds or promises.',
],
[
'role' => 'user',
'content' => view('prompts.support-reply', [
'ticket' => $ticket,
'input' => $run->input,
])->render(),
],
];
$response = $client->responses()->create([
'model' => 'gpt-5.5',
'input' => $messages,
]);
$run->update([
'status' => 'completed',
'output' => ['text' => $response->output_text],
'completed_at' => now(),
]);
}
}
That is not an argument against reuse. It is an argument for the correct level of reuse. Shared provider clients, response parsers, safety filters, and retry middleware make sense. A giant cross-product prompt abstraction often does not.
What you gain by staying in one backend longer
The biggest benefit is not fewer repos. It is that you preserve coherence.
AI features fail in weird ways. They timeout, partially complete, return malformed output, hit policy blocks, or produce content that should be reviewed before use. When all of that happens inside your primary application boundary, the recovery path is much cleaner.
Auth and tenancy stay boring
This sounds trivial until you have lived through the alternative.
If the main app owns the feature, your existing authorization layer remains the source of truth. The job can load the exact record the user was allowed to act on. Tenant scoping is already attached to that record. Audit trails stay aligned with the user and workspace that triggered the action.
If you extract too early, you start serializing domain context into payloads, signing requests between services, and hoping the receiving service interprets access assumptions the same way the main app would have.
That is how security bugs become “integration misunderstandings.”
Observability stays attached to user intent
The main app already knows the request path, actor, feature flag state, tenant, billing plan, and subject record. That context is gold during debugging.
When the AI feature stays inside the app, your logs and traces can answer questions like:
- which user triggered the run
- which record it operated on
- what prompt version was used
- how many retries occurred
- whether the UI displayed the result
- whether the result was accepted, edited, or discarded
That is much more useful than “service B returned 500 after 8.3 seconds.”
Queues are already the correct async boundary
A lot of premature service extraction is really just a queue problem in disguise.
The team knows the feature should not run inline with the request. Good instinct. But instead of using jobs, status tables, and async UI patterns, they jump to “this must be a separate backend.”
No. It usually means you need a proper background workflow.
For long-running provider operations, you can also use provider-native async features. OpenAI’s background mode exists specifically for long-running responses that should survive request boundaries more reliably. That still does not require handing product ownership to a second service. Your app can submit the request, persist the response ID, and poll or resume while keeping the workflow tied to your domain model.
Rollout control is far easier
AI features should almost never launch globally at full power on day one. You want feature flags, per-tenant controls, scenario restrictions, usage quotas, and fallback modes.
Those controls usually already exist in the app.
If you move the feature into a separate service early, now either the service must reimplement rollout logic or the main app must pass increasingly complicated execution policy with every request. Both options are worse than simply letting the app own the rules.
The failure modes of a premature second backend
This is where teams lose months.
The first version of the side service often works in demos because happy paths are easy. The trouble starts when usage gets real.
Failure mode 1: the AI service becomes a shadow policy engine
At first the service only generates text. Then it needs to know whether certain actions are allowed. Then it starts checking plan tiers, region restrictions, content policy, or internal workflow rules because “it already has the request.”
Now you have business logic in two places.
That is the point where product bugs become hard to explain. The UI says a user can do something, but the AI service quietly refuses. Or worse, the AI service allows something the main app would not have approved if it had remained the sole authority.
Failure mode 2: retries become unsafe
Distributed retries sound simple until they touch side effects.
Suppose the main app sends a request to the AI service, the AI service calls the provider, the provider succeeds, but the callback to the main app fails. Who owns retry? Who knows whether the result already exists? Who prevents duplicate drafts or duplicated billing events?
When the app owns the run record and the job lifecycle, idempotency is straightforward. When two services both think they are responsible for progress, things get messy fast.
Failure mode 3: data synchronization becomes its own feature
A separate AI service often needs a slice of product data: documents, customer context, policies, catalog metadata, user settings, maybe embeddings. Teams then build one of these:
- a sync pipeline
- a denormalized read model
- a retrieval store maintained by background events
- a request-time hydration layer that fetches app data remotely
Every one of those adds latency, drift risk, and operational overhead.
Sometimes that is justified. Usually it is not for the first several AI features.
Failure mode 4: ownership gets split across teams too early
This one is organizational rather than technical.
Once there is a separate backend, there is pressure for a separate team, roadmap, and abstraction layer. The product team now depends on the platform team to ship a prompt tweak, schema change, or status transition. The platform team does not fully own the user experience, and the product team no longer fully owns the implementation.
That seam slows everything down.
When a separate service is actually justified
There are real boundaries where extraction is the right decision. The mistake is pretending you are already there when you are not.
Compute and runtime boundaries
If the workload involves GPU-heavy inference, custom model serving, large-scale embedding pipelines, media generation, or Python-native ML tooling that is becoming central rather than incidental, a separate execution environment can make sense.
That is a real runtime boundary.
But note the wording: separate execution environment, not automatically a separate product backend. You can still keep the application as the owner of workflow state and user intent.
Independent scaling pressure
If model-heavy traffic grows on a curve that is materially different from the rest of the app, separating execution can reduce cost and blast radius. This matters when AI usage is no longer a background feature but a major throughput domain.
If 95 percent of your app traffic is ordinary CRUD and 5 percent is expensive AI work, queue isolation may be enough. If the AI work becomes its own demand plane, extraction starts earning its keep.
Hard isolation or compliance needs
Sometimes you need stricter network boundaries, separate secrets, isolated storage handling, or dedicated processing environments for regulated workflows. That is a strong reason to split.
This tends to be a better justification than “we may want to reuse this later.” Security and compliance boundaries are concrete. Future reuse is often speculation.
A capability is truly becoming a platform
A shared service is justified when multiple products genuinely need the same capability, contract, and lifecycle. Not merely “all of them call an LLM.”
A real platform capability might be:
- standardized document redaction with the same policy model
- a common retrieval and ranking engine for many apps
- a shared multimodal processing pipeline
- a governed evaluation or moderation layer used across products
What does not count is bundling unrelated feature prompts behind one service and calling it a platform.
The middle path most teams should take
The best move is often split execution first, not ownership.
That means the app still owns the external API, authorization, persistence, billing, and workflow state. A worker process or specialized executor handles the expensive AI part. The boundary is operational, not conceptual.
A better contract than synchronous service-to-service RPC
Instead of turning the AI subsystem into another live backend that your app must call synchronously, make the contract durable and task-oriented.
{
"id": "airun_481",
"tenant_id": 12,
"feature": "document_summary",
"status": "queued",
"subject": {
"type": "document",
"id": 933
},
"input": {
"length": "short",
"audience": "customer_success"
},
"policy": {
"requires_review": true,
"max_output_chars": 1200
},
"attempt": 1
}
A worker can consume that contract from the same database or a queue, execute the model call, and write results back. Later, if you really do need a specialized service, you can move the executor without moving the product boundary.
Keep these concerns in the main app as long as possible
Even if you split execution, the main application should usually continue to own:
- authorization decisions
- tenant resolution
- billing and quota enforcement
- feature flags and rollout rules
- final publish, send, or mutate side effects
- audit trail semantics
- human review requirements
The AI layer should generate, classify, summarize, extract, or rank. It should not quietly become the source of truth for business policy.
A practical maturity ladder
A lot of teams would benefit from thinking in stages:
- Inline prototype: useful only for internal proof of concept.
- App-owned async workflow: controller, run record, queue job, stored result.
- App-owned executor pool: isolated workers, stronger retries, better throughput.
- Specialized execution service: only when runtime or scale truly demands it.
- Shared platform capability: only after multiple products prove the reuse case.
Most teams should spend a long time in stages two and three.
A decision rule for real product teams
If you are building AI features inside a Laravel, Rails, Node, or full stack SaaS app, use this rule.
Keep the feature inside the main app when it is primarily about applying AI to existing product context.
Extract only when you are clearly building a separate operational system with different scaling, runtime, or compliance needs.
Ask these questions before creating a second backend:
- Does the feature rely heavily on existing product data and permissions?
- Can the work be modeled as a queued job with a durable run record?
- Would a second service duplicate auth, observability, retries, and rollout logic?
- Are runtime or scaling constraints already hurting us today?
- Will multiple products consume the exact same capability soon?
If the first three are yes and the last two are no, keep it in the app.
That is the right default for most teams building AI features in 2026. Not because microservices are bad, but because premature boundaries are expensive. They turn one feature into two systems before the feature has earned that complexity.
The cleanest architecture is not the one with the most boxes. It is the one where business truth, user permissions, and workflow ownership stay close together until separation solves a real problem.
That is the practical takeaway: add AI as a feature first, not a new platform. Split execution when necessary. Split product ownership only when it has clearly earned the boundary.
Read the full post on QCode: https://qcode.in/how-to-add-ai-features-without-creating-a-second-backend/
Top comments (0)