DEV Community: Aashir As

5 Architecture Decisions That Kill AI Projects Before They Launch

Aashir As — Mon, 20 Apr 2026 07:13:19 +0000

$684 billion was invested in AI initiatives in 2025. More than $547 billion of that failed to deliver value (RAND Corporation). That's not a model problem. That's mostly an architecture problem.

I've been on the failing side of this. Here are the five architectural decisions that caused the most damage, across 50+ AI projects we've built at Afnexis.

1. Building the Model Before Validating the Data

I learned this the hard way on a fraud detection project. The client had 18 months of transaction data. We built a beautiful gradient boosting classifier. Precision: 91%. We were proud of it.

Then we discovered their fraud labels were generated by their old rules engine, not by human review. The rules engine had a known flaw that mislabeled certain transaction types. We'd trained a model to replicate a broken system at 91% accuracy.

The fix cost four weeks. The root cause took four minutes to identify.

The decision that killed it: Starting model development before auditing label quality. We now treat data auditing as a non-negotiable gate before writing model code. Every project. No exceptions.

What to check before you write a line of code:

Are labels generated by humans, rules, or another model?
What's the label error rate? (Use Cleanlab or manual spot-checking)
What's the class balance? How were rare events captured?
Is there data leakage between train and test splits?

2. Treating Inference as an Afterthought

Most ML tutorials end at model accuracy. Production starts at inference.

A client needed real-time credit decisions. We trained a beautiful model with strong AUC scores. Then we tested serving latency. P95 response time: 2.3 seconds. Their requirement: under 200ms.

The model used 340 features, 20 of which required live API calls at inference time. We'd designed for accuracy, not for serving. Rebuilding the inference architecture added five weeks.

What we do now: Define the serving constraints before training. Before a single model runs:

What's the maximum acceptable latency? (P95, not average)
What features are available at inference time, with no latency penalty?
Is the serving environment CPU, GPU, or edge?
What's the expected RPS (requests per second)?

These constraints shape model architecture, feature selection, and serving infrastructure. Define them first.

3. One Monolithic Model for Everything

I once built a single model for a healthcare client that was supposed to classify 14 different document types. It worked okay on 9 of them and poorly on 5. When we pushed updates to improve the poor performers, we sometimes degraded the good ones.

The model was trying to do too much. Different document types have different data distributions, different error costs, and different update frequencies. Treating them as one problem made the engineering worse, not simpler.

What we do now: Ensemble-first architecture. Start by asking: can this problem be decomposed into smaller problems with clearer boundaries? For My Medical Records AI, we ended up with separate specialist models for lab reports, discharge summaries, prescriptions, and referral letters. Each could be updated independently. Each had its own monitoring. Accuracy improved on every category.

Monolithic models feel simpler at first. They're not.

4. No Feedback Loop from Production

Models degrade. That's not a hypothesis. MIT research found 91% of ML models see accuracy decline over time without active monitoring. The question isn't whether your model will drift. It's whether you'll know before your users do.

We shipped a churn prediction model for a SaaS client. Six months later, the business had launched two new product lines. User behavior patterns had shifted significantly. The model's precision dropped from 78% to 61%. Nobody noticed for eight weeks. The sales team was acting on stale predictions the whole time.

What we do now: Every model ships with a feedback loop. Specifically:

Outcome tracking: Did the predicted thing actually happen? Link predictions to outcomes.
Distribution monitoring: Are the features at inference time still distributed like the training data?
Confidence tracking: Is average confidence dropping? That's usually the first signal of drift.
Ground truth sampling: Regularly label a random sample of recent predictions. Compare to model output.

If you can't close the loop between model output and real-world outcomes, you're flying blind.

5. Hard-Coding the LLM Provider

This feels like a minor architectural decision until you get a surprise pricing change or a model deprecation notice.

We built a document analysis system for a fintech client using GPT-4 directly. Six months later, GPT-4 was deprecated in favor of GPT-4o with a different API signature. Migration cost: two weeks and a small bug in production that nobody caught immediately.

What we do now: Abstract the LLM provider behind an interface from day one. The calling code doesn't know if it's talking to OpenAI, Anthropic, or a self-hosted Llama model. Provider configuration lives in environment variables, not in code. Switching providers is a config change, not a refactor.

This also lets you run cost experiments: route 10% of traffic to a cheaper model and measure if quality degrades. You can't do that if your provider is hard-coded.

The Pattern

Every one of these failures came from building for the demo, not for production. Models are trained on clean test data. Production has messy data, time pressure, and users who break assumptions.

The best architectural advice I have: write down your production constraints before your first model run. Latency, labels, feedback loops, serving environment, provider flexibility. One page. It'll save you weeks.

More on what kills AI projects in production: Why AI Projects Fail — and What To Do Instead

Aashir Tariq is the CEO of Afnexis. We've shipped 50+ production AI systems across healthcare, fintech, and real estate. If your AI project is stuck between POC and production, that's what we fix.

7 Production RAG Mistakes I Made (And How to Fix Them)

Aashir As — Mon, 20 Apr 2026 07:12:10 +0000

I've shipped RAG systems for healthcare records, financial document search, and real estate databases. The early ones had problems that cost us weeks of debugging in production. Here are the seven mistakes I made most often and what we do now instead.

Want the full production architecture before reading this? Start here: How to Build a Production RAG System. This post is about the failure modes.

Mistake 1: Fixed-Size Chunking

I split every document into 512-token chunks with 50-token overlap. It felt clean and predictable. It broke almost immediately on real documents.

A medical discharge summary doesn't divide neatly at 512 tokens. A section heading on token 511 would be split from its content on token 513. The retriever would pull the heading without the details, or the details without the heading. The LLM got fragments and hallucinated to fill the gaps.

What we do now: Semantic chunking based on document structure. For medical documents, we split at section boundaries (Chief Complaint, History, Medications, etc.) using regex + NLP, not fixed token counts. For PDFs without clear structure, we use recursive character text splitting with overlap tuned per document type, not a global default.

If your chunks have orphaned headings or context-free numbers, your chunking strategy is wrong. Audit 20 random chunks before moving on.

Mistake 2: Vector Search Only

My first instinct was: embed everything, use cosine similarity, done. It worked on my test set. It failed on production queries.

A radiologist asked: "Show me all reports mentioning 'ground glass opacity' from Q3." Semantic search retrieved reports about lung pathology in general. The specific phrase wasn't matched because its vector representation was similar to dozens of other lung terms. Keyword search would've nailed it instantly.

What we do now: Hybrid retrieval. BM25 for exact term matching, vector search for semantic similarity, then Reciprocal Rank Fusion to merge the results. For our RadShifts deployment, hybrid retrieval cut false negative rate by 34% compared to vector-only. That's not a small number when the output is a clinical document.

Mistake 3: No Retrieval Validation Before Generation

Here's a failure mode I didn't anticipate: the retriever returns results, but they're all irrelevant. The LLM doesn't know that. It generates a confident, detailed, completely wrong answer from the retrieved garbage.

On one healthcare client, a query about "aspirin contraindications" was returning cardiology records about aspirin therapy. Technically related. Completely wrong context. The LLM synthesized a response that mixed treatment instructions with contraindications. Nobody caught it for two weeks.

What we do now: A retrieval validation layer between the retriever and the generator. We score each retrieved chunk against the query (cross-encoder reranking), set a minimum relevance threshold, and route low-confidence retrievals to a fallback. The fallback either returns a "not found" response or escalates to a human reviewer. The LLM never sees irrelevant context.

Mistake 4: Embedding Drift

Six months after launch, our semantic search accuracy dropped noticeably. The documents hadn't changed. The queries hadn't changed. What changed: the embedding model provider had pushed an update.

The new model produced slightly different vector representations for the same text. Our indexed embeddings from six months ago were now in a different vector space than fresh query embeddings. Retrieval quality degraded because we were comparing apples to oranges at the vector level.

What we do now: We pin embedding model versions by name and hash. When a new model version is available, we re-embed the entire corpus before switching. We also track embedding drift as a metric in production: periodically re-embed a sample of documents and compare cosine similarity against the original embeddings. If drift exceeds a threshold, we trigger a full re-index.

Mistake 5: No Document Versioning

Legal documents change. Medical protocols get updated. Product specifications get revised. Our RAG systems didn't know that.

We'd index version 1 of a contract. The client would update the contract. We'd index version 2. Now the corpus contained contradictory information from the same document. The LLM would retrieve both versions and synthesize an answer that was partly right and partly wrong, with no way for the user to know which version was used.

What we do now: Document versioning as a first-class citizen. Every document gets a version field and a validity period. When a new version is ingested, the old version is either archived (removed from active retrieval) or marked as superseded. Queries against an archived document return the new version with a note that the source was updated.

Mistake 6: Treating Permissions As an Afterthought

This one almost cost us a client relationship. I won't name them, but the domain was financial services.

Our RAG system was deployed across multiple user roles. A junior analyst ran a query. The retriever pulled a highly relevant document. The document was a board-level financial projection not cleared for analyst access. We'd implemented permissions at the UI layer. The retrieval layer didn't know about permissions at all.

What we do now: Permissions applied at query time, at the retrieval layer, not at display time. Every document in the index carries metadata about who can access it. Every query is tagged with the requesting user's role and permissions. The retriever filters candidates by permission before ranking. The LLM never sees a document the user isn't authorized to read.

Mistake 7: Flying Blind in Production

We shipped. We got positive feedback. We assumed the system was working.

Three months in, a client mentioned that questions about a specific topic were getting vague answers. We looked at the logs. Query-to-retrieval latency had doubled. Retrieval quality had degraded across a whole document category. Neither of us had noticed because we weren't tracking it.

What we do now: Four metrics in production for every RAG system:

Retrieval precision: Are the top-k results actually relevant to the query?
Answer groundedness: Is the generated answer supported by the retrieved context? (Use an LLM-as-judge to score this)
Query latency (p95): Not average. p95. Averages hide the tail.
Retrieval gap rate: What percentage of queries return no results or low-confidence results?

Set up alerts on all four before launch. Not after.

Summary Table

Mistake	Symptom	Fix
Fixed-size chunking	Fragmented, context-free retrievals	Semantic chunking by document structure
Vector search only	Exact terms missed	Hybrid BM25 + vector + RRF
No retrieval validation	Confident wrong answers	Cross-encoder reranking + minimum relevance threshold
Embedding drift	Silent accuracy degradation	Pin model versions, monitor vector drift
No document versioning	Contradictory answers	Version field + validity period + archive on update
Permissions afterthought	Unauthorized content retrieved	Filter by permissions at retrieval layer, not display layer
No production monitoring	Problems found by users, not by you	4 core metrics: precision, groundedness, latency, gap rate

One More Thing

These are mistakes I made on real systems. Most RAG tutorials show you a working demo. They don't show you what breaks at 3am six months after launch.

If you're building a RAG system for production, the architecture decisions that matter most aren't which vector database to use. They're the ones above.

Full production RAG architecture guide, including chunking strategy, index design, and the seven failure points in order of impact: Building a Production RAG System

Aashir Tariq is the CEO of Afnexis, an AI development company that's shipped 50+ production AI systems for healthcare, fintech, and real estate clients.