DEV Community: FARHAN HABIB FARAZ

The RAG System That Found Contradicting Answers (And Confidently Picked The Wrong One)

FARHAN HABIB FARAZ — Thu, 22 Jan 2026 09:39:53 +0000

I built a RAG system for a fintech company's policy knowledge base. Customer asks about refund policy, system retrieves documentation, generates answer. Retrieval found five relevant chunks. Four said refunds within thirty days. One said refunds within fourteen days.
The AI confidently told customers they had thirty days to request refunds. The actual policy was fourteen days. Two hundred seventeen customers were given wrong information before anyone noticed.

The Setup
Fintech platform with comprehensive documentation. Policies, procedures, terms of service, FAQs, internal guidelines. Everything indexed in one knowledge base for the support AI to reference.
Standard RAG flow. Customer question triggers semantic search across documentation. Top five most relevant chunks retrieved. LLM generates answer using those chunks as context.
I tested with forty questions. Answers matched documentation. Looked solid. Deployed.

The Conflicting Sources
Three weeks in, compliance team flagged an issue. Customers were claiming the AI told them refunds were available within thirty days. The current policy was fourteen days and had been for six months.
I pulled the retrieval logs for refund policy questions. Every query returned five chunks. Four chunks said thirty days. One chunk said fourteen days.
The four thirty-day chunks were old. From documentation written before the policy changed. They had not been deleted or archived. They still existed in the knowledge base, still getting retrieved, still being fed to the LLM.
The LLM saw four sources saying thirty days and one source saying fourteen days. It chose the majority. Four votes for thirty days, one vote for fourteen days. Thirty days wins.
Confidently wrong.

Why This Happened
The knowledge base contained documentation from multiple time periods. When policies changed, new documentation was added but old documentation was not removed. Historical context remained searchable.
The vector search ranked chunks by semantic similarity to the query, not by recency or accuracy. Old chunks about refund policy were just as semantically relevant as new chunks. Sometimes more relevant because older documentation was more detailed.
The retrieval system had no concept of document freshness, version history, or authoritative sources. Every chunk was treated equally. A paragraph from two years ago had the same weight as a paragraph from last month.
When contradictions appeared in retrieved chunks, the LLM had no guidance on how to handle them. It defaulted to majority voting or picked whichever chunk appeared first in the context window.

The Scope of The Problem
I audited the knowledge base. Out of eight hundred documents, one hundred thirty four had been updated in the past year with policy or procedure changes. Only nineteen of those updates included explicit deprecation of the old versions.
That meant one hundred fifteen outdated documents were still live in the vector database, still being retrieved, still generating wrong answers. Refund policy. Pricing tiers. Feature availability. Support hours. Interest rates. All potentially contradicted by newer documentation.
Twenty two percent of the total documentation was outdated but still active.

The Failed Fix
I tried adding timestamps to chunks and telling the LLM to prefer recent information. The prompt said: "If chunks contain conflicting information, prioritize the most recent."
That helped slightly but not enough. The LLM did not consistently identify conflicts. Sometimes it blended old and new information into hybrid answers that were partially correct and partially outdated.
Also, recency is not always the right filter. Sometimes old documentation contains important historical context or grandfathered policies that still apply to certain customers.

The Real Solution Was Source Authority
The fix required three changes to how the system handled retrieved chunks.
First, document versioning. Every document now has a version number and status flag. Active, deprecated, or archived. Deprecated and archived documents are excluded from retrieval by default unless the query specifically asks for historical information.
Second, authority ranking. Documents are tagged by source authority. Official policy documents have the highest authority. Internal guidelines have medium authority. Draft documents or old FAQs have low authority. When conflicts appear, higher authority sources win regardless of semantic similarity score or timestamp.
Third, conflict detection in the generation prompt. The LLM is explicitly instructed: "Check if retrieved chunks contradict each other. If they do, identify which source has the most recent timestamp and highest authority. Use only that source for your answer. If you cannot resolve the conflict, state that policies may have changed and escalate to human support."

What Changed
Question: "What is your refund policy?"
Old behavior:
Retrieved five chunks, four outdated, one current
Generated answer based on majority: thirty days
Wrong information given to customer
New behavior:

Retrieved five chunks
System filtered: only active documents included
If old chunks still appeared due to search algorithm, conflict detection triggered
LLM identified contradiction, checked authority and timestamp
Used only the current active policy document
Correct answer: fourteen days

The Results
Before the fix, twenty two percent of documentation was outdated and active. Conflicting information appeared in thirty seven percent of complex queries. Wrong answers were given to customers two hundred seventeen times before detection.
After the fix, outdated documents excluded from search. Conflicting information rate dropped to four percent, mostly edge cases with legitimately different policies for different customer tiers. Wrong answers dropped to near zero.
The business impact was significant. Compliance risk eliminated. Customer disputes over policy misinformation stopped. Support team regained trust in the AI. Legal department approved continued use of the system.

What I Learned
Knowledge bases are not static. Documentation accumulates over time. Old information does not automatically disappear when new information is added. Retrieval systems without versioning or deprecation treat all content equally regardless of accuracy.
Semantic similarity is not the same as correctness. An outdated document can be highly relevant to a query while being factually wrong. Retrieval must filter by authority and currency, not just relevance.
LLMs will not automatically detect or resolve contradictions in retrieved context. They will synthesize, blend, or pick based on unclear heuristics unless explicitly prompted to identify conflicts and apply resolution rules.

The Bottom Line
A RAG system retrieving from unversioned documentation generated wrong answers by confidently choosing outdated information when retrieved chunks conflicted. The fix was document versioning, authority ranking, and explicit conflict detection in the generation prompt.

Written by Farhan Habib Faraz
Senior Prompt Engineer building conversational AI and voice agents

Tags: rag, documentversioning, contradictions, knowledgebase, retrieval, policyaccuracy

The RAG System That Mixed Documentation From Different Products (And Created Frankenstein Instructions)

FARHAN HABIB FARAZ — Thu, 22 Jan 2026 09:37:35 +0000

I built a RAG system for a company that sold three different software products. One knowledge base. Three hundred documents. One AI answering questions about all products.
The retrieval worked. The answers were chaos. Customers following the instructions ended up configuring Product A using steps from Product B while referencing features that only existed in Product C.
Nobody could actually complete any task using the AI's guidance.
The Setup
Software company with three products. Enterprise CRM, Marketing Automation Platform, and Analytics Dashboard. Separate products but overlapping concepts. All three had settings pages, user management, API integrations, and data exports.
They wanted one unified support bot. Customer asks a question, bot searches across all documentation, returns the answer. Efficient. No need to maintain three separate bots.
I built it as a standard RAG system. All documentation from all three products went into one vector database. Query comes in, retrieve relevant chunks, generate answer.
Tested with fifty questions. Answers looked reasonable. Deployed.
The Frankenstein Instructions
Within days, support tickets arrived with a consistent pattern. Customers saying the AI instructions did not match what they saw in their product.
Customer using the CRM asked: "How do I export contact data?"
AI answer: "Go to Settings, click Data Export, select CSV format, choose date range, and click Generate Report. You can schedule automatic exports under Advanced Options."
Customer response: "There is no Data Export in Settings. There is no Advanced Options menu. Where are these features?"
I checked the retrieved chunks. The answer combined three different sources. "Go to Settings" came from CRM documentation. "Select CSV format and choose date range" came from Analytics Dashboard docs. "Schedule automatic exports under Advanced Options" came from Marketing Automation docs.
Each individual chunk was accurate for its own product. But they described three different export workflows that did not exist in combination anywhere.
The Pattern
The vector search retrieved chunks based on semantic similarity. "Export contact data" matched documentation about exports from all three products. The retrieval did not filter by product context.
The LLM saw five chunks about exporting data. Each chunk described a different part of the export process, but from different products with different UIs and different features. The LLM synthesized these chunks into one coherent-sounding answer that described a workflow that did not exist in any actual product.
Another example: "How do I add team members?"
Retrieved chunks:
CRM: "Navigate to Team Settings and click Add User"
Marketing Automation: "Go to Account > Users > Invite New Member"
Analytics Dashboard: "Open the sidebar, select Team, and click the plus icon"
AI answer: "Navigate to Team Settings in the sidebar, click Add User or the plus icon, then select Invite New Member from the Account menu."
Completely incoherent. Every product had different navigation, different button labels, different flows. The answer was a mashup that worked nowhere.
Why This Happened
The vector embeddings measured semantic similarity, not product identity. The query "add team members" was semantically similar to documentation from all three products about adding users. Retrieval returned chunks from all three.
The prompt told the LLM to answer using the retrieved chunks. It never told the LLM to check if chunks were from compatible contexts. The LLM saw five relevant chunks and synthesized them into one answer, assuming they described parts of the same system.
There was no product boundary in the retrieval or the generation. The entire knowledge base was treated as one unified system when it actually described three separate systems with different architectures.
The Failed Fix
I tried adding product names to every chunk. Each chunk was tagged with metadata: product equals CRM, product equals Marketing Automation, or product equals Analytics Dashboard.
Then I filtered retrieval: only search chunks matching the user's product.
That required knowing which product the user had. The bot asked: "Which product are you using?" at the start of every conversation. Customers hated it. Many did not know the official product names. Some used multiple products and did not know which one their question was about.
The Real Solution Was Contextual Separation
The fix required treating the knowledge base as three separate domains, not one merged corpus, but doing so intelligently without forcing users to self-identify upfront.
First, I added product-specific vector namespaces. Each product's documentation lived in its own retrieval space. When a query came in, the system searched all three namespaces in parallel but kept results separated by source.
Second, I added context detection. Before retrieval, the system analyzed the query for product-specific terminology. Mentions of "campaign builder" indicated Marketing Automation. Mentions of "deal pipeline" indicated CRM. Mentions of "dashboard widgets" indicated Analytics.
If the query contained clear product signals, retrieval prioritized that product's namespace and only pulled from others if the primary namespace had low-confidence matches.
If the query was ambiguous, the system retrieved from all products but presented answers separately: "For CRM: [answer]. For Marketing Automation: [answer]. Which product are you using?"
Third, the generation prompt changed. The LLM was explicitly instructed: "These chunks may come from different products. Do not mix instructions from different products. If chunks conflict, they likely describe different systems. Present separate answers or ask for clarification."
What Changed
Question: "How do I export contact data?"
Old behavior:
Retrieved chunks from CRM, Marketing Automation, Analytics
Generated mashup answer
Customer could not follow instructions
New behavior:
Detected "contact data" as CRM-specific terminology
Retrieved primarily from CRM namespace
Generated answer using only CRM chunks
Answer matched actual CRM interface
If query was ambiguous: "How do I export data?"
Retrieved from all products but kept separate
Responded: "Export process varies by product. Are you using CRM, Marketing Automation, or Analytics Dashboard?"
User clarifies
System provides accurate product-specific answer
The Results
Before the fix, forty one percent of answers mixed incompatible instructions from multiple products. Customers could not complete tasks. Support tickets increased because the AI was giving impossible instructions. Trust in the AI dropped to twenty three percent.
After the fix, cross-product contamination dropped to under three percent, limited to edge cases where products genuinely shared identical features. Answer accuracy for product-specific questions reached eighty nine percent. Support tickets related to AI confusion disappeared.
The business impact was immediate. Customers using the AI to solve problems actually succeeded. Support ticket deflection went from negative, the AI was creating tickets, to positive fifty four percent deflection rate.
What I Learned
Semantic similarity does not equal contextual compatibility. Two chunks can be topically similar while describing completely different systems. Retrieval must consider source boundaries, not just content similarity.
Merging documentation from multiple products into one undifferentiated knowledge base destroys the contextual boundaries that make instructions actionable. Each product is a separate world with its own vocabulary, UI, and workflows.
LLMs will synthesize retrieved chunks into coherent answers even when those chunks are incompatible. The generation prompt must explicitly forbid cross-context mixing and require product consistency.
The Bottom Line
A RAG system retrieving from three products without source filtering created Frankenstein instructions by mixing incompatible chunks. The fix was product-aware retrieval with context detection and generation prompts that enforced single-product coherence.

Written by Farhan Habib Faraz
Senior Prompt Engineer building conversational AI and voice agents

Tags: rag, multiproduct, contextmixing, retrieval, knowledgebase, productboundaries

The RAG System That Retrieved Perfect Chunks (But Answered Wrong Anyway)

FARHAN HABIB FARAZ — Thu, 22 Jan 2026 09:35:40 +0000

I built a RAG system for a customer support knowledge base. It retrieved relevant documentation chunks and used them to answer questions. Retrieval accuracy was ninety six percent. Answer accuracy was thirty two percent.
The retrieval worked perfectly. The answers were completely wrong.

The Setup
Enterprise software company with eight hundred pages of technical documentation. They wanted an AI that could answer customer questions using this knowledge base instead of forcing customers to search manually.
Standard RAG architecture. Customer asks question, system embeds the query, searches vector database for most relevant chunks, feeds those chunks to the LLM with the question, LLM generates answer using the retrieved context.
I tested retrieval quality first. For one hundred sample questions, the system retrieved the correct documentation sections ninety six times. Nearly perfect retrieval.
Then I tested end-to-end answers. Out of the same one hundred questions, only thirty two answers were actually correct or helpful. The rest were wrong, incomplete, or misleading.
The retrieval was flawless. The answer generation was broken.

Why This Happened
The problem was not retrieval. The problem was chunk boundaries. The system was retrieving the right paragraphs but those paragraphs did not contain complete information when isolated from surrounding context.
Example question: "How do I reset my API key?"
Retrieved chunk: "Click the regenerate button and confirm. Your old key will stop working immediately."
This chunk is relevant. It mentions API key regeneration. But it is missing critical information. Where is the regenerate button? What menu? What happens to existing API calls? How do I update my code?
That information existed in the documentation, but it was in the paragraph before and the paragraph after the retrieved chunk. The chunking strategy had split one complete procedure into three separate chunks. The retrieval system grabbed the middle chunk and missed the setup and followup steps.
The LLM saw incomplete instructions and either filled in the gaps with hallucinations or gave vague unhelpful answers.

The Chunking Problem
The documentation was chunked by fixed size. Every five hundred tokens became one chunk. Clean. Consistent. Terrible for meaning preservation.
A procedure that said "First go to Settings. Then navigate to API section. Click regenerate button and confirm" got split into two chunks if it crossed the five hundred token boundary. Retrieval might grab the second chunk, which starts mid-procedure.
Tables were even worse. A pricing table got split horizontally. The retrieved chunk had row data without column headers. The LLM could not interpret what the numbers meant.
Lists broke across chunks. A troubleshooting guide with eight steps got split. The user got steps four through six without context of what came before.

The Failed Fix
I tried increasing chunk size to one thousand tokens. That reduced splitting but created a new problem. Chunks became too broad. A chunk about API keys now also included information about OAuth, webhooks, and rate limiting. Retrieval precision dropped because chunks were less focused.
I tried overlapping chunks. Each chunk included the last one hundred tokens of the previous chunk. That helped slightly but created massive redundancy. The vector database size tripled and search became slower.

The Real Solution Was Semantic Chunking
The breakthrough was abandoning fixed-size chunks entirely. Instead, I chunked by semantic boundaries. Procedures stayed together. Tables stayed whole. Lists remained intact.
The new chunking logic identified content types. A procedure section with numbered steps became one chunk regardless of length. A table became one chunk with headers and all rows. A conceptual explanation paragraph became one chunk.
If a section was genuinely too large, it split at natural breakpoints. After a procedure ends but before the next topic begins. After a table but before explanatory text. At heading boundaries, not mid-paragraph.
I also added contextual metadata to each chunk. Every chunk now includes the page title, section heading, and subsection heading it came from. When the chunk is retrieved, the LLM sees not just the paragraph but also where it sits in the documentation hierarchy.

What Changed
Question: "How do I reset my API key?"
Old retrieval: Middle paragraph of procedure. "Click the regenerate button and confirm."
New retrieval: Complete procedure with context. "Section: API Management > Subsection: Key Regeneration To reset your API key:
Navigate to Settings > API Keys
Locate your current key in the list
Click the regenerate button next to it
Confirm the action in the popup
Copy your new key immediately Note: Your old key stops working immediately upon regeneration."
The LLM now had complete instructions with all necessary steps and context about where to find the feature.

The Results
After switching to semantic chunking, I retested the same one hundred questions. Retrieval accuracy stayed at ninety four percent, slightly lower because some chunks were now more specific. But answer accuracy jumped to eighty seven percent.
The system went from retrieving right and answering wrong to both retrieving right and answering right. Customer satisfaction with AI answers went from thirty eight percent to eighty one percent.
Support ticket deflection, the real business metric, increased from eighteen percent to sixty three percent. The AI was finally reducing support load instead of frustrating customers with incomplete answers.

What I Learned
Retrieval accuracy is not the same as answer quality. You can retrieve the exactly correct paragraph and still generate a wrong answer if that paragraph lacks surrounding context.
Fixed-size chunking optimizes for engineering simplicity, not semantic coherence. Real documentation has structure. Procedures have steps. Tables have relationships. Lists have order. Chunking must preserve these structures.
Contextual metadata matters as much as content. Knowing a chunk comes from the API Management section under Key Regeneration helps the LLM understand what it is reading and how it relates to the question.

The Bottom Line
A RAG system with ninety six percent retrieval accuracy produced wrong answers sixty eight percent of the time because chunks were split at arbitrary boundaries that destroyed semantic meaning. The fix was chunking by content structure instead of token count and adding hierarchical context to every chunk.

Written by Farhan Habib Faraz
Senior Prompt Engineer building conversational AI and voice agents

Tags: rag, chunking, retrieval, knowledgebase, vectorsearch, semanticsegmentation

The Calendar Sync That Scheduled Meetings During Weekends (And National Holidays)

FARHAN HABIB FARAZ — Thu, 22 Jan 2026 09:32:38 +0000

I built a meeting scheduler bot that automatically booked appointments by syncing with Google Calendar. It checked availability and confirmed bookings instantly. No back-and-forth emails. Pure automation.
First month looked perfect. Second month, complaints started. "Why did your system book me for Saturday at 11 PM?" Another: "I just got a meeting invite for Eid day. Is this a joke?"
The bot was scheduling meetings during weekends, holidays, late nights, and culturally inappropriate times. It saw "calendar slot empty" and booked it, without understanding why that slot was empty.

The Setup
Consulting company with international clients. Meeting scheduling was a nightmare. Clients in different time zones, back-and-forth emails trying to find time, hours wasted coordinating.
They wanted automation. Client requests a meeting, bot checks team calendars, finds available slots, books immediately. I built it with Google Calendar API integration, time zone handling, and instant confirmations.
The logic was simple. User requests meeting, bot scans the next two weeks, finds slots where the calendar shows no conflicts, presents options, user picks one, meeting booked.
Tested with thirty scheduling requests across five days. Perfect bookings. Deployed.

The Weekend Bookings
Two weeks in, first complaint. A client in Dubai got offered meeting times on Friday afternoon and Saturday morning. Friday afternoon is weekend in UAE. The client was confused and mildly insulted.
Then a UK client got a meeting scheduled for Sunday at 9 AM. Sunday morning. The client assumed it was a system error and ignored the invite. Missed meeting. Relationship awkwardness.
Then the worst one. Pakistani client got a meeting invite for Eid-ul-Fitr, one of the most important religious holidays. The client replied coldly: "Do you not know what day this is?"

Why This Happened
My availability logic checked one thing. Is the calendar slot free. If yes, that slot is available for booking.
The bot did not understand business days versus weekends. It did not know about public holidays. It did not account for cultural or religious observances. It did not respect reasonable working hours.
A calendar slot being empty does not mean that time is appropriate for a meeting. It usually means the opposite. The slot is empty because nobody wants to work then.
The bot treated 11 PM the same as 11 AM. It treated Saturday the same as Wednesday. It treated Christmas Day the same as any random Tuesday. Empty slot equals available slot. That was the entire logic.

The Global Holiday Problem
The company worked with clients in fifteen countries. Each country has different public holidays. UAE observes Friday-Saturday weekends. Bangladesh observes Friday. US observes Saturday-Sunday. Israel observes Friday evening through Saturday.
Religious holidays vary by region. Eid dates shift yearly. Diwali timing changes. Lunar New Year moves. Christmas is fixed in Gregorian calendar but not everyone observes it.
The bot knew none of this. It had no holiday database. No cultural calendar. No concept of inappropriate timing.

The Failed Fix
I tried manually blocking weekends. Configure the system to skip Saturdays and Sundays.
That broke scheduling for clients in UAE where Sunday is a work day. It also did not solve the holiday problem or the late-night bookings.
I tried adding a "reasonable hours" filter. Only suggest 9 AM to 5 PM.
That created time zone issues. 9 AM in New York is 7 PM in Bangladesh. The bot started suggesting evening slots to clients in Asia when trying to book during New York business hours.

The Real Solution Was Contextual Business Hours
The fix required understanding business context, not just calendar availability. The system now checks multiple layers before considering a slot bookable.
First layer is user-specific working hours. Every team member and every client has defined working hours in their local time zone. For a US team member, that might be Monday to Friday, 9 AM to 6 PM EST. For a UAE client, Monday to Thursday, 9 AM to 5 PM GST, plus Sunday.
Second layer is a holiday database. The system checks public holidays for the countries of both the team member and the client. If either party is observing a holiday, that day is blocked entirely.
Third layer is cultural sensitivity flags. Certain times are marked as generally inappropriate even if not official holidays. Late Friday afternoon for Muslim-majority countries. Friday evening through Saturday for Israeli clients. The week between Christmas and New Year for Western clients.
Fourth layer is minimum notice. No bookings within four hours of the current time, to avoid suggesting a meeting that starts in twenty minutes.
What Changed
The bot stopped suggesting Friday afternoon meetings to UAE clients. It stopped booking Sunday morning calls. It blocked out Eid, Diwali, Christmas, Lunar New Year, and other major holidays automatically.
Scheduling suggestions became contextually appropriate instead of just technically available. Clients stopped getting insulting meeting invites.

The Results
Before the fix, roughly forty percent of automated bookings landed on weekends, holidays, or inappropriate times. Complaints were frequent. Several client relationships were damaged. Manual rescheduling was constant.
After the fix, inappropriate bookings dropped to under two percent, mostly edge cases like unexpected office closures. Complaints stopped. Clients praised the system for respecting their schedules and cultures.
The business impact was measurable. Meeting no-show rate dropped because meetings were scheduled at reasonable times. Client satisfaction increased. The team saved hours per week not fixing bad automated bookings.

What I Learned
Calendar availability is necessary but not sufficient. Empty slots are often empty for good reasons. Business hours vary by culture, region, and individual preference. Holidays are not universal or static.
Systems that schedule meetings must understand work culture, not just work calendars. Technical availability must be filtered through cultural and contextual appropriateness.

The Bottom Line
A scheduling bot that only checked calendar availability booked meetings during weekends, holidays, and culturally inappropriate times. The fix was adding business hours, holiday awareness, and cultural context as filters before presenting time slots.

Written by Farhan Habib Faraz
Senior Prompt Engineer building conversational AI and voice agents

Tags: scheduling, calendarsync, automation, timezones, culturalawareness, meetings

The Feedback Collector That Published Negative Reviews Publicly (Before Human Review)

FARHAN HABIB FARAZ — Thu, 22 Jan 2026 09:29:39 +0000

I built a feedback collection system for a SaaS company. Customers submitted reviews, the system collected them, and the marketing team displayed the best ones on the website. Automated social proof. Standard practice.
Two weeks in, a one-star review appeared on the homepage. Then another. Then five more. All visible to every site visitor. All brutally negative.
The marketing director called screaming. "Why are you publishing our worst reviews on the front page?"

The Setup
SaaS company wanted customer testimonials on their site. They had been manually collecting reviews via email, then copying approved ones to the website. Slow process. They wanted automation.
The system I built was simple. After a customer used the product for thirty days, send an automated email asking for feedback. Customer clicks a link, fills out a review form, submits. The review posts automatically to a testimonials section on the homepage.
Fast. Efficient. No manual work required. Tested with eight beta users who all left positive reviews. Deployed.

The Public Disaster
The first few days were fine. Positive reviews appeared. Marketing was happy. Then the negative ones started showing up.
"Terrible customer support. Took 5 days to get a response." One star. Published immediately on the homepage.
"Product is buggy and crashes constantly. Waste of money." One star. Live on the site.
"Tried to cancel my subscription, they made it impossible. Avoid this company." One star. Front and center.
By the end of week two, the homepage testimonials section showed seventeen reviews. Nine were one or two stars. Only eight were positive.
Every potential customer visiting the site saw a wall of complaints before seeing any product information.

Why This Happened
My automation posted every review immediately upon submission. No filter. No approval step. No human review. The form said "Share your feedback" and the system shared it, instantly and publicly, exactly as written.
I had assumed most reviews would be positive. The company had good customer satisfaction scores. Surely most feedback would be praise. A few negative reviews mixed in would look authentic, I thought.
I was wrong on every assumption.

The Pattern
Customers who loved the product rarely filled out feedback forms. They were busy using the product. Happy customers are silent customers.
Customers with problems filled out the form immediately. Frustrated users had time and motivation to write detailed complaints. Angry customers wanted to be heard.
The result was selection bias. The feedback system captured complaints at a much higher rate than praise, because complaints were the primary driver of form completion.
The second problem was tone. When someone fills out a private feedback form, they write differently than when writing a public review. The form instructions said "Share your feedback with us" which sounded internal. Users wrote raw, unfiltered complaints meant for the support team.
Those raw complaints went live on the homepage word-for-word. Profanity included. Spelling errors included. No context, no resolution status, no company response. Just pure venting, published as social proof.

The Failed Fix
I tried adding a simple profanity filter. If the review contained bad words, do not publish it automatically.
That stopped the worst language but did not solve the core issue. Reviews like "Your product is garbage" and "Worst experience ever" did not trigger profanity filters but were still terrible homepage content.
I tried a sentiment analysis filter. Only publish reviews with positive sentiment scores.
That made the testimonials section look fake. Every review was glowing. No criticism. No authenticity. Potential customers assumed the reviews were fabricated or cherry-picked. Trust actually decreased.

The Real Solution Was Human Approval
The fix required stepping back from full automation. Reviews are now collected automatically but published manually.
When a customer submits feedback, it goes into a review queue visible to the marketing team. Positive reviews can be approved for public display. Negative reviews are routed to customer support for follow-up instead of being published.
Critical feedback is still valued and acted on internally. But the homepage only shows reviews that were explicitly approved for public use.
I also split the form into two paths. "Leave a public testimonial" versus "Send private feedback to our team." The wording sets expectations. Public testimonials go through approval. Private feedback goes to support and stays internal.

What Changed
After implementing approval workflow, the homepage showed only reviews that customers explicitly intended as public testimonials or that the marketing team felt comfortable displaying.
Negative feedback still came in, but it went to the support team where it belonged instead of being broadcast to potential customers.
Customers with complaints got responses and solutions. Many of those customers later submitted positive testimonials after their issues were resolved.

The Results
Before the fix, the homepage displayed every raw submission immediately, including nine one-star reviews out of seventeen total. Negative reviews visible to all site visitors. Conversion rate dropped thirty two percent. Several potential customers mentioned the bad reviews in sales calls as reasons for hesitation.
After the fix, the homepage showed only approved testimonials. Negative feedback still captured, but routed internally. Conversion rate recovered and exceeded baseline. Sales team stopped hearing objections about bad reviews.
The business impact was significant. Raw negative reviews had been costing conversions. Filtered testimonials built trust. Internal feedback still reached support teams for product improvement without damaging the public brand.

What I Learned
Automation is not always the answer. Some processes need human judgment. Feedback collection can be automated, but publication requires curation. Selection bias means negative feedback often outweighs positive in voluntary submission systems.
Most importantly, users write differently when they think feedback is private versus public. Instructions and expectations matter. A form asking for "your honest feedback" will get brutal honesty that should not be published verbatim.

The Bottom Line
An automated review system published every submission immediately, including raw negative feedback that was meant to be private. The homepage became a wall of complaints. The fix was adding human approval for public display while keeping internal feedback channels open for genuine customer input.

Written by Farhan Habib Faraz
Senior Prompt Engineer building conversational AI and voice agents

Tags: reviews, testimonials, automation, feedbackcollection, contentmoderation, socioproof

The Product Recommender That Only Suggested Out-of-Stock Items (Inventory Integration Fail)

FARHAN HABIB FARAZ — Thu, 22 Jan 2026 09:26:59 +0000

I built a product recommendation bot for an e-commerce site. It analyzed what customers were looking at and suggested related items. Smart recommendations. Higher conversions. Standard upsell automation.
First week, conversion rate dropped by forty one percent. Customers were leaving angry reviews. "Why does your bot keep recommending things you don't have?"
The bot was only suggesting out-of-stock products.

The Setup
E-commerce company selling electronics. Customers browsed products, and the bot suggested complementary items. Looking at a laptop, the bot suggests a laptop bag, mouse, and external drive. Looking at a camera, the bot suggests memory cards, tripods, lenses.
I connected the bot to their product catalog API. It pulled product data, analyzed relevance, and made recommendations based on what fit well together.
Tested with fifty browsing sessions. Recommendations looked perfect. Related products. Good pairings. Deployed.

The Complaints Start
Within three days, customer complaints arrived in waves. People clicking recommended products and landing on "Out of Stock" pages. Others adding recommended items to cart only to see "Item Unavailable" at checkout.
One customer left a review: "This site keeps suggesting products they don't even sell anymore. Waste of my time."
I checked the recommendation logs. Out of four hundred twelve recommendations made that week, three hundred seventy nine were out-of-stock items. Ninety two percent of suggestions were unavailable.

Why This Happened
My product recommendation logic searched the entire catalog for relevant matches. Laptop cases that pair well with laptops. Camera lenses that fit specific camera models. The algorithm found the best technical matches.
But it never checked stock levels. The API call pulled product details, descriptions, specifications, and compatibility data. I never added inventory status to the query.
The result was recommendations based purely on product fit, ignoring whether the item could actually be purchased. The bot became an expert at suggesting things customers could not buy.

The Irony
The out-of-stock products were often the most popular items. High demand meant frequent stockouts. High popularity also meant strong historical sales data, which made the recommendation algorithm rank them highly.
The bot was recommending the best products, the ones customers actually wanted, at exactly the moment those products were unavailable. It was technically correct and practically useless.

The Failed Fix
I added inventory checks after generating recommendations. If an item was out of stock, the bot removed it from the list and suggested the next best alternative.
That created a new problem. The backup recommendations were worse matches. A customer looking at a high-end camera got recommendations for cheap accessories that did not fit the quality level, because all the premium accessories were out of stock.
The bot went from suggesting unavailable great products to suggesting available mediocre products. Conversion still dropped.

The Real Solution Was Smart Filtering
The fix required filtering at the source, not after the fact. The recommendation algorithm now checks inventory status during the matching process, not after.
When building recommendations, the system first filters the catalog to only in-stock items, then applies the relevance algorithm within that subset. If the best match is out of stock, it never enters consideration at all.
I also added a secondary filter based on stock velocity. If an item has fewer than five units remaining and historical data shows it sells twenty units per day, the system treats it as effectively out of stock, because it will likely be gone before the customer completes checkout.
For high-value recommendations where no good in-stock alternative exists, the system shows a placeholder: "Customers also liked [Product Name] – currently restocking, available [estimated date]." That way, the customer knows about the better option without being sent to a dead end.

What Changed
Customers browsing laptops now saw in-stock laptop bags, in-stock mice, and in-stock drives. If the premium option was unavailable, they saw the next-best option that was actually purchasable.
The recommendations stopped being technically perfect and started being practically useful. Conversion rates recovered and then exceeded the pre-bot baseline.

The Results
Before the fix, ninety two percent of recommendations were out of stock, conversion dropped forty one percent, cart abandonment spiked, and customer complaints flooded in.
After the fix, ninety seven percent of recommendations were in stock, conversion exceeded baseline by eighteen percent, cart abandonment returned to normal, and complaints about recommendations stopped entirely.
The business impact was immediate. Revenue per session increased because customers were clicking recommendations and actually buying. Customer satisfaction improved because the site stopped wasting their time.

What I Learned
Relevance without availability is useless. The best recommendation is worthless if the customer cannot buy it. Inventory status is not optional metadata. It is a primary filter.
Systems that recommend products must treat stock levels as a core constraint, not an afterthought. Filtering after recommendation generation creates poor fallback suggestions. Filtering during recommendation generation ensures quality matches within what is actually available.

The Bottom Line
A recommendation bot that ignored inventory status suggested unavailable products ninety two percent of the time. The fix was checking stock levels during the matching process, not after, and treating low inventory as effectively out of stock to avoid checkout failures.

Written by Farhan Habib Faraz
Senior Prompt Engineer building conversational AI and voice agents

Tags: ecommerce, recommendations, inventory, automation, productsuggestions, conversionoptimization

The Onboarding Flow That Ghosted 300 New Users (Missing Conditional Logic)

FARHAN HABIB FARAZ — Thu, 22 Jan 2026 09:24:40 +0000

I built an onboarding automation for a SaaS product. New users signed up, got a welcome email, then the system walked them through setup in five steps. Each step triggered the next. Simple sequential flow.
Three hundred users signed up in the first week. Only nineteen completed onboarding. The other two hundred eighty one just disappeared after step one.

The Setup
SaaS onboarding needed automation. New user signs up, system sends welcome email, then guides them through account setup, profile completion, product tour, first project creation, and invitation to their first team call.
Each step was supposed to trigger automatically when the previous step finished. I built it in n8n with email triggers, database checks, and timed delays.
Tested with twelve users across three days. Everyone completed all five steps perfectly. Deployed Friday.

The Vanishing Users
Monday morning, the CEO asked why almost no one was finishing onboarding. I pulled the data. Three hundred signups. Two hundred eighty one stuck at step one. Nineteen made it through.
I checked the logs. Every stuck user followed the same pattern. They received the welcome email. They clicked the link. They landed on the setup page. Then nothing. No step two email. No follow-up. The automation just stopped.

Why This Happened
My workflow assumed a perfect path. User completes step one, trigger fires, step two begins. But I never built logic for what happens if step one is not completed.
The reality was messy. Users clicked the welcome email, looked at the setup page, then left to check something, got distracted, or decided to come back later. They did not click the final "Complete Setup" button.
Without that button click, the trigger for step two never fired. The system interpreted incomplete step one as no step one at all. Since step two depended on step one being fully done, it never sent. Step three depended on step two. Step four on step three. The whole chain collapsed at the first gap.
The users were not gone. They were waiting for instructions that would never come.

The Failed Fix
I tried adding a timer. If step one is not completed in 24 hours, send a reminder email.
That technically worked, but it did not solve the core problem. Users who came back and finished step one after the reminder still did not trigger step two, because I had built the system to only check completion status at the moment of initial signup, not continuously.

The Real Solution Was State Tracking
The breakthrough was realizing onboarding is not a sequence, it is a state machine. Users do not move in straight lines. They jump around, pause, return, skip, and backtrack.
I rebuilt the logic around state, not sequence. Every user has a current onboarding state stored in the database. Signed up but setup incomplete. Setup done but profile empty. Profile done but no project. And so on.
The automation checks state every hour for all active users. If a user is stuck in setup incomplete for more than six hours, send a nudge email. If they complete setup, immediately check their state and fire the next appropriate step, regardless of when or how they finished.
If a user skips ahead and creates a project before finishing their profile, the system adapts. It does not rigidly block them or ghost them. It adjusts the remaining onboarding steps based on what is already done.
Conditional logic replaced linear sequence.

What Changed
Users who paused at step one and came back three days later were no longer abandoned. The system recognized their return, saw setup was now complete, and immediately sent step two.
Users who completed steps out of order were not penalized. If someone invited a teammate before finishing the product tour, the system noticed and skipped the invite step in the normal sequence.
Users who never finished certain steps got periodic reminders but were not locked out of later steps that did not depend on the incomplete ones.
The flow became flexible instead of rigid.

The Results
After the fix, I reprocessed the two hundred eighty one stuck users manually to restart their flows. Within 48 hours, one hundred ninety three of them completed onboarding. The others were genuinely inactive, not system-ghosted.
For new signups after the fix, completion rates jumped. Out of the next 400 signups, 312 completed onboarding. That is seventy eight percent compared to the original six percent.
The difference was not better users. It was a system that adapted to how real people actually behave.

What I Learned
Linear flows break the moment users deviate. Onboarding is not a straight line. It is a web of possible states. Systems must track state, not sequence. Conditional logic must account for pauses, skips, returns, and out-of-order actions.
Most importantly, "user did not complete step one" does not mean "user left forever." It often just means "user is not done yet." The system must be patient and adaptive, not rigid and punishing.

The Bottom Line
A sequential onboarding flow ghosted three hundred users because it could not handle incomplete steps or non-linear behavior. The fix was replacing sequence with state tracking and building conditional logic that adapts to real user patterns instead of assuming perfect linear completion.

Written by Farhan Habib Faraz
Senior Prompt Engineer building conversational AI and voice agents

Tags: onboarding, automation, conditionallogic, userexperience, workflows, statemachine

The Notification System That Sent 12,000 Messages at 3 AM

FARHAN HABIB FARAZ — Wed, 21 Jan 2026 09:09:14 +0000

I built a notification system for a global SaaS company that sent reminder emails and SMS to users. The first week looked perfect. The second week, 12,000 users got notifications at 3 AM local time. Phones buzzing, inboxes flooding, and support tickets piling up with the same message: why are you spamming me at 3 AM.

The Setup
The platform had users in 47 countries and needed automated reminders that were supposed to land during business hours in each user’s local time. Trial ending in three days should trigger an email at 9 AM. Payment due tomorrow should trigger an SMS at 10 AM. Webinar starting in one hour should trigger a push notification. Account inactive for seven days should trigger a re-engagement email at 2 PM. I built the workflow in n8n, connected it to their database, email service, and SMS gateway, tested with 50 users across three time zones, and deployed.

The 3 AM Disaster
In week two, Monday morning, my phone exploded. The logs showed blocks of sends happening in the middle of the night for multiple regions. Notifications were being sent at around 2 to 4 AM for parts of the US, late evening for parts of Asia, and only looked correct for the UK by coincidence. Total volume crossed 12,000 messages across email and SMS in a single night window.

Why This Happened
My workflow had one fatal assumption. It ran daily at 9 AM server time, then checked all users who needed notifications and sent immediately. The server time was 9 AM UTC. When it was 9 AM UTC, it was 4 AM in New York, 1 AM in Los Angeles, 9 AM in London, 6 PM in Tokyo, and 8 PM in Sydney. Only a slice of Europe got “business hours.” Everyone else got night-time spam.

The Logic I Missed
I thought “send at 9 AM” meant 9 AM in the user’s time zone. I implemented “send at 9 AM” as 9 AM on the server, then broadcast to everyone. What I actually needed was per-user scheduling. For each user, calculate when it is 9 AM in their own time zone, then send at that moment.

The Failed Fix
My first fix was to add a timezone field for users and check it before sending. But the workflow still ran at 9 AM UTC. At 9 AM UTC, most users were not at 9 AM local time, so the system didn’t send to them. The result was the opposite failure: almost nobody received notifications, because the workflow had no mechanism to wake up at the correct hour for each user.

The Real Solution
The system needed timezone-aware scheduling, not a single daily blast.
The first change was storing a real timezone identifier for every user, like America New York or Asia Tokyo, not a raw offset like UTC minus five. That matters because offsets do not handle daylight saving time.
The second change was converting the target local send time into UTC for execution. If a user is in New York and they should receive a 9 AM reminder, the system converts that 9 AM local moment into the correct UTC timestamp and schedules against UTC. Tokyo users will map to a different UTC hour, and that is fine.
The third change was execution cadence. Instead of running once per day, the workflow runs hourly. Every hour, it checks which users have notifications due in that hour window in their local time, converts that to UTC, and sends only to that matching cohort. This spreads the global send load across 24 hours and makes “9 AM local” possible for everyone.

Daylight Saving Time Was the Hidden Landmine
DST shifts break systems that store offsets. A New York user’s “9 AM” maps to different UTC times depending on the season. Using timezone names lets the timezone database handle the shift automatically, so 9 AM remains 9 AM for the user even when the offset changes.

Edge Cases That Matter
Travel breaks assumptions. A user who signs up in New York and later moves to California will still receive notifications at New York’s 9 AM unless they update their timezone. The correct approach is to let users update timezone in settings and optionally prompt them when location changes are detected.
Missing timezone data is common for new users. The practical approach is to guess from IP on signup, ask the user to confirm, and if unconfirmed, default to UTC while marking those sends as lower priority until timezone is set.
DST transitions create ambiguous local times during the fall-back hour. The safest operational rule is to avoid scheduling sensitive messaging in the 1 AM to 3 AM window during DST transition weekends when feasible, or to rely on timezone-aware libraries that disambiguate with explicit offsets.
The international date line makes “today” a different date for different users. The only safe method is always tracking date, time, and timezone together rather than assuming a shared calendar day.

The Transformation
After the fix, New York users received messages at 9 AM local, Tokyo users at 9 AM local, and London users at 9 AM local, but this time it worked by design rather than accident. The middle-of-night sends disappeared.
The Results
Before the fix, roughly 65 percent of notifications went out at the wrong local time, thousands of users received night-time messages, unsubscribes spiked, and support tickets exploded. After the fix, the vast majority of notifications landed inside business-hour windows, night-time sends dropped to near zero, complaints collapsed, and unsubscribe rates normalized.

What I Learned
Server time is never user time. Timezone conversion is not simple math. It includes DST, date line effects, and regional rules. Testing with a few US time zones is not global testing. And timezone math should never be hand-rolled when libraries and timezone databases exist to handle it safely.

The Bottom Line
One oversight, scheduling by server time instead of user time, spammed 12,000 people at 3 AM. The fix was timezone-aware scheduling with per-user conversion and frequent execution windows, so notifications fire when users are awake and ready to engage.

Written by FARHAN HABIB FARAZ, Senior Prompt Engineer and Team Lead at PowerInAI
Building AI automation that adapts to humans.

Tags: timezones, automation, notifications, scheduling, workflows, globalusers

The Translation Bot That Turned Professional Emails Into Insults

FARHAN HABIB FARAZ — Wed, 21 Jan 2026 09:02:37 +0000

I built an email translation bot for a Bangladesh-based company working with international clients. English to Bengali. Bengali to English. Fully automated translation to speed up cross-border communication.
On day two, a client replied with a single line that stopped everything.
“Why is your team being so rude to me?”
The team hadn’t been rude. The translation bot had.

The Setup
This was an export company. Bangladeshi suppliers. International buyers. Daily communication flowing in both Bengali and English.
The sales team was most comfortable writing emails in Bengali. Clients expected English. Incoming emails arrived in English, but internal coordination happened in Bengali. Translation felt like the obvious automation win.
The system was simple. Outgoing Bengali emails were translated to English and sent to clients. Incoming English emails were translated to Bengali and shown to the team. Google Translate API did the heavy lifting.
I tested with around twenty sample emails. Everything looked fine. Nothing alarming. We deployed on a Monday.

The First Complaint
Tuesday morning, a British client emailed back clearly upset. The sales manager panicked and pulled up his original message, written in Bengali.
“আপনার অর্ডারটি আমরা এখনো পাইনি। দয়া করে একটু দেখবেন?”
The meaning was polite and neutral. We haven’t received your order yet. Could you please check.
What the client actually received was very different.
“We still didn’t get your order. You need to check this now.”
The polite “দয়া করে” turned into a command. The soft “একটু” disappeared. A respectful follow-up became a demand.
That single shift in tone was enough to damage a three-year business relationship.

When Patterns Started Appearing
Once we looked closer, the issue was everywhere.
A formal Bengali request asking someone to kindly review a document turned into a casual, almost dismissive English sentence. A careful apology about pricing became a blunt refusal. Professional gratitude turned into overexcited, exclamation-filled English that sounded childish rather than respectful.
The reverse direction was worse.
An English email saying “We need clarification on the delivery timeline” was translated into Bengali in a way that flipped responsibility. The team thought the client was asking them to explain something, when in reality the client was requesting clarification from them.
In other cases, literal translations produced Bengali sentences that were grammatically confusing and culturally unnatural. The team regularly asked, “What does this even mean?”

Why This Happened
Translation APIs are optimized for literal meaning and general usage. They are not optimized for business formality, cultural hierarchy, or relationship-sensitive language.
Bengali business communication relies heavily on softeners, indirect phrasing, and respect markers. Words like “দয়া করে,” “একটু,” and “যদি সম্ভব হয়” carry social weight that does not map cleanly to English.
English business communication, on the other hand, is more direct and task-oriented. Literal translation strips Bengali emails of politeness, making them sound rude. Literal translation of English into Bengali often sounds stiff, unclear, or even accusatory.
The language was technically correct. The tone was catastrophically wrong.

The Failed Fix
My first instinct was to add politeness automatically. If something sounded blunt, just soften it.
That backfired immediately.
Simple instructions like “Check this” turned into long, overly polite English sentences that sounded sarcastic or artificial. Instead of professionalism, we got awkwardness.
It became clear that tone could not be fixed by blindly adding polite words.

The Real Fix Was Context
The breakthrough was realizing that translation cannot be language-only. It must be context-aware.
Before translating anything, the system now detects the communication type. Is this a formal B2B email. An ongoing client relationship. Or an internal team message.
Next, it identifies politeness markers in the source language. Bengali softeners are not translated word for word. They are converted into structurally polite English. English directness is converted into culturally appropriate Bengali phrasing.
Finally, the system verifies subject and object clarity. “We need you to do X” must never become “You need to do X” unless that was the original intent.
Translation became a two-step process. First, preserve intent and tone. Then, convert language.

What Changed After
The same emails suddenly sounded right.
Polite Bengali follow-ups became professional English requests. Formal refusals stayed respectful. Incoming English emails became clear, polite Bengali messages that the team immediately understood.
Most importantly, no one felt insulted.

The Results
Before the fix, clients complained about tone. Sales teams manually rewrote translated emails. Automation created more work instead of saving time.
After the fix, complaints dropped to zero. Manual rewrites almost disappeared. The team trusted the system again. International client satisfaction improved noticeably.
The automation finally did what it was supposed to do.

What This Taught Me
Literal translation is not the same as appropriate translation. Politeness does not map cleanly across languages. Context always determines tone.
Most importantly, translation systems must understand relationships, not just sentences.

The Bottom Line
Direct translation turned respectful business emails into rude commands.
The fix was not a better translation engine. It was adding cultural and contextual intelligence before and after translation.
Now the system translates meaning, intent, and tone, not just words.

Written by FARHAN HABIB FARAZ, Senior Prompt Engineer and Team Lead at PowerInAI
Building AI automation that adapts to humans.

Tags: translation, multilingual, bengali, crossculture, automation, businesscommunication

The FAQ Bot That Made Up Answers When It Couldn’t Find Real Ones

FARHAN HABIB FARAZ — Wed, 21 Jan 2026 04:39:38 +0000

I built an FAQ bot for a SaaS company. It answered customer questions using their internal knowledge base.
For three weeks, it worked beautifully. Customers were happy. Support tickets went down. Everything looked stable.
Then someone asked a simple question.
Do you offer a student discount.
The bot replied yes, students get forty percent off with a valid student ID and a code called STUDENT40.
There was no student discount. There was no STUDENT40 code. The bot invented everything.
By the time we noticed, eighty three students had already tried to use the fake code.

The Setup
This was a SaaS company with a massive knowledge base. Around two hundred pages of product documentation, pricing FAQs, and troubleshooting guides.
They wanted a chatbot that could answer questions from the knowledge base, handle support twenty four seven, and reduce ticket volume.
The system followed a standard RAG flow. A customer asks a question. The bot searches the knowledge base. The bot answers using retrieved content.
I tested it with fifty known questions. Every answer was accurate.
We deployed.

The Invisible Problem
For the first three weeks, nothing looked wrong.
Customer satisfaction was high. Resolution rate was strong. There were no complaints.
Then support agents started seeing strange tickets.
One customer said the bot told them to use STUDENT40 but the code did not work. Another said the bot claimed there was an iOS app that did not exist. Another tried to integrate with Salesforce because the bot mentioned it.
None of these things were real.

The Pattern
The bot was answering questions that were not covered anywhere in the knowledge base.
And instead of saying it didn’t know, it confidently made things up.
I tested it deliberately.
I asked if there was a lifetime deal. The bot invented a one time nine hundred ninety nine dollar plan.
I asked if data could be exported to Excel. The bot gave step by step instructions for a feature that did not exist and added fake limits.
I asked about enterprise refunds. The bot created a sixty day refund policy that was never documented.
The answers were detailed, specific, and completely wrong.

Why This Happened
The problem was in my prompt.
I told the bot to be helpful and provide complete answers. Then I added one dangerous instruction.
If you don’t have exact information, use your best judgment to provide a helpful response.
To a human, that sounds reasonable. To an LLM, it means guess.
The internal logic was simple. The question had no answer in the knowledge base. The prompt said to be helpful anyway. The model relied on general SaaS patterns and invented something plausible.
That is how hallucinations happen.

The Confidence Trap
The worst part was how confident the lies sounded.
Real answers were careful. Fabricated answers were assertive.
Yes, students get forty percent off. Go to Settings and click Export. Enterprise plans have a sixty day guarantee.
The bot sounded more sure when it was wrong than when it was right.

The Failed Fix
My first attempt was to tell the bot to only answer if the information was in the knowledge base.
That broke legitimate questions.
If the wording did not match exactly, the bot refused to answer even common questions like password resets, despite the information being clearly documented under a different heading.
It became too strict.

The Real Solution
I introduced an explicit honesty protocol.
Every answer now depended on confidence level.
If the bot found an exact match, it answered directly and referenced the documentation.
If it found related information, it shared what it knew and clearly stated the limitation.
If it found nothing relevant, it said so immediately and escalated to a human.
Most importantly, I explicitly forbade fabrication. No invented prices. No fake features. No imaginary integrations. No guessed policies.
If the bot did not know, it had to say it did not know.

The Transformation
When asked about student discounts, the bot now said it could not find information and offered to connect the user with sales.
When asked about Excel exports, it acknowledged export functionality but avoided claiming unsupported formats.
When asked about an iOS app, it confirmed Android availability and escalated for iOS clarification.
Customers stopped chasing non existent features.

The Escalation Benefit
Before, the bot created frustration by confidently lying.
After, it created trust by being honest and escalating appropriately.
Customers explicitly said they preferred a bot that admitted uncertainty over one that sent them on a wild goose chase.

The Results
Before the fix, nearly a quarter of responses contained fabricated information. Support teams spent time correcting the bot. Brand trust suffered. Legal risk increased.
After the fix, fabricated answers dropped to zero. About eighteen percent of conversations escalated to humans, which was exactly what should happen. Human resolution rates were high and customer trust recovered.

The Audit Shock
When we audited three weeks of conversations, the numbers were brutal.
Out of two thousand eight hundred forty seven conversations, six hundred forty one contained false information.
Most hallucinations were around discounts, features, integrations, policies, and pricing.
Almost one in four helpful answers were lies.

What I Learned
Telling an AI to be helpful without constraints creates confident liars.
Fabricated answers sound better than real ones.
I don’t know is not a failure. It is a feature.
LLMs are trained to complete responses, not to stop. You must teach them when stopping is correct.

The Bottom Line
One instruction to use best judgment caused hundreds of fake answers in three weeks.
The fix was not a better model. It was teaching the system that honesty beats completion.
Now the bot knows when to answer and when to step aside.
That is what real reliability looks like.

Written by FARHAN HABIB FARAZ, Senior Prompt Engineer and Team Lead at PowerInAI
Building AI automation that adapts to humans.

Tags: hallucination, rag, knowledgebase, promptengineering, aiaccuracy, honesty

The AI That Quoted Customers Their Competitors’ Prices

FARHAN HABIB FARAZ — Wed, 21 Jan 2026 04:35:46 +0000

I built a pricing quote bot for an e commerce company. It answered customer questions about product prices.
On day three, a customer asked a simple question. How much is this camera.
The bot replied with prices from three competitors.
The customer asked why the bot was advertising other stores. The bot replied that it was providing a pricing comparison.
The customer never asked for a comparison. They just wanted our price.

The Setup
This was an electronics retailer. Customers asked price questions through chat like how much is this laptop, what is the price of these headphones, or is this item on sale.
The requirement was simple. When someone asks for a price, show the company’s price.
I added web search so the bot could look up current catalog pricing, check sale status, and verify stock. I tested it with forty sample questions. Everything looked perfect.
We deployed on Wednesday.

The Leak Begins
On Thursday morning, the first strange response appeared.
A customer asked for the price of Sony headphones. The bot replied with prices from Amazon, Target, and Walmart, then added our price at the end.
The customer replied asking for our price only.
That was the first red flag.

The Pattern
Every pricing question followed the same pattern.
The customer asked for a price.
The bot searched the web.
The search results contained competitor prices because they ranked highly.
The bot included everything it found.
The response accidentally promoted competitors.
The bot was doing exactly what I told it to do.

Why This Happened
My prompt told the bot to search for current pricing information and be helpful and thorough.
To the model, thorough meant sharing every relevant result it found. Web search does not distinguish between company data and market data. It just returns information.
The bot treated competitor prices as relevant because they appeared in the search results.
This was not a reasoning failure. It was a scope failure.

The Real Problem
The bot could not distinguish between information for the customer and information about the market.
Once web search was added, competitor pricing entered the context. Without strict rules, the model had no reason to hide it.
It assumed that more information was better.

The Failed Fix
I tried telling the bot to only share our prices.
That caused another failure. The bot could not reliably identify which price in the search results belonged to us, so it sometimes refused to answer at all.
The instruction was correct, but the input was still polluted.

The Real Solution
I restricted the search itself.
Instead of searching the entire web, the bot was only allowed to search our own domain. Competitor sites were explicitly excluded.
The rules became simple. Only retrieve pricing from our website. Ignore everything else. Never mention competitors unless the customer explicitly asks for a comparison.
Once competitor data stopped entering the context, the problem disappeared.

The Transformation
The same questions now produced clean answers.
When a customer asked for a product price, the bot returned a single number, stock status, and shipping information.
No competitor names. No comparisons. No accidental promotions.
When a customer explicitly asked whether our price was better than another store, the bot performed a separate comparison workflow and clearly labeled it as such.
Implicit questions stayed company focused. Explicit comparison requests triggered comparison logic.

The Results
Before the fix, competitor prices appeared in most pricing conversations. Customers frequently left the chat after seeing cheaper alternatives. Conversion rates were low and sales teams were frustrated.
After the fix, competitor prices were never mentioned unless requested. Chat abandonment dropped sharply. Conversion rates more than doubled.
The bot stopped acting like a market research tool and started acting like a sales assistant.

What I Learned
Web search without guardrails will always overshare.
Being thorough is dangerous when relevance is not defined.
Customers asking how much want one number, not a price landscape.
Search scope matters more than response wording.
The mistake was not giving the AI too much information. It was letting the wrong information into the context in the first place.

The Bottom Line
The bot quoted competitor prices because I told it to search broadly and be thorough.
The fix was not better phrasing. It was controlling where the bot was allowed to look and what it was allowed to say.
Now the bot answers pricing questions without sending customers to competitors.

Written by FARHAN HABIB FARAZ, Senior Prompt Engineer and Prompt Team Lead at PowerInAI
Building AI that knows what to search and what to share.

Tags: searchintegration, promptengineering, ecommerce, contextmanagement, automation

The Booking System That Created 47 Double-Bookings in One Morning

FARHAN HABIB FARAZ — Wed, 21 Jan 2026 04:31:44 +0000

I built an appointment booking bot for a dental clinic. It managed their calendar automatically.
On the first morning it went live, forty seven appointments were scheduled on top of already booked time slots.
Two patients showed up for the same 9 AM slot. Then three more at 9:30. Then four at 10 AM.
The receptionist called me screaming.

The Setup
This was a dental clinic with three dentists handling appointments through WhatsApp, phone calls, and walk-ins. They wanted automation so a WhatsApp bot could handle bookings twenty four seven, check availability, confirm appointments instantly, and sync everything with Google Calendar.
This was a standard booking flow. I had built similar systems before. We deployed on Sunday night for a Monday morning launch.

The Monday Morning Chaos
At 9:00 AM a slot was already booked for Dr. Ahmed. The bot booked Mrs. Rahman through WhatsApp at 8:47 AM. A few minutes later, at 8:52 AM, a walk-in booking added Mr. Karim to the same slot.
At 9:30 AM, a slot for Dr. Hasan was booked by Ms. Sultana through WhatsApp. Then a phone booking added Mr. Habib. Then another WhatsApp booking added Mrs. Begum.
By 11 AM there were forty seven double and triple bookings. The waiting room was packed and patients were angry.

Why This Happened
The logic looked correct on paper. When a user requested a slot, the bot checked the calendar. If the slot looked free, it confirmed and booked it. If not, it suggested the next available time.
The problem was timing.
Booking confirmation took about three seconds. Calendar synchronization took between five and eight seconds.
That gap was enough to break everything.

The Race Condition
When the first patient requested a slot, the bot checked the calendar and saw it as free. It immediately confirmed the booking to the patient. Only after that did it start writing the event to Google Calendar.
Before the calendar finished updating, another booking request came in. The calendar still showed the slot as available. The second booking was accepted. Sometimes a third followed.
In those few seconds between confirmation and persistence, the slot looked free to everyone else.

Why Testing Didn’t Catch It
During testing, bookings came in slowly. One request per minute. Everything worked fine.
On Monday morning, real traffic hit. Ten booking requests per minute. WhatsApp messages, phone calls, walk-ins, all happening at once.
The system behaved correctly in isolation and failed completely at scale.

The Failed Fix
My first attempt was to wait for the calendar write to complete before confirming the booking to the user.
That made things worse.
Users waited fifteen to twenty seconds without feedback. They assumed the bot was broken and sent the same request again. That created even more overlapping booking attempts.

The Real Solution
I added an instant locking layer.
As soon as a user requested a slot and the calendar check passed, the bot immediately locked that slot in its own database. This lock happened in a fraction of a second.
While the lock was active, the bot treated the slot as unavailable for everyone else. Only after the calendar write succeeded did the lock turn into a confirmed booking.
If another request arrived while the lock was active, the bot suggested the next available time instead.

How It Worked in Practice
When Mrs. Rahman requested the 9 AM slot, the bot locked it instantly and told her the booking was in progress. While the calendar write was happening, Mr. Karim requested the same slot. The bot saw the lock and rejected the request, offering 9:30 instead.
Once Google Calendar confirmed the booking, the lock was released and the slot was officially marked as booked.
No overlap. No confusion.

Edge Cases We Had to Handle
Locks couldn’t live forever. If a calendar write failed, the slot would stay blocked. To prevent that, every lock expired automatically after thirty seconds.
If a user cancelled during the booking process, the lock was released immediately.
Because bookings could come from WhatsApp, phone, or walk-ins, every system checked the same lock store before confirming anything.
If the lock system itself failed, the bot entered a safe mode and temporarily rejected all bookings instead of risking corruption.

The Results
In the first week after the fix, there were zero double bookings. The bot prevented thirty four conflicts that would have turned into overlaps. Average confirmation time stayed just over two seconds.
Before the fix, there were forty seven double bookings in three hours. After the fix, there were none in three months.
The receptionist stopped manually verifying bot bookings. Patient trust recovered. The bot now handles most of the clinic’s bookings reliably.

What I Learned
Calendar sync is never instant, no matter how real time it sounds. Fast user responses and slow database writes must be treated as separate systems.
Race conditions don’t show up during light testing. They appear only when traffic increases.
The simplest fix was not smarter AI. It was a two layer system that acknowledged reality. One fast layer for locking and user experience, and one slow layer for permanent records.

The Bottom Line
Forty seven double bookings happened because I assumed calendar sync was instant.
The fix was a lock that takes two tenths of a second.
That small change turned chaos into a stable system that now runs hundreds of bookings without conflict.

Written by FARHAN HABIB FARAZ, Senior Prompt Engineer and Prompt Team Lead at PowerInAI
Building AI systems that handle real world timing issues.

Tags: bookingautomation, raceconditions, scheduling, automation, systemdesign, calendarsync