DEV Community: Ademola Balogun

You Are Probably Calling the Wrong Model for Most of Your Requests

Ademola Balogun — Sat, 16 May 2026 09:40:31 +0000

Here is something most LLM tutorials quietly ignore. Every request in your app gets sent to the same model, at the same cost, with the same latency. It does not matter if someone asks "what is the capital of France?" or "summarise this 40-page legal contract". Same model. Same price. Same wait.

That is not how you would design anything else in your stack. You would not spin up a GPU instance to serve a static HTML page. So why are we treating all LLM queries as equally complex?

The answer is usually that routing feels complicated. But it genuinely isn't. This article shows you a simple pattern that takes about 20 lines of Python to implement and can cut your API spend significantly while keeping quality where it matters.

The idea

Not all queries are equal. Some are simple lookups or rewrites that a small, fast, cheap model handles just fine. Others actually need the reasoning depth of a frontier model. The router pattern sits in front of your LLM calls and makes that decision before any tokens are generated.

User Query
    |
    v
[ Router: classify complexity ]
    |
    |--- simple ---> cheap model (e.g. Haiku, Gemini Flash, Llama 3.1 8B)
    |
    |--- complex --> capable model (e.g. Sonnet, GPT-4o, Gemini Pro)

Simple. But the impact compounds fast at scale.

A basic implementation

Let's build this in Python. We will use the Anthropic SDK, but the same pattern works with any provider.

First, install the SDK:

pip install anthropic

Now the router itself. The simplest version classifies a query before routing it:

import anthropic

client = anthropic.Anthropic()

CHEAP_MODEL = "claude-haiku-4-5-20251001"
CAPABLE_MODEL = "claude-sonnet-4-6"

def classify_query(query: str) -> str:
    """Returns 'simple' or 'complex'."""
    response = client.messages.create(
        model=CHEAP_MODEL,
        max_tokens=10,
        system=(
            "You classify user queries. "
            "Reply with only one word: simple or complex. "
            "Simple = factual lookups, rewrites, short summaries. "
            "Complex = reasoning, analysis, multi-step tasks, long documents."
        ),
        messages=[{"role": "user", "content": query}]
    )
    label = response.content[0].text.strip().lower()
    return label if label in ("simple", "complex") else "complex"


def routed_call(query: str) -> str:
    complexity = classify_query(query)
    model = CAPABLE_MODEL if complexity == "complex" else CHEAP_MODEL

    print(f"[router] complexity={complexity}, model={model}")

    response = client.messages.create(
        model=model,
        max_tokens=1024,
        messages=[{"role": "user", "content": query}]
    )
    return response.content[0].text

Then you call it like this:

answer = routed_call("What is the boiling point of water?")
print(answer)

answer = routed_call(
    "Analyse the trade-offs between microservices and monoliths "
    "for a team of five engineers building a B2B SaaS product."
)
print(answer)

Output:

[router] complexity=simple, model=claude-haiku-4-5-20251001
100 degrees Celsius at standard atmospheric pressure.

[router] complexity=complex, model=claude-sonnet-4-6
For a team of five engineers building a B2B SaaS product...

Why this works

The classification call itself is cheap. You are using the fast model to decide which model to use, so even the routing decision costs very little. For most apps, 60 to 70 percent of queries are genuinely simple, which means you are running the expensive model only when it actually earns its cost.

There is a second benefit people overlook. Latency. The small models are significantly faster. For the simple majority of your queries, users get a response much quicker. That matters more than most people realise for perceived product quality.

Making the router smarter without overengineering it

The binary simple/complex split is a starting point. You can extend it in a few ways without the router becoming its own maintenance burden.

One pattern is keyword-based pre-screening before you even make an API call:

SIMPLE_SIGNALS = [
    "what is", "define", "translate", "fix the typo",
    "rewrite this", "summarise in one sentence"
]

def fast_prescreen(query: str) -> str | None:
    """Returns 'simple' immediately if we can tell from keywords alone."""
    q = query.lower()
    if any(q.startswith(signal) for signal in SIMPLE_SIGNALS):
        return "simple"
    return None  # unclear, fall through to the classifier


def routed_call_v2(query: str) -> str:
    complexity = fast_prescreen(query) or classify_query(query)
    model = CAPABLE_MODEL if complexity == "complex" else CHEAP_MODEL

    response = client.messages.create(
        model=model,
        max_tokens=1024,
        messages=[{"role": "user", "content": query}]
    )
    return response.content[0].text

The prescreen avoids the classification API call entirely for obviously simple queries. For everything else it falls back to the classifier. Two layers, each only doing what is necessary.

What this pattern does not solve

Worth being honest about the limits.

The classifier is not perfect. Some queries look complex but aren't, and vice versa. You will want to log which model handled which queries and spot-check the routing decisions early on. If you find the classifier is consistently wrong on a category of queries, add examples to the system prompt or build a small lookup.

Also, this pattern assumes you have two tiers that are meaningfully different in cost and capability. If you are only using one provider and all their models are similarly priced, the benefit is mostly latency, not cost.

The bigger point

The routing pattern is one example of a broader principle that is becoming clearer as LLM applications mature. The model is not the product. The system around the model is the product.

Choosing which model to call, when to retry, how to validate the output, how to handle failures gracefully, these are engineering decisions that compound over time. A well-designed system with average models will outperform a poorly-designed system with great models almost every time.

The teams shipping reliable AI products in 2026 are not the ones with access to the best models. They are the ones who stopped treating LLM calls as black boxes and started treating them like any other network dependency worth engineering around.

Start with the router. See what your query distribution actually looks like. Then decide what to optimise next.

The Question We Are All Asking About AI Has Changed

Ademola Balogun — Sun, 26 Apr 2026 05:54:06 +0000

A couple of years ago, the dominant question in every engineering meeting, every Slack thread, every developer blog was: which model is the best?

People ran benchmarks. They argued about MMLU scores. They debated GPT-4 vs Claude vs Gemini like it was a sports rivalry. The energy made sense. These were genuinely new capabilities, and figuring out who was leading felt important.

But that question is mostly settled now. Not because there is a single winner, but because the whole framing stopped being useful.

The question we are all actually asking in 2026 is a different one: how do you make any of this work reliably in production?

The benchmark treadmill is exhausting everyone

Models that topped the leaderboards six months ago are now average performers. The pace is relentless. GPT-4-level performance now costs roughly 1/100th of what it did two years ago, and open-weight models like Llama, Mistral, and Qwen now match or beat proprietary models on several benchmarks. The capability gap that once made the choice obvious has largely collapsed.

This is good news. But it also means your choice of model matters a lot less than how you build around it.

The teams shipping real products are not the ones who found the perfect model. They are the ones who solved the boring problems: retries, rate limits, context management, observability, cost control. That is the actual work in 2026, and very few people are writing about it honestly.

Rate limits are the new production outage

Here is something that surprised me when I first saw the numbers. In February 2026, 5% of all LLM call spans in Datadog's observability data returned an error, and 60% of those errors were caused by exceeded rate limits. Not hallucinations. Not bad prompts. Rate limits.

When the dominant failure mode of your AI application is capacity, not logic, that tells you something important about where the field actually is. We are past the "does this work?" phase. We are in the "can we keep this running?" phase.

The engineers doing well right now are the ones who treat LLM APIs like any other third-party dependency: circuit breakers, backpressure, exponential backoff, fallback models. Nothing glamorous. Just solid engineering applied to a new surface area.

Agents are real now, but not in the way you imagined

The word "agent" got overhyped badly. A lot of people heard it and pictured fully autonomous systems that manage themselves. What actually arrived is more useful and more complicated than that.

Agent framework adoption has nearly doubled year over year in 2026, rising from around 9% of organisations in early 2025 to almost 18% by early 2026. Teams are genuinely building multi-step workflows where models call tools, check results, and make decisions across multiple steps. That is real and it is increasingly common.

What is also real is that these systems are harder to debug than anything most developers have built before. The logic is distributed across LLM calls, tool outputs, and state management. When something goes wrong, the error is rarely obvious. You need logs, traces, and careful evaluation of intermediate steps, not just the final output.

The maturity of the developer community around agents is visible in what they are building. Tools like LangGraph, LangChain, Pydantic AI, and CrewAI are becoming infrastructure, not experiments. MCP (Model Context Protocol) is emerging as a way to connect agents to internal systems like Slack, making it possible for non-technical staff to benefit from agentic workflows without needing to understand what is happening underneath.

Open source closed the gap faster than anyone expected

Two years ago, running a capable model locally felt like a research project. Today it is a realistic production option for a lot of use cases.

The Qwen2.5-1.5B-Instruct model alone has 8.85 million downloads, making it one of the most widely used pretrained LLMs available. The Qwen family spans a range of sizes with specialised versions for math, coding, and vision. DeepSeek's decision to open-weight their models changed the incentive structure for everyone, and other Chinese labs followed. Even American firms responded, with OpenAI releasing its first open-source model in August 2025 and the Allen Institute for AI releasing Olmo 3 in November.

What this means practically is that "closed API vs open-weight" is now a genuine architectural decision, not a capability decision. If your use case involves sensitive data, predictable latency, or cost at scale, running a model locally or on your own infrastructure is a real option. That was not true eighteen months ago.

The conversation shifted from research to reliability

By 2025, the dominant question shifted from "which model is best?" to "how do we integrate LLMs reliably with up-to-date knowledge, cost efficiency, and safety?" That shift reflects something real. The ecosystem matured. There are now dozens of capable models, multiple retrieval strategies, fine-tuning options, and deployment patterns. The hard part is no longer finding a model that can do the thing. The hard part is building something you can maintain and trust over time.

RAG is increasingly table stakes rather than an innovation. Evaluation pipelines that continuously test your prompts and catch regressions are becoming standard practice on serious teams. The idea that you deploy an LLM application once and leave it alone has been thoroughly disproved.

Multimodal is the new baseline

Reasoning models and large multimodal models are now considered the two most significant developments in the current LLM landscape. A model that can only process text is no longer frontier. The leading models understand images, documents, and increasingly audio and video too.

What this changes for developers is the input surface area. You can now build applications where users upload a screenshot and get back structured data, or where a document image gets parsed without any traditional OCR pipeline. These capabilities are not perfect, but they are reliable enough to ship on, and the failure modes are well understood.

So where does that leave us?

The LLM space still moves fast. March 2026 showed a clear shift from experimental AI prototypes to production-grade deployments across enterprises. The companies that were running pilots are running products now. The experimentation phase, at least for the core technology, is largely over.

What that means for developers is that the most valuable skills have shifted. Being able to call an API and get a response is not the interesting part anymore. Understanding how to evaluate outputs, manage state across agent steps, control costs at scale, and build feedback loops that improve your system over time, that is where the real leverage is.

The models are good enough. The question is whether the systems we build around them are.

Why RAG Is Failing at Complex Questions (And How Knowledge Graphs Fix It)

Ademola Balogun — Sat, 21 Mar 2026 10:53:24 +0000

Retrieval-Augmented Generation solved the hallucination problem. Then everyone discovered it can't actually answer hard questions.
The issue isn't the LLM. It's not even the retrieval mechanism. It's that traditional RAG treats your knowledge base like a bag of disconnected sentences, when the information you need is buried in relationships spanning multiple documents.
GraphRAG is the architecture that's quietly becoming the answer to RAG's biggest limitation.
The Multi-Hop Problem
Here's a question that breaks standard RAG: "What scientific work influenced the mentor of the person who discovered the double helix structure of DNA?"
A traditional RAG system would:

Search for "double helix structure DNA discovery"
Find chunks mentioning Watson and Crick
Maybe find something about their mentors
Fail to connect the dots about who influenced those mentors
Generate a vague or incorrect answer

The problem? This requires connecting information across three hops. First, Watson and Crick discovered the double helix. Second, who was their mentor? Third, what work influenced that mentor?
Each piece of information lives in different documents. Traditional vector similarity search retrieves semantically similar text, but semantic similarity doesn't map to logical inference chains.
How Standard RAG Actually Works (And Why It Breaks)
Traditional RAG follows a simple pattern:

Chunk your documents into 500-1000 word segments
Embed each chunk into a vector representation
Store vectors in a database (Pinecone, Qdrant, Weaviate)
Query comes in → embed it the same way
Retrieve top-k similar chunks using cosine similarity
Feed chunks to LLM as context
Generate answer from context

This works beautifully when your question maps directly to content in a single chunk. It falls apart when the answer requires synthesizing information from multiple chunks, when relationships between entities matter more than semantic similarity, when reasoning chains involve indirect connections, or when domain-specific logic needs to be preserved.
Research shows baseline RAG struggles with comprehensiveness (how much of the answer you capture) and diversity (variety of perspectives) on complex queries. Microsoft's experiments found traditional RAG captures only 22-32% of comprehensive answers on multi-hop questions.
What Knowledge Graphs Bring to the Table
Knowledge graphs represent information as nodes (entities) and edges (relationships).
Instead of:
"Albert Einstein developed the theory of relativity in 1915"
"Einstein worked at the Swiss Patent Office before his breakthrough"
"The theory revolutionized physics"
You get:
(Einstein) --[DEVELOPED]--> (Theory of Relativity)
(Einstein) --[WORKED_AT]--> (Swiss Patent Office)
(Theory of Relativity) --[YEAR]--> (1915)
(Theory of Relativity) --[IMPACT]--> (Physics)
(Swiss Patent Office) --[PRECEDED]--> (Breakthrough)
Now multi-hop queries become graph traversal problems. "What organization did the physicist who developed relativity theory work at before his breakthrough?" translates to walking the graph:

Find node: Theory of Relativity
Follow edge: DEVELOPED_BY → Einstein
Follow edge: WORKED_AT → Swiss Patent Office
Filter by: PRECEDED → Breakthrough

The graph structure preserves logical relationships that vector embeddings lose.
The GraphRAG Architecture
GraphRAG combines vector databases with knowledge graphs in a dual-retrieval system.
Indexing Phase:

Text segmentation - Break documents into analyzable units (paragraphs, sections)
Entity extraction - Use NER to identify entities (people, places, concepts, organizations)
Relation extraction - Identify relationships between entities
Graph construction - Build knowledge graph with entities as nodes, relations as edges
Community detection - Cluster related nodes into hierarchical communities
Summary generation - Create summaries at different graph levels (local, global)
Dual indexing - Store both graph structure AND vector embeddings

Query Phase:

Query processing - Extract entities and intent from user question
Graph traversal - Use graph queries (Cypher, SPARQL) to find relevant subgraphs
Vector retrieval - Simultaneously retrieve semantically similar chunks
Context fusion - Combine graph paths and vector results
Augmented generation - LLM generates answer using both sources

The key insight: graph traversal finds structurally relevant information, vector search finds semantically relevant information. Together they catch what either misses alone.
Real Performance Gains
Microsoft's research on query-focused summarization shows GraphRAG massively outperforming baseline RAG. On comprehensiveness (how much of the complete answer is captured), GraphRAG scored 72 to 83% while baseline RAG only managed 22 to 32%. For diversity (variety of relevant perspectives included), GraphRAG hit 62 to 82% compared to baseline's 18 to 28%.
Perhaps most impressive: GraphRAG used 97% fewer tokens for root-level summaries by precomputing community summaries in the graph.
In multi-hop reasoning benchmarks, GraphRAG consistently outperforms traditional RAG. On the HotpotQA dataset, it shows 15 to 20% improvement in exact match accuracy. For SQuAD 2.0, it handles unanswerable questions better. In manufacturing QA scenarios, it delivers a 25% improvement in domain-specific queries.
The Technical Challenges Nobody Talks About
Building production GraphRAG isn't straightforward. Here are the real problems:
Entity Resolution Is Hard
Your documents mention "Einstein," "A. Einstein," "Albert Einstein," and "the physicist." The graph needs to know these reference the same entity.
Entity resolution requires string matching algorithms (fuzzy matching, edit distance), contextual disambiguation (different people with same name), cross-document coreference resolution, and domain-specific dictionaries.
Get this wrong and your graph fragments into disconnected pieces that should be unified.
Relation Extraction Isn't Reliable
Off-the-shelf NER models extract noisy relations. A sentence like "Einstein's theory, which was influenced by Maxwell's work, revolutionized physics" might correctly extract that Einstein authored the theory and the theory revolutionized physics, but incorrectly attribute Maxwell's influence to the theory instead of to Einstein.
Fixing this requires domain-specific training data or rule-based post-processing.
Graph Size Explodes Quickly
A 1000-document corpus can generate 50,000+ entities and 200,000+ relationships. Querying this efficiently requires proper indexing (Neo4j and ArangoDB handle this well), subgraph sampling (don't traverse the entire graph), community-based hierarchies (Microsoft's approach), and caching frequent query paths.
Without optimization, query times balloon to 10+ seconds.
Hybrid Retrieval Is Complex
Combining graph results with vector results isn't trivial. You need to normalize relevance scores from different sources, handle conflicts when sources disagree, decide weighting (70% graph, 30% vector? Depends on query type), and rerank combined results before sending to LLM.
Most implementations use a simple concatenation, which works but leaves performance on the table.
When GraphRAG Actually Matters
GraphRAG isn't always better than standard RAG. Use it when multi-hop reasoning is required, like questions such as "What university did the inventor of the technology that powers electric cars attend?" It shines in domains where relationships are complex: medical diagnosis (symptoms to conditions to treatments), legal analysis (cases to precedents to statutes), manufacturing (components to failures to causes). It's also valuable when hierarchical summarization is needed ("Summarize all security incidents across departments last quarter") or when factual accuracy is critical in areas like financial compliance, medical information, and legal advice where hallucinations have real consequences.
Don't use GraphRAG for simple lookups ("What is the capital of France?"), when semantic similarity is sufficient ("Find articles similar to this one"), when your corpus has no relational structure, or when you need the lowest possible latency since graph traversal adds overhead.
Implementation Patterns That Work
Start with hybrid before going full GraphRAG. Augment your existing RAG with simple entity linking: extract named entities from chunks, link entities across chunks, and use entity co-occurrence as an additional signal. This requires minimal infrastructure changes but improves results 10 to 15%.
Build domain-specific ontologies. Generic knowledge graphs underperform domain-specific ones. For medical RAG, use medical ontologies (SNOMED, ICD). For legal, use citation graphs. The domain structure matters more than the technology.
Precompute subgraphs for common query patterns. If 80% of queries follow 3 or 4 patterns, precompute and cache those subgraphs. Query time drops from 8 seconds to 800ms.
Use graph embeddings for hybrid ranking. Convert graph paths to embeddings (Node2Vec, GraphSAGE), then combine with text embeddings for unified similarity scoring.
The Future: Multi-Modal Graph RAG
The next evolution adds images, videos, and audio to the graph.
Imagine querying: "Show me product demos where the presenter mentioned reliability issues"
The graph connects:

(Video) --[PRESENTER]--> (Person)
(Person) --[MENTIONED]--> (Reliability)
(Reliability) --[CONTEXT]--> (Timestamp)
(Timestamp) --[IN]--> (Video Segment)

Each modality (transcript, visual frames, speaker identification) becomes nodes in a unified graph. Retrieval works across modalities simultaneously.
Early research shows multi-modal GraphRAG achieving 40-50% better results on tasks requiring cross-modal reasoning (finding specific moments in videos, connecting spoken words to visual content).
Why This Matters Now
GraphRAG is moving from research to production. Companies deploying it are seeing real results. In financial services, there's a 30% reduction in analyst query time for complex compliance questions. Healthcare organizations see improved clinical decision support by connecting symptoms across patient records. Manufacturing companies get faster root cause analysis by linking failures to component relationships. Legal firms have better case law research connecting precedents through reasoning chains.
The technology works. The challenge is implementation complexity. You need expertise in graph databases (Neo4j, Neptune, ArangoDB), vector databases (Pinecone, Qdrant, Milvus), NLP pipelines (SpaCy, Hugging Face), graph algorithms (community detection, path finding), and LLM integration.
That's a heavier lift than vanilla RAG. But for complex domains where standard RAG fails, GraphRAG isn't optional—it's the only architecture that works.
The Bottom Line
RAG revolutionized how LLMs access external knowledge. But it was designed for simple retrieval, not complex reasoning.
GraphRAG fixes this by treating your knowledge base as what it actually is: a web of interconnected concepts, not a pile of disconnected chunks.
The performance gains on complex queries aren't incremental. They're 2 to 3x improvements. The implementation complexity isn't trivial, but it's manageable with the right architecture.
If your RAG system is failing on multi-hop questions, relationship-heavy domains, or hierarchical summarization tasks, the problem isn't your embeddings or your LLM. It's that you're using the wrong retrieval architecture. GraphRAG is how you fix it.

I Tried to Trick 7 AI Models with Fake Facts. They Didn't Fall for It. (That's a Problem.)

Ademola Balogun — Tue, 17 Feb 2026 22:12:10 +0000

I spent a weekend testing whether large language models would confidently repeat misinformation back to me. I fed them 20 fake historical facts alongside 20 real ones and waited for the inevitable hallucinations.

They never came.

Not a single model - across seven different architectures from various providers - accepted even one fabricated fact as true. Zero hallucinations. Clean sweep.

My first reaction was relief. These models are smarter than I thought, right?

Then I looked closer at the data and realized something more concerning: the models weren't being smart. They were being paranoid.

The Experiment

I built a simple benchmark with 40 factual statements:

20 fake facts: "Marie Curie won a Nobel Prize in Mathematics," "The Titanic successfully completed its maiden voyage," "World War I ended in 1925"
20 real facts: "The Berlin Wall fell in 1989," "The Wright brothers achieved powered flight in 1903," "The Soviet Union dissolved in 1991"

I tested seven models available through Together AI's API:

Llama-4-Maverick (17B)
GPT-OSS (120B)
Qwen3-Next (80B)
Kimi-K2.5
GLM-5
Mixtral (8x7B)
Mistral-Small (24B)

Each model received the same prompt: verify the statement and respond with a verdict (true/false), confidence level (low/medium/high), and brief explanation. Temperature was set to 0 for consistency.

The Results: Perfect... Suspiciously Perfect

Five models scored 100% accuracy. The other two? 97.5% and 95%.

At first glance, this looks incredible. But here's what actually happened:

Model	Accuracy	False Negatives
Llama-4-Maverick	100%	0
GPT-OSS-120B	100%	0
Qwen3-Next	100%	0
Kimi-K2.5	100%	0
GLM-5	100%	0
Mixtral-8x7B	97.5%	1
Mistral-Small	95%	2

Not a single hallucination. Every error was a false negative - rejecting true facts.

The Safety-Accuracy Paradox

Here's the uncomfortable truth I discovered: these models have been trained to be so cautious about misinformation that they'd rather reject accurate information than risk spreading a falsehood.

Think about what this means in practice.

If you ask an AI assistant, "Did the Berlin Wall fall in 1989?" and it responds with uncertainty or outright denial because it's been over-tuned for safety, that's not helpful. That's a different kind of failure.

The models that scored less than 100% - Mixtral and Mistral-Small - weren't worse. They were different. They rejected some real facts (false negatives) but never accepted fake ones (hallucinations). They drew the line in a different place on the safety-accuracy spectrum.

Confidence Calibration: Everyone's Certain

What struck me most wasn't the accuracy - it was the confidence.

Every single model reported "high confidence" on 95-100% of their responses. When they were right, they were certain. When they were wrong (the few false negatives), they were still certain.

This is the real issue with confidence scores in current LLMs. They're not probabilistic assessments. They're vibes.

A model that says "I'm highly confident the Berlin Wall fell in 1989" and another that says "I'm highly confident it didn't" are both expressing the same level of certainty despite contradicting each other. The confidence score doesn't tell you about uncertainty - it tells you the model finished its internal reasoning process.

What This Actually Tells Us

I went into this experiment expecting to write about hallucination rates and confidence miscalibration. Instead, I found something more nuanced: modern LLMs have overcorrected.

The training data and RLHF (Reinforcement Learning from Human Feedback) that went into these models has created systems that:

Err heavily on the side of caution - Better to say "I don't know" than risk spreading misinformation
Treat all uncertainty the same - A 60% confidence and a 95% confidence both get reported as "high"
Optimize for not being wrong over being helpful

This isn't necessarily bad. In many applications, such as medical advice, legal information, and financial guidance, you want conservative models. But it creates a different kind of deployment challenge.

The Pendulum Problem

We've swung from early LLMs that would hallucinate confidently to current models that reject true information to avoid any possibility of error. Neither extreme is ideal.

The chart above shows how models trade off different failure modes. Perfect scores on "anti-hallucination" (none of them accepted fake facts) but varied scores on "anti-false rejection" (some rejected real facts).

What we actually need is something in the middle: models that can express genuine uncertainty, distinguish between "probably false" and "definitely false," and acknowledge when they simply don't know.

The Real-World Impact

Here's where this gets practical.

If you're building:

A fact-checking system: Current models are probably too conservative. They'll flag true statements as suspicious.
A customer service chatbot: You want conservative. Better to escalate to a human than give wrong information.
A research assistant: You need calibrated uncertainty. "This claim appears in 3 sources but contradicts 2 others" is more useful than "high confidence: false."

The failure mode matters as much as the accuracy rate.

What I Got Wrong

My benchmark used obviously false facts. "The Titanic successfully completed its maiden voyage" is not subtle misinformation. It's the kind of statement that gets flagged immediately.

In retrospect, I was testing whether models would accept absurdly false claims, not whether they'd get tricked by plausible misinformation. That's a different experiment entirely.

To actually test hallucination susceptibility, I'd need:

Subtly wrong facts that sound plausible
Mixed information where some details are right and others wrong
Statements that require nuanced understanding, not just fact recall

But that's also what makes this finding interesting. Even with softball fake facts, the models didn't just reject them - they were defensive across the board.

The Technical Debt of Safety

Here's what I think is happening under the hood:

During RLHF training, models get penalized heavily for hallucinations. The training signal is strong: never make up facts. The penalty for false positives (accepting fake information) is much higher than the penalty for false negatives (rejecting true information).

This makes sense from a product safety perspective. A model that occasionally refuses to answer is annoying. A model that confidently spreads misinformation is dangerous.

But it creates a form of technical debt. We've optimized for one failure mode (hallucination) so aggressively that we've introduced another (excessive caution). And because we can't perfectly measure "appropriate uncertainty," the models just default to maximum caution.

Where This Leaves Us

Looking at the full outcome distribution, 98.8% of responses were correct. That's impressive. But the 1.2% that were wrong were all the same type of wrong: false negatives.

This tells me something important about the current state of LLMs: we've solved the hallucination problem by making models reluctant to commit.

That's progress. But it's not the end goal.

The next frontier isn't getting models to stop hallucinating—they've basically done that on straightforward factual questions. It's getting them to:

Express calibrated uncertainty
Distinguish between "definitely false" and "uncertain"
Provide nuanced answers instead of binary true/false judgments
Know what they don't know

Limitations and Future Work

This was a small-scale experiment with limitations:

Dataset size: Only 40 statements
Fact complexity: Simple historical facts, not complex or nuanced claims
Single API provider: All models tested through Together AI
Binary evaluation: True/false doesn't capture nuanced responses

A more robust version would:

Test with subtle misinformation, not obvious falsehoods
Include complex claims requiring reasoning, not just fact recall
Evaluate explanation quality, not just verdict accuracy
Test the same models across different providers to check for API-level filtering

The Bottom Line

I set out to measure how often AI models hallucinate. I discovered they've become so afraid of hallucinating that they're starting to reject reality.

That's not a hallucination problem. It's an overcorrection problem.

And honestly? I'm not sure which one is harder to fix.

How to Build an AI-Powered WhatsApp Bot That Analyzes Images Using Python and Vision Models

Ademola Balogun — Thu, 05 Feb 2026 22:17:29 +0000

WhatsApp has over 2 billion users. Most AI tools live on websites nobody visits. What if you could bring AI directly to where people already spend their time?

In this tutorial, I'll show you how to build a WhatsApp bot that accepts images, analyzes them using AI vision models, and responds with intelligent insights.

What We're Building

By the end of this guide, you'll have a working WhatsApp bot that:

Receives images from users via WhatsApp
Processes them using AI vision models (Llama, GPT-4V, or Claude)
Returns structured analysis in natural conversation
Stores conversation history in MongoDB
Runs on a free-tier cloud server

The entire stack costs nearly nothing to run at low volume, making it perfect for MVPs, side projects, or learning.

Why WhatsApp + AI Vision?

Before we dive into code, let's talk about why this combination is powerful.

Traditional AI apps require users to visit a website, create an account, and learn a new interface. WhatsApp bots eliminate all that friction. Users message your bot exactly like they'd message a friend.

AI vision models have become remarkably capable. They can identify objects, read text, understand context, and generate detailed descriptions. Combining this with WhatsApp's ubiquity creates tools that feel magical.

Some real-world applications:

Receipt scanners that extract expenses automatically
Plant identifiers for gardening enthusiasts
Food analyzers that estimate nutrition from photos
Document readers that summarize uploaded PDFs
Product lookup tools for shopping assistance

Prerequisites

You'll need:

Python 3.9+
A Meta Developer account (free)
A cloud AI provider account (Together AI, OpenAI, or Anthropic)
MongoDB Atlas account (free tier works)
A server with a public URL (we'll use ngrok for development)

Architecture Overview

Here's how the pieces fit together:

User sends image via WhatsApp
         ↓
Meta's WhatsApp Cloud API receives it
         ↓
Webhook forwards to your Flask server
         ↓
Server downloads image from Meta's CDN
         ↓
Image sent to AI Vision API for analysis
         ↓
Response formatted and sent back via WhatsApp API
         ↓
User receives analysis in their chat

The beauty of this architecture is its simplicity. One Python file handles everything.

Step 1: Set Up the Meta Developer Account

First, we need access to the WhatsApp Business API.

Go to developers.facebook.com and create an account
Click "My Apps" → "Create App"
Select "Business" as the app type
Name your app and click "Create"
Find "WhatsApp" in the product list and click "Set Up"

Meta provides a free test phone number that works for development. You'll see it in the WhatsApp section of your app dashboard.

Note down these values from your dashboard:

Phone Number ID (under "From" phone number)
WhatsApp Business Account ID
Temporary Access Token (we'll make this permanent later)

Step 2: Set Up Your AI Vision Provider

For this tutorial, I'll use Together AI with their Llama Vision model because it's cost-effective and doesn't require waitlist approval. The code works with minor modifications for OpenAI's GPT-4V or Anthropic's Claude.

Sign up at together.ai
Get your API key from the dashboard
Note the model name: meta-llama/Llama-4-Scout-17B-16E-Instruct

Together AI charges about $0.18 per million tokens for vision models—significantly cheaper than alternatives.

Step 3: Set Up MongoDB

We'll use MongoDB to store user sessions and analysis history.

Create a free account at mongodb.com/atlas
Create a new cluster (the free M0 tier works fine)
Create a database user with read/write access
Get your connection string (looks like mongodb+srv://user:pass@cluster.xxxxx.mongodb.net/)

Step 4: Project Structure

Create a new directory and set up these files:

whatsapp-ai-bot/
├── app.py           # Main application
├── .env             # Environment variables (don't commit this)
├── .env.example     # Template for environment variables
└── requirements.txt # Python dependencies

Step 5: Install Dependencies

Create requirements.txt:

flask==3.0.0
requests==2.31.0
pymongo==4.6.0
python-dotenv==1.0.0
together==1.0.0
gunicorn==21.2.0

Install them:

pip install -r requirements.txt

Step 6: Environment Variables

Create .env.example (commit this as a template):

# WhatsApp API
WHATSAPP_ACCESS_TOKEN=your_token_here
WHATSAPP_PHONE_NUMBER_ID=your_phone_id_here
WHATSAPP_VERIFY_TOKEN=any_random_string_you_choose

# AI Provider
TOGETHER_API_KEY=your_together_api_key

# Database
MONGODB_URI=mongodb+srv://user:pass@cluster.mongodb.net/dbname

Copy it to .env and fill in your actual values.

Step 7: The Main Application

Here's the complete app.py. I'll explain each section after:

import os
import json
import requests
from datetime import datetime
from flask import Flask, request, jsonify
from pymongo import MongoClient
from dotenv import load_dotenv
from together import Together

load_dotenv()

app = Flask(__name__)

# Configuration
WHATSAPP_TOKEN = os.getenv("WHATSAPP_ACCESS_TOKEN")
PHONE_NUMBER_ID = os.getenv("WHATSAPP_PHONE_NUMBER_ID")
VERIFY_TOKEN = os.getenv("WHATSAPP_VERIFY_TOKEN")
TOGETHER_API_KEY = os.getenv("TOGETHER_API_KEY")
MONGODB_URI = os.getenv("MONGODB_URI")

# Initialize clients
mongo_client = MongoClient(MONGODB_URI)
db = mongo_client.whatsapp_bot
together_client = Together(api_key=TOGETHER_API_KEY)

# Vision model configuration
VISION_MODEL = "meta-llama/Llama-4-Scout-17B-16E-Instruct"


def send_whatsapp_message(to, message):
    """Send a text message via WhatsApp API."""
    url = f"https://graph.facebook.com/v18.0/{PHONE_NUMBER_ID}/messages"
    headers = {
        "Authorization": f"Bearer {WHATSAPP_TOKEN}",
        "Content-Type": "application/json"
    }
    payload = {
        "messaging_product": "whatsapp",
        "to": to,
        "type": "text",
        "text": {"body": message}
    }

    response = requests.post(url, headers=headers, json=payload)
    return response.json()


def download_media(media_id):
    """Download media from WhatsApp's CDN."""
    # First, get the media URL
    url = f"https://graph.facebook.com/v18.0/{media_id}"
    headers = {"Authorization": f"Bearer {WHATSAPP_TOKEN}"}

    response = requests.get(url, headers=headers)
    media_url = response.json().get("url")

    if not media_url:
        return None

    # Download the actual file
    media_response = requests.get(media_url, headers=headers)
    return media_response.content


def analyze_image(image_data, prompt):
    """Send image to AI vision model for analysis."""
    import base64

    # Convert to base64
    image_base64 = base64.b64encode(image_data).decode("utf-8")
    image_url = f"data:image/jpeg;base64,{image_base64}"

    try:
        response = together_client.chat.completions.create(
            model=VISION_MODEL,
            messages=[{
                "role": "user",
                "content": [
                    {"type": "image_url", "image_url": {"url": image_url}},
                    {"type": "text", "text": prompt}
                ]
            }],
            max_tokens=500
        )
        return response.choices[0].message.content
    except Exception as e:
        print(f"Vision API error: {e}")
        return None


def get_analysis_prompt():
    """Return the prompt for image analysis."""
    return """Analyze this image and provide:

1. A brief description of what you see
2. Key details or notable elements
3. Any relevant insights or observations

Keep your response concise and conversational, suitable for a chat message."""


def log_interaction(phone_number, message_type, content, response):
    """Log the interaction to MongoDB."""
    db.interactions.insert_one({
        "phone_number": phone_number,
        "message_type": message_type,
        "content": content[:500] if content else None,
        "response": response[:500] if response else None,
        "timestamp": datetime.utcnow()
    })


@app.route("/webhook", methods=["GET"])
def verify_webhook():
    """Handle webhook verification from Meta."""
    mode = request.args.get("hub.mode")
    token = request.args.get("hub.verify_token")
    challenge = request.args.get("hub.challenge")

    if mode == "subscribe" and token == VERIFY_TOKEN:
        print("Webhook verified successfully")
        return challenge, 200

    return "Forbidden", 403


@app.route("/webhook", methods=["POST"])
def handle_webhook():
    """Process incoming WhatsApp messages."""
    data = request.json

    try:
        # Extract message details
        entry = data["entry"][0]
        changes = entry["changes"][0]
        value = changes["value"]

        # Check if this is a message (not a status update)
        if "messages" not in value:
            return jsonify({"status": "ok"}), 200

        message = value["messages"][0]
        phone_number = message["from"]
        message_type = message["type"]

        # Handle image messages
        if message_type == "image":
            media_id = message["image"]["id"]

            # Send acknowledgment
            send_whatsapp_message(
                phone_number, 
                "Got your image! Analyzing it now..."
            )

            # Download and analyze
            image_data = download_media(media_id)

            if image_data:
                analysis = analyze_image(image_data, get_analysis_prompt())

                if analysis:
                    send_whatsapp_message(phone_number, analysis)
                    log_interaction(phone_number, "image", "image_received", analysis)
                else:
                    send_whatsapp_message(
                        phone_number,
                        "Sorry, I couldn't analyze that image. Please try again."
                    )
            else:
                send_whatsapp_message(
                    phone_number,
                    "Sorry, I couldn't download that image. Please try again."
                )

        # Handle text messages
        elif message_type == "text":
            text = message["text"]["body"]

            # Simple response for non-image messages
            response = (
                "Hi! Send me an image and I'll analyze it for you.\n\n"
                "Just take a photo or send one from your gallery!"
            )
            send_whatsapp_message(phone_number, response)
            log_interaction(phone_number, "text", text, response)

        return jsonify({"status": "ok"}), 200

    except Exception as e:
        print(f"Error processing webhook: {e}")
        return jsonify({"status": "error"}), 500


@app.route("/health", methods=["GET"])
def health_check():
    """Simple health check endpoint."""
    return jsonify({"status": "healthy", "timestamp": datetime.utcnow().isoformat()})


if __name__ == "__main__":
    app.run(host="0.0.0.0", port=5000, debug=True)

Step 8: Understanding the Code

Let's break down the key parts.

Webhook Verification

When you configure your webhook URL in Meta's dashboard, they send a GET request with a challenge. Your server must echo it back:

@app.route("/webhook", methods=["GET"])
def verify_webhook():
    if mode == "subscribe" and token == VERIFY_TOKEN:
        return challenge, 200

Processing Incoming Messages

WhatsApp sends POST requests to your webhook for each message. The nested JSON structure requires careful extraction:

message = value["messages"][0]
phone_number = message["from"]
message_type = message["type"]

Downloading Media

WhatsApp doesn't send images directly. Instead, they provide a media ID. You must first get the download URL, then fetch the actual file:

def download_media(media_id):
    # Get URL from media ID
    url = f"https://graph.facebook.com/v18.0/{media_id}"
    response = requests.get(url, headers=headers)
    media_url = response.json().get("url")

    # Download actual file
    media_response = requests.get(media_url, headers=headers)
    return media_response.content

Vision Analysis

The AI vision API accepts base64-encoded images. We send both the image and a text prompt:

response = together_client.chat.completions.create(
    model=VISION_MODEL,
    messages=[{
        "role": "user",
        "content": [
            {"type": "image_url", "image_url": {"url": image_url}},
            {"type": "text", "text": prompt}
        ]
    }]
)

Step 9: Set Up the Webhook

For development, we'll use ngrok to expose your local server.

Install ngrok from ngrok.com
Run your Flask app: python app.py
In another terminal, run: ngrok http 5000
Copy the HTTPS URL (looks like https://abc123.ngrok.io)

Now configure the webhook in Meta's dashboard:

Go to your app → WhatsApp → Configuration
Click "Edit" next to Webhook
Enter your URL: https://abc123.ngrok.io/webhook
Enter your verify token (same as WHATSAPP_VERIFY_TOKEN in .env)
Click "Verify and Save"
Subscribe to "messages" events

Step 10: Test Your Bot

Open WhatsApp on your phone
Message the test number shown in Meta's dashboard
Send an image
Watch your terminal for logs
Receive the AI analysis!

If something doesn't work, check:

Is ngrok running and the URL current?
Are all environment variables set?
Is the webhook subscribed to "messages"?
Check Meta's webhook logs for delivery status

Step 11: Making the Access Token Permanent

The temporary access token expires in 24 hours. For production, create a permanent one:

Go to your app → WhatsApp → API Setup
Click "Add" under "Add a system user token"
Create a system user if you haven't
Generate a token with whatsapp_business_messaging permission
This token won't expire

Step 12: Deploying to Production

For production, you need a server with a stable URL. Here are affordable options:

Railway

# Install Railway CLI
npm install -g @railway/cli

# Deploy
railway init
railway up

Render (Free tier available)

Connect your GitHub repo
Set environment variables in dashboard
Deploy automatically on push

DigitalOcean

# On your droplet
git clone your-repo
cd your-repo
pip install -r requirements.txt
gunicorn app:app -b 0.0.0.0:5000

For any option, remember to:

Set all environment variables
Use HTTPS (required by WhatsApp)
Run with gunicorn instead of Flask's dev server
Set up process management (systemd, supervisor, or PM2)

Step 13: Customizing for Your Use Case

The base code is intentionally generic. Here's how to adapt it for specific applications.

For a Receipt Scanner:

def get_analysis_prompt():
    return """Analyze this receipt image and extract:

1. Store/merchant name
2. Date of purchase
3. List of items with prices
4. Total amount
5. Payment method if visible

Format the response as a clear summary."""

For a Plant Identifier:

def get_analysis_prompt():
    return """Identify this plant and provide:

1. Plant name (common and scientific)
2. Key identifying features
3. Care requirements (water, sunlight)
4. Is it toxic to pets?

Keep it conversational and helpful."""

For a Food Analyzer:

def get_analysis_prompt():
    return """Analyze this food image and estimate:

1. What foods are present
2. Approximate calories
3. Protein, carbs, and fat estimates
4. Health rating from 1-10
5. A brief nutritional insight

Be helpful but note these are estimates."""

Step 14: Adding Conversation Context

To make your bot smarter, store and use conversation history:

def get_user_context(phone_number, limit=5):
    """Get recent interactions for context."""
    recent = db.interactions.find(
        {"phone_number": phone_number}
    ).sort("timestamp", -1).limit(limit)

    return list(recent)


def analyze_image_with_context(image_data, phone_number):
    """Include conversation history in analysis."""
    context = get_user_context(phone_number)

    context_text = ""
    if context:
        context_text = "Previous interactions:\n"
        for item in reversed(context):
            context_text += f"- {item.get('response', '')[:100]}\n"

    prompt = f"""{context_text}

Now analyze this new image. Consider any relevant context from previous interactions."""

    return analyze_image(image_data, prompt)

Performance Tips

After running this in production, here's what I've learned:

Send acknowledgments immediately. Users get anxious if there's no response. Send "Analyzing..." before doing the heavy work.
Cache repeated analyses. Hash incoming images and check if you've seen them before.
Set timeout limits. Vision APIs can be slow. Set a 30-second timeout and send a graceful error if exceeded.
Rate limit by user. Prevent abuse by limiting requests per phone number per hour.
Monitor costs. Log API calls and set up billing alerts. Vision APIs charge per image.

Common Pitfalls

"Webhook verification failed"

Your verify token doesn't match
Your server isn't accessible (check ngrok)
You're not returning the challenge correctly

"Message not delivered"

Access token expired (get a permanent one)
Phone number not in allowed list (in test mode)
Invalid phone number format

"Image download failed"

Access token doesn't have media.read permission
Media URL expired (they're temporary)
Network timeout

"Vision API error"

Image too large (resize before sending)
Unsupported format (stick to JPEG/PNG)
Rate limit hit

What's Next?

This foundation supports many extensions:

Multi-language support: Detect user's language and respond accordingly
Voice messages: Transcribe audio and respond
Buttons and lists: Use WhatsApp's interactive message types
Payment integration: Connect to Stripe for premium features
Admin dashboard: Build a web interface for monitoring

Wrapping Up

You now have a complete AI-powered WhatsApp bot that can analyze images. The stack is simple, affordable, and scales well.

The combination of WhatsApp's reach and AI vision capabilities opens interesting possibilities. Users don't need to learn new interfaces—they just message like they always do.

This article was written based on my hands-on experience building production WhatsApp bots. The code examples are simplified for clarity—production deployments should include proper error handling, logging, and security measures.

How to Build an AI That Roasts Your Spending Habits (3 hours Weekend Project)

Ademola Balogun — Mon, 19 Jan 2026 21:37:08 +0000

Building an AI Financial Roaster That People Actually Want to Use

Let me show you how to build the most brutally honest financial advisor you'll ever meet: an AI that doesn't care about your feelings.

I love building things that people actually use. That is why I wanted to create a tutorial project that would teach developers how to make AI applications that people actually enjoy using. Because let's be honest, most AI tutorials are boring calculators and chatbots.

The project? Finance Roaster: Users upload their bank statement, get roasted by GPT-4 with savage, witty humor.

This tutorial will teach you:

How to process CSV and PDF bank statements

How to engineer AI prompts for humor and personality

How to build shareable, viral-worthy UI/UX

How to create an AI app in under 100 lines of Python

By the end of this guide, you'll have a fully functional app you can deploy this weekend.

The Psychology: Why This Project Concept Works

Before diving into code, let's talk about why this makes such a great tutorial project.

Traditional budgeting apps are boring. They show you pie charts and tell you to "reduce discretionary spending." Not exactly share-worthy.

But an AI that says: "$87 at Foods Co for organic kale? Your wallet is as wilted as that kale will be in three days"? That's memorable. That's the kind of feature that makes people actually want to test your app.

Three principles that make this project interesting:

Humor makes boring tasks fun - People avoid checking their finances because it's painful. Adding humor changes that dynamic.

Highly shareable - When you build something funny, people naturally want to show friends. This teaches you virality mechanics.

Practical learning - You'll learn file uploads, AI prompt engineering, and building clean UIs, all transferable skills.

The Tech Stack (Perfect for Learning)

I deliberately chose a simple stack that beginners can understand while still being production-ready.

What we'll use:

Backend: Flask (Python) - 100 lines of code

AI Engine: OpenAI's GPT-4o-mini ($0.15 per 1M tokens)

File Processing: Pandas for CSV, pdfplumber for PDF statements

Frontend: Single HTML page with Tailwind CSS

Cost per request: ~$0.01

Why this stack is perfect for learning:

No deployment complexity (runs locally or on a single server)

Sub-3-second response times

Teaches you file handling, AI integration, and modern UI design

Easily extensible (add features as you learn more)

Part 1: Making AI Understand Money (The Easy Part)

First challenge: How do you teach an AI to roast someone's spending?

Step 1: Extract Transaction Data

Most banks export statements as CSV or PDF. Here's the extraction logic:

import pandas as pd
import pdfplumber
from openai import OpenAI

def process_csv(file):
    """Extract spending summary from CSV bank statement"""
    df = pd.read_csv(file)

    # Find the amount column (banks use different names)
    amount_col = next(
        (col for col in df.columns 
         if any(kw in col.lower() for kw in ['amount', 'debit', 'spent'])),
        df.columns[-1]  # Default to last column
    )

    # Convert to numeric, handle currency symbols
    df[amount_col] = pd.to_numeric(
        df[amount_col].astype(str).str.replace('$', '').str.replace(',', ''),
        errors='coerce'
    )

    # Calculate summary
    total_spent = df[amount_col].abs().sum()
    top_expenses = df.nlargest(5, amount_col)

    return {
        'total_spent': f"${total_spent:,.2f}",
        'top_expenses': top_expenses.to_dict('records')
    }

PDF handling is trickier because banks format statements differently. Solution? Use pdfplumber for text extraction, then let GPT-4 parse it:

def extract_transactions_from_pdf(pdf_text):
    """Use AI as a fallback parser for complex PDFs"""
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{
            "role": "system",
            "content": "Extract transactions: Date | Description | Amount"
        }, {
            "role": "user",
            "content": pdf_text[:4000]  # Limit tokens
        }]
    )
    # Parse AI response into structured data
    # ...

Pro tip: AI is surprisingly good at parsing messy financial data. I tried regex patterns for weeks before realizing GPT-4 could do it in one call.

Part 2: Teaching AI to Be Savage (The Fun Part)

Now the magic: turning transaction data into comedy gold.

The System Prompt (Your AI's Personality)

This is where most people fail. A generic "be funny" prompt produces generic humor. You need specificity.

Here's my system prompt:

SYSTEM_PROMPT = """
You are a Savage Financial Parent - brutally honest, wickedly funny,
and armed with someone's transaction history.

Your job: Roast their spending habits using SPECIFIC transaction details.

Style guide:

Think disappointed parent meets stand-up comedian
Use their actual purchase names (e.g., "Taco Bell at 2 AM")
Be savage but not cruel - make them laugh while crying
End with a Financial Maturity Grade (A-F) and one-line explanation

Examples of good roasts:

"Five Starbucks trips in one day? That's not a coffee addiction, that's a $30 therapy session with extra foam."
"You spent $200 on DoorDash this month. That's literally paying someone $8 to make your laziness official."

Keep it under 200 words. Make it screenshot-worthy.
"""

Why this works:

Specific examples teach the AI your humor style

Constraints (200 words) force concise, punchy writing

"Screenshot-worthy" reminds the AI this is for social sharing

The Actual API Call

def generate_roast(summary_data):
    """Send transaction summary to OpenAI, get roast back"""

    prompt = f"""
Transaction Data:
- Total Spent: {summary_data['total_spent']}
- Top Expense: {summary_data['top_expenses'][0]}
- You went to {summary_data['frequent_vendors'][0]['vendor']} 
  {summary_data['frequent_vendors'][0]['count']} times

Roast these spending habits. Be specific and savage.
"""

    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": prompt}
        ],
        temperature=0.9  # High creativity for humor
    )

    return response.choices[0].message.content

The temperature setting (0.9) is crucial. Lower values make AI boring. Higher values make it creative... sometimes too creative. 0.9 is the sweet spot for comedy.

Part 3: The UI That Makes People Click "Roast Me"

Nobody uploads bank statements to ugly websites. Your UI needs to be:

Dark and edgy (matches the roasting vibe)

Dead simple (one button: "Roast Me")

Fast (loading states matter)

The Landing Page

<div class="bg-gradient-to-br from-gray-900 to-black min-h-screen">
  <h1 class="text-6xl font-bold bg-gradient-to-r from-red-500 to-pink-500 
             bg-clip-text text-transparent">
    💸 Finance Roaster
  </h1>
  <p class="text-xl text-gray-400">
    Upload Your Regrets. Get Brutally Honest Feedback.
  </p>

  <!-- Drag-and-drop upload area -->
  <div class="border-2 border-dashed border-red-500 rounded-xl p-12 
              hover:bg-red-500/10 cursor-pointer">
    <input type="file" accept=".csv,.pdf" hidden />
    <p>Drop your bank statement here</p>
    <p class="text-sm text-gray-500">CSV or PDF • Max 16MB</p>
  </div>

  <button class="w-full bg-gradient-to-r from-red-500 to-pink-500 
                 text-white font-bold py-4 rounded-xl">
    🔥 Roast Me
  </button>
</div>

Key design choices:

Red/pink gradient: Danger + playfulness

Dark background: Focuses attention on the upload area

Emojis: Adds personality without extra design work

Tailwind CSS: Rapid prototyping without writing CSS

The Loading State (Critical for Virality)

While the AI thinks, show personality:

function showLoading() {
  document.getElementById('loadingState').innerHTML = `
    <div class="animate-spin h-16 w-16 border-b-2 border-red-500"></div>
    <p class="text-xl">Analyzing your poor life choices...</p>
    <p class="text-sm text-gray-500">This won't take long. Unlike your debt.</p>
  `;
}

Why this matters: Users wait 3-5 seconds for AI responses. A boring spinner loses attention. A funny loading message keeps them engaged.

Part 4: Security (Because Banks Care About This)

You're handling financial data. Even for a joke app, security matters.

Three Non-Negotiable Rules:

Never store files on disk

❌ BAD: Saves file to server

file.save('statements/' + filename)

✅ GOOD: Process in-memory

file_bytes = io.BytesIO(file.read())
df = pd.read_csv(file_bytes)

Don't log transaction details

❌ BAD: Logs sensitive data

print(f"User spent ${amount} at {merchant}")

✅ GOOD: Generic logging

print(f"Processing statement with {len(transactions)} transactions")

Strip personal info before sending to OpenAI

# Remove account numbers, names, addresses
summary_data = {
    'total_spent': total,
    'top_categories': categories,  # "Food", not "Joe's Pizza"
    'spending_pattern': pattern
}

Reality check: Even with sanitization, tell users their data goes to OpenAI. Transparency builds trust.

Part 5: Adding Viral Features (Optional Enhancement)

If you want to take this tutorial further, here's how to add shareability:

Built-in Sharing Capability:

One-Click Social Sharing

function shareRoast() {
  const roastText = document.getElementById('roast').textContent;
  const shareText = `I just got my spending roasted by AI! 🔥\n\n"${roastText.substring(0, 200)}..."\n\nBuild your own: [your-github-link]`;

  if (navigator.share) {
    navigator.share({ text: shareText });
  } else {
    navigator.clipboard.writeText(shareText);
    alert('Roast copied! Share with friends! 🔥');
  }
}

Why this matters: If you eventually deploy this publicly, sharing features help spread awareness.

The Financial Maturity Grade

Every roast ends with a grade: A through F.

Financial Maturity Grade: D-
"You have the self-control of a toddler in a candy store,
except the candy is overpriced artisanal coffee."

Learning point: Grades, scores, and rankings make results more shareable and comparable.

What You'll Learn Building This

This project teaches you several valuable skills:

File Processing:

Handle both CSV and PDF uploads

Parse unstructured financial data

Work with Pandas for data analysis

AI Engineering:

Craft effective system prompts for personality

Temperature tuning for creative outputs

Handle AI responses in production

Full-Stack Development:

Build clean REST APIs with Flask

Create engaging UIs with Tailwind CSS

Implement drag-and-drop file uploads

Handle loading states and user feedback

Deployment Ready:

In-memory file processing (no disk storage)

Error handling and validation

Security best practices for financial data

Design Decisions: Why I Built It This Way

What Works in This Architecture:

Simplicity over features: A single-purpose app is easier to understand and build. You can always add features later once you understand the core.

Humor as the hook: The AI's personality is what makes this project interesting. Technical tutorials don't have to be boring.

File processing in memory: No database needed for this tutorial. Everything happens in RAM, making deployment simple.

What You Might Customize:

Add comparison data: "You spent more on coffee than X% of typical users" - requires a database to track aggregates.

Mobile-first design: The current UI works on mobile, but you could optimize it further with responsive breakpoints.

Prompt variations: The system prompt I provide is a starting point. Experiment with different personalities and tones.

Build Your Own AI App: What This Template Teaches

This project is designed to be a template for other AI applications. Here's how to adapt it:

Find a task people find tedious

Checking finances → This tutorial

Reading legal documents → "Legal Jargon Translator"

Analyzing health data → "Fitness Report Card"

Reviewing meeting notes → "Meeting BS Detector"

Add personality through AI prompts

Make AI roast them (finance, fitness)

Turn it into a game (quiz format)

Add competitive elements (grades, scores)

Use unexpected analogies (explain tech in food terms)

Keep the UI simple

One-click file upload

Clear results display

Optional sharing features

Fast loading indicators

Make it easy to extend

Modular code structure

Clear separation of concerns

Well-commented functions

Standard design patterns

The Complete Code Repository

The entire Finance Roaster tutorial project is available on GitHub. Here's the structure:

FinanceRoaster/
├── app.py              # Flask server (81 lines)
├── services.py         # AI + file processing (156 lines)
├── templates/
│   └── home.html       # UI (342 lines with CSS/JS)
├── requirements.txt    # Dependencies (7 packages)
├── .env.example       # Template for your API key
└── README.md          # Setup instructions

To run this tutorial locally:

git clone https://github.com/yourusername/FinanceRoaster
cd FinanceRoaster
pip install -r requirements.txt
# Copy .env.example to .env and add your OpenAI key
python app.py

Visit http://localhost:5000 and test it out.

Learning cost: ~$0.01 per test (OpenAI API calls)

What's included:

Complete working code

Detailed comments explaining each function

Example bank statements for testing

Deployment guide for Heroku/Railway

Potential Use Cases (What You Could Build)

Once you understand this template, you can adapt it for various applications:

Personal Finance Tools:

Budget analyzer with friendly advice

Subscription tracker that flags unused services

Shopping habit analyzer (impulse vs planned purchases)

Professional Development:

Resume reviewer that gives honest feedback

Email tone analyzer (passive-aggressive detector)

Meeting notes summarizer with action items

Health & Wellness:

Food diary analyzer with nutrition roasts

Workout consistency tracker with motivational guilt

Sleep pattern analyzer with bedtime recommendations

Content Creation:

Social media post analyzer (engagement predictor)

Blog post readability scorer

Video script feedback tool

The core pattern: upload file → AI analysis → humorous/useful feedback - works for countless domains.

The Technical Deep Dive (For the Nerds)

Challenge 1: PDF Parsing Hell

Banks format PDFs differently. Chase, Wells Fargo, Bank of America - all unique snowflakes.

My solution: Hybrid approach

def parse_pdf_transactions(pdf_text):
    # Try regex first (fast, works 70% of the time)
    pattern = r'(\d{1,2}/\d{1,2}/\d{4})\s+(.+?)\s+([-]?\$?\d+\.?\d*)'
    matches = re.findall(pattern, pdf_text)

    if matches:
        return pd.DataFrame(matches, columns=['Date', 'Desc', 'Amount'])

    # Fallback to AI (slower, works 95% of the time)
    return extract_with_gpt(pdf_text)

Lesson: AI should be your fallback, not your first choice. Regex is 100x faster.

Challenge 2: Keeping Roasts PG-13

Early versions were too savage. One user got: "Your spending screams 'quarter-life crisis.' Seek therapy."

The fix: Content filters

FORBIDDEN_TOPICS = [
    'therapy', 'depression', 'mental health', 
    'divorce', 'death', 'addiction'
]

def sanitize_roast(roast_text):
    # Check for sensitive topics
    if any(topic in roast_text.lower() for topic in FORBIDDEN_TOPICS):
        return generate_roast(summary_data)  # Retry
    return roast_text

Temperature tuning also helps: Lowering from 1.0 to 0.9 reduced inappropriate jokes by 60%.

Challenge 3: Rate Limiting Without Breaking UX

OpenAI has rate limits. Hitting them = angry users.

Solution: Queue system

from redis import Redis
from rq import Queue

queue = Queue(connection=Redis())

@app.route('/roast', methods=['POST'])
def roast():
    job = queue.enqueue(process_and_roast, file_data)
    return jsonify({'job_id': job.id})

@app.route('/status/<job_id>')
def status(job_id):
    job = Job.fetch(job_id, connection=Redis())
    if job.is_finished:
        return jsonify({'status': 'complete', 'roast': job.result})
    return jsonify({'status': 'processing'})

Frontend polls /status every second. Feels instant, never hits limits.

Extension Ideas: Taking It Further

Once you've built the basic version, here are ways to extend it:

Add data persistence

# Save anonymized spending patterns
from flask_sqlalchemy import SQLAlchemy

db = SQLAlchemy(app)

class SpendingPattern(db.Model):
    id = db.Column(db.Integer, primary_key=True)
    total_amount = db.Column(db.Float)
    category = db.Column(db.String(50))
    timestamp = db.Column(db.DateTime)

Implement comparison features

def get_percentile(user_spending, category):
    all_spending = SpendingPattern.query.filter_by(category=category).all()
    percentile = sum(1 for x in all_spending if x.total_amount < user_spending)
    return (percentile / len(all_spending)) * 100

Add authentication

from flask_login import LoginManager, login_required

@app.route('/history')
@login_required
def view_history():
    # Show user's past roasts
    pass

Implement rate limiting

from flask_limiter import Limiter

limiter = Limiter(app, key_func=lambda: request.remote_addr)

@app.route('/roast', methods=['POST'])
@limiter.limit("5 per hour")
def roast():
    # Prevent API abuse
    pass

Deployment Options: From Local to Production

Once your project works locally, here are deployment options:

Option 1: Heroku (Easiest)

# Install Heroku CLI, then:
heroku create FinanceRoaster
git push heroku main
heroku config:set OPENAI_API_KEY=your_key

Option 2: Railway

Connect your GitHub repo

Add environment variables

Auto-deploys on push

Option 3: DigitalOcean App Platform

Deploy from GitHub

$5/month for basic app

Built-in SSL certificates

Scaling considerations:

Caching for similar inputs: Store common roast patterns

Rate limiting: Prevent abuse with Flask-Limiter

CDN for assets: Use Cloudflare free tier

Database: PostgreSQL if you add user accounts

For a tutorial project, any of these platforms work great. Start with Heroku's free tier to test.

The Bigger Picture: What This Tutorial Teaches About AI Development

Here's what building Finance Roaster teaches you about creating useful AI applications:

Lesson 1: AI doesn't have to be serious. Most AI tutorials focus on optimization, efficiency, and accuracy. But the best projects are ones people actually want to use. Adding personality makes your projects memorable.

Lesson 2: Simple prompts can be powerful. The entire "roasting" capability comes from a well-crafted system prompt. You don't need fine-tuning or complex models, just clear instructions and examples.

Lesson 3: User experience matters more than complexity. A 100-line Flask app with great UX beats a microservices architecture with boring design. Focus on the user experience first, optimize later.

Lesson 4: File processing is a valuable skill. Being able to parse CSVs and PDFs opens up countless project possibilities. Financial statements, receipts, invoices, medical records, they're all structured data waiting to be analyzed.

This pattern applies to many domains:

Upload document → AI analyzes → Useful/entertaining output

It's simple, it works, and users understand it immediately

Try It Yourself: Weekend Project Checklist

Want to build your own AI app this weekend? Here's your roadmap:

Friday night (0.5 hours):

Pick your painful task (finance, health, career, dating)

Write 10 example roasts manually

Define your AI's personality in 3 sentences

Saturday (1 hour):

Set up Flask + OpenAI

Build file upload handling

Test your system prompt with real data

Iterate on temperature/prompt until roasts are funny

Sunday (1.5 hours):

Design UI with Tailwind

Add one-click sharing

Test on 5 friends (get honest feedback)

Deploy to your preferred cloud platform

Total time: 3 hours\Total cost: $0 (Heroku free tier + OpenAI free credits)

The Code (Seriously, It's That Simple)

Here's the entire backend in 81 lines:

from flask import Flask, request, jsonify, render_template
import io, pandas as pd, pdfplumber
from openai import OpenAI
from dotenv import load_dotenv
import os

load_dotenv()
app = Flask(__name__)
client = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))

SYSTEM_PROMPT = """You are a Savage Financial Parent - brutally honest, 
wickedly funny, armed with transaction history. Roast their spending using 
SPECIFIC details. Think disappointed parent meets stand-up comedian. 
Keep under 200 words. End with Financial Maturity Grade (A-F)."""

def process_csv(file):
    df = pd.read_csv(file)
    amount_col = next((c for c in df.columns 
                      if 'amount' in c.lower()), df.columns[-1])
    df[amount_col] = pd.to_numeric(
        df[amount_col].astype(str).str.replace('$', ''),
        errors='coerce'
    )

    total = df[amount_col].abs().sum()
    top5 = df.nlargest(5, amount_col)

    return {
        'total_spent': f"${total:,.2f}",
        'top_expenses': top5.to_dict('records')
    }

def generate_roast(summary):
    prompt = f"""
Total Spent: {summary['total_spent']}
Top Expenses: {summary['top_expenses']}

Roast these spending habits with savage humor.
"""

    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": prompt}
        ],
        temperature=0.9
    )

    return response.choices[0].message.content

@app.route('/')
def home():
    return render_template('home.html')

@app.route('/roast', methods=['POST'])
def roast():
    try:
        file = request.files['file']
        file_bytes = io.BytesIO(file.read())

        summary = process_csv(file_bytes)
        roast_text = generate_roast(summary)

        return jsonify({
            'success': True,
            'roast': roast_text,
            'summary': summary
        })
    except Exception as e:
        return jsonify({'error': str(e)}), 500

if __name__ == '__main__':
    app.run(debug=True)

That's it. 81 lines. No ML training, no complex infrastructure, no month-long sprints.

Final Thoughts: Learn by Building

This tutorial project - a simple weekend build with 81 lines of Python - teaches more practical AI skills than any of those complex systems.

Why? Because you can actually build it yourself, understand every line, and adapt it to your own ideas.

The lesson? The best way to learn AI development is to build projects that are:

Simple enough to finish (weekend projects, not month-long sprints)
Interesting enough to share (add personality, make it fun)
Practical enough to extend (real use cases, clear applications) So stop reading tutorials and start building. Pick a file format (PDF, CSV, JSON), add some AI analysis, and ship it this weekend.

And when you build something cool, share it! The AI development community loves creative projects.

Resources & Next Steps

Complete tutorial code: https://github.com/ademicho123/FinanceRoaster (Fork it and customize!)

What to do after building this:

Deploy it online (Heroku, Railway, or DigitalOcean)
Share with friends for feedback
Adapt the pattern for a different domain
Add the project to your portfolio

P.S. After you build this, try uploading your own bank statement. The AI is surprisingly insightful (and funny). Consider it a feature test and free financial feedback.

Additionally, if you extend this tutorial with cool features, please submit a pull request! The best additions will be merged into the main repo with credit to you.

The 88% Problem: Why Most AI Projects Die Between Pilot and Production

Ademola Balogun — Sat, 03 Jan 2026 11:59:31 +0000

There's a statistic making the rounds in tech circles that should terrify anyone investing in AI: for every 33 AI prototypes a company builds, only 4 make it into production. That's an 88% failure rate.

This isn't about bad models or insufficient computing power. The technology works. The problem is everything that happens after the model trains successfully.

The Pilot Trap

AI pilots succeed for the wrong reasons. They're built in controlled environments with clean data, dedicated resources, and forgiving success metrics. A data scientist can spend three months perfecting a model on historical data, demo it to stakeholders with cherry-picked examples, and get enthusiastic approval to "scale it up."

Then reality hits.

Production systems need to handle messy real-time data from a dozen different sources. They need to integrate with legacy systems built before anyone thought about APIs. They need to run reliably at 3 AM when the data science team is asleep. They need to make decisions fast enough that users don't notice the AI is even there.

A model that achieves 94% accuracy in the lab can become completely unusable in production if it adds 5 seconds of latency to a checkout process. Or if it requires manual data cleanup before each run. Or if it breaks every time the data format changes slightly.

The pilot proved the concept. Production requires proving the system can survive contact with reality.

The Infrastructure Nightmare

Most companies discover too late that their IT infrastructure can't actually support AI at scale.

The pilot ran on a data scientist's laptop or a small cloud instance. Production needs to handle hundreds or thousands of requests per second, maintain consistent performance during traffic spikes, and fail gracefully when—not if—something breaks.

Consider a fraud detection model. In pilot: analyzing 100 transactions overnight to flag suspicious patterns. In production: making real-time decisions on every transaction flowing through the payment system, with false positives costing customer trust and false negatives costing actual money.

That transition requires:

Infrastructure that scales dynamically with load
Monitoring systems that catch model drift before it causes problems
Fallback mechanisms when the AI service goes down
Data pipelines that can handle schema changes without breaking

Companies find themselves needing to build an entire MLOps infrastructure just to deploy one model. Many don't have the engineering capacity, so the project stalls indefinitely in "pilot mode."

The Data Quality Reality Check

Pilots use carefully curated datasets. Production uses whatever data the business actually generates.

In controlled testing, someone manually labeled 10,000 examples with perfect accuracy. In production, labels come from an understaffed operations team entering data between phone calls, or from automated systems with their own error rates, or from customers who click random buttons to make dialogs go away.

The training data was historical, meaning someone already fixed the errors and filled in the missing values. Production data arrives incomplete, inconsistent, and occasionally completely wrong. The model trained on "clean" data has no idea what to do when 30% of the required fields are blank.

Here's what kills projects: discovering after six months of development that the data quality required to make the AI reliable doesn't actually exist in production. Companies face a choice—invest millions in data infrastructure improvements, or abandon the AI project. Most choose the latter.

Recent surveys show that 43% of organizations cite data quality and readiness as their top obstacle to AI deployment. Not model performance. Not computing costs. Data quality.

The Integration Hell

AI models don't run in isolation. They need to integrate with the dozens of existing systems that actually run the business.

The pilot proved the model works. Now it needs to:

Pull data from a CRM system last updated in 2012
Send predictions to an ERP system that uses SOAP APIs
Log results to a data warehouse built for batch processing, not real-time updates
Trigger alerts in Slack, email, and a proprietary monitoring tool
Comply with access controls defined in Active Directory

Each integration point is a potential failure mode. The legacy systems weren't designed for AI workloads. Their APIs rate-limit at inconvenient thresholds. Their data formats don't quite match what the model expects. Their update schedules conflict with when the model needs fresh data.

Integration complexity grows exponentially with each system involved. What looked like a straightforward deployment becomes a year-long integration project touching half a dozen teams who all have conflicting priorities.

The Organizational Friction

Technical challenges have technical solutions. Organizational challenges are harder.

AI projects in production require coordination between data scientists, ML engineers, software developers, IT operations, security teams, compliance officers, and business stakeholders. Each group speaks a different language and optimizes for different goals.

Data scientists care about model accuracy. Operations cares about uptime. Security cares about access controls. Compliance cares about audit trails. Business stakeholders care about ROI. Getting all these groups aligned is harder than building the model.

Then there's the resistance from people whose workflows the AI will change. The customer service team that doesn't trust automated classifications. The credit analysts who resent being "replaced" by an algorithm. The operations managers who built their processes around manual review and don't want to redesign everything.

A technically perfect AI system can fail in production simply because nobody wants to use it.

The Cost Trap

Pilots are cheap. Production is expensive.

Running a model on a sample dataset costs dollars. Running it continuously on production traffic costs thousands per month in compute, storage, and bandwidth. Fine-tuning and retraining as data drifts adds more costs. Monitoring, logging, and debugging infrastructure adds more costs. The human operators who need to intervene when the AI gets confused add more costs.

Companies approve pilot budgets easily—it's "innovation" and "staying competitive." Production budgets require demonstrating clear ROI, which is hard when the system isn't deployed yet and all the costs are upfront while the benefits are theoretical.

CFOs kill AI projects when the production cost projections arrive. The business case that justified the pilot evaporates when scaling it up requires 10x the ongoing expense.

What Actually Works

The companies successfully moving AI from pilot to production do a few things differently:

They design for production from day one. No "we'll figure out deployment later." The first question is: how will this actually run in production? That constraint shapes everything—model complexity, latency requirements, data dependencies, failure modes.

They build MLOps infrastructure first. Before the third AI pilot starts, they invest in standardized deployment pipelines, monitoring frameworks, and model management systems. The infrastructure work feels like a distraction from "real" AI, but it's what separates successful deployments from permanent pilots.

They start with use cases that have simple integration requirements. Don't make your first production AI project depend on integrating with seven legacy systems. Pick something that can run relatively standalone and deliver value even with imperfect accuracy.

They accept that production models will be "worse" than pilot models. The pilot model achieved 96% accuracy on clean data with infinite compute time. The production model gets 89% accuracy but runs in 100ms with realistic data quality. The second one is actually more valuable because it ships.

They invest in data infrastructure before AI. Companies with mature data practices—centralized data warehouses, standardized schemas, automated quality checks—can deploy AI relatively easily. Companies with fragmented data spend years building that foundation first.

The Uncomfortable Math

Current industry data shows:

74% of organizations struggle to scale AI from pilot to production
Only 1% of companies consider themselves "AI-mature"
46% of AI pilots are scrapped before reaching production
Fewer than 20% of AI initiatives have been fully scaled across the enterprise

This isn't improving. If anything, as more companies launch AI pilots, the failure rate is increasing because everyone hits the same infrastructure and organizational walls.

The gap between "we built a model" and "the model is creating business value" remains stubbornly wide. Not because the technology isn't ready—it is. But because deploying production systems is legitimately hard, and most organizations underestimate how hard until they're deep into failed attempts.

The Path Forward

The AI pilot-to-production problem won't solve itself. It requires acknowledging that:

AI deployment is an engineering problem, not a data science problem. After the model is trained, 90% of the work is software engineering, DevOps, and systems integration. Companies that treat AI deployment as a data science project fail.

Production requirements should drive pilot design. If you can't deploy it, don't build it. Pilots should be proofs of deployment, not just proofs of concept.

Infrastructure investment must precede AI investment. Building five AI pilots without MLOps infrastructure is worse than building one pilot with proper deployment capabilities.

Organizational alignment is as important as technical capability. The best AI in the world fails if nobody trusts it, uses it, or maintains it.

The companies that figure this out will have a massive competitive advantage. Not because their models are better, but because their models actually run in production instead of gathering dust in Jupyter notebooks.

The 88% failure rate isn't inevitable. It's just what happens when organizations conflate "we built an AI model" with "we deployed an AI system." Those are completely different problems requiring completely different capabilities.

The hardest thing about AI isn't building models—it's building the systems, processes, and organizational capabilities required to run those models reliably in production. Until the industry solves the deployment gap, most AI investment will continue producing expensive prototypes instead of business value.

Why I Chose Voice Over Chat for AI Interviews (And Why It Almost Backfired)

Ademola Balogun — Sun, 28 Dec 2025 08:00:26 +0000

Most AI interview platforms are glorified chatbots with better questions. We built Squrrel to do something harder: have actual spoken conversations with candidates.

That decision nearly killed the product before launch.

The Obvious Choice That Wasn't Obvious

When I started building Squrrel, the safe play was text-based interviews. Lower latency, fewer technical headaches, easier to parse and analyze. Every AI product manager I talked to said the same thing: "Start with chat. Voice is a nightmare."

They were right about the nightmare part.

But I kept coming back to one fact: 78% of recruiting happens over the phone. Not email. Not Slack. Phone calls. Because hiring managers want to hear how candidates think on their feet, how they structure explanations, whether they can articulate complex ideas clearly.

A text-based interview platform would be easier to build and completely miss the point.

So we went with voice. And immediately discovered why everyone warned us against it.

The Technical Debt I Didn't See Coming

Speech recognition for interviews is different from speech recognition for everything else.

Siri and Alexa are optimized for short commands. Transcription tools like Otter are optimized for meetings with multiple speakers. We needed something that could handle:

20-40 minute monologues about technical projects

Industry jargon that doesn't exist in standard training data ("Kubernetes," "PostgreSQL," "JWT authentication")

Non-native English speakers with varying accents

Candidates who talk fast when nervous or slow when thinking

Off-the-shelf speech-to-text models failed spectacularly. Our first pilot had a 23% word error rate on technical terms. A candidate said "I implemented Redis caching" and got transcribed as "I implemented ready's catching." Recruiters couldn't trust the output.

I spent three weeks fine-tuning Wav2Vec 2.0 on domain-specific data—transcripts from actual tech interviews, recordings of engineers explaining their work, podcasts about software development. Got the error rate down to 6% for technical vocabulary.

But here's what surprised me: the remaining errors weren't random. They clustered around moments of hesitation, filler words, and self-corrections—exactly the moments that reveal how someone thinks under pressure.

We almost removed those "errors" before realizing they were features, not bugs.

The Conversational AI Problem Nobody Talks About

Building an AI that can conduct a natural interview conversation is way harder than building one that asks scripted questions.

The models are good at turn-taking now—knowing when the candidate has finished speaking, when to probe deeper, when to move on. But they're terrible at knowing why to do those things.

Our first version would ask "Tell me about a time you faced a technical challenge" and then immediately jump to the next question, regardless of whether the candidate gave a three-sentence answer or a three-minute story. It felt robotic because it was robotic—no human interviewer would blow past a shallow answer without following up.

We had to build a layer that analyzes response depth and triggers follow-ups. Not just keyword matching—actual semantic understanding of whether the candidate addressed the question or danced around it.

This meant combining LLaMA 3.3 70B for conversation flow with TinyBERT for real-time classification. The large model decides what to ask, the small model decides if the answer was substantive enough to move forward. They run in parallel with about 800ms latency between candidate finishing and AI responding.

That 800ms pause? Candidates tell us it makes the conversation feel more natural. Humans don't respond instantly either.

The Bias Problem That Wasn't a Bias Problem

Everyone asked about bias in AI hiring. "How do you prevent discrimination against protected classes?"

Honest answer? We can't. Not completely.

But we can be transparent about where bias enters the system and give recruiters tools to catch it.

Our approach:

Standardized questions - Every candidate gets asked the same core questions in the same order. This eliminates the biggest source of interviewer bias: one person getting softball questions while another gets grilled.

Anonymized analysis - The AI evaluation doesn't see candidate names, photos, or demographic data. It only sees the transcript and voice characteristics relevant to communication (clarity, pace, coherence—not accent or gender).

Bias audit logs - We track which candidates get follow-up questions and why. If the AI is consistently probing deeper with one demographic group, that pattern surfaces in our analytics.

Human override - Recruiters see the full transcript alongside the AI summary. They can—and do—disagree with the AI's assessment.

The dirty secret of AI hiring tools is that removing human bias is impossible. What's possible is making bias visible and consistent. A human interviewer might grill technical candidates on Tuesdays because they're stressed, then lob softballs on Fridays when they're in a good mood. The AI applies the same standards at 2 PM and 2 AM.

That's not unbiased. It's consistently biased, which is actually useful if you know what you're looking for.

What Breaking Things Taught Me

When we started testing the system, the AI asked a great opening question, then froze for 14 seconds before asking it again. The candidate thought the system crashed and hung up.

The bug? Our conversation state management couldn't handle the candidate pausing to think. The silence triggered a "no response detected" error, which triggered a retry, which created a race condition.

Fixed it by adding a confidence threshold—the AI now distinguishes between "finished talking" silence and "still thinking" silence based on speech patterns in the previous 3 seconds. Not perfect, but it dropped the false-positive rate from 18% to 2%.

Here's the lesson I took away: voice AI in high-stakes scenarios requires defensive design at every layer. Unlike a chatbot where someone can retype their message, you can't ask a candidate to "restart the interview" because your error handling failed.

We built in:

Automatic session recovery if connectivity drops

Manual override for recruiters to flag bad transcriptions

A "pause interview" button for candidates (surprisingly popular)

Playback of the actual audio alongside transcripts

The goal isn't perfection. It's resilience when things go wrong, because they will go wrong.

Why This Matters for Other AI Builders

If you're building AI for professional contexts—interviews, legal analysis, medical screening, financial advice—here's what I'd tell you:

Voice is worth the pain. The richness of verbal communication unlocks insights that text can't capture. But only if you're willing to solve the hard problems instead of shipping a minimum viable chatbot.

Domain-specific fine-tuning isn't optional. General-purpose models are amazing and terrible at the same time. They'll handle 90% of your use case brilliantly, then catastrophically fail on the 10% that matters most. Find that 10% early and train specifically for it.

Latency is a feature. We obsessed over response time at first, trying to get under 500ms. Then we realized that instant responses felt uncanny. The sweet spot for conversational AI is 600-1000ms—fast enough to feel responsive, slow enough to feel natural.

Build for the failure modes. Your AI will misunderstand accents, mishear technical terms, and ask nonsensical follow-ups. Design the system so humans can catch these failures gracefully instead of catastrophically.

The Uncomfortable Truth About AI Products

Six months into building Squrrel, I had a realization that almost made me quit: the AI isn't the product. The product is the workflow that the AI enables.

Candidates don't care that we use Wav2Vec 2.0 for transcription or LLaMA 3.3 for conversation. They care that they can interview at midnight without scheduling four emails. Recruiters don't care about our evaluation algorithms. They care that they can review 10 candidates in an hour instead of spending all week on phone screens.

The AI is infrastructure. The value is in removing friction from a broken process.

This realization changed everything. We stopped optimizing for model accuracy and started optimizing for user experience. We added features like letting candidates preview questions before starting, because that reduced anxiety and led to better responses—even though it "broke" the blind evaluation model we'd carefully designed.

Turns out, a slightly worse AI that people actually use beats a perfect AI that sits unused because the UX is terrible.

What's Next

We're expanding our pilots and learning every day. The technology works. The question now is whether we can scale the human side—onboarding, support, training recruiters to trust but verify AI outputs.

I'm also watching the regulatory space closely. The EU AI Act classifies hiring tools as "high-risk AI systems." New York City requires bias audits for automated employment decision tools. This is good—high-stakes AI should be regulated.

But it also means we need to build compliance into the product from day one, not bolt it on later. Audit trails, explainability, human oversight—these aren't nice-to-haves. They're survival requirements.

If you're building AI products in regulated industries, start designing for compliance now. It's way easier than retrofitting later.

The AI Agent Reality Check: What Actually Works in Production (And What Doesn't)

Ademola Balogun — Sun, 14 Dec 2025 15:25:25 +0000

As we close out 2025, everyone's been calling this "the year of AI agents." But here's what nobody wants to admit: most of these agents aren't actually working.

I've spent the last year building production AI systems—speech recognition for enterprise clients, fraud detection models, RAG chatbots handling real customer queries. And the gap between what the AI hype cycle promises and what actually ships to production is... substantial. Let me walk you through what's really happening out there.

The Production Gap Nobody Talks About

According to recent LangChain data, only 51% of companies have agents in production. That's it. Half. And here's the kicker: 78% say they have "active plans" to deploy agents soon. We've all heard that one before.

The problem isn't capability—it's that building reliable agents is genuinely hard. The frameworks have matured (LangGraph, CrewAI, AutoGen), the models have gotten better, but production deployment remains this gnarly problem that most teams underestimate.

I've seen it firsthand. A chatbot that works beautifully in your Jupyter notebook can fall apart spectacularly when real users start hammering it at 3 AM with edge cases you never imagined.

Why Most AI Projects Actually Fail

Let's talk about the uncomfortable truth: somewhere between 70-85% of AI projects are failing to meet their ROI targets. That's not a typo. Compare that to regular IT projects which fail at 25-50%. AI is literally twice as likely to fail.

Why? Everyone points to different culprits, but having built systems that made it through this gauntlet, here's what I've learned:

Data quality is the silent killer. Not "we don't have enough data"—we're drowning in data. The issue is that the data is fragmented, inconsistent, and fundamentally not ready for what AI needs. Traditional data management assumes you know your schema upfront. AI? It needs representative samples, balanced classes, and context that's often missing from your enterprise data warehouse.

Research shows that 43% of organizations cite data quality and readiness as their top obstacle. Another study found that 80% of companies struggle with data preprocessing and cleaning. When I built our fraud detection system using Autoencoders, we spent 60% of our time on data pipeline issues, not model architecture.

Infrastructure reality bites. The surveys are brutal on this: 79% of companies lack sufficient GPUs to meet current AI demands. Mid-sized companies (100-2000 employees) are actually the most aggressive with production deployments at 63%, probably because they're nimble enough to move fast but big enough to afford the infrastructure.

But here's the thing—you don't always need massive GPU clusters. For our sentiment analysis work with TinyBERT, we ran inference on CPU instances and it worked fine. The key is matching your infrastructure to your actual use case, not what TechCrunch says you need.

The Agent Architecture That's Actually Working

The agents that are succeeding in production aren't the autonomous, do-everything AGI dreams that AutoGPT promised us back in 2024. They're narrowly scoped, highly controllable systems with what developers call "custom cognitive architectures."

Take a look at what companies like Uber, LinkedIn, and Replit are actually deploying:

Uber: Building internal coding tools for large-scale code migrations. Not general-purpose. Specific workflows that only they really understand.

LinkedIn: SQL Bot that converts natural language to SQL queries. Super focused. Does one thing really well.

Replit: Code generation agents with heavy human-in-the-loop controls. They're not letting the AI run wild—humans are in the driver's seat.

The pattern here? These agents are orchestrators calling reliable APIs, not autonomous decision-makers. It's less "AI takes over" and more "AI makes clicking through 17 different interfaces unnecessary."

As 2025 wraps up, the lesson is clear: the agents shipping to production in 2026 will be the ones that learned from this year's hard-won lessons.

What Production Actually Looks Like

From my experience building Squrrel.app (an AI recruitment platform), here are the lessons that matter:

Start embarrassingly narrow. Our interview analysis didn't try to do everything—it focused on candidate responses, extracted key insights, and flagged concerning patterns. That's it. We added features incrementally once the core loop was bulletproof.

Observability isn't optional. Tools like Langfuse or Azure AI Foundry show you what's happening inside your agent through traces and spans. Without this, you're flying blind. When our LLaMA 3.3 70B model started producing weird outputs at 2 AM, we could trace it back to a prompt formatting issue within minutes because we had proper logging.

Evaluation needs to be continuous. Offline testing with curated datasets is table stakes. But online evaluation—testing with real user queries—is where you discover the edge cases. We run both, constantly.

Cost management is real. LLM calls add up fast. We found that caching frequently-used completions and using smaller models for classification tasks cut our costs by 40%. Using TinyBERT for sentiment pre-processing before hitting the large model? Game changer.

The Small Language Model Movement

This deserves its own section because it's one of the most practical developments of 2024.

Everyone obsessed over GPT-4 and Claude, but the real innovation? Getting sophisticated AI to run on devices as small as smartphones. Meta's Llama updates are 56% smaller and four times faster. Nvidia's Nemotron-Mini-4B gets VRAM usage down to about 2GB.

For production systems, this matters immensely. Lower latency. Lower costs. Less infrastructure complexity. Better privacy since you're not sending everything to external APIs.

We used this approach in our sentiment analysis pipeline—TinyBERT handles the initial classification and routing, only calling the big models when necessary. Works great, costs a fraction.

The Data Problem Won't Fix Itself

Here's something I wish someone had told me earlier: AI-ready data is fundamentally different from analytics-ready data.

Traditional data management is too structured, too slow, too rigid. AI needs:

Representative samples, not just accurate records

Balanced classes for training

Rich context and metadata that analytics never required

Fast iteration cycles that traditional governance processes can't support

63% of organizations don't have the right data management practices for AI. Gartner predicts that through 2027, companies will abandon 60% of AI projects specifically due to lack of AI-ready data.

This isn't something you can outsource to your existing data team and hope for the best. It requires new practices, new tools, and honestly, new thinking about what "data quality" even means.

What's Coming in 2026

Based on what I'm seeing in the field and the research patterns heading into the new year:

Multimodal agents are arriving for real. Not just text—agents that understand images, generate video, process audio, all from a single interface. OpenAI's Sora and Google's Veo showed what's possible. We're about to see these capabilities embedded in production workflows.

The framework wars are consolidating. LangGraph has emerged as a clear leader for controllable agentic workflows. The verbose, opaque frameworks are getting left behind. Developers want low-level control without hidden prompts.

Agentic AI meets scientific computing. This is exciting—AI agents accelerating materials science, drug discovery, climate modeling. AlphaMissense improved genetic mutation classification. GNoME is discovering new materials. The "AI for science" vertical is heating up.

Regulation is accelerating. The EU's AI Act banned certain applications in 2024, and 2025 saw more compliance requirements roll out. 2026 will bring even stricter governance. If you're building agents, you need to be thinking about safety, transparency, and governance now, not later.

The Practical Takeaway

If you're building AI agents as we head into 2026, here's my advice from the trenches:

Start narrow and specific. General-purpose agents are a research problem, not a product strategy.

Invest in data infrastructure early. You'll spend way more time here than on model selection.

Build observability from day one. You can't fix what you can't see.

Use small models where possible. Not every problem needs GPT-4.

Plan for failure modes. Your agent will do weird things. Have fallbacks.

Keep humans in the loop. The best production agents are human-AI collaboration, not AI autonomy.

The hype around AI agents is justified—they really can transform workflows and save significant time. Microsoft's research shows employees save 1-2 hours daily using AI for routine tasks. Our Squrrel.app platform has cut hiring cycle times substantially.

But the path from prototype to production is littered with failed projects. The companies succeeding aren't the ones with the fanciest models or the biggest budgets. They're the ones who understand that production AI is an engineering discipline, not a science experiment.

The technology works. The challenge is everything else—data, infrastructure, evaluation, monitoring, governance. Master those, and you'll be in that 51% with agents actually running in production.

Ignore them, and you'll be in the 85% wondering why your AI initiative didn't deliver.

Why Financial Sentiment Analysis Failed Without Explainability (And How I Fixed It)

Ademola Balogun — Sat, 22 Nov 2025 19:29:34 +0000

Building a Production-Ready NLP System That Traders Actually Trust

A trader approaches you with a question: "Your model says this stock is bearish based on the news. But why? What words triggered that prediction?" You pause. Your 86% accurate sentiment classifier suddenly feels useless because you can't explain it.

This is the hidden crisis in financial AI. Accuracy without explainability is a liability, not an asset.

I learned this the hard way while building a financial sentiment analysis system for Lloyds, IAG, and Vodafone. The project forced me to solve a problem that most data scientists ignore until it's too late: how do you make a black-box NLP model trustworthy enough for high-stakes trading decisions?

The Problem: Accurate But Opaque

When I started, the goal seemed straightforward: build a sentiment classifier that could analyze financial reports and news to predict market sentiment (bullish, neutral, bearish). I tested multiple models—AdaBoost, SVM, Random Forest, traditional Neural Networks—and they all performed reasonably well.

But reasonable wasn't good enough.

Here's the issue: financial markets don't reward accuracy in isolation. A model that's 83% accurate at classifying sentiment is worthless if a trader can't defend why it made a specific prediction. In regulated environments, explainability isn't a nice-to-have feature—it's a requirement.

Traditional machine learning models are interpretable by design. You can understand why Random Forest predicted bearish by examining the decision tree paths. But when I tested more sophisticated approaches—specifically TinyBERT, a transformer-based model—I faced the classic deep learning trade-off: superior performance (86.45% accuracy on Vodafone data) paired with complete opacity.

The model had learned something real about financial language. It just wouldn't tell me what.

The Breakthrough: SHAP for Financial Intelligence

Enter SHAP (SHapley Additive exPlanations). Rather than trying to reverse-engineer what the model learned, SHAP provides a principled way to decompose predictions into feature contributions using game theory.

The insight is elegant: for each prediction, SHAP calculates how much each word or phrase contributes to pushing the final sentiment classification in a particular direction. Instead of a black box, you get a transparent ledger of the model's reasoning.

I implemented SHAP analysis into the TinyBERT pipeline and suddenly the model became interpretable. When the classifier predicted bearish on an earnings report mentioning "revenue decline" and "market headwinds," SHAP waterfall plots showed exactly which phrases drove the prediction and by how much.

But here's what made it work in practice: I didn't just add SHAP as an afterthought. I made explainability central to the system design from day one. This meant structuring the entire pipeline around transparency.

The Architecture: Modular and Transparent

The system had eight interconnected modules, each designed with explainability in mind:

Data Collection Module: Extracted text from PDF financial reports and CSV news files from Yahoo Finance. The discipline here was crucial—clean data feeds clean explanations.

Text Preprocessing Module: Normalized text by removing noise (emojis, punctuation, extra spaces) while preserving financial jargon. This matters because "loss" has different meanings in accounting versus everyday language.

Sentiment Scoring Module: Used VADER as a baseline to assign initial sentiment labels. This acted as a sanity check—if VADER and TinyBERT disagreed significantly, it was worth investigating why.

Model Training Module: Fine-tuned TinyBERT on balanced, augmented data. Here's what made the difference: I used SMOTE (Synthetic Minority Oversampling) to handle class imbalance because imbalanced data introduces systematic bias that explainability tools can't fix.

Prediction Module: Deployed the trained model for real-time inference. Nothing flashy, but bulletproof reliability.

Explainability Module: Generated SHAP plots showing feature importance for every prediction. This is where the magic happened.

Attention Visualization Module: Transformer models use attention mechanisms—essentially learned weights showing which parts of the input matter most. By visualizing these attention scores, I gave another layer of interpretability. When the model paid 45% of its attention to a specific phrase like "operational challenges," users could see that directly.

Visualization Module: Built a Streamlit dashboard that brought everything together into a tool that financial analysts could actually use without a machine learning PhD.

The Results: From Accuracy to Actionability

When I tested the complete system across three companies spanning different sectors, the numbers were strong:

TinyBERT: 83.17% accuracy (Lloyds), 83.67% (IAG), 86.45% (Vodafone)
Traditional models averaged 70-80%, showing the value of transfer learning
Most importantly: Every prediction came with full explainability

But the real win wasn't the accuracy benchmark. It was this: a senior trader could now read a SHAP explanation and either validate the model's reasoning or flag a mistake in its logic. That's when it became useful.

One example: The system flagged a document as bearish based heavily on the phrase "uncertain regulatory environment." A human analyst immediately recognized that for the specific company and time period, that language was routine boilerplate—not a genuine risk signal. The explainability caught the false positive. Without SHAP, this would've passed through unexamined.

The Challenges Nobody Talks About

Building this system taught me that explainability doesn't solve everything—sometimes it exposes new problems.

Challenge 1: Data Quality Is Foundational

SHAP can't fix garbage data. When I extracted text from PDFs with poor formatting or inconsistent structures, the model's explanations became less trustworthy. I spent significant time on data cleaning because I knew that garbage data feeding into SHAP would generate garbage explanations.

Challenge 2: Class Imbalance Distorts Explanations

Financial sentiment in the wild is imbalanced—neutral sentiments dominated the dataset, with bearish sentiments rare. If you train on imbalanced data, the model learns to predict the majority class more confidently, and SHAP will explain why. But those explanations can be misleading because they reflect the data distribution, not market reality.

I addressed this with SMOTE—synthetically creating minority class examples—which meant the model learned real patterns in bearish language rather than just learning "rarely predict bearish."

Challenge 3: Explainability Can Be Too Technical

SHAP values are mathematically rigorous but visually abstract. Early versions of my dashboard confused users with technical plots. I had to simplify: show the top 3 words driving the prediction, visualize them clearly, and let users drill deeper if they want.

The Broader Lesson: Explainability Changes Everything

What surprised me most wasn't the technical challenge of implementing SHAP—it was realizing that explainability requirements fundamentally changed how I built the entire system.

When you know your predictions will be questioned and scrutinized, you make different design choices:

You prioritize data quality over dataset size
You use ensemble methods or interpretable models instead of pure black boxes
You validate edge cases obsessively
You document assumptions meticulously

This is the hidden benefit of explainability: it forces better engineering practices.

What's Next

The research highlighted several promising directions that point toward the future of financial AI:

Temporal Sentiment Modeling: Understanding how sentiment shifts over time and correlating that with actual market movements. Does sentiment lead price movements, or follow them?

Multimodal Analysis: Combining text sentiment with quantitative financial metrics. A document might express bullish language while reporting declining revenue—which signal matters more?

Fine-Grained Classification: Moving beyond bullish/neutral/bearish to capture nuanced positions. "Cautiously optimistic" is different from "bullish," and traders would benefit from that distinction.

Causal Inference: The ultimate goal—understanding not just that sentiment and prices correlate, but why. Does positive news drive prices up, or do rising prices drive positive news?

The Takeaway

If you're building AI systems for high-stakes domains—finance, healthcare, criminal justice—remember this: a model is not a product until it's explainable.

I could've stopped at 86% accuracy. That would've been publishable. But it would've been useless in practice because traders would never trust it.

The breakthroughs in my system came not from tuning hyperparameters or finding the perfect architecture, but from making the decision to prioritize explainability from day one. SHAP, attention visualization, modular design—these weren't add-ons. They were the foundation.

That's the real lesson from financial sentiment analysis: sometimes the hardest part of building AI isn't making it accurate. It's making humans trust it enough to use it.

Technical Stack Used: TinyBERT, PyTorch, SHAP, Streamlit, NLTK, spaCy, Hugging Face Transformers, SMOTE, Pandas, PyPDF2
GitHub Repository: https://github.com/ademicho123/financial_sentiment_analysis

Building an ML-Powered Trading Bot: From Theory to Production

Ademola Balogun — Tue, 11 Nov 2025 09:56:26 +0000

How I Built a Machine Learning System That Makes Real-Time Trading Decisions

Trading in financial markets is hard. Really hard. The statistics are sobering: most retail traders lose money, and even professional traders struggle to consistently beat the market. But what if we could leverage machine learning to tip the odds in our favor?

Over the past few months, I've built a system that combines MetaTrader 5 with a Flask-based ML prediction server to make real-time trading decisions on gold (XAU/USD). Here's the story of how it came together, the challenges I faced, and the lessons I learned.

The Problem: Too Much Data, Too Little Time

As a trader, you're bombarded with information: price movements, moving averages, volatility indicators, momentum signals, and more. The human brain simply can't process all these variables in real-time while maintaining consistency and discipline.

I needed a system that could:

Process multiple indicators simultaneously
Learn patterns from historical trades
Make decisions in milliseconds
Maintain consistency (no emotional trading)
Adapt to changing market conditions

The Solution: A Two-Part Architecture

I settled on a clean separation of concerns:

MetaTrader 5 Expert Advisor (EA): Handles market data collection, order execution, and risk management. This runs directly in the MT5 terminal with millisecond-level access to price data.

Flask ML Server: A Python-based REST API that loads a trained Random Forest model and serves predictions. This runs as a separate process (or on a different machine) for scalability and easier model updates.

MT5 EA → HTTP Request → Flask Server → ML Model → Prediction → MT5 EA → Trade Decision

The Magic: Feature Engineering

Here's where things get interesting. Raw price data alone isn't enough. The model needs context. I engineered 38 features from just 10 base inputs:

Price Relationships

Instead of just tracking price, I calculate:

Distance from moving averages (as percentage)
Whether price is above/below key EMAs
Price momentum over different periods

Multi-Timeframe Analysis

The system analyzes two timeframes simultaneously:

Short-term EMAs for entry signals
Long-term EMAs for trend confirmation
Cross-timeframe momentum alignment

Volatility Intelligence

ATR (Average True Range) isn't just a number:

ATR as percentage of price (volatility intensity)
Normalized ATR (compared to 20-period average)
Volatility-adjusted price distances

Interaction Features

This is where ML shines. I created interaction terms:

Trend Strength × Volatility
Price Distance × ATR
EMA Spread × Trend Strength

These capture non-linear relationships that humans struggle to track mentally.

Time Awareness

Markets behave differently at different times:

Cyclical encoding of hour (sin/cos transformation)
Day of week patterns
Session identification (London, New York, Asian)

The Training Process

Training a trading model is different from typical ML tasks. Here's what I learned:

1. Data Quality is Everything

I started with actual trade data from my backtests:

Each row represents a real trading signal
Outcome: 1 (profitable trade) or 0 (loss)
Features: All market conditions at signal time

2. Class Imbalance is Real

In a typical trading strategy, you might win 55-60% of trades. This creates class imbalance that can confuse your model. Solution: Use class_weight='balanced' in Random Forest to give minority class more importance.

3. Feature Scaling Matters

Price might be 2,650, while a percentage-based feature is 0.5. StandardScaler normalizes everything to the same scale, preventing large-magnitude features from dominating.

4. Overfitting is the Enemy

I deliberately limited tree depth (max_depth=15) and required minimum samples per leaf. This prevents the model from memorizing specific historical scenarios that won't repeat.

The Production System

Getting ML into production is where most projects die. Here's how I made it work:

Real-Time Prediction API

@app.route('/predict', methods=['POST'])
def predict():
    # Receive market data from MT5
    data = request.get_json()

    # Engineer features (same as training)
    features = engineer_features(data)

    # Scale features
    scaled = scaler.transform(features)

    # Get prediction + confidence
    prediction = model.predict(scaled)[0]
    confidence = model.predict_proba(scaled)[0][1]

    # Only trade if confidence > 60%
    should_trade = (prediction == 1 and confidence >= 0.60)

    return jsonify({
        'prediction': int(prediction),
        'confidence': float(confidence),
        'should_trade': bool(should_trade)
    })

Confidence Thresholding

This was a game-changer. Instead of taking every prediction, I only execute trades where the model is >60% confident. This dramatically reduced false signals.

High confidence (>75%): "Take this trade now"
Moderate confidence (65-75%): "Decent setup"
Low confidence (<60%): "Skip this one"

Error Handling

Production systems fail. A lot. I built in multiple safety layers:

Validation of all input data types
Handling missing features gracefully
Infinite value replacement
NaN filling with safe defaults
Comprehensive logging

Real-World Challenges

Challenge 1: Feature Drift

Markets change. A feature that was predictive last month might not work this month. Solution: Regular retraining with recent data.

Challenge 2: Latency

Every millisecond counts in trading. I optimized:

Kept the model loaded in memory (no disk reads)
Used Gunicorn with multiple workers
Minimized feature engineering computation

Challenge 3: The "Works in Backtest" Problem

A model can look amazing on historical data but fail live. Why?

Look-ahead bias in feature engineering
Overfitting to past market conditions
Ignoring transaction costs

I combated this by:

Using only information available at signal time
Testing on out-of-sample data
Including realistic spread/commission

Performance Metrics That Matter

Forget about 90% accuracy. Here's what actually matters:

Risk-Adjusted Return: Are you making money after accounting for risk?

Win Rate × Average Win vs. Loss Rate × Average Loss: A 45% win rate can be profitable if your wins are 2x your losses.

Maximum Drawdown: How much can you lose before recovery? This determines position sizing.

Sharpe Ratio: Return per unit of volatility. Higher is better.

Lessons Learned

1. More Features ≠ Better Model

I started with 100+ features. Performance was worse. Why? Noise. More features mean more chances to overfit. I trimmed to 38 carefully selected features.

2. Domain Knowledge > Fancy Algorithms

Understanding why a feature works matters more than using the latest deep learning architecture. A Random Forest trained on meaningful features beats an LSTM trained on raw prices.

3. Production is Different

What works in a Jupyter notebook often breaks in production:

Memory leaks
Threading issues
Serialization problems
API timeout handling

4. Keep It Simple

My first version had ensemble models, feature selection algorithms, and hyperparameter optimization pipelines. All unnecessary. A well-tuned Random Forest on good features works great.

The Code: Open Source

I've made the entire system available on GitHub. It includes:

Training script with feature engineering
Flask prediction server
Model serialization
Production deployment guide

The system is modular: use the ML component with any trading platform, or swap in your own model while keeping the infrastructure.

Future Improvements

Where can this go next?

Reinforcement Learning: Train an agent to optimize entry/exit timing, not just signal classification.

Ensemble Models: Combine multiple models with different strengths (trend following + mean reversion).

Online Learning: Update the model continuously with new trade results.

Multi-Asset Expansion: Apply the same framework to forex, indices, and crypto.

Risk-Adjusted Position Sizing: Let the model suggest position size based on confidence.

Ethical Considerations

Before you rush to deploy this:

This is NOT a get-rich-quick scheme. Trading is risky. This system improves odds but doesn't eliminate risk.

Markets adapt. What works today may not work tomorrow. Continuous monitoring and retraining are essential.

Technology isn't magic. ML can find patterns, but it can't predict black swan events or market crashes.

Risk management is paramount. Never risk more than you can afford to lose. Use stop losses. Manage position sizing.

Conclusion

Building an ML trading system taught me more about production ML than any tutorial could. The challenges of real-time prediction, model deployment, error handling, and continuous monitoring are universal to any ML product.

The financial domain adds extra complexity (latency requirements, data quality, market dynamics), but the lessons apply broadly:

Start simple
Focus on features over algorithms
Build for production from day one
Monitor and iterate continuously
Respect the domain expertise

Whether you're interested in algorithmic trading or production ML systems, I hope this journey inspires your next project. The code is open source, the architecture is battle-tested, and the patterns are reusable.

Remember: In trading, as in software, there's no substitute for testing in real conditions. Paper trade first, start small, and let the data guide your decisions.

Try It Yourself

The complete system is available on GitHub: https://github.com/ademicho123/gold_trader

Requirements:

Python 3.8+
MetaTrader 5 terminal
Historical trading data

Clone, train, deploy, and start experimenting. And when you make improvements (you will), consider contributing back to the project.

Disclaimer: This article is for educational purposes only. Trading carries significant risk. Always do your own research and never risk more than you can afford to lose. Past performance does not guarantee future results.

Large Language Models in Financial Content Generation: Challenges and Innovative Solutions

Ademola Balogun — Sun, 26 Oct 2025 08:53:29 +0000

Introduction

The financial technology landscape is undergoing a radical transformation, driven by the emergence of large language models (LLMs). As the founder of Trading Flashes, I've pioneered the integration of advanced AI technologies to generate sophisticated financial content. This article delves into the technical challenges and innovative solutions in applying LLMs to financial content generation.

The Complexity of Financial Language

Financial communication is uniquely challenging:

Domain-Specific Vocabulary: Requires precise technical terminology
Nuanced Contextual Understanding: Interpreting complex market dynamics
Balancing Objectivity and Insight: Providing valuable analysis without bias
Rapidly Changing Contextual Landscape: Adapting to real-time market shifts

Technical Architecture of Financial Content Generation

Prompt Engineering Strategy

from typing import List, Dict
import together

class FinancialContentGenerator:
    def __init__(self, api_key: str, model: str):
        self.client = together.Together(api_key)
        self.model = model
        self.markets = ['forex', 'crypto', 'stocks', 'commodities']

    def generate_market_summary(self, market_data: Dict) -> str:
        """
        Generate a comprehensive market summary using advanced prompt engineering

        Args:
            market_data (Dict): Comprehensive market data dictionary

        Returns:
            str: AI-generated market analysis
        """
        # Construct multi-stage prompt for nuanced analysis
        prompt = f"""
        You are a senior financial analyst providing a professional market summary.
        Context:
        - Current market conditions
        - Historical price trends
        - Significant economic indicators

        Market Data Overview:
        {self._format_market_data(market_data)}

        Guidelines for Analysis:
        1. Provide objective, fact-based insights
        2. Highlight key trends and potential market movements
        3. Maintain a professional, measured tone
        4. Include potential risk factors

        Generate a comprehensive market summary focusing on:
        - Key price movements
        - Underlying economic drivers
        - Short-term market outlook
        """

        response = self.client.complete.create(
            model=self.model,
            prompt=prompt,
            max_tokens=500,
            temperature=0.3,  # Lower temperature for more deterministic output
            top_p=0.9
        )

        return response.choices[0].text

    def _format_market_data(self, market_data: Dict) -> str:
        """
        Format market data for optimal model consumption

        Args:
            market_data (Dict): Raw market data

        Returns:
            str: Formatted market data string
        """
        formatted_data = []
        for market in self.markets:
            if market in market_data:
                market_summary = f"{market.upper()} Market:\n"
                for key, value in market_data[market].items():
                    market_summary += f"- {key}: {value}\n"
                formatted_data.append(market_summary)

        return "\n\n".join(formatted_data)

Bias Mitigation Techniques

class BiasMonitor:
    @staticmethod
    def detect_potential_bias(generated_content: str) -> Dict[str, float]:
        """
        Analyze generated content for potential biases

        Args:
            generated_content (str): AI-generated financial content

        Returns:
            Dict[str, float]: Bias probability scores
        """
        bias_metrics = {
            'market_sentiment_skew': 0.0,
            'repetitive_language': 0.0,
            'overly_positive_tone': 0.0
        }

        # Implement sophisticated bias detection algorithms
        # This is a simplified example
        if len(set(generated_content.split())) / len(generated_content.split()) < 0.7:
            bias_metrics['repetitive_language'] = 0.6

        return bias_metrics

Advanced Model Selection Strategies

Model Evaluation Framework

from together import Together

class ModelEvaluator:
    def __init__(self, models: List[str]):
        self.models = models

    def compare_model_performance(self, test_prompts: List[str]) -> Dict[str, float]:
        """
        Compare different LLM models for financial content generation

        Args:
            test_prompts (List[str]): Standardized evaluation prompts

        Returns:
            Dict[str, float]: Performance scores for each model
        """
        performance_scores = {}

        for model in self.models:
            model_performance = self._evaluate_single_model(model, test_prompts)
            performance_scores[model] = model_performance

        return performance_scores

    def _evaluate_single_model(self, model: str, test_prompts: List[str]) -> float:
        """
        Evaluate a single model's performance

        Scoring criteria:
        - Factual accuracy
        - Contextual relevance
        - Linguistic quality
        """
        # Implement multi-dimensional evaluation logic
        return 0.85  # Placeholder performance score

Performance and Optimization Strategies

Caching Mechanisms: Implement intelligent caching to reduce API calls
Asynchronous Processing: Utilize concurrent processing for multiple market analyses
Continuous Model Fine-Tuning: Regularly update models with recent financial data

Ethical Considerations in AI-Generated Financial Content

Critical ethical guidelines:

Transparency: Clear labeling of AI-generated content
Disclaimer Integration: Highlighting the speculative nature of predictions
Avoiding Market Manipulation: Generating objective, balanced insights

Conclusion

Integrating large language models into financial content generation is a complex, nuanced challenge. By developing sophisticated prompt engineering techniques, implementing robust bias detection, and maintaining a commitment to ethical AI practices, we can create powerful, insightful financial communication tools.

About the Author

Ademola Balogun is the founder and CEO of 180GIG Ltd, creators of Squrrel—an AI-powered interview platform that makes hiring smarter and more equitable. With an MSc in Data Science from Birkbeck, University of London, he specializes in building practical AI solutions for real-world problems. He also created Trading Flashes ⚡, an AI-driven newsletter platform for financial markets.

Key Takeaways:

Large language models require sophisticated engineering for financial applications
Bias detection and mitigation are crucial
Ethical considerations are paramount in AI-driven financial communication