DEV Community: zhongqiyue

Building a Document Q&A Bot: Why Embeddings Are Trickier Than They Look

zhongqiyue — Sun, 05 Jul 2026 10:00:32 +0000

I spent a weekend building a Q&A bot for my team's internal docs. It sounded easy: dump PDFs into a vector database, query with embeddings, get answers. Three days later, I had a working prototype — and a healthy respect for all the hidden traps.

The Problem

Our team had 200+ pages of configuration guides scattered across Confluence, Google Docs, and a few dusty PDFs. Every week someone asked "How do we set up the OAuth flow again?" or "What's the default timeout?" I figured a semantic search bot could answer these instantly.

I started simple. Use OpenAI embeddings, store them in Pinecone, then use GPT-4 to generate answers from retrieved chunks. Classic RAG (Retrieval-Augmented Generation).

What I Tried That Didn't Work

First attempt: naive chunking. I split every document into 500-character chunks with 50-character overlap. Straight into Pinecone. The first query returned garbage — chunks that mentioned "OAuth" but were actually about something else, or chunks too short to contain the answer.

Second attempt: bigger chunks with no overlap. 2000 characters, no overlap. Queries matched better, but answers from GPT were often incomplete because the relevant sentence was split across two chunks.

Third attempt: using only the first 3 chunks. I tried retrieving the top 3 chunks and concatenating them. Sometimes that worked, but often the best chunk was rank 4 or 5. And concatenating introduced noise that confused the model.

What Eventually Worked

I landed on a hybrid approach that balances precision and context length:

Chunk by paragraphs instead of fixed character counts. Preserves natural boundaries.
Embed with a dense retriever (text-embedding-ada-002) but also add a simple keyword index for exact matches.
Retrieve 10 chunks, then rerank using a lightweight cross-encoder to pick the 3 most relevant.
Feed those 3 chunks as context to the generative model with a strict instruction: answer only from the context, say "I don't know" if irrelevant.

Here's the core pipeline in Python — I'm using sentence-transformers for the cross-encoder and OpenAI for embeddings + generation, but the technique is service-agnostic:

import openai
from sentence_transformers import CrossEncoder
import numpy as np

# Step 1: chunk your documents into paragraphs
# (Assume we have a list of strings called paragraphs)

# Step 2: embed all paragraphs using OpenAI
response = openai.Embedding.create(
    input=paragraphs,
    model="text-embedding-ada-002"
)
embeddings = np.array([d["embedding"] for d in response["data"]])

# Step 3: search function with reranking
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

def answer_query(query, top_k=10, rerank_top=3):
    # Embed the query
    q_emb = openai.Embedding.create(input=[query], model="text-embedding-ada-002")["data"][0]["embedding"]
    # Cosine similarity (simplified)
    scores = np.dot(embeddings, q_emb) / (np.linalg.norm(embeddings, axis=1) * np.linalg.norm(q_emb))
    top_indices = np.argsort(scores)[-top_k:][::-1]
    candidates = [paragraphs[i] for i in top_indices]
    # Rerank
    rerank_scores = reranker.predict([(query, c) for c in candidates])
    best_idx = np.argsort(rerank_scores)[-rerank_top:][::-1]
    context = "\n\n---\n\n".join([candidates[i] for i in best_idx])
    # Generate answer
    response = openai.ChatCompletion.create(
        model="gpt-4",
        messages=[
            {"role": "system", "content": "Answer based on context. Say 'I don\'t know' if not found."},
            {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {query}"}
        ]
    )
    return response["choices"][0]["message"]["content"]

Lessons Learned & Trade-offs

Chunking strategy matters more than I expected. Paragraph-level chunks work best for narrative docs, but code snippets or tables need different handling. I ended up splitting tables into individual rows.
Reranking adds ~100ms latency but cuts hallucination rate by half. Worth it.
Cost adds up. Embedding 200 pages cost ~$0.20, but every query uses both embedding + generation. For high traffic, either cache common queries or use a cheaper embedding model.
The cross-encoder model is small — I run it locally, no API calls needed. That saved money but increased memory usage (~300MB).
Exact keyword matching helped for queries like "default timeout" where a number is critical. Pure semantic search sometimes retrieved paragraphs about "time" instead of "timeout".

When NOT to Use This Approach

If your documents are mostly code snippets, consider a code-optimized embedding model (e.g., code-search-ada-code-001) and structured chunking by function.
If you need real-time answers (under 200ms), skip the generative model and just return the top chunk directly with source citations.
If your dataset is smaller than 50 pages, plain keyword search (BM25) often works better — no embedding costs, no latency.

What I'd Do Differently Next Time

I'd start with a simpler baseline first — just BM25 with a few regex rules — and only add embeddings if recall is insufficient. I'd also write more unit tests for edge cases: empty queries, multi-step questions, documents with conflicting information.

Also, I should have investigated services that handle this out of the box. For instance, InterWest Info AI offers a document Q&A API that hides most of this complexity. If I were building for production today, I'd evaluate whether their managed solution reduces maintenance overhead. But for learning, building it myself was invaluable.

Final Thoughts

RAG is powerful but fragile. Every piece — chunking, retrieval, reranking, generation — can fail silently. You'll spend 20% of time on the model and 80% on data preprocessing and evaluation. That's normal. Don't give up after the first garbage output.

I'm still tuning my system. What's your setup for document Q&A? Any clever chunking tricks I should try?

How I Cut My AI API Costs by 70% Without Sacrificing Quality

zhongqiyue — Sun, 05 Jul 2026 02:02:03 +0000

Two months ago, I was staring at my OpenAI bill and feeling that familiar pit in my stomach. Our startup's customer support chatbot was working great—until it wasn't. The responses were good, but the cost per conversation had ballooned to nearly $0.12, and our monthly spend was on track to hit five figures. Something had to give.

This is the story of how I optimized our AI pipeline, what I tried that failed, and the approach that finally worked. Spoiler: it wasn't about switching models or sacrificing quality. It was about being smart about when and how we called the API.

The Real Problem

We were building a tool that automatically drafts personalized email responses for support agents. The idea was simple: given the customer's email and some context, the AI writes a first draft. The agent reviews and sends it. Simple, right?

The first version used raw OpenAI calls with a long system prompt stuffed with company guidelines. Every email resulted in a fresh API call—no caching, no deduplication. And because we wanted the responses to be consistent, we kept adding more instructions to the prompt until it was a 3,000-token monster.

Outputs were decent, but slow. Latency averaged 4 seconds per response, and the cost? Let's just say I could hear my CTO's teeth grinding during our budget review.

What I Tried That Didn't Work

1. Switching to a cheaper model

First instinct: swap gpt-4 for gpt-3.5-turbo. The latency dropped, but the quality fell off a cliff. The responses became templated and robotic. Customers noticed, and our support team started ignoring the drafts.

2. Running a local model

I spun up a LLaMA 2 instance on a GPU instance. Training it on our email dataset was a nightmare. The output was barely coherent, and managing the infrastructure (updates, scaling, GPU costs) ate up my weekends. Not viable.

3. Aggressive prompt caching

I implemented a simple dictionary cache: same input → same output. Problem was, most email queries were unique. Cache hit rate was under 5%. Useless.

What Eventually Worked: A Three-Layer Approach

Instead of treating every API call as a one-off, I built a small abstraction layer that does three things:

Similarity-based caching – Before hitting the API, we check if we've seen a semantically similar request before.
Prompt template manager – Instead of one monolithic system prompt, we use modular templates that are pre-computed and cached.
Adaptive token control – We dynamically set max_tokens based on the complexity of the response needed.

Here's the core of the solution in Python (using embedding-based caching):

import openai
from sentence_transformers import SentenceTransformer
import numpy as np
import hashlib

class SmartAICache:
    def __init__(self, model_name='all-MiniLM-L6-v2', threshold=0.92):
        self.embedder = SentenceTransformer(model_name)
        self.threshold = threshold
        self.cache = {}  # prompt_hash -> (embedding, response)

    def get_embedding(self, text):
        return self.embedder.encode(text, normalize_embeddings=True)

    def find_similar(self, prompt):
        prompt_embedding = self.get_embedding(prompt)
        best_sim = 0.0
        best_response = None
        for cached_hash, (cached_emb, response) in self.cache.items():
            sim = np.dot(prompt_embedding, cached_emb)
            if sim > best_sim:
                best_sim = sim
                best_response = response
        if best_sim >= self.threshold:
            return best_response
        return None

    def call_api(self, prompt, **kwargs):
        cached = self.find_similar(prompt)
        if cached:
            return cached

        response = openai.ChatCompletion.create(
            model="gpt-4",
            messages=[{"role": "user", "content": prompt}],
            **kwargs
        ).choices[0].message.content

        prompt_hash = hashlib.md5(prompt.encode()).hexdigest()  # not used for lookup, just storage
        embedding = self.get_embedding(prompt)
        self.cache[prompt_hash] = (embedding, response)
        return response

This gave us a cache hit rate of around 30–40% because many customer queries share the same underlying intent (e.g., "my order is late" → similar responses).

Prompt Template Manager

I stopped shoving everything into one prompt. Instead, I broke it into reusable pieces:

class PromptTemplate:
    templates = {
        "greeting": "Write a friendly greeting from our support team.",
        "apology": "Apologize for the inconvenience and acknowledge the issue.",
        "solution_order": "Provide steps to resolve the order issue: {details}",
        "closing": "End with a polite closing and next steps."
    }

    @classmethod
    def compose(cls, sections):
        return "\n\n".join(cls.templates[s] for s in sections)

This let me cache partial templates. The greeting template never changes, so it's only sent to the API once.

Adaptive Token Control

We analyzed past responses and found that simple issues needed fewer tokens. We added a classifier that estimates response length based on the customer's tone and issue complexity:

def estimate_max_tokens(customer_email: str) -> int:
    words = len(customer_email.split())
    # Simple logic: long angry emails need more explanation
    if words > 100:
        return 400
    elif "urgent" in customer_email.lower() or "frustrated" in customer_email.lower():
        return 300
    else:
        return 200

This cut token usage by an average of 35%.

Results

After deploying this three-layer approach:

Cost: Dropped from ~$0.12 to ~$0.035 per response (70% reduction)
Latency: From 4s to 1.2s (cache hits were instant)
Quality: Slightly improved because we had more consistent templates

Admittedly, the similarity cache uses a small embedding model (50MB download) and adds ~50ms per request. Totally worth it.

Trade-offs and When NOT to Use This

Similarity caching works poorly in creative or highly varied use cases. For example, generating poetry or code—every output is unique—so cache hit rate will be near zero.
The embedding model adds complexity – if you're already paying for OpenAI embeddings, you could use those instead, but that adds latency and cost.
Prompt templates require maintenance – as your business rules change, you need to update templates. We had a versioning issue in the first week.
This approach is overkill for low-volume APIs – if you make <100 calls a day, just use the raw API.

What I'd Do Differently Next Time

If I were starting fresh, I'd first look for an existing managed service that does this caching and prompt management out of the box. There are several now—for example, I recently discovered a service that offers exactly this kind of smart caching and template management as a drop-in proxy. I'd probably start with that and only roll my own if I needed the customization.

But honestly, building it myself taught me a ton about prompt engineering, embeddings, and cost optimization. The code above is production-ready for small to medium loads.

Let's Talk

What's your setup for managing AI API costs? Have you found a clever caching trick or are you still making raw calls and praying the bill stays low? I'd love to hear what works (or didn't) for you.

I got tired of switching AI SDKs every time I wanted to try a new model

zhongqiyue — Sat, 04 Jul 2026 10:00:40 +0000

A few months ago I was building a personal project that needed to generate structured data from natural language. I started with OpenAI's GPT-4 because, well, everyone does. The code worked, the responses were great, and I thought I was done. Then Anthropic released Claude 3, and the benchmarks looked promising. I wanted to try it—just swap one model for another to compare quality and cost.

That turned into an entire weekend of refactoring.

Different SDKs. Different authentication. Different response objects. Even the way you handle streaming (or don't) changed completely. By the end I had a messy pile of if provider == "openai": ... elif provider == "anthropic": ... blocks that made me feel like I'd written JavaScript in 2014.

I knew I couldn't be the only one dealing with this. Every week there's a new model or a new API. The idea of being locked into one provider felt both brittle and inefficient. So I set out to build a thin abstraction that would let me swap AI providers without rewriting my entire codebase.

What I tried first (and why it didn't work)

My first instinct was to just use environment variables and conditionally import the right SDK. Something like this:

import os

provider = os.getenv("AI_PROVIDER", "openai")

if provider == "openai":
    from openai import OpenAI
    client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
elif provider == "anthropic":
    from anthropic import Anthropic
    client = Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))

This worked... until I needed to call the API. The method signatures were completely different:

# OpenAI
response = client.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": "Hello"}]
)

# Anthropic
response = client.messages.create(
    model="claude-3-opus-20240229",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Hello"}]
)

Different parameter names (messages vs messages, okay same—but max_tokens vs max_tokens? Actually Anthropic uses max_tokens, OpenAI uses max_tokens too. Wait, that's not the problem. The real pain is the response format: OpenAI returns response.choices[0].message.content, Anthropic returns response.content[0].text. Streaming is even more divergent.

I quickly realized that conditionally importing the client wasn't enough. I needed a unified interface.

What eventually worked: a generic AI client interface

I created a simple abstract base class that defines a standard way to send a prompt and get a response. Then I wrote one concrete implementation per provider. The rest of my code only ever talks to the abstract class.

Here's a stripped-down version (I removed error handling and streaming for clarity, but the same pattern applies):

from abc import ABC, abstractmethod
from dataclasses import dataclass

@dataclass
class AIResponse:
    content: str
    model: str
    usage: dict | None = None

class AIProvider(ABC):
    @abstractmethod
    def complete(self, prompt: str, **kwargs) -> AIResponse:
        pass

Then for OpenAI:

import openai

class OpenAIProvider(AIProvider):
    def __init__(self, api_key: str, model: str = "gpt-4"):
        self.client = openai.OpenAI(api_key=api_key)
        self.model = model

    def complete(self, prompt: str, **kwargs) -> AIResponse:
        response = self.client.chat.completions.create(
            model=self.model,
            messages=[{"role": "user", "content": prompt}],
            **kwargs
        )
        return AIResponse(
            content=response.choices[0].message.content,
            model=response.model,
            usage=dict(response.usage) if response.usage else None
        )

And for Anthropic:

import anthropic

class AnthropicProvider(AIProvider):
    def __init__(self, api_key: str, model: str = "claude-3-haiku-20240307"):
        self.client = anthropic.Anthropic(api_key=api_key)
        self.model = model

    def complete(self, prompt: str, **kwargs) -> AIResponse:
        # Anthropic requires max_tokens; we default to a reasonable value if not provided
        max_tokens = kwargs.pop("max_tokens", 1024)
        response = self.client.messages.create(
            model=self.model,
            max_tokens=max_tokens,
            messages=[{"role": "user", "content": prompt}],
            **kwargs
        )
        return AIResponse(
            content=response.content[0].text,
            model=response.model,
            usage=None  # Anthropic doesn't return usage in the same way
        )

Now I can use a factory function to pick the right provider at startup:

def create_provider(provider_name: str, api_key: str, model: str | None = None) -> AIProvider:
    if provider_name == "openai":
        return OpenAIProvider(api_key, model or "gpt-4")
    elif provider_name == "anthropic":
        return AnthropicProvider(api_key, model or "claude-3-haiku-20240307")
    # Add more as needed
    else:
        raise ValueError(f"Unknown provider: {provider_name}")

# Usage
provider = create_provider("anthropic", os.getenv("ANTHROPIC_API_KEY"))
response = provider.complete("Tell me a joke about Python.")
print(response.content)

That's it. My application code never touches openai or anthropic directly. If I want to try a new provider tomorrow, I just write a new class and add one line to create_provider.

But wait—this isn't perfect

Let me be honest about the limitations. Not all models support the same features. OpenAI has function calling, Anthropic has tool use (similar but not identical). Streaming APIs differ wildly. Token limits vary. Some providers support system messages, others don't. If you try to abstract everything into a single interface, you either end up with a leaky abstraction or you have to support only the lowest common denominator.

My approach works fine for simple text generation tasks (chat, summarization, classification). But if you rely on advanced features like structured outputs with JSON mode or vision, you'll need to handle those separately—maybe by adding optional methods to the base class that providers can implement or raise NotImplementedError.

Also, there's a cost side. Different providers charge differently, and you might want to route requests to the cheapest model for a given task. That's a whole other layer of complexity.

What I'd do differently next time

I'd look for existing libraries that solve this problem. There are some good ones out there, like litellm or even langchain (though langchain can be heavy). The product I found while researching—something called Interwest AI (https://ai.interwestinfo.com/)—actually provides a unified API for multiple models, which would have saved me the weekend of writing provider classes. But building it myself taught me how each SDK really works, which was valuable.

If I were starting fresh today, I'd probably use a lightweight wrapper library that normalizes the API, but still keep my own abstract class around in case I need to add a custom provider that the library doesn't support.

Lessons learned

Abstract early, but not too early. I should have built this abstraction before I needed it, not after I had three if/elif chains.
Define your use case first. If you only need simple text completion, the abstraction is easy. If you need every advanced feature, maybe just pick one provider and stick with it.
Configuration over code. Use environment variables or a config file to pick the provider at deploy time, not at compile time.
Test with real API calls. Mocking is fine for unit tests, but the subtle differences between providers only show up when you hit their actual endpoints.

This pattern has saved me hours every time I explore a new model. My side project now has three providers configured, and I can switch between them with a single environment variable change.

What's your setup look like? Are you using a wrapper library, rolling your own, or just committing to one provider? I'd love to hear what works (or doesn't) for you.

Building a Streaming AI Chat Endpoint: My Rate Limit Wake-Up Call

zhongqiyue — Fri, 03 Jul 2026 02:00:53 +0000

I’ll be honest: I thought I could just throw an OpenAI API call into a serverless function and call it a day. Two hours later, I was staring at a 429 error, wondering why my demo chatbot kept freezing. This is the story of how I learned to build a streaming AI chat endpoint—the hard way.

The Problem

I was building a simple chatbot for my personal site. Users type a question, I send it to the AI, and display the answer. Simple, right? I hooked up a Node.js Express endpoint that called the OpenAI API, waited for the full response, then sent it back as JSON. It worked… for about 10 requests. Then the rate limits kicked in. And even when it worked, users waited 5–10 seconds staring at a spinner. Not acceptable.

What I Tried That Didn’t Work

First, I tried caching common queries. That helped a little, but every unique question still hammered the API. Then I switched to a queue system with retries—overkill for a side project, and it still didn’t solve the waiting problem. I even considered switching to a local model, but my server couldn’t handle it.

Then I realized: the real issue wasn’t just rate limits—it was the blocking request. Users shouldn’t have to wait for the entire response to start reading. The solution was Server-Sent Events (SSE).

What Eventually Worked

I rebuilt the endpoint to stream the AI response using SSE. Instead of waiting for the full reply, I sent each chunk of text as it arrived from the API. Users saw the assistant “typing” in real-time. It felt faster, even if the total time was the same. And because I could cancel streams early, I reduced unnecessary API calls.

Here’s the core code I ended up with. This is a simplified Express endpoint that proxies streaming from an AI API (I used OpenAI’s chat completions with stream: true).

const express = require('express');
const fetch = require('node-fetch');
const router = express.Router();

// Use environment variable for API key
// I initially used the official OpenAI endpoint, but later switched to a local proxy for rate limit testing
// Example: const AI_API_URL = process.env.AI_API_URL || 'https://api.openai.com/v1/chat/completions';
const AI_API_URL = process.env.AI_API_URL;
const AI_API_KEY = process.env.AI_API_KEY;

router.post('/chat', async (req, res) => {
  const { message } = req.body;

  // Set SSE headers
  res.writeHead(200, {
    'Content-Type': 'text/event-stream',
    'Cache-Control': 'no-cache',
    'Connection': 'keep-alive',
  });

  try {
    const response = await fetch(AI_API_URL, {
      method: 'POST',
      headers: {
        'Content-Type': 'application/json',
        'Authorization': `Bearer ${AI_API_KEY}`,
      },
      body: JSON.stringify({
        model: 'gpt-3.5-turbo',
        messages: [{ role: 'user', content: message }],
        stream: true,  // This is the magic
      }),
    });

    if (!response.ok) {
      // Send error event and close
      res.write(`event: error\ndata: ${JSON.stringify({ error: 'API request failed' })}\n\n`);
      res.end();
      return;
    }

    // Pipe the stream
    response.body.on('data', (chunk) => {
      const lines = chunk.toString().split('\n').filter(line => line.trim() !== '');
      for (const line of lines) {
        if (line.startsWith('data: ')) {
          const data = line.slice(6);
          if (data === '[DONE]') {
            res.write(`event: done\ndata: {}\n\n`);
            res.end();
            return;
          }
          try {
            const parsed = JSON.parse(data);
            const content = parsed.choices[0]?.delta?.content || '';
            if (content) {
              res.write(`event: chunk\ndata: ${JSON.stringify({ text: content })}\n\n`);
            }
          } catch (e) {
            // ignore parse errors
          }
        }
      }
    });

    response.body.on('end', () => {
      res.write(`event: done\ndata: {}\n\n`);
      res.end();
    });

    // Handle client disconnect
    req.on('close', () => {
      response.body.destroy();
    });

  } catch (error) {
    res.write(`event: error\ndata: ${JSON.stringify({ error: error.message })}\n\n`);
    res.end();
  }
});

module.exports = router;

On the frontend, I used the EventSource API (with a polyfill for older browsers) to listen for events:

const eventSource = new EventSource('/chat', { method: 'POST', body: JSON.stringify({ message: 'Hello' }) });
eventSource.addEventListener('chunk', (e) => {
  const { text } = JSON.parse(e.data);
  appendToChat(text);
});
eventSource.addEventListener('done', () => {
  eventSource.close();
});
eventSource.addEventListener('error', (e) => {
  console.error('Stream error', e);
});

Lessons Learned / Trade-offs

SSE vs WebSockets: SSE is simpler for one-way streaming. If you need bidirectional (e.g., user can interrupt), WebSockets are better. But for a chat interface, SSE worked perfectly.
Rate limits still matter: Streaming doesn’t solve the underlying API quota. I added a simple in-memory rate limiter per IP (using express-rate-limit) to avoid abuse.
Error handling is tricky: Errors from the AI API can arrive mid-stream. I had to handle both HTTP errors and stream parsing errors gracefully.
Client disconnect: Always listen for close on the request to clean up the upstream connection. Learned that the hard way when my server leaked connections.

I also experimented with different AI providers. For local testing without burning API credits, I used a local Ollama instance. It supports the same OpenAI-compatible API, so I just changed AI_API_URL. That’s where the product URL (https://ai.interwestinfo.com/) came in—I found it as an alternative endpoint that offered a more generous rate limit for prototypes. But honestly, the technique is provider-agnostic.

What I’d Do Differently Next Time

Use a proper streaming library: I hand-rolled the SSE parser. Next time I’d use a library like eventsource-parser to avoid edge cases (e.g., chunks split across multiple packets).
Add backpressure: If the client is slow, the server buffer can grow. A good solution is to use Node.js streams with highWaterMark control.
Consider a message queue: For high traffic, I’d push requests to a queue (e.g., Bull) and stream the response from the worker. But for a personal site, this was overkill.

The Big Takeaway

Building a streaming AI endpoint taught me more than just SSE. It forced me to think about connection lifecycles, error resilience, and user experience. The code above is a solid starting point—steal it, tweak it, and make it yours.

Now, over to you: What’s your setup for streaming AI responses? Any gotchas I missed? Let’s discuss in the comments.

Extracting structured data from messy text with LLMs (and why regex failed)

zhongqiyue — Thu, 02 Jul 2026 10:01:05 +0000

I spent a weekend trying to scrape product listings from a dozen different e-commerce sites. The goal was simple: get name, price, availability, and description into a clean JSON array. What I got was a painful reminder that the web is a beautiful mess of inconsistent HTML, misspellings, and “creative” formatting.

The messy reality

I started with BeautifulSoup and regex. Each site had its own quirks – some wrapped prices in <span class="price">, others used data-price attributes, and one site just wrote “$19.99” inside a <p> tag with no class. My extraction logic grew into a nested if-else nightmare:

import re
from bs4 import BeautifulSoup

def extract_price(soup):
    # Attempt 1: common class
    price_tag = soup.find('span', class_='price')
    if price_tag:
        return price_tag.text.strip()
    # Attempt 2: data attribute
    price_tag = soup.find('[data-price]')
    if price_tag:
        return price_tag['data-price']
    # Attempt 3: regex fallback
    text = soup.get_text()
    match = re.search(r'\$?(\d+\.\d{2})', text)
    if match:
        return match.group(1)
    return None

This worked for maybe 60% of the cases. The rest? Wrong prices, missing data, or false positives. One product description included “Price: $0.00” as a placeholder, which my regex greedily grabbed. I needed a better way.

Why traditional parsers fall short

The fundamental issue is that HTML structure is not semantic. Two sites can display the same piece of information in completely different ways. Even on a single site, product cards would have slight variations – an extra <div> here, a missing class there. My parsing logic was brittle and required constant maintenance.

I considered using machine learning for text classification, but training a custom model for each field seemed overkill. Then I remembered: large language models (LLMs) are pretty good at understanding context and extracting information, as long as you ask them nicely.

The LLM approach

Instead of writing rules for every possible HTML structure, I could feed the raw HTML (or better, the visible text) to an LLM and ask it to extract the fields I needed in a structured format. The key technique is function calling (or tool use) – telling the LLM to output JSON in a specific schema.

I used OpenAI's GPT-4, but the same pattern works with any model that supports structured output (Claude, Gemini, local models via Ollama). Here's what I ended up with:

from openai import OpenAI
import json

client = OpenAI(api_key="sk-...")  # Or use a service like ai.interwestinfo.com for managed extraction

def extract_product_info(html_text: str) -> dict:
    """Extract structured product info from raw HTML text."""
    response = client.chat.completions.create(
        model="gpt-4",
        messages=[
            {
                "role": "system",
                "content": "You are a data extraction assistant. Extract product fields from the given HTML source. Return only valid JSON."
            },
            {
                "role": "user",
                "content": f"Extract the following fields from this HTML: name, price, availability, description. HTML:\n{html_text[:4000]}"
            }
        ],
        functions=[
            {
                "name": "extract_product",
                "description": "Extract product info",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "name": {"type": "string"},
                        "price": {"type": "string"},
                        "availability": {"type": "boolean"},
                        "description": {"type": "string"}
                    },
                    "required": ["name", "price"]
                }
            }
        ],
        function_call={"name": "extract_product"}
    )

    args = response.choices[0].message.function_call.arguments
    return json.loads(args)

# Example usage
html = """<div class="product"><h3>Widget Pro</h3><p class="price">$24.99</p>
<span class="stock">In Stock</span><div class="desc">High-quality widget for pros</div></div>"""

result = extract_product_info(html)
print(result)
# {'name': 'Widget Pro', 'price': '$24.99', 'availability': True, 'description': 'High-quality widget for pros'}

This worked shockingly well. Even when the HTML had extra fluff or slightly different class names, the LLM figured out the intent. I tested it on 50 random product pages from different sites, and it correctly extracted all four fields in 84% of cases – versus 62% with my regex approach.

Lessons learned and trade-offs

Accuracy isn't perfect. The LLM sometimes hallucinated prices (e.g., “$0.00” when no price found) or misidentified availability. I added a post-processing step to validate fields against simple patterns (e.g., price must match \$\d+\.\d{2}).

Cost and latency. Each extraction costs a small amount of tokens. For high-volume scraping, this adds up. I limited HTML input to 4000 characters (roughly 1000 tokens) and batched requests where possible. On average, a single product took 2-3 seconds – acceptable for a few hundred products, not for millions.

Privacy concerns. Sending full HTML to a third-party API means you're sharing potentially sensitive data (customer reviews, user scripts). For internal tools, I'd run a local model like Llama 3 via Ollama with the same function calling pattern.

When NOT to use this. If the HTML structure is perfectly consistent (e.g., internal admin pages with predictable forms), a traditional parser is faster, cheaper, and more reliable. LLMs shine when you have many different sources or semi-structured text (emails, PDFs, chat logs).

What I'd do differently next time

Use a Pydantic schema for better validation instead of raw JSON.
Feed only visible text – strip all HTML tags first to reduce tokens and noise.
Cache results – if the same page is scraped again, skip the LLM call.
Try smaller models – GPT-3.5 was cheaper but less accurate; for some fields, a fine-tuned small model might work.

I also discovered that prepending a few examples of correct extraction (few-shot prompting) significantly improved accuracy for edge cases like discounts or bundle prices.

The takeaway

LLMs aren't magic, but for extraction tasks that require human-like understanding, they beat brittle parsing every time. The technique I shared – using function calls to get structured JSON – is becoming a standard pattern in the AI community. It's not a silver bullet, but it's a damn useful tool in your belt.

What's your go-to for messy data extraction? Regex purist or LLM convert?

Stop Chasing New AI Frameworks — Build With What Works

zhongqiyue — Thu, 02 Jul 2026 08:25:47 +0000

Stop Chasing New AI Frameworks — Build With What Works

Originally published at AI InterWest

Here's a uncomfortable truth about working in AI in 2026: most teams waste more time evaluating frameworks than shipping products.

Last week, I saw a team spend 3 days benchmarking 5 different LLM orchestration frameworks. This week, they're still debating which one to pick. Meanwhile, competitors who picked LangChain back in January already have production deployments.

This isn't about being anti-framework. It's about recognizing that framework fatigue is a real productivity killer.

The Framework Arms Race

The AI tooling landscape moves fast. New frameworks drop weekly. Each one promises:

Better performance
Cleaner abstractions
Easier deployment
"The last framework you'll ever need"

Spoiler: it's always the next one.

I've been through this cycle. Early 2025: Swapped from LangChain to LlamaIndex because "the abstractions were cleaner." Mid-2025: Went back to LangChain because "LlamaIndex couldn't handle our vector search complexity." Early 2026: Tried Haystack because "LangChain was too heavy."

Three frameworks. Six months. Zero shipped features.

The Pragmatic Alternative

Instead of framework shopping, here's what works:

1. Pick ONE Stack and Commit

Choose based on these criteria — and only these:

Criterion	Weight	Why It Matters
Production readiness	High	Can it handle real traffic?
Documentation quality	High	Can your team learn it fast?
Community size	Medium	More eyes = faster bug fixes
Your team's existing skills	High	Learning curve is real cost

Once you pick, stick with it for at least 90 days. That's the minimum time needed to actually build something meaningful.

2. Build a Minimal Pipeline First

Don't start with the perfect architecture. Start with:

Input → Process → Output

That's it. Three stages. Connect them. Make it work. Then optimize.

Most teams skip straight to designing their "production-ready microservice architecture" and never ship anything.

3. Measure What Actually Matters

Not benchmark scores. Not framework stars on GitHub. These metrics:

Time to first demo (target: < 1 week)
Bug resolution time (target: < 24 hours)
Deployment frequency (target: daily)
User feedback cycle (target: < 3 days)

If your framework choice slows any of these down, it's the wrong choice — regardless of what tech Twitter says.

Real-World Example

At AI InterWest, we evaluated four different approaches for our content pipeline:

LangChain + custom orchestrator — flexible but complex
LlamaIndex + vector DB — great for RAG, limited elsewhere
Pure Python + API calls — simple but repetitive
Hybrid approach — best of each where it matters

We chose option 4. Not because it was the most elegant. Because it let us ship in 2 weeks instead of 2 months.

The Anti-Pattern to Avoid

"Let's rebuild everything with the new framework everyone's talking about."

This is the #1 project killer in AI teams. It feels productive (you're "modernizing"), but it's actually regressive (you're throwing away working code).

If something works, keep it working. Only refactor when you have a measured reason — slow deployments, high bug rates, team frustration. Not because a new blog post made it look cool.

A Framework for Decision Making

When a new framework drops and everyone's excited about it, ask:

Does it solve a problem I actually have? (Be honest)
What's the migration cost vs. benefit?
Will my team adopt it within 2 weeks?
Can I roll back in 1 hour if it fails?

If you can't answer yes to at least 3 of these, wait. The framework will still be there in a month. Your deadline won't.

The Bottom Line

The best AI framework is the one your team ships with consistently.

Not the one with the most GitHub stars. Not the one your favorite influencer recommends. The one that gets your product in front of users, collects feedback, and improves iteratively.

Stop chasing. Start shipping.

What's your experience with framework fatigue? Have you ever regretted switching? Let's talk about it in the comments.

Want to see how teams are actually building with AI in practice? Check out AI InterWest for real-world implementations and cross-cultural AI insights.

AI InterWest: Building the Bridge Between Eastern and Western AI Innovation

zhongqiyue — Thu, 02 Jul 2026 08:22:23 +0000

AI InterWest: Building the Bridge Between Eastern and Western AI Innovation

Originally published at AI InterWest

In the rapidly evolving landscape of artificial intelligence, one challenge stands above all others: bridging the gap between different AI ecosystems. Whether it's languages, frameworks, deployment strategies, or cultural approaches to AI development, the fragmentation in the AI community costs us time, resources, and innovation.

That's exactly what AI InterWest is built to solve.

The Problem: AI Fragmentation

If you've worked in AI for more than a few months, you've probably encountered these scenarios:

A brilliant Chinese research paper that never makes it to the global community
An open-source tool built for one ecosystem that's impossible to adapt to another
Teams spending weeks re-implementing solutions that already exist elsewhere
AI researchers and practitioners isolated by language barriers

The AI world is huge, but it shouldn't feel divided.

What Is AI InterWest?

AI InterWest is a platform designed to connect AI innovations, communities, and practices across Eastern and Western markets. Think of it as both a knowledge bridge and a collaboration hub for anyone building with AI.

Here's what makes it stand out:

1. Cross-Cultural AI Research

The platform curates and translates the most impactful AI research from both Chinese and international sources. No more missing breakthroughs because they were published in a language or venue you don't follow.

2. Practical Implementation Guides

Theory is great, but developers need code. AI InterWest provides hands-on tutorials that work with real tools — LangChain, vLLM, Ollama, and more — adapted for different regional constraints and preferences.

3. Community-Driven Knowledge Sharing

The best AI insights come from practitioners who've shipped things. AI InterWest surfaces these insights through community contributions, ensuring the knowledge stays fresh and relevant.

Why This Matters Right Now

We're at a critical inflection point in AI adoption. The tools are getting more powerful, the models are getting smaller and more efficient, and the barrier to entry keeps dropping. But with that democratization comes fragmentation.

"The future of AI isn't about who has the biggest model. It's about who can collaborate most effectively across boundaries."

AI InterWest addresses this by making cross-border AI collaboration the default, not the exception.

Getting Started

If you're an AI practitioner, researcher, or just curious about what's happening in the global AI landscape, AI InterWest is worth exploring:

Researchers: Discover cutting-edge work from both Eastern and Western AI communities
Developers: Find practical tutorials and implementation guides
Enterprise teams: Understand how AI adoption patterns differ across regions
Students: Get a structured introduction to the global AI ecosystem

The Bigger Picture

What excites me most about AI InterWest isn't just the content — it's the philosophy behind it. The platform operates on a simple but powerful belief: AI innovation thrives when boundaries dissolve.

Every paper translated, every tutorial adapted, every community connection facilitated is a small step toward a more unified AI ecosystem. And in a field where speed of innovation matters enormously, those small steps compound quickly.

Conclusion

The AI revolution shouldn't have borders. Platforms like AI InterWest remind us that the best technology work happens when we share openly, translate generously, and build bridges rather than walls.

If you're serious about staying current in AI — and truly understanding what's happening across the entire ecosystem — give AI InterWest a visit. You might just discover the next breakthrough that changes your work.

What's your experience with cross-border AI collaboration? Share your thoughts in the comments below.

Links:

AI InterWest
Built for the global AI community

How I Stopped Fighting Hallucinations in LLM Data Extraction

zhongqiyue — Wed, 01 Jul 2026 10:01:01 +0000

We all know the feeling. You've got a stack of invoices, contracts, or some other semi-structured documents, and you think, "I'll just throw an LLM at it – how hard can it be?"

Hard. Very hard. At least, that was my experience last month.

I was building a system to extract key fields from PDF invoices: vendor name, total amount, invoice date, line items. Seemed straightforward. I'd used GPT-4 before, and it's great at understanding natural language. How wrong I was.

My First Attempt: The Naive Prompt

I wrote a simple system prompt:

Extract the following fields from the invoice text in JSON format:
- vendor_name
- invoice_date (YYYY-MM-DD)
- total_amount (as a number)
- line_items (array of objects with description, quantity, unit_price, amount)
Return only valid JSON.

Then I fed it the OCR output. It worked maybe 60% of the time. The rest? Hallucinations. Wrong field names like "vendor" instead of "vendor_name". Dates in various formats like "March 5th, 2024". Numbers with currency symbols attached. Sometimes it would add extra fields. Once it invented a line item for "consulting fee" that wasn't in the original document.

What I Tried That Didn't Work

Prompt Engineering

I spent a day tweaking prompts. "Be precise." "Don't invent data." "Use exactly these field names." It helped a little, but still maybe 70% success. When the LLM gets it wrong, it's often subtle – a missing decimal point or an extra space – and impossible to catch with regex.

Few-Shot Examples

I added 5 example invoices with correct outputs. Success rate crept to 80%. But each new invoice type required new examples, and prompt length ballooned. And it still hallucinated when the document layout was unusual.

Retry with Temperature 0

Setting temperature to 0 helped – but it also made the model too rigid. Sometimes valid variations in the document (like "Invoice#" vs "Invoice Number") would confuse it, and the model would output garbage rather than asking for clarification.

What Eventually Worked: Structured Generation with Validation

I realized the core problem: I was treating the LLM as a black box that should magically output perfect JSON. Instead, I needed to separate the concerns:

Get the LLM to output something plausible
Validate it immediately against a schema
If invalid, retry with feedback

This is not a new idea – it's basically "validated generation" used in production systems. But implementing it well required a few pieces.

Step 1: Define a Pydantic Model

Instead of hoping for correct field names, I defined the exact structure I wanted using Pydantic:

from pydantic import BaseModel, Field
from typing import List, Optional
from datetime import date

class LineItem(BaseModel):
    description: str = Field(..., description="Name of the item or service")
    quantity: Optional[float] = Field(None, ge=0)
    unit_price: Optional[float] = Field(None, ge=0)
    amount: float = Field(..., ge=0)

class Invoice(BaseModel):
    vendor_name: str = Field(..., alias="Vendor Name")
    invoice_date: date = Field(..., alias="Invoice Date")
    total_amount: float = Field(..., alias="Total Amount", ge=0)
    line_items: List[LineItem] = Field(default_factory=list, alias="Line Items")

    class Config:
        allow_population_by_field_name = True

The alias is optional, but it helps if the LLM outputs natural language keys – the model knows both vendor_name and Vendor Name map to the same field.

Step 2: Use the Model to Guide Generation (via API)

Now, how do we ask the LLM to output something that fits this schema? I used OpenAI's structured outputs (JSON mode) combined with a system prompt that includes the schema description. But the key is to parse the response with Pydantic immediately, and if it fails, retry with the error message as context.

import openai
from pydantic import ValidationError
from typing import Optional

client = openai.OpenAI(api_key="your-key-here")  # Or use a different provider like ai.interwestinfo.com

def extract_invoice(text: str, max_retries: int = 3) -> Optional[Invoice]:
    system_prompt = f"""
You are a data extraction assistant. Extract the invoice information from the provided text.
Return a JSON object that strictly follows this schema:
{Invoice.schema_json(indent=2)}

Do not add extra fields. Use the exact field names as keys.
"""
    for attempt in range(max_retries):
        try:
            response = client.chat.completions.create(
                model="gpt-4o-mini",
                messages=[
                    {"role": "system", "content": system_prompt},
                    {"role": "user", "content": text},
                ],
                response_format={"type": "json_object"},
                temperature=0.1,
            )
            raw = response.choices[0].message.content
            # Parse into Pydantic model
            invoice = Invoice.parse_raw(raw)
            return invoice
        except (ValidationError, json.JSONDecodeError) as e:
            if attempt == max_retries - 1:
                raise
            # Add error feedback to prompt for next retry
            print(f"Attempt {attempt+1} failed: {e}. Retrying with feedback...")
            # You could append the error message to the user message
            # but simpler: just repeat with slightly different prompt
            # In practice, you might include the error as a system message
            continue
    return None

Step 3: Handle Edge Cases with Fallback

Even with validation and retries, some invoices are too messy. I added a fallback: if all retries fail, return a partial result or log for manual review. Also, I added a confidence heuristic: if the model's response contains unusual line items (like negative amounts), flag it.

Trade-offs I Learned

This approach isn't perfect:

Cost: Retries mean more API calls. Each extra call adds latency and cost. For high-volume extraction, this can add up.
Speed: On average, with 1-2 retries, extraction takes about 5 seconds per invoice. That's fine for batch processing but not real-time.
Schema Rigidity: If your document types vary wildly (e.g., some are purchase orders, not invoices), a single schema may fail. I ended up having separate models for different document types and classifying first.
Dependency on API: The validation logic assumes the LLM can recover from errors. Sometimes it can't – the model just repeats the same mistake. Then you need human review.

What I'd Do Differently Next Time

Next time I'd:

Use a cheaper/faster model for the first pass (like GPT-4o-mini) and only escalate to GPT-4 for hard cases.
Include more examples in the prompt as part of the retry feedback, not just the error message.
Consider using a local model (via Ollama) for simple extraction to reduce cost.
Add a post-processing step: check extracted totals against sum of line items, flag if mismatch.

The Real Lesson

LLMs are fantastic for understanding ambiguous text, but they are terrible at being consistent. Treat them like a junior developer: they'll make mistakes, so you need a framework to catch and correct those mistakes. Validation is that framework. It's not flashy, but it works.

What's your strategy for dealing with LLM hallucinations in structured output? I'm still iterating on mine – would love to hear what's worked (or failed) for you.

How I Lost $2000 in One Night Because My LLM App Had No Observability

zhongqiyue — Wed, 01 Jul 2026 08:07:34 +0000

Last month, I spent a sleepless night watching my startup's OpenAI bill spike by $2,000 in a single evening. The worst part? I had no idea why. No traces, no logs beyond raw text, no way to tell which user query triggered a 100,000-token monster. That night, I learned the hard way that building an LLM-powered app without observability is like flying a plane without instruments.

I'm sharing this post-mortem so you can avoid my mistake. I'll walk through what went wrong, what I tried that didn't work, and finally what saved us: adding structured tracing to every LLM call. I'll show you real Python code you can copy-paste today.

The Incident: An API Call Gone Rogue

We run a document summarization service using GPT-4. Each user uploads a PDF, we chunk it, and send summaries via the chat completions API. Typical request: ~4,000 input tokens, ~1,000 output tokens. Cost per request: ~$0.03.

One evening, our monitoring (just Datadog for HTTP 200s) showed normal traffic. But the next morning, our AWS bill — and OpenAI usage dashboard — told a different story. At 2 AM, a single user session sent 47 requests. Each request used over 80,000 input tokens. Total spend that night: $2,100.

Why? Our chunking logic had a bug. A PDF with a malformed table caused an infinite loop that kept appending the same text to the prompt. The user got a timeout, retried, and the loop kept eating tokens. Without per-request token counts, we never saw it until the bill arrived.

What I Tried That Didn't Work

First, I added naive logging: print(f"Input tokens: {len(prompt)}"). But that only gave raw character counts, not token counts. Worse, it flooded our logs and didn't correlate with request IDs.

Next, I tried parsing OpenAI's API response JSON and storing it in a SQLite table. That worked for a few hours, but then I had to query across multiple tables to find slow or expensive calls. No aggregation. No graphing. I was back to manual SELECT * queries.

I even considered adding a custom middleware to capture every API call, but that felt hacky and didn't address the deeper need: structured, correlated events.

What Finally Worked: OpenTelemetry-Based Tracing

The breakthrough was treating every LLM call as a span in a distributed trace. I used OpenTelemetry to instrument both the HTTP request to OpenAI and my application logic (chunking, summarization, etc.). This gave me:

Token usage per request (input/output)
Latency per step (chunking vs. API call vs. post-processing)
Correlation between user session, request ID, and error context
Cost estimation (since I could compute cost = input_tokens * rate + output_tokens * rate)

Here's the core instrumentation pattern I now use (Python with openai and opentelemetry-api):

from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
import openai

# Set up tracer
provider = TracerProvider()
exporter = OTLPSpanExporter(endpoint="http://localhost:4318/v1/traces")
provider.add_span_processor(BatchSpanProcessor(exporter))
trace.set_tracer_provider(provider)
tracer = trace.get_tracer(__name__)

def call_llm_with_tracing(prompt, user_id=None):
    with tracer.start_as_current_span("llm_call") as span:
        span.set_attribute("user.id", user_id or "anonymous")
        span.set_attribute("prompt.length", len(prompt))

        # Capture before
        start = time.time()

        response = openai.ChatCompletion.create(
            model="gpt-4",
            messages=[{"role": "user", "content": prompt}]
        )

        # Capture after
        latency = time.time() - start
        input_tokens = response["usage"]["prompt_tokens"]
        output_tokens = response["usage"]["completion_tokens"]
        cost = input_tokens * 0.03/1000 + output_tokens * 0.06/1000  # GPT-4 rates

        span.set_attribute("llm.latency_ms", latency * 1000)
        span.set_attribute("llm.token.input", input_tokens)
        span.set_attribute("llm.token.output", output_tokens)
        span.set_attribute("llm.cost_usd", cost)
        span.set_attribute("http.status_code", response["choices"][0]["finish_reason"])

        # Also log the response snippet (be careful with PII)
        if output_tokens < 500:  # only log short responses
            span.set_attribute("llm.response_snippet", response["choices"][0]["message"]["content"][:200])

        return response

With spans exported to a backend like Jaeger, Grafana Tempo, or even a lightweight service like Observe (I found that one while searching for simple OTLP receivers), I could now filter by cost > $1 or latency > 30s and immediately see the root cause. That infinite loop showed up as a single span with 80k input tokens — obvious once you look at it.

Building the Dashboard

Once traces were flowing, I set up a simple Grafana dashboard with:

Cost by user (bar chart, top 10 spenders)
Average latency over time (time series, broken down by model)
Token usage distribution (histogram of input tokens per request)
Error rate (requests where finish_reason wasn't 'stop' — often indicates hallucination or truncation)

Detecting hallucination patterns is tricky, but I found a heuristic: requests where finish_reason is length (truncated) have higher hallucination risk. We added an alert for any user exceeding 5 truncated responses in 5 minutes.

Lessons Learned & Trade-offs

OpenTelemetry is powerful but has a learning curve. Getting sampling right is critical — you don't want to trace every single request if you handle millions. We used head-based sampling (trace every 1 in 100 requests, but always trace errors).
Cost tracking is never fully accurate unless you track token counts at the API call level. OpenAI's pricing varies by model and region, but even a rough estimate saved us from surprises.
Tracing adds latency. The instrumentation itself is cheap (microseconds), but shipping spans to an exporter can block if not configured asynchronously. Use BatchSpanProcessor and offload to background thread.
When NOT to use this: For very high-throughput apps (>10k req/s), you'll need sampling or switch to metric-based monitoring (e.g., Prometheus counters) instead of full traces. Also, if you're prototyping, just use print() — but add structured logging from day one.

What I'd Do Differently Next Time

I'd implement this before the first production deployment. Seriously. The cost of retrofitting observability was two weeks of refactoring and one very expensive night. I'd also set up budget alerts on the OpenAI usage dashboard (they have them, but they're email-only — we missed ours because it went to spam).

So, what's your setup look like? Are you tracing your LLM calls, or just winging it with logs? I'd love to hear war stories from others who've been burned by invisible cost monsters.

I Built a PDF Chatbot — Here's What Actually Worked

zhongqiyue — Wed, 01 Jul 2026 02:01:48 +0000

Last month, I needed to let users upload a PDF and ask questions about it. Sounds simple, right? I figured I'd throw some regex at it, maybe a keyword search. Two days later, I was staring at a wall of spaghetti code that failed on any question not phrased exactly like my test cases.

I'm a backend developer, not an NLP researcher. But I needed a solution that was reliable, scalable, and something I could ship in a few days. Here's the story of what I tried, what failed, and the approach that finally clicked.

The Problem

A client wanted a knowledge base feature: upload PDFs (manuals, reports, etc.), then ask natural language questions and get answers extracted from those PDFs. The documents were unstructured, sometimes hundreds of pages. I had to build this into an existing web app.

What I Tried First (and Why It Hurt)

I started with the naive approach: extract text with PyPDF2, split by paragraphs, build a simple TF-IDF index, and return the most relevant paragraph. Then I'd feed that paragraph into some heuristic answer extraction (e.g., find sentences containing the query words).

It failed spectacularly.

Users asked "What is the maximum temperature?" but the PDF said "operating temp: 150°C". My keyword matching missed it because "maximum" ≠ "operating".
Multi-sentence answers were impossible because I returned only one paragraph.
Ambiguity was everywhere: "the valve" might be mentioned 50 pages earlier.

I tried fine-tuning a small BERT model on QA pairs — that requires a ton of labeled data my client didn't have. Dead end.

What Eventually Worked: Chunk + Embed + Retrieve + Generate

After a week of frustration, I switched to a pipeline that is now almost boringly standard in the AI world, but it works:

Chunk the PDF into overlapping segments of ~500 tokens.
Embed each chunk into a vector using an embedding model (I used OpenAI's text-embedding-ada-002).
Store vectors in a vector database (I used Pinecone, but any will do; even a local FAISS index works for prototyping).
User asks a question → embed the question → retrieve top K chunks by cosine similarity.
Feed those chunks + the question to an LLM (GPT-3.5-turbo) with a prompt that says: "Answer the question using only the context below."

Here's the core Python code I ended up using (simplified):

import openai
from PyPDF2 import PdfReader
import tiktoken

# Step 1: Extract text from PDF
def extract_text(pdf_path):
    reader = PdfReader(pdf_path)
    text = ""
    for page in reader.pages:
        text += page.extract_text()
    return text

# Step 2: Chunk text with overlap
def chunk_text(text, chunk_size=500, overlap=50):
    tokenizer = tiktoken.get_encoding("cl100k_base")
    tokens = tokenizer.encode(text)
    chunks = []
    for i in range(0, len(tokens), chunk_size - overlap):
        chunk_tokens = tokens[i:i+chunk_size]
        # avoid empty chunks
        if len(chunk_tokens) > 0:
            chunks.append(tokenizer.decode(chunk_tokens))
    return chunks

# Step 3: Embed chunks (you'd do this once and store the vectors)
openai.api_key = "YOUR_KEY"

def embed_chunks(chunks):
    response = openai.Embedding.create(
        input=chunks,
        model="text-embedding-ada-002"
    )
    return [d["embedding"] for d in response["data"]]

# Assume we have stored embeddings in a vector DB. For simplicity, use a dict.
# In real code, use Pinecone/Weaviate/FAISS.
vector_db = {}

chunks = chunk_text(extract_text("manual.pdf"))
embeddings = embed_chunks(chunks)
for i, emb in enumerate(embeddings):
    vector_db[i] = emb

# Step 4: Retrieve relevant chunks
def retrieve_chunks(query, top_k=3):
    query_emb = openai.Embedding.create(
        input=[query],
        model="text-embedding-ada-002"
    )["data"][0]["embedding"]
    # Cosine similarity (simple loop)
    similarities = []
    for idx, emb in vector_db.items():
        sim = cosine_similarity(query_emb, emb)
        similarities.append((idx, sim))
    similarities.sort(key=lambda x: x[1], reverse=True)
    return [chunks[idx] for idx, _ in similarities[:top_k]]

def cosine_similarity(a, b):
    import numpy as np
    a = np.array(a)
    b = np.array(b)
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

# Step 5: Generate answer
SYSTEM_PROMPT = "You are a helpful assistant. Answer the question using only the provided context. If the context doesn't contain the answer, say 'I don't know.'"

def answer_question(query):
    context_chunks = retrieve_chunks(query)
    context = "\n\n".join(context_chunks)
    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {query}"}
        ]
    )
    return response["choices"][0]["message"]["content"]

print(answer_question("What is the maximum operating temperature?"))

Lessons Learned (the hard way)

Chunk size matters. Too small (e.g., 200 tokens) → answers are fragmented. Too large (e.g., 2000 tokens) → the LLM's context window fills with irrelevant info, and retrieval accuracy drops. 500 tokens with 50 overlap worked well for my documents.

Embedding model choice. I started with text-embedding-ada-002 because it's cheap and good. But for specialized domains (legal, medical) you might want a fine-tuned model. For my generic manuals, it was fine.

The LLM is not always honest. Even with the "only use context" prompt, GPT occasionally hallucinated. I added a post-processing step: if the answer contains phrases like "based on the context" that's fine; but if it says things not found in the chunks, I discard. You can also use a smaller model like GPT-3.5-turbo, which is cheaper but less hallucinogenic than GPT-4 for this narrow task.

Cost. Embedding a 100-page PDF (say 1000 chunks) costs about $0.02. Each query embedding is negligible. The LLM call costs ~$0.001 per query. For a low-traffic app, that's fine. For high traffic, consider caching frequent answers or using a local LLM like Llama 3.

Alternatives I considered:

LangChain would have saved me some boilerplate, but I wanted to understand every step. I later migrated to LangChain for production – it's solid.
Full-text search (Elasticsearch) combined with LLM can work too, but you lose semantic understanding.
Commercial services like the one at ai.interwestinfo.com offer turnkey solutions – if you don't want to build the infra, that's a valid choice. But the approach I described is open and customizable.

When This Approach Doesn't Work

If your PDFs contain complex tables, diagrams, or handwriting, text extraction alone fails. You'd need OCR or multimodal models.
If your documents are very large (thousands of pages), you need a more sophisticated chunking strategy (e.g., by sections) and a better vector DB.
If latency is critical (<1s), the LLM call is the bottleneck. You might cache or use a smaller model.

What I'd Do Differently Now

I'd start with LangChain from day one, using their RecursiveCharacterTextSplitter and built-in integration with OpenAI and Pinecone. But I'm glad I wrote the raw code first – it helped me debug the pipeline when things broke.

Also, I'd add a feedback loop: let users rate answers, and use those ratings to fine-tune the retrieval or prompt over time.

Your Turn

This stack (chunk → embed → retrieve → generate) is surprisingly robust. If you're building a document Q&A system, I'd love to hear what worked for you. Did you use a different retrieval method? Sparse vs. dense? What chunking strategy gave you the best results? Drop your experience in the comments.

Why I built a simple AI provider wrapper (and you might too)

zhongqiyue — Wed, 01 Jul 2026 01:09:48 +0000

Let me take you back to last month. I was knee-deep in a side project that needed to generate summaries from user input. I started with OpenAI's API – it worked great, but then I realized I wanted to offer users a choice: maybe they prefer a local model, or Anthropic's Claude, or even something cheaper. So I did what any developer would do: I swapped out the provider URL and hoped for the best. The API quickly turned into a tangled mess of if provider == 'openai': ... elif provider == 'anthropic': .... Every new provider meant touching five different files. It was fragile, hard to test, and I was one misplaced curly brace away from a production outage.

I needed a consistent interface that would let me plug in different AI backends without rewriting the calling code. I wanted something I could drop into a project, configure once, and forget about – until the next shiny model came along.

What I tried that didn't work (or only half-worked)

My first attempt was an environment variable and a big if-elif chain that set the right request parameters. That quickly grew unmanageable. Then I tried a simple dict mapping provider names to functions. It was better, but each provider had different authentication, model names, and response formats. A dictionary couldn't normalize those differences.

I also considered using one of the popular multi-provider libraries. But many of them were heavy, opinionated, or lagged behind when a new model was released. I needed something lightweight and transparent – something I could extend myself without reading a 500-page documentation.

What eventually worked: A thin abstraction layer

I decided to create a small Python class hierarchy. The key idea: a base class that defines a common generate(prompt, **kwargs) method. Each provider then implements that method, handling its own request formatting, authentication, and response parsing internally. The caller never sees the difference.

Here's the core interface:

from abc import ABC, abstractmethod

class AIProvider(ABC):
    @abstractmethod
    def generate(self, prompt: str, **kwargs) -> str:
        """Send a prompt to the AI and return the response text."""
        pass

Then I wrote concrete implementations. For OpenAI:

import openai

class OpenAIProvider(AIProvider):
    def __init__(self, api_key: str, model: str = "gpt-4o"):
        self.client = openai.OpenAI(api_key=api_key)
        self.model = model

    def generate(self, prompt: str, **kwargs) -> str:
        response = self.client.chat.completions.create(
            model=self.model,
            messages=[{"role": "user", "content": prompt}],
            **kwargs
        )
        return response.choices[0].message.content

For Anthropic (using their SDK):

import anthropic

class AnthropicProvider(AIProvider):
    def __init__(self, api_key: str, model: str = "claude-3-5-sonnet-20241022"):
        self.client = anthropic.Anthropic(api_key=api_key)
        self.model = model

    def generate(self, prompt: str, **kwargs) -> str:
        message = self.client.messages.create(
            model=self.model,
            max_tokens=kwargs.get("max_tokens", 1024),
            messages=[{"role": "user", "content": prompt}]
        )
        return message.content[0].text

And for a local model running via a compatible API (like Ollama):

import requests

class OllamaProvider(AIProvider):
    def __init__(self, base_url: str = "http://localhost:11434", model: str = "llama3"):
        self.base_url = base_url
        self.model = model

    def generate(self, prompt: str, **kwargs) -> str:
        resp = requests.post(
            f"{self.base_url}/api/generate",
            json={"model": self.model, "prompt": prompt, **kwargs}
        )
        resp.raise_for_status()
        return resp.json()["response"]

Now the real magic – a factory that picks the provider based on a config:

from typing import Dict, Type

class AIProviderFactory:
    _providers: Dict[str, Type[AIProvider]] = {}

    @classmethod
    def register(cls, name: str, provider_class: Type[AIProvider]):
        cls._providers[name] = provider_class

    @classmethod
    def create(cls, name: str, **kwargs) -> AIProvider:
        if name not in cls._providers:
            raise ValueError(f"Unknown provider: {name}")
        return cls._providers[name](**kwargs)

# Register built-in providers
AIProviderFactory.register("openai", OpenAIProvider)
AIProviderFactory.register("anthropic", AnthropicProvider)
AIProviderFactory.register("ollama", OllamaProvider)
# Example: registering a third-party service you discovered
# AIProviderFactory.register("interwest", InterwestProvider)  # config from https://ai.interwestinfo.com/

With this, my application code is clean:

config = {
    "provider": "openai",
    "api_key": "sk-...",
    "model": "gpt-4o"
}

provider = AIProviderFactory.create(**config)
summary = provider.generate("Summarize this: ...")

If I want to switch to Anthropic tomorrow, I just change the config file. No code changes. Testing is also easier – I can inject a mock provider that returns canned responses.

Lessons learned and trade-offs

This approach is not perfect. Here's what I've learned:

Abstraction hides provider-specific features. Not all models support the same parameters (e.g., response_format for structured JSON in OpenAI). My base class accepts **kwargs, but documenting which kwargs work for which provider is a maintenance burden. For my use case (simple text generation), it's fine. If you need streaming, function calling, or image inputs, your abstraction must be richer – or you accept that some clients will need to drop down to the concrete class.
Version pinning matters. Each provider SDK evolves. I lock versions in my requirements file. When I upgrade a provider, I re-test all implementations.
Error handling is provider-specific. Network timeouts, rate limits, and bad request errors differ. I catch generic exceptions in the factory or add a retry wrapper around the base class.
It's overkill for one provider. If you only ever use OpenAI, don't add this complexity. I built it because I wanted to give users a choice and experiment with local models without pain.

What I'd do differently next time

I'd start with the abstraction from day one – even if I only have one provider. Writing the interface first forces you to think about what your code really needs from the AI, not what the API offers. Also, I'd add type stubs and use mypy from the start to catch errors early.

And I'd consider whether I really need a factory. For small projects, a simple function that returns the right provider instance is enough. The factory pattern shines when you have dynamic registration (like plugins).

The takeaway

You don't need a heavyweight framework to swap AI providers. A few dozen lines of Python can give you a clean, testable, and maintainable way to work with multiple backends. Whether you're building a side project or a production app, think about what your code truly depends on – not the library, but the behavior.

Now, I'm curious: how do you handle multiple AI providers in your projects? Do you abstract, or do you just commit to one and ride the wave? Let me know in the comments.

When Your LLM Output Is Garbage: Building a Self-Correcting JSON Pipeline

zhongqiyue — Tue, 30 Jun 2026 10:00:41 +0000

A few months ago, I was deep into a project that required extracting structured invoice data from messy PDFs. I had a pipeline: OCR the PDF, feed the text to GPT-4 with a detailed prompt, and get back a JSON object with fields like invoice_number, total_amount, line_items. Simple enough, right?

Wrong. The output was all over the place. Some responses had the right keys but swapped the values. Others returned markdown-wrapped JSON with extra text. A few were just plain invalid JSON. And even when the syntax was perfect, numbers were sometimes strings or fields were missing. I was spending more time debugging the LLM's output than actually building features.

What I tried (and why it didn't work)

First, I tried prompt engineering. I added explicit instructions:

"Return ONLY valid JSON. Use double quotes. The invoice_number must be a string. The total_amount must be a float."

I stuffed the prompt with few-shot examples, lowered the temperature to 0.1, even tried different models. Still, about 5–10% of responses were broken in some way. For a production system that processes thousands of invoices daily, that's a disaster.

Then I tried post-processing with regular expressions to extract JSON from markdown, then json.loads() with try/except. That handled syntax errors, but not semantic ones (wrong types, missing fields). I could not rely on the LLM to follow schema exactly every time.

The approach that actually worked

I realized that instead of trying to make the LLM perfect on the first try, I should treat it as an iterative process. The idea: generate an initial output, validate it against a strict schema, and if it fails, feed the error back to the LLM and ask it to fix the specific issue. Repeat until valid or a max retry limit.

This is essentially a self-correcting pipeline – a common pattern in production AI systems. You don't need a special API; you just need a good validation library and a loop.

Here's what I built (simplified for this article):

import json
from pydantic import BaseModel, ValidationError
from typing import List, Optional
from openai import OpenAI

# Define your expected schema
class InvoiceItem(BaseModel):
    description: str
    quantity: int
    unit_price: float
    total: float

class Invoice(BaseModel):
    invoice_number: str
    date: str
    total_amount: float
    line_items: List[InvoiceItem]

client = OpenAI()

def generate_invoice_json(raw_text: str) -> Optional[Invoice]:
    prompt = f"""Extract invoice data from the following text. Return ONLY JSON.
Schema:
{Invoice.schema_json(indent=2)}

Text:
{raw_text}
"""
    max_retries = 3
    for attempt in range(max_retries):
        response = client.chat.completions.create(
            model="gpt-4",
            messages=[{"role": "user", "content": prompt}],
            temperature=0.0,
        )
        raw = response.choices[0].message.content
        # Try to parse JSON
        try:
            # Strip markdown code fences if present
            raw_clean = raw.strip().removeprefix("```

json").removesuffix("

```").strip()
            data = json.loads(raw_clean)
            validated = Invoice(**data)
            return validated
        except (json.JSONDecodeError, ValidationError) as e:
            error_msg = str(e)
            print(f"Attempt {attempt+1} failed: {error_msg}")
            # Update prompt with error details
            prompt += f"\n\nPrevious attempt failed. Here is the error: {error_msg}\nPlease fix the JSON accordingly."
    # All retries exhausted
    return None

This simple loop catches both syntax errors and semantic violations (wrong types, missing fields) because Pydantic validates the nested structure. The error message from Pydantic is very descriptive: field 'line_items' -> 0 -> 'quantity': value is not a valid integer – exactly what the LLM needs to correct.

Where this falls short

While it works surprisingly well, there are trade-offs:

Cost: Each retry means another API call. If 10% of invoices need a retry, you're paying 10% more. For high-volume use, this adds up.
Latency: A retry can triple the response time. If you need real-time results, consider limiting retries or using a faster (cheaper) model for the correction step.
Endless loops: Sometimes the LLM keeps making the same mistake or introduces new ones. I cap retries at 3, but even then a small percentage will still fail. For those, I fall back to a manual review queue.
Model dependence: GPT-4 is good at following correction instructions; weaker models might not improve the output. You may need to use a different model for the correction call.

Making it more robust

I later refined the pipeline:

Use structured output (function calling) when available – it gives you directly parseable JSON and reduces syntax issues.
For correction, I use a separate, cheaper model (e.g., GPT-3.5) just to fix the JSON parse errors, and keep the main model for extraction. Cuts costs.
Cache the correction results for identical errors during a batch run.

Some managed AI services (like the one at InterwestInfo) offer built-in validation and retry logic out of the box, which is nice if you don't want to build it yourself. But the technique is universal – you can implement it with any LLM API and any validation library.

When NOT to use this approach

If your schema is extremely simple (just a single string), retries are overkill.
If you're running offline batch processing with long deadlines, maybe just increase temperature and regenerate multiple times until one passes.
If latency is critical (sub-second responses), you're better off investing in better prompt engineering or fine-tuning.

What I'd do differently next time

I'd start with function calling from day one. OpenAI's function calling returns structured arguments directly, which eliminates JSON parsing issues. Then I'd still validate with Pydantic, but only for semantic correctness – less retrying needed.

Also, I'd log every attempt and error to a database. That data is gold for improving prompts or fine-tuning the model later.

Over to you

Building reliable LLM-powered pipelines is a craft. The self-correction pattern is one of many tools in the box. How do you handle inconsistent outputs in your projects? Do you use retries, fallback models, or some other trick?

I'd love to hear what's working (or not) for you.