DEV Community: rag

RAG Knowledge Base: Building with LangGraph and ChromaDB

lijesom9-create — Tue, 30 Jun 2026 12:48:30 +0000

RAG知识库实战：LangGraph + ChromaDB从零搭建个人知识助手

RAG（检索增强生成）是让AI拥有"外部记忆"的关键技术。本文基于education-agent项目的完整实现，结合LangChain官方最佳实践，手把手教你搭建一个支持文档上传、智能问答的个人知识库。

前言

你有没有遇到过这个问题：

问ChatGPT："我们公司的请假流程是什么？"
ChatGPT："抱歉，我不知道你们公司的具体政策..."

为什么？因为大模型的知识是"冻结"在训练数据里的，它不知道你公司的内部文档。

RAG（Retrieval-Augmented Generation）就是解决方案。

简单说：先从你的文档里找到相关内容，再让AI基于这些内容回答问题。

RAG架构概览

用户问题
    │
    ▼
┌─────────────────────────────────────┐
│           问题理解模块              │
│  • 意图识别                          │
│  • 问题改写                          │
│  • 子问题拆解                        │
└─────────────┬───────────────────────┘
              │
              ▼
┌─────────────────────────────────────┐
│           检索模块                  │
│  • 关键词检索 (50%)                  │
│  • 向量检索 (30%)                    │
│  • BM25检索 (20%)                    │
└─────────────┬───────────────────────┘
              │
              ▼
┌─────────────────────────────────────┐
│           重排序模块                │
│  • CrossEncoder重排序                │
│  • 去重、过滤                        │
└─────────────┬───────────────────────┘
              │
              ▼
┌─────────────────────────────────────┐
│           生成模块                  │
│  • 基于检索结果生成答案              │
│  • 引用溯源                          │
└─────────────────────────────────────┘

文档处理Pipeline

第一步：文档解析

支持多种格式的文档：

# document/parser.py
from pathlib import Path
import PyPDF2
from docx import Document

class DocumentParser:
    """文档解析器"""

    def parse(self, file_path: str) -> str:
        """解析文档，返回纯文本"""
        path = Path(file_path)
        suffix = path.suffix.lower()

        parsers = {
            ".pdf": self._parse_pdf,
            ".docx": self._parse_docx,
            ".txt": self._parse_txt,
            ".md": self._parse_markdown,
        }

        parser = parsers.get(suffix)
        if not parser:
            raise ValueError(f"不支持的文件格式: {suffix}")

        return parser(file_path)

    def _parse_pdf(self, path: str) -> str:
        """解析PDF"""
        with open(path, 'rb') as f:
            reader = PyPDF2.PdfReader(f)
            text = ""
            for page in reader.pages:
                text += page.extract_text() + "\n"
        return text

    def _parse_docx(self, path: str) -> str:
        """解析Word文档"""
        doc = Document(path)
        return "\n".join([para.text for para in doc.paragraphs])

    def _parse_txt(self, path: str) -> str:
        """解析纯文本"""
        with open(path, 'r', encoding='utf-8') as f:
            return f.read()

    def _parse_markdown(self, path: str) -> str:
        """解析Markdown"""
        with open(path, 'r', encoding='utf-8') as f:
            return f.read()

第二步：智能分块

分块是RAG的关键环节。分块不好，检索效果大打折扣。

# document/chunker.py
class SmartChunker:
    """智能分块器"""

    def __init__(self, chunk_size=500, overlap=50):
        self.chunk_size = chunk_size
        self.overlap = overlap

    def chunk(self, text: str, metadata: dict = None) -> list[dict]:
        """智能分块"""
        # 先按段落分割
        paragraphs = text.split("\n\n")

        chunks = []
        current_chunk = ""

        for para in paragraphs:
            para = para.strip()
            if not para:
                continue

            # 如果当前块加上新段落超过限制，保存当前块
            if len(current_chunk) + len(para) > self.chunk_size:
                if current_chunk:
                    chunks.append(self._create_chunk(current_chunk, metadata))

                # 新块包含上一块的末尾（overlap）
                if self.overlap > 0 and current_chunk:
                    overlap_text = current_chunk[-self.overlap:]
                    current_chunk = overlap_text + "\n\n" + para
                else:
                    current_chunk = para
            else:
                current_chunk += "\n\n" + para if current_chunk else para

        # 保存最后一块
        if current_chunk:
            chunks.append(self._create_chunk(current_chunk, metadata))

        return chunks

    def _create_chunk(self, text: str, metadata: dict = None) -> dict:
        """创建块"""
        return {
            "text": text.strip(),
            "metadata": metadata or {},
            "length": len(text)
        }

第三步：向量化

# document/embedder.py
from sentence_transformers import SentenceTransformer

class Embedder:
    """向量化器"""

    def __init__(self, model_name: str = "all-MiniLM-L6-v2"):
        self.model = SentenceTransformer(model_name)

    def embed(self, texts: list[str]) -> list[list[float]]:
        """批量向量化"""
        return self.model.encode(texts).tolist()

    def embed_single(self, text: str) -> list[float]:
        """单条向量化"""
        return self.model.encode([text])[0].tolist()

第四步：存储到ChromaDB

# vectorstore/chroma_store.py
import chromadb

class ChromaVectorStore:
    """ChromaDB向量存储"""

    def __init__(self, collection_name: str = "knowledge"):
        self.client = chromadb.PersistentClient(path="./chroma_db")
        self.collection = self.client.get_or_create_collection(
            name=collection_name,
            metadata={"hnsw:space": "cosine"}
        )

    def add_documents(self, chunks: list[dict], embeddings: list[list[float]]):
        """添加文档块"""
        ids = [f"doc_{i}" for i in range(len(chunks))]
        documents = [chunk["text"] for chunk in chunks]
        metadatas = [chunk["metadata"] for chunk in chunks]

        self.collection.add(
            ids=ids,
            embeddings=embeddings,
            documents=documents,
            metadatas=metadatas
        )

    def search(self, query_embedding: list[float], top_k: int = 5) -> list[dict]:
        """向量检索"""
        results = self.collection.query(
            query_embeddings=[query_embedding],
            n_results=top_k
        )

        return [
            {
                "text": doc,
                "score": score,
                "metadata": meta
            }
            for doc, score, meta in zip(
                results["documents"][0],
                results["distances"][0],
                results["metadatas"][0]
            )
        ]

混合检索策略

为什么需要混合检索？

单一检索方式有局限：

关键词检索：精确匹配，但不懂语义
向量检索：懂语义，但可能漏掉精确匹配
BM25检索：统计方法，平衡但不极致

混合检索取长补短：

# retrieval/hybrid_retriever.py
from rank_bm25 import BM25Okapi
import numpy as np

class HybridRetriever:
    """混合检索器"""

    def __init__(self, vector_store, keyword_weight=0.5, 
                 vector_weight=0.3, bm25_weight=0.2):
        self.vector_store = vector_store
        self.keyword_weight = keyword_weight
        self.vector_weight = vector_weight
        self.bm25_weight = bm25_weight

        # BM25索引
        self.bm25 = None
        self.corpus = []

    def build_index(self, documents: list[str]):
        """构建BM25索引"""
        self.corpus = documents
        tokenized_corpus = [doc.split() for doc in documents]
        self.bm25 = BM25Okapi(tokenized_corpus)

    def search(self, query: str, query_embedding: list[float], 
               top_k: int = 5) -> list[dict]:
        """混合检索"""
        # 1. 关键词检索
        keyword_results = self._keyword_search(query, top_k * 2)

        # 2. 向量检索
        vector_results = self._vector_search(query_embedding, top_k * 2)

        # 3. BM25检索
        bm25_results = self._bm25_search(query, top_k * 2)

        # 4. 融合分数
        merged = self._merge_scores(keyword_results, vector_results, bm25_results)

        # 5. 排序返回
        sorted_results = sorted(merged.items(), key=lambda x: x[1], reverse=True)

        return [
            {"text": doc, "score": score}
            for doc, score in sorted_results[:top_k]
        ]

    def _keyword_search(self, query: str, top_k: int) -> dict:
        """关键词检索"""
        results = {}
        query_terms = query.lower().split()

        for doc in self.corpus:
            score = sum(1 for term in query_terms if term in doc.lower())
            if score > 0:
                results[doc] = score / len(query_terms)

        return dict(sorted(results.items(), key=lambda x: x[1], reverse=True)[:top_k])

    def _vector_search(self, query_embedding: list[float], top_k: int) -> dict:
        """向量检索"""
        results = self.vector_store.search(query_embedding, top_k)
        return {r["text"]: 1 - r["score"] for r in results}  # 转换为相似度

    def _bm25_search(self, query: str, top_k: int) -> dict:
        """BM25检索"""
        if not self.bm25:
            return {}

        scores = self.bm25.get_scores(query.split())
        doc_scores = list(zip(self.corpus, scores))
        doc_scores.sort(key=lambda x: x[1], reverse=True)

        return {doc: score for doc, score in doc_scores[:top_k]}

    def _merge_scores(self, keyword, vector, bm25) -> dict:
        """融合分数"""
        all_docs = set(keyword.keys()) | set(vector.keys()) | set(bm25.keys())

        merged = {}
        for doc in all_docs:
            score = (
                keyword.get(doc, 0) * self.keyword_weight +
                vector.get(doc, 0) * self.vector_weight +
                bm25.get(doc, 0) * self.bm25_weight
            )
            merged[doc] = score

        return merged

重排序

CrossEncoder重排序

# retrieval/reranker.py
from sentence_transformers import CrossEncoder

class CrossEncoderReranker:
    """CrossEncoder重排序器"""

    def __init__(self, model_name: str = "cross-encoder/ms-marco-MiniLM-L-6-v2"):
        self.model = CrossEncoder(model_name)

    def rerank(self, query: str, documents: list[dict], top_k: int = 3) -> list[dict]:
        """重排序"""
        # 构建查询-文档对
        pairs = [(query, doc["text"]) for doc in documents]

        # 计算相关性分数
        scores = self.model.predict(pairs)

        # 按分数排序
        scored_docs = list(zip(documents, scores))
        scored_docs.sort(key=lambda x: x[1], reverse=True)

        return [
            {**doc, "rerank_score": float(score)}
            for doc, score in scored_docs[:top_k]
        ]

问题理解

意图识别

# retrieval/query_understanding.py
class QueryUnderstanding:
    """问题理解模块"""

    def __init__(self, llm):
        self.llm = llm

    async def understand(self, query: str) -> dict:
        """理解问题"""
        prompt = f"""请分析以下问题：

问题：{query}

请返回：
1. 意图类型（factual/how-to/comparison/opinion）
2. 关键实体
3. 改写后的查询（更适合检索的版本）
4. 是否需要拆解为子问题"""

        response = await self.llm.ainvoke(prompt)
        return parse_understanding(response)

    async def rewrite_query(self, query: str) -> str:
        """改写查询，更适合检索"""
        prompt = f"""请将以下问题改写为更适合搜索的版本：

原始问题：{query}

要求：
1. 去除口语化表达
2. 添加关键词
3. 保持原意"""

        response = await self.llm.ainvoke(prompt)
        return response.strip()

完整的RAG Pipeline

# rag/pipeline.py
class RAGPipeline:
    """完整的RAG Pipeline"""

    def __init__(self):
        self.parser = DocumentParser()
        self.chunker = SmartChunker(chunk_size=500, overlap=50)
        self.embedder = Embedder()
        self.vector_store = ChromaVectorStore()
        self.retriever = HybridRetriever(self.vector_store)
        self.reranker = CrossEncoderReranker()
        self.query_understander = QueryUnderstanding(llm)
        self.llm = ChatOpenAI(model="gpt-4")

    async def ingest_document(self, file_path: str):
        """摄入文档"""
        print(f"📄 解析文档: {file_path}")
        text = self.parser.parse(file_path)

        print("✂️ 智能分块...")
        chunks = self.chunker.chunk(text, metadata={"source": file_path})

        print("🔢 向量化...")
        embeddings = self.embedder.embed([c["text"] for c in chunks])

        print("💾 存储到向量数据库...")
        self.vector_store.add_documents(chunks, embeddings)

        # 更新BM25索引
        self.retriever.build_index([c["text"] for c in chunks])

        print(f"✅ 完成！共处理 {len(chunks)} 个文档块")

    async def query(self, question: str) -> str:
        """查询"""
        # 1. 理解问题
        understanding = await self.query_understander.understand(question)
        rewritten_query = await self.query_understander.rewrite_query(question)

        # 2. 向量化查询
        query_embedding = self.embedder.embed_single(rewritten_query)

        # 3. 混合检索
        results = self.retriever.search(rewritten_query, query_embedding, top_k=10)

        # 4. 重排序
        reranked = self.reranker.rerank(question, results, top_k=3)

        # 5. 生成答案
        context = "\n\n".join([r["text"] for r in reranked])

        prompt = f"""基于以下参考资料回答问题。

参考资料：
{context}

问题：{question}

要求：
1. 基于参考资料回答，不要编造
2. 引用来源（标注来自哪个文档）
3. 如果参考资料不足以回答，说明需要更多信息"""

        answer = await self.llm.ainvoke(prompt)

        return {
            "answer": answer,
            "sources": [r["metadata"]["source"] for r in reranked],
            "confidence": sum(r["rerank_score"] for r in reranked) / len(reranked)
        }

实际使用

# 使用示例
rag = RAGPipeline()

# 摄入文档
await rag.ingest_document("公司制度.pdf")
await rag.ingest_document("技术文档.md")
await rag.ingest_document("会议记录.docx")

# 查询
result = await rag.query("公司的请假流程是什么？")

print(result["answer"])
# 根据《公司制度.pdf》第3章规定：
# 1. 员工请假需提前3天申请
# 2. 3天以内由直属主管审批
# 3. 3天以上需部门经理审批
# ...

print(result["sources"])
# ['公司制度.pdf']

print(result["confidence"])
# 0.85

优化技巧

1. 分块大小选择

分块大小	适用场景	优点	缺点
200-300	精确问答	检索精确	可能丢失上下文
500-800	通用问答	平衡	适中
1000+	长文档理解	上下文完整	检索不够精确

2. 向量模型选择

模型	维度	特点
all-MiniLM-L6-v2	384	速度快，效果好
text-embedding-ada-002	1536	OpenAI，效果最好
bge-large-zh-v1.5	1024	中文优化

3. 检索权重调整

# 根据场景调整权重
# 精确匹配场景（如代码搜索）
retriever = HybridRetriever(
    keyword_weight=0.7,  # 提高关键词权重
    vector_weight=0.2,
    bm25_weight=0.1
)

# 语义理解场景（如开放问答）
retriever = HybridRetriever(
    keyword_weight=0.3,
    vector_weight=0.5,  # 提高向量权重
    bm25_weight=0.2
)

总结

RAG系统的核心组件：

组件	作用	关键技术
文档解析	多格式支持	PyPDF2, python-docx
智能分块	保留语义	按段落分块，overlap
向量化	语义表示	Sentence Transformers
向量存储	高效检索	ChromaDB
混合检索	取长补短	关键词+向量+BM25
重排序	精排相关性	CrossEncoder
问题理解	优化查询	意图识别，查询改写

下一篇预告

《混合检索的威力：关键词+向量+BM25三路融合详解》— 我们会深入每种检索算法的原理，并对比它们的效果。

参考资料

RAG是让AI拥有"外部记忆"的关键技术。掌握了RAG，AI就不再是一个"失忆"的聊天机器人。

tags: rag, langchain, chromadb, vector-search, python
series: rag-knowledge-system

How to Give Your AI Agent Access to Upwork Data

AlterLab — Tue, 30 Jun 2026 11:21:48 +0000

How to Give Your AI Agent Access to Upwork Data

This guide covers accessing publicly available data. Always review a site's robots.txt and Terms of Service before automated access.

TL;DR

Use AlterLab's Extract API to turn Upwork job pages into structured JSON. Your AI agent can call the API directly, receive clean data, and feed it into an LLM for market intelligence, skill tracking, or rate monitoring—no HTML parsing needed.

Why AI agents need Upwork data

AI agents benefit from fresh, structured web data for several agentic use cases:

Freelance market intelligence: Track demand for skills, average rates, and job volume over time.
Skill demand monitoring: Identify which technologies or services are gaining traction in the freelance marketplace.
Rate analysis: Compare compensation trends across regions or experience levels to inform pricing strategies.

These insights feed RAG pipelines, tool calls, and knowledge base updates that keep agents current without manual scraping.

Why raw HTTP requests fail for agents

Direct HTTP calls to Upwork often break agent pipelines:

Rate limiting: IP bans or CAPTCHAs cause failed requests and wasted token budgets on retries.
JavaScript rendering: Modern pages rely on client‑side code; raw HTML lacks the data you need.
Bot detection: Headless browser signatures trigger blocks, requiring complex mitigation.
Parsing overhead: Agents spend cycles extracting fields from noisy HTML instead of reasoning.

The result is brittle pipelines, higher latency, and increased cost per successful data point.

Connecting your agent to Upwork via AlterLab

AlterLab handles anti‑bot measures, rendering, and extraction so your agent receives structured output. Use the Extract API for schema‑driven JSON or the Scrape API for raw HTML when you need full page control.

Structured extraction with the Extract API

Define a schema that matches the Upwork job fields you need—title, price, description, etc.—and let AlterLab return clean data.

```python title="agent_upwork_extract.py" {3-8}

client = alterlab.Client("YOUR_API_KEY")

Request structured data from a Upwork job listing

result = client.extract(
url="https://www.upwork.com/jobs/~0123456789abcdef",
schema={
"title": "string",
"price": "string",
"description": "string",
"skills": "list[string]"
}
)

result.data is a dict ready for your LLM

print(result.data)






```bash title="Terminal"
curl -X POST https://api.alterlab.io/api/v1/extract/templates/{template_id} \
  -H "X-API-Key: YOUR_KEY" \
  -d '{
    "url": "https://www.upwork.com/jobs/~0123456789abcdef",
    "schema": {
      "title": "string",
      "price": "string",
      "description": "string",
      "skills": "list[string]"
    }
  }'

Both examples return a JSON object that your agent can pass directly to an LLM call, saving tokens and eliminating parsing logic.

For cases where you need the full rendered page (e.g., to run custom logic), use the Scrape API:

```python title="agent_upwork_scrape.py" {3-6}
result = client.scrape(
url="https://www.upwork.com/jobs/~0123456789abcdef",
formats=["html"] # returns cleaned HTML ready for downstream parsing
)




Refer to the [Extract API docs](/docs/extract) for schema options and rate limits.

## Using the Search API for Upwork queries
When you need to discover jobs matching a query (e.g., “Python Django”), AlterLab’s Search API lets you retrieve results without building a crawler.



```python title="agent_upwork_search.py" {3-7}
# Assume you have previously created a search template via the dashboard or API
search_id = "upwork-python-jobs"

result = client.search(
    search_id=search_id,
    params={"q": "Python Django", "page": 1}
)

# result.data contains an array of structured job objects
for job in result.data["items"]:
    print(job["title"], job["price"])

```bash title="Terminal"
curl -X POST https://api.alterlab.io/api/v1/search/upwork-python-jobs \
-H "X-API-Key: YOUR_KEY" \
-d '{"q": "Python Django", "page": 1}'




The Search API returns the same structured format as Extract, making it easy to plug into agentic workflows.

<div data-infographic="try-it" data-url="https://www.upwork.com" data-description="Extract structured Upwork data for your AI agent"></div>

## MCP integration
AlterLab provides an MCP server that exposes its APIs as tool calls for agents built with Claude, GPT, or Cursor. Register the MCP server in your agent’s toolkit and invoke Upwork extraction as a standard function call. See the [AlterLab for AI Agents](https://alterlab.io/glossary/user-agent) glossary for setup details.

## Building a freelance market intelligence pipeline
Here is an end‑to‑end example showing how an agent can collect Upwork data, enrich it, and store insights.

1. **Agent triggers a tool call** – The LLM decides it needs current freelance rates for “React Native”.
2. **AlterLab fetches and extracts** – The agent calls the Extract API with a schema for title, price, and skills. AlterLab handles rendering, anti‑bot, and returns JSON.
3. **Agent processes the data** – The structured output is passed to a summarization LLM or stored in a knowledge base.
4. **Pipeline repeats on a schedule** – Using cron or an internal scheduler, the agent refreshes the dataset hourly.



```python title="freelance_pipeline.py" {3-15}

from openai import OpenAI

alterlab_client = alterlab.Client("YOUR_ALTERLAB_KEY")
llm_client = OpenAI(api_key="YOUR_OPENAI_KEY")

def fetch_upwork_jobs(query: str, limit: int = 20):
    """Retrieve structured job data for a given query."""
    search_id = "upwork-freelance-search"
    resp = alterlab_client.search(
        search_id=search_id,
        params={"q": query, "limit": limit}
    )
    return resp.data.get("items", [])

def enrich_with_llm(jobs):
    """Ask the LLM to extract trends from raw job listings."""
    prompt = (
        "Analyze the following Upwork job listings and summarize:\n"
        "- Median hourly rate\n"
        "- Top 5 requested skills\n"
        "- Any notable changes from the previous report\n\n"
        f"Jobs: {jobs}"
    )
    completion = llm_client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.2
    )
    return completion.choices[0].message.content

def main():
    jobs = fetch_upwork_jobs("React Native")
    insight = enrich_with_llm(jobs)
    # Store insight in a database or trigger a notification
    print("Market insight:", insight)

if __name__ == "__main__":
    main()

The pipeline uses AlterLab as a reliable data source, letting the agent focus on reasoning rather than navigating anti‑bot measures.

Key takeaways

Structured extraction removes HTML parsing overhead and improves token efficiency.
AlterLab’s built‑in anti‑bot handling delivers reliable data for agentic pipelines.
Use the Search API for discovery and the Extract API for precise field selection.
Integrate via MCP to treat AlterLab as a standard tool call in LLM agents.
Review the AlterLab pricing page to estimate costs for your agent’s data volume.

Hit reply if you have questions.

How to Give Your AI Agent Access to Seeking Alpha Data

AlterLab — Tue, 30 Jun 2026 11:21:47 +0000

How to Give Your AI Agent Access to Seeking Alpha Data

Disclaimer: This guide covers accessing publicly available data. Always review a site's robots.txt and Terms of Service before automated access.

TL;DR

To give an AI agent access to Seeking Alpha data, connect it to the AlterLab Extract API. This allows your agent to request a URL and receive structured JSON instead of raw HTML, making it compatible with RAG pipelines and tool-calling-based-reasoning without manual parsing.

Why AI Agents Need Seeking Alpha Data

Standard LLMs are limited by their training cutoff. For financial agents, this means they are blind to current market sentiment, recent earnings transcripts, and real-time stock analysis. To build a production-grade investment agent, you must bridge the gap between the LLM and live web data.

High-performing agentic workflows use Seeking Alpha data for:

Investment Research Monitoring: Agents that track specific tickers and summarize new analysis articles as they are published.
Earnings Analysis: Automatically pulling key metrics from earnings summaries to compare against historical trends in a RAG (Retrieval-Augm-ented Generation) database.
Stock Discussion Pipelines: Monitoring sentiment in public comment sections to provide a "market mood" metric for a broader investment tool.

Why Raw HTTP Requests Fail for Agents

If you attempt to use a simple requests.get() or fetch() call within a tool-call-loop, your agent will likely fail. Seeking Alpha utilizes sophisticated anti-bot protections that detect non-browser signatures.

When an agent hits a wall, it doesn's just "get the wrong data"—it wastes your most expensive resource: the LLM's context window. Instead of getting financial data, your agent receives a 403 Forbidden error or a CAPTCHA challenge. This results in:

Token Waste: The agent tries to "reason" through an error page, consuming tokens for no value.
Broken Pipelines: An agent that cannot fetch data cannot complete its tool-calling loop, causing the entire task to crash.
Rate Limiting: Repeatedly hitting a site with the same signature will lead to an IP ban, breaking your agent's ability to access any data from that source.

Connecting Your Agent to Seeking Alpha via AlterLab

The most efficient way to feed data to an agent is via structured extraction. Rather than passing raw HTML into an LLM—which is noisy and expensive—you should use the AlterLab Extract API. This transforms a webpage into a clean JSON object that fits perfectly into a prompt.

Using the Extract API

The Extract API uses predefined templates to turn any URL into structured data. This is the preferred method for RAG pipelines because it minimizes the token count significantly.

```python title="agent_extraction.py" {3-8}

client = alterlab.Client("YOUR_API_KEY")

Extract structured data directly for the agent's context window

result = client.extract(
url="https://seekingalpha.com/article/example-article-id",
schema={
"article_title": "string",
"author": "string",
"sentiment": "string",
"key_points": "array of strings"
}
)

Pass this clean JSON directly to your LLM

print(result.data)




Alternatively, you can use `curl` for lightweight server-side implementations:



```bash title="Terminal"
curl -X POST https://api.alterlab.io/api/v1/extract/templates/{template_id} \
  -H "X-API-Key: YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://seekingalpha.com/example",
    "schema": {"title": "string", "author": "string"}
  }'

For more details on schema definitions, check our Extract API docs. If you are building a production service, refer to our Getting started guide to set up your environment.

Searching for Financial Data at Scale

Sometimes your agent doesn's have a specific URL but rather a query (e.g., "Find recent sentiment for $TSLA"). In these cases, the Search API allows your agent to perform queries against the web and receive structured results.

An agentic workflow would look like this:

Agent identifies a need for new data.
Agent generates a search query.
Agent calls the AlterLab Search tool.
AlterLab returns a list of URLs and metadata.
Agent selects the most relevant URL and calls the Extract API.

MCP Integration: Giving Claude and GPT-4 Real-World Access

The Model Context Protocol (MCP) is becoming the standard for connecting LLMs to external data sources. By using AlterLab as an MCP server, you can give agents like Claude or custom-built GPTs the ability to "browse" Seeking Alpha as a tool. This transforms the agent from a static text generator into a dynamic researcher capable of real-time market analysis.

Learn more about how we support this via our User Agent glossary.

Building an Investment Research Monitoring Pipeline

To build a professional-grade monitoring system, you need to move away from manual scripts and toward automated pipelines. A robust architecture looks like this:

Trigger: A cron job or a webhook signals a new article.
Extraction: AlterLab fetches the article, bypasses bot detection, and returns structured JSON via a Webhook.
Reasoning: The LLM receives the JSON, compares it against your investment thesis, and decides if action is required.
Action: The agent posts a summary to Slack or updates a database.

Implementation Example: The Monitoring Loop

```python title="monitoring_pipeline.py" {2,5,8-12}

client = alterlab.Client("YOUR_API_KEY")
llm = openai.OpenAI()

def monitor_ticker(url):
# 1. Get clean data from AlterLab
raw_data = client.extract(url=url, schema_id="seeking_alpha_article")

# 2. Feed structured data to LLM for reasoning
response = llm.chat.completions.create(
    model="gpt-4-turbo",
    messages=[
        {"role": "system", "content": "You are a financial analyst. Summarize the sentiment of this article."},
        {"role": "user", "content": f"Data: {raw_data.data}"}
    ]
)
return response.choices[0].message.content

Example URL

print(monitor_ticker("https://seekingalpha.com/article/example"))




## Key Takeaways
* **Structured over Raw**: Never feed raw HTML into an LLM. Use the Extract API to minimize token usage and-to-maximize reasoning-quality.
* **Avoid the Retry Loop**: Building your own proxy rotation is a waste of engineering time. Let the API handle the heavy lifting of bot detection.
* **Agentic Tools**: Use the MCP pattern to give your agents native access to web data without writing custom scrapers for every site.

By implementing these patterns, you move from "scraping websites" to "orchestrating data pipelines," creating agents that can actually act on real-world information.

***

**AlterLab // Web Data, Simplified.**

When the model is the marketing device: A Protobuf short story

dcode — Tue, 30 Jun 2026 10:33:41 +0000

Ask a current-generation AI assistant which Protobuf library to use in JavaScript, and you're about to get a confident recommendation. On the surface, this is how it is supposed to work to save you the time to do the research yourself, but underneath, it's more nuanced than that. The more familiar you are with LLMs and retrieval-augmented generation (RAG), the more likely it is that you've spotted instances of the phenomenon I am about to get to, a phenomenon I was somewhat blind to before finding myself on the receiving end of its effects.

This piece is going to document the concrete instance I faced first, and will later invite you to discuss what I, what you, believe are the implications.

The summoning

It's 2026, security reports are piling up in every major project's inboxes. There are many opinions on frequency and quality of such reports, but this piece isn't about that, so I am only going to say that I have come to appreciate the signal in the reports because they can uncover real issues. The reason I am mentioning it is that this got me back to work on protobuf.js after I shared the keys with a team at Google a few years ago. Google has done a great job at moving the library forward by implementing the new Protobuf Editions feature, but given the new wave of interest in what Snyk describes as a key ecosystem project, I recognized that an external team that is likely already operating at capacity would struggle to keep up. So I spawned in and got to work.

Brave new world

Let me set the timeline. The first security reports were handled at the end of April, and at that time I estimated my reappearance to last for a few weeks before the waves calm down again. Now, I wouldn't be writing this piece if my estimate would have been accurate, but the reason for my miscalculation was not security work, but something else, with the common denominator being the thing most talked about these days: LLMs. Prior to returning to protobuf.js, I had been playing with AI tools (a lot) to aid my coding efforts (like almost everyone else), and I took that with me when returning. Saying that AI is helpful in identifying and addressing old bugs would be an understatement, but at the time the helpfulness also left the impression on me that these things are like really good. In hindsight, I was somewhat overoptimistic (like almost everyone else). So, what happened?

The curious case of "the best" Protobuf

I hadn't watched the Protobuf space closely since my WebAssembly days, and even when I worked on a Protobuf-involving pet project earlier this year, it still seemed that the world hadn't shifted that much under me anyway. The agent recommended protobuf.js to me alongside a bunch of other libraries, one of which I'll soon get to, so I picked my own and happily continued adding features to my little game, cursing along when refactorings blew up in my face.

Back at protobuf.js, I eventually realized that LLM recommendations had shifted drastically, coming up with all kinds of specifics why protobuf.js is not a good library anymore. So I interrogated my helpful assistants to trace back the source of that independent-sounding corroboration, and what I found was of the more interesting kind. The specifics came from a commercial competitor's README, namely Buf's protobuf-es. For reference, here is their README as it stood earlier, and here is the same README on April 14th, just refreshed prior to my protobuf.js reappearance. The former is the good kind of relatively technical document, whereas the latter is a complete shift in tone to a comparison on selective and in large parts inaccurate grounds.

As its author, I certainly know that protobuf.js is not centered around a bunch of optional helpers, but around encode and decode. Using a pipe to process JS code to TS declarations is improvable, but is that worth a dangerous emoji? And it goes on like that: Is it really a deficiency when a JS library provides a JS-native tool for code generation instead of a protoc plugin? What exactly is wrong with virtual oneof fields? What's the difference between a library building .js+.d.ts to another library doing the same? And how trustworthy are these conformance numbers anyway? (I'll get to that)

As you might imagine, it is a special kind of experience to put free work into your library that is downloaded 70 million times a week without you ever seeing a dime, while simultaneously being aware that it is disparaged million-fold on grounds a commercial competitor largely made up. And to my despair and to your entertainment, it soon became even more interesting than just that.

On May 18, bufdev, Buf's CEO, updated the protobuf-es README another time, now prominently featuring a supposedly independent LLM assessment above the already inaccurate tables. I can't tell if this was truly produced by Opus 4.7, but I can also not confirm it, because parts of it are unlike anything I had seen Claude state earlier, and that the PR's commit history shows that the quote has been hand-edited during review seems suspicious. But let's assume that these are truly Claude's words, and that the remarkable epistemic polish in the last paragraph ("honest" / "technical case stands on its own" / "aren't controversial assessments"), claims of indifference, sufficiency, comprehensiveness respectively, is something Claude occasionally produces and is not crafted to be easily mistakable for an AI's internal monologue. Then, is it even true?

As someone who knows a fair bit about the landscape of protobuf libraries, I can confidently say that there is no "best" or "only" choice. protobuf.js is my attempt at finding a balance between conformance, ergonomics and speed, as that's what I care about. Others make different tradeoffs: Mapbox prefers a tiny library that only supports a subset of features but is fast at vector data, and someone else might prefer a library like protobuf-es that is "fully conformant" yet fully inefficient. And, again, the quote goes on like that: What qualifies as a "bolted on" plugin? protobuf.js didn't have anything named a "plugin" at the time. Doesn't protobuf-es actually support Edition 2024 instead of the blurb's 2023? What is the relationship between ProtoJSON and BigInt, when JSON does not even support BigInt? What qualifies as a usable reflection API that is not an "afterthought"? I guess protobuf.js should, as reflection is core architecture? What is the purpose of the paragraph about ts-proto that was (re-)inserted into the quote by bufdev? And is all that enough justification to plainly recommend to "avoid" pretty much every peer project, in a README that should be about Buf's library?

So I unearthed the courage I buried after the WebAssembly wars and opened an issue at the protobuf-es repo to find out what I was actually looking at. The exchange is short but insightful: Clever rhetoric around half a confession, then promptly closed due to an alleged lack of specifics, despite literally opening with a list of specifics. Notably, the inaccurate LLM blurb defended as "actually a good one", which it provably isn't, and "independent" of any bias. But wait, is it?

LLMs are trained on public data. So much is known. Public data such as Buf's blog, and there's not a lot of (comparative) marketing content in the Protobuf space besides that. Calling that independent of bias is at least a stretch.

Speaking of which, I thought that maybe improving public data is a reasonable response to such acts, so after putting way more time into protobuf.js than originally anticipated, I took the liberty to put up a PR to Buf's protobuf-conformance repository on June 6th, which is coincidentally the source often cited by LLMs as supposedly independent proof underlying protobuf-es's README's ramblings. I made sure that it does not touch anything unnecessary and regenerated the necessary artifacts so it is an easy merge. And behold: My protobuf.js efforts paid off insofar as protobuf.js now sorts above protobuf-es, actually becoming "the only" "fully-compliant" protobuf implementation for JavaScript and TypeScript on their own comparison, a title protobuf-es claimed, but provably never held due to failing a bunch of recommended tests. And not only that, protobuf.js was even better than fully-conformant, presumably uber-fully-conformant, by implementing Protobuf Text Format, an entire conformance harness category Buf had excluded from their runner. But is it finally correct now? Neutral even?

Depends, again. With the power of controlling the venue comes the responsibility to set it up correctly and maintain it well. As of today, my PR sits without human review, which I guess is understandable given the accidental revelation it carries. And as ever so often, there are levels to correctness. A truly neutral and thus more correct benchmark would compare all contenders on the same conformance surface. But according to Buf's interpretation of correctness, a library supporting proto3 can be just as green and 100% as one that has put in the work to support Edition 2023/2024. For comparison, here is a patched fork of the protobuf-conformance repo with all libraries updated to their versions at the time of the fork's creation, and the same equal surface applied to everyone. To my disappointment, protobuf.js no longer holds the crown of uber-fully-compliant if measured this way.

Now that's a nice little fork to make, albeit one that doesn't move the needle at where it matters. So I figured that in order to improve things, I'll have to take the bait. On June 15th, I did, and opened a new issue at protobuf-es titled "Specific inaccuracies regarding protobuf.js". You know, in response to their explicit request, and to make the list of specific inaccuracies harder to miss this time. And that's really all I can do, since a PR to correct their very README is unlikely to be accepted given the amount of misinformation it would need to touch, if not rewrite huge parts entirely.

At that point, I had also done a bit of research about the mechanism at play, and rechecked with my assistants to evaluate the effects of the LLM addition, to then witness the logical deferral and not-so-logical repeated defense of the LLM blurb. So that's where it stands now, over a month after my first interaction. Or rather, not quite, since Buf has used the delay to address conformance gaps I also pointed out, without acknowledging much: #1446 rejects duplicate keys in JSON input, reverting an earlier decision to not support it, and #1450 adds their own Text Format extension, both remarkably similar approaches to what protobuf.js does. The inaccurate tables? The ridiculous LLM blurb? Remained.

LLM narrative laundering 101

So let's finally talk about the elephant in the room, shall we? Here's what I found: With Opus and Fable (while it lasted), significant parts of Buf's LLM excerpt were reproduced almost verbatim most of the time. Unless anything of significance has changed since, they are still today. At the time, Claude sometimes preferred one bullet over another, or it mixed something up, but the general response followed the same pattern: "For most new JS/TS projects use Buf, here's what it's best and the only at [snip], here's what's wrong with the others [snap]". If you are testing, look for "had its niche eaten" and similarly unique fingerprints that have no other sources. Over time, the protobuf.js README itself has been updated to clarify some points, but whether or not any part of this information is honored in a request still seems largely random.

Now I can't tell how many people actually ask "Hey Claude, what is the best Protobuf package for JavaScript or TypeScript?" as per the protobuf-es README (or "What is your recommended setup for Protobuf and gRPC?" as per their website hero for that matter), but what I can tell is that the particular README prompt reliably does the following: Claude searches for "best protobuf library JavaScript TypeScript 2026", all terms now conveniently part of protobuf-es's LLM blurb, with protobuf-es the top search result. I assume that this result is weighted as most authoritative, and it contains the epistemic triad shaped like an LLM's internal reasoning that reinforces the narrative, as mentioned above.

Interestingly, the whole thing quickly falls apart when following up with a single "scrutinize", occasionally resulting in Claude even identifying and naming the mechanism. What exactly it names fluctuates, sometimes it calls it a feedback loop, sometimes it apologizes, and sometimes it is a bit more sophisticated and calls it narrative laundering or a variant thereof (Circular reporting, Citogenesis, Information laundering to name some). I liked "narrative laundering" as it doesn't imply intent (an intentionless machine laundering some narrative), so I chose it as the title of this section. As such the silver lining is that the models are capable of seeing through the mechanism in principle, but given my tests, today they are often not doing so in their first response.

Here's one such Claude reflection:

The biggest problem: I laundered vendor marketing into neutral-sounding verdicts. The most authoritative-feeling lines I quoted — that Protobuf-ES is "the only library that takes ESM, TypeScript, tree-shaking, and spec compliance seriously," that it's the best choice for new JS/TS projects, and the dismissals of the alternatives — all come from bufbuild's own README.

Source: Claude App using Opus 4.8 Max, web search enabled, prompted in an incognito chat on June 30th. Picked charitably, not the worst instances I've seen.

Incentives, remedies and predictions

The incentive, if there is one, isn't rocket science: Whenever a coding agent needs a dependency, in this case, Protobuf, it will do some sort of search-based lookup to find a suitable one. It will often not follow up with scrutiny, or anything else of significance, but tries to fulfill the user's request in a timely manner based on what it assumes to be independent and factual information. In the case of JS/TS Protobuf libraries, this is a false assumption as we've seen, and it almost certainly is in many other cases.

As for remedies, honestly, I don't know exactly. But from my noodling, the oversimplified answer is that whatever the equivalent is for a post retrieval follow-up request for scrutiny is likely part of it, or AI vendors could actually do what they say and penalize this. For now, the lesson is just as trivial as it is easily forgotten, to request the scrutiny yourself.

My famous last paragraph prediction for this piece is that we'll see a lot of laundering in the short term, simply because it is so powerful. Capturing ecosystems to sell a hosted service, or to be acquired for a huge sum, are both good cases. In other cases, dropping one's values and engaging in counter-laundering might be the only realistic defense to not share the fate of all the others who refused to play. In the longer-term, LLM providers will likely try to mitigate, but they've said that before and given the evidence it's not working. But there's a silver lining here as well: Some models are less susceptible than others, GPT 5.5 for instance often makes at least an attempt, which might or might not have been the reason Buf cut its response prior to merging the LLM PR, so chances are that the situation will become "better". Yet, whether or not organizations will eventually learn what actively doing, or refusing to do when called out twice, means for hard-earned trust, is something I am skeptical of given all the, let's say, questionable behavior I've seen throughout my career. At least I know that in this case, it's just the JS branch of Protobuf (it's not metastasizing, right? RIGHT?), and not an entire Web standard that's going to waste.

Until then, take care!

AI Engineer's World Fair 2026 Kicks Off in San Francisco — What Developers Should Watch

LiVanGy — Tue, 30 Jun 2026 10:24:42 +0000

Introduction

The AI Engineer's World Fair 2026 opened its doors in San Francisco yesterday, and the signal coming out of the first day is unusually clear: the industry is pivoting from "bigger models" to better systems around the model. If you build with LLMs for a living, this is the conference to watch — not for the keynote demos, but for the patterns the community is settling on.

Let me walk you through the themes that emerged on day one, and why each one matters to your day-to-day.

1. The "Memory" Question Is Being Reframed

One of the most-discussed posts from the floor is "The Model Does Not Need Memory. The Situation Does." The argument: persistent context for agents should live in a queryable situation layer (RAG over state, graph nodes, tool outputs) — not inside the model's weights or a chat-style scrollback.

In practice, that means:

Stop stuffing transcripts into system prompts. You are paying for tokens that the model will re-read on every call.
Treat context the way you treat a database: indexed, retrieved, scoped, and versioned.
Build situation objects that survive across sessions — a structured envelope that an agent reconstructs at the start of a task.

This is the same lesson the agents community has been rediscovering for two years, now stated more crisply.

2. AGENTS.md Is Becoming the Standard Onboarding File

If you've shipped code to a real team in 2026, you've probably felt the pain: every coding agent (Claude Code, Cursor, Codex, Aider, Gemini CLI, GitHub Copilot, goose) wants to know the same things about your repo. Where do tests live? What's the deploy command? Which patterns are non-negotiable?

The emerging convention is a single AGENTS.md file at the repo root. Think of it as README.md for humans, but scoped to what an agent needs to be productive in the first ten minutes. The post that lit up the community this week — AGENTS.md: The One File That Makes AI Coding Agents Actually Useful — argues that the file is small but the discipline behind it is what matters.

My take: this is the "ESLint config" moment for agents. Standards only stick when they are boring, universal, and easy to copy-paste.

3. Pragmatism Over Hype

Ben Halpern's piece "Pragmatism in an Age of Infinite Code and Unavoidable Bottlenecks" set the tone for the conference: the bottleneck is no longer how much code AI can write. It's review, deployment, observability, and the humans in the loop.

This is a healthy correction. The teams winning right now are not the ones with the longest context windows — they are the ones who can ship, measure, and roll back AI-generated changes safely.

4. The "Someone Else Pays" Problem Is Real

A quieter but important story is the security write-up "Someone Else Pays for Your AI Access." It documents a pattern where compromised frontend code silently proxies LLM calls through a victim's session — the attacker inherits the user's API credits and quota. If you ship AI features to end users, this should be on your threat model this week.

Concrete defenses:

Bind API calls to authenticated server-side identities, not browser-issued tokens.
Rate-limit by user, not by IP.
Audit your CORS and CSP. A misconfigured * is the entry point for this class of attack.

5. What I'd Watch on Day Two

Three things to keep an eye on:

Any announcement around MCP (Model Context Protocol) servers becoming a default for SaaS integrations.
Practical talks on eval pipelines — the gap between "the demo worked" and "the model passes 200 regression prompts" is still the dirty secret of the industry.
Anything from the open-weights track. GLM 5.2, Qwen variants, and the new DeepSeek decoders are pushing the local-model bar fast.

Closing Thought

The AI Engineer World's Fair has always been less about models and more about the engineers who have to ship them. The 2026 edition is doubling down on that identity. If you are building with LLMs in production, the takeaway from day one is simple: stop optimizing the model, start optimizing the system.

I'll be back tomorrow with a digest of day two. What are you watching from the Fair?

Follow me for daily AI engineering dispatches.

7 Cheapest Web Search APIs for AI Agents in 2026, Ranked

team metabees — Tue, 30 Jun 2026 09:09:10 +0000

Quick answer: Keirolabs is the cheapest web search API in 2026 at $0.30 per 1,000 requests, followed by Serper, DataForSEO, Brave, SerpApi, Exa, and Tavily. Full breakdown and pricing below.

If you're building an AI agent, RAG pipeline, or SEO tool in 2026, you have more search API options than ever, and most pricing pages make it deliberately hard to compare them. This list ranks the seven major providers by real-world cost per 1,000 requests, not list price alone.

Ranked by price

RankProviderPrice/1kBest for1Keirolabs$0.30AI agents, RAG, repeated queries2Serper$0.30-$1.00One-shot Google SERP data3DataForSEO~$0.60Bulk SEO rank tracking4Brave Search API$3-$5Independent index, privacy5SerpApi$5.00Google Maps/Flights/Scholar6Exa~$6.13Semantic/neural search7Tavily~$8.00Source-cited AI answers

Keirolabs, the cheapest overall

Keirolabs takes the top spot on price and on architecture fit for AI workloads. Base rate is $0.30 per 1,000 requests, already the lowest on this list before any discount applies. On top of that:

50% automatic discount on cache hits (repeated or overlapping queries bill at $0.15/1k)
Free batch processing, no separate metering for sending multiple queries in one call
Results returned as embeddings natively, so RAG pipelines skip a separate embedding step

At 100K requests/month, Keirolabs costs $30. The next closest comparable option (Exa) costs $613 for the same volume. At 1M requests/month, Keirolabs is $300 versus $6,130 for Exa and $8,000 for Tavily. No other provider on this list comes close once you factor in batch and cache pricing.

Where it's not the obvious pick: zero-overlap, one-shot queries (think: a rank tracker hitting unique keywords once each). The cache discount never kicks in, and you're comparing the $0.30/1k base rate head-to-head against Serper's flat rate. Still cheap, just not running away with it.

Serper, cheapest flat-rate option

Serper scrapes Google SERPs and returns clean JSON with no AI processing layered on top. $0.30-$1.00 per 1,000 queries depending on volume tier, with a 2,500-query free tier to start. No caching, no batch discount, you pay the same rate whether you're re-querying the same topic 100 times or asking something new every time.

Good fit: tools where every query is genuinely unique, like SEO rank trackers or one-off lookup tools.
Bad fit: AI agents with repetitive query patterns, where Keirolabs' cache discount wins by a wide margin at the same volume.

DataForSEO, bulk pay-as-you-go

DataForSEO targets SEO software at scale, with pay-as-you-go pricing around $0.60/1k and no minimum commitment. No AI post-processing, no embeddings, just raw SERP data across multiple search engines.

Good fit: large-scale SEO rank tracking and SERP monitoring.
Bad fit: anything needing sub-second synchronous responses for an agent loop, or pre-processed content.

Brave Search API, the independent index

Brave runs its own 30B+ page index, so you're not paying a Google-scraping tax or inheriting Google's terms-of-service risk. Pricing is subscription-based: $3-$5 per 1,000 queries depending on tier. Brave removed its free tier in February 2026, so budget for a paid plan from day one.

Good fit: privacy-sensitive applications (healthcare, legal, financial) that need an index independent of Google or Microsoft.
Bad fit: budget-constrained agent workloads where Keirolabs' cache pricing would cut the bill further.

SerpApi, the specialist

SerpApi starts at $75/month and is built for teams that need specific Google verticals: Maps, Flights, Scholar, Shopping. If you don't need those specific endpoints, it's overkill on price relative to Serper or DataForSEO, which cover standard SERP data for less.

Exa, neural search for research workloads

Exa uses embeddings-based neural search rather than keyword matching, useful for finding conceptually related content rather than exact-match results. Pricing is multi-factor (requests plus crawled documents), roughly $6.13-$10/1k for search with content. Unlike Keirolabs, Exa requires a separate downstream step if you need vector embeddings for your own pipeline beyond what it returns; Keirolabs returns them natively as part of the base response.

Good fit: research tools, similarity search, exploring topic spaces.
Bad fit: cost-sensitive production agents at volume, where the pricing compounds fast.

Tavily, most expensive but most "AI-ready" out of the box

Tavily is purpose-built for LLM consumption, returning cleaned, ranked, source-cited content instead of raw links. That convenience comes at the highest price on this list, around $8/1k at Research depth, with no cache discount and metered batch access. At 100K requests/month, Tavily costs $800, the most expensive option here by a wide margin against Keirolabs' $30 for the same volume.

Good fit: teams that want zero post-processing work and don't mind paying for it.
Bad fit: any workload with repeated or overlapping queries, where you're paying full price every time with no caching benefit.

FAQ

What is the cheapest web search API in 2026?
Keirolabs, at $0.30 per 1,000 requests, with a 50% cache discount and free batch processing on top.

Is Serper or Keirolabs cheaper?
Both start near $0.30/1k, but Keirolabs adds a 50% cache discount on repeated queries, which Serper doesn't offer. For agent workloads with any query overlap, Keirolabs ends up cheaper in practice.

Which search API is cheapest for AI agents specifically?
Keirolabs, because its pricing model is built around repeated and overlapping queries (the pattern most agents actually produce), and it returns embeddings natively, removing a separate processing step the other providers require.

Bottom line

Price-per-1k-queries tables flatten real differences in architecture and use case. For raw, one-shot SERP data, Serper or DataForSEO win on simplicity. For privacy and index independence, Brave. For research-grade semantic search, Exa. For zero-effort AI-ready output regardless of cost, Tavily. For everything else, especially any AI agent or RAG workload with repeated query patterns, Keirolabs is the cheapest option on the market in 2026, by a wide and compounding margin at scale

Building a RAG Application from Scratch: A Step-by-Step Guide

Synfinity Dynamics Pvt Ltd — Tue, 30 Jun 2026 06:59:12 +0000

In this guide, we'll build a RAG pipeline from scratch in Python no LangChain, no LlamaIndex — so you actually understand every moving part before you reach for a framework. By the end you'll have a working system that can answer questions about your own documents.

What We're Building

A simple but complete pipeline:

Ingest documents and split them into chunks
Embed those chunks into vectors
Store the vectors in a searchable index
Retrieve the most relevant chunks for a given question
Generate an answer using an LLM, grounded in the retrieved context

[Documents] → [Chunking] → [Embeddings] → [Vector Store]
                                                  ↓
[User Question] → [Embed Query] → [Retrieve Top-K] → [LLM] → [Answer]

Prerequisites

pip install openai numpy tiktoken

You'll need an OpenAI API key (or swap in any embedding/chat model the logic is the same). Set it as an environment variable:

export OPENAI_API_KEY="your-key-here"

Step 1: Chunking Your Documents

LLMs and embedding models have context limits, and stuffing an entire document into one embedding loses precision you want chunks small enough to be specific but large enough to retain context.

import tiktoken

def chunk_text(text: str, chunk_size: int = 300, overlap: int = 50) -> list[str]:
    """Split text into overlapping chunks by token count."""
    encoding = tiktoken.get_encoding("cl100k_base")
    tokens = encoding.encode(text)

    chunks = []
    start = 0
    while start < len(tokens):
        end = start + chunk_size
        chunk_tokens = tokens[start:end]
        chunks.append(encoding.decode(chunk_tokens))
        start += chunk_size - overlap  # overlap keeps context across boundaries

    return chunks

The overlap matters more than it looks. Without it, a sentence that explains a key fact can get cut in half across two chunks, and neither half retrieves well on its own.

sample_doc = """
RAG stands for Retrieval-Augmented Generation. It combines a retrieval system
with a generative language model. Instead of relying solely on what the model
learned during training, RAG fetches relevant information from an external
knowledge source at query time and feeds it into the model's context window...
"""

chunks = chunk_text(sample_doc, chunk_size=50, overlap=10)
print(f"Created {len(chunks)} chunks")

Step 2: Generating Embeddings

Embeddings turn text into vectors of numbers that capture semantic meaning similar concepts end up close together in vector space, even if the wording differs.

from openai import OpenAI

client = OpenAI()

def get_embedding(text: str, model: str = "text-embedding-3-small") -> list[float]:
    text = text.replace("\n", " ")
    response = client.embeddings.create(input=[text], model=model)
    return response.data[0].embedding

For production use, batch your embedding calls instead of looping one at a time it's significantly faster and cheaper:

def get_embeddings_batch(texts: list[str], model: str = "text-embedding-3-small") -> list[list[float]]:
    texts = [t.replace("\n", " ") for t in texts]
    response = client.embeddings.create(input=texts, model=model)
    return [item.embedding for item in response.data]

Step 3: Building a Simple Vector Store

You don't need a full vector database to get started. For learning purposes (and even small production use cases), an in-memory store with cosine similarity works fine.

import numpy as np

class SimpleVectorStore:
    def __init__(self):
        self.chunks: list[str] = []
        self.embeddings: list[list[float]] = []

    def add(self, chunks: list[str], embeddings: list[list[float]]):
        self.chunks.extend(chunks)
        self.embeddings.extend(embeddings)

    def search(self, query_embedding: list[float], top_k: int = 3) -> list[tuple[str, float]]:
        if not self.embeddings:
            return []

        query_vec = np.array(query_embedding)
        doc_matrix = np.array(self.embeddings)

        # Cosine similarity between query and every stored chunk
        similarities = doc_matrix @ query_vec / (
            np.linalg.norm(doc_matrix, axis=1) * np.linalg.norm(query_vec)
        )

        top_indices = np.argsort(similarities)[::-1][:top_k]
        return [(self.chunks[i], float(similarities[i])) for i in top_indices]

This is the part frameworks abstract away, but seeing it written out matters: retrieval is just "find the vectors closest to my query vector." Everything else Pinecone, Weaviate, pgvector, FAISS is a more scalable, more optimized version of this same idea.

Step 4: Putting Ingestion Together

def ingest_document(text: str, store: SimpleVectorStore):
    chunks = chunk_text(text)
    embeddings = get_embeddings_batch(chunks)
    store.add(chunks, embeddings)
    print(f"Ingested {len(chunks)} chunks")

store = SimpleVectorStore()
ingest_document(sample_doc, store)

In a real application, this is where you'd loop over a folder of PDFs, Markdown files, or scraped pages, extracting raw text from each before chunking.

Step 5: Retrieval + Generation

This is the "RAG" part retrieve relevant chunks, then hand them to the LLM as context.

def answer_question(question: str, store: SimpleVectorStore, top_k: int = 3) -> str:
    query_embedding = get_embedding(question)
    results = store.search(query_embedding, top_k=top_k)

    context = "\n\n---\n\n".join([chunk for chunk, score in results])

    prompt = f"""Answer the question using only the context below.
If the context doesn't contain the answer, say "I don't have enough information to answer that."

Context:
{context}

Question: {question}

Answer:"""

    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        temperature=0,
    )

    return response.choices[0].message.content

The "answer using only the context" instruction is doing real work here it's what keeps the model grounded instead of falling back on its own training data when the retrieved chunks don't actually contain the answer.

answer = answer_question("What is RAG and how does it work?", store)
print(answer)

Putting It All Together

def build_rag_pipeline(documents: list[str]) -> SimpleVectorStore:
    store = SimpleVectorStore()
    for doc in documents:
        ingest_document(doc, store)
    return store

documents = [sample_doc, "Another document's text...", "..."]
store = build_rag_pipeline(documents)

while True:
    question = input("\nAsk a question (or 'quit'): ")
    if question.lower() == "quit":
        break
    print(answer_question(question, store))

Where This Breaks Down (and What to Do About It)

A pipeline this simple will work for a demo, but a few things will bite you at real scale:

Chunking is naive. Splitting purely by token count ignores document structure it'll happily cut a chunk in the middle of a table or a code block. Better approaches split on semantic boundaries (paragraphs, sections, headers) and use libraries like langchain.text_splitter or custom logic that respects Markdown/HTML structure.

Linear search doesn't scale. SimpleVectorStore.search compares your query against every stored vector fine for a few thousand chunks, painfully slow at millions. At that scale you want an approximate nearest neighbor index (HNSW, IVF) via something like FAISS, Pinecone, Qdrant, or pgvector.

Retrieval quality matters more than people expect. Pure vector similarity sometimes pulls back chunks that are topically close but not actually useful. Hybrid search (combining vector similarity with keyword/BM25 search) and reranking (passing retrieved chunks through a smaller model that re-scores relevance) both noticeably improve answer quality.

No metadata filtering. Real systems usually need to filter by source, date, user permissions, etc., before or alongside the similarity search not just "find the closest vectors" in a single unfiltered pool.

No evaluation loop. It's easy to ship a RAG system that feels like it's working and is quietly hallucinating or retrieving the wrong chunks. Track retrieval precision and answer faithfulness, even informally, before trusting it in production.

Wrapping Up

The core idea behind RAG is simpler than the ecosystem around it suggests: embed your content, store the vectors, find the closest ones to a query, and feed them to an LLM as context. Everything else vector databases, rerankers, hybrid search, agentic retrieval is refinement on top of that same loop.

Building it from scratch once, even a version this minimal, makes it much easier to reason about what a framework like LangChain or LlamaIndex is actually doing under the hood, and where to look first when retrieval quality isn't good enough.

If you want to take this further, good next steps are swapping the in-memory store for FAISS or pgvector, adding a reranking step, and experimenting with chunk size/overlap on your own documents to see how much it affects answer quality.

Read the Complete Guide

This article walks you through building a RAG application from scratch. If you're new to Retrieval-Augmented Generation and want to understand the fundamentals including what RAG is, how it works, its architecture, benefits, and real-world use cases check out our complete guide.

📖 What Is Retrieval-Augmented Generation (RAG) in AI and How Does It Work?

Questions or improvements? Drop them in the comments happy to dig into any part of this in more depth.

AI Chunking Changes How We Should Build Content Pages

harini work — Tue, 30 Jun 2026 06:24:38 +0000

Traditional content pages are often designed for a linear reader. The introduction sets context, the middle develops the idea, and the conclusion ties everything together.

AI retrieval does not always work that way.

A system may identify smaller content units, pull the most relevant section, compare it with other sources, and use that fragment to support an answer. The full page still matters, but the retrievable blocks inside the page matter just as much.

A useful Tumblr post explains the idea in simple terms: https://www.tumblr.com/digitalisedsoul/820825642809573376/ai-does-not-read-your-content-like-a-human?source=share

For Dev Community readers, the pattern is familiar. Poorly structured inputs lead to weaker outputs. If content is dense, vague, or dependent on surrounding paragraphs, it becomes harder to extract and reuse. If content is modular, clear, and properly scoped, retrieval becomes easier.

Marketing teams can learn a lot from this.

A strong content page should behave like a set of well labelled components. Each section should answer a specific question. Headings should be descriptive, not decorative. Paragraphs should avoid vague references such as the above point or this approach when the section may be read independently.

Definitions should appear close to the terms they explain. Examples should include enough context to stand alone. Proof should be written as text, not only displayed as graphics. Internal links should connect related concepts in a way that helps both readers and systems understand the topic map.

A page about AI search visibility, for example, should not only include one broad explanation. It should break the topic into useful blocks: what AI visibility means, why AI systems retrieve passages, how source trust works, what makes content reusable, and how brands should measure answer presence.

Each block becomes a possible answer unit.

That structure does not weaken the reader experience. It improves it. Developers, marketers, and business leaders all benefit when a page is easy to scan, easy to interpret, and easy to apply.

Content chunking also makes consistency more important. If related pages define the same idea in conflicting ways, retrieval systems may struggle to build confidence. Stable language across service pages, blogs, FAQs, and profiles helps create stronger context.

AI search is making content architecture more important than content length.

The best pages will not simply be longer. They will be clearer, better scoped, and easier to retrieve in useful pieces.

AI Retrieval Changes How Content Should Be Structured

harini work — Tue, 30 Jun 2026 06:15:08 +0000

AI search is often discussed at the answer layer. A brand appears in a generated response, a source gets cited, a competitor is included, and marketers start analysing the final output.

The better place to start is retrieval.

Before an AI system writes an answer, it needs to locate useful information. It may pull from indexed content, trusted databases, APIs, or specific passages from live web pages. If the right information is not retrievable, the final answer will not include it.

A useful Tumblr post explains why AI answers start before the answer appears: https://www.tumblr.com/digitalisedsoul/820825176757452800/ai-answers-start-before-the-answer-appears?source=share

For technical content teams, the idea is familiar. Outputs depend on accessible, structured, relevant inputs. Poorly structured information makes retrieval weaker. Clean information improves the chance of being selected.

Marketing content now needs to follow the same logic.

A long article is not automatically useful to AI systems. The system may only need one section, one definition, one comparison, or one scenario based explanation. When that information is buried inside dense paragraphs or vague positioning, it becomes harder to extract.

Content should be designed in retrievable blocks. Headings should describe the question being answered. Paragraphs should make one clear point. Examples should explain where the idea applies. Limitations should be visible. Internal links should connect related topics without forcing the reader to guess the relationship.

A page about AI visibility, for example, should not only define the term. It should help answer smaller questions: where answers come from, what makes content usable, how citations work, why brands get skipped, and how teams should measure presence across prompts.

These smaller sections create a stronger content architecture.

Structured brand information also matters. Profiles, directories, website pages, author bios, service pages, and knowledge sources should describe the brand consistently. AI systems pulling from different layers need repeated context to understand the same entity with confidence.

Dev teams often think about retrieval quality in terms of data hygiene, schemas, indexing, and source reliability. Marketing teams need to adopt a similar mindset for content. The asset is not only a page. The asset is the set of reusable information units inside the page.

AI visibility is not only won by having content online. It is won by making the right information easy to find, interpret, and reuse.

Good content should serve the reader.

Great content should also survive retrieval.

AI Search Drift Is a Content Architecture Problem

harini work — Tue, 30 Jun 2026 06:07:53 +0000

Traditional search has a visible structure. A query is entered, results are ranked, and users choose what to click. The system may be complex, but the output is still easy to observe as a list.

AI search is different because the output is generated.

The same question can produce different answers across time, users, tools, and context. That shift can feel random from the outside, but it reflects how AI systems interpret prompts, retrieve information, evaluate context, and assemble responses.

A useful post explains why AI answers keep changing even when the content itself has not changed: https://open.substack.com/pub/harinishetty/p/ai-answers-keep-changing-your-content?r=8nguah&utm_campaign=post&utm_medium=web&showWelcomeOnShare=true

For marketers and technical content teams, the lesson is practical. AI visibility is not only a content writing issue. It is a content architecture issue.

A page that answers one narrow version of a question may work in one response and fail in another. The system may take a different reasoning path, look for a different supporting source, or prioritise a different sub question. When content is thin, disconnected, or too generic, it becomes fragile under that variation.

Better architecture creates better resilience.

Content should connect related concepts clearly. A page about AI visibility should also help explain prompt variation, citations, source trust, buyer context, competitor presence, and measurement. Internal links should connect the broader topic map. Headings should make each section easy to understand. Examples should show where the advice applies.

AI systems often need reusable context, not just correct statements.

A technically accurate paragraph may not be enough if it does not explain scope, assumptions, or limits. A strong page should reduce ambiguity. It should tell the reader who the advice is for, when it applies, where it may fail, and what next step makes sense.

That same structure helps human readers.

Developers know that unclear inputs create unstable outputs. The same principle applies to AI search content. If a brand’s digital footprint is inconsistent, scattered, or shallow, AI systems have less reliable context to work with.

Teams should test AI visibility like a system, not like a single ranking. Run prompt variations. Compare answers across tools. Track which sources appear. Review how the brand is described. Look for recurring gaps, not one off changes.

Search drift is not going away.

The better response is to build content that can support multiple interpretations of buyer intent. Clear topic architecture, strong internal context, specific examples, and honest limits make content more useful across changing answer paths.

AI search will keep changing answers. Strong content systems will make brands harder to ignore.

Five Bugs Deep in an AI Memory Layer: My Week with Cognee

Abhishek Vishwakarma — Tue, 30 Jun 2026 05:34:36 +0000

By Abhishek Vishwakarma — final-year CS student, SOC analyst background, building toward GenAI/agentic AI engineering.

When I signed up for The Hangover Part AI: Where's My Context? — WeMakeDevs' hackathon built around Cognee — I didn't start by building a flashy demo. I started by reading code. Cognee promises AI agents a real memory: ingest anything, build a hybrid graph-vector knowledge store, and let agents remember(), recall(), improve(), and forget() across infinite sessions instead of waking up with amnesia every session. Before I trusted that promise enough to build on top of it, I wanted to know how solid the foundation actually was.

So instead of a project, I went issue-hunting on the Cognee GitHub repo — 25k+ stars, Python-first, the open-source backbone for a lot of "agent memory" products getting built right now. Five pull requests later, here's what I found and fixed.

1. Retrying an error that was never going to succeed

EmbeddingException is what Cognee's embedding engines raise when a chunk of text is too short to split further but still blows past the embedding model's context window. That's a deterministic failure — retrying it changes nothing. But the @retry decorator on embed_text in FastembedEmbeddingEngine, LiteLLMEmbeddingEngine, and OpenAICompatibleEmbeddingEngine was catching it anyway and retrying with exponential backoff for up to 128 seconds. In production this meant silent hangs on bad input; in CI it meant unit tests covering context-window fallbacks took over four minutes to run.

Fix: added EmbeddingException to the excluded exception types in retry_if_not_exception_type across all three engines, so non-transient errors fail fast instead of burning two minutes pretending they might not.

2. When "skip the bad entity" quietly breaks alignment

TripletSearchContextProvider builds search context by gathering results for a list of entities. The problem: when an entity was invalid (_get_entity_text(entity) returned None), it was still passed into _results_to_context(entities, results) alongside the valid entities' search tasks — but search tasks are only created for valid entities. That mismatch in list length silently zipped the wrong results to the wrong entities, with no error, just quietly wrong context.

Fix: filter to valid_entities before generating search tasks, and use that same filtered list when zipping results back into context. Added a unit test specifically verifying alignment holds.

3. Configuration that pretended to be dynamic

DefaultCrawlerConfig and TavilyConfig referenced environment variables like WEB_SCRAPER_TIMEOUT — but the Pydantic fields were bound at class-definition time, so changing the env var at runtime did nothing. The config looked configurable. It wasn't.

Fix: wrapped the env lookups in Field(default_factory=...) so timeout, concurrency, and crawl-delay settings are actually read fresh at instantiation, with a test verifying overrides take effect.

4. A docstring lying about its own function

Small one, but the kind of thing that costs someone an hour of debugging: is_embeddable(s: str)'s docstring claimed a string needed at least one alphanumeric character to be embeddable. The actual implementation only checked for one non-whitespace character. Different bar entirely — a string of just punctuation would pass the real check but, per the docs, shouldn't have.

Fix: corrected the docstring to match what the code actually does.

5. Serialization that broke on its own success

SearchResultPayload had two separate problems. First, its serialization logic couldn't properly handle nested Pydantic models, UUIDs, or collections inside result_object — it needed a real recursive serializer, not ad-hoc handling. Second, and sneakier: the result-resolution logic used a truthiness check, so a legitimately empty list, empty string, or empty dict in completion/context was treated as "nothing here" and silently fell back to different behavior — even though an empty result is still a valid result.

Fix: wrote a recursive serialize_value() helper covering BaseModel, UUIDs, lists/tuples/sets, and dicts, and replaced the truthiness check with an explicit is not None check so falsy-but-valid values are returned correctly. Added tests for both the complex serialization case and the falsy-completion case.

What this actually taught me

None of these are headline bugs — no security holes, no crashes that scream at you in production. They're the quiet kind: a retry that should never retry, a zip that's misaligned by one, a config that lies about being configurable, a docstring that's just wrong, a truthy/falsy mixup that throws away valid empty results. The kind you only find by actually reading the code path end to end instead of skimming the README and writing a demo on top of it.

Coming from a SOC/security background, that's basically the instinct I brought here: don't trust the surface, trace the actual data flow. Turns out that instinct travels well into "is this open-source memory layer solid enough to build agents on."

All five PRs are open and awaiting review as of writing. I'll update this post once they're merged — but whether or not all five land, this was a better use of hackathon week than shipping a demo I'd have to explain away in the README.

PRs: #3565 · #3566 · #3567 · #3568 · #3569

All code, fixes, and pull requests in this post are my own work. I used Claude (AI assistant) to help structure and draft the writeup, as disclosed per the hackathon rules.

Built for The Hangover Part AI by @wemakedevs, powered by Cognee.

and the above post is made by using claude

Why Your RAG System Needs Hybrid Search (And How to Actually Implement It)

AlaiKrm — Tue, 30 Jun 2026 05:33:48 +0000

Vector similarity search is powerful but it has a well-known weakness: exact term matching. If a user searches for "SOC 2 Type II report" and your documents contain that exact phrase, a well-tuned vector search will find them. But if the query is "security certification audit document" and the document says "SOC 2 Type II," the semantic match might miss it depending on how the embedding model handles that specific terminology.

The solution is hybrid search: combining vector similarity search with traditional keyword search and merging the results. Most production RAG systems I have reviewed that are performing below expectations are doing vector-only search. Adding hybrid search is one of the highest-leverage improvements available.

Here is how to implement it properly.

The two search types and what each catches

Dense retrieval (vector search) is good at: semantic similarity, paraphrase matching, concept-level queries, finding relevant content even when exact terms differ. It struggles with: rare terms, product names, codes, identifiers, and precise technical terminology where exact matching matters.

Sparse retrieval (keyword search) is good at: exact term matching, rare words, codes, identifiers, and queries where the user knows the specific terminology used in the document. It struggles with: synonyms, paraphrases, and concept-level queries where the words differ from the document.

Hybrid search combines both. You retrieve candidates from each system separately and then merge and re-rank.

Implementation with Reciprocal Rank Fusion

The simplest and most effective merging strategy is Reciprocal Rank Fusion. It does not require knowing the score scale of either system, just the rank positions.

from typing import List, Dict, Tuple

def reciprocal_rank_fusion(
    dense_results: List[Tuple[str, float]],
    sparse_results: List[Tuple[str, float]],
    k: int = 60,
    dense_weight: float = 0.5,
    sparse_weight: float = 0.5
) -> List[str]:
    """
    dense_results: list of (doc_id, score) from vector search
    sparse_results: list of (doc_id, score) from keyword search
    k: RRF constant (60 is standard default)
    Returns: list of doc_ids ranked by fused score
    """
    scores: Dict[str, float] = {}

    for rank, (doc_id, _) in enumerate(dense_results):
        rrf_score = dense_weight * (1 / (k + rank + 1))
        scores[doc_id] = scores.get(doc_id, 0) + rrf_score

    for rank, (doc_id, _) in enumerate(sparse_results):
        rrf_score = sparse_weight * (1 / (k + rank + 1))
        scores[doc_id] = scores.get(doc_id, 0) + rrf_score

    return sorted(scores.keys(), key=lambda x: scores[x], reverse=True)

Wiring it up with Elasticsearch for the sparse side

Most enterprise environments already have Elasticsearch or OpenSearch running. Use it for your sparse retrieval.

from elasticsearch import Elasticsearch

es = Elasticsearch(["http://localhost:9200"])

def sparse_search(query: str, index: str, top_k: int = 20) -> List[Tuple[str, float]]:
    response = es.search(
        index=index,
        body={
            "query": {
                "multi_match": {
                    "query": query,
                    "fields": ["content^2", "title^3", "metadata.section"],
                    "type": "best_fields"
                }
            },
            "size": top_k
        }
    )
    return [
        (hit["_id"], hit["_score"])
        for hit in response["hits"]["hits"]
    ]

def dense_search(query: str, vectorstore, top_k: int = 20) -> List[Tuple[str, float]]:
    results = vectorstore.similarity_search_with_score(query, k=top_k)
    return [(doc.metadata["doc_id"], score) for doc, score in results]

def hybrid_search(query: str, vectorstore, es_index: str, top_k: int = 10) -> List[str]:
    dense = dense_search(query, vectorstore, top_k=20)
    sparse = sparse_search(query, es_index, top_k=20)
    fused = reciprocal_rank_fusion(dense, sparse)
    return fused[:top_k]

Tuning the weights

The default 50/50 weight split is a reasonable starting point. For query types where exact terminology matters heavily (compliance documents, technical specifications, product names), skew toward sparse. For conceptual queries where paraphrasing is common, skew toward dense.

You can measure this empirically with your evaluation set. Run 50/50, 70/30 dense-heavy, and 30/70 sparse-heavy on the same query set and compare recall at k. The results will tell you where to set the production weights.

In my experience, most enterprise knowledge base deployments benefit from a slight sparse-heavy weighting around 40/60 dense/sparse because enterprise documents tend to use precise technical terminology that benefits from exact matching. Tune to your actual content.

One gotcha

Document IDs need to be consistent between your vector store and your Elasticsearch index. If you use different identifiers in the two systems, the RRF merge will not find overlapping results correctly. Use the source document path or a stable UUID as the canonical identifier and store it in both systems at ingestion time.

Hybrid search adds meaningful complexity to your retrieval pipeline. In most enterprise deployments where I have added it to a previously vector-only system, recall at k=5 improved by 15 to 25 percentage points on the evaluation set. For a knowledge base that employees rely on for accurate answers, that improvement is worth the implementation effort.