Gao Dalie (Ilyass)

Posted on Jun 13

LangChain + Segment Any Text + RAG = The Key to Understanding Your Documents

#datascience #machinelearning #programming #rag

In this Story, I have a super quick tutorial showing you how to create a multi-agent chatbot using LangChain, Segment Any Text, and RAG to build a powerful agent chatbot for your business or personal use.

When developing RAG, a commonly overlooked but crucial pain point is how to avoid the semantic fragmentation caused by token segmentation.
RAG relies heavily on retrieving relevant documents from external sources to improve its responses. This dependence can lead to challenges in maintaining and updating the knowledge base.

RAG is a generative framework designed for tasks such as question answering and summarisation; however, it does not address the fundamental challenge of effectively understanding or segmenting raw, unstructured, or noisy text before generation.

RAG also assumes that input texts are reasonably well-structured. It does not inherently solve issues arising from unpunctuated, informal, or unstructured input data, which are common in user-generated content.

When the number of documents increases from a few to hundreds, this problem becomes extremely challenging, not only time-consuming but also prone to errors and difficult to scale.

The Segment Any Text (SAT) model cleverly solves this problem through intelligent segmentation technology driven by neural networks. It is not a replacement for RAG, but a powerful front-end enhancement layer of RAG, which significantly reduces the risk of hallucination generated downstream by ensuring the semantic integrity of each text block.

Segment Any Text (SAT) fills this gap by providing a universal, efficient, and robust sentence segmentation method that improves the foundational step in many NLP pipelines, enabling more accurate and scalable downstream tasks, including but not limited to those augmented by RAG.

Although the SAT is currently just a conceptual model, it has inspired practical implementations, such as ContextGem, a free and open-source LLM framework developed by Sergii Shcherbak.

Designed for structured data extraction from documents, it completely changes the way data extraction works with the concept of “zero boilerplate code.”

So, let me give you a quick demo of a live chatbot to show you what I mean.

Check This Video

I will ask the chatbot a question: “Why did Steve Jobs feel subscription accounting understated iPhone’s importance?” Feel free to ask any questions you want.

If you look at how the chatbot generates the output, you will see that when I enter a query, the agent loads a PDF file, then splits the content into overlapping 1000-character chunks using the Recursive Character Text Splitter. Each chunk is embedded into a vector using OpenAI Embeddings and stored in a FAISS database for fast similarity searches.

It uses ContextGem to analyse document structure. It uses a Document model that represents the full content of various types of documents, and the Concept model to capture the actual information, insights, or conclusions.

Then it connects everything to langchain, setting up a RetrievalQA chain that uses your FAISS retriever to answer questions based on similar text chunks, then it uses get concept data to fetch structured insights, and ask_question() to get the answer.

So, by the end of this Story, you will understand what ContextGem is, what the difference between SAT and RAG is, how SAT can serve as an enabling tool for RAG and how we going to use LangChain, Segment Any Text, RAG and create a powerful Agentic chatbot.

What is ContextGem?

ContextGem is an emerging framework that focuses on transforming unstructured documents into precise, structured data. Through its unique document-centric design and neural network technology (SAT),

ContextGem can not only serve as a pre-processor of RAG or a perception module of Agent, but can also be used independently, providing a flexible and efficient solution for data processing.

Core Components

ContextGem structures document analysis through three core models — Document, Aspect, and Concept — that work together to enable accurate and context-aware information extraction.

The Document model represents the full content of various types of documents.

The Aspect model focuses on specific themes or topics within documents, offering a structured and hierarchical way to narrow down areas of interest.

The Concept model captures the actual information, insights, or conclusions, whether simple facts or complex evaluations, linked to either specific aspects or the entire document.

RAG Vs SAT

The key difference between SAT (Segment Any Text) and RAG (Retrieval-Augmented Generation) lies in their roles within natural language processing (NLP) systems. SAT is a preprocessing tool designed to segment raw, unstructured, and often noisy or unpunctuated text into clear, coherent sentences.

It is language-agnostic, robust across domains, and focused solely on accurately identifying sentence boundaries to improve input quality for downstream tasks.

In contrast, RAG is a generative framework that enhances language models by retrieving relevant documents from external sources to generate informed responses, typically used in tasks like question answering and summarization.

While SAT prepares and structures the input text, RAG focuses on retrieving and generating content based on external knowledge. Used together, SAT improves data quality before processing, and RAG builds on that foundation to produce more accurate and knowledge-rich outputs.

How the SAT can serve as an enabling tool for RAG
It can be understood this way: SAT is not a substitute for RAG, but a powerful enabling tool and front-end enhancement layer of the RAG system.

In the modern RAG architecture, SAT can be used as a pre-blocking processor to provide the search engine with higher-quality text units, fundamentally improving the search quality.

The contribution of SAT is that it solves the “garbage in, garbage out” problem — no matter how advanced your embedding model, vector database, and search algorithm are, if the input text block itself is semantically fragmented, the irreversible error accumulation will inevitably limit the quality of the final search and generation.

This problem is also one of the important and hidden reasons why many RAGs generate hallucinations.

Through SAT intelligent segmentation, the RAG system obtains better semantic units and can:

More accurately match user query intent
Reduce irrelevant results
Providing more coherent contextual information to large language models
Significantly improve the quality and accuracy of the final generated content

Let’s Start Coding

Let us now explore step by step and unravel the answer to how to create the SAT and RAG APP. We will install the libraries that support the model. For this, we will do a pip install requirements


pip install requirements

The next step is the usual one: We will import the relevant libraries, the significance of which will become evident as we proceed.

ContextGem is an open-source solution for a key aspect of legal tech: extracting structured data and insights from documents.

ContextGem describes itself as “a flexible and intuitive framework that extracts structured data and insights from documents with minimal effort.”

import os
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
from langchain.chains import RetrievalQA
from langchain_openai import ChatOpenAI
from contextgem import Document, DocumentLLM, StringConcept

I developed a process to load and prepare a PDF document for use in a Retrieval-Augmented Generation system using Langchain. I loaded the PDF file with the PyPDFLoader.

Next, I created a text splitter using RecursiveCharacterTextSplitter to break the PDF content into manageable chunks—each chunk is 1000 characters long with a 200-character overlap to maintain context between sections.

I then generated vector embeddings for these chunks using OpenAI Embeddings, securely pulling my API key from the environment variable OPENAI_API_KEY. I stored the resulting vectors in a FAISS vector database, which allows me to perform fast similarity searches later

# 1. Load PDF with Langchain
pdf_path = "Apple_iPhone_Revenue_Bomb.pdf"  # Replace with your actual path
loader = PyPDFLoader(pdf_path)
langchain_docs = loader.load()

# 2. Process for vector database (RAG)
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200
)
pdf_chunks = text_splitter.split_documents(langchain_docs)
embeddings = OpenAIEmbeddings(openai_api_key=os.environ.get("OPENAI_API_KEY"))
vectorstore = FAISS.from_documents(pdf_chunks, embeddings)

# 3. Combine for ContextGem (full document extraction)
full_text = "\n".join([chunk.page_content for chunk in pdf_chunks])

I created a structured information extraction system using ContextGem to identify key insights from the full text of the Apple iPhone revenue PDF.

First, I built an Document object feeding in the entire raw text content, which I had previously combined from the split PDF chunks. Then, I defined a list ofStringConceptseach representing a specific type of information I wanted to extract.

These included Revenue Recognition Issues, focusing on policy challenges; Apple Financials, capturing important metrics and figures; Accounting Changes, highlighting changes in revenue recognition rules along with supporting justifications; and Stakeholder Opinions, reflecting views from analysts, Apple leadership, and Wall Street.

I enabled references and justifications where necessary to ensure traceability and depth. To process the extraction, I instantiated a DocumentLLM using the openai/gpt-4o-mini model and passed in my API key from the environment.

Finally, I used llm.extract_all() The model to apply to my structuredcontextgem_doc, enabling automated, structured extraction of key business insights from the PDF.

# 4. ContextGem structured extraction
contextgem_doc = Document(raw_text=full_text)
contextgem_doc.concepts = [
    StringConcept(
        name="RevenueRecognitionIssues",
        description="Issues and challenges with iPhone revenue recognition policy",
        add_references=True,
        reference_depth="sentences"
    ),
    StringConcept(
        name="AppleFinancials",
        description="Key financial figures and metrics for Apple during the period",
        add_references=True,
        reference_depth="sentences"
    ),
    StringConcept(
        name="AccountingChanges",
        description="Proposed and implemented changes to revenue recognition accounting rules",
        add_references=True,
        add_justifications=True
    ),
    StringConcept(
        name="StakeholderOpinions",
        description="Views from Wall Street, analysts, and Apple management regarding accounting practices",
        add_references=True
    )
]

llm = DocumentLLM(
    model="openai/gpt-4o-mini",
    api_key=os.environ.get("OPENAI_API_KEY")
)

contextgem_doc = llm.extract_all(contextgem_doc)

I set up a Langchain-based RAG system to access both structured and unstructured data from the Apple iPhone revenue document.

First, I initialized a ChatOpenAI chain using the gpt-4o-mini model with a temperature of 0 for consistent answers, and connected it to my FAISS vector store for similarity-based retrieval.

I then built a RetrievalQA chain that uses this retriever to answer questions based on relevant chunks from the PDF.

To complement this, I created a get_concept_data function that queries structured insights extracted by ContextGem. It returns values, references, and justifications for a given concept name.

This setup lets me interact with the document both through open-ended questions and targeted concept lookups.

# 5. Setup Langchain RAG
llm_chain = ChatOpenAI(
    model_name="gpt-4o-mini",
    openai_api_key=os.environ.get("OPENAI_API_KEY"),
    temperature=0
)

retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 3}
)

qa_chain = RetrievalQA.from_chain_type(
    llm=llm_chain,
    chain_type="stuff",
    retriever=retriever
)

# 6. Create functions to use both capabilities
def get_concept_data(concept_name):
    """Return data for a specific concept extracted by ContextGem"""
    for concept in contextgem_doc.concepts:
        if concept.name == concept_name:
            return [{
                "value": item.value,
                "references": [ref.text for ref in item.references] if hasattr(item, "references") and item.references else [],
                "justification": item.justifications[0].text if hasattr(item, "justifications") and item.justifications else None
            } for item in concept.extracted_items]
    return []

I created a simple interface to ask questions and display structured insights from a PDF. I defined a function called ask_question that takes a query and uses the Langchain RAG system to return an answer by invoking the qa_chain.

I printed out structured information first: for the Revenue Recognition Issues, I looped through the extracted items, printing each issue along with its first reference sentence when available. Then, I printed key figures under Apple Financials using the same approach.

Finally, I showcased a question—“Why did Steve Jobs feel subscription accounting understated iPhone’s importance?”—and used it as a question to get a contextual answer directly from the document using the RAG pipeline.

def ask_question(query):
    """Use Langchain RAG to answer questions about the document"""
    return qa_chain.invoke({"query": query})["result"]

# Example usage
print("REVENUE RECOGNITION ISSUES:")
for issue in get_concept_data("RevenueRecognitionIssues"):
    print(f"- {issue['value']}")
    if issue.get("references"):
        print(f"  Reference: {issue['references'][0]}")

print("\nAPPLE FINANCIALS:")
for financial in get_concept_data("AppleFinancials"):
    print(f"- {financial['value']}")

print("\nQUESTION ABOUT CASE:")
print(ask_question("Why did Steve Jobs feel subscription accounting understated iPhone's importance?"))

Conclusion :

ContextGem completely changes the way structured data is extracted from documents, transforming complex extraction tasks into simple and intuitive declarative definitions. It eliminates the pain points of traditional methods and provides a comprehensive and powerful set of tools that allow developers to focus on business logic rather than underlying implementation details.

Whether you are working on legal contracts, research reports or client feedback, ContextGem helps you get the information you need faster and more accurately, unlocking the full value of your document data.

Reference :

If this article might be helpful to your friends, please forward it to them.

🧙‍♂️ I am an AI Generative expert! If you want to collaborate on a project, drop an inquiry here or book a 1-on-1 Consulting Call With Me.

I would highly appreciate it if you

❣ Join my Patreon: https://www.patreon.com/GaoDalie_AI
Book an Appointment with me: https://topmate.io/gaodalie_ai
Support the Content (every Dollar goes back into the video):https://buymeacoffee.com/gaodalie98d
Subscribe to the Newsletter for free:https://substack.com/@gaodalie

DEV Community