HarinezumIgel

Posted on Apr 12

Experimenting with RAG on Constrained Hardware

#opensource #architecture #python #ai

When Context Windows Meet Reality

Experimenting with RAG on Constrained Hardware

Large context windows are often treated as the solution to RAG problems. On constrained hardware (a second‑hand Tesla T4 in my case), that assumption breaks down quickly. This post explores using document classification as semantic compression to make RAG work without large contexts.

In other words: Large context windows are a convenience feature, not an architectural solution.

What initially felt like a disadvantage turned out to be a forcing function. Instead of chasing ever larger context windows, I started questioning whether the model actually needs to see entire documents at all.

Key takeaways

Context limits are often a hardware problem, not a model problem
Simply “increasing context” is rarely viable on modest GPUs
Constraints can lead to better architectural decisions

Rethinking context: classification instead of brute force

The obvious approach to large documents is to feed more text into the model. On a Tesla T4, that idea breaks down quickly. Rather than pushing context size, I flipped the problem around and asked a different question: why does the model need to see all of this text in the first place?

This shift led me to treat document classification not as a side task, but as a core architectural element. Instead of sending full documents downstream, I focused on extracting their meaning in a compact form.

Key takeaways

Context size should be minimized, not maximized
Classification can act as a form of semantic compression
Architecture often matters more than raw model capability

Using KeyBERT to reduce documents to meaning

To make classification practical, I used KeyBERT. Rather than passing full documents to an LLM, KeyBERT reduces each document to a small list of meaning‑dense keywords. This allows downstream components to understand what a document is about without reading it in its entirety.

In practice, this worked far better than expected. However, it introduced a smaller but important usability issue: stemming. Words such as predator became predat, which is technically useful but confusing when results are presented to users.

Figure 1: KeyBERT document classification flow

To address this, I added a reverse‑stemming step. For each stem, candidate original words are tracked and weighted. The reconstruction is approximate—and can never be perfect—but it produces output that is readable and meaningful in CSV and Excel exports.

Key takeaways

Keyword extraction dramatically reduces effective context size
Stemming helps matching but hurts human readability
Reverse stemming is a pragmatic compromise, not a perfect solution

When classification turned into routing

For a long time, classification and RAG felt like parallel ideas that never fully aligned. The breakthrough came with a simple question: why not use classification results to decide which documents even enter the RAG pipeline?

Since classification results were already stored in CSV files, SQLite’s CSV reader made them easy to query.

Figure 2: Classification “mammal yes/no” of the TestDocs corpus

This effectively turned classification into a routing mechanism. Documents are first classified, results are stored, and structured queries determine which files are worth ingesting.

Figure 3: Route documents about mammals in English language to RAGLoad.py

This approach became query‑driven document routing. Instead of loading everything and filtering later, only relevant documents are sent downstream.

RAGChat and RAGChatService also allow on‑the‑fly switching of the collections a user is chatting with, which nicely complements the routing functionality.

Key takeaways

Classification output is useful far beyond simple labeling
Ingestion is an architectural decision, not a default step
Early routing significantly reduces cost and complexity

Building the ingestion pipeline

Text extraction was the next challenge. Documents arrived in many formats, including scanned PDFs and images, so a reliable baseline was needed. Tesseract OCR provided consistent text extraction across heterogeneous sources.

Once extracted, the text is chunked and prepared for embedding. Early versions of the pipeline simply loaded everything into the vector database. That approach turned out to be costly and hard to reason about: irrelevant or low‑quality material consumed storage, embedding compute, and downstream context budget.

Addressing that led to introducing lightweight, deterministic filtering before ingestion. By rejecting unsuitable inputs early, the ingestion pipeline stays smaller, cheaper to operate, and easier to debug—especially on constrained hardware.

The filters are intentionally simple and deterministic, and run before any embedding or LLM inference.

Key takeaways

OCR is foundational when working with diverse document formats
Chunking alone is insufficient—selectivity matters
Not every document deserves a vector embedding

Filter chains: combining depth and breadth

Instead of relying on a single filtering signal, each chunk is evaluated by multiple algorithms. These include BM25, cosine similarity, Jaccard similarity, KeyBERT, and regular expressions combined with Levenshtein distance.

The crucial insight was that filtering requires two dimensions.

Depth defines how strong a single signal must be to count, while breadth defines how many independent algorithms must agree before a decision is taken.

One lesson became clear very quickly: a minimum depth threshold is essential. Without it, noise causes breadth triggers to fire constantly. Once tuned correctly, however, the depth‑and‑breadth combination proved far more reliable than simple keyword blocking.

Key takeaways

Single filters are unreliable when used in isolation
Depth prevents noise from dominating decisions
Breadth enforces consensus across algorithms

Multilingual documents and semantic drift

Handling non‑English documents added another layer of complexity. Banned words are maintained in English, while content may be German or written in other languages. Argos Translate is used to translate banned terms and cache the results.

KeyBERT naturally accounts for semantic similarity, while other algorithms rely on WordNet to expand synonyms. In practice, this combination allowed translated concepts to be detected without overly aggressive filtering.

Figure 4: “horse”, “saddle” and “western riding” are recognized in German also

Key takeaways

Translation must be contextual, not absolute
Multiple weak signals outperform single strong assumptions
Multilingual filtering benefits from redundancy

Sanitizing instead of blocking

Not all problematic content needs to be rejected. In some cases, sanitization is sufficient. For this purpose, a masking step applies regular‑expression‑based sanitization early in the pipeline, directly to the extracted text.

This ensures that all downstream consumers automatically work with cleaned input, without duplicating sanitization logic.

Key takeaways

Blocking is not always the correct response
Early sanitization simplifies downstream processing
Pipelines benefit from clear responsibility boundaries

Avoiding unnecessary work

Document ingestion is expensive, especially during experimentation. To avoid reprocessing unchanged documents, each file is hashed and the hash is stored as metadata. Unchanged documents are skipped entirely.

The same concept applies to documents flagged for human review. These are placed on an exclusion list and can be ignored in subsequent runs, preventing repeated processing of known problem cases.

Key takeaways

Hashing is a simple but powerful optimization
Skipping unnecessary work matters as much as doing work
Human‑review flags should influence pipelines, not just reports

Chat interfaces and usability

Interaction started with a simple CLI‑style chat interface. Over time, features such as session‑based history, named chat contexts, and dynamic collection switching were added.

When I decided to open‑source the project, I integrated Open WebUI to provide a more accessible GUI. This also marked the point where thread safety became a requirement and singleton patterns had to be removed.

*Figure 5: Open WebUI integration through RAGChatService.py

Key takeaways

Usability evolves together with the project
GUI integration forces architectural maturity
Thread safety becomes unavoidable in real‑world usage

Prompt validation and safety

User prompts go through two validation stages. First, they are checked by the same filter chains used for documents. If they pass, a dedicated prompt‑checking LLM performs a second evaluation.

Figure 6: The first prompt is caught by the filter chain, the second by the prompt‑guarding LLM

This two‑stage approach proved significantly more reliable than relying on a single mechanism.

Key takeaways

Prompts deserve the same level of scrutiny as documents
Combining heuristics with LLM checks improves reliability
Layered decisions improve safety

Compliance and licensing

Releasing software for others to use comes with responsibilities. Downloaders for Hugging Face and Argos include explicit license checks. Once accepted, licenses are recorded so users only need to go through the process once.

For me, this was not just a legal requirement, but a matter of respect for upstream projects.

Key takeaways

License compliance should be automated
One‑time acceptance improves the user experience
Open source includes ethical obligations

Configuration as a design philosophy

Nearly every aspect of the system is configurable, from debug verbosity to model selection and reranking behavior. Reranking, in particular, had an outsized impact on result quality.

Over time, configuration files became cluttered. Grouping related settings under higher‑level blocks improved clarity and enabled rapid switching between predefined setups.

Figure 7: Two of the predefined chunk selection strategies

Key takeaways

Configuration enables experimentation without code changes
Reranking often matters more than retrieval
Structure matters as much in config files as in code

Offline‑first by design

Operating without an internet connection was a hard requirement. After the initial setup, the system should run entirely offline. To validate this, a network tracer reports any external connections at runtime.

Key takeaways

Offline capability must be verified, not assumed
Visibility into network behavior builds trust
Constraints simplify threat models

Documentation and reflection

Significant effort went into documentation, including class diagrams, flow charts, and markdown guides. RAG‑LCC follows a configuration‑first approach, which can feel overwhelming at first but ultimately empowers users to experiment deeply.

Figure 8: DocClassify.py overview

Looking back, several decisions stood out. Reranking was worth every hour invested.

Figure 9: Output at standard debug level for a hedgehog query with reranking

Two‑stage prompt validation improved robustness, and hash‑based ingestion avoidance dramatically reduced iteration time. On the other hand, implementing both blocking and non‑blocking Ollama calls turned out to be more interesting than useful under tight hardware constraints.

Along the way, I also explored IBM’s Docling, an excellent RAG framework and a valuable source of inspiration.

Key takeaways

Constraints reward architectural thinking
Not every technically interesting idea is worth keeping
Learning sometimes matters more than outcomes

Closing thoughts

Every hour spent on this project was worth it. Working under tight constraints shaped the architecture in ways that brute‑force approaches never would have.

The biggest lesson is simple: on constrained hardware, increasing context size is often the wrong optimization. Classification, filtering, and routing can deliver better results at lower cost—and lead to systems that are easier to reason about.

Interestingly, modern LLM systems also rely heavily on selection, prioritization, and compression before generation. Raw context size alone is rarely the deciding factor.

The journey continues.

Project & Source Code

RAG‑LCC is open source and available on GitHub.

If you’d like to explore the code, experiment with the configuration, or provide feedback, you can find it here:

👉 https://github.com/harinezumigel/rag-lcc

Top comments (1)

HarinezumIgel • Apr 13

Author here. Happy to answer questions or clarify details.