When Context Windows Meet Reality
Experimenting with RAG‑LCC on Constrained Hardware
Introduction: When context windows meet reality
The journey began with a simple question: whether AI‑based document classification is feasible in practice. That question quickly led me into document‑classification experiments—and almost immediately into a problem that will sound familiar to many practitioners: how do you classify large documents that do not fit into an LLM’s context window?
In my case, the limitation was not even the model itself, but the hardware. I work with a second‑hand Tesla T4. It is not a recent card, but its single‑slot form factor and roughly 70‑watt power consumption make it attractive for a consumer workstation. In practice, extra cooling is usually required, but the more important point is the constraint itself: memory is limited, and context size becomes a real bottleneck very quickly.
What initially felt like a disadvantage turned out to be a forcing function. Instead of chasing ever larger context windows, I started questioning whether the model actually needs to see entire documents at all.
Key takeaways
- Context limits are often a hardware problem, not a model problem
- Simply “increasing context” is rarely viable on modest GPUs
- Constraints can lead to better architectural decisions
Rethinking context: classification instead of brute force
The obvious approach to large documents is to feed more text into the model. On a Tesla T4, that idea breaks down quickly. Rather than pushing context size, I flipped the problem around and asked a different question: why does the model need to see all of this text in the first place?
This shift led me to treat document classification not as a side task, but as a core architectural element. Instead of sending full documents downstream, I focused on extracting their meaning in a compact form.
Key takeaways
- Context size should be minimized, not maximized
- Classification can act as a form of semantic compression
- Architecture often matters more than raw model capability
Using KeyBERT to reduce documents to meaning
To make classification practical, I used KeyBERT. Rather than passing full documents to an LLM, KeyBERT reduces each document to a small list of meaning‑dense keywords. This allows downstream components to understand what a document is about without reading it in its entirety.
In practice, this worked far better than expected. However, it introduced a smaller but important usability issue: stemming. Words such as predator became predat, which is technically useful but confusing when results are presented to users.

Figure 1: KeyBERT document classification flow
To address this, I added a reverse‑stemming step. For each stem, candidate original words are tracked and weighted. The reconstruction is approximate—and can never be perfect—but it produces output that is readable and meaningful in CSV and Excel exports.
Key takeaways
- Keyword extraction dramatically reduces effective context size
- Stemming helps matching but hurts human readability
- Reverse stemming is a pragmatic compromise, not a perfect solution
When classification turned into routing
For a long time, classification and RAG felt like parallel ideas that never fully aligned. The breakthrough came with a simple question: why not use classification results to decide which documents even enter the RAG pipeline?
Since classification results were already stored in CSV files, SQLite’s CSV reader made them easy to query.

Figure 2: Classification “mammal yes/no” of the TestDocs corpus
This effectively turned classification into a routing mechanism. Documents are first classified, results are stored, and structured queries determine which files are worth ingesting.

Figure 3: Route documents about mammals in English language to RAGLoad.py
This approach became query‑driven document routing. Instead of loading everything and filtering later, only relevant documents are sent downstream.
RAGChat and RAGChatService also allow on‑the‑fly switching of the collections a user is chatting with, which nicely complements the routing functionality.
Key takeaways
- Classification output is useful far beyond simple labeling
- Ingestion is an architectural decision, not a default step
- Early routing significantly reduces cost and complexity
Building the ingestion pipeline
Text extraction was the next challenge. Documents arrived in many formats, including scanned PDFs and images. Tesseract OCR proved invaluable for consistent text extraction across heterogeneous sources.
Once extracted, the text is chunked and prepared for embedding. Early versions of the pipeline loaded everything into the vector database. On reflection, this felt wrong. Loading inappropriate content can also introduce legal risks, for example with data‑protection regulations.
That realization led directly to the introduction of filtering before ingestion.
Key takeaways
- OCR is foundational when working with diverse document formats
- Chunking alone is insufficient—selectivity matters
- Not every document deserves a vector embedding
Filter chains: combining depth and breadth
Instead of relying on a single filtering signal, each chunk is evaluated by multiple algorithms. These include BM25, cosine similarity, Jaccard similarity, KeyBERT, and regular expressions combined with Levenshtein distance.
The crucial insight was that filtering requires two dimensions.
Depth defines how strong a single signal must be to count, while breadth defines how many independent algorithms must agree before a decision is taken.
One lesson became clear very quickly: a minimum depth threshold is essential. Without it, noise causes breadth triggers to fire constantly. Once tuned correctly, however, the depth‑and‑breadth combination proved far more reliable than simple keyword blocking.
Key takeaways
- Single filters are unreliable when used in isolation
- Depth prevents noise from dominating decisions
- Breadth enforces consensus across algorithms
Multilingual documents and semantic drift
Handling non‑English documents added another layer of complexity. Banned words are maintained in English, while content may be German or written in other languages. Argos Translate is used to translate banned terms and cache the results.
KeyBERT naturally accounts for semantic similarity, while other algorithms rely on WordNet to expand synonyms. In practice, this combination allowed translated concepts to be detected without overly aggressive filtering.

Figure 4: “horse”, “saddle” and “western riding” are recognized in German also
Key takeaways
- Translation must be contextual, not absolute
- Multiple weak signals outperform single strong assumptions
- Multilingual filtering benefits from redundancy
Sanitizing instead of blocking
Not all problematic content needs to be rejected. In some cases, sanitization is sufficient. For this purpose, a masking step applies regular‑expression‑based sanitization early in the pipeline, directly to the extracted text.
This ensures that all downstream consumers automatically work with cleaned input, without duplicating sanitization logic.
Key takeaways
- Blocking is not always the correct response
- Early sanitization simplifies downstream processing
- Pipelines benefit from clear responsibility boundaries
Avoiding unnecessary work
Document ingestion is expensive, especially during experimentation. To avoid reprocessing unchanged documents, each file is hashed and the hash is stored as metadata. Unchanged documents are skipped entirely.
The same concept applies to documents flagged for human review. These are placed on an exclusion list and can be ignored in subsequent runs, preventing repeated processing of known problem cases.
Key takeaways
- Hashing is a simple but powerful optimization
- Skipping unnecessary work matters as much as doing work
- Human‑review flags should influence pipelines, not just reports
Chat interfaces and usability
Interaction started with a simple CLI‑style chat interface. Over time, features such as session‑based history, named chat contexts, and dynamic collection switching were added.
When I decided to open‑source the project, I integrated Open WebUI to provide a more accessible GUI. This also marked the point where thread safety became a requirement and singleton patterns had to be removed.

*Figure 5: Open WebUI integration through RAGChatService.py
Key takeaways
- Usability evolves together with the project
- GUI integration forces architectural maturity
- Thread safety becomes unavoidable in real‑world usage
Prompt validation and safety
User prompts go through two validation stages. First, they are checked by the same filter chains used for documents. If they pass, a dedicated prompt‑checking LLM performs a second evaluation.

Figure 6: The first prompt is caught by the filter chain, the second by the prompt‑guarding LLM
This two‑stage approach proved significantly more reliable than relying on a single mechanism.
Key takeaways
- Prompts deserve the same level of scrutiny as documents
- Combining heuristics with LLM checks improves reliability
- Layered decisions improve safety
Compliance and licensing
Releasing software for others to use comes with responsibilities. Downloaders for Hugging Face and Argos include explicit license checks. Once accepted, licenses are recorded so users only need to go through the process once.
For me, this was not just a legal requirement, but a matter of respect for upstream projects.
Key takeaways
- License compliance should be automated
- One‑time acceptance improves the user experience
- Open source includes ethical obligations
Configuration as a design philosophy
Nearly every aspect of the system is configurable, from debug verbosity to model selection and reranking behavior. Reranking, in particular, had an outsized impact on result quality.
Over time, configuration files became cluttered. Grouping related settings under higher‑level blocks improved clarity and enabled rapid switching between predefined setups.

Figure 7: Two of the predefined chunk selection strategies
Key takeaways
- Configuration enables experimentation without code changes
- Reranking often matters more than retrieval
- Structure matters as much in config files as in code
Offline‑first by design
Operating without an internet connection was a hard requirement. After the initial setup, the system should run entirely offline. To validate this, a network tracer reports any external connections at runtime.
Key takeaways
- Offline capability must be verified, not assumed
- Visibility into network behavior builds trust
- Constraints simplify threat models
Documentation and reflection
Significant effort went into documentation, including class diagrams, flow charts, and markdown guides. RAG‑LCC follows a configuration‑first approach, which can feel overwhelming at first but ultimately empowers users to experiment deeply.

Figure 8: DocClassify.py overview
Looking back, several decisions stood out. Reranking was worth every hour invested.

Figure 9: Output at standard debug level for a hedgehog query with reranking
Two‑stage prompt validation improved robustness, and hash‑based ingestion avoidance dramatically reduced iteration time. On the other hand, implementing both blocking and non‑blocking Ollama calls turned out to be more interesting than useful under tight hardware constraints.
Along the way, I also explored IBM’s Docling, an excellent RAG framework and a valuable source of inspiration.
Key takeaways
- Constraints reward architectural thinking
- Not every technically interesting idea is worth keeping
- Learning sometimes matters more than outcomes
Closing thoughts
Every hour spent on this project was worth it. Working under tight constraints shaped the architecture in ways that brute‑force approaches never would have.
The biggest lesson is simple: on constrained hardware, increasing context size is often the wrong optimization. Classification, filtering, and routing can deliver better results at lower cost—and lead to systems that are easier to reason about.
Interestingly, modern LLM systems also rely heavily on selection, prioritization, and compression before generation. Raw context size alone is rarely the deciding factor.
The journey continues.
Project & Source Code
RAG‑LCC is open source and available on GitHub.
If you’d like to explore the code, experiment with the configuration, or provide feedback, you can find it here:
Top comments (0)