Jaskirat Singh

Posted on Feb 16

When Your LLM Starts Bleeding Context (And How I Fixed It)

#ai #coding #programming #softwaredevelopment

LLM batch processing failing? Learn how context bleeding between data rows tanks accuracy—plus the linearization technique that dropped our error rate to under 15%.

Here’s the thing about working with LLMs at scale: they’re incredible until they’re not.

I learned this the hard way while processing thousands of user feedback entries for sentiment analysis. We’re talking serious volume here — the kind that makes individual processing a non-starter from both a time and cost perspective. So naturally, I went with batch processing.

Smart move, right?

Wrong.

The Day My Accuracy Tanked

Three weeks into production, I started noticing something weird. Our false positive and false negative rates were climbing. Not catastrophically, but enough to make me nervous.

The really frustrating part? When I spot-checked the problematic entries by processing them individually, they came back accurate.

That’s when I knew we had a problem.

After digging into the data (because of course I did — my academic research background doesn’t let me quit), I found a pattern. When feedback entries followed a continuous sentiment sequence — such as five positive reviews in a row — a sudden negative review would be misclassified as positive.

And vice versa.

The numbers were brutal:

False positives jumped 23% when negative feedback followed strings of positive reviews
False negatives climbed 18% in the reverse scenario
Overall accuracy dropped from 91% to 76% on edge cases

The LLM wasn’t treating each entry independently. It was bleeding context across rows, letting previous patterns bias current predictions.

Research shows that LLMs struggle to maintain strict contextual boundaries when processing sequential rows, leading to performance degradation as input length increases.

This is what people in the industry call faithfulness hallucination — when the model generates content that diverges from the actual input because it’s confused by the surrounding context.

Not exactly what you want when you're trying to make data-driven decisions.

Quick Fixes (That Actually Worked)

I needed solutions yesterday, so here’s what I tried first:

1. Batch Size Reduction

I slashed our batch size from 32 entries down to 8. Throughput took a hit, but accuracy improved immediately.

Studies using models like Llama3-70B show that pushing batch sizes beyond 64 often produces diminishing returns. There’s a sweet spot between efficiency and accuracy.

2. Prompt Engineering for Independence

I rewrote our system prompt to explicitly hammer home row independence:

You are analysing user feedback entries independently. 
Each entry must be evaluated solely on its own content 
without influence from previous entries. 

Treat each review as a completely separate analysis task. 
Do not allow patterns from earlier reviews to bias your 
assessment of subsequent reviews.

Did it help? Yes.

Was it enough? Not even close.

The Real Fix: Learning from AutoPK

Here’s where things get interesting.

I started researching how other teams were handling LLMs with structured data, and I came across the AutoPK framework. AutoPK demonstrates that LLMs often fail to preserve spatial and structural relationships when processing raw tables, requiring transformation to explicit key-value representations.

And honestly?

It changed everything.

Step 1: Linearise Your Data

Instead of feeding the LLM raw tabular rows, I transformed each entry into an explicit key-value format.

Before (What Everyone Does)

Row_ID, Review_Text, Sentiment
1, Great product, positive
2, Terrible service, negative

After (What Actually Works)

<ENTRY_1 @ Review: "Great product" | Sentiment: positive>
<ENTRY_2 @ Review: "Terrible service" | Sentiment: negative>

Why This Works

By converting each relevant cell into a key-value pair, the transformation abstracts away layout differences. This enables text-based models to more effectively extract information from tabular data.

You’re shifting the cognitive load from:

❌ Spatial reasoning (which LLMs struggle with)
✅ Sequential text processing (which they’re actually good at)

Step 2: Add Few-Shot Examples

I included five examples in the linearized format before the actual task data. The examples showed the model exactly how to handle each entry independently.

Task: Analyze sentiment for each feedback entry independently.

Examples:
<ENTRY_A @ Review: "The interface is intuitive and fast" | Sentiment: positive>
<ENTRY_B @ Review: "Customer support was unhelpful" | Sentiment: negative>
<ENTRY_C @ Review: "Product works as described" | Sentiment: neutral>
<ENTRY_D @ Review: "Exceeded my expectations completely" | Sentiment: positive>
<ENTRY_E @ Review: "Shipping took three weeks" | Sentiment: negative>

Now analyze the following entries:
[Production data follows]

Research on TabLLM shows that this approach can outperform traditional deep-learning methods, particularly in few-shot scenarios with minimal labelled data.

We saw error rates drop from 60–95% down to under 15%.

Step 3: Simplify Your Input

I stripped out every column that wasn’t directly relevant to sentiment analysis.

No timestamps.

No URLs.

No user agents.

Just the entry ID and the feedback text.

Original Input (Noisy)

row_id, timestamp, user_id, session_id, feedback_text, page_url, user_agent, sentiment

Cleaned Input

entry_id, feedback_text, sentiment

Research confirms that heterogeneous features — ranging from dense numerical to sparse categorical — can confuse models, especially when columns contain information unrelated to the target task.

This single change reduced hallucination rates by about 12%.

Step 4: Control the Output Format

I explicitly instructed the model to return results in a structured format using only the provided identifiers:

Return results in the following format, using only the entry identifiers provided:

entry_id,predicted_sentiment,confidence
1,positive,0.92
2,negative,0.87

This prevents the model from inventing data or introducing information that wasn’t in the input.

It also makes validation way easier.

The Trade-Offs Nobody Talks About

Look — these solutions aren’t free.

Here’s what I learned about the cost-performance balance:

Smaller batch sizes = better accuracy but worse throughput
Linearization + few-shot prompting = higher token consumption per entry
Individual processing = guaranteed independence but eye-watering API costs

Anthropic optimised Claude 3 with continuous batching, increasing throughput from 50 to 450 tokens per second while reducing latency and cutting GPU costs by 40%.

The point is: you need to know what you’re optimising for.

For our use case — customer feedback analysis where 85–90% accuracy was acceptable — optimized batch processing with linearization hit the sweet spot.

Your mileage may vary.

How to Actually Implement This

Here’s my step-by-step process:

Establish a baseline by processing a representative sample individually
Convert your pipeline to generate key-value formatted entries
Create domain-specific examples that cover your edge cases
Experiment with batch sizes to find your accuracy-throughput balance
Build validation checks comparing batch results to individual processing
Monitor continuously for regression over time

Code: Linearization + Prompt Builder

def linearize_feedback(df, id_column, text_column):
    """
    Convert tabular feedback data into linearized key-value format.
    """
    linearized = []
    for _, row in df.iterrows():
        entry = f"<ENTRY_{row[id_column]} @ Review: \"{row[text_column]}\">"
        linearized.append(entry)
    return linearized


def create_few_shot_prompt(examples, task_data):
    """
    Construct prompt with few-shot examples and task data.
    """
    prompt = "Task: Analyze sentiment for each feedback entry independently.\n\n"
    prompt += "Examples:\n"

    for entry, label in examples:
        prompt += f"{entry} | Sentiment: {label}\n"

    prompt += "\nNow analyze the following entries:\n"
    prompt += "\n".join(task_data)
    prompt += "\n\nReturn results as CSV: entry_id,predicted_sentiment,confidence"

    return prompt

What I Wish Someone Had Told Me

Context bleeding isn’t some edge case that only affects massive enterprise deployments.

If you’re processing structured data with LLMs — feedback, survey responses, support tickets, whatever — you’re probably experiencing this problem right now.

You just might not know it yet.

Context errors can lead to:

Lost essential information
Misinterpreted model output
Incorrect downstream actions

The AutoPK-inspired pipeline approach fundamentally changed how I think about feeding data to LLMs.

Converting spatial table relationships into explicit textual representations isn’t just a workaround.

It’s actually aligning with what these models are good at.

What’s Next

The research community is exploring:

Attention mechanisms specifically designed for structured data
Architectural modifications that enforce entry boundaries
Hybrid approaches combining LLM strengths with traditional ML for tabular data

Continuous batching combines KV caching, chunked prefill, and ragged batching with dynamic scheduling to maximise throughput — but these optimizations focus on throughput rather than accuracy preservation.

For now, if you’re dealing with batch processing of structured data:

Start with linearization + few-shot prompting.

The results speak for themselves.

References

Anyscale. "Achieve 23x LLM Inference Throughput & Reduce p50 Latency."

https://www.anyscale.com/blog/continuous-batching-llm-inference
Hugging Face. "Continuous Batching from First Principles."

https://huggingface.co/blog/continuous_batching
Chroma Research. "Context Rot: How Increasing Input Tokens Impacts LLM Performance."

https://research.trychroma.com/context-rot
ACM Transactions. "A Survey on Hallucination in Large Language Models."

https://dl.acm.org/doi/10.1145/3703155
Nature. "Detecting Hallucinations in Large Language Models Using Semantic Entropy."

https://www.nature.com/articles/s41586-024-07421-0
Arxiv. "AutoPK: Leveraging LLMs and Hybrid Similarity Metric for Advanced Retrieval of Pharmacokinetic Data."

https://arxiv.org/html/2510.00039
Arxiv. "Large Language Models on Tabular Data: Prediction, Generation, and Understanding - A Survey."

https://arxiv.org/html/2402.17944v2
Predibase. "Maximize Zero-Shot LLM Performance on Tabular Data."

https://predibase.com/blog/getting-the-best-zero-shot-performance-on-your-tabular-data-with-llms
Latitude. "Scaling LLMs with Batch Processing: Ultimate Guide."

https://latitude-blog.ghost.io/blog/scaling-llms-with-batch-processing-ultimate-guide/

DEV Community