DEV Community

Cover image for When Your LLM Starts Bleeding Context (And How I Fixed It)
Jaskirat Singh
Jaskirat Singh

Posted on

When Your LLM Starts Bleeding Context (And How I Fixed It)

LLM batch processing failing? Learn how context bleeding between data rows tanks accuracy—plus the linearization technique that dropped our error rate to under 15%.

Here’s the thing about working with LLMs at scale: they’re incredible until they’re not.

I learned this the hard way while processing thousands of user feedback entries for sentiment analysis. We’re talking serious volume here — the kind that makes individual processing a non-starter from both a time and cost perspective. So naturally, I went with batch processing.

Smart move, right?

Wrong.


The Day My Accuracy Tanked

image1

Three weeks into production, I started noticing something weird. Our false positive and false negative rates were climbing. Not catastrophically, but enough to make me nervous.

The really frustrating part? When I spot-checked the problematic entries by processing them individually, they came back accurate.

That’s when I knew we had a problem.

After digging into the data (because of course I did — my academic research background doesn’t let me quit), I found a pattern. When feedback entries followed a continuous sentiment sequence — such as five positive reviews in a row — a sudden negative review would be misclassified as positive.

And vice versa.

The numbers were brutal:

  • False positives jumped 23% when negative feedback followed strings of positive reviews
  • False negatives climbed 18% in the reverse scenario
  • Overall accuracy dropped from 91% to 76% on edge cases

The LLM wasn’t treating each entry independently. It was bleeding context across rows, letting previous patterns bias current predictions.

Research shows that LLMs struggle to maintain strict contextual boundaries when processing sequential rows, leading to performance degradation as input length increases.

This is what people in the industry call faithfulness hallucination — when the model generates content that diverges from the actual input because it’s confused by the surrounding context.

Not exactly what you want when you're trying to make data-driven decisions.


Quick Fixes (That Actually Worked)

I needed solutions yesterday, so here’s what I tried first:

1. Batch Size Reduction

I slashed our batch size from 32 entries down to 8. Throughput took a hit, but accuracy improved immediately.

Studies using models like Llama3-70B show that pushing batch sizes beyond 64 often produces diminishing returns. There’s a sweet spot between efficiency and accuracy.

2. Prompt Engineering for Independence

I rewrote our system prompt to explicitly hammer home row independence:

You are analysing user feedback entries independently. 
Each entry must be evaluated solely on its own content 
without influence from previous entries. 

Treat each review as a completely separate analysis task. 
Do not allow patterns from earlier reviews to bias your 
assessment of subsequent reviews.
Enter fullscreen mode Exit fullscreen mode

Did it help? Yes.

Was it enough? Not even close.


The Real Fix: Learning from AutoPK

Here’s where things get interesting.

I started researching how other teams were handling LLMs with structured data, and I came across the AutoPK framework. AutoPK demonstrates that LLMs often fail to preserve spatial and structural relationships when processing raw tables, requiring transformation to explicit key-value representations.

And honestly?

It changed everything.


Image2


Step 1: Linearise Your Data

Instead of feeding the LLM raw tabular rows, I transformed each entry into an explicit key-value format.

Before (What Everyone Does)

Row_ID, Review_Text, Sentiment
1, Great product, positive
2, Terrible service, negative
Enter fullscreen mode Exit fullscreen mode

After (What Actually Works)

<ENTRY_1 @ Review: "Great product" | Sentiment: positive>
<ENTRY_2 @ Review: "Terrible service" | Sentiment: negative>
Enter fullscreen mode Exit fullscreen mode

Why This Works

By converting each relevant cell into a key-value pair, the transformation abstracts away layout differences. This enables text-based models to more effectively extract information from tabular data.

You’re shifting the cognitive load from:

  • ❌ Spatial reasoning (which LLMs struggle with)
  • ✅ Sequential text processing (which they’re actually good at)

Step 2: Add Few-Shot Examples

I included five examples in the linearized format before the actual task data. The examples showed the model exactly how to handle each entry independently.

Task: Analyze sentiment for each feedback entry independently.

Examples:
<ENTRY_A @ Review: "The interface is intuitive and fast" | Sentiment: positive>
<ENTRY_B @ Review: "Customer support was unhelpful" | Sentiment: negative>
<ENTRY_C @ Review: "Product works as described" | Sentiment: neutral>
<ENTRY_D @ Review: "Exceeded my expectations completely" | Sentiment: positive>
<ENTRY_E @ Review: "Shipping took three weeks" | Sentiment: negative>

Now analyze the following entries:
[Production data follows]
Enter fullscreen mode Exit fullscreen mode

Research on TabLLM shows that this approach can outperform traditional deep-learning methods, particularly in few-shot scenarios with minimal labelled data.

We saw error rates drop from 60–95% down to under 15%.


Step 3: Simplify Your Input

I stripped out every column that wasn’t directly relevant to sentiment analysis.

No timestamps.

No URLs.

No user agents.

Just the entry ID and the feedback text.

Original Input (Noisy)

row_id, timestamp, user_id, session_id, feedback_text, page_url, user_agent, sentiment
Enter fullscreen mode Exit fullscreen mode

Cleaned Input

entry_id, feedback_text, sentiment
Enter fullscreen mode Exit fullscreen mode

Research confirms that heterogeneous features — ranging from dense numerical to sparse categorical — can confuse models, especially when columns contain information unrelated to the target task.

This single change reduced hallucination rates by about 12%.


Step 4: Control the Output Format

I explicitly instructed the model to return results in a structured format using only the provided identifiers:

Return results in the following format, using only the entry identifiers provided:

entry_id,predicted_sentiment,confidence
1,positive,0.92
2,negative,0.87
Enter fullscreen mode Exit fullscreen mode

This prevents the model from inventing data or introducing information that wasn’t in the input.

It also makes validation way easier.


The Trade-Offs Nobody Talks About

Look — these solutions aren’t free.

Here’s what I learned about the cost-performance balance:

  • Smaller batch sizes = better accuracy but worse throughput
  • Linearization + few-shot prompting = higher token consumption per entry
  • Individual processing = guaranteed independence but eye-watering API costs

Anthropic optimised Claude 3 with continuous batching, increasing throughput from 50 to 450 tokens per second while reducing latency and cutting GPU costs by 40%.

The point is: you need to know what you’re optimising for.

For our use case — customer feedback analysis where 85–90% accuracy was acceptable — optimized batch processing with linearization hit the sweet spot.

Your mileage may vary.


How to Actually Implement This

Here’s my step-by-step process:

  1. Establish a baseline by processing a representative sample individually
  2. Convert your pipeline to generate key-value formatted entries
  3. Create domain-specific examples that cover your edge cases
  4. Experiment with batch sizes to find your accuracy-throughput balance
  5. Build validation checks comparing batch results to individual processing
  6. Monitor continuously for regression over time

Code: Linearization + Prompt Builder

def linearize_feedback(df, id_column, text_column):
    """
    Convert tabular feedback data into linearized key-value format.
    """
    linearized = []
    for _, row in df.iterrows():
        entry = f"<ENTRY_{row[id_column]} @ Review: \"{row[text_column]}\">"
        linearized.append(entry)
    return linearized


def create_few_shot_prompt(examples, task_data):
    """
    Construct prompt with few-shot examples and task data.
    """
    prompt = "Task: Analyze sentiment for each feedback entry independently.\n\n"
    prompt += "Examples:\n"

    for entry, label in examples:
        prompt += f"{entry} | Sentiment: {label}\n"

    prompt += "\nNow analyze the following entries:\n"
    prompt += "\n".join(task_data)
    prompt += "\n\nReturn results as CSV: entry_id,predicted_sentiment,confidence"

    return prompt
Enter fullscreen mode Exit fullscreen mode

What I Wish Someone Had Told Me

Context bleeding isn’t some edge case that only affects massive enterprise deployments.

If you’re processing structured data with LLMs — feedback, survey responses, support tickets, whatever — you’re probably experiencing this problem right now.

You just might not know it yet.

Context errors can lead to:

  • Lost essential information
  • Misinterpreted model output
  • Incorrect downstream actions

The AutoPK-inspired pipeline approach fundamentally changed how I think about feeding data to LLMs.

Converting spatial table relationships into explicit textual representations isn’t just a workaround.

It’s actually aligning with what these models are good at.


What’s Next

The research community is exploring:

  • Attention mechanisms specifically designed for structured data
  • Architectural modifications that enforce entry boundaries
  • Hybrid approaches combining LLM strengths with traditional ML for tabular data

Continuous batching combines KV caching, chunked prefill, and ragged batching with dynamic scheduling to maximise throughput — but these optimizations focus on throughput rather than accuracy preservation.

For now, if you’re dealing with batch processing of structured data:

Start with linearization + few-shot prompting.

The results speak for themselves.


References

Top comments (0)