Prompt Context Ordering: Why Recency Beats Relevance More Often Than You Think

#ai #promptengineering #llm #rag

Book: Prompt Engineering Pocket Guide: Techniques for Getting the Most from LLMs
Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

Your retriever ranks chunks by relevance and hands you the top eight. You paste them into the prompt in rank order: best match first, worst match last, instruction at the top. It feels correct. The model should see the strongest evidence first.

Then accuracy sags on the long queries. The retriever is finding the right chunk every time. You can see it in the logs, sitting at rank two. The model just doesn't use it. It answers from chunk one and the tail, and ignores the middle where the answer actually lives.

That gap has a name. It is one of the best-studied failure modes in long-context prompting, and the fix costs you nothing but a reverse() call.

Lost in the middle

The Stanford / Berkeley / Samaya paper Lost in the Middle: How Language Models Use Long Contexts (Liu et al., 2023) ran the experiment cleanly. Take a question, plant the answer document at a known position among N distractors, and measure accuracy as you slide the answer around.

The result was a U-shaped curve. Accuracy was highest when the relevant document sat at the very start or the very end of the context. It dropped in the middle, sometimes below the score the same model got with no documents at all. The model had the answer in its window and still missed it because of where it sat.

The effect holds across model families and gets worse as the context grows. A 4k window hides the middle less than a 32k one. With today's 200k+ windows, the dead zone in the middle is large enough to swallow most of your retrieved chunks.

So the question is not "did I retrieve the right chunk." It is "did I put it where the model reads."

Position beats rank

Here is the part that trips people up. Relevance ranking and positional attention are two different axes, and you are usually optimizing the wrong one.

Your retriever sorts by relevance: chunk A is a 0.91 match, chunk B is 0.88, down to chunk H at 0.71. Natural instinct says put A first. But the model does not read "first" as "most important." It reads the start and the end hardest and the middle softest. If you dump chunks in rank order, your second-best chunk lands at position two (good, near the front) but chunks three through six fall into the trough, and your eighth-best chunk gets the privileged last slot for no reason.

You handed the strongest position to your weakest evidence.

The fix from the long-context literature, and what most RAG frameworks now do by default, is a two-sided reorder. Put the highest-ranked chunks at both ends and bury the low-ranked ones in the middle where it matters least. LangChain ships this as LongContextReorder; LlamaIndex has the same idea. The pattern is older than either: it is just "fold the ranking around the center."

The reorder, in code

The mechanic is small enough to write yourself, and writing it yourself means you understand what it does to your context.

def fold_by_relevance(chunks):
    """Highest-ranked chunks to the edges,
    lowest to the middle.

    chunks: list pre-sorted best-first.
    returns: list reordered for position.
    """
    head, tail = [], []
    for i, chunk in enumerate(chunks):
        if i % 2 == 0:
            tail.append(chunk)   # even ranks -> end
        else:
            head.insert(0, chunk)  # odd ranks -> front
    return head + tail

Feed it a relevance-sorted list and it interleaves: rank 0 lands at the very end, rank 1 at the very front, rank 2 next-to-last, rank 3 second, and so on inward. The two best chunks own the two strongest positions. The weakest chunk sits dead center.

If that feels too clever, the blunter version works almost as well:

def edges_first(chunks, edge=2):
    """Top `edge` chunks at front, next `edge` at
    the end, rest in the middle.
    """
    front = chunks[:edge]
    back = chunks[edge:2 * edge]
    middle = chunks[2 * edge:]
    return front + middle + back

Both push your strongest evidence out of the trough. Pick whichever you find easier to reason about at 2am.

Where the instruction goes

Chunks are half the story. The other half is the instruction itself: the actual question or task the model has to act on.

Put it last. After the context, not before.

On long context the model attends most strongly to the tokens nearest the end of the prompt. If you open with "Answer the question using only the documents below" and then dump 30k tokens of chunks, the instruction is now buried at the front of a long window and the model has drifted by the time it reaches the end. State the context first, then state the task right before you hand control back to the model.

<context>
{reordered_chunks}
</context>

<question>
{user_question}
</question>

Answer using only the context above. If the
answer is not present, say you don't know.

A cheap and reliable trick on very long inputs is to state the task twice: once briefly at the top so the model knows what it is reading the context for, and again in full at the end where attention is strongest. The book calls this bookending. It costs a few tokens and recovers a noticeable chunk of the tail accuracy you lose to drift.

Run the experiment yourself

Do not take the curve on faith. The trough moves with model, window size, and task, so measure it on yours. The setup is the same one the paper used, shrunk to something you can run in an afternoon.

import random

def plant(answer_doc, distractors, position):
    """Insert answer_doc at `position` among
    distractors. Returns the ordered list.
    """
    docs = list(distractors)
    docs.insert(position, answer_doc)
    return docs

def build_prompt(docs, question):
    body = "\n\n".join(
        f"[{i}] {d}" for i, d in enumerate(docs)
    )
    return (
        f"<context>\n{body}\n</context>\n\n"
        f"<question>\n{question}\n</question>\n\n"
        "Answer using only the context above."
    )

# sweep the answer across every slot and score
def position_sweep(answer, distractors, q, judge,
                   call_llm):
    n = len(distractors) + 1
    scores = {}
    for pos in range(n):
        docs = plant(answer, distractors, pos)
        out = call_llm(build_prompt(docs, q))
        scores[pos] = 1.0 if judge(out) else 0.0
    return scores

Run position_sweep over 50 to 100 questions, each with its answer planted at every slot, and average the score per position. Plot it. You will see the U: high at 0 and N-1, sagging through the middle. The depth of that sag is how much accuracy your current rank-order layout is leaving on the table.

Then re-run the same eval with fold_by_relevance applied to your real retriever output, against a flat rank-order baseline. The delta is the free accuracy. Whatever the sweep above showed for your stack is what to expect here: on retrieval tasks with eight to twelve chunks the reorder typically buys back a few points, and the gap widens as the context grows. Measure it on yours rather than trusting a number from someone else's pipeline.

What this does not fix

Reordering is a layout fix. It does not invent information you failed to retrieve. If the answer chunk never made the top-N, no amount of folding helps; that is a retrieval problem, fix the retriever.

It also matters less when you have few chunks. With three short documents in a 4k prompt there is barely a middle to get lost in. The effect earns its keep when the context is long: many chunks, a wide window, a question that depends on one specific passage somewhere in the pile.

And it is not a substitute for cutting chunks you do not need. A tighter context with five strong chunks beats a folded context of fifteen mediocre ones. Reorder after you have trimmed, not instead of trimming.

The rules, distilled

Rank by relevance, place by position. They are different decisions. Stop conflating "best match" with "goes first."
Fold the ranking around the center. Strongest chunks at both edges, weakest in the middle.
Instruction last. State the task right before you hand control to the model; bookend it on very long inputs.
Measure the trough on your stack. Run the position sweep. The U-curve's depth is model- and window-specific.
Layout is the cheap fix; retrieval is the real one. Reorder after you trim, not as a way to avoid trimming.

The reorder is a few lines. Most teams running long-context RAG have a perfectly good retriever feeding the model evidence it then ignores, purely because the right chunk landed in the wrong slot.

If this was useful

Position effects are the kind of thing that is obvious once you have plotted the curve and invisible until then, which is exactly why so many RAG pipelines quietly bleed accuracy in the middle. The chapter on context ordering in the Prompt Engineering Pocket Guide goes deeper on bookending, instruction placement, and how the trough shifts as your window grows. Same measure-it-yourself mindset, applied to a few more layout decisions you are probably making on instinct.